This is stage 1 of the 6-stage producer chain. It reloads the cache written by stage 00, runs the four register-family per-doc analyzers, and writes a partial JSON keyed by file index.
Analyzer
What it produces (per-file)
mood_for_doc
Imperative-marker lexical density (count + pct + per_sent).
register_for_doc
TTR, mean sentence length, dependency depth, Heylighen F-score, four register classes.
"""Reload corpus + DocBin from stage 00."""import os, pathlib, json, importlibimport pandas as pdfrom tqdm.auto import tqdmfrom spacy.tokens import DocBin_here = pathlib.Path.cwd().resolve()PROJECT_ROOT =next( (p for p in [_here, *_here.parents] if (p /"prompt_pipeline.py").is_file()),None,)if PROJECT_ROOT isNone:raiseRuntimeError(f"Could not find prompt_pipeline.py walking up from {_here}. ""Run from inside the claude-prompts-analysis repo." )if pathlib.Path.cwd() != PROJECT_ROOT: os.chdir(PROJECT_ROOT)CACHE_DIR = PROJECT_ROOT /"_pipeline_cache"DOCBIN_IN = CACHE_DIR /"corpus_docs.spacy"META_IN = CACHE_DIR /"corpus_meta.parquet"PARTIAL_OUT = CACHE_DIR /"partial_register.json"assert DOCBIN_IN.exists(), f"missing {DOCBIN_IN} — run 00_setup_and_corpus first"assert META_IN.exists(), f"missing {META_IN} — run 00_setup_and_corpus first"import prompt_pipelineimportlib.reload(prompt_pipeline)from prompt_pipeline import NLPdf = pd.read_parquet(META_IN)docs =list(DocBin().from_disk(DOCBIN_IN).get_docs(NLP.vocab))assertlen(docs) ==len(df), f"DocBin/df length mismatch: {len(docs)} vs {len(df)}"print(f"reloaded {len(df)} files, {sum(d.__len__() for d in docs):,} doc tokens")
reloaded 290 files, 145,534 doc tokens
1. Mood
Mood marker density is the share of word tokens matched by IMPERATIVE_MARKERS (the lexicon is echoed verbatim into the YAML’s lexicons block). Reported per file as count + pct (% of word tokens) + per_sent (rate per sentence). The per-sentence imperative classifier moved into sentence_register; this block carries only the lexical-density signal (marker_*).
Code
from prompt_pipeline import mood_for_docmood_per_file = [mood_for_doc(d, n, s)for d, n, s inzip(docs, df["n_tokens"], df["n_sents"])]df_mood = pd.DataFrame(mood_per_file)print("per-file mood (head):")print(df_mood.head().to_string())print()print("category mean (mood — imperative-marker density):")print(pd.concat([df[["category"]], df_mood], axis=1) .groupby("category").mean(numeric_only=True).round(3).to_string())
Register captures formality. Four numerical metrics per file:
TTR (type-token ratio) — unique types ÷ total tokens. Anti-correlates with file length (longer files reuse vocabulary, lower TTR).
Mean sentence length — tokens per sentence.
Mean dependency depth — average nesting depth of the spaCy parse tree.
Heylighen F-score (Heylighen & Dewaele 2002) — a 0–100 formality index: F = 50 + 0.5 × (noun + adj + prep + article − pronoun − verb − adverb − interjection), each term as a percentage of all tokens. Higher = more formal-academic prose; the corpus clusters in the 70–80 band.
Plus lexical density for four register classes (frozen / formal / consultative / casual).
Code
from prompt_pipeline import register_for_docregister_per_file = [register_for_doc(d, n, s)for d, n, s inzip(docs, df["n_tokens"], df["n_sents"])]df_register = pd.DataFrame(register_per_file)print("per-file register (head):")print(df_register.head().to_string())print()print("category mean (numeric register cols):")num_cols = [c for c in df_register.columnsif c !="dominant_register"andnot c.endswith("_count")]print(pd.concat([df[["category"]], df_register[num_cols]], axis=1) .groupby("category").mean(numeric_only=True).round(3).to_string())
Stance classifies expressed attitude into five polarity-aware classes: directive, expository, positive_evaluative, negative_evaluative, dialogic. Each is a hand-curated lexicon match, normalized as pct (% of word tokens) and per_sent. We also count 1st/2nd-person engagement (pronouns_1p, pronouns_2p) since they covary with stance.
The positive_evaluative quality / emphasis split. The union positive_evaluative lexicon conflates two phenomena, so each file also carries the split:
The union is preserved for back-compat in existing charts. When the question is how much praise is here, cite the quality-only ratio against negative_evaluative. When the question is how loud is the rule emphasis, cite the emphasis count alongside the imperative-marker density.
Code
from prompt_pipeline import stance_for_docstance_per_file = [stance_for_doc(d, n, s)for d, n, s inzip(docs, df["n_tokens"], df["n_sents"])]df_stance = pd.DataFrame(stance_per_file)print("per-file stance (head):")print(df_stance.head().to_string())print()print("dominant stance by category:")print(pd.crosstab(df["category"], df_stance["dominant_stance"]))print()print("category mean stance (% of words and per-sentence rate):")num_cols = [c for c in df_stance.columnsif c !="dominant_stance"andnot c.endswith("_count")]print(pd.concat([df[["category"]], df_stance[num_cols]], axis=1) .groupby("category").mean(numeric_only=True).round(3).to_string())
Per-sentence multi-label classifier with six classes: collaborative / permissive / appreciative / imperative / directive / configuring. Implemented via M_SENT_REGISTER PhraseMatchers, M_STANCE["directive"], and a spaCy DependencyMatcher for parse-tree cues; the imperative flag is driven by classify_sent_mood. Near-zero classes are deliberately preserved — absence is the welfare-relevant signal for this corpus.
Multi-label semantics. A single sentence can carry several flags simultaneously. The hypothetical sentence "Please, we should consider running the migration." is permissive (please), collaborative (we should), imperative (consider), and directive (modal should). Because each contributes to multiple class counts, per-class sent_pct values across the six classes can sum to more than 100% within a category — intentional, not a bug.
Code
from prompt_pipeline import sentence_register_for_docsentence_register_per_file = [ sentence_register_for_doc(d, s)for d, s inzip(docs, df["n_sents"])]df_sent_register = pd.DataFrame(sentence_register_per_file)print("per-file sentence_register (head):")print(df_sent_register.head().to_string())print()print("dominant by category:")print(pd.crosstab(df["category"], df_sent_register["dominant"]))print()print("category mean sentence_register (% of sentences):")pct_cols = [c for c in df_sent_register.columns if c.endswith("_sent_pct")]print(pd.concat([df[["category"]], df_sent_register[pct_cols]], axis=1) .groupby("category").mean(numeric_only=True).round(2).to_string())
One JSON document, keyed by file index (string), with the four metric blocks per file. Stage 04 reloads this and feeds it directly to build_file_record.