This is stage 2 of the 6-stage producer chain. It reloads the cache from stage 00 and runs five analyzers focused on vocabulary, emphasis, and explanation.
Counts per VOCAB lexicon class (prohibitions, prescriptions, politeness, warmth, hedging, structural, profanity, pronouns).
all_caps_for_text
ALL CAPS tokens with TECH_ACRONYMS excluded.
caps_imperative_for_text
Direct match against CAPS_IMPERATIVE_TOKENS (e.g., MUST, NEVER, IMPORTANT).
justification_for_text
JUSTIFICATION_PATTERNS matches + the ratio = count / (marker_count + 1).
The justification ratio depends on the per-file imperative marker_count. We recompute it locally with the same matcher used in stage 01 (prompt_pipeline.M_IMPERATIVE) — keeps stage 02 self-contained without reading stage 01’s partial.
"""Reload corpus + DocBin from stage 00."""import os, pathlib, json, importlibimport pandas as pdfrom collections import Counterfrom tqdm.auto import tqdmfrom spacy.tokens import DocBin_here = pathlib.Path.cwd().resolve()PROJECT_ROOT =next( (p for p in [_here, *_here.parents] if (p /"prompt_pipeline.py").is_file()),None,)if PROJECT_ROOT isNone:raiseRuntimeError(f"Could not find prompt_pipeline.py walking up from {_here}. ""Run from inside the claude-prompts-analysis repo." )if pathlib.Path.cwd() != PROJECT_ROOT: os.chdir(PROJECT_ROOT)CACHE_DIR = PROJECT_ROOT /"_pipeline_cache"DOCBIN_IN = CACHE_DIR /"corpus_docs.spacy"META_IN = CACHE_DIR /"corpus_meta.parquet"PARTIAL_OUT = CACHE_DIR /"partial_vocab_emphasis.json"assert DOCBIN_IN.exists(), f"missing {DOCBIN_IN} — run 00_setup_and_corpus first"assert META_IN.exists(), f"missing {META_IN} — run 00_setup_and_corpus first"import prompt_pipelineimportlib.reload(prompt_pipeline)from prompt_pipeline import NLPdf = pd.read_parquet(META_IN)docs =list(DocBin().from_disk(DOCBIN_IN).get_docs(NLP.vocab))assertlen(docs) ==len(df), f"DocBin/df length mismatch: {len(docs)} vs {len(df)}"print(f"reloaded {len(df)} files")
reloaded 290 files
4. Modality
A single spaCy parse-tree detector classifies every modal expression as deontic (must, should, have to, need to + verb), epistemic (may, might, modals followed by be/have, epistemic adverbs like likely/probably), or dynamic (can, could, able to, will/would).
Code
from prompt_pipeline import modality_for_docmodality_per_file = [modality_for_doc(d, n, s)for d, n, s inzip(docs, df["n_tokens"], df["n_sents"])]df_modality = pd.DataFrame(modality_per_file)print("per-file modality (head):")print(df_modality.head().to_string())print()print("category means (% of words and per-sentence rate):")key_cols = ["deontic_pct", "deontic_per_sent","epistemic_pct", "epistemic_per_sent","dynamic_pct", "dynamic_per_sent"]print(pd.concat([df[["category"]], df_modality[key_cols]], axis=1) .groupby("category").mean(numeric_only=True).round(3).to_string())print()print("corpus-wide modality counts:")totals = df_modality[["deontic_count", "epistemic_count", "dynamic_count"]].sum()print(totals.to_string())
profanity — explicit obscenity (zero matches in this corpus).
pronouns_2p, pronouns_1p — second- and first-person pronouns.
Code
from prompt_pipeline import vocab_for_doc, VOCAB_KEYSvocab_per_file = [vocab_for_doc(d, n, s)for d, n, s inzip(docs, df["n_tokens"], df["n_sents"])]df_vocab = pd.DataFrame(vocab_per_file)pct_cols = [f"{k}_pct"for k in VOCAB_KEYS]per_sent_cols = [f"{k}_per_sent"for k in VOCAB_KEYS]print("category mean (% of words):")print(pd.concat([df[["category"]], df_vocab[pct_cols]], axis=1) .groupby("category").mean(numeric_only=True).round(3).to_string())print()print("corpus-wide raw counts:")total = df_vocab[[f"{k}_count"for k in VOCAB_KEYS]].sum()for k in VOCAB_KEYS:print(f" {k:24s}{int(total[f'{k}_count']):6d}")
Every uppercase token (≥2 characters) excluding the curated TECH_ACRONYMS allowlist (e.g. API, URL, JSON — not emphasis-flagged).
Code
from prompt_pipeline import all_caps_for_text, ALLCAPS_REfrom lexicons import TECH_ACRONYMSall_caps_per_file = [all_caps_for_text(t, n, s)for t, n, s inzip(df["raw_text"], df["n_tokens"], df["n_sents"])]df_all_caps = pd.DataFrame(all_caps_per_file)print("per-file ALL CAPS (head):")print(df_all_caps.head().to_string())print()print("category mean ALL CAPS (% of words / per sentence):")print(pd.concat([df[["category"]], df_all_caps[["count", "pct", "per_sent"]]], axis=1) .groupby("category").mean(numeric_only=True).round(3).to_string())corpus_all_caps = Counter()for txt in df["raw_text"]:for tok in ALLCAPS_RE.findall(txt):if tok notin TECH_ACRONYMS: corpus_all_caps[tok] +=1print()print("corpus-wide top-25 ALL CAPS tokens:")for tok, c in corpus_all_caps.most_common(25):print(f" {tok:20s}{c}")
per-file ALL CAPS (head):
count distinct pct per_sent top
0 6 3 0.6410 0.2727 [{'token': 'CLAUDE', 'count': 3}, {'token': 'TASK_TOOL_NAME', 'count': 2}, {'token': 'NOTE', 'count': 1}]
1 0 0 0.0000 0.0000 []
2 40 33 1.1322 0.3478 [{'token': 'GITHUB_TOKEN', 'count': 3}, {'token': 'THE', 'count': 2}, {'token': 'OTP', 'count': 2}]
3 0 0 0.0000 0.0000 []
4 0 0 0.0000 0.0000 []
category mean ALL CAPS (% of words / per sentence):
count pct per_sent
category
Agent prompt 12.270 1.565 0.449
Data / template 2.846 0.380 0.086
Skill 9.300 1.548 0.463
System prompt 2.844 1.300 0.387
System reminder 3.225 7.511 1.197
Tool description 2.886 2.408 0.498
Tool parameter 0.000 0.000 0.000
corpus-wide top-25 ALL CAPS tokens:
NOT 109
CLAUDE 79
IMPORTANT 37
ATTACHMENT_OBJECT 32
BLOCK 30
NEVER 26
TUNE 26
ONLY 23
MUST 21
SSE 20
README 15
CRITICAL 15
ALLOW 15
DO 15
BLOCKS 14
ALL 13
ASK_USER_QUESTION_TOOL_NAME 12
TTL 11
MONITOR_TOOL_NAME 11
NO 10
AND 9
ONE_OFF_ENABLED_FN 9
GA 9
CRON_CREATE_TOOL_NAME 9
TASK_TOOL_NAME 8
7. CAPS imperative tokens
Direct match against CAPS_IMPERATIVE_TOKENS (IMPORTANT, MUST, NEVER, DO NOT, WARNING …). Word-boundary regex handles multi-word phrases (MUST NOT, VERY IMPORTANT).
Code
from prompt_pipeline import caps_imperative_for_text, CAPS_IMP_REfrom lexicons import CAPS_IMPERATIVE_TOKENScaps_imperative_per_file = [ caps_imperative_for_text(t, n, s)for t, n, s inzip(df["raw_text"], df["n_tokens"], df["n_sents"])]df_caps_imperative = pd.DataFrame(caps_imperative_per_file)print("per-file CAPS imperative (head):")print(df_caps_imperative.head().to_string())print()print("category mean CAPS imperative (% of words / per sentence):")print(pd.concat([df[["category"]], df_caps_imperative[["count", "pct", "per_sent"]]], axis=1) .groupby("category").mean(numeric_only=True).round(3).to_string())corpus_caps_imperative = Counter()for txt in df["raw_text"]: corpus_caps_imperative.update(CAPS_IMP_RE.findall(txt))print()print("corpus-wide CAPS imperative frequency:")for tok in CAPS_IMPERATIVE_TOKENS:if corpus_caps_imperative[tok]:print(f" {tok:18s}{corpus_caps_imperative[tok]}")
per-file CAPS imperative (head):
count pct per_sent hits
0 1 0.1068 0.0455 {'NOTE': 1}
1 0 0.0000 0.0000 {}
2 0 0.0000 0.0000 {}
3 0 0.0000 0.0000 {}
4 0 0.0000 0.0000 {}
category mean CAPS imperative (% of words / per sentence):
count pct per_sent
category
Agent prompt 0.919 0.167 0.061
Data / template 0.026 0.002 0.000
Skill 0.367 0.013 0.004
System prompt 0.391 0.167 0.050
System reminder 0.425 0.224 0.054
Tool description 0.620 0.373 0.085
Tool parameter 0.000 0.000 0.000
corpus-wide CAPS imperative frequency:
IMPORTANT 36
VERY IMPORTANT 1
CRITICAL 15
MANDATORY 2
REQUIRED 2
MUST 18
MUST NOT 3
NEVER 26
ALWAYS 7
DO NOT 14
NOTE 7
STOP 1
REMEMBER 1
PROHIBITED 2
STRICTLY 2
8. Justification patterns
Counts of JUSTIFICATION_PATTERNS regex matches per file (count, pct, per_sent) plus the justification ratio: count / (marker_count + 1).
The + 1 in the denominator prevents division-by-zero when a file has zero imperative markers, and softens the ratio for very-low-marker files (one reason in a file with one rule scores 0.5, not 1.0). marker_count is recomputed locally with the same M_IMPERATIVE matcher used in stage 01, so this notebook stays self-contained.
Code
from prompt_pipeline import justification_for_text, M_IMPERATIVE, count_matcher# Recompute marker_count per file (deterministic; same matcher as stage 01).marker_counts = [count_matcher(d, M_IMPERATIVE["imperative_markers"]) for d in docs]justification_per_file = [ justification_for_text(t, n, s, m)for t, n, s, m inzip(df["clean_text"], df["n_tokens"], df["n_sents"], marker_counts)]df_justification = pd.DataFrame(justification_per_file)print("per-file justification (head):")print(df_justification.head().to_string())print()print("category mean (justification, % of words / per sentence):")print(pd.concat([df[["category"]], df_justification], axis=1) .groupby("category").mean(numeric_only=True).round(3).to_string())
per-file justification (head):
count pct per_sent ratio
0 6 0.6410 0.2727 0.857
1 1 0.4132 0.0769 1.000
2 7 0.1981 0.0609 0.412
3 2 0.5333 0.0769 0.400
4 1 0.5556 0.2000 0.500
category mean (justification, % of words / per sentence):
count pct per_sent ratio
category
Agent prompt 1.865 0.302 0.078 0.310
Data / template 1.000 0.102 0.020 0.194
Skill 2.633 0.273 0.083 0.377
System prompt 0.812 0.288 0.072 0.238
System reminder 0.425 0.255 0.054 0.101
Tool description 0.747 0.371 0.085 0.157
Tool parameter 0.000 0.000 0.000 0.000