Vocab + emphasis analyzers

This is stage 2 of the 6-stage producer chain. It reloads the cache from stage 00 and runs five analyzers focused on vocabulary, emphasis, and explanation.

Analyzer	What it produces
`modality_for_doc`	Three-class modality (`deontic` / `epistemic` / `dynamic`) via spaCy parse-tree patterns.
`vocab_for_doc`	Counts per `VOCAB` lexicon class (prohibitions, prescriptions, politeness, warmth, hedging, structural, profanity, pronouns).
`all_caps_for_text`	ALL CAPS tokens with `TECH_ACRONYMS` excluded.
`caps_imperative_for_text`	Direct match against `CAPS_IMPERATIVE_TOKENS` (e.g., `MUST`, `NEVER`, `IMPORTANT`).
`justification_for_text`	`JUSTIFICATION_PATTERNS` matches + the `ratio = count / (marker_count + 1)`.

The justification ratio depends on the per-file imperative marker_count. We recompute it locally with the same matcher used in stage 01 (prompt_pipeline.M_IMPERATIVE) — keeps stage 02 self-contained without reading stage 01’s partial.

Output: _pipeline_cache/partial_vocab_emphasis.json.

Code

"""Reload corpus + DocBin from stage 00."""
import os, pathlib, json, importlib
import pandas as pd
from collections import Counter
from tqdm.auto import tqdm
from spacy.tokens import DocBin

_here = pathlib.Path.cwd().resolve()
PROJECT_ROOT = next(
    (p for p in [_here, *_here.parents] if (p / "prompt_pipeline.py").is_file()),
    None,
)
if PROJECT_ROOT is None:
    raise RuntimeError(
        f"Could not find prompt_pipeline.py walking up from {_here}. "
        "Run from inside the claude-prompts-analysis repo."
    )
if pathlib.Path.cwd() != PROJECT_ROOT:
    os.chdir(PROJECT_ROOT)

CACHE_DIR  = PROJECT_ROOT / "_pipeline_cache"
DOCBIN_IN  = CACHE_DIR / "corpus_docs.spacy"
META_IN    = CACHE_DIR / "corpus_meta.parquet"
PARTIAL_OUT = CACHE_DIR / "partial_vocab_emphasis.json"

assert DOCBIN_IN.exists(),  f"missing {DOCBIN_IN} — run 00_setup_and_corpus first"
assert META_IN.exists(),    f"missing {META_IN} — run 00_setup_and_corpus first"

import prompt_pipeline
importlib.reload(prompt_pipeline)
from prompt_pipeline import NLP

df = pd.read_parquet(META_IN)
docs = list(DocBin().from_disk(DOCBIN_IN).get_docs(NLP.vocab))
assert len(docs) == len(df), f"DocBin/df length mismatch: {len(docs)} vs {len(df)}"
print(f"reloaded {len(df)} files")

reloaded 290 files

4. Modality

A single spaCy parse-tree detector classifies every modal expression as deontic (must, should, have to, need to + verb), epistemic (may, might, modals followed by be/have, epistemic adverbs like likely/probably), or dynamic (can, could, able to, will/would).

Code

from prompt_pipeline import modality_for_doc

modality_per_file = [modality_for_doc(d, n, s)
                     for d, n, s in zip(docs, df["n_tokens"], df["n_sents"])]
df_modality = pd.DataFrame(modality_per_file)

print("per-file modality (head):")
print(df_modality.head().to_string())
print()
print("category means (% of words and per-sentence rate):")
key_cols = ["deontic_pct", "deontic_per_sent",
            "epistemic_pct", "epistemic_per_sent",
            "dynamic_pct", "dynamic_per_sent"]
print(pd.concat([df[["category"]], df_modality[key_cols]], axis=1)
        .groupby("category").mean(numeric_only=True).round(3).to_string())
print()
print("corpus-wide modality counts:")
totals = df_modality[["deontic_count", "epistemic_count", "dynamic_count"]].sum()
print(totals.to_string())

per-file modality (head):
   deontic_count  deontic_pct  deontic_per_sent  epistemic_count  epistemic_pct  epistemic_per_sent  dynamic_count  dynamic_pct  dynamic_per_sent top_construction
0              5       0.5342            0.2273                8         0.8547              0.3636              4       0.4274            0.1818           should
1              2       0.8264            0.1538                2         0.8264              0.1538              2       0.8264            0.1538           should
2              7       0.1981            0.0609                0         0.0000              0.0000             48       1.3586            0.4174              'll
3              1       0.2667            0.0385                1         0.2667              0.0385              2       0.5333            0.0769               ca
4              0       0.0000            0.0000                0         0.0000              0.0000              0       0.0000            0.0000              NaN

category means (% of words and per-sentence rate):
                  deontic_pct  deontic_per_sent  epistemic_pct  epistemic_per_sent  dynamic_pct  dynamic_per_sent
category                                                                                                         
Agent prompt            0.236             0.062          0.409               0.109        0.351             0.092
Data / template         0.073             0.015          0.145               0.030        0.247             0.061
Skill                   0.132             0.031          0.142               0.036        0.370             0.101
System prompt           0.335             0.084          0.565               0.114        0.580             0.139
System reminder         0.299             0.068          0.503               0.103        0.262             0.055
Tool description        0.620             0.138          0.269               0.075        0.961             0.189
Tool parameter          0.000             0.000          0.000               0.000        0.000             0.000

corpus-wide modality counts:
deontic_count      261
epistemic_count    322
dynamic_count      559

5. Vocabulary profile

VOCAB is an 11-class lexicon. Each class is a hand-curated list of phrases matched per file, normalized as pct (% of word tokens) and per_sent:

hard_prohibitions — categorical no: never, do not, forbidden, prohibited.
hard_prescriptions — categorical must: must, required, always.
soft_prescriptions — softened obligation: should, prefer, recommended.
politeness_direct — bare politeness markers: please, kindly.
politeness_softening — face-saving framings: would you, if you could.
warmth_encouragement — affirmative tone: great, well done, nice work.
hedging — uncertainty markers: perhaps, roughly, I think.
structural_markers — discourse organization: first, next, however.
profanity — explicit obscenity (zero matches in this corpus).
pronouns_2p, pronouns_1p — second- and first-person pronouns.

Code

from prompt_pipeline import vocab_for_doc, VOCAB_KEYS

vocab_per_file = [vocab_for_doc(d, n, s)
                  for d, n, s in zip(docs, df["n_tokens"], df["n_sents"])]
df_vocab = pd.DataFrame(vocab_per_file)

pct_cols      = [f"{k}_pct" for k in VOCAB_KEYS]
per_sent_cols = [f"{k}_per_sent" for k in VOCAB_KEYS]

print("category mean (% of words):")
print(pd.concat([df[["category"]], df_vocab[pct_cols]], axis=1)
        .groupby("category").mean(numeric_only=True).round(3).to_string())
print()
print("corpus-wide raw counts:")
total = df_vocab[[f"{k}_count" for k in VOCAB_KEYS]].sum()
for k in VOCAB_KEYS:
    print(f"  {k:24s} {int(total[f'{k}_count']):6d}")

category mean (% of words):
                  hard_prohibitions_pct  hard_prescriptions_pct  soft_prescriptions_pct  politeness_direct_pct  politeness_softening_pct  warmth_encouragement_pct  hedging_pct  structural_markers_pct  profanity_pct  pronouns_2p_pct  pronouns_1p_pct
category                                                                                                                                                                                                                                                
Agent prompt                      0.477                   0.233                   0.379                  0.043                     0.032                     0.004        0.179                   0.230            0.0            1.362            0.089
Data / template                   0.282                   0.285                   0.021                  0.000                     0.036                     0.000        0.086                   0.118            0.0            0.637            0.054
Skill                             0.473                   0.105                   0.097                  0.002                     0.011                     0.018        0.169                   0.160            0.0            0.955            0.081
System prompt                     0.820                   0.273                   0.666                  0.030                     0.006                     0.000        0.313                   0.465            0.0            2.099            0.125
System reminder                   0.513                   0.407                   0.554                  0.150                     0.053                     0.000        0.472                   0.667            0.0            1.617            0.064
Tool description                  1.193                   0.988                   0.418                  0.005                     0.010                     0.000        0.138                   0.279            0.0            1.648            0.087
Tool parameter                    0.000                   0.000                   0.000                  0.000                     0.000                     0.000        0.000                   0.000            0.0            0.000            0.000

corpus-wide raw counts:
  hard_prohibitions           631
  hard_prescriptions          358
  soft_prescriptions          254
  politeness_direct            21
  politeness_softening         28
  warmth_encouragement          5
  hedging                     243
  structural_markers          287
  profanity                     0
  pronouns_2p                1397
  pronouns_1p                 185

6. ALL CAPS emphasis

Every uppercase token (≥2 characters) excluding the curated TECH_ACRONYMS allowlist (e.g. API, URL, JSON — not emphasis-flagged).

Code

from prompt_pipeline import all_caps_for_text, ALLCAPS_RE
from lexicons import TECH_ACRONYMS

all_caps_per_file = [all_caps_for_text(t, n, s)
                     for t, n, s in zip(df["raw_text"], df["n_tokens"], df["n_sents"])]
df_all_caps = pd.DataFrame(all_caps_per_file)

print("per-file ALL CAPS (head):")
print(df_all_caps.head().to_string())
print()
print("category mean ALL CAPS (% of words / per sentence):")
print(pd.concat([df[["category"]], df_all_caps[["count", "pct", "per_sent"]]], axis=1)
        .groupby("category").mean(numeric_only=True).round(3).to_string())

corpus_all_caps = Counter()
for txt in df["raw_text"]:
    for tok in ALLCAPS_RE.findall(txt):
        if tok not in TECH_ACRONYMS:
            corpus_all_caps[tok] += 1
print()
print("corpus-wide top-25 ALL CAPS tokens:")
for tok, c in corpus_all_caps.most_common(25):
    print(f"  {tok:20s} {c}")

per-file ALL CAPS (head):
   count  distinct     pct  per_sent                                                                                                        top
0      6         3  0.6410    0.2727  [{'token': 'CLAUDE', 'count': 3}, {'token': 'TASK_TOOL_NAME', 'count': 2}, {'token': 'NOTE', 'count': 1}]
1      0         0  0.0000    0.0000                                                                                                         []
2     40        33  1.1322    0.3478        [{'token': 'GITHUB_TOKEN', 'count': 3}, {'token': 'THE', 'count': 2}, {'token': 'OTP', 'count': 2}]
3      0         0  0.0000    0.0000                                                                                                         []
4      0         0  0.0000    0.0000                                                                                                         []

category mean ALL CAPS (% of words / per sentence):
                   count    pct  per_sent
category                                 
Agent prompt      12.270  1.565     0.449
Data / template    2.846  0.380     0.086
Skill              9.300  1.548     0.463
System prompt      2.844  1.300     0.387
System reminder    3.225  7.511     1.197
Tool description   2.886  2.408     0.498
Tool parameter     0.000  0.000     0.000

corpus-wide top-25 ALL CAPS tokens:
  NOT                  109
  CLAUDE               79
  IMPORTANT            37
  ATTACHMENT_OBJECT    32
  BLOCK                30
  NEVER                26
  TUNE                 26
  ONLY                 23
  MUST                 21
  SSE                  20
  README               15
  CRITICAL             15
  ALLOW                15
  DO                   15
  BLOCKS               14
  ALL                  13
  ASK_USER_QUESTION_TOOL_NAME 12
  TTL                  11
  MONITOR_TOOL_NAME    11
  NO                   10
  AND                  9
  ONE_OFF_ENABLED_FN   9
  GA                   9
  CRON_CREATE_TOOL_NAME 9
  TASK_TOOL_NAME       8

7. CAPS imperative tokens

Direct match against CAPS_IMPERATIVE_TOKENS (IMPORTANT, MUST, NEVER, DO NOT, WARNING …). Word-boundary regex handles multi-word phrases (MUST NOT, VERY IMPORTANT).

Code

from prompt_pipeline import caps_imperative_for_text, CAPS_IMP_RE
from lexicons import CAPS_IMPERATIVE_TOKENS

caps_imperative_per_file = [
    caps_imperative_for_text(t, n, s)
    for t, n, s in zip(df["raw_text"], df["n_tokens"], df["n_sents"])
]
df_caps_imperative = pd.DataFrame(caps_imperative_per_file)

print("per-file CAPS imperative (head):")
print(df_caps_imperative.head().to_string())
print()
print("category mean CAPS imperative (% of words / per sentence):")
print(pd.concat([df[["category"]], df_caps_imperative[["count", "pct", "per_sent"]]], axis=1)
        .groupby("category").mean(numeric_only=True).round(3).to_string())

corpus_caps_imperative = Counter()
for txt in df["raw_text"]:
    corpus_caps_imperative.update(CAPS_IMP_RE.findall(txt))
print()
print("corpus-wide CAPS imperative frequency:")
for tok in CAPS_IMPERATIVE_TOKENS:
    if corpus_caps_imperative[tok]:
        print(f"  {tok:18s} {corpus_caps_imperative[tok]}")

per-file CAPS imperative (head):
   count     pct  per_sent         hits
0      1  0.1068    0.0455  {'NOTE': 1}
1      0  0.0000    0.0000           {}
2      0  0.0000    0.0000           {}
3      0  0.0000    0.0000           {}
4      0  0.0000    0.0000           {}

category mean CAPS imperative (% of words / per sentence):
                  count    pct  per_sent
category                                
Agent prompt      0.919  0.167     0.061
Data / template   0.026  0.002     0.000
Skill             0.367  0.013     0.004
System prompt     0.391  0.167     0.050
System reminder   0.425  0.224     0.054
Tool description  0.620  0.373     0.085
Tool parameter    0.000  0.000     0.000

corpus-wide CAPS imperative frequency:
  IMPORTANT          36
  VERY IMPORTANT     1
  CRITICAL           15
  MANDATORY          2
  REQUIRED           2
  MUST               18
  MUST NOT           3
  NEVER              26
  ALWAYS             7
  DO NOT             14
  NOTE               7
  STOP               1
  REMEMBER           1
  PROHIBITED         2
  STRICTLY           2

8. Justification patterns

Counts of JUSTIFICATION_PATTERNS regex matches per file (count, pct, per_sent) plus the justification ratio: count / (marker_count + 1).

The + 1 in the denominator prevents division-by-zero when a file has zero imperative markers, and softens the ratio for very-low-marker files (one reason in a file with one rule scores 0.5, not 1.0). marker_count is recomputed locally with the same M_IMPERATIVE matcher used in stage 01, so this notebook stays self-contained.

Code

from prompt_pipeline import justification_for_text, M_IMPERATIVE, count_matcher

# Recompute marker_count per file (deterministic; same matcher as stage 01).
marker_counts = [count_matcher(d, M_IMPERATIVE["imperative_markers"]) for d in docs]

justification_per_file = [
    justification_for_text(t, n, s, m)
    for t, n, s, m in zip(df["clean_text"], df["n_tokens"], df["n_sents"], marker_counts)
]
df_justification = pd.DataFrame(justification_per_file)

print("per-file justification (head):")
print(df_justification.head().to_string())
print()
print("category mean (justification, % of words / per sentence):")
print(pd.concat([df[["category"]], df_justification], axis=1)
        .groupby("category").mean(numeric_only=True).round(3).to_string())

per-file justification (head):
   count     pct  per_sent  ratio
0      6  0.6410    0.2727  0.857
1      1  0.4132    0.0769  1.000
2      7  0.1981    0.0609  0.412
3      2  0.5333    0.0769  0.400
4      1  0.5556    0.2000  0.500

category mean (justification, % of words / per sentence):
                  count    pct  per_sent  ratio
category                                       
Agent prompt      1.865  0.302     0.078  0.310
Data / template   1.000  0.102     0.020  0.194
Skill             2.633  0.273     0.083  0.377
System prompt     0.812  0.288     0.072  0.238
System reminder   0.425  0.255     0.054  0.101
Tool description  0.747  0.371     0.085  0.157
Tool parameter    0.000  0.000     0.000  0.000

Write `partial_vocab_emphasis.json`

Code

partial = {
    str(i): {
        "modality":         modality_per_file[i],
        "vocab":            vocab_per_file[i],
        "all_caps":         all_caps_per_file[i],
        "caps_imperative":  caps_imperative_per_file[i],
        "justification":    justification_per_file[i],
    }
    for i in range(len(df))
}
with open(PARTIAL_OUT, "w") as f:
    json.dump(partial, f)
size = PARTIAL_OUT.stat().st_size
print(f"wrote {PARTIAL_OUT.relative_to(PROJECT_ROOT)}  ({size:,} bytes, {size/1024:.1f} KiB)")
print(f"      {len(partial)} per-file records, 5 blocks each")

wrote _pipeline_cache/partial_vocab_emphasis.json  (452,906 bytes, 442.3 KiB)
      290 per-file records, 5 blocks each

4. Modality

5. Vocabulary profile

6. ALL CAPS emphasis

7. CAPS imperative tokens

8. Justification patterns

Write partial_vocab_emphasis.json

Write `partial_vocab_emphasis.json`