Register-family analyzers

This is stage 1 of the 6-stage producer chain. It reloads the cache written by stage 00, runs the four register-family per-doc analyzers, and writes a partial JSON keyed by file index.

Analyzer What it produces (per-file)
mood_for_doc Imperative-marker lexical density (count + pct + per_sent).
register_for_doc TTR, mean sentence length, dependency depth, Heylighen F-score, four register classes.
stance_for_doc Five lexical stance classes (directive / expository / positive_evaluative / negative_evaluative / dialogic) + 1p/2p engagement + the positive_evaluative quality/emphasis split.
sentence_register_for_doc Per-sentence multi-label classifier (six classes: collaborative / permissive / appreciative / imperative / directive / configuring) + addressee classification for appreciative/collaborative.

Output: _pipeline_cache/partial_register.json (per-file metric trees).

Code
"""Reload corpus + DocBin from stage 00."""
import os, pathlib, json, importlib
import pandas as pd
from tqdm.auto import tqdm
from spacy.tokens import DocBin

_here = pathlib.Path.cwd().resolve()
PROJECT_ROOT = next(
    (p for p in [_here, *_here.parents] if (p / "prompt_pipeline.py").is_file()),
    None,
)
if PROJECT_ROOT is None:
    raise RuntimeError(
        f"Could not find prompt_pipeline.py walking up from {_here}. "
        "Run from inside the claude-prompts-analysis repo."
    )
if pathlib.Path.cwd() != PROJECT_ROOT:
    os.chdir(PROJECT_ROOT)

CACHE_DIR  = PROJECT_ROOT / "_pipeline_cache"
DOCBIN_IN  = CACHE_DIR / "corpus_docs.spacy"
META_IN    = CACHE_DIR / "corpus_meta.parquet"
PARTIAL_OUT = CACHE_DIR / "partial_register.json"

assert DOCBIN_IN.exists(),  f"missing {DOCBIN_IN} — run 00_setup_and_corpus first"
assert META_IN.exists(),    f"missing {META_IN} — run 00_setup_and_corpus first"

import prompt_pipeline
importlib.reload(prompt_pipeline)
from prompt_pipeline import NLP

df = pd.read_parquet(META_IN)
docs = list(DocBin().from_disk(DOCBIN_IN).get_docs(NLP.vocab))
assert len(docs) == len(df), f"DocBin/df length mismatch: {len(docs)} vs {len(df)}"
print(f"reloaded {len(df)} files, {sum(d.__len__() for d in docs):,} doc tokens")
reloaded 290 files, 145,534 doc tokens

1. Mood

Mood marker density is the share of word tokens matched by IMPERATIVE_MARKERS (the lexicon is echoed verbatim into the YAML’s lexicons block). Reported per file as count + pct (% of word tokens) + per_sent (rate per sentence). The per-sentence imperative classifier moved into sentence_register; this block carries only the lexical-density signal (marker_*).

Code
from prompt_pipeline import mood_for_doc

mood_per_file = [mood_for_doc(d, n, s)
                 for d, n, s in zip(docs, df["n_tokens"], df["n_sents"])]
df_mood = pd.DataFrame(mood_per_file)
print("per-file mood (head):")
print(df_mood.head().to_string())
print()
print("category mean (mood — imperative-marker density):")
print(pd.concat([df[["category"]], df_mood], axis=1)
        .groupby("category").mean(numeric_only=True).round(3).to_string())
per-file mood (head):
   marker_count  marker_pct  marker_per_sent
0             6      0.6410           0.2727
1             0      0.0000           0.0000
2            16      0.4529           0.1391
3             4      1.0667           0.1538
4             1      0.5556           0.2000

category mean (mood — imperative-marker density):
                  marker_count  marker_pct  marker_per_sent
category                                                   
Agent prompt             5.459       0.773            0.224
Data / template          4.667       0.573            0.123
Skill                    8.400       0.603            0.162
System prompt            2.328       1.093            0.264
System reminder          1.700       1.032            0.241
Tool description         2.190       2.136            0.412
Tool parameter           0.000       0.000            0.000

2. Register

Register captures formality. Four numerical metrics per file:

  • TTR (type-token ratio) — unique types ÷ total tokens. Anti-correlates with file length (longer files reuse vocabulary, lower TTR).
  • Mean sentence length — tokens per sentence.
  • Mean dependency depth — average nesting depth of the spaCy parse tree.
  • Heylighen F-score (Heylighen & Dewaele 2002) — a 0–100 formality index: F = 50 + 0.5 × (noun + adj + prep + article − pronoun − verb − adverb − interjection), each term as a percentage of all tokens. Higher = more formal-academic prose; the corpus clusters in the 70–80 band.

Plus lexical density for four register classes (frozen / formal / consultative / casual).

Code
from prompt_pipeline import register_for_doc

register_per_file = [register_for_doc(d, n, s)
                     for d, n, s in zip(docs, df["n_tokens"], df["n_sents"])]
df_register = pd.DataFrame(register_per_file)
print("per-file register (head):")
print(df_register.head().to_string())
print()
print("category mean (numeric register cols):")
num_cols = [c for c in df_register.columns
            if c != "dominant_register" and not c.endswith("_count")]
print(pd.concat([df[["category"]], df_register[num_cols]], axis=1)
        .groupby("category").mean(numeric_only=True).round(3).to_string())
per-file register (head):
      ttr  mean_sent_len  dep_depth  f_score  frozen_count  formal_count  consultative_count  casual_count  frozen_pct  formal_pct  consultative_pct  casual_pct  frozen_per_sent  formal_per_sent  consultative_per_sent  casual_per_sent dominant_register
0  0.5407          42.55      3.910    71.98             0             0                   5             1         0.0         0.0            0.5342      0.1068              0.0              0.0                 0.2273           0.0455      consultative
1  0.6421          18.62      2.661    75.89             0             0                   0             1         0.0         0.0            0.0000      0.4132              0.0              0.0                 0.0000           0.0769            casual
2  0.3840          30.72      2.924    61.05             0             0                   8            28         0.0         0.0            0.2264      0.7925              0.0              0.0                 0.0696           0.2435            casual
3  0.7986          14.42      2.263    59.18             0             0                   0             2         0.0         0.0            0.0000      0.5333              0.0              0.0                 0.0000           0.0769            casual
4  0.7802          36.00      3.259    73.20             0             0                   0             0         0.0         0.0            0.0000      0.0000              0.0              0.0                 0.0000           0.0000              none

category mean (numeric register cols):
                    ttr  mean_sent_len  dep_depth  f_score  frozen_pct  formal_pct  consultative_pct  casual_pct  frozen_per_sent  formal_per_sent  consultative_per_sent  casual_per_sent
category                                                                                                                                                                                  
Agent prompt      0.601         26.657      3.087   70.479         0.0       0.000             0.259       0.294              0.0            0.000                  0.067            0.076
Data / template   0.536         22.658      2.702   74.358         0.0       0.001             0.249       0.252              0.0            0.000                  0.056            0.060
Skill             0.566         26.245      2.669   68.818         0.0       0.000             0.283       0.550              0.0            0.000                  0.070            0.130
System prompt     0.740         25.695      2.858   66.976         0.0       0.003             0.631       0.509              0.0            0.001                  0.149            0.129
System reminder   0.791         18.335      2.306   71.012         0.0       0.000             0.848       0.283              0.0            0.000                  0.160            0.060
Tool description  0.793         22.826      2.588   64.828         0.0       0.000             0.717       0.392              0.0            0.000                  0.147            0.074
Tool parameter    0.600         12.790      1.995   75.410         0.0       0.000             0.000       0.000              0.0            0.000                  0.000            0.000

3. Stance

Stance classifies expressed attitude into five polarity-aware classes: directive, expository, positive_evaluative, negative_evaluative, dialogic. Each is a hand-curated lexicon match, normalized as pct (% of word tokens) and per_sent. We also count 1st/2nd-person engagement (pronouns_1p, pronouns_2p) since they covary with stance.

The positive_evaluative quality / emphasis split. The union positive_evaluative lexicon conflates two phenomena, so each file also carries the split:

  • positive_evaluative_quality — genuinely affirmative tone: good, recommended, optimal, safe.
  • positive_evaluative_emphasis — emphasis-of-rule words: important, critical, essential, key.

The union is preserved for back-compat in existing charts. When the question is how much praise is here, cite the quality-only ratio against negative_evaluative. When the question is how loud is the rule emphasis, cite the emphasis count alongside the imperative-marker density.

Code
from prompt_pipeline import stance_for_doc

stance_per_file = [stance_for_doc(d, n, s)
                   for d, n, s in zip(docs, df["n_tokens"], df["n_sents"])]
df_stance = pd.DataFrame(stance_per_file)
print("per-file stance (head):")
print(df_stance.head().to_string())
print()
print("dominant stance by category:")
print(pd.crosstab(df["category"], df_stance["dominant_stance"]))
print()
print("category mean stance (% of words and per-sentence rate):")
num_cols = [c for c in df_stance.columns
            if c != "dominant_stance" and not c.endswith("_count")]
print(pd.concat([df[["category"]], df_stance[num_cols]], axis=1)
        .groupby("category").mean(numeric_only=True).round(3).to_string())
per-file stance (head):
   directive_count  directive_pct  directive_per_sent  expository_count  expository_pct  expository_per_sent  positive_evaluative_count  positive_evaluative_pct  positive_evaluative_per_sent  negative_evaluative_count  negative_evaluative_pct  negative_evaluative_per_sent  dialogic_count  dialogic_pct  dialogic_per_sent  pronouns_1p_count  pronouns_1p_pct  pronouns_1p_per_sent  pronouns_2p_count  pronouns_2p_pct  pronouns_2p_per_sent dominant_stance  positive_evaluative_quality_count  positive_evaluative_quality_pct  positive_evaluative_quality_per_sent  positive_evaluative_emphasis_count  positive_evaluative_emphasis_pct  positive_evaluative_emphasis_per_sent
0               18         1.9231              0.8182                15          1.6026               0.6818                          6                   0.6410                        0.2727                          0                   0.0000                        0.0000               3        0.3205             0.1364                  2           0.2137                0.0909                 16           1.7094                0.7273       directive                                  3                           0.3205                                0.1364                                   3                            0.3205                                 0.1364
1                3         1.2397              0.2308                 6          2.4793               0.4615                          1                   0.4132                        0.0769                          0                   0.0000                        0.0000               0        0.0000             0.0000                  0           0.0000                0.0000                  2           0.8264                0.1538      expository                                  1                           0.4132                                0.0769                                   0                            0.0000                                 0.0000
2                8         0.2264              0.0696                59          1.6700               0.5130                          5                   0.1415                        0.0435                          5                   0.1415                        0.0435               7        0.1981             0.0609                 46           1.3020                0.4000                 18           0.5095                0.1565      expository                                  3                           0.0849                                0.0261                                   2                            0.0566                                 0.0174
3                2         0.5333              0.0769                10          2.6667               0.3846                          0                   0.0000                        0.0000                          1                   0.2667                        0.0385               0        0.0000             0.0000                  0           0.0000                0.0000                  9           2.4000                0.3462      expository                                  0                           0.0000                                0.0000                                   0                            0.0000                                 0.0000
4                1         0.5556              0.2000                 1          0.5556               0.2000                          0                   0.0000                        0.0000                          0                   0.0000                        0.0000               0        0.0000             0.0000                  0           0.0000                0.0000                  0           0.0000                0.0000       directive                                  0                           0.0000                                0.0000                                   0                            0.0000                                 0.0000

dominant stance by category:
dominant_stance   dialogic  directive  expository  negative_evaluative  none  \
category                                                                       
Agent prompt             0          9          24                    1     2   
Data / template          0          3          34                    1     0   
Skill                    0          3          27                    0     0   
System prompt            2         29          22                    0     2   
System reminder          0         13           9                    0    18   
Tool description         1         32          24                    1    17   
Tool parameter           0          0           0                    0     0   

dominant_stance   positive_evaluative  
category                               
Agent prompt                        1  
Data / template                     1  
Skill                               0  
System prompt                       9  
System reminder                     0  
Tool description                    4  
Tool parameter                      1  

category mean stance (% of words and per-sentence rate):
                  directive_pct  directive_per_sent  expository_pct  expository_per_sent  positive_evaluative_pct  positive_evaluative_per_sent  negative_evaluative_pct  negative_evaluative_per_sent  dialogic_pct  dialogic_per_sent  pronouns_1p_pct  pronouns_1p_per_sent  pronouns_2p_pct  pronouns_2p_per_sent  positive_evaluative_quality_pct  positive_evaluative_quality_per_sent  positive_evaluative_emphasis_pct  positive_evaluative_emphasis_per_sent
category                                                                                                                                                                                                                                                                                                                                                                                                                                                             
Agent prompt              0.982               0.273           1.580                0.408                    0.381                         0.093                    0.127                         0.040         0.109              0.028            0.089                 0.027            1.362                 0.343                            0.196                                 0.050                             0.186                                  0.043
Data / template           0.385               0.078           1.414                0.308                    0.357                         0.085                    0.115                         0.022         0.171              0.034            0.054                 0.020            0.637                 0.136                            0.211                                 0.048                             0.146                                  0.037
Skill                     0.565               0.142           1.513                0.394                    0.248                         0.060                    0.086                         0.018         0.195              0.045            0.081                 0.021            0.955                 0.220                            0.187                                 0.044                             0.061                                  0.016
System prompt             1.574               0.350           1.686                0.382                    0.650                         0.175                    0.135                         0.045         0.277              0.076            0.125                 0.037            2.099                 0.465                            0.425                                 0.114                             0.225                                  0.061
System reminder           1.335               0.286           1.605                0.340                    0.166                         0.046                    0.016                         0.008         0.470              0.092            0.064                 0.025            1.617                 0.353                            0.056                                 0.014                             0.110                                  0.032
Tool description          2.160               0.427           1.263                0.321                    0.499                         0.143                    0.053                         0.015         0.282              0.066            0.087                 0.021            1.648                 0.372                            0.372                                 0.106                             0.127                                  0.037
Tool parameter            0.000               0.000           0.000                0.000                    2.235                         0.286                    0.000                         0.000         0.000              0.000            0.000                 0.000            0.000                 0.000                            1.676                                 0.214                             0.559                                  0.071

3b. Sentence-level pragmatic register

Per-sentence multi-label classifier with six classes: collaborative / permissive / appreciative / imperative / directive / configuring. Implemented via M_SENT_REGISTER PhraseMatchers, M_STANCE["directive"], and a spaCy DependencyMatcher for parse-tree cues; the imperative flag is driven by classify_sent_mood. Near-zero classes are deliberately preserved — absence is the welfare-relevant signal for this corpus.

Multi-label semantics. A single sentence can carry several flags simultaneously. The hypothetical sentence "Please, we should consider running the migration." is permissive (please), collaborative (we should), imperative (consider), and directive (modal should). Because each contributes to multiple class counts, per-class sent_pct values across the six classes can sum to more than 100% within a category — intentional, not a bug.

Code
from prompt_pipeline import sentence_register_for_doc

sentence_register_per_file = [
    sentence_register_for_doc(d, s)
    for d, s in zip(docs, df["n_sents"])
]
df_sent_register = pd.DataFrame(sentence_register_per_file)

print("per-file sentence_register (head):")
print(df_sent_register.head().to_string())
print()
print("dominant by category:")
print(pd.crosstab(df["category"], df_sent_register["dominant"]))
print()
print("category mean sentence_register (% of sentences):")
pct_cols = [c for c in df_sent_register.columns if c.endswith("_sent_pct")]
print(pd.concat([df[["category"]], df_sent_register[pct_cols]], axis=1)
        .groupby("category").mean(numeric_only=True).round(2).to_string())
per-file sentence_register (head):
   collaborative_sent_count  collaborative_sent_pct  permissive_sent_count  permissive_sent_pct  appreciative_sent_count  appreciative_sent_pct  imperative_sent_count  imperative_sent_pct  directive_sent_count  directive_sent_pct  configuring_sent_count  configuring_sent_pct  none_sent_count  none_sent_pct  appreciative_addressee_claude_count  appreciative_addressee_user_count  appreciative_addressee_unknown_count  collaborative_addressee_claude_count  collaborative_addressee_user_count  collaborative_addressee_unknown_count    dominant
0                         0                     0.0                      3              13.6364                        0                    0.0                      7              31.8182                     9             40.9091                       5               22.7273                7        31.8182                                    0                                  0                                     0                                     0                                   0                                      0   directive
1                         0                     0.0                      0               0.0000                        0                    0.0                      1               7.6923                     2             15.3846                       1                7.6923               10        76.9231                                    0                                  0                                     0                                     0                                   0                                      0   directive
2                         0                     0.0                      3               2.6087                        0                    0.0                     39              33.9130                     7              6.0870                       1                0.8696               71        61.7391                                    0                                  0                                     0                                     0                                   0                                      0  imperative
3                         0                     0.0                      0               0.0000                        0                    0.0                     10              38.4615                     2              7.6923                       0                0.0000               15        57.6923                                    0                                  0                                     0                                     0                                   0                                      0  imperative
4                         0                     0.0                      0               0.0000                        0                    0.0                      1              20.0000                     1             20.0000                       0                0.0000                4        80.0000                                    0                                  0                                     0                                     0                                   0                                      0  imperative

dominant by category:
dominant              collaborative  configuring  directive  imperative  \
category                                                                  
Agent prompt       0              0            0          7          30   
Data / template    1              1            1          0          36   
Skill              0              0            0          1          29   
System prompt      2              0            1          7          53   
System reminder   16              0            0          4          17   
Tool description  16              1            3          8          50   
Tool parameter     0              0            0          0           1   

dominant          permissive  
category                      
Agent prompt               0  
Data / template            0  
Skill                      0  
System prompt              1  
System reminder            3  
Tool description           1  
Tool parameter             0  

category mean sentence_register (% of sentences):
                  collaborative_sent_pct  permissive_sent_pct  appreciative_sent_pct  imperative_sent_pct  directive_sent_pct  configuring_sent_pct  none_sent_pct
category                                                                                                                                                          
Agent prompt                        0.21                 2.59                   0.27                31.61               18.42                  6.20          54.53
Data / template                     1.32                 1.20                   0.00                25.65                6.88                  5.90          66.40
Skill                               0.47                 2.05                   0.00                36.21               12.72                  2.64          55.31
System prompt                       0.85                 4.07                   0.49                41.84               25.86                  6.46          41.25
System reminder                     0.13                 7.07                   0.00                34.59               16.76                  1.04          56.48
Tool description                    1.32                 2.92                   0.09                41.94               30.49                  6.96          41.54
Tool parameter                      0.00                 0.00                   0.00                78.57                0.00                 50.00          14.29

Write partial_register.json

One JSON document, keyed by file index (string), with the four metric blocks per file. Stage 04 reloads this and feeds it directly to build_file_record.

Code
partial = {
    str(i): {
        "mood":              mood_per_file[i],
        "register":          register_per_file[i],
        "stance":            stance_per_file[i],
        "sentence_register": sentence_register_per_file[i],
    }
    for i in range(len(df))
}
with open(PARTIAL_OUT, "w") as f:
    json.dump(partial, f)
size = PARTIAL_OUT.stat().st_size
print(f"wrote {PARTIAL_OUT.relative_to(PROJECT_ROOT)}  ({size:,} bytes, {size/1024:.1f} KiB)")
print(f"      {len(partial)} per-file records, 4 blocks each")
wrote _pipeline_cache/partial_register.json  (598,693 bytes, 584.7 KiB)
      290 per-file records, 4 blocks each
Back to top