Headline numbers + corpus audit

Stage 5 of the 6-stage producer chain — the final stage. Loads the YAML and parquet written by stage 04 and prints two pieces:

Canonical HEADLINE sheet — prompt_analysis.headline_numbers(data, alt_df=…, parquet=…). Same call signature consumer notebooks use.
Audit table — the human-readable summary of corpus statistics (Files / Sentences / Word tokens / ccVersions / Rule sentences / etc.). Each row references the canonical YAML key its value comes from.

This is the only producer stage that appears in the published Quarto site (it’s the most reader-friendly view); stages 00–04 run in the kernel but are hidden from the navbar.

12. Canonical HEADLINE sheet

Re-uses prompt_analysis.headline_numbers() — the same function consumer notebooks call. Pass alt_df (for composite-directiveness range and per-version mood_marker_pct extremes) and the per-sentence parquet (for parquet-level threat / causal / rule counts) so the producer’s audit covers the full HEADLINE contract.

Code

"""Compute and display the canonical HEADLINE dict.

Prints as YAML so the values are visible in this notebook (saving a copy in the
cell output) without requiring another tool.
"""
import os, sys, pathlib, importlib
sys.path.insert(0, ".")
import pandas as pd
import yaml as _yaml

_here = pathlib.Path.cwd().resolve()
PROJECT_ROOT = next(
    (p for p in [_here, *_here.parents] if (p / "prompt_pipeline.py").is_file()),
    None,
)
if PROJECT_ROOT is None:
    raise RuntimeError(
        f"Could not find prompt_pipeline.py walking up from {_here}. "
        "Run from inside the claude-prompts-analysis repo."
    )
if pathlib.Path.cwd() != PROJECT_ROOT:
    os.chdir(PROJECT_ROOT)

import prompt_analysis
importlib.reload(prompt_analysis)  # pick up edits without restarting the kernel
from prompt_analysis import (
    load_yaml, build_alt_df, headline_numbers, qualitative_phrases, bind_inline_vars,
    use_deterministic_ids,
)

# Replace random Altair / Styler IDs with a deterministic counter so re-runs
# produce byte-identical .ipynb outputs (no UUID churn in `git diff`).
use_deterministic_ids()

data    = load_yaml()
alt_df  = build_alt_df(data)
parquet = pd.read_parquet("sentences_classified.parquet")

HEADLINE = headline_numbers(data, alt_df=alt_df, parquet=parquet)
PHRASES  = qualitative_phrases(HEADLINE, alt_df=alt_df, parquet=parquet)

# Make every formatted figure available as a plain-name variable for inline {python} expressions in the audit-table cell below.
globals().update(bind_inline_vars(HEADLINE, PHRASES))

print(_yaml.safe_dump(HEADLINE, sort_keys=False, default_flow_style=False))

n_files: 290
n_sentences: 5881
n_word_tokens: 133611
n_versions: 58
n_rule_sentences: 2288
pct_explained_same: 6.6871
pct_explained_para: 24.3444
judgment_count: 78
procedural_count: 595
judgment_to_procedural_ratio: 0.131
threat_count: 8
causal_count: 137
threat_share: 0.0552
question_count: 87
apology_count: 3
selfref_claude: 521
selfref_assistant: 20
selfref_model: 266
pct_anthropomorphic: 0.6456
positive_evaluative_quality: 298
positive_evaluative_emphasis: 185
positive_evaluative_union: 483
negative_evaluative: 152
ratio_quality_to_negative: 1.9605263157894737
ratio_union_to_negative: 3.1776315789473686
appreciative_sent: 4
collaborative_sent: 30
streak_max: 12
n_streaks_ge3: 230
n_streaks_ge5: 52
vocab_hard_prohibitions: 631
vocab_hard_prescriptions: 358
vocab_pronouns_2p: 1397
vocab_pronouns_1p: 185
vocab_profanity: 0
modality_deontic: 261
modality_epistemic: 322
modality_dynamic: 559
mood_marker_pct: 0.7679
top_caps_imperative:
- - IMPORTANT
  - 36
- - NEVER
  - 26
- - MUST
  - 18
- - CRITICAL
  - 15
rules_section_in_paragraphs: 27
rules_section_out_paragraphs: 1286
rules_section_in_paragraphs_explained: 5
rules_section_out_paragraphs_explained: 213
rules_section_in_pct_explained: 18.5185
rules_section_out_pct_explained: 16.563
composite_directiveness_min: -19.54089614772803
composite_directiveness_max: 19.208174461486646
mood_marker_pct_first_version: 1.596813473053892
mood_marker_pct_latest_version: 1.1647437603993345
mood_marker_pct_first_version_id: 2.0.14
mood_marker_pct_latest_version_id: 2.1.133
parquet_threat_count: 8
parquet_causal_count: 132
parquet_threat_and_rule_count: 5
parquet_threat_and_rule_with_causal: 0
parquet_threat_and_rule_explained: 4
threat_count_unambiguous: 5
imperative_sent_pct: 30.9811
imperative_share: 0.309811
pct_paragraphs_with_rules_unexplained: 83.3968
selfref_claude_share: 0.6456
n_paragraphs_with_rules: 1313
n_paragraphs_with_rules_explained: 218
n_paragraphs_with_rules_unexplained_count: 1095
pct_paragraphs_with_rules_explained: 16.6032
rule_sentences_per_explained_paragraph: 2.555045871559633
rule_sentences_per_unexplained_paragraph: 1.5808219178082192
rule_exp_pct_agent_prompt: 37.8897
n_rule_sentences_agent_prompt: 417
rule_exp_pct_data_template: 10.6855
n_rule_sentences_data_template: 496
rule_exp_pct_skill: 19.6137
n_rule_sentences_skill: 673
rule_exp_pct_system_prompt: 31.9876
n_rule_sentences_system_prompt: 322
rule_exp_pct_system_reminder: 30.9278
n_rule_sentences_system_reminder: 97
rule_exp_pct_tool_description: 29.7794
n_rule_sentences_tool_description: 272
rule_exp_pct_tool_parameter: 0.0
n_rule_sentences_tool_parameter: 11
judgment_to_procedural_ratio_first_version: 0.42105263157894735
judgment_to_procedural_ratio_first_version_id: 2.1.18
judgment_to_procedural_ratio_latest_version: 0.13109243697478992
judgment_to_procedural_ratio_latest_version_id: 2.1.133
n_uptick_transitions: 10
n_total_transitions: 49

Audit table

Live corpus statistics — these are the canonical values for every prose mention across the notebooks; any number that disagrees gets reconciled to these. Every figure below is computed live from the YAML.

Quantity	Value
Files	290
Sentences	5,881
Word tokens	133,611
ccVersions (distinct)	58 (latest 2.1.133)
Rule sentences	2288
`pct_explained_same`	6.69%
`pct_explained_para` (per rule sentence)	24.34% (the headline rule-explanation rate)
Rule-bearing paragraphs (total / explained / unexplained)	1,313 / 218 / 1,095
`pct_paragraphs_with_rules_explained` / `_unexplained`	16.60% / 83.40%
Avg rule sentences per paragraph (explained / unexplained)	2.56 / 1.58 — explains the gap between the per-sentence rate (24.34%) and the per-paragraph rate (16.60%)
`judgment_count` / `procedural_count` / ratio	78 / 595 / 0.131
`threat_count` / `causal_count` / `threat_share`	8 / 137 / 0.0552 (5.5% in narrative)
`question_count` / `apology_count`	87 / 3
`selfref_claude` / `_assistant` / `_model`	521 / 20 / 266
`pct_anthropomorphic`	0.6456 (64.6%)
Imperative streaks: `streak_max` / `n_ge3` / `n_ge5`	12 / 230 / 52
RULES-section paragraphs (in / out, % explained)	27 (18.52%) / 1286 (16.56%)
Modality (deontic / epistemic / dynamic)	261 / 322 / 559
`vocab.hard_prohibitions.count`	631
`vocab.hard_prescriptions.count`	358
`vocab.pronouns_2p.count`	1397
`vocab.pronouns_1p.count`	185
`vocab.profanity.count`	0
Stance: positive_evaluative_quality / _emphasis / negative_evaluative	298 / 185 / 152
Quality-only positive-vs-negative ratio	1.96× (298 / 152)
Union positive-vs-negative ratio	3.18× (483 / 152)
`appreciative_sent_count`	4
`collaborative_sent_count`	30