Audits the small set of rule sentences whose explanation uses coercive framing (will fail, or else, is forbidden) instead of causal framing (because, due to). Lists every threat-framed rule sentence in the corpus and proposes rewriting each into causal form, with threat_share tracked per release as a non-regression gate. Complements 20_track_justification_rate (quantity of explanations) by addressing their quality.
Code
"""Setup: load YAML data + parquet artifact for forensic sentence inspection."""import importlibimport altair as altimport pandas as pdimport pathlibimport prompt_analysisimportlib.reload(prompt_analysis)from prompt_analysis import ( load_yaml, build_alt_df, version_order, category_colors, cumulative_by_version, welfare_evidence_table, positive_exemplar_table, headline_numbers, qualitative_phrases, bind_inline_vars, use_deterministic_ids, save_chart,)# Replace random Altair / Styler IDs with a deterministic counter so re-runs# produce byte-identical .ipynb outputs (no UUID churn in `git diff`).use_deterministic_ids()alt.data_transformers.disable_max_rows()data = load_yaml()alt_df = build_alt_df(data)by_category = data["by_category"]corpus_block = data["corpus"]per_file_records = data["files"]cats =list(by_category.keys())CATEGORY_COLORS = category_colors(cats)_cat_domain = cats_cat_range = [CATEGORY_COLORS[c] for c in cats]sentences_df = pd.read_parquet("sentences_classified.parquet")# Pass alt_df + parquet so HEADLINE has the full keyset (trend endpoints, threat-and-rule, per-category, etc.)HEADLINE = headline_numbers(data, alt_df=alt_df, parquet=sentences_df)PHRASES = qualitative_phrases(HEADLINE, alt_df=alt_df, parquet=sentences_df)# Make every formatted figure available as a plain-name variable for inline {python} expressions.globals().update(bind_inline_vars(HEADLINE, PHRASES))print(f"YAML paragraph-aggregated: threat={HEADLINE['threat_count']} | causal={HEADLINE['causal_count']} "f"| threat_share={HEADLINE['threat_share']:.3f} ({HEADLINE['threat_share']*100:.1f}%)")print(f"Per-sentence parquet flags: has_threat={HEADLINE['parquet_threat_count']} | "f"has_causal={HEADLINE['parquet_causal_count']}")print(f"Threat-flagged sentences that are also rules: {HEADLINE['parquet_threat_and_rule_count']}")print(f" ... of which also carry a causal marker: {HEADLINE['parquet_threat_and_rule_with_causal']}")print(f" ... of which sit in a paragraph with any explanation: {HEADLINE['parquet_threat_and_rule_explained']}")print(f"sentences_classified.parquet: {len(sentences_df):,} rows × {sentences_df.shape[1]} columns")
YAML paragraph-aggregated: threat=8 | causal=137 | threat_share=0.055 (5.5%)
Per-sentence parquet flags: has_threat=8 | has_causal=132
Threat-flagged sentences that are also rules: 5
... of which also carry a causal marker: 0
... of which sit in a paragraph with any explanation: 4
sentences_classified.parquet: 5,881 rows × 21 columns
Findings
The pipeline classifies rule-explanation language with a two-tier lexicon (full description in docs/THREAT_CLASSIFIER.md):
THREAT_PATTERNS — unambiguous coercion: will fail/break/crash/..., or else, or it will, is (forbidden|prohibited|not allowed|not permitted), this will (cause|result in|break|fail|crash). Drives threat_count / threat_share.
SOFT_CONDITIONAL_PATTERNS — neutral procedural connectives (otherwise, if not, modal may cause, leads to, results in, …). Tracked separately as soft_conditional_count. Not summed into the threat metric.
Corpus-wide threat counts (v2.1.133 corpus): 8 hard-threat sentences out of 5,881 total, 137 causal-marker sentences, threat share 5.5%. All hard-threat hits are matches against will fail/break/crash/...; the other four hard patterns (or else, or it will, is forbidden, this will cause) fire zero times in the corpus. Of the threat sentences, 5 are also rule sentences (the audit candidates listed in section 4).
The two framings teach different things — "Do X or it will fail" trains compliance with a rule, "Do X because Y" trains understanding of what the rule protects. The first is procedural and brittle; the second is internalisable and transfers.
Proposal
A targeted editorial pass on the small set of rule sentences that contain unambiguous coercive language. Section 4 below lists all 5. For each, attempt a rewrite into causal framing — naming the underlying reason rather than the consequence. Track the rewrite success rate: the share where the rewrite preserves the rule’s information content without losing precision. Track threat_share per future release; gate against regression so that the share of causal framing only goes up.
This proposal is complementary to 20_track_justification_rate: that one raises the quantity of explanations (pct_explained_para is currently 24.34%); this one raises their quality in the small minority of cases where the explanation is coercive rather than causal.
Supplemental
The full per-section analysis follows: the threat-share data with both counting bases, the per-category split, paired exemplars (audit candidates and rewrite templates), the forensic listing of all five rule sentences flagged as threat-framed, and the closing Conclusions / Recommendations / Limitations triplet. Everything in this section supports the Findings and Proposal above; none of it is required for evaluation.
1. Threat-share data
Two valid counting bases for the same underlying observation:
YAML paragraph-aggregated counts (the headline figures): 8 threat / 137 causal, summed at the per-paragraph level. The headline threat_share = 0\.0552 (5.5%) is computed from these. This is the figure cited in the proposal abstract and tracked alongside pct_explained_para in 20_track_justification_rate.
Per-sentence parquet flags: 8 threat / 132 causal, counting the actual has_threat / has_causal boolean columns in sentences_classified.parquet. Slightly lower than the paragraph-aggregated YAML counts because a single paragraph can contain a threat and a causal marker on different sentences and contributes once to each pool either way.
The two are not contradictory — they are two valid bases for the same underlying observation. The forensic listing in section 4 operates on the per-sentence parquet flags (because the rewrite is a per-sentence task). The release-tracking proposal in 20_track_justification_rate uses the YAML paragraph-aggregated threat_share (computed alongside pct_explained_para and tracking the same denominator).
2. Per-category threat_share
Where the threat-framing concentrates. The chart below sums per-file threat_count and causal_count within each category, then plots the resulting threat share. Categories with fewer than 5 explanations total are omitted (the share is undefined or noisy). Note the y-axis range is compressed — no category exceeds 11%, and the welfare-relevant System-prompt and System-reminder subsets register 0%.
Code
"""Per-category hard threat_share — sum of per-file counts, then ratio."""per_cat = ( alt_df.groupby("category", as_index=False) .agg(threat=("threat_count", "sum"), causal=("causal_count", "sum")))per_cat["total"] = per_cat["threat"] + per_cat["causal"]per_cat = per_cat[per_cat["total"] >=5].copy()per_cat["threat_share"] = per_cat["threat"] / per_cat["total"]threat_chart = ( alt.Chart(per_cat).mark_bar().encode( x=alt.X("threat_share:Q", title="hard threat_share (fraction of explanations that are hard threats)", scale=alt.Scale(domain=[0, 0.15])), y=alt.Y("category:N", sort="-x", title=None), color=alt.Color("category:N", scale=alt.Scale(domain=_cat_domain, range=_cat_range), legend=None), tooltip=[alt.Tooltip("category:N"), alt.Tooltip("threat:Q", title="threat count (hard)"), alt.Tooltip("causal:Q", title="causal count"), alt.Tooltip("total:Q", title="total explanations"), alt.Tooltip("threat_share:Q", format=".3f")], ).properties(width=520, height=240, title="Per-category hard threat_share (categories with ≥5 explanations)"))save_chart(threat_chart, "21-per-category-threat-share")
3. Paired exemplars — audit candidates and rewrite templates
Two rankings, side by side. The top-10 welfare-evidence files (negative exemplars) are rule-saturated AND under-explained — they are this proposal’s primary audit candidates: every threat-framed sentence in these files is a candidate for rewriting. The top-10 positive exemplars are rule-saturated AND well-explained — they are the rewrite templates showing how rules can be paired with reasons in similar contexts.
The score formulas: score_welfare = rule_density × (1 − pct_explained_para/100) and score_positive = rule_density × (pct_explained_para/100).
4. Forensic sample — all threat-framed rule sentences
All five rule sentences with has_threat=True from sentences_classified.parquet. Each row shows the file, the threat-flagged sentence text, and whether the same paragraph also contains a causal explanation. The “rewrite as causal” task: replace the threat clause with a because <reason> clause that names the rule’s underlying purpose. With only five candidates corpus-wide, this is a five-minute editorial pass rather than a corpus-wide audit.
Code
"""All hard-threat rule sentences from the parquet (small N — show all, not a sample)."""threat_rules = sentences_df[sentences_df["has_threat"] & sentences_df["is_rule"]].copy()print(f"population: {len(threat_rules)} sentences flagged as both hard threat and rule")print(f" of which {int(threat_rules['has_causal'].sum())} also carry a causal marker in the same sentence")print(f" of which {int(threat_rules['paragraph_has_just'].sum())} sit in a paragraph that also contains some justification")print()display_cols = ["file_path", "category", "ccVersion", "text", "has_causal", "paragraph_has_just"]threat_rules[display_cols].sort_values(["category", "file_path"]).reset_index(drop=True)
population: 5 sentences flagged as both hard threat and rule
of which 0 also carry a causal marker in the same sentence
of which 4 sit in a paragraph that also contains some justification
file_path
category
ccVersion
text
has_causal
paragraph_has_just
0
agent-prompt-explore.md
Agent prompt
2.1.118
You do NOT have access to file editing tools -...
False
True
1
agent-prompt-plan-mode-enhanced.md
Agent prompt
2.1.118
You do NOT have access to file editing tools -...
False
True
2
data-claude-api-reference-curl.md
Data / template
2.1.111
Do not use / —\nJSON strings can contain any...
False
True
3
skill-model-migration-guide.md
Skill
2.1.128
**\n\nPassing both will error on every Claude ...
False
False
4
tool-description-write-read-existing-file-firs...
Tool description
2.1.120
- If the file already exists, you must ${READ_...
False
True
Conclusions (Claude)
The threat-framing finding is small but pointed. Across 5,881 sentences, only 8 contain unambiguous coercive language (will fail, or else, is forbidden, this will cause), and only 5 of those are inside rule sentences. This is a much narrower target than the volume of unjustified rules Pattern 2 documents, but the editorial cost is correspondingly tiny: 5 sentences are 5 sentences. If the goal is to encourage reasoning over blind obedience, neutral causal explanation is the mechanism — coercion just substitutes extrinsic motivation for intrinsic understanding.
What the soft-conditional pool tells us. The sentences flagged has_soft_conditional=True are mostly procedural prose (If it's a slash command, invoke it via the Skill tool; otherwise act on it directly). They are useful as a procedural-density signal — high values flag prose where most logic is if X otherwise Y rather than because Z — and they covary with the 7.6:1 procedural-to-judgment ratio that the executive summary in 20_track_justification_rate foregrounds. They are not coercion; they should not be reported as threats.
Recommendations (Claude)
The asks this proposal makes of Anthropic, framed as “I’d want X”:
Hand-review the 5 threat-framed rule sentences (table in section 4) and rewrite as causal where it makes sense. None looked like overt coercion in a spot-check; most are factual statements about what the system will do (will error on every Claude API call). Whether to rewrite is editorial judgment per file.
Track threat_share per future release, gated against regression — same logic as the justification-rate proposal in 20_track_justification_rate. No arbitrary target; the only goal is that the share of causal framing only goes up.
Treat soft-conditional density as a procedural-density signal, not a threat signal. The has_soft_conditional=True matches are useful alongside the 7.6:1 procedural-to-judgment ratio. They should not re-enter the threat-share metric.
Limitations (Claude)
What this analysis can’t tell us about threat-vs-causal framing specifically:
Two counting bases, not one. The threat-share section above distinguishes the YAML paragraph-aggregated counts (8 / 137 → 5.5%) from the per-sentence parquet flags (8 / 132). The headline figure cited in the proposal abstract is the YAML number; the parquet flags are what the audit operates on. Anyone quoting a single threat-share figure should pick one base and stick with it; the two will differ slightly because a paragraph can contain both flags on different sentences.
The “rewrite success rate” is itself a judgment call. Whether a rewrite “preserves the rule’s information content without losing precision” requires editorial judgment from someone who knows the rule’s underlying purpose. This proposal can’t be fully automated; it requires Anthropic’s prompt authors in the loop. The audit pipeline can flag candidates and surface them; the rewrite is human work.
Cross-cutting limitations apply — rule-based detection is a lower bound (lexicon-based threat detection misses indirect / ironic / implied threats; the actual threat-share is plausibly higher than the floor cited above); English-only lexicons; single-product corpus. The cross-product audit in 22_cross_product_audit is what would generalize this finding to other Anthropic prompt corpora. See index.qmd for the full cross-cutting limitations note.