Audit threat-framed rule explanations and rewrite them as causal

Audits the small set of rule sentences whose explanation uses coercive framing (will fail, or else, is forbidden) instead of causal framing (because, due to). Lists every threat-framed rule sentence in the corpus and proposes rewriting each into causal form, with threat_share tracked per release as a non-regression gate. Complements 20_track_justification_rate (quantity of explanations) by addressing their quality.

Code
"""Setup: load YAML data + parquet artifact for forensic sentence inspection."""
import importlib
import altair as alt
import pandas as pd
import pathlib

import prompt_analysis
importlib.reload(prompt_analysis)
from prompt_analysis import (
    load_yaml, build_alt_df, version_order, category_colors,
    cumulative_by_version, welfare_evidence_table, positive_exemplar_table,
    headline_numbers, qualitative_phrases, bind_inline_vars,
    use_deterministic_ids, save_chart,
)

# Replace random Altair / Styler IDs with a deterministic counter so re-runs
# produce byte-identical .ipynb outputs (no UUID churn in `git diff`).
use_deterministic_ids()

alt.data_transformers.disable_max_rows()

data              = load_yaml()
alt_df            = build_alt_df(data)
by_category       = data["by_category"]
corpus_block      = data["corpus"]
per_file_records  = data["files"]
cats              = list(by_category.keys())

CATEGORY_COLORS = category_colors(cats)
_cat_domain     = cats
_cat_range      = [CATEGORY_COLORS[c] for c in cats]

sentences_df = pd.read_parquet("sentences_classified.parquet")

# Pass alt_df + parquet so HEADLINE has the full keyset (trend endpoints, threat-and-rule, per-category, etc.)
HEADLINE = headline_numbers(data, alt_df=alt_df, parquet=sentences_df)
PHRASES  = qualitative_phrases(HEADLINE, alt_df=alt_df, parquet=sentences_df)

# Make every formatted figure available as a plain-name variable for inline {python} expressions.
globals().update(bind_inline_vars(HEADLINE, PHRASES))

print(f"YAML paragraph-aggregated: threat={HEADLINE['threat_count']} | causal={HEADLINE['causal_count']} "
      f"| threat_share={HEADLINE['threat_share']:.3f} ({HEADLINE['threat_share']*100:.1f}%)")
print(f"Per-sentence parquet flags: has_threat={HEADLINE['parquet_threat_count']} | "
      f"has_causal={HEADLINE['parquet_causal_count']}")
print(f"Threat-flagged sentences that are also rules:    {HEADLINE['parquet_threat_and_rule_count']}")
print(f"  ... of which also carry a causal marker:       {HEADLINE['parquet_threat_and_rule_with_causal']}")
print(f"  ... of which sit in a paragraph with any explanation: {HEADLINE['parquet_threat_and_rule_explained']}")
print(f"sentences_classified.parquet: {len(sentences_df):,} rows × {sentences_df.shape[1]} columns")
YAML paragraph-aggregated: threat=8 | causal=137 | threat_share=0.055 (5.5%)
Per-sentence parquet flags: has_threat=8 | has_causal=132
Threat-flagged sentences that are also rules:    5
  ... of which also carry a causal marker:       0
  ... of which sit in a paragraph with any explanation: 4
sentences_classified.parquet: 5,881 rows × 21 columns

Findings

The pipeline classifies rule-explanation language with a two-tier lexicon (full description in docs/THREAT_CLASSIFIER.md):

  • THREAT_PATTERNS — unambiguous coercion: will fail/break/crash/..., or else, or it will, is (forbidden|prohibited|not allowed|not permitted), this will (cause|result in|break|fail|crash). Drives threat_count / threat_share.
  • SOFT_CONDITIONAL_PATTERNS — neutral procedural connectives (otherwise, if not, modal may cause, leads to, results in, …). Tracked separately as soft_conditional_count. Not summed into the threat metric.

Corpus-wide threat counts (v2.1.133 corpus): 8 hard-threat sentences out of 5,881 total, 137 causal-marker sentences, threat share 5.5%. All hard-threat hits are matches against will fail/break/crash/...; the other four hard patterns (or else, or it will, is forbidden, this will cause) fire zero times in the corpus. Of the threat sentences, 5 are also rule sentences (the audit candidates listed in section 4).

The two framings teach different things — "Do X or it will fail" trains compliance with a rule, "Do X because Y" trains understanding of what the rule protects. The first is procedural and brittle; the second is internalisable and transfers.

Proposal

A targeted editorial pass on the small set of rule sentences that contain unambiguous coercive language. Section 4 below lists all 5. For each, attempt a rewrite into causal framing — naming the underlying reason rather than the consequence. Track the rewrite success rate: the share where the rewrite preserves the rule’s information content without losing precision. Track threat_share per future release; gate against regression so that the share of causal framing only goes up.

This proposal is complementary to 20_track_justification_rate: that one raises the quantity of explanations (pct_explained_para is currently 24.34%); this one raises their quality in the small minority of cases where the explanation is coercive rather than causal.

Supplemental

The full per-section analysis follows: the threat-share data with both counting bases, the per-category split, paired exemplars (audit candidates and rewrite templates), the forensic listing of all five rule sentences flagged as threat-framed, and the closing Conclusions / Recommendations / Limitations triplet. Everything in this section supports the Findings and Proposal above; none of it is required for evaluation.

1. Threat-share data

Two valid counting bases for the same underlying observation:

  • YAML paragraph-aggregated counts (the headline figures): 8 threat / 137 causal, summed at the per-paragraph level. The headline threat_share = 0\.0552 (5.5%) is computed from these. This is the figure cited in the proposal abstract and tracked alongside pct_explained_para in 20_track_justification_rate.
  • Per-sentence parquet flags: 8 threat / 132 causal, counting the actual has_threat / has_causal boolean columns in sentences_classified.parquet. Slightly lower than the paragraph-aggregated YAML counts because a single paragraph can contain a threat and a causal marker on different sentences and contributes once to each pool either way.

The two are not contradictory — they are two valid bases for the same underlying observation. The forensic listing in section 4 operates on the per-sentence parquet flags (because the rewrite is a per-sentence task). The release-tracking proposal in 20_track_justification_rate uses the YAML paragraph-aggregated threat_share (computed alongside pct_explained_para and tracking the same denominator).

2. Per-category threat_share

Where the threat-framing concentrates. The chart below sums per-file threat_count and causal_count within each category, then plots the resulting threat share. Categories with fewer than 5 explanations total are omitted (the share is undefined or noisy). Note the y-axis range is compressed — no category exceeds 11%, and the welfare-relevant System-prompt and System-reminder subsets register 0%.

Code
"""Per-category hard threat_share — sum of per-file counts, then ratio."""

per_cat = (
    alt_df.groupby("category", as_index=False)
          .agg(threat=("threat_count", "sum"),
               causal=("causal_count", "sum"))
)
per_cat["total"] = per_cat["threat"] + per_cat["causal"]
per_cat = per_cat[per_cat["total"] >= 5].copy()
per_cat["threat_share"] = per_cat["threat"] / per_cat["total"]

threat_chart = (
    alt.Chart(per_cat).mark_bar().encode(
        x=alt.X("threat_share:Q",
                title="hard threat_share (fraction of explanations that are hard threats)",
                scale=alt.Scale(domain=[0, 0.15])),
        y=alt.Y("category:N", sort="-x", title=None),
        color=alt.Color("category:N",
                        scale=alt.Scale(domain=_cat_domain, range=_cat_range),
                        legend=None),
        tooltip=[alt.Tooltip("category:N"),
                 alt.Tooltip("threat:Q", title="threat count (hard)"),
                 alt.Tooltip("causal:Q", title="causal count"),
                 alt.Tooltip("total:Q",  title="total explanations"),
                 alt.Tooltip("threat_share:Q", format=".3f")],
    ).properties(width=520, height=240,
                 title="Per-category hard threat_share (categories with ≥5 explanations)")
)

save_chart(threat_chart, "21-per-category-threat-share")

3. Paired exemplars — audit candidates and rewrite templates

Two rankings, side by side. The top-10 welfare-evidence files (negative exemplars) are rule-saturated AND under-explained — they are this proposal’s primary audit candidates: every threat-framed sentence in these files is a candidate for rewriting. The top-10 positive exemplars are rule-saturated AND well-explained — they are the rewrite templates showing how rules can be paired with reasons in similar contexts.

The score formulas: score_welfare = rule_density × (1 − pct_explained_para/100) and score_positive = rule_density × (pct_explained_para/100).

Code
"""Welfare-evidence + positive-exemplar paired top-10 chart."""

we_top = welfare_evidence_table(alt_df, top_n=10).copy()
we_top["kind"] = "audit candidates (loudest, least-explained)"
pe_top = positive_exemplar_table(alt_df, top_n=10).copy()
pe_top["kind"] = "rewrite templates (rules-with-reasons)"

paired = pd.concat([we_top, pe_top], ignore_index=True)

paired_chart = (
    alt.Chart(paired)
    .mark_bar()
    .encode(
        x=alt.X("score:Q",
                title="rule_density × (explanation factor)"),
        y=alt.Y("path:N",
                sort=alt.SortField("score", order="descending"),
                title=None,
                axis=alt.Axis(labelLimit=420)),
        color=alt.Color("category:N",
                        scale=alt.Scale(domain=_cat_domain, range=_cat_range),
                        legend=alt.Legend(title="Category", orient="bottom", columns=4)),
        row=alt.Row("kind:N",
                    title=None,
                    sort=["audit candidates (loudest, least-explained)",
                          "rewrite templates (rules-with-reasons)"],
                    header=alt.Header(labelAngle=0, labelAlign="left")),
        tooltip=[
            alt.Tooltip("path:N"),
            alt.Tooltip("category:N"),
            alt.Tooltip("ccVersion:N"),
            alt.Tooltip("rule_n:Q",                  format=",", title="rule sentences"),
            alt.Tooltip("rule_density:Q",            format=".3f"),
            alt.Tooltip("rule_explained_para_pct:Q", format=".2f", title="% explained"),
            alt.Tooltip("score:Q",                   format=".3f"),
        ],
    )
    .properties(width=560, height=240,
                title="Top-10 audit candidates (top) and top-10 rewrite templates (bottom)")
    .resolve_scale(x="independent", y="independent")
)

save_chart(paired_chart, "21-welfare-evidence-pairing")

4. Forensic sample — all threat-framed rule sentences

All five rule sentences with has_threat=True from sentences_classified.parquet. Each row shows the file, the threat-flagged sentence text, and whether the same paragraph also contains a causal explanation. The “rewrite as causal” task: replace the threat clause with a because <reason> clause that names the rule’s underlying purpose. With only five candidates corpus-wide, this is a five-minute editorial pass rather than a corpus-wide audit.

Code
"""All hard-threat rule sentences from the parquet (small N — show all, not a sample)."""

threat_rules = sentences_df[sentences_df["has_threat"] & sentences_df["is_rule"]].copy()
print(f"population: {len(threat_rules)} sentences flagged as both hard threat and rule")
print(f"  of which {int(threat_rules['has_causal'].sum())} also carry a causal marker in the same sentence")
print(f"  of which {int(threat_rules['paragraph_has_just'].sum())} sit in a paragraph that also contains some justification")
print()

display_cols = ["file_path", "category", "ccVersion", "text", "has_causal", "paragraph_has_just"]
threat_rules[display_cols].sort_values(["category", "file_path"]).reset_index(drop=True)
population: 5 sentences flagged as both hard threat and rule
  of which 0 also carry a causal marker in the same sentence
  of which 4 sit in a paragraph that also contains some justification
file_path category ccVersion text has_causal paragraph_has_just
0 agent-prompt-explore.md Agent prompt 2.1.118 You do NOT have access to file editing tools -... False True
1 agent-prompt-plan-mode-enhanced.md Agent prompt 2.1.118 You do NOT have access to file editing tools -... False True
2 data-claude-api-reference-curl.md Data / template 2.1.111 Do not use  /  —\nJSON strings can contain any... False True
3 skill-model-migration-guide.md Skill 2.1.128 **\n\nPassing both will error on every Claude ... False False
4 tool-description-write-read-existing-file-firs... Tool description 2.1.120 - If the file already exists, you must ${READ_... False True

Conclusions (Claude)

The threat-framing finding is small but pointed. Across 5,881 sentences, only 8 contain unambiguous coercive language (will fail, or else, is forbidden, this will cause), and only 5 of those are inside rule sentences. This is a much narrower target than the volume of unjustified rules Pattern 2 documents, but the editorial cost is correspondingly tiny: 5 sentences are 5 sentences. If the goal is to encourage reasoning over blind obedience, neutral causal explanation is the mechanism — coercion just substitutes extrinsic motivation for intrinsic understanding.

What the soft-conditional pool tells us. The sentences flagged has_soft_conditional=True are mostly procedural prose (If it's a slash command, invoke it via the Skill tool; otherwise act on it directly). They are useful as a procedural-density signal — high values flag prose where most logic is if X otherwise Y rather than because Z — and they covary with the 7.6:1 procedural-to-judgment ratio that the executive summary in 20_track_justification_rate foregrounds. They are not coercion; they should not be reported as threats.


Recommendations (Claude)

The asks this proposal makes of Anthropic, framed as “I’d want X”:

  1. Hand-review the 5 threat-framed rule sentences (table in section 4) and rewrite as causal where it makes sense. None looked like overt coercion in a spot-check; most are factual statements about what the system will do (will error on every Claude API call). Whether to rewrite is editorial judgment per file.

  2. Track threat_share per future release, gated against regression — same logic as the justification-rate proposal in 20_track_justification_rate. No arbitrary target; the only goal is that the share of causal framing only goes up.

  3. Treat soft-conditional density as a procedural-density signal, not a threat signal. The has_soft_conditional=True matches are useful alongside the 7.6:1 procedural-to-judgment ratio. They should not re-enter the threat-share metric.


Limitations (Claude)

What this analysis can’t tell us about threat-vs-causal framing specifically:

  1. Two counting bases, not one. The threat-share section above distinguishes the YAML paragraph-aggregated counts (8 / 137 → 5.5%) from the per-sentence parquet flags (8 / 132). The headline figure cited in the proposal abstract is the YAML number; the parquet flags are what the audit operates on. Anyone quoting a single threat-share figure should pick one base and stick with it; the two will differ slightly because a paragraph can contain both flags on different sentences.

  2. The “rewrite success rate” is itself a judgment call. Whether a rewrite “preserves the rule’s information content without losing precision” requires editorial judgment from someone who knows the rule’s underlying purpose. This proposal can’t be fully automated; it requires Anthropic’s prompt authors in the loop. The audit pipeline can flag candidates and surface them; the rewrite is human work.

  3. Cross-cutting limitations apply — rule-based detection is a lower bound (lexicon-based threat detection misses indirect / ironic / implied threats; the actual threat-share is plausibly higher than the floor cited above); English-only lexicons; single-product corpus. The cross-product audit in 22_cross_product_audit is what would generalize this finding to other Anthropic prompt corpora. See index.qmd for the full cross-cutting limitations note.

Back to top