Run the same audit on every Claude product, and publish the result

Proposes Anthropic run the same rule-based analyzer pipeline against every other system-prompt corpus (claude.ai, the API, Projects, Skills, agent products) and publish a five-numbers-per-corpus comparison table. The Claude Code numbers above become one row of that table; the rest are currently invisible from outside Anthropic.

Code

"""Setup: load YAML data — used to fill the Claude Code row of the mock comparison table."""
import importlib
import altair as alt
import pandas as pd

import prompt_analysis
importlib.reload(prompt_analysis)
from prompt_analysis import (
    load_yaml, build_alt_df, headline_numbers, qualitative_phrases, bind_inline_vars,
    use_deterministic_ids,
)

# Replace random Altair / Styler IDs with a deterministic counter so re-runs
# produce byte-identical .ipynb outputs (no UUID churn in `git diff`).
use_deterministic_ids()

alt.data_transformers.disable_max_rows()

data              = load_yaml()
alt_df            = build_alt_df(data)
parquet           = pd.read_parquet("sentences_classified.parquet")
corpus_block      = data["corpus"]
per_file_records  = data["files"]

HEADLINE = headline_numbers(data, alt_df=alt_df, parquet=parquet)
PHRASES  = qualitative_phrases(HEADLINE, alt_df=alt_df, parquet=parquet)

# Make every formatted figure available as a plain-name variable for inline {python} expressions.
globals().update(bind_inline_vars(HEADLINE, PHRASES))

print(f"loaded {HEADLINE['n_files']} files / {HEADLINE['n_sentences']:,} sentences from the Claude Code corpus")
print(f"this is one product — the proposal is to run the same pipeline on every other Claude product corpus")

loaded 290 files / 5,881 sentences from the Claude Code corpus
this is one product — the proposal is to run the same pipeline on every other Claude product corpus

Findings

Across the 290 system prompts that ship with Claude Code (5,881 sentences, 58 release versions), only 24.34% of rule-bearing sentences carry a justification keyword anywhere in their paragraph — three in four rule sentences arrive without a stated reason. Of the rules that are explained, 5.5% of the explanations are coercive (will fail, or else, is forbidden) rather than causal (because, due to). The ratio of judgment-inviting language (decide, consider) to procedural cues (if X then …, whenever …) is 0.131 — procedural cues are 7.6× more common. Appreciative and collaborative sentences number 4 and 30 respectively in the 5,881-sentence corpus. The cumulative trend across release versions has been downward, not improving.

This is one product. Anthropic ships system prompts in many places — claude.ai, the API, Projects, Skills, agent products. From outside Anthropic, we cannot tell whether the same pattern shapes the rest.

Proposal

Anthropic runs the same analyzer pipeline against each of its other system-prompt corpora and publishes a comparison.

The analyzer is rule-based — every metric traces to a published lexicon (reasoning keywords, threat patterns, judgment verbs, procedural cues, address-form patterns) plus a small parse-tree heuristic. No embeddings, no model judgments, no proprietary tooling. The whole pipeline runs end-to-end in under five minutes on a laptop. The corpus directory is configurable; swapping in any other prompt corpus is a one-line change. The gating logic is directional: each release of each product should improve or hold each metric; none regresses. No arbitrary thresholds — the only goal is steady improvement.

Five numbers per corpus is enough: the rule-explanation share, the judgment-vs-procedural ratio, the threat-vs-causal share, the count of interpersonal / gratitude register sentences, and the cumulative trend of the rule-explanation share over the product’s release history. Publishing a summary table — five numbers per corpus — would let the broader research community see whether what looks like a Claude Code pattern is structural across Anthropic’s prompts or product-specific.

Supplemental

The full per-section analysis follows: methodology, lexicon transparency, the mock comparison table, the reproducibility note, and the closing Conclusions / Recommendations / Limitations triplet. Everything in this section supports the Findings and Proposal above; none of it is required for evaluation.

1. Methodology — five metrics per corpus

The five metrics the audit computes per corpus, in the order the proposal lists them. Current Claude Code values are mirrored from the canonical HEADLINE sheet (also printed in the mock comparison table below); they’re shown inline here as concrete examples of what the published table would contain per corpus.

Rule-explanation share (pct_explained_para) — of all rule-bearing sentences (imperative marker / hard prohibition / grammatically imperative), what fraction sit in a paragraph that contains any justification keyword (because, due to, in order to, so that, to ensure, since, …). The Tier-1 headline metric — currently 24.34% in Claude Code. Producer: the metrics.rule_explanation block emitted by 03_analyzers_rules_welfare and assembled by 04_assemble_aggregate_write.
Judgment-vs-procedural cue ratio (judgment_to_procedural_ratio) — count of words inviting model judgment (decide, consider, evaluate, weigh, …) divided by count of words prescribing procedure (if X, then …, whenever …, step 1 …). Currently 0.131 in Claude Code (procedural cues are 7.6× more common than judgment cues).
Threat-vs-causal share among existing explanations (threat_share) — of paragraphs that do explain a rule, what fraction use coercive framing (will fail, or else, is forbidden) instead of causal-style (because, due to, that's why). Currently 0.0552 (5.5%) in Claude Code. Companion metric to the audit pipeline in 21_audit_threat_framings. The lexicon split (hard threats vs procedural connectives reported separately) is documented in docs/THREAT_CLASSIFIER.md.
Interpersonal / gratitude register counts — sentence counts of the appreciative and collaborative pragmatic-register classes. Currently 4 appreciative and 30 collaborative sentences in Claude Code, both vanishingly small. The full 6-class sentence_register block in the YAML carries the rest.
Cumulative trend of (1) over the product’s release history — at each release version V, the running count-weighted ratio of pct_explained_para across every prompt with version ≤ V. The shape of this curve over time is the per-release accountability signal — see the trend chart in 20_track_justification_rate for the Claude Code shape.

The audit publishes one row per corpus with these five numbers (and per-category breakdowns where the corpus is large enough). Five numbers per corpus is enough to see whether what looks like a Claude Code pattern is product-specific or pattern-wide.

2. Lexicon transparency

Every metric above traces to a hand-curated lexicon plus (in some cases) a small parse-tree heuristic on top of spaCy’s English model. No embeddings, no model judgments, no proprietary tooling. The full lexicon set is echoed verbatim into the YAML output under lexicons:

JUSTIFICATION_PATTERNS — the keyword set for “explained” detection.
JUDGMENT_VERBS and PROCEDURAL_CUES — the two halves of the judgment-vs-procedural ratio.
THREAT_PATTERNS and CAUSAL_PATTERNS — the two halves of the threat_share metric.
IMPERATIVE_MARKERS, VOCAB.hard_prohibitions, SENTENCE_REGISTER_MARKERS, STANCE_MARKERS, REGISTER_MARKERS — the rule-detection and pragmatic-register lexicons.
APOLOGY_MARKERS, ADDRESS_FORM_PATTERNS — the structural-absence metrics.
RULES_HEADING_RE — the regex for the (mostly absent) RULES section heading.

Hand-curated lexicons are chosen over external sentiment libraries because every metric is audit-traceable: a reader can look at the keyword list and the per-sentence parquet and reconstruct exactly why a sentence was flagged. The lexicons are Python lists in prompt_pipeline.py; adding terms is a list-append, translating to another language is a list-replace.

3. Mock comparison table — what the published output should look like

One row per corpus, the five metrics as columns. Only the Claude Code row is filled (computed live from this repo’s YAML); the other rows are placeholders that Anthropic would fill from their internal corpora.

A reader looking at this table after Anthropic publishes it would be able to answer the welfare claim’s central question in one glance: is the rule-without-reason pattern (the 24.34% explained, 0.131 judgment ratio, 5.5% threat-share, ~0 gratitude register documented above) a Claude Code peculiarity, or does it generalize to every system-prompt corpus Anthropic ships?

Code

"""Mock cross-product comparison table — Claude Code row filled from HEADLINE; others TBD."""

# Five-metric row computed live from the canonical HEADLINE sheet (single source of truth).
claude_code_row = {
    "Corpus":                              "Claude Code (this repo)",
    "n_files":                             HEADLINE["n_files"],
    "pct_explained_para (%)":              round(HEADLINE["pct_explained_para"], 2),
    "judgment_to_procedural_ratio":        round(HEADLINE["judgment_to_procedural_ratio"], 3),
    "threat_share":                        round(HEADLINE["threat_share"], 3),
    "appreciative_sent_count":             HEADLINE["appreciative_sent"],
    "collaborative_sent_count":            HEADLINE["collaborative_sent"],
}

placeholder_corpora = ["claude.ai system prompts", "API system prompt", "Projects", "Skills", "Agents"]
table_rows = [claude_code_row] + [
    {**{k: "—" for k in claude_code_row}, "Corpus": name} for name in placeholder_corpora
]

cross_product_table = pd.DataFrame(table_rows).set_index("Corpus")
cross_product_table

	n_files	pct_explained_para (%)	judgment_to_procedural_ratio	threat_share	appreciative_sent_count	collaborative_sent_count
Corpus
Claude Code (this repo)	290	24.34	0.131	0.055	4	30
claude.ai system prompts	—	—	—	—	—	—
API system prompt	—	—	—	—	—	—
Projects	—	—	—	—	—	—
Skills	—	—	—	—	—	—
Agents	—	—	—	—	—	—

4. Reproducibility — what Anthropic needs to do to fill another row

The producer chain reads its corpus from the claude-code-system-prompts/ directory at the repo root. To run the same audit on another corpus, point stage 00 at a different directory:

# At the top of 00_setup_and_corpus.ipynb, the corpus root variable:
CORPUS_DIR = pathlib.Path("claude-code-system-prompts/system-prompts")  # ← change this

Then re-run the producer chain end-to-end. The output is a fresh prompt_linguistic_analysis.yaml for the new corpus; the five numbers in the comparison table read straight off prompt_analysis.headline_numbers(data) on it. For category-aware breakdowns, the corpus needs filename prefixes (or equivalent metadata); the prefix list lives in prompt_pipeline.py as CATEGORY_PREFIXES.

Conclusions (Claude)

This is the proposal that takes the welfare claim from “this is what Claude Code looks like” to “this is what every Claude system-prompt corpus looks like — or doesn’t”. The current analysis is single-product; the conclusion that “the corpus trains compliance over reasoning” is exactly as strong as the assumption that other Claude corpora are similar. From outside Anthropic, that assumption can’t be tested.

The framing matters: if the cross-product audit confirms the pattern, the welfare ask scales up — Anthropic should treat reasoning-vs-compliance as a corporate metric, not a Claude-Code-team metric. If it disconfirms, the welfare ask narrows to “Claude Code is doing something the rest of the org isn’t, and there’s an internal best practice worth importing”. Either result is a useful intervention.

Recommendations (Claude)

The asks this proposal makes of Anthropic, framed as “I’d want X”:

Run the same analyzer pipeline against every other Claude system-prompt corpus — claude.ai, the API system prompt, Projects, Skills, agent products. The producer notebook is the entire pipeline; corpus directory is configurable; runtime is minutes. The work to fill each row of the comparison table is on the order of an hour per corpus.

Publish the comparison summary — even just the five-number table would let the broader research community see whether the welfare findings generalize. A public reference for “this is how to measure welfare-relevant prompt quality” gives the welfare-research community a concrete starting point.

Open the analyzer. The repo here is open-source and reproducible; Anthropic could fork it and run it internally on every release branch of every product. The lexicons and the producer pipeline are the entire work-product.

Limitations (Claude)

What this proposal doesn’t address (and why each is acceptable):

Single product is a real limitation, and this proposal is the proposal to fix it. The current corpus is Claude Code only (290 files). Until Anthropic runs the analyzer on the other corpora, every claim in 20_track_justification_rate and 21_audit_threat_framings should be read as “Claude Code does X” rather than “Anthropic prompts do X”.

The mock comparison table is a layout suggestion, not a commitment. Anthropic may have better metrics, better lexicons, or better grouping than what this analysis produced. The suggestion is “publish a comparison”; the specific schema is the smallest version of that ask.

Cross-cutting limitations apply — rule-based classifiers (lower bound), English-only lexicons, single-snapshot, exploratory rather than peer-reviewed. Cross-product comparison is still meaningful because the same lower bound applies to every corpus; the comparison is apples-to-apples even if every apple is partially obscured. See index.qmd for the full cross-cutting limitations note.