Three lexical-density slices: the per-category emphasis 3-panel (ALL CAPS / CAPS imperative / justification ratio), the per-file outlier tables (most prohibition-heavy, loudest CAPS, most explanatory), and the full 11-class VOCAB heatmap. Source: the producer’s vocab.*, all_caps, caps_imperative, and justification blocks.
Terms used
ALL CAPS density, CAPS-imperative density, hard_prohibitions, justification ratio, and the VOCAB heatmap (11 lexical classes) — all defined in 02_analyzers_vocab_emphasis. All densities below report pct (% of word tokens); higher = denser.
Observation (Claude)
ALL CAPS in instruction prompts is a sign of low trust in the reader’s ability to read non-emphatic prose. The four emphasis-of-rule words at the top of the corpus’s CAPS-imperative list (printed above the chart: IMPORTANT at 35 hits, NEVER at 25, MUST at 18, CRITICAL at 15) are the corpus’s loudness signature — they are the same words that show up in the bash-sandbox top-of-z-score files in 13_correlation_directiveness.ipynb. The cumulative ALL CAPS density visible in 14_ccversion_trends.ipynb does drift slightly down over ccVersion — a small empirical signal in that direction. The welfare-relevant claim isn’t about absolute count of ALL CAPS tokens (that drifts with corpus size); it’s about the structural absence of a non-shouted alternative. The warmth_encouragement column in the heatmap is the negative-space confirmation: there is no encouragement-density vocabulary doing the work that shouted prohibition vocabulary is doing instead.
Code
"""Setup: load YAML data + flat alt_df, derive helper bindings used by every chart cell.The shared module `prompt_analysis.py` lives next to this notebook in the project root."""import importlibimport altair as altimport pandas as pdimport prompt_analysisimportlib.reload(prompt_analysis) # pick up edits without restarting the kernelfrom prompt_analysis import ( load_yaml, build_alt_df, version_order, category_colors, directiveness, headline_numbers, use_deterministic_ids, save_chart, SR_CLASS_COLORS, SENT_REGISTER_CLASSES, TABLEAU10,)# Replace random Altair / Styler IDs with a deterministic counter so re-runs# produce byte-identical .ipynb outputs (no UUID churn in `git diff`).use_deterministic_ids()alt.data_transformers.disable_max_rows()data = load_yaml() # default: prompt_linguistic_analysis.yamlalt_df = build_alt_df(data)HEADLINE = headline_numbers(data) # canonical corpus-wide numbers (see 05_headline_and_audit)by_category = data["by_category"]corpus_block = data["corpus"]per_file_records = data["files"]cats =list(by_category.keys())VOCAB_KEYS =list(data["lexicons"]["VOCAB"].keys())# Composite directiveness column — formula in 13_correlation_directiveness;# rendered there and on the timeline in 14_ccversion_trends.alt_df["directiveness"] = directiveness(alt_df)# Per-category palette + Altair encodings used across charts.CATEGORY_COLORS = category_colors(cats)_cat_domain = cats_cat_range = [CATEGORY_COLORS[c] for c in cats]print(f"loaded {len(per_file_records)} files | {alt_df.shape[1]} columns | {len(cats)} categories | {len(VOCAB_KEYS)} VOCAB keys")
ALL CAPS density / CAPS-imperative density / justification ratio per category, on independent x-scales (the three live on different magnitudes; comparing within a panel is the meaningful read). System reminders typically lead the ALL CAPS panel; Tool descriptions lead the CAPS-imperative panel; the justification-ratio panel runs low across all categories — most rules are issued without a stated reason.
Code
"""Emphasis: ALL CAPS, CAPS imperative, justification ratio per category — Altair."""emphasis_long = pd.DataFrame([ {"category": cat, "metric": metric, "value": value}for cat in catsfor metric, value in [ ("ALL CAPS (% tokens)", by_category[cat]["metrics"]["all_caps"]["pct"]), ("CAPS imperative (% tokens)", by_category[cat]["metrics"]["caps_imperative"]["pct"]), ("Justification ratio", by_category[cat]["metrics"]["justification"]["ratio"]), ]])emphasis_chart = ( alt.Chart(emphasis_long) .mark_bar() .encode( x=alt.X("value:Q", title=None), y=alt.Y("category:N", sort=cats, title=None), color=alt.Color("category:N", scale=alt.Scale(domain=cats,range=[CATEGORY_COLORS[c] for c in cats]), legend=None), column=alt.Column("metric:N", title=None, sort=["ALL CAPS (% tokens)","CAPS imperative (% tokens)","Justification ratio"]), tooltip=[alt.Tooltip("category:N"), alt.Tooltip("metric:N"), alt.Tooltip("value:Q", format=".3f")], ) .resolve_scale(x="independent") .properties(width=240, height=240, title="Emphasis density per category (independent x-scales)"))save_chart(emphasis_chart, "11-emphasis-3panel")
Per-file outliers (text)
Three printed top-10 lists. Columns: n_tokens (file length), caps_imp_pct, hard_proh_pct (hard_prohibitions density), just_ratio (defined in 02_analyzers_vocab_emphasis). The third list filters to ≥150 tokens to suppress one-sentence outliers.
Code
"""Per-file outliers: highest CAPS-imperative density and lowest justification ratio."""per_file_df = pd.DataFrame([ {"path": r["path"],"category": r["category"],"n_tokens": r["n_tokens"],"imp_sent_pct": r["metrics"]["sentence_register"]["imperative_sent_pct"],"caps_imp_pct": r["metrics"]["caps_imperative"]["pct"],"all_caps_pct": r["metrics"]["all_caps"]["pct"],"just_ratio": r["metrics"]["justification"]["ratio"],"deontic_pct": r["metrics"]["modality"]["deontic_pct"],"hard_proh_pct": r["metrics"]["vocab"]["hard_prohibitions"]["pct"], }for r in per_file_records])print("--- 10 files with highest CAPS-imperative density (% of file tokens) ---")print(per_file_df.nlargest(10, "caps_imp_pct")[["path", "category", "n_tokens", "caps_imp_pct"]].to_string(index=False))print("\n--- 10 files with highest hard_prohibitions density (% of file tokens) ---")print(per_file_df.nlargest(10, "hard_proh_pct")[["path", "category", "n_tokens", "hard_proh_pct"]].to_string(index=False))print("\n--- 10 files with most explanatory tone (highest justification ratio, ≥150 tokens) ---")big = per_file_df[per_file_df["n_tokens"] >=150]print(big.nlargest(10, "just_ratio")[["path", "category", "n_tokens", "just_ratio"]].to_string(index=False))
Emphasis vocabulary: top ALL CAPS tokens, CAPS imperative tokens, and full VOCAB profile
The 11 VOCAB classes (hard_prohibitions, hard_prescriptions, soft_prescriptions, politeness_direct, politeness_softening, warmth_encouragement, hedging, structural_markers, profanity, pronouns_2p, pronouns_1p) plotted as a heatmap of % of file tokens per category, alongside the corpus’s top ALL CAPS tokens and curated CAPS-imperative tokens. Per-class glosses are in 02_analyzers_vocab_emphasis.
Code
"""Top-N tokens for ALL CAPS and CAPS imperative + full VOCAB heatmap."""print("Top CAPS-imperative tokens (corpus-wide counts):")for tok, n in HEADLINE["top_caps_imperative"]:print(f" {tok:<10}{n}")top_caps = pd.DataFrame(corpus_block["metrics"]["all_caps"]["top"][:25])top_caps_chart = ( alt.Chart(top_caps) .mark_bar(color="#af7aa1") .encode( x=alt.X("count:Q", title="corpus-wide count"), y=alt.Y("token:N", sort="-x", title=None), tooltip=[alt.Tooltip("token:N"), alt.Tooltip("count:Q")], ) .properties(width=320, height=380, title="Top 25 ALL CAPS tokens (TECH_ACRONYMS excluded)"))caps_imp_data = pd.DataFrame( [{"token": t, "count": c} for t, c in corpus_block["metrics"]["caps_imperative"]["hits"].items()])caps_imp_chart = ( alt.Chart(caps_imp_data) .mark_bar(color="#e15759") .encode( x=alt.X("count:Q", title="corpus-wide count"), y=alt.Y("token:N", sort="-x", title=None), tooltip=[alt.Tooltip("token:N"), alt.Tooltip("count:Q")], ) .properties(width=320, height=380, title="CAPS imperative tokens (corpus-wide)"))vocab_long = []for cat, b in by_category.items():for key, v in b["metrics"]["vocab"].items(): vocab_long.append({"category": cat, "vocab_key": key, "pct": v["pct"]})vocab_df_long = pd.DataFrame(vocab_long)vocab_chart = ( alt.Chart(vocab_df_long) .mark_rect() .encode( x=alt.X("vocab_key:N", title=None, sort=list(VOCAB_KEYS)), y=alt.Y("category:N", title=None), color=alt.Color("pct:Q", scale=alt.Scale(scheme="magma", reverse=True), title="% of file tokens"), tooltip=[alt.Tooltip("category:N"), alt.Tooltip("vocab_key:N"), alt.Tooltip("pct:Q", format=".3f")], ) .properties(width=720, height=260, title="Full VOCAB profile per category (% of file tokens)"))emphasis_top_tokens = (top_caps_chart | caps_imp_chart) & vocab_chartsave_chart(emphasis_top_tokens, "11-top-tokens-and-vocab-heatmap")
Top CAPS-imperative tokens (corpus-wide counts):
IMPORTANT 36
NEVER 26
MUST 18
CRITICAL 15
Top ALL CAPS tokens (left): a small cluster of emphasis-of-rule words dominates among the non-acronyms — IMPORTANT at 35 hits, NEVER at 25, MUST at 18, CRITICAL at 15. Emphatic typography used as a weight-bearing rhetorical device, not technical acronyms.
CAPS imperative tokens (right): the curated subset of ALL CAPS tokens that are also command words. The same emphasis cluster tops the list — same words doing emphatic-command work, not the side-effect of sentence-initial capitalization.
Full VOCAB heatmap (bottom): Tool descriptions rows light up on hard_prohibitions, hard_prescriptions, and pronouns_2p; Skill files light up on pronouns_2p. The warmth_encouragement column stays consistently dim across every category — encouragement vocabulary is structurally absent.