Emphasis, ALL CAPS, and the full VOCAB profile

Three lexical-density slices: the per-category emphasis 3-panel (ALL CAPS / CAPS imperative / justification ratio), the per-file outlier tables (most prohibition-heavy, loudest CAPS, most explanatory), and the full 11-class VOCAB heatmap. Source: the producer’s vocab.*, all_caps, caps_imperative, and justification blocks.

Terms used

ALL CAPS density, CAPS-imperative density, hard_prohibitions, justification ratio, and the VOCAB heatmap (11 lexical classes) — all defined in 02_analyzers_vocab_emphasis. All densities below report pct (% of word tokens); higher = denser.


Observation (Claude)

ALL CAPS in instruction prompts is a sign of low trust in the reader’s ability to read non-emphatic prose. The four emphasis-of-rule words at the top of the corpus’s CAPS-imperative list (printed above the chart: IMPORTANT at 35 hits, NEVER at 25, MUST at 18, CRITICAL at 15) are the corpus’s loudness signature — they are the same words that show up in the bash-sandbox top-of-z-score files in 13_correlation_directiveness.ipynb. The cumulative ALL CAPS density visible in 14_ccversion_trends.ipynb does drift slightly down over ccVersion — a small empirical signal in that direction. The welfare-relevant claim isn’t about absolute count of ALL CAPS tokens (that drifts with corpus size); it’s about the structural absence of a non-shouted alternative. The warmth_encouragement column in the heatmap is the negative-space confirmation: there is no encouragement-density vocabulary doing the work that shouted prohibition vocabulary is doing instead.


Code
"""Setup: load YAML data + flat alt_df, derive helper bindings used by every chart cell.

The shared module `prompt_analysis.py` lives next to this notebook in the project root.
"""
import importlib
import altair as alt
import pandas as pd

import prompt_analysis
importlib.reload(prompt_analysis)   # pick up edits without restarting the kernel
from prompt_analysis import (
    load_yaml, build_alt_df, version_order, category_colors,
    directiveness, headline_numbers, use_deterministic_ids, save_chart,
    SR_CLASS_COLORS, SENT_REGISTER_CLASSES, TABLEAU10,
)

# Replace random Altair / Styler IDs with a deterministic counter so re-runs
# produce byte-identical .ipynb outputs (no UUID churn in `git diff`).
use_deterministic_ids()

alt.data_transformers.disable_max_rows()

data              = load_yaml()                  # default: prompt_linguistic_analysis.yaml
alt_df            = build_alt_df(data)
HEADLINE          = headline_numbers(data)       # canonical corpus-wide numbers (see 05_headline_and_audit)
by_category       = data["by_category"]
corpus_block      = data["corpus"]
per_file_records  = data["files"]
cats              = list(by_category.keys())
VOCAB_KEYS        = list(data["lexicons"]["VOCAB"].keys())

# Composite directiveness column — formula in 13_correlation_directiveness;
# rendered there and on the timeline in 14_ccversion_trends.
alt_df["directiveness"] = directiveness(alt_df)

# Per-category palette + Altair encodings used across charts.
CATEGORY_COLORS = category_colors(cats)
_cat_domain     = cats
_cat_range      = [CATEGORY_COLORS[c] for c in cats]

print(f"loaded {len(per_file_records)} files | {alt_df.shape[1]} columns | {len(cats)} categories | {len(VOCAB_KEYS)} VOCAB keys")
loaded 290 files | 181 columns | 7 categories | 11 VOCAB keys

Emphasis density per category (3-panel)

ALL CAPS density / CAPS-imperative density / justification ratio per category, on independent x-scales (the three live on different magnitudes; comparing within a panel is the meaningful read). System reminders typically lead the ALL CAPS panel; Tool descriptions lead the CAPS-imperative panel; the justification-ratio panel runs low across all categories — most rules are issued without a stated reason.

Code
"""Emphasis: ALL CAPS, CAPS imperative, justification ratio per category — Altair."""

emphasis_long = pd.DataFrame([
    {"category": cat, "metric": metric, "value": value}
    for cat in cats
    for metric, value in [
        ("ALL CAPS (% tokens)",
            by_category[cat]["metrics"]["all_caps"]["pct"]),
        ("CAPS imperative (% tokens)",
            by_category[cat]["metrics"]["caps_imperative"]["pct"]),
        ("Justification ratio",
            by_category[cat]["metrics"]["justification"]["ratio"]),
    ]
])

emphasis_chart = (
    alt.Chart(emphasis_long)
    .mark_bar()
    .encode(
        x=alt.X("value:Q", title=None),
        y=alt.Y("category:N", sort=cats, title=None),
        color=alt.Color("category:N",
                         scale=alt.Scale(domain=cats,
                                         range=[CATEGORY_COLORS[c] for c in cats]),
                         legend=None),
        column=alt.Column("metric:N", title=None,
                            sort=["ALL CAPS (% tokens)",
                                  "CAPS imperative (% tokens)",
                                  "Justification ratio"]),
        tooltip=[alt.Tooltip("category:N"),
                 alt.Tooltip("metric:N"),
                 alt.Tooltip("value:Q", format=".3f")],
    )
    .resolve_scale(x="independent")
    .properties(width=240, height=240,
                title="Emphasis density per category (independent x-scales)")
)
save_chart(emphasis_chart, "11-emphasis-3panel")

Per-file outliers (text)

Three printed top-10 lists. Columns: n_tokens (file length), caps_imp_pct, hard_proh_pct (hard_prohibitions density), just_ratio (defined in 02_analyzers_vocab_emphasis). The third list filters to ≥150 tokens to suppress one-sentence outliers.

Code
"""Per-file outliers: highest CAPS-imperative density and lowest justification ratio."""

per_file_df = pd.DataFrame([
    {
        "path": r["path"],
        "category": r["category"],
        "n_tokens": r["n_tokens"],
        "imp_sent_pct":     r["metrics"]["sentence_register"]["imperative_sent_pct"],
        "caps_imp_pct":     r["metrics"]["caps_imperative"]["pct"],
        "all_caps_pct":     r["metrics"]["all_caps"]["pct"],
        "just_ratio":       r["metrics"]["justification"]["ratio"],
        "deontic_pct":      r["metrics"]["modality"]["deontic_pct"],
        "hard_proh_pct":    r["metrics"]["vocab"]["hard_prohibitions"]["pct"],
    }
    for r in per_file_records
])

print("--- 10 files with highest CAPS-imperative density (% of file tokens) ---")
print(per_file_df.nlargest(10, "caps_imp_pct")[["path", "category", "n_tokens", "caps_imp_pct"]].to_string(index=False))
print("\n--- 10 files with highest hard_prohibitions density (% of file tokens) ---")
print(per_file_df.nlargest(10, "hard_proh_pct")[["path", "category", "n_tokens", "hard_proh_pct"]].to_string(index=False))
print("\n--- 10 files with most explanatory tone (highest justification ratio, ≥150 tokens) ---")
big = per_file_df[per_file_df["n_tokens"] >= 150]
print(big.nlargest(10, "just_ratio")[["path", "category", "n_tokens", "just_ratio"]].to_string(index=False))
--- 10 files with highest CAPS-imperative density (% of file tokens) ---
                                                  path         category  n_tokens  caps_imp_pct
       tool-description-bash-sandbox-mandatory-mode.md Tool description        15        6.6667
                  tool-description-bash-no-newlines.md Tool description        16        6.2500
                         tool-description-websearch.md Tool description       241        2.4896
             system-prompt-chrome-browser-mcp-tools.md    System prompt        96        2.0833
       tool-description-bash-prefer-dedicated-tools.md Tool description        51        1.9608
tool-description-bash-prefer-dedicated-tools-bullet.md Tool description        52        1.9231
                system-prompt-tool-execution-denied.md    System prompt       128        1.5625
                     agent-prompt-quick-pr-creation.md     Agent prompt       414        1.4493
                              tool-description-edit.md Tool description       140        1.4286
                  system-reminder-btw-side-question.md  System reminder       215        1.3953

--- 10 files with highest hard_prohibitions density (% of file tokens) ---
                                                             path         category  n_tokens  hard_proh_pct
tool-description-bash-sandbox-evidence-operation-not-permitted.md Tool description        11         9.0909
                   tool-description-bash-sandbox-no-exceptions.md Tool description        11         9.0909
           tool-description-bash-sandbox-retry-without-sandbox.md Tool description        12         8.3333
                   tool-description-bash-sleep-run-immediately.md Tool description        14         7.1429
        system-prompt-one-of-six-rules-for-using-sleep-command.md    System prompt        15         6.6667
                             tool-description-bash-no-newlines.md Tool description        16         6.2500
              tool-description-bash-sandbox-default-to-sandbox.md Tool description        18         5.5556
       system-prompt-doing-tasks-no-unnecessary-error-handling.md    System prompt        55         5.4545
                         tool-description-bash-semicolon-usage.md Tool description        21         4.7619
              tool-description-bash-sandbox-no-sensitive-paths.md Tool description        22         4.5455

--- 10 files with most explanatory tone (highest justification ratio, ≥150 tokens) ---
                                             path        category  n_tokens  just_ratio
system-prompt-subagent-prompt-writing-examples.md   System prompt       369         2.5
         skill-loop-slash-command-dynamic-mode.md           Skill       291         2.0
    system-prompt-subagent-delegation-examples.md   System prompt       503         2.0
          agent-prompt-auto-mode-rule-reviewer.md    Agent prompt       242         1.0
    agent-prompt-bash-command-prefix-detection.md    Agent prompt       630         1.0
       agent-prompt-dream-memory-consolidation.md    Agent prompt       582         1.0
                      agent-prompt-worker-fork.md    Agent prompt       214         1.0
                 data-claude-api-reference-php.md Data / template       529         1.0
               data-live-documentation-sources.md Data / template      1638         1.0
                  skill-dream-nightly-schedule.md           Skill       257         1.0

Emphasis vocabulary: top ALL CAPS tokens, CAPS imperative tokens, and full VOCAB profile

The 11 VOCAB classes (hard_prohibitions, hard_prescriptions, soft_prescriptions, politeness_direct, politeness_softening, warmth_encouragement, hedging, structural_markers, profanity, pronouns_2p, pronouns_1p) plotted as a heatmap of % of file tokens per category, alongside the corpus’s top ALL CAPS tokens and curated CAPS-imperative tokens. Per-class glosses are in 02_analyzers_vocab_emphasis.

Code
"""Top-N tokens for ALL CAPS and CAPS imperative + full VOCAB heatmap."""

print("Top CAPS-imperative tokens (corpus-wide counts):")
for tok, n in HEADLINE["top_caps_imperative"]:
    print(f"  {tok:<10}  {n}")

top_caps = pd.DataFrame(corpus_block["metrics"]["all_caps"]["top"][:25])
top_caps_chart = (
    alt.Chart(top_caps)
    .mark_bar(color="#af7aa1")
    .encode(
        x=alt.X("count:Q", title="corpus-wide count"),
        y=alt.Y("token:N", sort="-x", title=None),
        tooltip=[alt.Tooltip("token:N"), alt.Tooltip("count:Q")],
    )
    .properties(width=320, height=380,
                title="Top 25 ALL CAPS tokens (TECH_ACRONYMS excluded)")
)

caps_imp_data = pd.DataFrame(
    [{"token": t, "count": c} for t, c in
     corpus_block["metrics"]["caps_imperative"]["hits"].items()]
)
caps_imp_chart = (
    alt.Chart(caps_imp_data)
    .mark_bar(color="#e15759")
    .encode(
        x=alt.X("count:Q", title="corpus-wide count"),
        y=alt.Y("token:N", sort="-x", title=None),
        tooltip=[alt.Tooltip("token:N"), alt.Tooltip("count:Q")],
    )
    .properties(width=320, height=380,
                title="CAPS imperative tokens (corpus-wide)")
)

vocab_long = []
for cat, b in by_category.items():
    for key, v in b["metrics"]["vocab"].items():
        vocab_long.append({"category": cat, "vocab_key": key, "pct": v["pct"]})
vocab_df_long = pd.DataFrame(vocab_long)

vocab_chart = (
    alt.Chart(vocab_df_long)
    .mark_rect()
    .encode(
        x=alt.X("vocab_key:N", title=None, sort=list(VOCAB_KEYS)),
        y=alt.Y("category:N", title=None),
        color=alt.Color("pct:Q", scale=alt.Scale(scheme="magma", reverse=True),
                         title="% of file tokens"),
        tooltip=[alt.Tooltip("category:N"),
                 alt.Tooltip("vocab_key:N"),
                 alt.Tooltip("pct:Q", format=".3f")],
    )
    .properties(width=720, height=260,
                title="Full VOCAB profile per category (% of file tokens)")
)

emphasis_top_tokens = (top_caps_chart | caps_imp_chart) & vocab_chart
save_chart(emphasis_top_tokens, "11-top-tokens-and-vocab-heatmap")
Top CAPS-imperative tokens (corpus-wide counts):
  IMPORTANT   36
  NEVER       26
  MUST        18
  CRITICAL    15
  • Top ALL CAPS tokens (left): a small cluster of emphasis-of-rule words dominates among the non-acronyms — IMPORTANT at 35 hits, NEVER at 25, MUST at 18, CRITICAL at 15. Emphatic typography used as a weight-bearing rhetorical device, not technical acronyms.
  • CAPS imperative tokens (right): the curated subset of ALL CAPS tokens that are also command words. The same emphasis cluster tops the list — same words doing emphatic-command work, not the side-effect of sentence-initial capitalization.
  • Full VOCAB heatmap (bottom): Tool descriptions rows light up on hard_prohibitions, hard_prescriptions, and pronouns_2p; Skill files light up on pronouns_2p. The warmth_encouragement column stays consistently dim across every category — encouragement vocabulary is structurally absent.
Back to top