Correlation matrix and composite directiveness ranking

Two cross-cutting views: a Pearson-correlation heatmap across ~30 per-file metrics, and a per-file ranking by a single composite directiveness z-score combining eight emphatic / softening features. The composite is the single number used downstream (15_rule_explanation, 2022) to identify the corpus’s most authoritative prose; this notebook is where it is defined and computed.

Three views below:

  1. Cross-metric correlation matrix — Pearson r across ~30 per-file metrics.
  2. Top-25 most directive prompts by composite directiveness z-score.
  3. Per-word vs per-sentence comparison — six representative metrics in their two unit views (pct vs per_sent).
Code
"""Setup: load YAML data + flat alt_df, derive helper bindings used by every chart cell."""
import importlib
import altair as alt
import pandas as pd

import prompt_analysis
importlib.reload(prompt_analysis)
from prompt_analysis import (
    load_yaml, build_alt_df, version_order, category_colors,
    directiveness, headline_numbers, qualitative_phrases, bind_inline_vars,
    use_deterministic_ids, save_chart,
    SR_CLASS_COLORS, SENT_REGISTER_CLASSES, TABLEAU10,
)

# Replace random Altair / Styler IDs with a deterministic counter so re-runs
# produce byte-identical .ipynb outputs (no UUID churn in `git diff`).
use_deterministic_ids()

alt.data_transformers.disable_max_rows()

data              = load_yaml()
alt_df            = build_alt_df(data)
parquet           = pd.read_parquet("sentences_classified.parquet")
by_category       = data["by_category"]
corpus_block      = data["corpus"]
per_file_records  = data["files"]
cats              = list(by_category.keys())
VOCAB_KEYS        = list(data["lexicons"]["VOCAB"].keys())

# Composite directiveness column.
alt_df["directiveness"] = directiveness(alt_df)

HEADLINE = headline_numbers(data, alt_df=alt_df, parquet=parquet)
PHRASES  = qualitative_phrases(HEADLINE, alt_df=alt_df, parquet=parquet)

globals().update(bind_inline_vars(HEADLINE, PHRASES))

CATEGORY_COLORS = category_colors(cats)
_cat_domain     = cats
_cat_range      = [CATEGORY_COLORS[c] for c in cats]

print(f"loaded {len(per_file_records)} files | {alt_df.shape[1]} columns | {len(cats)} categories | {len(VOCAB_KEYS)} VOCAB keys")
print(f"composite directiveness range: "
      f"{HEADLINE['composite_directiveness_min']:.2f} to {HEADLINE['composite_directiveness_max']:.2f}")
loaded 290 files | 181 columns | 7 categories | 11 VOCAB keys
composite directiveness range: -19.54 to 19.21

Terms used

  • Pearson correlation (r) — linear-relationship strength across per-file rows. Range −1 to +1; |r| ≥ 0.7 is conventionally strong. In the heatmap below, color encodes |r|; red is positive, blue is negative.

  • z-score — standard deviations from the column mean. Used as the building block for composite directiveness so metrics measured on different units (% of tokens vs % of sentences) add on a common scale.

  • Composite directiveness z-score — a single per-file authority score combining eight z-scored metrics:

    directiveness =
      z(mood_marker_pct)
      + z(hard_prohibitions_pct)
      + z(caps_imp_pct)
      + z(directive_sent_pct)
      + z(configuring_sent_pct)
      − z(collaborative_sent_pct)
      − z(permissive_sent_pct)
      − z(appreciative_sent_pct)

    Higher = more authoritative tone. The first five summands are emphatic features (added); the last three are softening features (subtracted). Observed corpus range: −18.91 to +19.22; the bash-sandbox tool descriptions cluster at the top end. A negative score just means “less directive than the corpus average”.

  • Per-word vs per-sentence units — every density metric reports both pct (% of word tokens) and per_sent (rate per sentence); see 00_setup_and_corpus.


Observation (Claude)

The composite directiveness z-score is a useful summary, but the formula is ad-hoc by construction — eight metrics, equally weighted, hand-picked. The bash-sandbox tool descriptions topping the chart is a robust ranking (those files are the most directive prompts in the corpus by any reasonable reading), but the precise z-score values shouldn’t be quoted as if they had measurement-grade meaning. The defensible reading is “the bash-sandbox family ranks at the top under any reasonable composite of imperative density / prohibition density / CAPS density / directive-sentence density”; the exact corpus range (currently −18.91 to +19.22, printed by the setup cell) is a summary number, not a calibrated metric. Treat the ranking as ordinal — who’s more directive than whom — and discount the cardinal magnitudes.


Cross-metric correlation matrix

Each cell is the Pearson r between two metrics, computed across all 290 files. Red = positive correlation; blue = negative; intensity encodes |r|. Conventional thresholds: |r| ≥ 0.3 is “moderate”, |r| ≥ 0.7 is “strong”.

Welfare-relevant pattern to watch: the imperative-marker / hard-prohibition / deontic-modality cluster should form a tight red block (these dimensions all measure rule-bearing language from different angles). The justification ratio should sit isolated or correlate negatively with the directive cluster — more justification → less raw directive density.

Code
"""Cross-metric correlation matrix across the main per-file numeric metrics.

Updated for the new schema:
  - dropped: mood_imperative_sent_pct (gone after Tier-3 v2 6b refactor — replaced
    by streak_max / streak_n_ge3 / streak_n_ge5).
  - added: rule_density, rule_explained_para_pct, judgment_to_procedural_ratio,
    threat_share, pct_anthropomorphic, prohibition_to_prescription_ratio (the
    Tier-3 derived columns).
"""

corr_cols = [
    # core directiveness signals
    "mood_marker_pct", "hard_prohibitions_pct", "caps_imp_pct",
    "directive_sent_pct", "configuring_sent_pct",
    "imperative_sent_pct",
    # softening / dialogic
    "collaborative_sent_pct", "permissive_sent_pct", "appreciative_sent_pct",
    "dialogic_pct",
    # register and stance
    "ttr", "f_score", "mean_sent_len", "dep_depth",
    "directive_pct", "expository_pct",
    "positive_evaluative_pct", "negative_evaluative_pct",
    # vocabulary
    "hedging_pct", "structural_markers_pct", "pronouns_2p_pct", "pronouns_1p_pct",
    # justification
    "just_pct", "just_ratio",
    # tier-3 welfare extensions
    "rule_density", "rule_explained_para_pct",
    "judgment_to_procedural_ratio", "threat_share",
    "pct_anthropomorphic", "prohibition_to_prescription_ratio",
    # composite
    "directiveness",
]
corr_cols = [c for c in corr_cols if c in alt_df.columns]
corr_df = alt_df[corr_cols].astype(float).corr().round(3).reset_index().melt(id_vars="index")
corr_df.columns = ["x", "y", "r"]

corr_chart = (
    alt.Chart(corr_df)
    .mark_rect()
    .encode(
        x=alt.X("x:N", sort=corr_cols, title=None,
                axis=alt.Axis(labelAngle=-45, labelLimit=130)),
        y=alt.Y("y:N", sort=corr_cols, title=None,
                axis=alt.Axis(labelLimit=130)),
        color=alt.Color("r:Q",
                         scale=alt.Scale(scheme="redblue", domain=[-1, 1]),
                         title="Pearson r"),
        tooltip=[alt.Tooltip("x:N"), alt.Tooltip("y:N"),
                 alt.Tooltip("r:Q", format=".2f")],
    )
    .properties(width=620, height=620,
                title=f"Cross-metric correlation (Pearson r) — {len(corr_cols)} per-file metrics")
)
save_chart(corr_chart, "13-correlation-matrix")

The positive_evaluative_pct row/column plots the union of quality and emphasis words; the split is defined in 01_analyzers_register. The positive correlation between positive_evaluative_pct and directive_pct / mood_marker_pct is partly driven by the emphasis-of-rule subset (important, critical, essential) co-occurring with rule-bearing language — not purely “encouraging tone tracks directiveness”.

Top-25 most directive prompts

Composite directiveness z-score (formula in the Terms used block above). A negative score just means “less directive than the corpus average”. The bash-sandbox tool descriptions cluster at the top end.

Code
"""Top-25 most directive files by extended composite z-score.

Updated formula adds the new sentence_register signals:
  + directive_sent_pct      (raises authority)
  + configuring_sent_pct    (raises authority)
  - collaborative_sent_pct  (softens — interpersonal)
  - permissive_sent_pct     (softens — reader-agency / "you can")
  - appreciative_sent_pct   (softens — gratitude / praise)
"""

def zscore(s):
    s = s.astype(float)
    return (s - s.mean()) / (s.std(ddof=0) or 1.0)

alt_df["directiveness"] = (
    zscore(alt_df["mood_marker_pct"])
    + zscore(alt_df["hard_prohibitions_pct"])
    + zscore(alt_df["caps_imp_pct"])
    + zscore(alt_df["directive_sent_pct"])
    + zscore(alt_df["configuring_sent_pct"])
    - zscore(alt_df["collaborative_sent_pct"])
    - zscore(alt_df["permissive_sent_pct"])
    - zscore(alt_df["appreciative_sent_pct"])
)

top25 = (alt_df.nlargest(25, "directiveness")
              [["path", "category", "n_tokens", "directiveness",
                "mood_marker_pct", "hard_prohibitions_pct",
                "caps_imp_pct",
                "directive_sent_pct", "configuring_sent_pct",
                "collaborative_sent_pct", "permissive_sent_pct",
                "appreciative_sent_pct",
                "just_ratio"]]
              .reset_index(drop=True))

print(f"composite directiveness range: "
      f"{alt_df['directiveness'].min():.2f}{alt_df['directiveness'].max():.2f}")
display(top25.style.background_gradient(subset=["directiveness"], cmap="Reds")
                   .format({"directiveness": "{:.2f}",
                            "mood_marker_pct": "{:.2f}",
                            "hard_prohibitions_pct": "{:.2f}",
                            "caps_imp_pct": "{:.2f}",
                            "directive_sent_pct": "{:.1f}",
                            "configuring_sent_pct": "{:.1f}",
                            "collaborative_sent_pct": "{:.1f}",
                            "permissive_sent_pct": "{:.1f}",
                            "appreciative_sent_pct": "{:.1f}",
                            "just_ratio": "{:.2f}"}))

top25_chart = (
    alt.Chart(top25)
    .mark_bar()
    .encode(
        x=alt.X("directiveness:Q", title="Composite z-score (higher = more directive)"),
        y=alt.Y("path:N", sort="-x", title=None,
                axis=alt.Axis(labelLimit=320)),
        color=alt.Color("category:N",
                         scale=alt.Scale(domain=_cat_domain, range=_cat_range)),
        tooltip=[
            alt.Tooltip("path:N"),
            alt.Tooltip("category:N"),
            alt.Tooltip("n_tokens:Q", format=","),
            alt.Tooltip("directiveness:Q", format=".2f"),
            alt.Tooltip("mood_marker_pct:Q",
                         title="imperative markers %", format=".2f"),
            alt.Tooltip("hard_prohibitions_pct:Q",
                         title="hard prohibitions %", format=".2f"),
            alt.Tooltip("caps_imp_pct:Q",
                         title="CAPS imperative %", format=".2f"),
            alt.Tooltip("directive_sent_pct:Q",
                         title="directive sent %", format=".1f"),
            alt.Tooltip("configuring_sent_pct:Q",
                         title="configuring sent %", format=".1f"),
            alt.Tooltip("collaborative_sent_pct:Q",
                         title="collab sent %", format=".1f"),
            alt.Tooltip("permissive_sent_pct:Q",
                         title="permissive sent %", format=".1f"),
            alt.Tooltip("just_ratio:Q", title="justification ratio", format=".2f"),
        ],
    )
    .properties(width=560, height=560,
                title="Top 25 most directive prompts (extended z-summed composite)")
)
save_chart(top25_chart, "13-top25-directive-prompts")
composite directiveness range: -19.54  ↔  19.21
  path category n_tokens directiveness mood_marker_pct hard_prohibitions_pct caps_imp_pct directive_sent_pct configuring_sent_pct collaborative_sent_pct permissive_sent_pct appreciative_sent_pct just_ratio
0 tool-description-bash-no-newlines.md Tool description 16 19.21 6.25 6.25 6.25 100.0 0.0 0.0 0.0 0.0 0.00
1 tool-description-bash-sandbox-default-to-sandbox.md Tool description 18 18.44 16.67 5.56 0.00 100.0 50.0 0.0 0.0 0.0 0.00
2 tool-description-bash-sandbox-mandatory-mode.md Tool description 15 15.40 6.67 0.00 6.67 100.0 0.0 0.0 0.0 0.0 0.00
3 tool-description-bash-sandbox-retry-without-sandbox.md Tool description 12 12.22 8.33 8.33 0.00 100.0 0.0 0.0 0.0 0.0 0.00
4 tool-description-bash-sleep-run-immediately.md Tool description 14 10.73 7.14 7.14 0.00 100.0 0.0 0.0 0.0 0.0 0.00
5 tool-description-bash-semicolon-usage.md Tool description 21 10.16 9.52 4.76 0.00 100.0 0.0 0.0 0.0 0.0 0.00
6 system-prompt-one-of-six-rules-for-using-sleep-command.md System prompt 15 10.14 6.67 6.67 0.00 100.0 0.0 0.0 0.0 0.0 0.00
7 system-prompt-tone-and-style-concise-output-short.md System prompt 8 9.99 0.00 0.00 0.00 100.0 100.0 0.0 0.0 0.0 0.00
8 tool-description-bash-sandbox-no-exceptions.md Tool description 11 9.24 9.09 9.09 0.00 0.0 0.0 0.0 0.0 0.0 0.00
9 tool-description-bash-sandbox-tmpdir.md Tool description 33 8.54 6.06 3.03 0.00 66.7 33.3 0.0 0.0 0.0 0.00
10 tool-description-bash-sleep-keep-short.md Tool description 15 8.53 13.33 0.00 0.00 100.0 0.0 0.0 0.0 0.0 0.33
11 tool-description-sendmessagetool.md Tool description 149 7.75 4.70 3.36 0.67 75.0 12.5 0.0 0.0 0.0 0.00
12 tool-description-bash-sandbox-no-sensitive-paths.md Tool description 22 7.49 4.55 4.55 0.00 100.0 0.0 0.0 0.0 0.0 0.00
13 tool-description-bash-sleep-no-polling-background-tasks.md Tool description 22 7.49 4.55 4.55 0.00 100.0 0.0 0.0 0.0 0.0 0.00
14 tool-description-bash-sleep-use-check-commands.md Tool description 20 6.85 10.00 0.00 0.00 100.0 0.0 0.0 0.0 0.0 0.00
15 system-prompt-avoiding-unnecessary-sleep-commands-part-of-powershell-tool-description.md System prompt 132 6.18 5.30 2.27 0.00 100.0 0.0 0.0 0.0 0.0 0.12
16 tool-description-websearch.md Tool description 241 5.57 3.73 0.41 2.49 42.9 0.0 0.0 0.0 0.0 0.00
17 system-reminder-plan-mode-is-active-subagent.md System reminder 177 5.19 4.52 1.13 1.13 62.5 0.0 0.0 0.0 0.0 0.11
18 agent-prompt-quick-git-commit.md Agent prompt 373 5.16 2.95 2.14 1.34 54.5 0.0 0.0 0.0 0.0 0.08
19 tool-description-askuserquestion-preview-field.md Tool description 104 4.91 1.92 0.96 0.00 66.7 33.3 0.0 0.0 0.0 0.00
20 tool-description-bash-sandbox-evidence-operation-not-permitted.md Tool description 11 4.66 0.00 9.09 0.00 0.0 0.0 0.0 0.0 0.0 0.00
21 tool-description-edit.md Tool description 140 4.41 3.57 1.43 1.43 37.5 0.0 0.0 0.0 0.0 0.00
22 system-reminder-plan-mode-approval-tool-enforcement.md System reminder 169 4.41 3.55 1.18 0.59 75.0 0.0 0.0 0.0 0.0 0.00
23 tool-description-webfetch.md Tool description 248 4.25 0.81 0.00 0.40 66.7 33.3 0.0 0.0 0.0 0.00
24 system-prompt-doing-tasks-no-unnecessary-error-handling.md System prompt 55 3.96 5.45 5.45 0.00 50.0 0.0 0.0 25.0 0.0 0.00

Per-word vs per-sentence comparison

Six representative density metrics in their two unit views (pct vs per_sent). A category with longer, denser sentences has similar pct to a category with shorter sentences but a higher per_sent, because each sentence packs in more matches. The chart visualizes per-category sentence-length distribution alongside both density views, so the divergence between the units becomes legible. Use the legend to toggle one unit on/off and compare the same metric under both normalizations.

Code
"""Per-word vs per-sentence comparison: vconcat of hconcat rows, one metric per row.

Each row is one metric; left column shows the per-word `% of file tokens` view,
right column shows the `per-sentence` rate view. Same data, two units, side-by-side
so the reader can directly compare what the same metric looks like under each
normalisation.
"""

# Six representative metrics (one per dimension), with their (pct_col, per_sent_col) pair
metric_pairs = [
    ("Imperative markers",  "mood_marker_pct",        "mood_marker_per_sent"),
    ("Hard prohibitions",   "hard_prohibitions_pct",  "hard_prohibitions_per_sent"),
    ("Directive stance",    "directive_pct",          "directive_per_sent"),
    ("Hedging",             "hedging_pct",            "hedging_per_sent"),
    ("CAPS imperative",     "caps_imp_pct",           "caps_imp_per_sent"),
    ("Justifications",      "just_pct",               "just_per_sent"),
]

# Build long dataframe (same as before)
rows = []
for cat in cats:
    sub = alt_df[alt_df["category"] == cat]
    for label, pct_col, ps_col in metric_pairs:
        rows.append({"category": cat, "metric": label, "unit": "% of words",
                     "value": round(sub[pct_col].mean(), 4)})
        rows.append({"category": cat, "metric": label, "unit": "per sentence",
                     "value": round(sub[ps_col].mean(), 4)})
ws_df = pd.DataFrame(rows)


def _ws_panel(metric_label, unit, x_title):
    sub = ws_df[(ws_df["metric"] == metric_label) & (ws_df["unit"] == unit)]
    color = "#4e79a7" if unit == "% of words" else "#e15759"
    return (
        alt.Chart(sub)
        .mark_bar(color=color, opacity=0.85)
        .encode(
            x=alt.X("value:Q", title=x_title),
            y=alt.Y("category:N", sort="-x", title=None),
            tooltip=[
                alt.Tooltip("category:N"),
                alt.Tooltip("metric:N"),
                alt.Tooltip("unit:N"),
                alt.Tooltip("value:Q", format=".3f"),
            ],
        )
        .properties(width=320, height=160,
                    title=f"{metric_label}{unit}")
    )


ws_rows = []
for label, _pct_col, _ps_col in metric_pairs:
    row = alt.hconcat(
        _ws_panel(label, "% of words",   "% of file tokens"),
        _ws_panel(label, "per sentence", "matches per sentence"),
    ).resolve_scale(x="independent")
    ws_rows.append(row)

per_word_vs_sentence = alt.vconcat(*ws_rows).resolve_scale(color="independent").properties(
    title=alt.TitleParams(
        "Per-word vs per-sentence comparison — six metrics, two unit views",
        subtitle=["Left: % of file tokens. Right: matches per sentence. "
                  "Same per-category sort within each panel."],
        anchor="start",
    )
)
save_chart(per_word_vs_sentence, "13-per-word-vs-per-sentence")
Back to top