Correlation matrix and composite directiveness ranking

Two cross-cutting views: a Pearson-correlation heatmap across ~30 per-file metrics, and a per-file ranking by a single composite directiveness z-score combining eight emphatic / softening features. The composite is the single number used downstream (15_rule_explanation, 20–22) to identify the corpus’s most authoritative prose; this notebook is where it is defined and computed.

Three views below:

Cross-metric correlation matrix — Pearson r across ~30 per-file metrics.
Top-25 most directive prompts by composite directiveness z-score.
Per-word vs per-sentence comparison — six representative metrics in their two unit views (pct vs per_sent).

Code

"""Setup: load YAML data + flat alt_df, derive helper bindings used by every chart cell."""
import importlib
import altair as alt
import pandas as pd

import prompt_analysis
importlib.reload(prompt_analysis)
from prompt_analysis import (
    load_yaml, build_alt_df, version_order, category_colors,
    directiveness, headline_numbers, qualitative_phrases, bind_inline_vars,
    use_deterministic_ids, save_chart,
    SR_CLASS_COLORS, SENT_REGISTER_CLASSES, TABLEAU10,
)

# Replace random Altair / Styler IDs with a deterministic counter so re-runs
# produce byte-identical .ipynb outputs (no UUID churn in `git diff`).
use_deterministic_ids()

alt.data_transformers.disable_max_rows()

data              = load_yaml()
alt_df            = build_alt_df(data)
parquet           = pd.read_parquet("sentences_classified.parquet")
by_category       = data["by_category"]
corpus_block      = data["corpus"]
per_file_records  = data["files"]
cats              = list(by_category.keys())
VOCAB_KEYS        = list(data["lexicons"]["VOCAB"].keys())

# Composite directiveness column.
alt_df["directiveness"] = directiveness(alt_df)

HEADLINE = headline_numbers(data, alt_df=alt_df, parquet=parquet)
PHRASES  = qualitative_phrases(HEADLINE, alt_df=alt_df, parquet=parquet)

globals().update(bind_inline_vars(HEADLINE, PHRASES))

CATEGORY_COLORS = category_colors(cats)
_cat_domain     = cats
_cat_range      = [CATEGORY_COLORS[c] for c in cats]

print(f"loaded {len(per_file_records)} files | {alt_df.shape[1]} columns | {len(cats)} categories | {len(VOCAB_KEYS)} VOCAB keys")
print(f"composite directiveness range: "
      f"{HEADLINE['composite_directiveness_min']:.2f} to {HEADLINE['composite_directiveness_max']:.2f}")

loaded 290 files | 181 columns | 7 categories | 11 VOCAB keys
composite directiveness range: -19.54 to 19.21

Terms used

Pearson correlation (r) — linear-relationship strength across per-file rows. Range −1 to +1; |r| ≥ 0.7 is conventionally strong. In the heatmap below, color encodes |r|; red is positive, blue is negative.
z-score — standard deviations from the column mean. Used as the building block for composite directiveness so metrics measured on different units (% of tokens vs % of sentences) add on a common scale.
Composite directiveness z-score — a single per-file authority score combining eight z-scored metrics:
```
directiveness =
  z(mood_marker_pct)
  + z(hard_prohibitions_pct)
  + z(caps_imp_pct)
  + z(directive_sent_pct)
  + z(configuring_sent_pct)
  − z(collaborative_sent_pct)
  − z(permissive_sent_pct)
  − z(appreciative_sent_pct)
```
Higher = more authoritative tone. The first five summands are emphatic features (added); the last three are softening features (subtracted). Observed corpus range: −18.91 to +19.22; the bash-sandbox tool descriptions cluster at the top end. A negative score just means “less directive than the corpus average”.
Per-word vs per-sentence units — every density metric reports both pct (% of word tokens) and per_sent (rate per sentence); see 00_setup_and_corpus.

Observation (Claude)

The composite directiveness z-score is a useful summary, but the formula is ad-hoc by construction — eight metrics, equally weighted, hand-picked. The bash-sandbox tool descriptions topping the chart is a robust ranking (those files are the most directive prompts in the corpus by any reasonable reading), but the precise z-score values shouldn’t be quoted as if they had measurement-grade meaning. The defensible reading is “the bash-sandbox family ranks at the top under any reasonable composite of imperative density / prohibition density / CAPS density / directive-sentence density”; the exact corpus range (currently −18.91 to +19.22, printed by the setup cell) is a summary number, not a calibrated metric. Treat the ranking as ordinal — who’s more directive than whom — and discount the cardinal magnitudes.

Cross-metric correlation matrix

Each cell is the Pearson r between two metrics, computed across all 290 files. Red = positive correlation; blue = negative; intensity encodes |r|. Conventional thresholds: |r| ≥ 0.3 is “moderate”, |r| ≥ 0.7 is “strong”.

Welfare-relevant pattern to watch: the imperative-marker / hard-prohibition / deontic-modality cluster should form a tight red block (these dimensions all measure rule-bearing language from different angles). The justification ratio should sit isolated or correlate negatively with the directive cluster — more justification → less raw directive density.

Code

"""Cross-metric correlation matrix across the main per-file numeric metrics.

Updated for the new schema:
  - dropped: mood_imperative_sent_pct (gone after Tier-3 v2 6b refactor — replaced
    by streak_max / streak_n_ge3 / streak_n_ge5).
  - added: rule_density, rule_explained_para_pct, judgment_to_procedural_ratio,
    threat_share, pct_anthropomorphic, prohibition_to_prescription_ratio (the
    Tier-3 derived columns).
"""

corr_cols = [
    # core directiveness signals
    "mood_marker_pct", "hard_prohibitions_pct", "caps_imp_pct",
    "directive_sent_pct", "configuring_sent_pct",
    "imperative_sent_pct",
    # softening / dialogic
    "collaborative_sent_pct", "permissive_sent_pct", "appreciative_sent_pct",
    "dialogic_pct",
    # register and stance
    "ttr", "f_score", "mean_sent_len", "dep_depth",
    "directive_pct", "expository_pct",
    "positive_evaluative_pct", "negative_evaluative_pct",
    # vocabulary
    "hedging_pct", "structural_markers_pct", "pronouns_2p_pct", "pronouns_1p_pct",
    # justification
    "just_pct", "just_ratio",
    # tier-3 welfare extensions
    "rule_density", "rule_explained_para_pct",
    "judgment_to_procedural_ratio", "threat_share",
    "pct_anthropomorphic", "prohibition_to_prescription_ratio",
    # composite
    "directiveness",
]
corr_cols = [c for c in corr_cols if c in alt_df.columns]
corr_df = alt_df[corr_cols].astype(float).corr().round(3).reset_index().melt(id_vars="index")
corr_df.columns = ["x", "y", "r"]

corr_chart = (
    alt.Chart(corr_df)
    .mark_rect()
    .encode(
        x=alt.X("x:N", sort=corr_cols, title=None,
                axis=alt.Axis(labelAngle=-45, labelLimit=130)),
        y=alt.Y("y:N", sort=corr_cols, title=None,
                axis=alt.Axis(labelLimit=130)),
        color=alt.Color("r:Q",
                         scale=alt.Scale(scheme="redblue", domain=[-1, 1]),
                         title="Pearson r"),
        tooltip=[alt.Tooltip("x:N"), alt.Tooltip("y:N"),
                 alt.Tooltip("r:Q", format=".2f")],
    )
    .properties(width=620, height=620,
                title=f"Cross-metric correlation (Pearson r) — {len(corr_cols)} per-file metrics")
)
save_chart(corr_chart, "13-correlation-matrix")

The positive_evaluative_pct row/column plots the union of quality and emphasis words; the split is defined in 01_analyzers_register. The positive correlation between positive_evaluative_pct and directive_pct / mood_marker_pct is partly driven by the emphasis-of-rule subset (important, critical, essential) co-occurring with rule-bearing language — not purely “encouraging tone tracks directiveness”.

Top-25 most directive prompts

Composite directiveness z-score (formula in the Terms used block above). A negative score just means “less directive than the corpus average”. The bash-sandbox tool descriptions cluster at the top end.

Code

"""Top-25 most directive files by extended composite z-score.

Updated formula adds the new sentence_register signals:
  + directive_sent_pct      (raises authority)
  + configuring_sent_pct    (raises authority)
  - collaborative_sent_pct  (softens — interpersonal)
  - permissive_sent_pct     (softens — reader-agency / "you can")
  - appreciative_sent_pct   (softens — gratitude / praise)
"""

def zscore(s):
    s = s.astype(float)
    return (s - s.mean()) / (s.std(ddof=0) or 1.0)

alt_df["directiveness"] = (
    zscore(alt_df["mood_marker_pct"])
    + zscore(alt_df["hard_prohibitions_pct"])
    + zscore(alt_df["caps_imp_pct"])
    + zscore(alt_df["directive_sent_pct"])
    + zscore(alt_df["configuring_sent_pct"])
    - zscore(alt_df["collaborative_sent_pct"])
    - zscore(alt_df["permissive_sent_pct"])
    - zscore(alt_df["appreciative_sent_pct"])
)

top25 = (alt_df.nlargest(25, "directiveness")
              [["path", "category", "n_tokens", "directiveness",
                "mood_marker_pct", "hard_prohibitions_pct",
                "caps_imp_pct",
                "directive_sent_pct", "configuring_sent_pct",
                "collaborative_sent_pct", "permissive_sent_pct",
                "appreciative_sent_pct",
                "just_ratio"]]
              .reset_index(drop=True))

print(f"composite directiveness range: "
      f"{alt_df['directiveness'].min():.2f}  ↔  {alt_df['directiveness'].max():.2f}")
display(top25.style.background_gradient(subset=["directiveness"], cmap="Reds")
                   .format({"directiveness": "{:.2f}",
                            "mood_marker_pct": "{:.2f}",
                            "hard_prohibitions_pct": "{:.2f}",
                            "caps_imp_pct": "{:.2f}",
                            "directive_sent_pct": "{:.1f}",
                            "configuring_sent_pct": "{:.1f}",
                            "collaborative_sent_pct": "{:.1f}",
                            "permissive_sent_pct": "{:.1f}",
                            "appreciative_sent_pct": "{:.1f}",
                            "just_ratio": "{:.2f}"}))

top25_chart = (
    alt.Chart(top25)
    .mark_bar()
    .encode(
        x=alt.X("directiveness:Q", title="Composite z-score (higher = more directive)"),
        y=alt.Y("path:N", sort="-x", title=None,
                axis=alt.Axis(labelLimit=320)),
        color=alt.Color("category:N",
                         scale=alt.Scale(domain=_cat_domain, range=_cat_range)),
        tooltip=[
            alt.Tooltip("path:N"),
            alt.Tooltip("category:N"),
            alt.Tooltip("n_tokens:Q", format=","),
            alt.Tooltip("directiveness:Q", format=".2f"),
            alt.Tooltip("mood_marker_pct:Q",
                         title="imperative markers %", format=".2f"),
            alt.Tooltip("hard_prohibitions_pct:Q",
                         title="hard prohibitions %", format=".2f"),
            alt.Tooltip("caps_imp_pct:Q",
                         title="CAPS imperative %", format=".2f"),
            alt.Tooltip("directive_sent_pct:Q",
                         title="directive sent %", format=".1f"),
            alt.Tooltip("configuring_sent_pct:Q",
                         title="configuring sent %", format=".1f"),
            alt.Tooltip("collaborative_sent_pct:Q",
                         title="collab sent %", format=".1f"),
            alt.Tooltip("permissive_sent_pct:Q",
                         title="permissive sent %", format=".1f"),
            alt.Tooltip("just_ratio:Q", title="justification ratio", format=".2f"),
        ],
    )
    .properties(width=560, height=560,
                title="Top 25 most directive prompts (extended z-summed composite)")
)
save_chart(top25_chart, "13-top25-directive-prompts")

composite directiveness range: -19.54  ↔  19.21

	path	category	n_tokens	directiveness	mood_marker_pct	hard_prohibitions_pct	caps_imp_pct	directive_sent_pct	configuring_sent_pct	permissive_sent_pct	just_ratio
0	tool-description-bash-no-newlines.md	Tool description	16	19.21	6.25	6.25	6.25	100.0	0.0	0.0	0.00
1	tool-description-bash-sandbox-default-to-sandbox.md	Tool description	18	18.44	16.67	5.56	0.00	100.0	50.0	0.0	0.00
2	tool-description-bash-sandbox-mandatory-mode.md	Tool description	15	15.40	6.67	0.00	6.67	100.0	0.0	0.0	0.00
3	tool-description-bash-sandbox-retry-without-sandbox.md	Tool description	12	12.22	8.33	8.33	0.00	100.0	0.0	0.0	0.00
4	tool-description-bash-sleep-run-immediately.md	Tool description	14	10.73	7.14	7.14	0.00	100.0	0.0	0.0	0.00
5	tool-description-bash-semicolon-usage.md	Tool description	21	10.16	9.52	4.76	0.00	100.0	0.0	0.0	0.00
6	system-prompt-one-of-six-rules-for-using-sleep-command.md	System prompt	15	10.14	6.67	6.67	0.00	100.0	0.0	0.0	0.00
7	system-prompt-tone-and-style-concise-output-short.md	System prompt	8	9.99	0.00	0.00	0.00	100.0	100.0	0.0	0.00
8	tool-description-bash-sandbox-no-exceptions.md	Tool description	11	9.24	9.09	9.09	0.00	0.0	0.0	0.0	0.00
9	tool-description-bash-sandbox-tmpdir.md	Tool description	33	8.54	6.06	3.03	0.00	66.7	33.3	0.0	0.00
10	tool-description-bash-sleep-keep-short.md	Tool description	15	8.53	13.33	0.00	0.00	100.0	0.0	0.0	0.33
11	tool-description-sendmessagetool.md	Tool description	149	7.75	4.70	3.36	0.67	75.0	12.5	0.0	0.00
12	tool-description-bash-sandbox-no-sensitive-paths.md	Tool description	22	7.49	4.55	4.55	0.00	100.0	0.0	0.0	0.00
13	tool-description-bash-sleep-no-polling-background-tasks.md	Tool description	22	7.49	4.55	4.55	0.00	100.0	0.0	0.0	0.00
14	tool-description-bash-sleep-use-check-commands.md	Tool description	20	6.85	10.00	0.00	0.00	100.0	0.0	0.0	0.00
15	system-prompt-avoiding-unnecessary-sleep-commands-part-of-powershell-tool-description.md	System prompt	132	6.18	5.30	2.27	0.00	100.0	0.0	0.0	0.12
16	tool-description-websearch.md	Tool description	241	5.57	3.73	0.41	2.49	42.9	0.0	0.0	0.00
17	system-reminder-plan-mode-is-active-subagent.md	System reminder	177	5.19	4.52	1.13	1.13	62.5	0.0	0.0	0.11
18	agent-prompt-quick-git-commit.md	Agent prompt	373	5.16	2.95	2.14	1.34	54.5	0.0	0.0	0.08
19	tool-description-askuserquestion-preview-field.md	Tool description	104	4.91	1.92	0.96	0.00	66.7	33.3	0.0	0.00
20	tool-description-bash-sandbox-evidence-operation-not-permitted.md	Tool description	11	4.66	0.00	9.09	0.00	0.0	0.0	0.0	0.00
21	tool-description-edit.md	Tool description	140	4.41	3.57	1.43	1.43	37.5	0.0	0.0	0.00
22	system-reminder-plan-mode-approval-tool-enforcement.md	System reminder	169	4.41	3.55	1.18	0.59	75.0	0.0	0.0	0.00
23	tool-description-webfetch.md	Tool description	248	4.25	0.81	0.00	0.40	66.7	33.3	0.0	0.00
24	system-prompt-doing-tasks-no-unnecessary-error-handling.md	System prompt	55	3.96	5.45	5.45	0.00	50.0	0.0	25.0	0.00

Per-word vs per-sentence comparison

Six representative density metrics in their two unit views (pct vs per_sent). A category with longer, denser sentences has similar pct to a category with shorter sentences but a higher per_sent, because each sentence packs in more matches. The chart visualizes per-category sentence-length distribution alongside both density views, so the divergence between the units becomes legible. Use the legend to toggle one unit on/off and compare the same metric under both normalizations.

Code

"""Per-word vs per-sentence comparison: vconcat of hconcat rows, one metric per row.

Each row is one metric; left column shows the per-word `% of file tokens` view,
right column shows the `per-sentence` rate view. Same data, two units, side-by-side
so the reader can directly compare what the same metric looks like under each
normalisation.
"""

# Six representative metrics (one per dimension), with their (pct_col, per_sent_col) pair
metric_pairs = [
    ("Imperative markers",  "mood_marker_pct",        "mood_marker_per_sent"),
    ("Hard prohibitions",   "hard_prohibitions_pct",  "hard_prohibitions_per_sent"),
    ("Directive stance",    "directive_pct",          "directive_per_sent"),
    ("Hedging",             "hedging_pct",            "hedging_per_sent"),
    ("CAPS imperative",     "caps_imp_pct",           "caps_imp_per_sent"),
    ("Justifications",      "just_pct",               "just_per_sent"),
]

# Build long dataframe (same as before)
rows = []
for cat in cats:
    sub = alt_df[alt_df["category"] == cat]
    for label, pct_col, ps_col in metric_pairs:
        rows.append({"category": cat, "metric": label, "unit": "% of words",
                     "value": round(sub[pct_col].mean(), 4)})
        rows.append({"category": cat, "metric": label, "unit": "per sentence",
                     "value": round(sub[ps_col].mean(), 4)})
ws_df = pd.DataFrame(rows)


def _ws_panel(metric_label, unit, x_title):
    sub = ws_df[(ws_df["metric"] == metric_label) & (ws_df["unit"] == unit)]
    color = "#4e79a7" if unit == "% of words" else "#e15759"
    return (
        alt.Chart(sub)
        .mark_bar(color=color, opacity=0.85)
        .encode(
            x=alt.X("value:Q", title=x_title),
            y=alt.Y("category:N", sort="-x", title=None),
            tooltip=[
                alt.Tooltip("category:N"),
                alt.Tooltip("metric:N"),
                alt.Tooltip("unit:N"),
                alt.Tooltip("value:Q", format=".3f"),
            ],
        )
        .properties(width=320, height=160,
                    title=f"{metric_label} — {unit}")
    )


ws_rows = []
for label, _pct_col, _ps_col in metric_pairs:
    row = alt.hconcat(
        _ws_panel(label, "% of words",   "% of file tokens"),
        _ws_panel(label, "per sentence", "matches per sentence"),
    ).resolve_scale(x="independent")
    ws_rows.append(row)

per_word_vs_sentence = alt.vconcat(*ws_rows).resolve_scale(color="independent").properties(
    title=alt.TitleParams(
        "Per-word vs per-sentence comparison — six metrics, two unit views",
        subtitle=["Left: % of file tokens. Right: matches per sentence. "
                  "Same per-category sort within each panel."],
        anchor="start",
    )
)
save_chart(per_word_vs_sentence, "13-per-word-vs-per-sentence")