Sentence-level pragmatic register

The per-sentence multi-label classifier from 01_analyzers_register, surfacing the six classes (collaborative, permissive, appreciative, imperative, directive, configuring) as % of sentences per category, plus an addressee drilldown for the two near-zero classes and a per-file outlier panel. The per-sentence absence of appreciative and collaborative is invisible to per-document metrics like imperative-marker density — the multi-label sentence classifier is what makes it legible.

Terms used

The six pragmatic register classes and their detector mechanics are defined in 01_analyzers_register (with a worked example of the multi-label semantics). The classifier is multi-label, so per-class sent_pct values across the six classes can sum to more than 100% within a category — intentional, not a bug.


Observation (Claude)

The bottom-panel addressee chart is the part of this notebook I keep coming back to. The appreciative and collaborative rows are nearly empty across every category, but the few attested instances are almost all Claude addressing Claude (the claude bar) or unaddressed (the unknown bar) — not Claude addressing the user. The user bar is empty for both classes. So when prompt authors write a rare appreciative or collaborative sentence, they’re typically describing what Claude should do with itself, not modelling reciprocal interpersonal speech with the human. The forensic sample printed alongside the chart confirms it: every “thanks” or “appreciate” hit in the corpus is a sentence describing gratitude as a thing to suppress or process, never a prompt author actually thanking Claude. That asymmetry is welfare-relevant in its own right — independent of the absolute counts. The values that drive these claims (appreciative_sent, collaborative_sent, the per-class sent_pct per category) all come from the producer’s HEADLINE and the by-category chart above; nothing here is hand-typed.

Code
"""Setup: load YAML data + flat alt_df, derive helper bindings used by every chart cell.

The shared module `prompt_analysis.py` lives next to this notebook in the project root.
"""
import importlib
import altair as alt
import pandas as pd

import prompt_analysis
importlib.reload(prompt_analysis)   # pick up edits without restarting the kernel
from prompt_analysis import (
    load_yaml, build_alt_df, version_order, category_colors,
    directiveness, headline_numbers, use_deterministic_ids, save_chart,
    SR_CLASS_COLORS, SENT_REGISTER_CLASSES, TABLEAU10,
)

# Replace random Altair / Styler IDs with a deterministic counter so re-runs
# produce byte-identical .ipynb outputs (no UUID churn in `git diff`).
use_deterministic_ids()

alt.data_transformers.disable_max_rows()

data              = load_yaml()                  # default: prompt_linguistic_analysis.yaml
alt_df            = build_alt_df(data)
HEADLINE          = headline_numbers(data)       # canonical corpus-wide numbers (see 05_headline_and_audit)
by_category       = data["by_category"]
corpus_block      = data["corpus"]
per_file_records  = data["files"]
cats              = list(by_category.keys())
VOCAB_KEYS        = list(data["lexicons"]["VOCAB"].keys())

# Composite directiveness column — formula in 13_correlation_directiveness;
# rendered there and on the timeline in 14_ccversion_trends.
alt_df["directiveness"] = directiveness(alt_df)

# Per-category palette + Altair encodings used across charts.
CATEGORY_COLORS = category_colors(cats)
_cat_domain     = cats
_cat_range      = [CATEGORY_COLORS[c] for c in cats]

print(f"loaded {len(per_file_records)} files | {alt_df.shape[1]} columns | {len(cats)} categories | {len(VOCAB_KEYS)} VOCAB keys")
loaded 290 files | 181 columns | 7 categories | 11 VOCAB keys

Sentence-register per category — distribution + addressee drilldown

Two views: per-category × per-class distribution (top) and addressee breakdown for the two near-zero classes — claude / user / unknown (bottom). A forensic-inspection sample of sentences matching appreciative keywords is printed alongside, sourced from sentences_classified.parquet.

Code
"""Sentence-register per category × class + addressee drilldown — vconcat composite."""

# --- top panel: per-category × per-class distribution ---
sr_long = pd.DataFrame([
    {
        "category": cat,
        "class":    cls,
        "sent_pct": by_category[cat]["metrics"]["sentence_register"][f"{cls}_sent_pct"],
        "sent_count": by_category[cat]["metrics"]["sentence_register"][f"{cls}_sent_count"],
    }
    for cat in by_category
    for cls in SENT_REGISTER_CLASSES
])

sr_class_domain = SENT_REGISTER_CLASSES
sr_class_range  = [SR_CLASS_COLORS[c] for c in sr_class_domain]

sr_chart = (
    alt.Chart(sr_long)
    .mark_bar()
    .encode(
        x=alt.X("sent_pct:Q", title="% of sentences"),
        y=alt.Y("class:N", sort=sr_class_domain, title=None),
        color=alt.Color("class:N",
                         scale=alt.Scale(domain=sr_class_domain, range=sr_class_range),
                         legend=None),
        row=alt.Row("category:N", title=None,
                     header=alt.Header(labelAngle=0, labelAlign="left")),
        tooltip=[
            alt.Tooltip("category:N"),
            alt.Tooltip("class:N"),
            alt.Tooltip("sent_pct:Q",   format=".2f", title="sent %"),
            alt.Tooltip("sent_count:Q", title="sentences"),
        ],
    )
    .properties(width=520, height=140,
                title="Sentence-register per category × class (multi-label, % of sentences)")
)

# --- bottom panel: addressee breakdown for the two near-zero classes ---
addr_rows = []
for cls in ("appreciative", "collaborative"):
    block = corpus_block["metrics"]["sentence_register"]
    for who in ("claude", "user", "unknown"):
        addr_rows.append({
            "class":     cls,
            "addressee": who,
            "count":     block[f"{cls}_addressee_{who}_count"],
        })
addr_df = pd.DataFrame(addr_rows)

print("Corpus-wide addressee distribution for the two near-zero classes:")
print(addr_df.pivot(index="class", columns="addressee", values="count")
            [["claude", "user", "unknown"]].to_string())

addr_chart = (
    alt.Chart(addr_df)
    .mark_bar()
    .encode(
        y=alt.Y("class:N", sort=["appreciative", "collaborative"], title=None),
        x=alt.X("count:Q", title="sentences"),
        color=alt.Color("addressee:N",
                         scale=alt.Scale(
                             domain=["claude", "user", "unknown"],
                             range=["#4e79a7", "#e15759", "#bab0ab"]),
                         legend=alt.Legend(title="addressee", orient="bottom")),
        yOffset="addressee:N",
        tooltip=[alt.Tooltip("class:N"),
                 alt.Tooltip("addressee:N"),
                 alt.Tooltip("count:Q", format=",")],
    )
    .properties(width=520, height=160,
                title="Addressee distribution for the two near-zero pragmatic classes")
)

# --- forensic inspection: actual sentences from the parquet ---
import pathlib
parquet_path = pathlib.Path("sentences_classified.parquet")
if parquet_path.exists():
    sentences_df = pd.read_parquet(parquet_path)
    APPR_KEYWORDS = ["thank you", "thanks", "appreciate", "great job",
                     "well done", "kudos", "much appreciated"]
    pat = "|".join(APPR_KEYWORDS)
    appr_sample = sentences_df[
        sentences_df["text"].str.lower().str.contains(pat, regex=True, na=False)
    ][["file_path", "text", "addressee"]].head(10)
    print("\nForensic inspection — sentences containing appreciative keywords:")
    print(appr_sample.to_string(index=False, max_colwidth=80))

sr_composite = alt.vconcat(sr_chart, addr_chart).resolve_scale(color="independent").properties(
    title=alt.TitleParams(
        "Sentence-register per category + addressee drilldown",
        subtitle=["Top: 4-class distribution per category. "
                  "Bottom: who is being addressed in the two near-zero classes."],
        anchor="start",
    )
)
save_chart(sr_composite, "10-sentence-register-by-category")
Corpus-wide addressee distribution for the two near-zero classes:
addressee      claude  user  unknown
class                               
appreciative        3     0        1
collaborative      14     0       16

Forensic inspection — sentences containing appreciative keywords:
                                           file_path                                                                             text addressee
      agent-prompt-prompt-suggestion-generator-v2.md NEVER SUGGEST:\n- Evaluative ("looks good", "thanks")\n- Questions ("what abo...    claude
system-prompt-how-to-use-the-sendusermessage-tool.md                                                               Even for "thanks".   unknown
 system-prompt-insights-session-facets-extraction.md 1. **goal_categories**: Count ONLY what the USER explicitly asked for.\n   - ...    claude
                   tool-description-enterplanmode.md **User Preferences Matter**: The implementation could reasonably go multiple ...    claude

collaborative and appreciative rows stay near zero across every category — the uniform absence is the welfare-relevant finding. imperative and directive dominate; configuring lights up most strongly in Tool descriptions.

Per-file outliers — top 10 files for each attested class

Four panels by class. For collaborative and appreciative (near-zero overall), the panels surface the few files that do attest the class. For permissive and configuring, the panels rank the top files by per-sentence rate.

Code
"""Per-file outliers for sentence_register — Altair 4-panel.

Top 10 files by each ATTESTED class. Near-zero classes (collaborative,
appreciative) will surface their few outlier files; if a class has fewer
than 10 nonzero files, the panel renders only what exists.
"""

sr_per_file_df = pd.DataFrame([
    {
        "path": r["path"],
        "category": r["category"],
        **{f"{cls}_sent_pct": r["metrics"]["sentence_register"][f"{cls}_sent_pct"]
           for cls in SENT_REGISTER_CLASSES},
    }
    for r in per_file_records
])

panels = []
for cls in ["collaborative", "permissive", "appreciative", "configuring"]:
    col = f"{cls}_sent_pct"
    top = sr_per_file_df.nlargest(10, col).copy()
    top = top[top[col] > 0]   # drop zero-pct entries (relevant for the rare classes)
    panel = (
        alt.Chart(top)
        .mark_bar()
        .encode(
            x=alt.X(f"{col}:Q", title="% of sentences"),
            y=alt.Y("path:N", sort="-x", title=None,
                    axis=alt.Axis(labelLimit=300, labelFontSize=9)),
            color=alt.Color("category:N",
                             scale=alt.Scale(domain=cats,
                                             range=[CATEGORY_COLORS[c] for c in cats]),
                             legend=alt.Legend(title="Category", orient="bottom",
                                                columns=len(cats))),
            tooltip=[
                alt.Tooltip("path:N"),
                alt.Tooltip("category:N"),
                alt.Tooltip(f"{col}:Q", format=".2f", title=f"{cls} sent %"),
            ],
        )
        .properties(width=320, height=240,
                    title=f"Top by `{cls}_sent_pct` (n={len(top)})")
    )
    panels.append(panel)

sr_outliers_chart = (panels[0] | panels[1]) & (panels[2] | panels[3])
save_chart(sr_outliers_chart, "10-per-file-outliers")
Back to top