Setup and corpus parse

Stage 0 of the 6-stage producer chain. Loads the 286 corpus files, parses HTML-comment frontmatter, runs the spaCy pipeline once, and writes two cache artifacts under _pipeline_cache/:

Output	Description
`_pipeline_cache/corpus_docs.spacy`	spaCy DocBin — the parsed Doc objects, keyed by file index
`_pipeline_cache/corpus_meta.parquet`	Per-file metadata: path, category, name, description, ccVersion, agentType, raw_text, n_tokens, n_sents

Stages 01–03 reload the cache and run the per-doc analyzers; 04 assembles and writes the final YAML + parquet; 05 prints the canonical HEADLINE sheet. The producer chain is also where every linguistic and statistical term is defined; lexicons and analyzer functions live in prompt_pipeline.py.

Reading the YAML keys

YAML field names are snake_case versions of the spaced display names used in chart axes:

“Positive evaluative” stance class → positive_evaluative_pct / positive_evaluative_count / positive_evaluative_per_sent.
“Negative evaluative” → negative_evaluative_*.
“Imperative marker density” / “mood marker density” → mood_marker_pct (also mood.marker_pct inside the YAML metric tree).
“Pearson correlation (r)” → labeled as Pearson r in chart axes.
“% of file tokens” / “per word” → _pct suffix; “rate per sentence” → _per_sent.

Units (read every chart axis with these in mind)

Every density metric reports two parallel views, plus one variant for the multi-label sentence-register classifier:

% of file tokens (pct) — count / total_word_tokens × 100. Normalized by document length. Answers: “what fraction of this file’s prose is this feature?” Use this when comparing categories of different sizes — a 10,000-word file with 50 imperative markers (0.5%) is less imperative-dense than a 200-word file with 10 markers (5%).
per_sent (rate per sentence) — count / total_sentences. Can be > 1 if a feature appears multiple times per sentence on average. Answers: “how often per sentence does this feature appear?” Use this when comparing categories where typical sentence length differs — a category with long, dense sentences will have similar pct but higher per_sent.
sent_pct (% of sentences with this flag) — sentences_with_flag / total_sentences × 100. Used only for the sentence_register multi-label classifier (defined in 01_analyzers_register); per-class values can sum to more than 100% across the six classes, since a single sentence can carry several flags simultaneously.

Code

%pip install --quiet spacy python-frontmatter pyarrow altair vl-convert-python
!python -m spacy download en_core_web_sm --quiet

Note: you may need to restart the kernel to use updated packages.

✔ Download and installation successful

You can now load the package via spacy.load('en_core_web_sm')

Code

"""Path configuration. Notebook lives at the project root (claude-prompts-analysis/).
The kernel CWD is the same directory, so all paths are relative to it."""
import os, pathlib

_here = pathlib.Path.cwd().resolve()
PROJECT_ROOT = next(
    (p for p in [_here, *_here.parents] if (p / "prompt_pipeline.py").is_file()),
    None,
)
if PROJECT_ROOT is None:
    raise RuntimeError(
        f"Could not find prompt_pipeline.py walking up from {_here}. "
        "Run from inside the claude-prompts-analysis repo."
    )
if pathlib.Path.cwd() != PROJECT_ROOT:
    os.chdir(PROJECT_ROOT)

CORPUS_DIR  = PROJECT_ROOT / "claude-code-system-prompts" / "system-prompts"
CACHE_DIR   = PROJECT_ROOT / "_pipeline_cache"
DOCBIN_OUT  = CACHE_DIR / "corpus_docs.spacy"
META_OUT    = CACHE_DIR / "corpus_meta.parquet"

CACHE_DIR.mkdir(exist_ok=True)

assert CORPUS_DIR.is_dir(), f"missing corpus dir: {CORPUS_DIR}"
md_files = sorted(CORPUS_DIR.glob("*.md"))
print(f"cwd:       {pathlib.Path.cwd()}")
print(f"corpus:    {CORPUS_DIR}")
print(f"           {len(md_files)} .md files")
print(f"cache dir: {CACHE_DIR}")
print(f"output:    {DOCBIN_OUT.name}, {META_OUT.name}")

cwd:       /home/user/workspace/claude-prompts-analysis
corpus:    /home/user/workspace/claude-prompts-analysis/claude-code-system-prompts/system-prompts
           290 .md files
cache dir: /home/user/workspace/claude-prompts-analysis/_pipeline_cache
output:    corpus_docs.spacy, corpus_meta.parquet

Corpus terms

Term	Definition
Corpus	The 286 `.md` files under `claude-code-system-prompts/system-prompts/`.
Frontmatter	HTML-comment-wrapped YAML at the top of each file (name / description / ccVersion / agentType).
Category	One of seven labels assigned by filename prefix (Agent prompt / Data / template / Skill / System prompt / System reminder / Tool description / Tool parameter).
`raw_text`	The body of the file with code fences, inline code, image-only lines, and link-only lines stripped.
`n_tokens`	Word tokens (excluding whitespace), as counted by spaCy.
`n_sents`	Sentences as segmented by spaCy.

Code

"""Load every .md file: parse HTML-comment frontmatter, strip code, categorize.

Frontmatter format (HTML-comment-wrapped YAML, all 287 files in this corpus):

    <!--
    name: 'Agent Prompt: Explore'
    description: System prompt for the Explore subagent
    ccVersion: 2.1.118
    -->

We extract ``name`` / ``description`` / ``ccVersion`` (top level) and
``agentMetadata.agentType`` (nested). If full YAML parsing fails on edge cases like
unresolved aliases (``*`` without anchor) or template variables, we fall back to
a line-by-line extraction of the known scalar top-level keys. All parsing helpers
live in ``prompt_pipeline``.
"""
import importlib
import pandas as pd
from tqdm.auto import tqdm

import prompt_pipeline
importlib.reload(prompt_pipeline)
from prompt_pipeline import parse_html_frontmatter, clean_markdown, categorize

records = []
fm_count = 0
fallback_count = 0
for path in tqdm(md_files, desc="loading"):
    raw = path.read_text(encoding="utf-8")
    meta, body, used_fallback = parse_html_frontmatter(raw)
    if meta:
        fm_count += 1
    if used_fallback:
        fallback_count += 1
    raw_text = clean_markdown(body)
    agent_meta = meta.get("agentMetadata") if isinstance(meta.get("agentMetadata"), dict) else {}
    records.append({
        "path":        path.name,
        "category":    categorize(path.name),
        "name":        meta.get("name", "") or "",
        "description": meta.get("description", "") or "",
        "ccVersion":   meta.get("ccVersion", "") or "",
        "agentType":   (agent_meta.get("agentType", "") if agent_meta else ""),
        "raw_text":    raw_text,
    })

df = pd.DataFrame(records)
df["clean_text"] = df["raw_text"].str.lower()
print(f"loaded {len(df)} prompts ({fm_count} with frontmatter, {fallback_count} via line-by-line fallback)")
print()
print("category distribution:")
print(df["category"].value_counts().to_string())
print()
print(f"unique ccVersions: {df['ccVersion'].nunique()}")
top_ver = df['ccVersion'].value_counts().head()
print(f"top 5 ccVersions: {top_ver.to_dict()}")
print(f"files with non-empty agentType: {(df['agentType'] != '').sum()}")

loaded 290 prompts (290 with frontmatter, 29 via line-by-line fallback)

category distribution:
category
Tool description    79
System prompt       64
System reminder     40
Data / template     39
Agent prompt        37
Skill               30
Tool parameter       1

unique ccVersions: 58
top 5 ccVersions: {'2.1.53': 47, '2.1.18': 23, '2.1.128': 18, '2.1.132': 17, '2.1.118': 15}
files with non-empty agentType: 5

Code

"""Run the spaCy pipeline once, then cache:
   - `_pipeline_cache/corpus_docs.spacy`  (DocBin of 286 parsed docs)
   - `_pipeline_cache/corpus_meta.parquet` (per-file metadata + raw_text + token/sent counts)

Downstream stages reload both via:
    docs = list(DocBin().from_disk(DOCBIN_OUT).get_docs(prompt_pipeline.NLP.vocab))
    df   = pd.read_parquet(META_OUT)
"""
from spacy.tokens import DocBin
from prompt_pipeline import NLP

docs = list(tqdm(NLP.pipe(df["raw_text"].tolist(), batch_size=20),
                  total=len(df), desc="spaCy parse"))

df["n_tokens"] = [sum(1 for t in d if not t.is_space) for d in docs]
df["n_sents"]  = [sum(1 for _ in d.sents) for d in docs]

print(f"total tokens: {df['n_tokens'].sum():,}")
print(f"total sentences: {df['n_sents'].sum():,}")
print(f"mean tokens/file: {df['n_tokens'].mean():.0f}")

# Write DocBin + per-file metadata.
DocBin(docs=docs, store_user_data=True).to_disk(DOCBIN_OUT)
df.to_parquet(META_OUT, index=True)

doc_size  = DOCBIN_OUT.stat().st_size
meta_size = META_OUT.stat().st_size
print()
print(f"wrote {DOCBIN_OUT.relative_to(PROJECT_ROOT)}  ({doc_size:,} bytes, {doc_size/1024:.1f} KiB)")
print(f"wrote {META_OUT.relative_to(PROJECT_ROOT)}  ({meta_size:,} bytes, {meta_size/1024:.1f} KiB)")
print()
print("--- sample rows ---")
print(df[["path", "category", "ccVersion", "n_tokens", "n_sents"]].head().to_string())

total tokens: 133,611
total sentences: 5,881
mean tokens/file: 461

wrote _pipeline_cache/corpus_docs.spacy  (1,501,457 bytes, 1466.3 KiB)
wrote _pipeline_cache/corpus_meta.parquet  (735,737 bytes, 718.5 KiB)

--- sample rows ---
                                                path      category ccVersion  n_tokens  n_sents
0           agent-prompt-agent-creation-architect.md  Agent prompt    2.0.77       936       22
1            agent-prompt-auto-mode-rule-reviewer.md  Agent prompt    2.1.81       242       13
2  agent-prompt-background-agent-state-classifier.md  Agent prompt   2.1.129      3533      115
3  agent-prompt-background-job-agent-instructions.md  Agent prompt   2.1.128       375       26
4    agent-prompt-bash-command-description-writer.md  Agent prompt     2.1.3       180        5