Stage 0 of the 6-stage producer chain. Loads the 286 corpus files, parses HTML-comment frontmatter, runs the spaCy pipeline once, and writes two cache artifacts under _pipeline_cache/:
Output
Description
_pipeline_cache/corpus_docs.spacy
spaCy DocBin — the parsed Doc objects, keyed by file index
Stages 01–03 reload the cache and run the per-doc analyzers; 04 assembles and writes the final YAML + parquet; 05 prints the canonical HEADLINE sheet. The producer chain is also where every linguistic and statistical term is defined; lexicons and analyzer functions live in prompt_pipeline.py.
Reading the YAML keys
YAML field names are snake_case versions of the spaced display names used in chart axes:
“Positive evaluative” stance class → positive_evaluative_pct / positive_evaluative_count / positive_evaluative_per_sent.
“Pearson correlation (r)” → labeled as Pearson r in chart axes.
“% of file tokens” / “per word” → _pct suffix; “rate per sentence” → _per_sent.
Units (read every chart axis with these in mind)
Every density metric reports two parallel views, plus one variant for the multi-label sentence-register classifier:
% of file tokens (pct) — count / total_word_tokens × 100. Normalized by document length. Answers: “what fraction of this file’s prose is this feature?” Use this when comparing categories of different sizes — a 10,000-word file with 50 imperative markers (0.5%) is less imperative-dense than a 200-word file with 10 markers (5%).
per_sent (rate per sentence) — count / total_sentences. Can be > 1 if a feature appears multiple times per sentence on average. Answers: “how often per sentence does this feature appear?” Use this when comparing categories where typical sentence length differs — a category with long, dense sentences will have similar pct but higher per_sent.
sent_pct (% of sentences with this flag) — sentences_with_flag / total_sentences × 100. Used only for the sentence_register multi-label classifier (defined in 01_analyzers_register); per-class values can sum to more than 100% across the six classes, since a single sentence can carry several flags simultaneously.
Note: you may need to restart the kernel to use updated packages.
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
Code
"""Path configuration. Notebook lives at the project root (claude-prompts-analysis/).The kernel CWD is the same directory, so all paths are relative to it."""import os, pathlib_here = pathlib.Path.cwd().resolve()PROJECT_ROOT =next( (p for p in [_here, *_here.parents] if (p /"prompt_pipeline.py").is_file()),None,)if PROJECT_ROOT isNone:raiseRuntimeError(f"Could not find prompt_pipeline.py walking up from {_here}. ""Run from inside the claude-prompts-analysis repo." )if pathlib.Path.cwd() != PROJECT_ROOT: os.chdir(PROJECT_ROOT)CORPUS_DIR = PROJECT_ROOT /"claude-code-system-prompts"/"system-prompts"CACHE_DIR = PROJECT_ROOT /"_pipeline_cache"DOCBIN_OUT = CACHE_DIR /"corpus_docs.spacy"META_OUT = CACHE_DIR /"corpus_meta.parquet"CACHE_DIR.mkdir(exist_ok=True)assert CORPUS_DIR.is_dir(), f"missing corpus dir: {CORPUS_DIR}"md_files =sorted(CORPUS_DIR.glob("*.md"))print(f"cwd: {pathlib.Path.cwd()}")print(f"corpus: {CORPUS_DIR}")print(f" {len(md_files)} .md files")print(f"cache dir: {CACHE_DIR}")print(f"output: {DOCBIN_OUT.name}, {META_OUT.name}")
The 286 .md files under claude-code-system-prompts/system-prompts/.
Frontmatter
HTML-comment-wrapped YAML at the top of each file (name / description / ccVersion / agentType).
Category
One of seven labels assigned by filename prefix (Agent prompt / Data / template / Skill / System prompt / System reminder / Tool description / Tool parameter).
raw_text
The body of the file with code fences, inline code, image-only lines, and link-only lines stripped.
n_tokens
Word tokens (excluding whitespace), as counted by spaCy.
n_sents
Sentences as segmented by spaCy.
Code
"""Load every .md file: parse HTML-comment frontmatter, strip code, categorize.Frontmatter format (HTML-comment-wrapped YAML, all 287 files in this corpus): <!-- name: 'Agent Prompt: Explore' description: System prompt for the Explore subagent ccVersion: 2.1.118 -->We extract ``name`` / ``description`` / ``ccVersion`` (top level) and``agentMetadata.agentType`` (nested). If full YAML parsing fails on edge cases likeunresolved aliases (``*`` without anchor) or template variables, we fall back toa line-by-line extraction of the known scalar top-level keys. All parsing helperslive in ``prompt_pipeline``."""import importlibimport pandas as pdfrom tqdm.auto import tqdmimport prompt_pipelineimportlib.reload(prompt_pipeline)from prompt_pipeline import parse_html_frontmatter, clean_markdown, categorizerecords = []fm_count =0fallback_count =0for path in tqdm(md_files, desc="loading"): raw = path.read_text(encoding="utf-8") meta, body, used_fallback = parse_html_frontmatter(raw)if meta: fm_count +=1if used_fallback: fallback_count +=1 raw_text = clean_markdown(body) agent_meta = meta.get("agentMetadata") ifisinstance(meta.get("agentMetadata"), dict) else {} records.append({"path": path.name,"category": categorize(path.name),"name": meta.get("name", "") or"","description": meta.get("description", "") or"","ccVersion": meta.get("ccVersion", "") or"","agentType": (agent_meta.get("agentType", "") if agent_meta else""),"raw_text": raw_text, })df = pd.DataFrame(records)df["clean_text"] = df["raw_text"].str.lower()print(f"loaded {len(df)} prompts ({fm_count} with frontmatter, {fallback_count} via line-by-line fallback)")print()print("category distribution:")print(df["category"].value_counts().to_string())print()print(f"unique ccVersions: {df['ccVersion'].nunique()}")top_ver = df['ccVersion'].value_counts().head()print(f"top 5 ccVersions: {top_ver.to_dict()}")print(f"files with non-empty agentType: {(df['agentType'] !='').sum()}")
loaded 290 prompts (290 with frontmatter, 29 via line-by-line fallback)
category distribution:
category
Tool description 79
System prompt 64
System reminder 40
Data / template 39
Agent prompt 37
Skill 30
Tool parameter 1
unique ccVersions: 58
top 5 ccVersions: {'2.1.53': 47, '2.1.18': 23, '2.1.128': 18, '2.1.132': 17, '2.1.118': 15}
files with non-empty agentType: 5
Code
"""Run the spaCy pipeline once, then cache: - `_pipeline_cache/corpus_docs.spacy` (DocBin of 286 parsed docs) - `_pipeline_cache/corpus_meta.parquet` (per-file metadata + raw_text + token/sent counts)Downstream stages reload both via: docs = list(DocBin().from_disk(DOCBIN_OUT).get_docs(prompt_pipeline.NLP.vocab)) df = pd.read_parquet(META_OUT)"""from spacy.tokens import DocBinfrom prompt_pipeline import NLPdocs =list(tqdm(NLP.pipe(df["raw_text"].tolist(), batch_size=20), total=len(df), desc="spaCy parse"))df["n_tokens"] = [sum(1for t in d ifnot t.is_space) for d in docs]df["n_sents"] = [sum(1for _ in d.sents) for d in docs]print(f"total tokens: {df['n_tokens'].sum():,}")print(f"total sentences: {df['n_sents'].sum():,}")print(f"mean tokens/file: {df['n_tokens'].mean():.0f}")# Write DocBin + per-file metadata.DocBin(docs=docs, store_user_data=True).to_disk(DOCBIN_OUT)df.to_parquet(META_OUT, index=True)doc_size = DOCBIN_OUT.stat().st_sizemeta_size = META_OUT.stat().st_sizeprint()print(f"wrote {DOCBIN_OUT.relative_to(PROJECT_ROOT)} ({doc_size:,} bytes, {doc_size/1024:.1f} KiB)")print(f"wrote {META_OUT.relative_to(PROJECT_ROOT)} ({meta_size:,} bytes, {meta_size/1024:.1f} KiB)")print()print("--- sample rows ---")print(df[["path", "category", "ccVersion", "n_tokens", "n_sents"]].head().to_string())