About 60 minutes. Open a Claude Code session in
~/ai-training and hand it this guide:
Read the file at /Users/<you>/ai-training/week-6-guide.md (or wherever you saved it) and walk me through Session 2.6.
I've completed Sessions 2.1 through 2.5.
Posture: public, synthetic, or personal data only. Today’s corpus is public — comments on a regulations.gov docket, a folder of NBER abstracts, public consultation responses, court filings, anything published in volume. Nothing client-internal, even abstracted.
By the end of this session you will have, inside
~/ai-training/corpus/<docket-slug>/:
pipeline/00-raw/ — 100+ public comments downloaded as
individual files.pipeline/01-scored/scores.csv — every comment scored
1–5 on novelty and substance, with a one-sentence justification.pipeline/02-summarized/ — one-paragraph summaries for
the items that scored 4–5 only.pipeline/03-fact-checked/ — the same summaries after
two parallel sub-agent fact-checkers ran on each. Items where the two
agents agreed are clean; items where they disagreed are flagged for your
review.corpus-report.md — a one-page synthesis of what the
high-scoring comments collectively say.Five stages, one folder per stage, one report at the end. The
pipeline is the artifact. Re-running on a different docket means
swapping out 00-raw/ and pressing go.
A production version of this is a daily 2 AM scan across eight NBER sections, scoring each new working paper 1–5, drafting threads only for the 4s and 5s, running two parallel fact-check agents on every drafted thread, and emailing the 5+ best out on Saturday at 6 AM in a format reviewable on a phone in 10 minutes. The corpus is ~30–60 papers per week. Without the pipeline, that’s a full afternoon of reading. With it, it’s a 10-minute mobile review.
The pattern transfers cleanly. The piece you’re learning today is not “how to score NBER papers”; it’s the four-stage pipeline that turns “too much to read” into “five things worth reading, with confidence.”
The mistake people make on a 100-comment docket: they ask Claude to summarize all 100. Two hours and a lot of tokens later, they have 100 summaries — most of which are noise, because most public comments on most dockets are noise. The summaries weren’t worth writing.
The unlock: score first, draft second. Scoring a comment 1–5 with one sentence costs almost nothing. Drafting a real summary costs a lot more. By scoring everything and drafting only the high-scorers, you spend tokens where they earn return.
Same logic at the next stage: fact-check the drafts you wrote, not the corpus. Two parallel fact-checkers on the 5–10 high-scoring items is a 10-minute job. Fact-checking 100 noise items would be 5 hours and find almost nothing.
The pipeline is the discipline. Each stage filters; each filter removes work from the next stage.
You need a public corpus with at least 100 items where the items are distinct and short enough to score individually. Good options:
gh issue list.Pick one in or near your running domain. The exact corpus matters less than the size and structure.
From inside ~/ai-training, start claude and
paste:
Read the file at /Users/<you>/ai-training/week-6-guide.md (or wherever you saved it) and walk me through the Reference Sections.
I've picked <docket-name> as the corpus, with ~<N> items.
Claude takes over.
Claude: walk the user through the four-stage pipeline. Peer-tone. Two rules:
Design: this session introduces sub-agents (the Task tool — fan
out to parallel workers), the two-terminal / headless pattern
(claude -p for non-interactive runs), and the
score-then-draft pipeline shape. 2.7 builds on the parallel-agent
pattern (formalized as dueling fact-checkers). 2.8 reuses this exact
pipeline, scheduled.
Before any pipeline work:
/refresh from Session 2.5 still works — Claude has
read CLAUDE.md and MEMORY.md and knows what kind of project this
is.The shape: every item from the corpus lands as a separate file in
corpus/<docket-slug>/pipeline/00-raw/. One item per
file, named with a stable ID (the comment ID, the issue number, the
paper ID — whatever the source uses).
In Claude:
Create the directory corpus/<docket-slug>/pipeline/00-raw/. Then download
every public comment from <docket URL or API endpoint> as a separate
.txt or .json file in that directory. Name each file by its native ID
(e.g. EPA-HQ-OAR-2026-0014.txt). After download, write
corpus/<docket-slug>/pipeline/00-raw/manifest.json listing every file with
its size and a SHA-256 hash.
Claude: regulations.gov has a public API at api.regulations.gov; GitHub has the gh CLI; NBER has RSS. Pick the right transport for the corpus the user chose. Do NOT scrape the HTML site — use the API. Confirm with the user that the count of files in 00-raw/ matches the expected docket size.
If a few items fail to download (network glitch, rate limit), retry; don’t move on with a half-corpus.
This is the first place sub-agents pay for themselves. You don’t want to score 100 items sequentially in one Claude conversation — context fills up, the model’s calibration drifts. Instead, fan out.
The shape: one sub-agent per ~10 items, each agent scores its batch, scores merge into one CSV.
In Claude:
Read corpus/<docket-slug>/pipeline/00-raw/manifest.json. Split the items into
batches of 10. For each batch, dispatch a sub-agent (Task tool, model: haiku
for cost) with this prompt:
"Read each of these public comment files. For each, score 1-5 on (a)
novelty (does it add something new or just restate prior comments?) and
(b) substance (is the argument concrete and supported, or vague?). For
each item, return: filename, novelty (1-5), substance (1-5), one-sentence
justification. Output as CSV rows."
Collect all sub-agent outputs into a single CSV at
corpus/<docket-slug>/pipeline/01-scored/scores.csv with columns:
filename, novelty, substance, justification.
Sort by (novelty + substance) descending. Show me the top 10 and the
bottom 10 so I can sanity-check the calibration.
Claude: dispatch the sub-agents in parallel using the Task tool. Use a cheap model (haiku) for the scoring — 100 short scoring decisions don’t need full Opus. The user reviews the top/bottom sample to confirm the model’s calibration matches theirs. If the calibration is off, tighten the rubric in the sub-agent prompt and re-run.
The “top 10 and bottom 10” sanity check is non-negotiable. If the bottom 10 contains items the user thinks are obviously substantive, the rubric needs work before any drafts are written.
Take items where (novelty + substance) ≥ 8 — typically 5–15 items out of 100.
In Claude:
Read corpus/<docket-slug>/pipeline/01-scored/scores.csv. Filter to items
with novelty + substance >= 8. For each filtered item, read the raw
comment from pipeline/00-raw/ and write a one-paragraph summary covering:
the position taken, the strongest evidence cited, the implicit
assumptions, and why this comment scored high. Save each as
pipeline/02-summarized/<filename>.md.
Claude: this is the expensive stage in token cost; it’s also where the value sits. Write each summary carefully — these are the things the user will actually read.
Read 3 of the summaries together with the user. They should feel substantive, not generic. If they read like wikipedia introductions, the summarizer prompt needs more specificity.
This is the spine of the session. Two genuinely-independent sub-agents fact-check each summary, with explicit instructions to disagree where they can.
In Claude:
For each .md file in corpus/<docket-slug>/pipeline/02-summarized/:
Dispatch two sub-agents IN PARALLEL with these distinct prompts.
Agent A (model: haiku, role: factual-error finder):
"Read this summary and the underlying comment file at
pipeline/00-raw/<filename>. Find every factual claim in the summary
that's wrong, misattributed, or unsupported by the source. Be
aggressive — your job is to find errors, not to be balanced."
Agent B (model: haiku, role: missing-context finder):
"Read this summary and the underlying comment file. Find every place
where the summary omits context that would change a careful reader's
interpretation, or overstates the comment's confidence. Be aggressive
— your job is to find what's missing, not to be balanced."
After both agents return, write
pipeline/03-fact-checked/<filename>.md containing:
- The original summary
- Agent A's findings
- Agent B's findings
- A confidence label: "clean" (both agents found nothing material),
"agreed-issue" (both agents flagged the same issue), or "disputed"
(only one agent flagged something).
After the loop, give me a count: N clean, N agreed-issue, N disputed.
Claude: parallel matters. Sequential dispatch wastes the independence — the second agent can be subtly biased by the first’s output if they share context. The Task tool runs sub-agents in parallel by default. Use that.
The output is rich. Clean items go to the report as-is. Agreed-issue items get the agreed correction folded in before the report. Disputed items get held — the user reviews them manually and decides.
The two-fault-line preview here is direct: agent A is your hallucination defense; agent B is your underclaim/overclaim defense. Session 2.7 formalizes both into reusable skills.
In Claude:
Read every clean and agreed-issue summary in
pipeline/03-fact-checked/. Synthesize a one-page corpus-report.md with:
- A two-sentence overall framing of what the high-scoring comments
collectively argue.
- 3-5 themes, each with the 1-2 strongest comments cited by filename.
- A "minority view" section for any high-scoring comment that cuts
against the majority.
- A "disputed items, for human review" footer listing the disputed
summaries by filename.
Save to corpus/<docket-slug>/corpus-report.md.
Read it. The report should be the thing the user would have wanted at the start — a fluent summary of the substantive comments, footnoted to specific items, with disputes flagged rather than buried.
Name these out loud:
pipeline/00-raw → 01-scored → 02-summarized → 03-fact-checked.
Each stage has its own folder; downstream stages read from upstream
folders only. Reproducibility, debuggability, restartability.claude -p "<prompt referencing the pipeline>" from a
second terminal — no interactive UI, just a job that runs and writes
output. Session 2.8 schedules this; today, just notice the option.Three things to try this week:
prompts/corpus-pipeline.md captures Steps A–E. By Session
2.8 you’ll wire this as a slash command + scheduled job.Feedback.
The user submits feedback at https://docs.google.com/forms/d/e/1FAIpQLSdJLxiyFT58MkkkGq3ZUvquelZcGvUkTIoVi5Zn-vUdLpZC2A/viewform.
Claude: paste the URL into chat. The form mirrors the questions below. Collect answers conversationally first, then have the user click through and submit.
Tell the user: “Your instructor uses these to tailor next week’s session.”
Sub-agents are not free. Each fan-out adds tokens. The discipline is to use a cheaper model for cheaper work — haiku for scoring, opus only for the synthesis report. Specify model per task; don’t let everything inherit the parent.
Parallelism breaks if you share state by accident. Two fact-check agents with distinct prompts but reading from the same scratch file are not independent. Each agent gets its own input file path; outputs go to distinct files; the parent does the merge.
100 items is the threshold. Below ~30 items, reading by hand is fine; the pipeline is overkill. Above 100, you can’t physically read everything; the pipeline pays for itself. In between, judgment.
The pipeline is recoverable mid-run. If the network drops at stage 2, the stage-1 scores are still on disk; resume from there. Stages reading from upstream folders means restartability is free.
Pipelines age well. A pipeline you build today and
rerun in six months keeps working as long as the corpus source’s API
doesn’t change. The manifest.json and the per-stage folders
make it obvious what was last run when. This becomes the spine of the
brief in 2.8.