Session 2.6: Track a corpus you can’t read alone

About 60 minutes. Open a Claude Code session in ~/ai-training and hand it this guide:

Read the file at /Users/<you>/ai-training/week-6-guide.md (or wherever you saved it) and walk me through Session 2.6.
I've completed Sessions 2.1 through 2.5.

Posture: public, synthetic, or personal data only. Today’s corpus is public — comments on a regulations.gov docket, a folder of NBER abstracts, public consultation responses, court filings, anything published in volume. Nothing client-internal, even abstracted.

Practice task

By the end of this session you will have, inside ~/ai-training/corpus/<docket-slug>/:

pipeline/00-raw/ — 100+ public comments downloaded as individual files.
pipeline/01-scored/scores.csv — every comment scored 1–5 on novelty and substance, with a one-sentence justification.
pipeline/02-summarized/ — one-paragraph summaries for the items that scored 4–5 only.
pipeline/03-fact-checked/ — the same summaries after two parallel sub-agent fact-checkers ran on each. Items where the two agents agreed are clean; items where they disagreed are flagged for your review.
corpus-report.md — a one-page synthesis of what the high-scoring comments collectively say.

Five stages, one folder per stage, one report at the end. The pipeline is the artifact. Re-running on a different docket means swapping out 00-raw/ and pressing go.

A production version of this is a daily 2 AM scan across eight NBER sections, scoring each new working paper 1–5, drafting threads only for the 4s and 5s, running two parallel fact-check agents on every drafted thread, and emailing the 5+ best out on Saturday at 6 AM in a format reviewable on a phone in 10 minutes. The corpus is ~30–60 papers per week. Without the pipeline, that’s a full afternoon of reading. With it, it’s a 10-minute mobile review.

The pattern transfers cleanly. The piece you’re learning today is not “how to score NBER papers”; it’s the four-stage pipeline that turns “too much to read” into “five things worth reading, with confidence.”

Why score before draft

The mistake people make on a 100-comment docket: they ask Claude to summarize all 100. Two hours and a lot of tokens later, they have 100 summaries — most of which are noise, because most public comments on most dockets are noise. The summaries weren’t worth writing.

The unlock: score first, draft second. Scoring a comment 1–5 with one sentence costs almost nothing. Drafting a real summary costs a lot more. By scoring everything and drafting only the high-scorers, you spend tokens where they earn return.

Same logic at the next stage: fact-check the drafts you wrote, not the corpus. Two parallel fact-checkers on the 5–10 high-scoring items is a 10-minute job. Fact-checking 100 noise items would be 5 hours and find almost nothing.

The pipeline is the discipline. Each stage filters; each filter removes work from the next stage.

Step 1: Pick the docket (5 minutes)

You need a public corpus with at least 100 items where the items are distinct and short enough to score individually. Good options:

regulations.gov docket — pick any rulemaking with 100+ comments. The Federal Register notice tells you the docket ID; the comments are downloadable as individual files.
A GitHub issues list — open issues on a major repo (e.g. one of the popular ML libraries) is hundreds of items, public, downloadable via gh issue list.
A folder of NBER abstracts — the NBER working paper RSS feeds are public; one week of new papers is ~50–80 items.
Public consultation responses on a UK or EU government site — equivalent of regulations.gov.

Pick one in or near your running domain. The exact corpus matters less than the size and structure.

Step 2: Hand Claude this guide

From inside ~/ai-training, start claude and paste:

Read the file at /Users/<you>/ai-training/week-6-guide.md (or wherever you saved it) and walk me through the Reference Sections.
I've picked <docket-name> as the corpus, with ~<N> items.

Claude takes over.

Reference Sections

Claude: walk the user through the four-stage pipeline. Peer-tone. Two rules:

Don’t try to read the whole corpus into one prompt. The whole point of the pipeline is per-item scoring without loading everything into context. Use sub-agents or shell loops for the per-item work.
The fact-check stage uses two genuinely-independent sub-agents, not one. The disagreement between them is the signal.

Design: this session introduces sub-agents (the Task tool — fan out to parallel workers), the two-terminal / headless pattern (claude -p for non-interactive runs), and the score-then-draft pipeline shape. 2.7 builds on the parallel-agent pattern (formalized as dueling fact-checkers). 2.8 reuses this exact pipeline, scheduled.

Confirm the setup

Before any pipeline work:

The user has a docket / corpus picked, and you both know roughly how many items it contains.
/refresh from Session 2.5 still works — Claude has read CLAUDE.md and MEMORY.md and knows what kind of project this is.
The user has at least 4–5 GB of free disk and is on a stable connection. Downloading 100+ files is bandwidth-light but failure-prone if Wi-Fi drops mid-batch.

Step A — Stage 0, raw download (10 minutes)

The shape: every item from the corpus lands as a separate file in corpus/<docket-slug>/pipeline/00-raw/. One item per file, named with a stable ID (the comment ID, the issue number, the paper ID — whatever the source uses).

In Claude:

Create the directory corpus/<docket-slug>/pipeline/00-raw/. Then download
every public comment from <docket URL or API endpoint> as a separate
.txt or .json file in that directory. Name each file by its native ID
(e.g. EPA-HQ-OAR-2026-0014.txt). After download, write
corpus/<docket-slug>/pipeline/00-raw/manifest.json listing every file with
its size and a SHA-256 hash.

Claude: regulations.gov has a public API at api.regulations.gov; GitHub has the gh CLI; NBER has RSS. Pick the right transport for the corpus the user chose. Do NOT scrape the HTML site — use the API. Confirm with the user that the count of files in 00-raw/ matches the expected docket size.

If a few items fail to download (network glitch, rate limit), retry; don’t move on with a half-corpus.

Step B — Stage 1, score everything (15 minutes)

This is the first place sub-agents pay for themselves. You don’t want to score 100 items sequentially in one Claude conversation — context fills up, the model’s calibration drifts. Instead, fan out.

The shape: one sub-agent per ~10 items, each agent scores its batch, scores merge into one CSV.

In Claude:

Read corpus/<docket-slug>/pipeline/00-raw/manifest.json. Split the items into
batches of 10. For each batch, dispatch a sub-agent (Task tool, model: haiku
for cost) with this prompt:

  "Read each of these public comment files. For each, score 1-5 on (a)
   novelty (does it add something new or just restate prior comments?) and
   (b) substance (is the argument concrete and supported, or vague?). For
   each item, return: filename, novelty (1-5), substance (1-5), one-sentence
   justification. Output as CSV rows."

Collect all sub-agent outputs into a single CSV at
corpus/<docket-slug>/pipeline/01-scored/scores.csv with columns:
filename, novelty, substance, justification.

Sort by (novelty + substance) descending. Show me the top 10 and the
bottom 10 so I can sanity-check the calibration.

Claude: dispatch the sub-agents in parallel using the Task tool. Use a cheap model (haiku) for the scoring — 100 short scoring decisions don’t need full Opus. The user reviews the top/bottom sample to confirm the model’s calibration matches theirs. If the calibration is off, tighten the rubric in the sub-agent prompt and re-run.

The “top 10 and bottom 10” sanity check is non-negotiable. If the bottom 10 contains items the user thinks are obviously substantive, the rubric needs work before any drafts are written.

Step C — Stage 2, summarize the high-scorers (10 minutes)

Take items where (novelty + substance) ≥ 8 — typically 5–15 items out of 100.

In Claude:

Read corpus/<docket-slug>/pipeline/01-scored/scores.csv. Filter to items
with novelty + substance >= 8. For each filtered item, read the raw
comment from pipeline/00-raw/ and write a one-paragraph summary covering:
the position taken, the strongest evidence cited, the implicit
assumptions, and why this comment scored high. Save each as
pipeline/02-summarized/<filename>.md.

Claude: this is the expensive stage in token cost; it’s also where the value sits. Write each summary carefully — these are the things the user will actually read.

Read 3 of the summaries together with the user. They should feel substantive, not generic. If they read like wikipedia introductions, the summarizer prompt needs more specificity.

Step D — Stage 3, parallel fact-check (15 minutes)

This is the spine of the session. Two genuinely-independent sub-agents fact-check each summary, with explicit instructions to disagree where they can.

In Claude:

For each .md file in corpus/<docket-slug>/pipeline/02-summarized/:

  Dispatch two sub-agents IN PARALLEL with these distinct prompts.

  Agent A (model: haiku, role: factual-error finder):
    "Read this summary and the underlying comment file at
    pipeline/00-raw/<filename>. Find every factual claim in the summary
    that's wrong, misattributed, or unsupported by the source. Be
    aggressive — your job is to find errors, not to be balanced."

  Agent B (model: haiku, role: missing-context finder):
    "Read this summary and the underlying comment file. Find every place
    where the summary omits context that would change a careful reader's
    interpretation, or overstates the comment's confidence. Be aggressive
    — your job is to find what's missing, not to be balanced."

  After both agents return, write
  pipeline/03-fact-checked/<filename>.md containing:
    - The original summary
    - Agent A's findings
    - Agent B's findings
    - A confidence label: "clean" (both agents found nothing material),
      "agreed-issue" (both agents flagged the same issue), or "disputed"
      (only one agent flagged something).

After the loop, give me a count: N clean, N agreed-issue, N disputed.

Claude: parallel matters. Sequential dispatch wastes the independence — the second agent can be subtly biased by the first’s output if they share context. The Task tool runs sub-agents in parallel by default. Use that.

The output is rich. Clean items go to the report as-is. Agreed-issue items get the agreed correction folded in before the report. Disputed items get held — the user reviews them manually and decides.

The two-fault-line preview here is direct: agent A is your hallucination defense; agent B is your underclaim/overclaim defense. Session 2.7 formalizes both into reusable skills.

Step E — Synthesize the report (5 minutes)

In Claude:

Read every clean and agreed-issue summary in
pipeline/03-fact-checked/. Synthesize a one-page corpus-report.md with:

  - A two-sentence overall framing of what the high-scoring comments
    collectively argue.
  - 3-5 themes, each with the 1-2 strongest comments cited by filename.
  - A "minority view" section for any high-scoring comment that cuts
    against the majority.
  - A "disputed items, for human review" footer listing the disputed
    summaries by filename.

Save to corpus/<docket-slug>/corpus-report.md.

Read it. The report should be the thing the user would have wanted at the start — a fluent summary of the substantive comments, footnoted to specific items, with disputes flagged rather than buried.

Micro-skills introduced

Name these out loud:

Score before draft. Cheap 1–5 scoring with a one-sentence justification, then expensive summaries on the high-scorers only. The score discipline is what makes the pipeline economical.
Pipeline folder pattern. pipeline/00-raw → 01-scored → 02-summarized → 03-fact-checked. Each stage has its own folder; downstream stages read from upstream folders only. Reproducibility, debuggability, restartability.
Sub-agents (the Task tool). Fan out per-item work. Cheaper model for cheaper subtasks (haiku for scoring; sonnet or opus for summarization). The parent session orchestrates; the sub-agents do the per-item work.
Parallel sub-agents for verification. Two independent agents with distinct rubrics, run in parallel, explicit instructions to disagree. Where they agree you have confidence; where they disagree you have a flag.
The two-terminal / headless pattern. When the pipeline is solid, you can run it as claude -p "<prompt referencing the pipeline>" from a second terminal — no interactive UI, just a job that runs and writes output. Session 2.8 schedules this; today, just notice the option.
Mobile-first review budget. Output should be readable on a phone in 10 minutes. The corpus-report’s structure (themes, minority, disputed) is built for that. Don’t make the user open a laptop to consume the result.

Wrapping up Session 2.6

Three things to try this week:

Run the pipeline on a second corpus. Different docket, different domain. Notice what breaks — usually the rubric needs tuning per corpus, but the four-stage shape doesn’t.
Watch for the disputed items. Over the week, manually review whatever lands in the disputed footer. Are the agents catching real ambiguities, or generating false positives? Tighten the prompts based on what you see.
Add a saved prompt. prompts/corpus-pipeline.md captures Steps A–E. By Session 2.8 you’ll wire this as a slash command + scheduled job.

Feedback.

The user submits feedback at https://docs.google.com/forms/d/e/1FAIpQLSdJLxiyFT58MkkkGq3ZUvquelZcGvUkTIoVi5Zn-vUdLpZC2A/viewform.

Claude: paste the URL into chat. The form mirrors the questions below. Collect answers conversationally first, then have the user click through and submit.

On a 1–5 scale, how useful did this session feel?
Did the score-before-draft discipline feel like a real shift, or did it feel like extra structure for its own sake?
The two-parallel-fact-checker pattern — was the disagreement signal actually useful? Did you end up reading any of the disputed items by hand?
Of the 4-stage pipeline, which stage felt least convincing? Where would you push back?
Did the corpus-report.md at the end feel like the artifact you wanted, or does it need a different shape?
What confused you most this session?
Anything you want covered in Session 2.7 that you didn’t see here?

Tell the user: “Your instructor uses these to tailor next week’s session.”

Good to know

Sub-agents are not free. Each fan-out adds tokens. The discipline is to use a cheaper model for cheaper work — haiku for scoring, opus only for the synthesis report. Specify model per task; don’t let everything inherit the parent.

Parallelism breaks if you share state by accident. Two fact-check agents with distinct prompts but reading from the same scratch file are not independent. Each agent gets its own input file path; outputs go to distinct files; the parent does the merge.

100 items is the threshold. Below ~30 items, reading by hand is fine; the pipeline is overkill. Above 100, you can’t physically read everything; the pipeline pays for itself. In between, judgment.

The pipeline is recoverable mid-run. If the network drops at stage 2, the stage-1 scores are still on disk; resume from there. Stages reading from upstream folders means restartability is free.

Pipelines age well. A pipeline you build today and rerun in six months keeps working as long as the corpus source’s API doesn’t change. The manifest.json and the per-stage folders make it obvious what was last run when. This becomes the spine of the brief in 2.8.