This is the handoff brief for Claude Code (Insiders). The hackathon is shipped and won. Snowflake's role is over. This document is the spec for migrating to a hobby-budget stack while preserving every public contract.
Read AGENTS.md before writing code. Read PRD.md for product intent. This doc supersedes the Snowflake-specific sections of AGENTS.md once a phase is complete; update AGENTS.md in Phase 4.
| Concern | Decision | Rationale |
|---|---|---|
| Storage backend | Cloudflare R2 | 10GB free, $0 egress, fits "could go viral, won't always". User has Cloudflare account already. GCS rejected because all infra is in us-east1 and GCS always-free is us-central only. |
| Query engine | DuckDB embedded in FastAPI | No server, no daemon, queries Parquet over httpfs. Column pruning + predicate pushdown over HTTP make this viable for low-traffic. |
| Data format | Parquet | 5-10x compression vs CSV, columnar reads, universal. |
| LLM (runtime) | Claude Sonnet 4.6 with prompt caching | NL→SQL not needed; tool-using agent is the model. Sonnet for capability, prompt cache for cost. |
| LLM (build-time prose) | Claude Sonnet 4.6, output committed | Mine narratives baked once per mine, never called at request time. |
| Compute host | Cloud Run stays (with config fix) | Rewriting FastAPI as TS Workers for $3/mo savings is not worth the time. |
| Frontend host | Cloud Run stays for now | Cloudflare Pages migration is a separate concern, not in this brief. Note in TODO if user wants it later. |
| Public endpoint contracts | Unchanged | /mine-for-me, /ask, /h3-density, /emissions/{plant} keep current request/response shapes. Frontend code does not change. |
| Degraded-mode contract | Unchanged | Per-subregion fallback JSON (19 files) and the degraded: true flag stay. Add new fallback paths only if a new failure mode is introduced. |
- No new features. This is a stack migration, not a product change.
- No new endpoints. Same four routes, same request/response shapes.
- No frontend rewrite. Footer credit copy update only (Phase 4).
- No Cloudflare Workers / Pages migration. Separate effort.
- No changes to the eGRID GeoJSON, the 19 fallback JSONs, the semantic model YAML's question-pattern coverage (the YAML itself goes away in Phase 3, but the same patterns must still be answerable).
- No "TODO-gated" half-finished work. AGENTS.md §6 stands. If a phase can't fully ship, scope it down — don't merge a stub.
Deliverables:
- R2 bucket created. Suggested name:
unearthed-data. Region: pick the auto / nearest. Public access: off (DuckDB authenticates via S3-compatible keys). - Cloudflare R2 API token issued. Scope: Object Read + Object Write on the bucket only. Not the account-wide token.
- Anthropic API key issued, stored in Cloud Run as
ANTHROPIC_API_KEYsecret via Secret Manager. - R2 credentials stored in Cloud Run as
R2_ACCESS_KEY_ID,R2_SECRET_ACCESS_KEY,R2_ENDPOINT,R2_BUCKETsecrets. - Bucket layout decided and documented in this file:
unearthed-data/
├── raw/
│ ├── msha_mines.parquet
│ ├── msha_quarterly_production.parquet
│ ├── msha_accidents.parquet
│ ├── eia_923_fuel_receipts.parquet
│ ├── eia_860_plants.parquet
│ └── plant_subregion_lookup.parquet
├── mrt/
│ ├── v_mine_for_plant.parquet
│ ├── v_mine_for_subregion.parquet
│ ├── mine_plant_for_subregion.parquet
│ └── emissions_by_plant.parquet
└── prose/
└── mine_narratives.json (pre-baked, committed to R2 not the repo)
Acceptance: aws s3 ls --endpoint-url $R2_ENDPOINT s3://unearthed-data/ returns an empty bucket. Cloud Run revision has all secrets bound (verify via gcloud run services describe).
Goal: Prove the R2 + DuckDB pattern end-to-end on the simplest endpoint before committing to it for all four.
Deliverables:
- Export script (
scripts/export_snowflake_to_parquet.pyor similar). Reads each Snowflake table once, writes Parquet locally, validates row counts againstSELECT COUNT(*)from Snowflake. Must fail loudly on count mismatch. No silent data loss. - Upload script (
scripts/upload_to_r2.pyormaketarget). Uses boto3 against R2's S3-compatible endpoint. Idempotent. app/data_client.py— new module. DuckDB connection lifecycle (one connection per FastAPI worker, lazyINSTALL httpfs; LOAD httpfs;, R2 credentials from env). Helper functions for parameterized parquet reads. Do not expose rawcon.execute()outside this module — keep query construction here so SQL injection surface stays narrow (mirrors the SELECT-only validation discipline of the currentsnowflake_client)./emissions/{plant}migrated. Hand-written DuckDB SQL againsts3://unearthed-data/mrt/emissions_by_plant.parquet. Same response shape. Same EIA→EPA name normalization (LIKEprefix matching, parenthetical stripping).- Tests: existing
test_new_endpoints.py::TestEmissions*must pass without modification. Add a unit test fordata_client.query_emissions_for_plantthat hits a fixture parquet file checked intotests/fixtures/. - Old code path stays alive.
snowflake_client.get_emissions_for_plantis not removed yet — Phase 5 does the deletion. Endpoint just stops calling it.
Acceptance:
pytest tests/integration/test_new_endpoints.py::TestEmissionsgreen.curl localhost:8001/emissions/COLSTRIPreturns identical JSON shape and values to current Snowflake-backed response (spot-check 5 plants).- Cold-start
/emissions/{plant}latency under 1s (DuckDB needs to fetch the parquet's footer the first time; subsequent queries reuse the open connection). - No regressions in the other three endpoints.
Goal: All non-LLM endpoints serve from R2 + DuckDB. /ask stays on Cortex Analyst until Phase 3.
Deliverables:
/mine-for-me— querys3://unearthed-data/mrt/mine_plant_for_subregion.parquet. Same 19-subregion fallback contract viaassets/fallback/<subregion>.json. Cortex Complete prose generation is NOT yet replaced — this endpoint still callsprose_client.generate_proseagainst Snowflake. That's fine; replaced in Phase 3./h3-density— DuckDB does not have a native H3 extension equivalent to Snowflake'sH3_LATLNG_TO_CELL_STRING. Use theh3-pylibrary to compute hex cell IDs at query time, OR pre-compute hex assignments in Phase 1's export and store as a Parquet column. Decide and document which. Pre-computing is cheaper at runtime; computing at query time is more flexible if resolution becomes a tunable. Pre-computing wins for hobby cost.- Same
totalssecond-query pattern stays — unfiltered registry counts go through DuckDB the same way. - Cortex Complete H3 summary stays on Snowflake until Phase 3.
- Tests: all existing
/h3-densityand/mine-for-metests pass unmodified, including the degraded-mode tests and thesummary_degradedflag behavior.
Acceptance:
- All three non-LLM endpoints return DuckDB-sourced data with identical JSON shapes.
- Snowflake connection still alive (used by
/askand prose endpoints). make test-cigreen.
Goal: No more Snowflake calls. /ask becomes a Claude tool-using agent. Mine narratives and H3 summary become build-time-baked or one-shot Claude calls.
Deliverables:
-
app/llm_client.py— Anthropic SDK wrapper. One client per worker. Prompt caching enabled on the system prompt + tool catalog. Model:claude-sonnet-4-6. -
app/tools.py— tool catalog. Functions Claude can call. Each is a thin Python wrapper overdata_client.pyqueries. Suggested first cut:lookup_mine_by_id(mine_id: int) -> dictmines_supplying_plant(plant_name: str, year: int | None = None) -> list[dict]production_history(mine_id: int, year_range: tuple[int, int] | None = None) -> list[dict]safety_record(mine_id: int) -> dict— fatalities, injuries, days lostemissions_for_plant(plant_name: str) -> dictsearch_mines(state: str | None = None, county: str | None = None, status: str | None = None) -> list[dict]top_supplier_in_state(state: str) -> dictsubregion_breakdown(subregion_id: str) -> dict
These cover the six question patterns in AGENTS.md §9 plus comfortable freeform headroom. Tools must enforce the 500-row cap and 10s timeout that the current Snowflake session-level guards provide. Do this in Python with explicit limits in the SQL and
concurrent.futurestimeouts. -
app/skill.md(or a constant inllm_client.py, but a markdown file is easier to iterate on) — system prompt covering:- Domain primer (US coal supply chain, MSHA vs EIA vs EPA roles, what each ID means)
- Tool catalog with when-to-use guidance
- Reasoning patterns (question → tool sequence)
- Output rules (format, citation of mine IDs, "I don't know" triggers)
- Refusal patterns (no investment advice, no predictions, no data outside the corpus, no scope creep into adjacent topics)
- Scope reminder: this is a US coal lookup, not a general energy chatbot
-
/askrewrite. Endpoint accepts{question, subregion_id?}(unchanged). Invokes Claude with tools. Response shape:{ "answer": "...", "tools_used": ["mines_supplying_plant", "safety_record"], "suggestions": [...], "degraded": false, "error": null }The
sqlfield from the current Cortex Analyst response goes away — the frontend's CortexChat component will need a small update to drop the SQL display block. Coordinate with the frontend update in Phase 4. -
Pre-baked mine narratives. New script
scripts/bake_mine_narratives.py. For each mine inMSHA_MINES, generate the safety prose Cortex Complete used to write — but call Claude once at build time. Outputprose/mine_narratives.jsonkeyed by mine ID, upload to R2. Commit the script, do not commit the JSON output (it lives in R2)./mine-for-mereads from R2 and falls back to a template if the mine ID is missing. -
H3 summary. Two options: (a) bake it per state at build time and store in
prose/h3_summaries.json, OR (b) one-shot Claude call at request time with the cell counts. Bake-time wins for cost; pick (a). Thesummary_degradedflag still fires if the prose is missing for the requested scope. -
Honest attribution preserved. AGENTS.md §3's rule that "the 'Cortex, on this map' byline never sits over template prose" becomes "the 'Claude, on this map' byline never sits over template prose." Same contract, new label. Frontend copy updates in Phase 4.
Acceptance:
- All four endpoints serve without any Snowflake call. Verify with
grep -r 'snowflake' app/returning only the deadsnowflake_client.py(Phase 5 deletion). /askanswers the six AGENTS.md §9 question patterns correctly. Add a regression test per pattern.- Realistic per-question cost measured and logged. Document actual numbers in this file.
Deliverables:
- Question-hash cache for
/ask. SHA256 of normalized question (lowercased, whitespace-collapsed) → cached response. Backend: in-memoryOrderedDictLRU capped at 1024 entries (matches the 256-entry pattern of_emissions_cachebut bigger because question space is larger). Optional: persist to R2 ascache/ask/<hash>.jsonwith a 24h TTL; only worth it if cold-start traffic is non-trivial. - Rate limit on
/ask.slowapimiddleware. 5/min/IP, 50/day/IP. Return 429 with a polite message that points to the chips. Document in AGENTS.md §2. - Cloud Run config. Update the deploy command / Terraform /
gcloud run services updateinvocation to set--min-instances=0and--cpu-throttling. Verify Cloud Run cost drops. - Doc sync — non-negotiable, all in this PR:
AGENTS.md§1 (project constraints): hackathon constraint section can soften — mark deadline as "originally 2026-04-20, since shipped." Cost-optimization rule replaces shipping rule.AGENTS.md§2 (architecture): rewrite to describe DuckDB + R2 + Claude tool agent. Drop the two-Snowflake-roles section. Update degraded-mode table.AGENTS.md§3 (Snowflake): delete the section. Add a §3 (Data Layer) covering DuckDB conventions (read-onlydata_client.pyboundary, noSELECT *, parameterized queries).AGENTS.md§8 (Snowflake MCP): delete.AGENTS.md§9 (semantic model): delete; the question patterns it lists move into the skill prompt as reasoning patterns.PRD.md: data architecture section, update.system-diagram.md: regenerate the Mermaid.frontend/src/routes/+layout.sveltefooter: drop the Snowflake Cortex link, add "Claude (Anthropic) · DuckDB · Cloudflare R2" credits.CortexChat.svelte→ consider rename toAskChat.svelteor similar. Drop the SQL-display block. Keep the chips. Coordinate with/askresponse shape change.frontend/e2e/fixtures.jsmockBackend→ update/askmock response shape.
Acceptance:
make lint,make test,pnpm test,pnpm test:e2e,pnpm lhciall green.- Lighthouse thresholds (a11y 1.0, SEO 1.0, BP 0.98, perf 0.90) hold.
- One-week observation window: Cloud Run + Anthropic + R2 spend matches projected ~$5/mo or document why not.
Deliverables:
- Delete
app/snowflake_client.pyandapp/prose_client.py(Snowflake-specific). - Drop
snowflake-connector-pythonfrompyproject.toml. Runuv syncand verify nothing imports it. - Remove Snowflake secrets from Cloud Run (
SNOWFLAKE_*env vars and the private-key secret). - Suspend the Snowflake account via the Snowflake console. Do not delete it for one billing cycle in case of post-cutover issue. After 30 days clean, delete.
- Remove
snowflake-mcp-config.yamlfrom the repo. Remove the Snowflake MCP server section from any local.mcp.jsonor Claude Code config. - CHANGELOG entry describing the migration.
Acceptance:
grep -ri snowflake .returns only this file, the CHANGELOG entry, and historical commits.- Cloud Run service has no Snowflake env vars.
- Anthropic + R2 + Cloud Run are the only three line items on the bill.
/asksubregion_id?— RESOLVED. Stay explicit.- Donate platforms — RESOLVED. Buy Me a Coffee (
buymeacoffee.com/anchildress1) + GitHub Sponsors (github.com/sponsors/anchildress1). Both wired in the footer. - Data expansion — see new section below. Confirm subset before Phase 3 starts.
This is the user's "I want the actual fatality reports, incidents, and how they classify the major shit" requirement. It expands the data corpus before Phase 3's tool catalog is finalized — adding tools after the agent is shipped is fine, but it's cheaper to do it once if we know the scope now.
Shipped on phase-3a-msha-data-expansion:
scripts/msha_scrape_index.py— paginates the MSHA fatality search filtered to coal (field_mine_category_target_id=191), one year at a time, writesdata/msha/manifest.csv. Polite 1.0s throttle, identifying User-Agent.scripts/msha_scrape_interstitial.py— for each manifest row with a final report, fetches the interstitial HTML and parses out structured metadata + section text + PDF citation URL.scripts/msha_build_fatality_parquet.py— joins manifest + interstitial JSON intomrt/fatality_narratives.parquet(UPPERCASE column convention, one row per fatality).app/data_client.py::query_fatalities_for_mine,::query_recent_fatalities— read-only DuckDB helpers the runtime agent will call..github/workflows/refresh-msha-fatalities.yml— annual cron + manual dispatch.
Deferred to a follow-up branch:
- Tier 1 enforcement datasets (Violations, 107(a) Orders, Assessed Violations, Contested Violations) — MSHA's bulk zip files at
arlweb.msha.gov/OpenGovernmentData/DataSets/*.zipship with malformed CDFH/LFH headers (Central Directory offset off by ~750 bytes, garbage compression method0x808, wrong filename length). Standardzipfile,bsdtar,unzip,zipfile-deflate64, andzipfile-inflate64all reject the archives. Working around requires either (a) a custom byte-level CD walker that ignores the EOCD record, or (b) finding a non-zip MSHA endpoint. Scoped out of Phase 3.A v1 so the fatality narratives could ship.
1. No build-time AI extraction.
The original Phase 3.A spec called for Claude to extract fact_lines and cause_category from PDF text at build time. The runtime-synthesis approach the user proposed wins: store raw extracted prose in the parquet, let the agent synthesize the answer at request time inside its existing /ask tool-call loop. Zero AI cost in the build path. Synthesis style adapts to the question ("give me a list" vs "what happened" vs "compare to other mines"). Re-extracts cost nothing on refresh.
2. No PDF parsing.
Late discovery during the build: every Final Report has a parallel HTML rendering at /data-reports/fatality-reports/<year>/<slug>/final-report. The page exposes a .field--name-body block whose content mirrors the source PDF exactly — same h2 sections, same paragraphs. That collapsed two planned pipeline steps (download PDF + extract text via PyMuPDF) into one HTML fetch and removed the AGPL viral-copyleft concern entirely. PDFs are still cited via URL in the parquet so the agent can surface a "[MSHA report ↗]" chip on every answer; the PDFs themselves never land in R2.
Three tiers of relevant data, all public:
Tier 1 — Structured bulk data (pipe-delimited text files, MSHA Open Government Data Portal):
| Dataset | Why it matters | Rough size |
|---|---|---|
| Violations (since 2000) | Every citation issued. Filterable to S&S (Significant & Substantial — the "could reasonably cause serious injury" tier). The mine's enforcement history. | ~5M+ rows, ~200-500 MB Parquet |
| 107(a) Orders (since 2000) | Imminent-danger withdrawal orders. The legal "stop work" hammer — when MSHA forcibly shuts down a section. This is the "major shit" classification. | ~few thousand rows, trivial size |
| Inspections (since 2000) | Linked to Violations by Event Number. Gives "what triggered the citation." Optional — only worth ingesting if we want inspection-history surface. | Large, defer unless needed |
| Assessed Violations | Penalty amounts. Lets the agent answer "how much has mine X been fined?" | ~few hundred MB Parquet |
| Contested Violations | Which citations the operator is fighting. Adds nuance to enforcement history. | Smaller |
Tier 2 — Per-incident summary records (Data.gov "MSHA Fatality Reports" dataset):
Structured rows with date, mine, location, accident type, mined material, victim count. Fast to ingest. Already a superset of the user's current MSHA_ACCIDENTS table for fatal events specifically.
Tier 3 — Narrative investigation documents (the gold):
| Document type | Format | What it gives |
|---|---|---|
| Preliminary Accident Reports | HTML/PDF, published days after a fatality | First-pass narrative, basic facts |
| Fatalgrams | 1-page PDF, educational | "Best practices" framing of the incident |
| Fatal Investigation Reports | Multi-page PDF, published 6-18 months later | Full root cause, contributing factors, citations issued, recommendations. The narrative the agent can quote. |
The historical TICL archive holds ~24,000 investigation documents (~33,000 victims) going back to MSHA's predecessor agencies.
The four-part subset below was the original ingestion plan. Items 1-3 + 5 (the Tier 1 enforcement datasets) are the deferred work; item 4 (fatality narratives) shipped, with the runtime-synthesis architecture noted above replacing the build-time AI extraction the original draft called for.
Pick a defensible scope and ship it; do not try to ingest everything:
-
Violations dataset, filtered to S&S, joined on
MINE_ID→ new Parquetmrt/violations_ss_by_mine.parquet. Pre-aggregate "S&S count by mine, by year." Tool:safety_violations(mine_id, year_range=None). -
107(a) Orders dataset, full →
mrt/withdrawal_orders.parquet. Tool:withdrawal_orders(mine_id). Surface the most charged signal: "MSHA forced this mine to stop work N times since 2000." -
Assessed Violations totals by mine → roll into
mrt/violations_ss_by_mine.parquetas atotal_penaltiescolumn. Tool: extendsafety_violationsto include penalties. -
Fatal Investigation Reports for coal-related fatalities, last 15 years → scraped, text-extracted via Claude at build time, structured as
mrt/fatality_narratives.parquetwith columns:mine_idincident_datevictim_countcause_category(Claude-classified into a controlled vocab: roof_fall, methane_ignition, equipment, electrical, etc.)fact_lines(list of 2-5 short declarative sentences; see extraction rules below)contributing_factors(list, Claude-extracted from MSHA's findings — verbs, not adjectives)citations_issued(list of citation IDs, regex-extracted)report_source(one of:msha_final,state_wv,state_ky,state_pa,state_other,msha_preliminary)report_status(one of:final,preliminary)contest_status(one of:uncontested,contested,not_applicable—not_applicablefor state and preliminary reports where contest doesn't apply)report_pdf_url(link back to the source — rendered as a source chip on every card; chip label reflectsreport_source)
Estimated scope: ~200-400 PDFs for coal-only, ~50 MB Parquet after extraction.
Tool:
fatality_narrative(mine_id, date=None)returns the structured rows + URL to the source PDF. The agent surfaces facts; it does not paraphrase the report's prose, and it does not editorialize.
For each incident, the build pipeline picks ONE report and extracts only from it. Never blend sources — they can disagree, and stitching them produces composite "facts" no single agency endorses. Priority:
- MSHA Fatal Investigation Report (final) if it exists — extract from this and ONLY this. Set
report_source=msha_final,report_status=final, populatecontest_statusfrom MSHA's contest dataset (any citation issued in the report being contested =contested). - State investigation report (final) if no MSHA final exists yet — extract from the state report (WV, KY, PA, etc. produce their own). Set
report_source=state_<abbr>,report_status=final,contest_status=not_applicable(state reports follow a different appeals track; do not conflate). - MSHA Preliminary Accident Report as last resort if neither final exists yet — extract with explicit
report_status=preliminaryso the frontend can surface a "preliminary — subject to revision" label on the card.contest_status=not_applicable.
Re-run rule: when the build pipeline detects a final report has been published for an incident previously stored as preliminary or state-only, it replaces the row entirely. Never append. Never keep the old extraction "for history." Stale facts hurt more than missing ones.
-
Include Contested Violations as a
contest_statusfield onmrt/violations_ss_by_mine.parquet. Do not filter contested-but-unruled rows out — most S&S citations get contested at first and rulings take 2-5 years; filtering on "ruled only" guts the dataset to a narrow window of old data. Surface the contest status as a small label in the rendered card ("12 S&S citations, 4 currently contested"). Reader weighs it; the data isn't suppressed; the context is honest. -
Pass on for now: Inspections (huge, marginal value), Fatalgrams (Fatal Investigation Reports cover the same ground in more depth).
- New script:
scripts/scrape_msha_fatality_pdfs.py— pulls PDFs, deduplicates, stores raw underraw_pdfs/in R2 (separate prefix from the working corpus). Respectrobots.txt; throttle to 1-2 req/sec. - New script:
scripts/extract_fatality_narratives.py— for each PDF, calls Claude with a structured-extraction prompt, writes a row to the parquet. Build-time cost: ~$3-10 one-time for ~300 PDFs at Sonnet rates with prompt caching across the prompt template. - Refresh cadence: annual is fine for fatality reports (they trail by 6-18 months anyway). Quarterly for the structured tier-1 datasets.
- GitHub Action (or local make target) to run the refresh and upload to R2.
The extraction prompt lives in scripts/extract_fatality_narratives.py (or a sibling .md file the script reads). It must enforce:
- No PII. Replace all personal names — victim, foreman, supervisor, operator personnel, MSHA inspector — with role descriptors: "the miner", "the foreman", "the operator", "the inspector". Mine names and operator company names are public record and stay. If a PDF page has names in headers/captions, drop them at extraction.
- Strip the bureaucratic phrasing. Convert MSHA's regulator prose into short declarative sentences. No "the investigation determined that the entity failed to ensure compliance with..." — write "the operator did not follow the approved ventilation plan."
- Facts only. Each
fact_linesentry must be a fact MSHA established in the report. No softening ("may have"), no amplifying ("egregiously"), no editorial frame ("a tragic preventable loss"). - Sentences are short. Target 6-14 words per fact line. Long sentences are a signal you're paraphrasing instead of extracting.
- No comparison or pattern claims. "This was the third fatality at the mine since 2019" is a comparison the agent must compute from the data, not a claim the extractor pulls from the PDF — and even then it belongs in a separate
pattern_contextfield rendered with explicit data citations, not infact_lines. - Citations to MSHA are mandatory at the row level.
report_pdf_urlis required; rows without it are dropped.
- Cards live in their own section, not nested in the cost block. The cost block is the author's voice (yours). The fatality narrative cards are the data's voice (the source report's facts, surfaced cleanly). Visual separation prevents the editorial weight of the cost block from leaking into the narrative cards or vice versa.
- Each card renders: date + victim count as the eyebrow,
fact_linesas the card body in serif, a small mono source chip linkingreport_pdf_url(chip label varies:[MSHA report ↗]formsha_final,[WV state report ↗]forstate_wv,[MSHA preliminary ↗]formsha_preliminary, etc.), and acause_categorytag. - Status labels render when not the canonical case. If
report_status=preliminary, render a smallpreliminary — subject to revisionlabel on the card. Ifcontest_status=contested, render a smallfindings contestedlabel. Neither label suppresses the card; both add honest context. - Inline source chip on every card. No footer disclaimer required — the source is always one click away from any claim. (Footer-level "this site uses AI" copy still optional but no longer load-bearing.)
- Author voice belongs to the surrounding section chrome (heading, sub, transitions), not to any extracted card content. Mixing them is a bug.
Append these to the Phase 3 tool list:
safety_violations(mine_id: int, year_range: tuple[int, int] | None = None) -> dict— S&S count, total penalties, top violation typeswithdrawal_orders(mine_id: int) -> list[dict]— every 107(a) order, date and reasonfatality_narrative(mine_id: int, date: str | None = None) -> list[dict]— investigation summaries with MSHA PDF URLs
Teach Claude:
- The legal weight of a 107(a) order ("imminent danger withdrawal" — the strongest enforcement tool MSHA has).
- The S&S classification ("significant and substantial" — a citation tier predicting serious injury risk).
- The contest-status nuance: a contested-but-unruled citation is still data; surface the contest label, do not suppress.
- Fatality narratives are passed through, not paraphrased. When
fatality_narrativeis called, the agent renders thefact_linesas-is (or with minimal connective tissue) and includes the source URL chip. The agent does not rewrite, soften, amplify, or editorialize the facts. If the user asks "what happened," the answer is MSHA's facts plus the source link — full stop. - No comparative superlatives. "Most dangerous mine in WV" is editorial. "Highest S&S count in WV in 2023" is factual and requires citing the rank computation. Default to factual; refuse the editorial framing even if the user asks for it.
- Voice boundary. The agent never inserts editorial framing of its own ("tragically", "preventably", "shockingly"). The site's editorial voice belongs to the page copy, not to the data layer or the agent's responses. This is a hard line.
- Refusal pattern: never speculate beyond what the investigation report concludes; never compare across mines without a tool-computed basis; never name individuals.
Shipped (fatality narratives slice):
mrt/fatality_narratives.parquetlives in R2, refreshed annually byrefresh-msha-fatalities.yml.query_fatalities_for_mine(mine_id)andquery_recent_fatalities(state)available inapp/data_client.pyfor the upcoming Phase 3 agent tool catalog.- Spot-check the captured Leer Mine fixture: victim name redacted to role descriptor in OVERVIEW + CONCLUSION; ROOT CAUSE ANALYSIS preserved verbatim with role-only references; ENFORCEMENT ACTIONS surfaces 103(k) order text and citation IDs; PDF URL captured for the source chip.
- Source-priority rule still designed-for but only one tier (
msha_final) implemented in v1 — state and preliminary reports stay out of the parquet for now (the search index does include them, so the manifest carries the flags; a future commit can extract preliminary section text from the same interstitial path).
Deferred (Tier 1 enforcement slice):
- 4 new tools (
safety_violations,withdrawal_orders,fatality_narrativealready shipped, plus contested + assessed wiring) — onlyfatality_narrativeis live; the other three depend on the deferred bulk datasets. /ask "which mines have been shut down by 107(a) orders"will not resolve until Tier 1 ships.
Frontend (Phase 4):
- Fatality cards rendering in a dedicated section with source chip on every card — designed but not yet built; lives in Phase 4 with the rest of the
/askrewrite.
- MSHA Open Government Initiative Portal — bulk pipe-delimited datasets
- MSHA Data and Reports — entry point
- MSHA Fatality Reports search — per-incident reports
- Data.gov MSHA Fatality Reports dataset — structured fatality summaries
- Data.gov MSHA Accident Injuries Data Set — current source for the existing
MSHA_ACCIDENTStable