Project context for AI coding agents (Cursor, Claude Code, etc.). If you're
a human, README.md is the better entry point.
cxg-census-mcp is a Model Context Protocol (MCP) server that lets LLM
agents query the CZ CELLxGENE Discover Census
single-cell atlas with ontology-aware filters, cost caps, and per-response
provenance. It speaks MCP over stdio.
It is independent — not affiliated with CZI, EMBL-EBI, or any government
agency. Every tool response carries attribution and unaffiliated fields
that downstream code must preserve.
src/cxg_census_mcp/
tools/ # thin MCP wrappers, no business logic
planner/ # FilterSpec → QueryPlan; cost estimate; tier routing
ontology/ # OLS4 client + local hint overlay + CL/UBERON/MONDO expansion
execution/ # Tier 0 facet counts | Tier 1 chunked obs scan |
# Tier 2 expression aggregate | Tier 9 refuse → snippet
clients/ # OLS4 (HTTPS) + Census/SOMA wrappers
caches/ # SqliteKV-backed OLS / facet / plan caches + filter LRU
models/ # Pydantic models incl. ResponseEnvelope (provenance + caveats)
uv sync --extra dev # install everything except live Census
uv sync --extra dev --extra census # add cellxgene_census + tiledbsoma
make lint typecheck test # ruff + mypy + pytest (mock mode)
make audit # pip-audit on locked prod deps
make cov # tests + coveragemake test runs in mock mode (CXG_CENSUS_MCP_MOCK_MODE=1) and uses
deterministic fixtures. make test-live hits real OLS / Census and is
gated by the live pytest marker.
- Comments explain why, not what. Don't narrate code.
- No new runtime deps without strong justification. This is a single user-installable tool; every dep is a Docker layer and a CVE surface.
- Cache values are JSON-serializable. The KV layer (
caches/_sqlite_kv.py) json-encodes; storing arbitrary pickle values would re-introduce CVE-2025-69872. SeeSECURITY.md. - Every tool returns
ResponseEnvelope. Includesdata,query_provenance,attribution,unaffiliated,disclaimer,call_id,defaults_applied,warnings. Don't return raw payloads. - Tools are async, planner is async, execution is async. Stay in
asyncio; don't introduce threadpools. - Caps live in
Settings(config.py). Don't hard-code limits in tools. - Don't bypass the planner. Tools must call
plan_query(...)first; it is responsible for ontology resolution + tier selection + cost estimate + refusal logic.
- New module under
src/cxg_census_mcp/tools/yourtool.py. Follow the existing pattern: validate input, call planner, run execution, build envelope, register call_id. - Export from
tools/__init__.py. - Register in
server.pyso MCPtools/listadvertises it. - Integration test under
tests/integration/test_yourtool.py(the_isolated_envfixture handles cache + mock mode for you). - Update
docs/tool-reference.md.
LICENSEand the trademark / unaffiliated notices inREADME.md,__init__.py,models/provenance.py. They're load-bearing legally.data/ontology_hints.jsonanddata/facet_catalog.json. They're refreshed by scheduled GitHub Actions, not edited by hand.- The
ATTRIBUTION/UNAFFILIATED/DISCLAIMERstrings insrc/cxg_census_mcp/__init__.py. They surface in every response.
src/cxg_census_mcp/server.py— MCP wire protocol, tool dispatch.src/cxg_census_mcp/planner/query_plan.py—plan_query, the brain.src/cxg_census_mcp/execution/tier{0,1,2}_*.py— actual Census reads.src/cxg_census_mcp/ontology/resolver.py— text → CURIE.
- Read
docs/architecture.mdanddocs/tool-reference.mdfirst. - Then look at how an existing similar tool / planner branch is wired and copy the pattern.
- The mock-mode fixtures in
clients/census.py(_mock_*) are the canonical reference for the data shape.