Skip to content

Enrich inferred schemas with canonical data dictionaries (closes #192)#214

Open
amc-corey-cox wants to merge 2 commits into
mainfrom
dd-enrichment
Open

Enrich inferred schemas with canonical data dictionaries (closes #192)#214
amc-corey-cox wants to merge 2 commits into
mainfrom
dd-enrichment

Conversation

@amc-corey-cox

Copy link
Copy Markdown
Contributor

Closes #192. First step of the schema enrichment pipeline (#190): overlay declared metadata from a canonical DD onto an inferred LinkML schema.

schemauto enrich is the primary surface; --data-dictionary (repeatable) is wired into generalize-tsv/generalize-tsvs as a convenience. Matching is global by slot name — class-aware targeting deferred until we have a real use case. Pandera path documented as a known limitation.

The count-based heuristic on num_distinct_values lets enrichment refuse to upgrade a primitive range to an enum when the DD provably can't cover what inference observed; DD codes get stashed as a slot annotation in that case.

Test plan

  • pytest tests/ — 185 passed, 3 skipped, 9 xfailed
  • Smoke: schemauto generalize-tsv --data-dictionary <jhs.dd.yaml> jhs.tsv round-trips through yaml_loader.load(..., SchemaDefinition)

Copilot AI review requested due to automatic review settings May 22, 2026 20:00

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the first step of the schema enrichment pipeline by overlaying canonical data dictionary (DD) declarations onto inferred LinkML schemas, exposing this via a new schemauto enrich command and --data-dictionary convenience flags on TSV generalization.

Changes:

  • Introduces schema_automator.enrichers.data_dictionary_enricher with an EnrichmentReport, metadata overlay rules, permissible-values merging, and an “incomplete DD enum” heuristic driven by observed distinct counts.
  • Wires repeatable --data-dictionary into generalize-tsv / generalize-tsvs and adds a standalone schemauto enrich CLI command.
  • Adds unit/integration/CLI tests plus realistic dbGaP-like fixtures (JHS) to validate enrichment behavior and discrepancy surfacing.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
schema_automator/enrichers/data_dictionary_enricher.py New core enricher implementation + report structure + heuristic logic.
schema_automator/enrichers/__init__.py Exposes enricher/report as package API.
schema_automator/generalizers/csv_data_generalizer.py Records num_distinct_values annotation used by enrichment heuristic.
schema_automator/cli.py Adds --data-dictionary options to generalize commands and new enrich command.
tests/test_enrichers/test_data_dictionary_enricher.py Unit tests covering overlay, type/required handling, PV merge, and heuristic.
tests/test_enrichers/test_integration_jhs.py End-to-end enrichment tests against JHS-shaped fixture.
tests/test_enrichers/test_cli.py CLI integration tests for enrich and generalize convenience flags.
tests/test_enrichers/__init__.py Initializes new test package.
tests/resources/dbgap/JHS_Subject.dd.yaml Canonical DD fixture used in integration/CLI tests.
tests/resources/dbgap/JHS_Subject.data.tsv TSV fixture used for inference + enrichment tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread schema_automator/enrichers/data_dictionary_enricher.py Outdated
Comment thread schema_automator/enrichers/data_dictionary_enricher.py Outdated
Comment thread schema_automator/enrichers/data_dictionary_enricher.py Outdated
Comment thread schema_automator/cli.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ingest structured data dictionary to enrich inferred schemas

2 participants