feat(trino): add column-level lineage on upstreamLineage to connector… by alfiyas-datahub · Pull Request #16292 · datahub-project/datahub

alfiyas-datahub · 2026-02-20T05:54:26Z

Summary

Adds column-level lineage (CLL) on the upstreamLineage from Trino to connector sources (e.g. Iceberg, Hive) and emits upstreamLineage for views (previously only siblings were emitted). Addresses ING-1608.

Solution

New config include_column_lineage (default True). When set and schema is available, gen_lineage_workunit emits fineGrainedLineages with a 1:1 column mapping in addition to table-level upstreams.
Tables: build schema and pass it into gen_lineage_workunit when emitting lineage to a connector.
Views: now call gen_lineage_workunit as well (so they get upstreamLineage and optional CLL, not only siblings).
Unit tests in test_sql_common.py cover CLL on/off and schema present/absent.

github-actions · 2026-02-20T05:54:35Z

Linear: ING-1695

codecov · 2026-02-20T05:56:56Z

Codecov Report

❌ Patch coverage is 88.23529% with 4 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...ngestion/src/datahub/ingestion/source/sql/trino.py	88.23%	4 Missing ⚠️

📢 Thoughts on this report? Let us know!

alwaysmeticulous · 2026-02-20T06:00:06Z

🔴 Meticulous spotted visual differences in 1 of 1198 screens tested: view and approve differences detected.

Meticulous evaluated ~8 hours of user flows against your PR.

_{Last updated for commit c3f0aa1. This comment will update as new commits are pushed.}

codecov · 2026-02-20T06:16:31Z

Bundle Report

Bundle size has no change ✅

… sources - Emit fineGrainedLineages when include_column_lineage=True and schema available (tables and views) - Emit upstreamLineage for views (previously only siblings were emitted) - Add include_column_lineage config (default True) - Clarify in docstring: Siblings vs UpstreamLineage are distinct relations - Add LINEAGE_FINE capability for Table and View - Add unit tests in test_sql_common.py for Trino lineage and CLL Co-authored-by: Cursor <cursoragent@cursor.com>

…import order Co-authored-by: Cursor <cursoragent@cursor.com>

Regenerated integration test golden files to include fineGrainedLineages in upstreamLineage aspects emitted by the new column-level lineage feature. Co-authored-by: Cursor <cursoragent@cursor.com>

rajatoss · 2026-02-25T06:08:39Z

Connector Tests Results

All connector tests passed for commit 11cf8ab

View full test logs →

Autogenerated by the connector-tests CI pipeline.

skrydal · 2026-02-27T12:33:47Z

The Trino-specific unit tests added here (test_trino_gen_lineage_workunit_*, test_trino_process_table_*, test_trino_process_view_*) should live in a dedicated Trino test file (e.g., tests/unit/test_trino.py) rather than test_sql_common.py.

test_sql_common.py tests shared SQL infrastructure (foreign keys, SQLAlchemy URI mapping, fine-grained lineage in the base class, etc.). These new tests import and exercise TrinoSource and TrinoConfig directly — they're connector-specific, not shared infrastructure.

skrydal · 2026-02-27T12:39:21Z

Replace `Optional[Any]` with `Optional[SchemaMetadataClass]`

Three occurrences use Optional[Any] where the type is always SchemaMetadataClass:

gen_lineage_workunit parameter: schema_metadata: Optional[Any] = None
_process_table local: schema_metadata: Optional[Any] = None
_process_view local: schema_metadata_view: Optional[Any] = None

get_schema_metadata() already returns SchemaMetadataClass, so erasing this to Any disables all downstream type checking. The project style guide explicitly says: "Avoid Any type — use specific types."

Fix — change all three to Optional[SchemaMetadataClass] and add SchemaMetadataClass to the existing datahub.metadata.schema_classes import block in trino.py (it's already imported in the test file):

def gen_lineage_workunit(
    self,
    dataset_urn: str,
    source_dataset_urn: str,
    schema_metadata: Optional[SchemaMetadataClass] = None,
) -> Iterable[MetadataWorkUnit]:

schema_metadata: Optional[SchemaMetadataClass] = None   # in _process_table
schema_metadata_view: Optional[SchemaMetadataClass] = None  # in _process_view

skrydal · 2026-02-27T12:42:58Z

Replace `getattr` duck typing with direct attribute access

Once schema_metadata is typed as Optional[SchemaMetadataClass] (see previous comment), the getattr fallback in gen_lineage_workunit (trino.py) becomes unnecessary:

# Current:
and getattr(schema_metadata, "fields", None)

# With proper typing, simplify to:
and schema_metadata.fields

SchemaMetadataClass.fields is a known attribute (always present, defaults to []), so after the is not None check, direct access is safe and mypy can verify it.

skrydal · 2026-02-27T12:49:00Z

Suggestion: Reuse schema from parent instead of re-fetching

The CLL here builds a 1:1 field mapping — the same field path is used for both the upstream (connector source) and downstream (Trino) dataset. This relies on the assumption that field paths are identical on both sides.

If that assumption holds, there's no need to re-fetch columns from the inspector. The parent _process_table (in sql_common.py) already calls inspector.get_columns(), get_schema_fields(), and get_schema_metadata() for the same table via the same Trino inspector. The Trino override then duplicates all of those calls just to get the field paths.

Instead, you could intercept the workunits from super() and extract the already-emitted SchemaMetadataClass:

def _process_table(self, dataset_name, inspector, schema, table, sql_config, data_reader):
    schema_metadata = None
    for wu in super()._process_table(
        dataset_name, inspector, schema, table, sql_config, data_reader
    ):
        sm = wu.get_aspect_of_type(SchemaMetadataClass)
        if sm is not None:
            schema_metadata = sm
        yield wu

    if self.config.ingest_lineage_to_connectors:
        # ... use schema_metadata directly for CLL, no redundant inspector calls

Same approach would apply to _process_view.

Could you check whether this would be a valid approach? It would eliminate the duplicate inspector queries and the redundant try/except blocks for schema building.

skrydal

Thank you for the contribution, please see my comments.

skrydal · 2026-02-27T13:01:09Z

metadata-ingestion/tests/unit/test_sql_common.py

+    )
+    assert len(workunits) == 1
+    upstream_lineage = workunits[0].get_aspect_of_type(UpstreamLineageClass)
+    assert upstream_lineage is not None


to avoid calling getattr you could assert here isinstance.

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Feb 20, 2026

github-actions bot deployed to datahub-wheels (Preview) February 20, 2026 05:56 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) February 20, 2026 05:59 View deployment

datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Feb 20, 2026

vercel bot deployed to Preview February 20, 2026 06:05 View deployment

github-actions bot deployed to datahub-wheels (Preview) February 20, 2026 10:16 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) February 20, 2026 10:20 View deployment

vercel bot deployed to Preview February 20, 2026 10:29 View deployment

alfiyas-datahub and others added 2 commits February 23, 2026 10:25

test(trino): add coverage for _process_table/_process_view, fix lint …

cacc1d9

…import order Co-authored-by: Cursor <cursoragent@cursor.com>

alfiyas-datahub force-pushed the feat/trino-column-lineage branch from c3f0aa1 to cacc1d9 Compare February 23, 2026 04:56

github-actions bot deployed to datahub-wheels (Preview) February 23, 2026 04:58 View deployment

vercel bot deployed to Preview February 23, 2026 05:11 View deployment

test(trino): update golden files for column-level lineage

8d8ce39

Regenerated integration test golden files to include fineGrainedLineages in upstreamLineage aspects emitted by the new column-level lineage feature. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions bot deployed to datahub-wheels (Preview) February 23, 2026 05:38 View deployment

vercel bot deployed to Preview February 23, 2026 05:51 View deployment

Merge branch 'master' into feat/trino-column-lineage

bf87cab

github-actions bot deployed to datahub-wheels (Preview) February 23, 2026 09:44 View deployment

vercel bot deployed to Preview February 23, 2026 09:57 View deployment

github-actions bot deployed to datahub-wheels (Preview) February 24, 2026 11:29 View deployment

vercel bot deployed to Preview February 24, 2026 11:42 View deployment

datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Feb 25, 2026

Merge branch 'master' into feat/trino-column-lineage

11cf8ab

alfiyas-datahub force-pushed the feat/trino-column-lineage branch from 65130f4 to 11cf8ab Compare February 25, 2026 10:54

github-actions bot deployed to datahub-wheels (Preview) February 25, 2026 10:56 View deployment

vercel bot deployed to Preview February 25, 2026 11:09 View deployment

skrydal requested changes Feb 27, 2026

View reviewed changes

skrydal reviewed Feb 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(trino): add column-level lineage on upstreamLineage to connector…#16292

feat(trino): add column-level lineage on upstreamLineage to connector…#16292
alfiyas-datahub wants to merge 5 commits intomasterfrom
feat/trino-column-lineage

alfiyas-datahub commented Feb 20, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 20, 2026

Uh oh!

codecov bot commented Feb 20, 2026 •

edited

Loading

Uh oh!

alwaysmeticulous bot commented Feb 20, 2026 •

edited

Loading

Uh oh!

codecov bot commented Feb 20, 2026

Uh oh!

rajatoss commented Feb 25, 2026 •

edited

Loading

Uh oh!

skrydal commented Feb 27, 2026

Uh oh!

skrydal commented Feb 27, 2026

Uh oh!

skrydal commented Feb 27, 2026

Uh oh!

skrydal commented Feb 27, 2026

Uh oh!

skrydal left a comment

Uh oh!

skrydal Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alfiyas-datahub commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 20, 2026

Uh oh!

codecov bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

alwaysmeticulous bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Feb 20, 2026

Bundle Report

Uh oh!

rajatoss commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Connector Tests Results

Uh oh!

skrydal commented Feb 27, 2026

Uh oh!

skrydal commented Feb 27, 2026

Replace Optional[Any] with Optional[SchemaMetadataClass]

Uh oh!

skrydal commented Feb 27, 2026

Replace getattr duck typing with direct attribute access

Uh oh!

skrydal commented Feb 27, 2026

Suggestion: Reuse schema from parent instead of re-fetching

Uh oh!

skrydal left a comment

Choose a reason for hiding this comment

Uh oh!

skrydal Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alfiyas-datahub commented Feb 20, 2026 •

edited

Loading

codecov bot commented Feb 20, 2026 •

edited

Loading

alwaysmeticulous bot commented Feb 20, 2026 •

edited

Loading

rajatoss commented Feb 25, 2026 •

edited

Loading

Replace `Optional[Any]` with `Optional[SchemaMetadataClass]`

Replace `getattr` duck typing with direct attribute access