Skip to content

feat(trino): add column-level lineage on upstreamLineage to connector…#16292

Open
alfiyas-datahub wants to merge 5 commits intomasterfrom
feat/trino-column-lineage
Open

feat(trino): add column-level lineage on upstreamLineage to connector…#16292
alfiyas-datahub wants to merge 5 commits intomasterfrom
feat/trino-column-lineage

Conversation

@alfiyas-datahub
Copy link
Contributor

@alfiyas-datahub alfiyas-datahub commented Feb 20, 2026

Summary

Adds column-level lineage (CLL) on the upstreamLineage from Trino to connector sources (e.g. Iceberg, Hive) and emits upstreamLineage for views (previously only siblings were emitted). Addresses ING-1608.

Solution

New config include_column_lineage (default True). When set and schema is available, gen_lineage_workunit emits fineGrainedLineages with a 1:1 column mapping in addition to table-level upstreams.
Tables: build schema and pass it into gen_lineage_workunit when emitting lineage to a connector.
Views: now call gen_lineage_workunit as well (so they get upstreamLineage and optional CLL, not only siblings).
Unit tests in test_sql_common.py cover CLL on/off and schema present/absent.

@github-actions
Copy link
Contributor

Linear: ING-1695

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Feb 20, 2026
@codecov
Copy link

codecov bot commented Feb 20, 2026

Codecov Report

❌ Patch coverage is 88.23529% with 4 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...ngestion/src/datahub/ingestion/source/sql/trino.py 88.23% 4 Missing ⚠️

📢 Thoughts on this report? Let us know!

@alwaysmeticulous
Copy link

alwaysmeticulous bot commented Feb 20, 2026

🔴 Meticulous spotted visual differences in 1 of 1198 screens tested: view and approve differences detected.

Meticulous evaluated ~8 hours of user flows against your PR.

Last updated for commit c3f0aa1. This comment will update as new commits are pushed.

@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Feb 20, 2026
@codecov
Copy link

codecov bot commented Feb 20, 2026

Bundle Report

Bundle size has no change ✅

alfiyas-datahub and others added 2 commits February 23, 2026 10:25
… sources

- Emit fineGrainedLineages when include_column_lineage=True and schema available (tables and views)
- Emit upstreamLineage for views (previously only siblings were emitted)
- Add include_column_lineage config (default True)
- Clarify in docstring: Siblings vs UpstreamLineage are distinct relations
- Add LINEAGE_FINE capability for Table and View
- Add unit tests in test_sql_common.py for Trino lineage and CLL

Co-authored-by: Cursor <cursoragent@cursor.com>
…import order

Co-authored-by: Cursor <cursoragent@cursor.com>
Regenerated integration test golden files to include fineGrainedLineages
in upstreamLineage aspects emitted by the new column-level lineage feature.

Co-authored-by: Cursor <cursoragent@cursor.com>
@rajatoss
Copy link
Member

rajatoss commented Feb 25, 2026

Connector Tests Results

All connector tests passed for commit 11cf8ab

View full test logs →

Autogenerated by the connector-tests CI pipeline.

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Feb 25, 2026
@alfiyas-datahub alfiyas-datahub force-pushed the feat/trino-column-lineage branch from 65130f4 to 11cf8ab Compare February 25, 2026 10:54
@skrydal
Copy link
Collaborator

skrydal commented Feb 27, 2026

The Trino-specific unit tests added here (test_trino_gen_lineage_workunit_*, test_trino_process_table_*, test_trino_process_view_*) should live in a dedicated Trino test file (e.g., tests/unit/test_trino.py) rather than test_sql_common.py.

test_sql_common.py tests shared SQL infrastructure (foreign keys, SQLAlchemy URI mapping, fine-grained lineage in the base class, etc.). These new tests import and exercise TrinoSource and TrinoConfig directly — they're connector-specific, not shared infrastructure.

@skrydal
Copy link
Collaborator

skrydal commented Feb 27, 2026

Replace Optional[Any] with Optional[SchemaMetadataClass]

Three occurrences use Optional[Any] where the type is always SchemaMetadataClass:

  1. gen_lineage_workunit parameter: schema_metadata: Optional[Any] = None
  2. _process_table local: schema_metadata: Optional[Any] = None
  3. _process_view local: schema_metadata_view: Optional[Any] = None

get_schema_metadata() already returns SchemaMetadataClass, so erasing this to Any disables all downstream type checking. The project style guide explicitly says: "Avoid Any type — use specific types."

Fix — change all three to Optional[SchemaMetadataClass] and add SchemaMetadataClass to the existing datahub.metadata.schema_classes import block in trino.py (it's already imported in the test file):

def gen_lineage_workunit(
    self,
    dataset_urn: str,
    source_dataset_urn: str,
    schema_metadata: Optional[SchemaMetadataClass] = None,
) -> Iterable[MetadataWorkUnit]:
schema_metadata: Optional[SchemaMetadataClass] = None   # in _process_table
schema_metadata_view: Optional[SchemaMetadataClass] = None  # in _process_view

@skrydal
Copy link
Collaborator

skrydal commented Feb 27, 2026

Replace getattr duck typing with direct attribute access

Once schema_metadata is typed as Optional[SchemaMetadataClass] (see previous comment), the getattr fallback in gen_lineage_workunit (trino.py) becomes unnecessary:

# Current:
and getattr(schema_metadata, "fields", None)

# With proper typing, simplify to:
and schema_metadata.fields

SchemaMetadataClass.fields is a known attribute (always present, defaults to []), so after the is not None check, direct access is safe and mypy can verify it.

@skrydal
Copy link
Collaborator

skrydal commented Feb 27, 2026

Suggestion: Reuse schema from parent instead of re-fetching

The CLL here builds a 1:1 field mapping — the same field path is used for both the upstream (connector source) and downstream (Trino) dataset. This relies on the assumption that field paths are identical on both sides.

If that assumption holds, there's no need to re-fetch columns from the inspector. The parent _process_table (in sql_common.py) already calls inspector.get_columns(), get_schema_fields(), and get_schema_metadata() for the same table via the same Trino inspector. The Trino override then duplicates all of those calls just to get the field paths.

Instead, you could intercept the workunits from super() and extract the already-emitted SchemaMetadataClass:

def _process_table(self, dataset_name, inspector, schema, table, sql_config, data_reader):
    schema_metadata = None
    for wu in super()._process_table(
        dataset_name, inspector, schema, table, sql_config, data_reader
    ):
        sm = wu.get_aspect_of_type(SchemaMetadataClass)
        if sm is not None:
            schema_metadata = sm
        yield wu

    if self.config.ingest_lineage_to_connectors:
        # ... use schema_metadata directly for CLL, no redundant inspector calls

Same approach would apply to _process_view.

Could you check whether this would be a valid approach? It would eliminate the duplicate inspector queries and the redundant try/except blocks for schema building.

Copy link
Collaborator

@skrydal skrydal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution, please see my comments.

)
assert len(workunits) == 1
upstream_lineage = workunits[0].get_aspect_of_type(UpstreamLineageClass)
assert upstream_lineage is not None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid calling getattr you could assert here isinstance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata pending-submitter-response Issue/request has been reviewed but requires a response from the submitter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants