feat(trino): add column-level lineage on upstreamLineage to connector…#16292
feat(trino): add column-level lineage on upstreamLineage to connector…#16292alfiyas-datahub wants to merge 5 commits intomasterfrom
Conversation
|
Linear: ING-1695 |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
🔴 Meticulous spotted visual differences in 1 of 1198 screens tested: view and approve differences detected. Meticulous evaluated ~8 hours of user flows against your PR. Last updated for commit c3f0aa1. This comment will update as new commits are pushed. |
Bundle ReportBundle size has no change ✅ |
… sources - Emit fineGrainedLineages when include_column_lineage=True and schema available (tables and views) - Emit upstreamLineage for views (previously only siblings were emitted) - Add include_column_lineage config (default True) - Clarify in docstring: Siblings vs UpstreamLineage are distinct relations - Add LINEAGE_FINE capability for Table and View - Add unit tests in test_sql_common.py for Trino lineage and CLL Co-authored-by: Cursor <cursoragent@cursor.com>
…import order Co-authored-by: Cursor <cursoragent@cursor.com>
c3f0aa1 to
cacc1d9
Compare
Regenerated integration test golden files to include fineGrainedLineages in upstreamLineage aspects emitted by the new column-level lineage feature. Co-authored-by: Cursor <cursoragent@cursor.com>
Connector Tests ResultsAll connector tests passed for commit Autogenerated by the connector-tests CI pipeline. |
65130f4 to
11cf8ab
Compare
|
The Trino-specific unit tests added here (
|
Replace
|
Replace
|
Suggestion: Reuse schema from parent instead of re-fetchingThe CLL here builds a 1:1 field mapping — the same field path is used for both the upstream (connector source) and downstream (Trino) dataset. This relies on the assumption that field paths are identical on both sides. If that assumption holds, there's no need to re-fetch columns from the inspector. The parent Instead, you could intercept the workunits from def _process_table(self, dataset_name, inspector, schema, table, sql_config, data_reader):
schema_metadata = None
for wu in super()._process_table(
dataset_name, inspector, schema, table, sql_config, data_reader
):
sm = wu.get_aspect_of_type(SchemaMetadataClass)
if sm is not None:
schema_metadata = sm
yield wu
if self.config.ingest_lineage_to_connectors:
# ... use schema_metadata directly for CLL, no redundant inspector callsSame approach would apply to Could you check whether this would be a valid approach? It would eliminate the duplicate inspector queries and the redundant |
skrydal
left a comment
There was a problem hiding this comment.
Thank you for the contribution, please see my comments.
| ) | ||
| assert len(workunits) == 1 | ||
| upstream_lineage = workunits[0].get_aspect_of_type(UpstreamLineageClass) | ||
| assert upstream_lineage is not None |
There was a problem hiding this comment.
to avoid calling getattr you could assert here isinstance.
Summary
Adds column-level lineage (CLL) on the upstreamLineage from Trino to connector sources (e.g. Iceberg, Hive) and emits upstreamLineage for views (previously only siblings were emitted). Addresses ING-1608.
Solution
New config include_column_lineage (default True). When set and schema is available, gen_lineage_workunit emits fineGrainedLineages with a 1:1 column mapping in addition to table-level upstreams.
Tables: build schema and pass it into gen_lineage_workunit when emitting lineage to a connector.
Views: now call gen_lineage_workunit as well (so they get upstreamLineage and optional CLL, not only siblings).
Unit tests in test_sql_common.py cover CLL on/off and schema present/absent.