fix(ingest/powerbi): stop CTE alias leaking as upstream in native SQL lineage#17700
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
ca01c70 to
5593389
Compare
…atement edge cases
…Ts and SELECT INTO patterns
|
🤖 Meticulous evaluated 130 user flows and took 946 visual snapshots. Meticulous has not yet run on df17f98 of the main branch and so there was nothing to compare against. If you recently setup Meticulous, this is expected. Meticulous will start reporting comparisons for new pull requests after the next commit to the main branch. Last updated for commit |
Connector Tests ResultsAll connector tests passed for commit To skip connector tests, add the Autogenerated by the connector-tests CI pipeline. |
Summary
Supersedes #17606 (which can be closed).
PowerBI native-SQL lineage emitted the CTE alias as a real upstream table and dropped the actual source tables when the custom SQL was a single CTE (
WITH … SELECT). It triggered when the CTE had blank lines before its closingSELECTand/or a semicolon inside a SQL comment (e.g.-- done; continue).Root cause
parse_custom_sqldecided "single vs multiple statements" with two fragile heuristics:remove_tsql_control_statementsdeleted inter-statement separators (GO/DROP/USE/SET) into empty strings, leaving only an ambiguous blank line between statements.parse_custom_sqlthen inserted;before any blank-lineSELECTand routed on a substring";" in querycheck.That substring check fired on semicolons inside comments, and the blank-line insertion split a CTE's closing
SELECTaway from itsWITHclause — so sqlglot resolved the CTE alias as a real table.Fix
Replace the heuristics with boundary-preserving cleanup + grammar-aware parsing:
remove_tsql_control_statementsnow replaces each stripped separator with;instead of deleting it, preserving the statement boundary as a real terminator (with a string-safe collapse of duplicate/leading semicolons).parse_custom_sqldecides single- vs multi-statement by actually parsing (sqlglot, dialect-aware, with a default-dialect fallback for platforms without a sqlglot dialect such asodbc). Single statements (CTE / UNION / plainSELECT) go through the single-statement parser untouched, so blank-line formatting can never split them. Multiple statements are split with the grammar- and CTE-awaresplit_statements. A blank-line regex remains only as a last-resort fallback for genuinely separator-less statements (e.g.SELECT … INTO #tempfollowed by aSELECT).Why a new PR instead of iterating on #17606
#17606 fixed the symptom with a line-by-line blank-line heuristic (
_insert_statement_separators/_scan_line_depth/_has_real_semicolons). That approach still mis-handled set operations (UNION/EXCEPT), comments between statements, a comment beforeWITH, and;inside string literals. This PR addresses the root cause instead, with less code, and is verified to have no regression vs the current behaviour while fixing those additional cases.Behaviour
;-in-comment).GO-separated CTEs.masteron any tested case.Testing
Added/updated regression tests in
metadata-ingestion/tests/integration/powerbi/test_native_sql_parser.py:;-in-comment (the reported bug);-separated multi-statement queryGO-separated CTE through the real cleanup pipelineSELECT … INTO #temp+ blank-lineSELECT(separator-less fallback)Test relocation
Moved
metadata-ingestion/tests/integration/powerbi/test_native_sql_parser.py→metadata-ingestion/tests/unit/powerbi/test_native_sql_parser.py.These are pure-logic tests (they only call
get_tables/remove_*/parse_custom_sql, with no Docker, graph, or network), but living undertests/integration/meantconftest.pyauto-marked themintegration.