Skip to content

fix(ingest/powerbi): stop CTE alias leaking as upstream in native SQL lineage#17700

Merged
aviraj-gour merged 5 commits into
masterfrom
fix/powerbi-native-sql-single-statement-detection
Jun 5, 2026
Merged

fix(ingest/powerbi): stop CTE alias leaking as upstream in native SQL lineage#17700
aviraj-gour merged 5 commits into
masterfrom
fix/powerbi-native-sql-single-statement-detection

Conversation

@aviraj-gour
Copy link
Copy Markdown
Contributor

@aviraj-gour aviraj-gour commented Jun 3, 2026

Summary

Supersedes #17606 (which can be closed).

PowerBI native-SQL lineage emitted the CTE alias as a real upstream table and dropped the actual source tables when the custom SQL was a single CTE (WITH … SELECT). It triggered when the CTE had blank lines before its closing SELECT and/or a semicolon inside a SQL comment (e.g. -- done; continue).

Root cause

parse_custom_sql decided "single vs multiple statements" with two fragile heuristics:

  1. remove_tsql_control_statements deleted inter-statement separators (GO/DROP/USE/SET) into empty strings, leaving only an ambiguous blank line between statements.
  2. parse_custom_sql then inserted ; before any blank-line SELECT and routed on a substring ";" in query check.

That substring check fired on semicolons inside comments, and the blank-line insertion split a CTE's closing SELECT away from its WITH clause — so sqlglot resolved the CTE alias as a real table.

Fix

Replace the heuristics with boundary-preserving cleanup + grammar-aware parsing:

  • remove_tsql_control_statements now replaces each stripped separator with ; instead of deleting it, preserving the statement boundary as a real terminator (with a string-safe collapse of duplicate/leading semicolons).
  • parse_custom_sql decides single- vs multi-statement by actually parsing (sqlglot, dialect-aware, with a default-dialect fallback for platforms without a sqlglot dialect such as odbc). Single statements (CTE / UNION / plain SELECT) go through the single-statement parser untouched, so blank-line formatting can never split them. Multiple statements are split with the grammar- and CTE-aware split_statements. A blank-line regex remains only as a last-resort fallback for genuinely separator-less statements (e.g. SELECT … INTO #temp followed by a SELECT).

Why a new PR instead of iterating on #17606

#17606 fixed the symptom with a line-by-line blank-line heuristic (_insert_statement_separators / _scan_line_depth / _has_real_semicolons). That approach still mis-handled set operations (UNION/EXCEPT), comments between statements, a comment before WITH, and ; inside string literals. This PR addresses the root cause instead, with less code, and is verified to have no regression vs the current behaviour while fixing those additional cases.

Behaviour

  • Resolves the reported bug (single CTE, blank lines, ;-in-comment).
  • Additionally fixes CTEs inside multi-statement queries and GO-separated CTEs.
  • No regression vs current master on any tested case.

Testing

Added/updated regression tests in metadata-ingestion/tests/integration/powerbi/test_native_sql_parser.py:

  • single nested CTE with blank lines + ;-in-comment (the reported bug)
  • CTE inside a ;-separated multi-statement query
  • GO-separated CTE through the real cleanup pipeline
  • SELECT … INTO #temp + blank-line SELECT (separator-less fallback)
  • single- vs multi-statement classification

Test relocation

Moved metadata-ingestion/tests/integration/powerbi/test_native_sql_parser.pymetadata-ingestion/tests/unit/powerbi/test_native_sql_parser.py.

These are pure-logic tests (they only call get_tables / remove_* / parse_custom_sql, with no Docker, graph, or network), but living under tests/integration/ meant conftest.py auto-marked them integration.

@github-actions github-actions Bot added the ingestion PR or Issue related to the ingestion of metadata label Jun 3, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

@aviraj-gour aviraj-gour force-pushed the fix/powerbi-native-sql-single-statement-detection branch from ca01c70 to 5593389 Compare June 3, 2026 08:08
@alwaysmeticulous
Copy link
Copy Markdown

alwaysmeticulous Bot commented Jun 4, 2026

🤖 Meticulous evaluated 130 user flows and took 946 visual snapshots. Meticulous has not yet run on df17f98 of the main branch and so there was nothing to compare against.

If you recently setup Meticulous, this is expected. Meticulous will start reporting comparisons for new pull requests after the next commit to the main branch.

Last updated for commit 18b7ef6 fix(tests/powerbi): remove xfail tests for blank-line-separated SELECTs .... This comment will update as new commits are pushed.

@datahub-connector-tests
Copy link
Copy Markdown

Connector Tests Results

All connector tests passed for commit 18b7ef6

View full test logs →

To skip connector tests, add the skip-connector-tests label (org members only).

Autogenerated by the connector-tests CI pipeline.

@maggiehays maggiehays added pending-submitter-merge and removed needs-review Label for PRs that need review from a maintainer. labels Jun 4, 2026
@aviraj-gour aviraj-gour merged commit 839f9d3 into master Jun 5, 2026
58 checks passed
@aviraj-gour aviraj-gour deleted the fix/powerbi-native-sql-single-statement-detection branch June 5, 2026 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata pending-submitter-merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants