feat(ingest/sigma): add default_db/schema config and optimize SQL parsing performance by kyungsoo-datahub · Pull Request #16050 · datahub-project/datahub

kyungsoo-datahub · 2026-02-02T18:29:27Z

Add default_db / default_schema config to generate fully qualified URNs (e.g., prod.public.table)
Add generate_column_lineage config (default: false) to skip expensive column lineage computation
Add retry logic for 429/503 errors with exponential backoff
Handle 400 Bad Request for elements without lineage support (text boxes, images)

Why generate_column_lineage=false by default

Sigma only uses in_tables (table-level lineage) - see sigma.py:411:

sql_parser_in_tables = create_lineage_sql_parsed_result(...).in_tables
column_lineage is computed but never used. Skipping it improves performance without functional impact.

Example Config

chart_sources_platform_mapping:
  "*":
    data_source_platform: "redshift"
    default_db: "prod"
    default_schema: "public"
generate_column_lineage: false  # default

…nfig - Add parallel SQL parsing using ThreadPoolExecutor for significant performance improvement (10x+ faster lineage extraction) - Add sql_parsing_threads config option (default: 10) to control parallelism - Add default_db and default_schema options to PlatformDetail for generating fully qualified table URNs - Pre-parse all SQL queries in parallel before generating workunits - Cache parsing results to avoid redundant computation This addresses slow ingestion times when using chart_sources_platform_mapping with large numbers of charts (e.g., 5000+ charts taking 5+ hours).

github-actions · 2026-02-02T18:29:40Z

Linear: ING-1487

codecov · 2026-02-02T18:32:15Z

Codecov Report

❌ Patch coverage is 65.78947% with 13 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...on/src/datahub/ingestion/source/sigma/sigma_api.py	44.44%	10 Missing ⚠️
...gestion/src/datahub/sql_parsing/sqlglot_lineage.py	87.50%	2 Missing ⚠️
...estion/src/datahub/ingestion/source/sigma/sigma.py	0.00%	1 Missing ⚠️

❌ Your patch status has failed because the patch coverage (65.78%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

📢 Thoughts on this report? Let us know!

alwaysmeticulous · 2026-02-02T18:35:24Z

✅ Meticulous spotted 0 visual differences across 969 screens tested: view results.

Meticulous evaluated ~8 hours of user flows against your PR.

_{Expected differences? Click here. Last updated for commit 2390383. This comment will update as new commits are pushed.}

…e support Some Sigma elements (text boxes, images, UI elements) return 400 Bad Request when fetching lineage. Handle this gracefully like 403/500 to reduce log noise.

github-actions · 2026-02-04T16:19:26Z

Linear: ING-1508

…hing - Parallelize Sigma API calls for fetching element lineage and SQL queries using ThreadPoolExecutor (previously sequential, now parallel) - Add column_lineage_batch_size parameter to process columns in batches, reducing peak memory for wide queries (100+ columns) - Rename sql_parsing_threads to max_workers for clarity (used for both API calls and SQL parsing) - Add gc.collect() after each batch to reduce memory pressure Config options: - max_workers: controls parallelism for API calls and SQL parsing (default: 20) - column_lineage_batch_size: columns per batch for lineage (default: 20) This significantly improves ingestion performance: - API calls: ~10-20x faster (parallel instead of sequential) - Memory: reduced peak usage for wide Salesforce queries (150+ columns)

When max_workers > 5, enable rate limiting at 10 requests/second to avoid hitting Sigma API 429 Too Many Requests errors. Uses token bucket algorithm for thread-safe rate limiting.

Rate limiting now enables at max_workers > 3 (previously > 5). With max_workers=5, the parallel API calls still hit Sigma's rate limits, causing 429 errors. Lowering threshold ensures rate limiting is active at typical worker counts.

Pass self.ctx.graph to create_lineage_sql_parsed_result() so the schema resolver can look up pre-ingested table schemas from DataHub. This enables column-level lineage when upstream tables (e.g., Redshift) are already ingested.

- Change default from 20 to 4 (conservative default) - Add le=100 upper bound validation - Fix misleading "in parallel" comment

…ug logging - Change SQL parsing log messages from info to debug level - Remove dead fallback code for direct SQL parsing (cache is always populated) - Simplify _get_element_input_details to use cache.get() directly

- Clarify sql_parsing_threads only applies when chart_sources_platform_mapping is configured - Add comment about memory implications for large deployments

…age config - Remove sql_parsing_threads config and ThreadPoolExecutor logic - Add generate_column_lineage config (default: false) to skip expensive column-level lineage computation since Sigma only uses table-level lineage - Simplify _parse_sql_in_parallel to sequential _parse_sql_queries - Add generate_column_lineage parameter to sqlglot_lineage functions

- Remove _parse_sql_queries method and _sql_parsing_cache - Add _parse_element_sql that parses SQL on-demand when needed - Change main loop back to direct iteration over workbooks - Remove unused SqlParsingResult import

…ture - Restore inline SQL parsing in _get_element_input_details (original approach) - Remove _parse_element_sql helper method - Remove _fetch_element_lineage_data and two-pass approach in sigma_api.py - Keep only essential additions: default_db, default_schema, generate_column_lineage params - Keep 429/503 retry logic and 400 Bad Request handling (needed in production)

github-actions · 2026-02-08T02:03:01Z

Linear: ING-1542

ligfx · 2026-02-09T19:12:44Z

Adding default_db/default_schema makes sense to me, though unfortunate that it's needed 🤔 I can't see anywhere in the Sigma REST API that it exposes the default database/schema for a connection, even though it's user-configurable via the UI. (IMO, also we should really re-architect how we handle connections and get rid of chart_sources_platform_mapping, but that's for another time.)
Is there a reason to add generate_column_lineage as a user-configurable option, if column lineage is never used by the ingestion source anyways? Seems like it should just pass it to the lineage parser automatically (although I'm also not convinced that it's that much of a performance benefit).
Also, a lot of separate behavior changes in this PR. I wonder if we could break them out into different PRs.

kyungsoo-datahub · 2026-02-17T21:56:08Z

Adding default_db/default_schema makes sense to me, though unfortunate that it's needed 🤔 I can't see anywhere in the Sigma REST API that it exposes the default database/schema for a connection, even though it's user-configurable via the UI. (IMO, also we should really re-architect how we handle connections and get rid of chart_sources_platform_mapping, but that's for another time.)

Yeah, I looked into the Sigma connections API (GET /v2/connections/{connectionId}) but the response fields vary by connection type and don't expose default db/schema directly. Would need connection-type-specific parsing to extract them. Static config felt like the practical choice for now.

Is there a reason to add generate_column_lineage as a user-configurable option, if column lineage is never used by the ingestion source anyways? Seems like it should just pass it to the lineage parser automatically (although I'm also not convinced that it's that much of a performance benefit).

Thank you for the suggesstion. Updated. In terms of performance benefit, it was aborted without this skipping. Also, we can reduce our cost for the CLL computation which doesn't work for Sigma anyways.

Also, a lot of separate behavior changes in this PR. I wonder if we could break them out into different PRs.

Yes, now, it's divided to

PR1: #16243 (Retry 429/503 errors)
PR2: #16244 (Skip column level lineage)
PR3: #16245 (default db/default schema)

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Feb 2, 2026

github-actions bot deployed to datahub-wheels (Preview) February 2, 2026 18:31 View deployment

datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Feb 2, 2026

github-actions bot deployed to datahub-project-web-react (Preview) February 2, 2026 18:34 View deployment

vercel bot deployed to Preview February 2, 2026 18:44 View deployment

fix(ingest/sigma): handle 400 Bad Request for elements without lineag…

a49ec35

…e support Some Sigma elements (text boxes, images, UI elements) return 400 Bad Request when fetching lineage. Handle this gracefully like 403/500 to reduce log noise.

vercel bot deployed to Preview February 2, 2026 23:02 View deployment

kyungsoo-datahub marked this pull request as draft February 3, 2026 21:18

debug: add logging for Sigma SQL parsing and URN generation

fd3e891

kyungsoo-datahub marked this pull request as ready for review February 4, 2026 16:19

github-actions bot deployed to datahub-wheels (Preview) February 4, 2026 16:20 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) February 4, 2026 16:23 View deployment

vercel bot deployed to Preview February 4, 2026 16:34 View deployment

kyungsoo-datahub marked this pull request as draft February 4, 2026 17:41

github-actions bot deployed to datahub-wheels (Preview) February 4, 2026 19:52 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) February 4, 2026 19:56 View deployment

vercel bot deployed to Preview February 4, 2026 20:06 View deployment

feat(ingest/sigma): add rate limiting for parallel API calls

af04da6

When max_workers > 5, enable rate limiting at 10 requests/second to avoid hitting Sigma API 429 Too Many Requests errors. Uses token bucket algorithm for thread-safe rate limiting.

github-actions bot deployed to datahub-wheels (Preview) February 4, 2026 22:55 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) February 4, 2026 22:57 View deployment

vercel bot deployed to Preview February 4, 2026 23:08 View deployment

kyungsoo-datahub added 2 commits February 4, 2026 17:54

github-actions bot deployed to datahub-wheels (Preview) February 5, 2026 02:09 View deployment

fix(ingest/sigma): adjust sql_parsing_threads default and fix comment

274075d

- Change default from 20 to 4 (conservative default) - Add le=100 upper bound validation - Fix misleading "in parallel" comment

github-actions bot deployed to datahub-wheels (Preview) February 7, 2026 00:40 View deployment

vercel bot deployed to Preview February 7, 2026 00:54 View deployment

github-actions bot deployed to datahub-wheels (Preview) February 7, 2026 02:27 View deployment

github-actions bot deployed to datahub-wheels (Preview) February 7, 2026 02:37 View deployment

vercel bot deployed to Preview February 7, 2026 02:50 View deployment

github-actions bot deployed to datahub-wheels (Preview) February 7, 2026 14:47 View deployment

kyungsoo-datahub added 3 commits February 7, 2026 06:54

docs(ingest/sigma): add clarifying comments for config and memory usage

f75935f

- Clarify sql_parsing_threads only applies when chart_sources_platform_mapping is configured - Add comment about memory implications for large deployments

kyungsoo-datahub force-pushed the dev/sigma-parallel-sql-parsing branch from 235e040 to e8a5c5d Compare February 7, 2026 14:56

github-actions bot deployed to datahub-wheels (Preview) February 7, 2026 14:58 View deployment

refactor(ingest/sigma): simplify SQL parsing to on-demand approach

f12b9b2

- Remove _parse_sql_queries method and _sql_parsing_cache - Add _parse_element_sql that parses SQL on-demand when needed - Change main loop back to direct iteration over workbooks - Remove unused SqlParsingResult import

github-actions bot deployed to datahub-wheels (Preview) February 7, 2026 15:02 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) February 7, 2026 15:06 View deployment

github-actions bot deployed to datahub-wheels (Preview) February 7, 2026 15:16 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) February 7, 2026 15:20 View deployment

style(sql_parsing): use named args in sqlglot_lineage for clarity

2390383

kyungsoo-datahub changed the title ~~feat(ingest/sigma): add parallel SQL parsing and fully qualified URN support~~ feat(ingest/sigma): add default_db/schema config and optimize SQL parsing performance Feb 7, 2026

github-actions bot deployed to datahub-wheels (Preview) February 7, 2026 15:24 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) February 7, 2026 15:27 View deployment

vercel bot deployed to Preview February 7, 2026 15:38 View deployment

kyungsoo-datahub marked this pull request as ready for review February 8, 2026 02:02

datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Feb 9, 2026

kyungsoo-datahub closed this Feb 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingest/sigma): add default_db/schema config and optimize SQL parsing performance#16050

feat(ingest/sigma): add default_db/schema config and optimize SQL parsing performance#16050
kyungsoo-datahub wants to merge 35 commits intomasterfrom
dev/sigma-parallel-sql-parsing

kyungsoo-datahub commented Feb 2, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 2, 2026

Uh oh!

codecov bot commented Feb 2, 2026 •

edited

Loading

Uh oh!

alwaysmeticulous bot commented Feb 2, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 4, 2026

Uh oh!

github-actions bot commented Feb 8, 2026

Uh oh!

ligfx commented Feb 9, 2026

Uh oh!

kyungsoo-datahub commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

kyungsoo-datahub commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 2, 2026

Uh oh!

codecov bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

alwaysmeticulous bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 4, 2026

Uh oh!

github-actions bot commented Feb 8, 2026

Uh oh!

ligfx commented Feb 9, 2026

Uh oh!

kyungsoo-datahub commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

kyungsoo-datahub commented Feb 2, 2026 •

edited

Loading

codecov bot commented Feb 2, 2026 •

edited

Loading

alwaysmeticulous bot commented Feb 2, 2026 •

edited

Loading