Skip to content

feat(ingest/sigma): add default_db/schema config and optimize SQL parsing performance#16050

Closed
kyungsoo-datahub wants to merge 35 commits intomasterfrom
dev/sigma-parallel-sql-parsing
Closed

feat(ingest/sigma): add default_db/schema config and optimize SQL parsing performance#16050
kyungsoo-datahub wants to merge 35 commits intomasterfrom
dev/sigma-parallel-sql-parsing

Conversation

@kyungsoo-datahub
Copy link
Contributor

@kyungsoo-datahub kyungsoo-datahub commented Feb 2, 2026

  • Add default_db / default_schema config to generate fully qualified URNs (e.g., prod.public.table)
  • Add generate_column_lineage config (default: false) to skip expensive column lineage computation
  • Add retry logic for 429/503 errors with exponential backoff
  • Handle 400 Bad Request for elements without lineage support (text boxes, images)

Why generate_column_lineage=false by default

Sigma only uses in_tables (table-level lineage) - see sigma.py:411:

sql_parser_in_tables = create_lineage_sql_parsed_result(...).in_tables
column_lineage is computed but never used. Skipping it improves performance without functional impact.

Example Config

chart_sources_platform_mapping:
  "*":
    data_source_platform: "redshift"
    default_db: "prod"
    default_schema: "public"
generate_column_lineage: false  # default

…nfig

- Add parallel SQL parsing using ThreadPoolExecutor for significant
  performance improvement (10x+ faster lineage extraction)
- Add sql_parsing_threads config option (default: 10) to control
  parallelism
- Add default_db and default_schema options to PlatformDetail for
  generating fully qualified table URNs
- Pre-parse all SQL queries in parallel before generating workunits
- Cache parsing results to avoid redundant computation

This addresses slow ingestion times when using chart_sources_platform_mapping
with large numbers of charts (e.g., 5000+ charts taking 5+ hours).
@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Feb 2, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 2, 2026

Linear: ING-1487

@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Feb 2, 2026
@codecov
Copy link

codecov bot commented Feb 2, 2026

Codecov Report

❌ Patch coverage is 65.78947% with 13 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...on/src/datahub/ingestion/source/sigma/sigma_api.py 44.44% 10 Missing ⚠️
...gestion/src/datahub/sql_parsing/sqlglot_lineage.py 87.50% 2 Missing ⚠️
...estion/src/datahub/ingestion/source/sigma/sigma.py 0.00% 1 Missing ⚠️

❌ Your patch status has failed because the patch coverage (65.78%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

📢 Thoughts on this report? Let us know!

@alwaysmeticulous
Copy link

alwaysmeticulous bot commented Feb 2, 2026

✅ Meticulous spotted 0 visual differences across 969 screens tested: view results.

Meticulous evaluated ~8 hours of user flows against your PR.

Expected differences? Click here. Last updated for commit 2390383. This comment will update as new commits are pushed.

…e support

Some Sigma elements (text boxes, images, UI elements) return 400 Bad Request
when fetching lineage. Handle this gracefully like 403/500 to reduce log noise.
@kyungsoo-datahub kyungsoo-datahub marked this pull request as ready for review February 4, 2026 16:19
@github-actions
Copy link
Contributor

github-actions bot commented Feb 4, 2026

Linear: ING-1508

…hing

- Parallelize Sigma API calls for fetching element lineage and SQL queries
  using ThreadPoolExecutor (previously sequential, now parallel)
- Add column_lineage_batch_size parameter to process columns in batches,
  reducing peak memory for wide queries (100+ columns)
- Rename sql_parsing_threads to max_workers for clarity (used for both
  API calls and SQL parsing)
- Add gc.collect() after each batch to reduce memory pressure

Config options:
- max_workers: controls parallelism for API calls and SQL parsing (default: 20)
- column_lineage_batch_size: columns per batch for lineage (default: 20)

This significantly improves ingestion performance:
- API calls: ~10-20x faster (parallel instead of sequential)
- Memory: reduced peak usage for wide Salesforce queries (150+ columns)
When max_workers > 5, enable rate limiting at 10 requests/second
to avoid hitting Sigma API 429 Too Many Requests errors.

Uses token bucket algorithm for thread-safe rate limiting.
Rate limiting now enables at max_workers > 3 (previously > 5).
With max_workers=5, the parallel API calls still hit Sigma's rate
limits, causing 429 errors. Lowering threshold ensures rate limiting
is active at typical worker counts.
Pass self.ctx.graph to create_lineage_sql_parsed_result() so the schema
resolver can look up pre-ingested table schemas from DataHub. This enables
column-level lineage when upstream tables (e.g., Redshift) are already
ingested.
- Change default from 20 to 4 (conservative default)
- Add le=100 upper bound validation
- Fix misleading "in parallel" comment
…ug logging

- Change SQL parsing log messages from info to debug level
- Remove dead fallback code for direct SQL parsing (cache is always populated)
- Simplify _get_element_input_details to use cache.get() directly
- Clarify sql_parsing_threads only applies when chart_sources_platform_mapping is configured
- Add comment about memory implications for large deployments
…age config

- Remove sql_parsing_threads config and ThreadPoolExecutor logic
- Add generate_column_lineage config (default: false) to skip expensive
  column-level lineage computation since Sigma only uses table-level lineage
- Simplify _parse_sql_in_parallel to sequential _parse_sql_queries
- Add generate_column_lineage parameter to sqlglot_lineage functions
- Remove _parse_sql_queries method and _sql_parsing_cache
- Add _parse_element_sql that parses SQL on-demand when needed
- Change main loop back to direct iteration over workbooks
- Remove unused SqlParsingResult import
…ture

- Restore inline SQL parsing in _get_element_input_details (original approach)
- Remove _parse_element_sql helper method
- Remove _fetch_element_lineage_data and two-pass approach in sigma_api.py
- Keep only essential additions: default_db, default_schema, generate_column_lineage params
- Keep 429/503 retry logic and 400 Bad Request handling (needed in production)
@kyungsoo-datahub kyungsoo-datahub changed the title feat(ingest/sigma): add parallel SQL parsing and fully qualified URN support feat(ingest/sigma): add default_db/schema config and optimize SQL parsing performance Feb 7, 2026
@kyungsoo-datahub kyungsoo-datahub marked this pull request as ready for review February 8, 2026 02:02
@github-actions
Copy link
Contributor

github-actions bot commented Feb 8, 2026

Linear: ING-1542

@ligfx
Copy link
Contributor

ligfx commented Feb 9, 2026

  • Adding default_db/default_schema makes sense to me, though unfortunate that it's needed 🤔 I can't see anywhere in the Sigma REST API that it exposes the default database/schema for a connection, even though it's user-configurable via the UI. (IMO, also we should really re-architect how we handle connections and get rid of chart_sources_platform_mapping, but that's for another time.)

  • Is there a reason to add generate_column_lineage as a user-configurable option, if column lineage is never used by the ingestion source anyways? Seems like it should just pass it to the lineage parser automatically (although I'm also not convinced that it's that much of a performance benefit).

  • Also, a lot of separate behavior changes in this PR. I wonder if we could break them out into different PRs.

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Feb 9, 2026
@kyungsoo-datahub
Copy link
Contributor Author

  • Adding default_db/default_schema makes sense to me, though unfortunate that it's needed 🤔 I can't see anywhere in the Sigma REST API that it exposes the default database/schema for a connection, even though it's user-configurable via the UI. (IMO, also we should really re-architect how we handle connections and get rid of chart_sources_platform_mapping, but that's for another time.)

Yeah, I looked into the Sigma connections API (GET /v2/connections/{connectionId}) but the response fields vary by connection type and don't expose default db/schema directly. Would need connection-type-specific parsing to extract them. Static config felt like the practical choice for now.

  • Is there a reason to add generate_column_lineage as a user-configurable option, if column lineage is never used by the ingestion source anyways? Seems like it should just pass it to the lineage parser automatically (although I'm also not convinced that it's that much of a performance benefit).

Thank you for the suggesstion. Updated. In terms of performance benefit, it was aborted without this skipping. Also, we can reduce our cost for the CLL computation which doesn't work for Sigma anyways.

  • Also, a lot of separate behavior changes in this PR. I wonder if we could break them out into different PRs.

Yes, now, it's divided to

PR1: #16243 (Retry 429/503 errors)
PR2: #16244 (Skip column level lineage)
PR3: #16245 (default db/default schema)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata pending-submitter-response Issue/request has been reviewed but requires a response from the submitter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments