feat(ingest/sigma): add default_db/schema config and optimize SQL parsing performance#16050
feat(ingest/sigma): add default_db/schema config and optimize SQL parsing performance#16050kyungsoo-datahub wants to merge 35 commits intomasterfrom
Conversation
…nfig - Add parallel SQL parsing using ThreadPoolExecutor for significant performance improvement (10x+ faster lineage extraction) - Add sql_parsing_threads config option (default: 10) to control parallelism - Add default_db and default_schema options to PlatformDetail for generating fully qualified table URNs - Pre-parse all SQL queries in parallel before generating workunits - Cache parsing results to avoid redundant computation This addresses slow ingestion times when using chart_sources_platform_mapping with large numbers of charts (e.g., 5000+ charts taking 5+ hours).
|
Linear: ING-1487 |
Codecov Report❌ Patch coverage is ❌ Your patch status has failed because the patch coverage (65.78%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage. 📢 Thoughts on this report? Let us know! |
|
✅ Meticulous spotted 0 visual differences across 969 screens tested: view results. Meticulous evaluated ~8 hours of user flows against your PR. Expected differences? Click here. Last updated for commit 2390383. This comment will update as new commits are pushed. |
…e support Some Sigma elements (text boxes, images, UI elements) return 400 Bad Request when fetching lineage. Handle this gracefully like 403/500 to reduce log noise.
|
Linear: ING-1508 |
…hing - Parallelize Sigma API calls for fetching element lineage and SQL queries using ThreadPoolExecutor (previously sequential, now parallel) - Add column_lineage_batch_size parameter to process columns in batches, reducing peak memory for wide queries (100+ columns) - Rename sql_parsing_threads to max_workers for clarity (used for both API calls and SQL parsing) - Add gc.collect() after each batch to reduce memory pressure Config options: - max_workers: controls parallelism for API calls and SQL parsing (default: 20) - column_lineage_batch_size: columns per batch for lineage (default: 20) This significantly improves ingestion performance: - API calls: ~10-20x faster (parallel instead of sequential) - Memory: reduced peak usage for wide Salesforce queries (150+ columns)
When max_workers > 5, enable rate limiting at 10 requests/second to avoid hitting Sigma API 429 Too Many Requests errors. Uses token bucket algorithm for thread-safe rate limiting.
Rate limiting now enables at max_workers > 3 (previously > 5). With max_workers=5, the parallel API calls still hit Sigma's rate limits, causing 429 errors. Lowering threshold ensures rate limiting is active at typical worker counts.
Pass self.ctx.graph to create_lineage_sql_parsed_result() so the schema resolver can look up pre-ingested table schemas from DataHub. This enables column-level lineage when upstream tables (e.g., Redshift) are already ingested.
- Change default from 20 to 4 (conservative default) - Add le=100 upper bound validation - Fix misleading "in parallel" comment
…ug logging - Change SQL parsing log messages from info to debug level - Remove dead fallback code for direct SQL parsing (cache is always populated) - Simplify _get_element_input_details to use cache.get() directly
- Clarify sql_parsing_threads only applies when chart_sources_platform_mapping is configured - Add comment about memory implications for large deployments
…age config - Remove sql_parsing_threads config and ThreadPoolExecutor logic - Add generate_column_lineage config (default: false) to skip expensive column-level lineage computation since Sigma only uses table-level lineage - Simplify _parse_sql_in_parallel to sequential _parse_sql_queries - Add generate_column_lineage parameter to sqlglot_lineage functions
235e040 to
e8a5c5d
Compare
- Remove _parse_sql_queries method and _sql_parsing_cache - Add _parse_element_sql that parses SQL on-demand when needed - Change main loop back to direct iteration over workbooks - Remove unused SqlParsingResult import
…ture - Restore inline SQL parsing in _get_element_input_details (original approach) - Remove _parse_element_sql helper method - Remove _fetch_element_lineage_data and two-pass approach in sigma_api.py - Keep only essential additions: default_db, default_schema, generate_column_lineage params - Keep 429/503 retry logic and 400 Bad Request handling (needed in production)
|
Linear: ING-1542 |
|
Yeah, I looked into the Sigma connections API (GET /v2/connections/{connectionId}) but the response fields vary by connection type and don't expose default db/schema directly. Would need connection-type-specific parsing to extract them. Static config felt like the practical choice for now.
Thank you for the suggesstion. Updated. In terms of performance benefit, it was aborted without this skipping. Also, we can reduce our cost for the CLL computation which doesn't work for Sigma anyways.
Yes, now, it's divided to PR1: #16243 (Retry 429/503 errors) |
default_db/default_schemaconfig to generate fully qualified URNs (e.g.,prod.public.table)generate_column_lineageconfig (default:false) to skip expensive column lineage computationWhy
generate_column_lineage=falseby defaultSigma only uses
in_tables(table-level lineage) - seesigma.py:411: