-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(ingest/snowflake): order queries for queries_v2 #12551
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅ ✅ All tests successful. No failed tests found.
Continue to review full report in Codecov by Sentry.
|
@@ -696,6 +696,9 @@ def _build_enriched_query_log_query( | |||
JOIN filtered_access_history a USING (query_id) | |||
) | |||
SELECT * FROM query_access_history | |||
-- Our query aggregator expects the queries to be added in chronological order. | |||
-- It's easier for us to push down the sorting to Snowflake/SQL instead of doing it in Python. | |||
ORDER BY start_time ASC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QUERY_START_TIME ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked if we are doing the sorting in eg bigquery and yes, we do
datahub/metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries_extractor.py
Line 503 in ac13f25
ORDER BY creation_time |
Are there other sources to validate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Snowflake is getting fixed here, bq is all good. For redshift, we actually register temp tables as a preprocessing step, so it's fine that the other queries don't have an order by
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we add a validation in sql parsing aggregator? we may track start query time as queries are added to the aggregator and eg we may raise a warning if the start query time is not increasing for every addition
Checklist