Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/snowflake): apply table name normalization for queries_v2 #12566

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mayurinehate
Copy link
Collaborator

Adds config table_name_normalization_rules to snowflake. The tables identified by these rules should typically be temporary or transient. This helps in uniform query_id generation for queries using random table names for each ETL run for tools like DBT, Segment, Fivetran

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

Adds config `table_name_normalization_rules` to snowflake.
The tables identified by these rules should typically be temporary or transient.
This helps in uniform query_id generation for queries using random table names
for each ETL run for tools like DBT, Segment, Fivetran
@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Feb 6, 2025
Copy link

codecov bot commented Feb 6, 2025

Codecov Report

Attention: Patch coverage is 66.66667% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ub/ingestion/source/snowflake/snowflake_queries.py 33.33% 6 Missing ⚠️
...hub/ingestion/source/snowflake/snowflake_config.py 80.00% 2 Missing ⚠️
Files with missing lines Coverage Δ
...ingestion/source/snowflake/snowflake_lineage_v2.py 41.37% <ø> (-38.37%) ⬇️
...datahub/ingestion/source/snowflake/snowflake_v2.py 60.60% <ø> (-27.95%) ⬇️
.../src/datahub/sql_parsing/sql_parsing_aggregator.py 93.91% <100.00%> (-0.14%) ⬇️
...gestion/src/datahub/sql_parsing/sqlglot_lineage.py 93.97% <ø> (ø)
...ingestion/src/datahub/sql_parsing/sqlglot_utils.py 90.47% <100.00%> (ø)
...hub/ingestion/source/snowflake/snowflake_config.py 96.93% <80.00%> (-1.11%) ⬇️
...ub/ingestion/source/snowflake/snowflake_queries.py 43.06% <33.33%> (-0.34%) ⬇️

... and 18 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ac13f25...40444fe. Read the comment docs.

@@ -300,6 +301,18 @@ class SnowflakeV2Config(
"to ignore the temporary staging tables created by known ETL tools.",
)

table_name_normalization_rules: Dict[re.Pattern, str] = pydantic.Field(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where are the defaults injected in?

@@ -300,6 +301,18 @@ class SnowflakeV2Config(
"to ignore the temporary staging tables created by known ETL tools.",
)

table_name_normalization_rules: Dict[re.Pattern, str] = pydantic.Field(
default={},
description="[Advanced] Regex patterns for table names to normalize in lineage ingestion. "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense for us to have table_name_normalization_rules and extra_table_name_normalization_rules - setting the former overrides the defaults, but the latter lets you extend it?

the ruff linter uses a similar pattern which works pretty well https://docs.astral.sh/ruff/settings/#lint_extend-select

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an excellent idea. Have not injected defaults yet. That's exactly the problem I was looking to solve - how to let users extend defaults without re-declaring them, and at the same time, allow users to change defaults if they do not suit them.

@@ -8,6 +8,7 @@
from datahub.sql_parsing.sql_parsing_common import QueryType
from datahub.sql_parsing.sqlglot_lineage import _UPDATE_ARGS_NOT_SUPPORTED_BY_SELECT
from datahub.sql_parsing.sqlglot_utils import (
_TABLE_NAME_NORMALIZATION_RULES,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should remove the _ prefix if this is a public thing that other sources depend on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants