Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using CASE in a select statement breaks the split_statements logic used for inferring lineage from stored procedures in mssql #12606

Open
ttekampe opened this issue Feb 12, 2025 · 3 comments
Labels
bug Bug report

Comments

@ttekampe
Copy link

Describe the bug
Using CASE in a select statement breaks the split_statements logic used for inferring lineage from stored procedures in mssql

To Reproduce

from datahub.sql_parsing.schema_resolver import SchemaResolver
from datahub.sql_parsing.split_statements import split_statements
from datahub.sql_parsing.sqlglot_lineage import sqlglot_lineage

sql_text = """
SELECT
    Id,
    CASE
        WHEN Id > 10 THEN 'IS_GREATER_THAN_TEN'
        WHEN Id > 5 THEN 'IS_GREATER_THAN_FIVE'
        ELSE 'IS_SMALL'
    END AS foo,
    bar
FROM mySchema.myTable
"""

schema_resolver = SchemaResolver(platform="mssql")

statements = split_statements(sql_text)
results = []
for s in statements:
    print(s)
    print("------")
    parsed = sqlglot_lineage(
        s,
        schema_resolver=schema_resolver,
        default_db="myDatabase",
        default_schema="mySchema",
    )
    results.append(parsed)


for r in results:
    print(r.json(indent=4))
    print("------")

The individual statements that are produced by split_statements are

SELECT
    Id,
    CASE
        WHEN Id > 10 THEN 'IS_GREATER_THAN_TEN'
        WHEN Id > 5 THEN 'IS_GREATER_THAN_FIVE'
        ELSE 'IS_SMALL'
------
END
------
AS foo,
    bar
FROM mySchema.myTable
------

Expected behavior
The statement should not be broken up, such that the SELECT part can be parsed correctly and lineage can be inferred.

Additional context
acryl-datahub python package version is 0.15.0.5

@ttekampe ttekampe added the bug Bug report label Feb 12, 2025
@ttekampe
Copy link
Author

I wanted to prepare a PR just to find out that this is already fixed on the master branch.

Any plans to release a new version of the acryl-datahub python package, so we can benefit from these improvements?

@ttekampe
Copy link
Author

ttekampe commented Feb 13, 2025

There is still one thing that leaves me a bit puzzled:

The regex patterns to match control flow keywords here

CONTROL_FLOW_KEYWORDS = [
"GO",
r"BEGIN\w+TRY",
r"BEGIN\w+CATCH",
"BEGIN",
r"END\w+TRY",
r"END\w+CATCH",
# This isn't strictly correct, but we assume that IF | (condition) | (block) should all be split up
# This mainly ensures that IF statements don't get tacked onto the previous statement incorrectly
"IF",
# For things like CASE, END does not mean the end of a statement.
# We have special handling for this.
END_KEYWORD,
# "ELSE", # else is also valid in CASE, so we we can't use it here.
]

Why do you use \w+ and not \s+?

For example BEGIN\w+TRY matches BEGINFooBarTRY, but it does not match BEGIN TRY

@hsheth2 and @mayurinehate is this intentional or simply a typo?

@hsheth2
Copy link
Collaborator

hsheth2 commented Feb 13, 2025

Yeah that looks like a typo - we'd welcome a PR with some corresponding tests :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report
Projects
None yet
Development

No branches or pull requests

2 participants