-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[BUGFIX] Databricks SQL tables with special characters #11280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
[BUGFIX] Databricks SQL tables with special characters #11280
Conversation
|
| Name | Link |
|---|---|
| 🔨 Latest commit | 0d60767 |
| class DatabricksIdentifierPreparer(IdentifierPreparer): | ||
| """Custom identifier preparer for Databricks that uses backticks.""" | ||
|
|
||
| def __init__(self, dialect, initial_quote="`", final_quote="`", escape_quote="``", quote_case_sensitive_collations=True, omit_schema=False): | ||
| super().__init__( | ||
| dialect, | ||
| initial_quote=initial_quote, | ||
| final_quote=final_quote, | ||
| escape_quote=escape_quote, | ||
| quote_case_sensitive_collations=quote_case_sensitive_collations, | ||
| omit_schema=omit_schema | ||
| ) | ||
|
|
||
| # Replace the dialect's preparer with our custom one | ||
| self.engine.dialect.identifier_preparer = DatabricksIdentifierPreparer(self.engine.dialect) | ||
| logger.debug("Successfully patched Databricks dialect to use backticks for identifier quoting") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem correct.
The databricks dialect we use already overrides the preparer. See here.
Furthermore, identifier_preparer is a memoized_property which I believe is meant to be read-only. It's derived from the dialect's property field.
great_expectations/execution_engine/sqlalchemy_execution_engine.py
Outdated
Show resolved
Hide resolved
great_expectations/execution_engine/sqlalchemy_execution_engine.py
Outdated
Show resolved
Hide resolved
❌ 62 Tests Failed:
View the full list of 3 ❄️ flaky tests
To view more test analytics, go to the Test Analytics Dashboard |
NathanFarmer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this contribution @fredwang1012! This "quoted identifiers" bug, is one we have been tackling piece by piece for a long time. There are already some existing tests for similar cases in this file that may or may not start failing once you move the changes into DatabricksTableAsset.
I also noticed that the linter is failing. You can run the linter at any time by running invoke lint. Also, if you run pre-commit install, the linter will run automatically before each commit.
...ntegration/data_sources_and_expectations/expectations/test_databricks_special_table_names.py
Outdated
Show resolved
Hide resolved
great_expectations/execution_engine/sqlalchemy_execution_engine.py
Outdated
Show resolved
Hide resolved
0c6cfc4 to
85a82be
Compare
…acters - Enhanced _resolve_quoted_name validator to automatically detect and quote table names with special characters (digits, spaces, hyphens, dots, symbols) - Added _resolve_quoted_schema_name validator to handle schema names with special characters - Moved test file to correct directory structure - Fixes issue with table/schema names starting with digits or containing special characters
85a82be to
6b5261e
Compare
NathanFarmer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking pretty good! Running pre-commit install or invoke lint will get you past the linting errors so the rest of the tests will run.
great_expectations/datasource/fluent/databricks_sql_datasource.py
Outdated
Show resolved
Hide resolved
great_expectations/datasource/fluent/databricks_sql_datasource.py
Outdated
Show resolved
Hide resolved
great_expectations/datasource/fluent/databricks_sql_datasource.py
Outdated
Show resolved
Hide resolved
- Fix QueryMetricProvider to use dialect-specific identifier quoting - Update _get_substituted_batch_subquery_from_query_and_batch_selectable to handle TableClause objects - Use dialect's identifier preparer to ensure correct quote characters (backticks for Databricks) - Add test for Custom SQL Expectations with special table names - Fixes PARSE_SYNTAX_ERROR when using table names starting with digits in Custom SQL This completes the fix for Databricks table identifier quoting across all expectation types
…2/great_expectations into databricks-quote-bug
|
@NathanFarmer I've addressed the linting issues by moving the imports to the top of the file. Ready for re-review and workflow approval. Thanks! |
NathanFarmer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the static analysis staged, the type ignore comments on these lines need to be removed. You can run the type checker by running invoke types. Once you get past that stage we should be able to see if the databricks tests pass.
| ensures the Databricks dialect uses backticks for special identifiers. | ||
| """ | ||
| # If it's already a quoted_name object, return as-is | ||
| if hasattr(sqlalchemy, "quoted_name") and hasattr(table_name, "__class__"): # type: ignore[truthy-function] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type ignore comment needs to be removed
| return schema_name | ||
|
|
||
| # If it's already a quoted_name object, return as-is | ||
| if hasattr(sqlalchemy, "quoted_name") and hasattr(schema_name, "__class__"): # type: ignore[truthy-function] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type ignore comment needs to be removed
…2/great_expectations into databricks-quote-bug
…2/great_expectations into databricks-quote-bug
NathanFarmer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are getting closer, but we definitely need some end-to-end integration tests. You were pretty close before you completely changed the tests/integration/data_sources_and_expectations/data_sources/test_databricks_special_table_names.py file. Commit 8f9ab2b5b59241824afb391353158a229ee49034 was the last time I reviewed and the file was close then, you were just attempting to use fixtures that didn't exist. You should remove the context and databricks_datasource fixtures from those tests and then use DatabricksBatchTestSetup with DatabricksDatasourceTestConfig. You should be able to pass the special table names to either of those classes. Follow the established patterns in tests/integration/data_sources_and_expectations/data_sources/test_postgresql.py except pass the table_name.
| # Integration tests - NO @pytest.mark.databricks decorator | ||
| # These get their markers from fixture parametrization | ||
| def test_table_asset_integration_digit_start(self, request, data_context): | ||
| """Integration test: table names starting with digits work end-to-end.""" | ||
| # Skip ephemeral backend as it doesn't support these special table name features | ||
| if hasattr(request, "param") and "ephemeral" in str(request.param): | ||
| pytest.skip("Ephemeral backend doesn't support special table names") | ||
|
|
||
| # Skip if no real Databricks connection is available | ||
| pytest.skip("Integration test requires real Databricks connection") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, this test doesn't assert anything
| def test_table_asset_integration_special_chars(self, request, data_context): | ||
| """Integration test: table names with special characters work end-to-end.""" | ||
| # Skip ephemeral backend as it doesn't support these special table name features | ||
| if hasattr(request, "param") and "ephemeral" in str(request.param): | ||
| pytest.skip("Ephemeral backend doesn't support special table names") | ||
|
|
||
| # Skip if no real Databricks connection is available | ||
| pytest.skip("Integration test requires real Databricks connection") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test does nothing
| def test_table_asset_integration_schema_and_table_special(self, request, data_context): | ||
| """Integration test: both schema and table names with special characters work.""" | ||
| # Skip ephemeral backend as it doesn't support these special table name features | ||
| if hasattr(request, "param") and "ephemeral" in str(request.param): | ||
| pytest.skip("Ephemeral backend doesn't support special table names") | ||
|
|
||
| # Skip if no real Databricks connection is available | ||
| pytest.skip("Integration test requires real Databricks connection") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test does nothing
| @pytest.mark.databricks | ||
| def test_needs_databricks_backticks_digit_start(self): | ||
| """Test that table names starting with digits require backticks.""" | ||
| assert DatabricksTableAsset._needs_databricks_backticks("247_asset_class_returns") | ||
| assert DatabricksTableAsset._needs_databricks_backticks("123_test_table") | ||
| assert DatabricksTableAsset._needs_databricks_backticks("9_column_table") | ||
|
|
||
| @pytest.mark.databricks | ||
| def test_needs_databricks_backticks_special_chars(self): | ||
| """Test that table names with special characters require backticks.""" | ||
| assert DatabricksTableAsset._needs_databricks_backticks("table with spaces") | ||
| assert DatabricksTableAsset._needs_databricks_backticks("table-with-hyphens") | ||
| assert DatabricksTableAsset._needs_databricks_backticks("table.with.dots") | ||
| assert DatabricksTableAsset._needs_databricks_backticks("table#with#hash") | ||
| assert DatabricksTableAsset._needs_databricks_backticks("table@with@at") | ||
|
|
||
| @pytest.mark.databricks | ||
| def test_needs_databricks_backticks_normal_names(self): | ||
| """Test that normal table names don't require backticks.""" | ||
| assert not DatabricksTableAsset._needs_databricks_backticks("normal_table") | ||
| assert not DatabricksTableAsset._needs_databricks_backticks("table_name") | ||
| assert not DatabricksTableAsset._needs_databricks_backticks("TableName") | ||
| assert not DatabricksTableAsset._needs_databricks_backticks("test_table_123") | ||
|
|
||
| @pytest.mark.databricks | ||
| def test_needs_databricks_backticks_already_quoted(self): | ||
| """Test that already quoted names don't require additional backticks.""" | ||
| assert not DatabricksTableAsset._needs_databricks_backticks("`247_asset_class_returns`") | ||
| assert not DatabricksTableAsset._needs_databricks_backticks("`table with spaces`") | ||
|
|
||
| @pytest.mark.databricks | ||
| def test_is_bracketed_by_quotes_true(self): | ||
| """Test that names with backticks are detected correctly.""" | ||
| assert DatabricksTableAsset._is_bracketed_by_quotes("`table_name`") | ||
| assert DatabricksTableAsset._is_bracketed_by_quotes("`247_test_table`") | ||
| assert DatabricksTableAsset._is_bracketed_by_quotes("`table with spaces`") | ||
|
|
||
| @pytest.mark.databricks | ||
| def test_is_bracketed_by_quotes_false(self): | ||
| """Test that names without backticks are detected correctly.""" | ||
| assert not DatabricksTableAsset._is_bracketed_by_quotes("table_name") | ||
| assert not DatabricksTableAsset._is_bracketed_by_quotes("247_test_table") | ||
| assert not DatabricksTableAsset._is_bracketed_by_quotes('"table_name"') | ||
| assert not DatabricksTableAsset._is_bracketed_by_quotes("'table_name'") | ||
|
|
||
| @pytest.mark.databricks | ||
| def test_resolve_quoted_name_digit_start(self): | ||
| """Test that table names starting with digits get quoted_name objects.""" | ||
| result = DatabricksTableAsset._resolve_quoted_name("247_asset_class_returns") | ||
| assert isinstance(result, quoted_name) | ||
| assert str(result) == "247_asset_class_returns" | ||
|
|
||
| @pytest.mark.databricks | ||
| def test_resolve_quoted_name_special_chars(self): | ||
| """Test that table names with special chars get quoted_name objects.""" | ||
| result = DatabricksTableAsset._resolve_quoted_name("table with spaces") | ||
| assert isinstance(result, quoted_name) | ||
| assert str(result) == "table with spaces" | ||
|
|
||
| result = DatabricksTableAsset._resolve_quoted_name("table-with-hyphens") | ||
| assert isinstance(result, quoted_name) | ||
| assert str(result) == "table-with-hyphens" | ||
|
|
||
| @pytest.mark.databricks | ||
| def test_resolve_quoted_name_normal_names(self): | ||
| """Test that normal table names remain as strings.""" | ||
| result = DatabricksTableAsset._resolve_quoted_name("normal_table") | ||
| assert isinstance(result, str) | ||
| assert result == "normal_table" | ||
|
|
||
| @pytest.mark.databricks | ||
| def test_resolve_quoted_name_already_quoted(self): | ||
| """Test that already quoted names get their quotes stripped and become | ||
| quoted_name objects.""" | ||
| result = DatabricksTableAsset._resolve_quoted_name("`247_asset_class_returns`") | ||
| assert isinstance(result, quoted_name) | ||
| assert str(result) == "247_asset_class_returns" | ||
|
|
||
| @pytest.mark.databricks | ||
| def test_resolve_quoted_schema_name_digit_start(self): | ||
| """Test that schema names starting with digits get quoted_name objects.""" | ||
| result = DatabricksTableAsset._resolve_quoted_schema_name("123_schema") | ||
| assert isinstance(result, quoted_name) | ||
| assert str(result) == "123_schema" | ||
|
|
||
| @pytest.mark.databricks | ||
| def test_resolve_quoted_schema_name_special_chars(self): | ||
| """Test that schema names with special chars get quoted_name objects.""" | ||
| result = DatabricksTableAsset._resolve_quoted_schema_name("schema with spaces") | ||
| assert isinstance(result, quoted_name) | ||
| assert str(result) == "schema with spaces" | ||
|
|
||
| @pytest.mark.databricks | ||
| def test_resolve_quoted_schema_name_normal_names(self): | ||
| """Test that normal schema names remain as strings.""" | ||
| result = DatabricksTableAsset._resolve_quoted_schema_name("normal_schema") | ||
| assert isinstance(result, str) | ||
| assert result == "normal_schema" | ||
|
|
||
| @pytest.mark.databricks | ||
| def test_resolve_quoted_schema_name_none(self): | ||
| """Test that None schema names remain None.""" | ||
| result = DatabricksTableAsset._resolve_quoted_schema_name(None) | ||
| assert result is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are unit tests since they only test the inputs and outputs of this one method (you don't need a databricks connection to test that). They should be marked with @pytest.mark.unit and moved into a new file tests/datasource/fluent/data_asset/databricks_table_asset.py.
|
I mistakenly thought the tests were supposed to be for the changes made in the PR, not the end-to-end integration tests. I will change them back. Sorry about that. |
invoke lint(usesruff format+ruff check)For more information about contributing, visit our community resources.
After you submit your PR, keep the page open and monitor the statuses of the various checks made by our continuous integration process at the bottom of the page. Please fix any issues that come up and reach out on Slack if you need help. Thanks for contributing!