Skip to content

JDBC logging and safety improvements #1117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 14, 2025
Merged

JDBC logging and safety improvements #1117

merged 8 commits into from
Apr 14, 2025

Conversation

zaleslaw
Copy link
Collaborator

@zaleslaw zaleslaw commented Apr 3, 2025

This pull request includes several changes aimed at enhancing the validation and execution of SQL queries and table names, as well as improving the database connection handling in the dataframe-jdbc module.

The most important changes include the introduction of strict validation for SQL queries and table names, the replacement of createStatement with prepareStatement for better security, and the addition of new validation methods.

Enhancements to SQL Validation:

  • Added strictValidation parameter to methods readSqlTable, readSqlQuery, and readDataFrame to enforce or bypass strict validation of SQL queries and table names. Default is true.
  • Implemented isValidSqlQuery and isValidTableName methods to validate SQL queries and table names, ensuring they do not contain forbidden patterns and have balanced quotes.

Security Improvements:

Code Cleanup:

  • Removed the unused constant MULTIPLE_SQL_QUERY_SEPARATOR from readJdbc.kt.

Documentation Updates:

  • Updated method documentation to include descriptions for the new strictValidation parameter.

These changes collectively improve the robustness, security, and clarity of the codebase, ensuring safer and more reliable database interactions.

zaleslaw added 5 commits April 2, 2025 11:01
Introduces a new `strictValidation` parameter to SQL reading methods, defaulting to `true`, to ensure table names and queries follow a valid format. If disabled, a warning is logged, granting users flexibility while maintaining safety by default.
@zaleslaw zaleslaw requested a review from Copilot April 3, 2025 19:18
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces JDBC logging and safety improvements by adding strict validation for SQL queries and table names while also updating test cases to prevent SQL injection. Key changes include:

  • Adding tests to verify that invalid table names and malicious SQL queries are rejected.
  • Refactoring query execution to use prepared statements and introducing a strictValidation flag.
  • Enhancing validation methods with detailed logging for both SQL queries and table names.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
dataframe-jdbc/src/test/kotlin/org/jetbrains/kotlinx/dataframe/io/h2/h2Test.kt Added tests for SQL injection prevention and strictValidation behavior
dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/readJdbc.kt Refactored query execution to use prepared statements; added strictValidation parameter and enhanced SQL validation
dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/db/util.kt Updated database utility to use prepared statements for improved safety

@zaleslaw zaleslaw marked this pull request as ready for review April 3, 2025 19:22
@zaleslaw zaleslaw requested a review from Jolanrensen April 3, 2025 19:22
* Checks if a given string contains forbidden patterns or keywords.
* Logs a clear and friendly message if any forbidden pattern is found.
*/
private fun containsForbiddenPatterns(input: String): Boolean {
Copy link
Collaborator

@Jolanrensen Jolanrensen Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether we, as DataFrame, are responsible for catching these types of attacks. Allow me to elaborate:
This PR makes it so that we can catch 5 types of attacks, which is good! But I'm pretty sure there are more, and we'll miss all of those. Couldn't we better ask the JDBC implementation or a proven validation library whether the statement is safe instead?

Copy link
Collaborator Author

@zaleslaw zaleslaw Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I investigated a few libraries with statement validation: some of them supports a limited number of databases or have complex API and seems like overengineering or for example licensing is SOMETHING SPECIAL

Pros of Catching Attacks in DataFrame.readSql*:

  • Defense in depth: Even if JDBC is secure, a second layer helps prevent human errors in API usage.
  • Prevents misusing readSqlQuery for multiple queries or data modifications.
  • Improves developer ergonomics: Clear logs/warnings/errors when something sketchy is passed.
  • Customizability: The user can turn strictValidation off in edge cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also mostly protected on JDBC level with PreparedStatement and executeQuery, but...different drivers give different possiblities to push some DROP Table/Database stuff through executeQuery and execute it BEFORE exception

executeQuery("SELECT * FROM users; DROP TABLE X") could lead to SQL injection, especially if multiple statements are allowed by the DB/driver (in URL for example)

My idea was a 3-tier protection:

✅ Prepared statements

✅ Manual validation (for SELECT-only queries, no ;, no suspicious tokens)

✅ strictValidation as opt-in/out

Together, that's a pretty solid defense-in-depth.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds like a good approach indeed :)

}

// Validate the table name structure: letters, numbers, underscores, and dots are allowed
val tableNameRegex = Regex("^[A-Z0-9_]+(\\.[A-Z0-9_]+)*$")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regex can be a const somewhere

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only uppercase letters are allowed? Oh I see, normalizedTableName is made uppercase, got it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, do we specifically only allow A-Z? So, not Æ, Ø, Å, etc? Otherwise, it might make more sense to replace [A-Z0-9_] by unicode category with [\p{L}\p{N}_] (https://www.regular-expressions.info/unicode.html). If SQL has a specific ISO standard for these names, let's stick to that instead.

Copy link
Collaborator Author

@zaleslaw zaleslaw Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree about possibility to use other name, it seems inconsistent with other DataFrame parts, from other hand, I thought that this behavior could be covered with disabling strictValidation

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah probably, but it seems unnecessarily strict to me. If you're from a country that uses a different alphabet you will always have to turn off strictValidation in this case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said I agree, we're supporting different alphabets and symbols in other stuff, better to be aligned with that, I consider this and add test cases (honestly never named in Cyrillic any table in my life)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, Unicode is supported and test is added

Added handling and tests for tables with Unicode names in `readSqlTable`. Updated the table name validation regex to support Unicode letters, enhancing compatibility with diverse database naming conventions.
@zaleslaw zaleslaw merged commit 5f3b13f into master Apr 14, 2025
4 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants