Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Error Message in .query() for DataFrame with Duplicate Column Names #60864

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Xenomorph149
Copy link

Summary of Changes:

  • Enhanced the error messaging in the .query() method for pandas DataFrames when duplicate column names are present.
  • Prior to this change, invoking .query() on a DataFrame with duplicate column names resulted in an unclear TypeError, making it difficult for users to understand the root cause.
  • With this update, users will now receive a more descriptive and helpful ValueError, similar to when columns are accessed directly, with a message such as:
    "ValueError: cannot reindex on an axis with duplicate labels"

Reasoning Behind the Change:

  • The current behavior of .query() did not offer clear feedback when users attempted to run queries on DataFrames with duplicate column names.
  • By improving the error message, we enhance the overall user experience, making it easier for users to diagnose and resolve issues related to duplicate columns.

Testing Approach:

  • I tested the change by creating a sample DataFrame with duplicate column names and attempted to execute a .query() operation.
  • Below is the test case used:
import pandas as pd

# Create a DataFrame with duplicate column names
df = pd.DataFrame({
    "A": [1, 2, 3, 4, 5],
    "B": [10, 8, 6, 4, 2],
    "A": [5, 4, 3, 2, 1],  # Duplicate column name "A"
})

# Test the query functionality
try:
    result = df.query("A <= 4 and B <= 8")
    print(result)
except Exception as e:
    print(f"Error: {e}")
  • After applying the fix, the code will raise a ValueError, indicating that queries cannot be executed due to duplicate column names, making it much easier to pinpoint the issue.

Issue Addressed:

  • This PR resolves the issue documented in #60863, where .query() failed to provide a clear error message when used on DataFrames containing duplicate column names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant