-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Description
When using to_dummies() with both drop_first=True and drop_nulls=True, the implementation appears to apply drop_first before drop_nulls. This causes issues when null values are the first category (lexicographically), as the null column gets dropped by drop_first, leaving drop_nulls with nothing to do.
This behavior differs from the expected outcome where both the first non-null category and the null category should be dropped.
Environment
- Polars version: 1.35.1
- Python version: 3.12.3
Minimal Reproducible Example
import polars as pl
import pandas as pd
# Create DataFrame with null as first category
df = pl.DataFrame({
"p54_i2": [None, "Cheadle (imaging)", "Bristol (imaging)", "Newcastle (imaging)", "Reading (imaging)"],
"eid": [1, 2, 3, 4, 5]
}).with_columns(
pl.col("p54_i2").cast(pl.Categorical)
)
# Test different parameter combinations
not_drop_first_df = df.to_dummies(columns=["p54_i2"])
drop_nulls_df = df.to_dummies(columns=["p54_i2"], drop_nulls=True)
drop_first_df = df.to_dummies(columns=["p54_i2"], drop_first=True)
drop_first_nulls_df = df.to_dummies(columns=["p54_i2"], drop_first=True, drop_nulls=True)
drop_first_pandas_df = pd.get_dummies(df.to_pandas(), columns=["p54_i2"], drop_first=True)
# Print results
print("No dropping:", not_drop_first_df.columns)
print("Drop nulls only:", drop_nulls_df.columns)
print("Drop first only:", drop_first_df.columns)
print("Drop first AND nulls:", drop_first_nulls_df.columns)
print("Drop first in pandas:", drop_first_pandas_df.columns)Current Output
No dropping: ['p54_i2_Bristol (imaging)', 'p54_i2_Cheadle (imaging)', 'p54_i2_Newcastle (imaging)', 'p54_i2_Reading (imaging)', 'p54_i2_null', 'eid']
Drop nulls only: ['p54_i2_Bristol (imaging)', 'p54_i2_Cheadle (imaging)', 'p54_i2_Newcastle (imaging)', 'p54_i2_Reading (imaging)', 'p54_i2_null', 'eid']
Drop first only: ['p54_i2_Bristol (imaging)', 'p54_i2_Cheadle (imaging)', 'p54_i2_Newcastle (imaging)', 'p54_i2_Reading (imaging)', 'eid']
Drop first AND nulls: ['p54_i2_Bristol (imaging)', 'p54_i2_Cheadle (imaging)', 'p54_i2_Newcastle (imaging)', 'p54_i2_Reading (imaging)', 'eid']
Drop first in pandas: ['p54_i2_Bristol (imaging)', 'p54_i2_Newcastle (imaging)', 'p54_i2_Reading (imaging)', 'eid']
Expected Output
When using drop_first=True, drop_nulls=True, the expected result should drop both:
- The null category column (
p54_i2_null) - The first non-null category column (
p54_i2_Bristol (imaging))
Drop first AND nulls: ['p54_i2_Bristol (imaging)', 'p54_i2_Newcastle (imaging)', 'p54_i2_Reading (imaging)', 'eid']
Analysis
The issue arises from the implementation order:
drop_first=Truedrops the first category (which isnullin this case)drop_nulls=Truethen has no null column to drop
This differs from pandas behavior and causes problems when working with datasets like UK Biobank where null values may be encoded as a category.
Question
Is this implementation order intentional, or should drop_nulls be applied before drop_first to ensure both operations have their intended effect?
Proposed Solution
Consider either:
- Applying
drop_nullsbeforedrop_firstso both parameters work as expected - Documenting this behavior clearly if it's intentional
- Having
drop_firstskip null categories and only drop the first non-null category
Context
This was discovered while working with the UK Biobank dataset. The behavior makes it difficult to create consistent dummy variable matrices when null handling is required.