Skip to content

to_dummies applies drop_first before drop_nulls, causing unexpected behavior #25334

@HumbleHumbert

Description

@HumbleHumbert

Description

When using to_dummies() with both drop_first=True and drop_nulls=True, the implementation appears to apply drop_first before drop_nulls. This causes issues when null values are the first category (lexicographically), as the null column gets dropped by drop_first, leaving drop_nulls with nothing to do.

This behavior differs from the expected outcome where both the first non-null category and the null category should be dropped.

Environment

  • Polars version: 1.35.1
  • Python version: 3.12.3

Minimal Reproducible Example

import polars as pl
import pandas as pd

# Create DataFrame with null as first category
df = pl.DataFrame({
    "p54_i2": [None, "Cheadle (imaging)", "Bristol (imaging)", "Newcastle (imaging)", "Reading (imaging)"],
    "eid": [1, 2, 3, 4, 5]
}).with_columns(
    pl.col("p54_i2").cast(pl.Categorical)
)

# Test different parameter combinations
not_drop_first_df = df.to_dummies(columns=["p54_i2"])
drop_nulls_df = df.to_dummies(columns=["p54_i2"], drop_nulls=True)
drop_first_df = df.to_dummies(columns=["p54_i2"], drop_first=True)
drop_first_nulls_df = df.to_dummies(columns=["p54_i2"], drop_first=True, drop_nulls=True)
drop_first_pandas_df = pd.get_dummies(df.to_pandas(), columns=["p54_i2"], drop_first=True)

# Print results
print("No dropping:", not_drop_first_df.columns)
print("Drop nulls only:", drop_nulls_df.columns)
print("Drop first only:", drop_first_df.columns)
print("Drop first AND nulls:", drop_first_nulls_df.columns)
print("Drop first in pandas:", drop_first_pandas_df.columns)

Current Output

No dropping: ['p54_i2_Bristol (imaging)', 'p54_i2_Cheadle (imaging)', 'p54_i2_Newcastle (imaging)', 'p54_i2_Reading (imaging)', 'p54_i2_null', 'eid']
Drop nulls only: ['p54_i2_Bristol (imaging)', 'p54_i2_Cheadle (imaging)', 'p54_i2_Newcastle (imaging)', 'p54_i2_Reading (imaging)', 'p54_i2_null', 'eid']
Drop first only: ['p54_i2_Bristol (imaging)', 'p54_i2_Cheadle (imaging)', 'p54_i2_Newcastle (imaging)', 'p54_i2_Reading (imaging)', 'eid']
Drop first AND nulls: ['p54_i2_Bristol (imaging)', 'p54_i2_Cheadle (imaging)', 'p54_i2_Newcastle (imaging)', 'p54_i2_Reading (imaging)', 'eid']
Drop first in pandas:  ['p54_i2_Bristol (imaging)', 'p54_i2_Newcastle (imaging)', 'p54_i2_Reading (imaging)', 'eid']

Expected Output

When using drop_first=True, drop_nulls=True, the expected result should drop both:

  1. The null category column (p54_i2_null)
  2. The first non-null category column (p54_i2_Bristol (imaging))
Drop first AND nulls: ['p54_i2_Bristol (imaging)', 'p54_i2_Newcastle (imaging)', 'p54_i2_Reading (imaging)', 'eid']

Analysis

The issue arises from the implementation order:

  1. drop_first=True drops the first category (which is null in this case)
  2. drop_nulls=True then has no null column to drop

This differs from pandas behavior and causes problems when working with datasets like UK Biobank where null values may be encoded as a category.

Question

Is this implementation order intentional, or should drop_nulls be applied before drop_first to ensure both operations have their intended effect?

Proposed Solution

Consider either:

  1. Applying drop_nulls before drop_first so both parameters work as expected
  2. Documenting this behavior clearly if it's intentional
  3. Having drop_first skip null categories and only drop the first non-null category

Context

This was discovered while working with the UK Biobank dataset. The behavior makes it difficult to create consistent dummy variable matrices when null handling is required.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions