`to_dummies` applies `drop_first` before `drop_nulls`, causing unexpected behavior

**Description**

When using `to_dummies()` with both `drop_first=True` and `drop_nulls=True`, the implementation appears to apply `drop_first` before `drop_nulls`. This causes issues when null values are the first category (lexicographically), as the null column gets dropped by `drop_first`, leaving `drop_nulls` with nothing to do.

This behavior differs from the expected outcome where both the first non-null category and the null category should be dropped.

**Environment**
- Polars version: 1.35.1
- Python version: 3.12.3

**Minimal Reproducible Example**

```python
import polars as pl
import pandas as pd

# Create DataFrame with null as first category
df = pl.DataFrame({
    "p54_i2": [None, "Cheadle (imaging)", "Bristol (imaging)", "Newcastle (imaging)", "Reading (imaging)"],
    "eid": [1, 2, 3, 4, 5]
}).with_columns(
    pl.col("p54_i2").cast(pl.Categorical)
)

# Test different parameter combinations
not_drop_first_df = df.to_dummies(columns=["p54_i2"])
drop_nulls_df = df.to_dummies(columns=["p54_i2"], drop_nulls=True)
drop_first_df = df.to_dummies(columns=["p54_i2"], drop_first=True)
drop_first_nulls_df = df.to_dummies(columns=["p54_i2"], drop_first=True, drop_nulls=True)
drop_first_pandas_df = pd.get_dummies(df.to_pandas(), columns=["p54_i2"], drop_first=True)

# Print results
print("No dropping:", not_drop_first_df.columns)
print("Drop nulls only:", drop_nulls_df.columns)
print("Drop first only:", drop_first_df.columns)
print("Drop first AND nulls:", drop_first_nulls_df.columns)
print("Drop first in pandas:", drop_first_pandas_df.columns)
```

**Current Output**

```
No dropping: ['p54_i2_Bristol (imaging)', 'p54_i2_Cheadle (imaging)', 'p54_i2_Newcastle (imaging)', 'p54_i2_Reading (imaging)', 'p54_i2_null', 'eid']
Drop nulls only: ['p54_i2_Bristol (imaging)', 'p54_i2_Cheadle (imaging)', 'p54_i2_Newcastle (imaging)', 'p54_i2_Reading (imaging)', 'p54_i2_null', 'eid']
Drop first only: ['p54_i2_Bristol (imaging)', 'p54_i2_Cheadle (imaging)', 'p54_i2_Newcastle (imaging)', 'p54_i2_Reading (imaging)', 'eid']
Drop first AND nulls: ['p54_i2_Bristol (imaging)', 'p54_i2_Cheadle (imaging)', 'p54_i2_Newcastle (imaging)', 'p54_i2_Reading (imaging)', 'eid']
Drop first in pandas:  ['p54_i2_Bristol (imaging)', 'p54_i2_Newcastle (imaging)', 'p54_i2_Reading (imaging)', 'eid']
```

**Expected Output**

When using `drop_first=True, drop_nulls=True`, the expected result should drop both:
1. The null category column (`p54_i2_null`)
2. The first non-null category column (`p54_i2_Bristol (imaging)`)

```
Drop first AND nulls: ['p54_i2_Bristol (imaging)', 'p54_i2_Newcastle (imaging)', 'p54_i2_Reading (imaging)', 'eid']
```

**Analysis**

The issue arises from the implementation order:
1. `drop_first=True` drops the first category (which is `null` in this case)
2. `drop_nulls=True` then has no null column to drop

This differs from pandas behavior and causes problems when working with datasets like UK Biobank where null values may be encoded as a category.

**Question**

Is this implementation order intentional, or should `drop_nulls` be applied before `drop_first` to ensure both operations have their intended effect?

**Proposed Solution**

Consider either:
1. Applying `drop_nulls` before `drop_first` so both parameters work as expected
2. Documenting this behavior clearly if it's intentional
3. Having `drop_first` skip null categories and only drop the first non-null category

**Context**

This was discovered while working with the UK Biobank dataset. The behavior makes it difficult to create consistent dummy variable matrices when null handling is required.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`to_dummies` applies `drop_first` before `drop_nulls`, causing unexpected behavior #25334

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

to_dummies applies drop_first before drop_nulls, causing unexpected behavior #25334

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`to_dummies` applies `drop_first` before `drop_nulls`, causing unexpected behavior #25334