Skip to content

Conversation

@davidia
Copy link

@davidia davidia commented Dec 4, 2025

When performing left/right joins on chunked DataFrames the take_chunked_unchecked and take_opt_chunked_unchecked methods for List types would lose the inner dtype information. This caused List(Categorical) to become List(UInt32) because ChunkedArray::with_chunk re-infers the dtype from the physical Arrow array.

The fix preserves the original dtype by using Series::from_chunks_and_dtype_unchecked with the original self.dtype() instead of letting it be re-inferred.

Fixes #25626

@github-actions github-actions bot added fix Bug fix rust Related to Rust Polars labels Dec 4, 2025
@codecov
Copy link

codecov bot commented Dec 5, 2025

Codecov Report

❌ Patch coverage is 92.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.44%. Comparing base (4742c6a) to head (233d6cc).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...tes/polars-ops/src/chunked_array/gather/chunked.rs 92.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #25634      +/-   ##
==========================================
+ Coverage   79.35%   79.44%   +0.09%     
==========================================
  Files        1743     1743              
  Lines      240295   240328      +33     
  Branches     3038     3038              
==========================================
+ Hits       190683   190928     +245     
+ Misses      48830    48618     -212     
  Partials      782      782              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

let ca = self.list().unwrap();
ca.take_chunked_unchecked(by, sorted, avoid_sharing)
.into_series()
let taken = ca.take_chunked_unchecked(by, sorted, avoid_sharing);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right place to fix it - could we instead update take_chunked_unchecked() impl for ChunkedArray to ensure the original type is restored there? -

unsafe fn take_chunked_unchecked<const B: u64>(

@davidia davidia force-pushed the fix-list-categorical-dtype-loss branch 3 times, most recently from c725f27 to dc53b56 Compare December 5, 2025 13:59
When performing left/right joins on chunked DataFrames (common with native
Parquet reader), the `take_chunked_unchecked` and `take_opt_chunked_unchecked`
methods would lose dtype information for nested types like `List(Categorical)`.

The issue was that `ChunkedArray::with_chunk` re-infers the dtype from the
physical Arrow array, causing `List(Categorical)` to become `List(UInt32)`.

The fix uses `ChunkedArray::with_chunk_like` instead, which preserves the
original ChunkedArray's dtype when constructing the result.

Fixes pola-rs#25626
@davidia davidia force-pushed the fix-list-categorical-dtype-loss branch from dc53b56 to 233d6cc Compare December 5, 2025 14:23
Copy link
Collaborator

@nameexhaustion nameexhaustion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix Bug fix rust Related to Rust Polars

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Under LEFT join list[cat] sometimes becomes list[u32] after parquet serialization

2 participants