Skip to content

Under LEFT join list[cat] sometimes becomes list[u32] after parquet serialization #25626

@davidia

Description

@davidia

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import tempfile

cat_val = [f"XX{i:010d}" for i in range(300)]
data = []
for i in range(200):
    data.append({'col1': i, 'cat': cat_val})

df = pl.DataFrame(data).with_columns(pl.col('cat').list.eval(pl.element().cast(pl.Categorical)))
tmp = tempfile.NamedTemporaryFile(suffix='.parquet', delete=False)
df.write_parquet(tmp.name)
df2 = pl.read_parquet(tmp.name)
join_df = pl.DataFrame({'col1': [17]})

left = join_df.join(df2, on='col1', how='left')
inner = join_df.join(df2, on='col1', how='inner')

print("LEFT:", left.schema['cat'])
print("INNER:", inner.schema['cat'])
print("LEFT data:", left['cat'][0][:5])
print("INNER data:", inner['cat'][0][:5])

Log output

_init_credential_provider_builder(): credential_provider_init = None
Writeable: try_new: local: /tmp/tmp5vpw18k7.parquet (canonicalize: Ok("/tmp/tmp5vpw18k7.parquet"))
_init_credential_provider_builder(): credential_provider_init = None
sourcing parquet scan file schema from: '/tmp/tmp5vpw18k7.parquet'
polars-stream: updating graph state
async thread count: 4
polars-stream: running in-memory-sink in subgraph
polars-stream: running multi-scan[parquet] in subgraph
[MultiScanTaskInit]: 1 sources, reader name: parquet, ReaderCapabilities(ROW_INDEX | PRE_SLICE | NEGATIVE_PRE_SLICE | PARTIAL_FILTER | FULL_FILTER | MAPPED_COLUMN_PROJECTION), n_readers_pre_init: 1, max_concurrent_scans: 1
[MultiScanTaskInit]: predicate: None, skip files mask: None, predicate to reader: None
[MultiScanTaskInit]: scan_source_idx: 0, extra_ops: ExtraOperations { row_index: None, row_index_col_idx: 18446744073709551615, pre_slice: None, include_file_paths: None, file_path_col_idx: 18446744073709551615, predicate: None }
[MultiScanTaskInit]: Readers init: 1 / (1 total) (range: 0..1, filtered out: 0)
[MultiScan]: Initialize source 0
[ReaderStarter]: scan_source_idx: 0
[ReaderStarter]: max_concurrent_scans is 1, waiting..
[AttachReaderToBridge]: received reader (n_readers_received: 1)
[ReaderStarter]: scan_source_idx: 0: pre_slice_to_reader: None, external_filter_mask: None, file_iceberg_schema: None
memory prefetch function: madvise_willneed
[ParquetFileReader]: project: 2 / 2, pre_slice: None, resolved_pre_slice: None, row_index: None, predicate: None 
[ParquetFileReader]: Config { num_pipelines: 56, row_group_prefetch_size: 128, target_values_per_thread: 16777216 }
[ParquetFileReader]: ideal_morsel_size: 100000
start_reader_impl: scan_source_idx: 0, first_morsel_position: RowCounter { physical_rows: 0, deleted_rows: 0 }
start_reader_impl: scan_source_idx: 0, ApplyExtraOps::Noop, first_morsel_position: RowCounter { physical_rows: 0, deleted_rows: 0 }
[ReaderStarter]: Stopping (no more readers)
[MultiScanState]: Readers disconnected
polars-stream: done running graph phase
polars-stream: updating graph state
join parallel: true
LEFT join dataframes finished
join parallel: true
INNER join dataframes finished

Issue description

Left join changes the type of list[cat] column to list[u32] when the dataframe has been saved to parquet. Output form script above:

LEFT: List(UInt32)
INNER: List(Categorical)
LEFT data: shape: (5,)
Series: '' [u32]
[
	0
	1
	2
	3
	4
]
INNER data: shape: (5,)
Series: '' [cat]
[
	"XX0000000000"
	"XX0000000001"
	"XX0000000002"
	"XX0000000003"
	"XX0000000004"
]

Expected behavior

column should still be list[cat] as with inner join

Installed versions

--------Version info---------
Polars:              1.35.2
Index type:          UInt32
Platform:            Linux-5.15.150.1-microsoft-standard-WSL2-x86_64-with-glibc2.39
Python:              3.12.5 | packaged by conda-forge | (main, Aug  8 2024, 18:36:51) [GCC 12.4.0]
Runtime:             rt32

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               5.5.0
azure.identity       <not installed>
boto3                1.35.22
cloudpickle          3.0.0
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.9.0
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           3.10.3
numpy                1.26.4
openpyxl             3.1.5
pandas               2.1.4
polars_cloud         <not installed>
pyarrow              18.1.0
pydantic             2.8.2
pyiceberg            <not installed>
sqlalchemy           1.4.49
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-dtype-categoricalArea: categorical data typeA-io-parquetArea: reading/writing Parquet filesP-highPriority: highacceptedReady for implementationbugSomething isn't workingpythonRelated to Python Polars

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions