-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Open
Labels
A-dtype-categoricalArea: categorical data typeArea: categorical data typeA-io-parquetArea: reading/writing Parquet filesArea: reading/writing Parquet filesP-highPriority: highPriority: highacceptedReady for implementationReady for implementationbugSomething isn't workingSomething isn't workingpythonRelated to Python PolarsRelated to Python Polars
Description
Checks
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of Polars.
Reproducible example
import polars as pl
import tempfile
cat_val = [f"XX{i:010d}" for i in range(300)]
data = []
for i in range(200):
data.append({'col1': i, 'cat': cat_val})
df = pl.DataFrame(data).with_columns(pl.col('cat').list.eval(pl.element().cast(pl.Categorical)))
tmp = tempfile.NamedTemporaryFile(suffix='.parquet', delete=False)
df.write_parquet(tmp.name)
df2 = pl.read_parquet(tmp.name)
join_df = pl.DataFrame({'col1': [17]})
left = join_df.join(df2, on='col1', how='left')
inner = join_df.join(df2, on='col1', how='inner')
print("LEFT:", left.schema['cat'])
print("INNER:", inner.schema['cat'])
print("LEFT data:", left['cat'][0][:5])
print("INNER data:", inner['cat'][0][:5])Log output
_init_credential_provider_builder(): credential_provider_init = None
Writeable: try_new: local: /tmp/tmp5vpw18k7.parquet (canonicalize: Ok("/tmp/tmp5vpw18k7.parquet"))
_init_credential_provider_builder(): credential_provider_init = None
sourcing parquet scan file schema from: '/tmp/tmp5vpw18k7.parquet'
polars-stream: updating graph state
async thread count: 4
polars-stream: running in-memory-sink in subgraph
polars-stream: running multi-scan[parquet] in subgraph
[MultiScanTaskInit]: 1 sources, reader name: parquet, ReaderCapabilities(ROW_INDEX | PRE_SLICE | NEGATIVE_PRE_SLICE | PARTIAL_FILTER | FULL_FILTER | MAPPED_COLUMN_PROJECTION), n_readers_pre_init: 1, max_concurrent_scans: 1
[MultiScanTaskInit]: predicate: None, skip files mask: None, predicate to reader: None
[MultiScanTaskInit]: scan_source_idx: 0, extra_ops: ExtraOperations { row_index: None, row_index_col_idx: 18446744073709551615, pre_slice: None, include_file_paths: None, file_path_col_idx: 18446744073709551615, predicate: None }
[MultiScanTaskInit]: Readers init: 1 / (1 total) (range: 0..1, filtered out: 0)
[MultiScan]: Initialize source 0
[ReaderStarter]: scan_source_idx: 0
[ReaderStarter]: max_concurrent_scans is 1, waiting..
[AttachReaderToBridge]: received reader (n_readers_received: 1)
[ReaderStarter]: scan_source_idx: 0: pre_slice_to_reader: None, external_filter_mask: None, file_iceberg_schema: None
memory prefetch function: madvise_willneed
[ParquetFileReader]: project: 2 / 2, pre_slice: None, resolved_pre_slice: None, row_index: None, predicate: None
[ParquetFileReader]: Config { num_pipelines: 56, row_group_prefetch_size: 128, target_values_per_thread: 16777216 }
[ParquetFileReader]: ideal_morsel_size: 100000
start_reader_impl: scan_source_idx: 0, first_morsel_position: RowCounter { physical_rows: 0, deleted_rows: 0 }
start_reader_impl: scan_source_idx: 0, ApplyExtraOps::Noop, first_morsel_position: RowCounter { physical_rows: 0, deleted_rows: 0 }
[ReaderStarter]: Stopping (no more readers)
[MultiScanState]: Readers disconnected
polars-stream: done running graph phase
polars-stream: updating graph state
join parallel: true
LEFT join dataframes finished
join parallel: true
INNER join dataframes finishedIssue description
Left join changes the type of list[cat] column to list[u32] when the dataframe has been saved to parquet. Output form script above:
LEFT: List(UInt32)
INNER: List(Categorical)
LEFT data: shape: (5,)
Series: '' [u32]
[
0
1
2
3
4
]
INNER data: shape: (5,)
Series: '' [cat]
[
"XX0000000000"
"XX0000000001"
"XX0000000002"
"XX0000000003"
"XX0000000004"
]
Expected behavior
column should still be list[cat] as with inner join
Installed versions
--------Version info---------
Polars: 1.35.2
Index type: UInt32
Platform: Linux-5.15.150.1-microsoft-standard-WSL2-x86_64-with-glibc2.39
Python: 3.12.5 | packaged by conda-forge | (main, Aug 8 2024, 18:36:51) [GCC 12.4.0]
Runtime: rt32
----Optional dependencies----
Azure CLI <not installed>
adbc_driver_manager <not installed>
altair 5.5.0
azure.identity <not installed>
boto3 1.35.22
cloudpickle 3.0.0
connectorx <not installed>
deltalake <not installed>
fastexcel <not installed>
fsspec 2024.9.0
gevent <not installed>
google.auth <not installed>
great_tables <not installed>
matplotlib 3.10.3
numpy 1.26.4
openpyxl 3.1.5
pandas 2.1.4
polars_cloud <not installed>
pyarrow 18.1.0
pydantic 2.8.2
pyiceberg <not installed>
sqlalchemy 1.4.49
torch <not installed>
xlsx2csv <not installed>
xlsxwriter <not installed>
Metadata
Metadata
Assignees
Labels
A-dtype-categoricalArea: categorical data typeArea: categorical data typeA-io-parquetArea: reading/writing Parquet filesArea: reading/writing Parquet filesP-highPriority: highPriority: highacceptedReady for implementationReady for implementationbugSomething isn't workingSomething isn't workingpythonRelated to Python PolarsRelated to Python Polars