Skip to content

TypeError returned when scanning pyarrow dataset with allow_pyarrow_filter=False #25316

@westonpace

Description

@westonpace

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import pyarrow as pa
import pyarrow.dataset as ds

data = pa.table({
    "vector": [[3.1, 4.1], [5.9, 26.5]],
    "item": ["foo", "bar"],
    "price": [10.0, 20.0],
})
pyds = ds.dataset(data)
# if allow_pyarrow_filter is True then this passes, currently this fails
print(pl.scan_pyarrow_dataset(pyds, allow_pyarrow_filter=False).first().collect())

Log output

Traceback (most recent call last):
  File "/home/pace/dev/lance-experiments/polars-bug/simple_repr.py", line 11, in <module>
    print(pl.scan_pyarrow_dataset(pyds, allow_pyarrow_filter=False).first().collect())
  File "/home/pace/miniconda3/envs/lance/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 97, in wrapper
    return function(*args, **kwargs)
  File "/home/pace/miniconda3/envs/lance/lib/python3.10/site-packages/polars/lazyframe/opt_flags.py", line 328, in wrapper
    return function(*args, **kwargs)
  File "/home/pace/miniconda3/envs/lance/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 2422, in collect
    return wrap_df(ldf.collect(engine, callback))
  File "/home/pace/miniconda3/envs/lance/lib/python3.10/site-packages/polars/_utils/scan.py", line 27, in _execute_from_rust
    return function(with_columns, *args)
TypeError: _scan_pyarrow_dataset_impl() got multiple values for argument 'batch_size'

Issue description

This worked correctly in previous versions of polars (worked in 1.3.0 and fails in 1.4.1)

Expected behavior

The code should run without error

Installed versions

--------Version info---------
Polars:              1.35.2
Index type:          UInt32
Platform:            Linux-6.8.0-87-generic-x86_64-with-glibc2.39
Python:              3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
Runtime:             rt32

----Optional dependencies----
Azure CLI            2.76.0
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                1.35.58
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.6.1
gevent               <not installed>
google.auth          2.27.0
great_tables         <not installed>
matplotlib           3.9.2
numpy                2.2.3
openpyxl             <not installed>
pandas               2.2.2
polars_cloud         <not installed>
pyarrow              20.0.0
pydantic             2.12.2
pyiceberg            <not installed>
sqlalchemy           1.4.54
torch                2.7.1+cu126
xlsx2csv             <not installed>
xlsxwriter           <not installed>

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-interop-arrowArea: interoperability with other Arrow implementations (such as pyarrow)bugSomething isn't workingneeds triageAwaiting prioritization by a maintainerpythonRelated to Python Polars

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions