Cannot load a binary column of many rows via the `to_arrow` method.

### Feature Request / Improvement

I manage binary data ranging in size from 500kb to 2MB and meta information about this data in one table.
The number of rows is from hundreds of thousands to millions.
Saving the data from the table as a parquet file via pyiceberg is no problem.

However, when I fetch the data from the table via the `to_arrow` method, I get an OOM error while performing the `combine_chunks` method of pyarrow.
This is because pyarrow's binary data type is 32-bit in size, which means it can't hold more than about 2GB of data (in my case, I get the error after about 4000 rows of data).
This error is often reported when pyarrow is used in other libraries besides pyiceberg ([arrow](https://github.com/apache/arrow/issues/33049), [ray](https://github.com/ray-project/ray/issues/41411), [vaex](https://github.com/vaexio/vaex/issues/2335), ...).

I don't think it's a strange usage to store and manage data like images, sounds, LLM tokens, etc. in binary.
So why don't we change pyiceberg to import data as large_binary when importing data as pyarrow so that it can handle such large data?

For now, I have solved the problem in my use case with a few modifications.
When I tested it to contribute, I realized that there are a lot of places where pyarrow uses binary internally, so I think more fixes and modifications to the test code are needed.
However, I'm starting to wonder if this is the direction the pyiceberg maintainers want to go.
If this is not my particular situation and this is the direction that the maintainers agree with, I will submit a PR to address it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cannot load a binary column of many rows via the `to_arrow` method. #344

Feature Request / Improvement

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cannot load a binary column of many rows via the to_arrow method. #344

Description

Feature Request / Improvement

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Cannot load a binary column of many rows via the `to_arrow` method. #344