-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot load a binary column of many rows via the to_arrow
method.
#344
Comments
Hey @castedice thanks for reaching out here. The logic of creating a PyArrow dataframe is currently sub-optimal as you already mentioned. We're always on the lookout to make this more efficient, ideally by pushing things to Arrow itself instead of doing it in Python. I'm very curious to your solution and I would love to see that PR! 👍 |
Thank you for your quick feedback. 👍 |
@castedice Also feel free to open up a draft if you want some early feedback. |
This week was a tough week and I didn't have time to work on it. |
Closing this one, since #409 has been merged |
Feature Request / Improvement
I manage binary data ranging in size from 500kb to 2MB and meta information about this data in one table.
The number of rows is from hundreds of thousands to millions.
Saving the data from the table as a parquet file via pyiceberg is no problem.
However, when I fetch the data from the table via the
to_arrow
method, I get an OOM error while performing thecombine_chunks
method of pyarrow.This is because pyarrow's binary data type is 32-bit in size, which means it can't hold more than about 2GB of data (in my case, I get the error after about 4000 rows of data).
This error is often reported when pyarrow is used in other libraries besides pyiceberg (arrow, ray, vaex, ...).
I don't think it's a strange usage to store and manage data like images, sounds, LLM tokens, etc. in binary.
So why don't we change pyiceberg to import data as large_binary when importing data as pyarrow so that it can handle such large data?
For now, I have solved the problem in my use case with a few modifications.
When I tested it to contribute, I realized that there are a lot of places where pyarrow uses binary internally, so I think more fixes and modifications to the test code are needed.
However, I'm starting to wonder if this is the direction the pyiceberg maintainers want to go.
If this is not my particular situation and this is the direction that the maintainers agree with, I will submit a PR to address it.
The text was updated successfully, but these errors were encountered: