Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot load a binary column of many rows via the to_arrow method. #344

Closed
castedice opened this issue Feb 1, 2024 · 5 comments
Closed

Comments

@castedice
Copy link
Contributor

Feature Request / Improvement

I manage binary data ranging in size from 500kb to 2MB and meta information about this data in one table.
The number of rows is from hundreds of thousands to millions.
Saving the data from the table as a parquet file via pyiceberg is no problem.

However, when I fetch the data from the table via the to_arrow method, I get an OOM error while performing the combine_chunks method of pyarrow.
This is because pyarrow's binary data type is 32-bit in size, which means it can't hold more than about 2GB of data (in my case, I get the error after about 4000 rows of data).
This error is often reported when pyarrow is used in other libraries besides pyiceberg (arrow, ray, vaex, ...).

I don't think it's a strange usage to store and manage data like images, sounds, LLM tokens, etc. in binary.
So why don't we change pyiceberg to import data as large_binary when importing data as pyarrow so that it can handle such large data?

For now, I have solved the problem in my use case with a few modifications.
When I tested it to contribute, I realized that there are a lot of places where pyarrow uses binary internally, so I think more fixes and modifications to the test code are needed.
However, I'm starting to wonder if this is the direction the pyiceberg maintainers want to go.
If this is not my particular situation and this is the direction that the maintainers agree with, I will submit a PR to address it.

@Fokko
Copy link
Contributor

Fokko commented Feb 1, 2024

Hey @castedice thanks for reaching out here.

The logic of creating a PyArrow dataframe is currently sub-optimal as you already mentioned. We're always on the lookout to make this more efficient, ideally by pushing things to Arrow itself instead of doing it in Python.

I'm very curious to your solution and I would love to see that PR! 👍

@castedice
Copy link
Contributor Author

Thank you for your quick feedback. 👍
I'll submit a PR after I fix the code to match the contribution guidelines.
It may take longer than I thought as there were test errors in places I didn't think of.

@Fokko
Copy link
Contributor

Fokko commented Feb 5, 2024

@castedice Also feel free to open up a draft if you want some early feedback.

@castedice
Copy link
Contributor Author

This week was a tough week and I didn't have time to work on it.
I'll try to submit a PR tomorrow or the day after.
The way you support large_string in #382 is similar to how I was thinking of doing it.
Alternatively, I was thinking of replacing all binary types with large_binary.
That seems like a much simpler way to do it, although the memory allocated to the pyarrow object would be a bit larger in that case (in terms of modifying the test code, I suppose it might be more work).
In terms of tight memory management, it seems like it might be a solution you'd be reluctant to go with - I'd be interested in your thoughts. @Fokko

@Fokko
Copy link
Contributor

Fokko commented Feb 16, 2024

Closing this one, since #409 has been merged

@Fokko Fokko closed this as completed Feb 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants