Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow: Support Large Binary when using to_arrow #409

Merged
merged 2 commits into from
Feb 10, 2024
Merged

Conversation

castedice
Copy link
Contributor

This PR is to address an issue that prevented the to_arrow method from handling binaries larger than 2GB when used as mentioned in #344.

With this change in place, all binary types must be defined via pa.large_binary when defining a pyarrow schema.

I considered leaving it as pa.binary and casting it in pyiceberg, but since defining it as pa.large_binary is essential for pyarrow to handle 2GB of data, I implemented it this way.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@castedice Thanks for raising this. I think this is fine since Polars does it as well:

python3
Python 3.11.7 (main, Dec  4 2023, 18:10:11) [Clang 15.0.0 (clang-1500.1.0.2.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import polars as pl
>>> df = pl.DataFrame(
...     {"foo": [1, 2, 3, 4, 5, 6], "bar": [b"a", b"b", b"c", b"d", b"e", b"f"]}
... )
>>> df.to_arrow()
pyarrow.Table
foo: int64
bar: large_binary
----
foo: [[1,2,3,4,5,6]]
bar: [[61,62,63,64,65,66]]

@Fokko Fokko added this to the PyIceberg 0.6.0 release milestone Feb 10, 2024
@Fokko Fokko merged commit a576fc9 into apache:main Feb 10, 2024
6 checks passed
@castedice
Copy link
Contributor Author

Thanks for review
This change will require a few changes to the documentation.
After checking the documentation, I'll create an additional PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants