Skip to content

Cast large_string to string right before saving to Parquet.#234

Merged
davidknight-seequent merged 8 commits intoSeequentEvo:mainfrom
davidknight-seequent:fix-large-string-error
Mar 24, 2026
Merged

Cast large_string to string right before saving to Parquet.#234
davidknight-seequent merged 8 commits intoSeequentEvo:mainfrom
davidknight-seequent:fix-large-string-error

Conversation

@davidknight-seequent
Copy link
Contributor

@davidknight-seequent davidknight-seequent commented Mar 23, 2026

Description

Second attempt at getting rid of large_string.

  • Reverted previous changes meant to fix this problem.
  • Targeted functions that actually save dataframes to Parquet.

Checklist

  • I have read the contributing guide and the code of conduct

@davidknight-seequent davidknight-seequent requested a review from a team as a code owner March 23, 2026 01:46
@wordsworthc
Copy link
Contributor

My understanding here is that the issue lies in the default handling of pandas StringDtype when converting a pandas dataframe to a pyarrow table. To avoid this area of code becoming a patchwork quilt, I think we should

  1. revert the original change that didn't fix the issue; then
  2. update the way we convert dataframes to tables
    a. df.convert_dtypes(dtype_backend="pyarrow") seems to be one way of handling this, but check for performance penalties first.

Optionally, add some kind of documentation about this issue with pandas v3, because this seems like an easy trap to fall into if you're doing DIY conversion

@davidknight-seequent davidknight-seequent requested a review from a team as a code owner March 23, 2026 02:27
@davidknight-seequent
Copy link
Contributor Author

The convert_dtypes approach is about 2.3x slower than the approach implemented in this PR. This is because it copies and re-infers every column's dtype to pandas before handing off to Arrow.

@davidknight-seequent davidknight-seequent merged commit 34ac615 into SeequentEvo:main Mar 24, 2026
88 checks passed
@davidknight-seequent davidknight-seequent deleted the fix-large-string-error branch March 24, 2026 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants