Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add dtype argument to str.decode #60940

Merged
merged 7 commits into from
Feb 18, 2025

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Feb 16, 2025

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Ref: #60795 (comment)

PyArrow-backed strings cannot handle surrogates. When users have infer_string=True and PyArrow installed, they can end up with a failure they can't workaround when calling str.decode. Adding the dtype argument allows for a workaround.

@rhshadrach rhshadrach added Enhancement Strings String extension data type and string data labels Feb 16, 2025
@rhshadrach rhshadrach added this to the 2.3 milestone Feb 16, 2025
if (
dtype is not None
and not is_string_dtype(dtype)
and not is_object_dtype(dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, is_string_dtype currently (confusingly, sometimes, if you really only want StringDtype) also returns True for object dtype, so this final and not is_object_dtype is not strictly needed (although I find it useful for reading the code ;))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Maybe add a test with actual surrogates to make sure the dtype=object workaround works in that case?

@rhshadrach
Copy link
Member Author

Maybe add a test with actual surrogates to make sure the dtype=object workaround works in that case?

There is a surrogate in test_decode_object_dtype.

@WillAyd
Copy link
Member

WillAyd commented Feb 18, 2025

Have to think about this some more but my initial reaction is a -1. If someone needs non UTF-8 support I think we should push them to using the object dtype, rather than extending the API like this

@jorisvandenbossche
Copy link
Member

Maybe add a test with actual surrogates to make sure the dtype=object workaround works in that case?

There is a surrogate in test_decode_object_dtype.

Sorry, missed that!

If someone needs non UTF-8 support I think we should push them to using the object dtype, rather than extending the API like this

That's exactly what this PR is intending to enable? With this keyword, people can choose to use object dtype explicitly (without it, we always use str dtype, which then fails)

@WillAyd
Copy link
Member

WillAyd commented Feb 18, 2025

With this keyword, people can choose to use object dtype explicitly (without it, we always use str dtype, which then fails)

Ah sorry I am getting tripped up over terminology and the existing API. I'm guessing this is a Python2 relic that we even offer str.decode, since there is no string method in Python3 for decode.

So makes sense, I just want to really limit the use of keywords to control data types because as they can be hard to reason about. I don't think it can be avoided specifically for this method though

@mroeschke mroeschke merged commit 63249f2 into pandas-dev:main Feb 18, 2025
42 checks passed
@mroeschke
Copy link
Member

Thanks @rhshadrach

Copy link

lumberbot-app bot commented Feb 18, 2025

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.3.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 63249f2aa95ef0b0300ea2f1cc68200cc8b13484
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #60940: ENH: Add dtype argument to str.decode'
  1. Push to a named branch:
git push YOURFORK 2.3.x:auto-backport-of-pr-60940-on-2.3.x
  1. Create a PR against branch 2.3.x, I would have named this PR:

"Backport PR #60940 on branch 2.3.x (ENH: Add dtype argument to str.decode)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

@rhshadrach
Copy link
Member Author

I'm guessing this is a Python2 relic that we even offer str.decode, since there is no string method in Python3 for decode.

It's bytes.decode, I think we just don't want to add a bytes namespace so we stashed it under str.

@WillAyd
Copy link
Member

WillAyd commented Feb 18, 2025

Back in Python2 the str type had a decode method, as it was closer to an array of bytes than the unicode object type that it currently is in Python3. The accessor being under that namespace today is a bit of cruft owing back to that legacy.

Certainly not something to fix here, but its an interesting API nonetheless

@WillAyd
Copy link
Member

WillAyd commented Feb 19, 2025

Actually giving this a little more thought, instead of adding a dtype argument what if we coerce to object when the encoding is anything but utf-8 and just use the standard string inference logic when it is utf-8?

@rhshadrach
Copy link
Member Author

I'm negative on adding values-specific behavior.

rhshadrach added a commit to rhshadrach/pandas that referenced this pull request Feb 19, 2025
mroeschke pushed a commit that referenced this pull request Feb 20, 2025
* ENH: Improved error message and raise new error for small-string NaN edge case in HDFStore.append (#60829)

* Add clearer error messages for datatype mismatch in HDFStore.append. Raise ValueError when nan_rep too large for pytable column. Add and modify applicable test code.

* Fix missed tests and correct mistake in error message.

* Remove excess comments. Reverse error type change to avoid api changes. Move nan_rep tests into separate function.

(cherry picked from commit 57340ec)

* TST(string dtype): Resolve xfails in pytables (#60795)

(cherry picked from commit 4511251)

* BUG(string dtype): Resolve pytables xfail when reading with condition (#60943)

(cherry picked from commit 0ec5f26)

* Backport PR #60940: ENH: Add dtype argument to str.decode

---------

Co-authored-by: Jake Thomas Trevallion <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Still Needs Manual Backport Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants