ENH: Add dtype argument to str.decode #60940

rhshadrach · 2025-02-16T12:55:33Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

PyArrow-backed strings cannot handle surrogates. When users have infer_string=True and PyArrow installed, they can end up with a failure they can't workaround when calling str.decode. Adding the dtype argument allows for a workaround.

jorisvandenbossche · 2025-02-17T08:55:25Z

pandas/core/strings/accessor.py

+        if (
+            dtype is not None
+            and not is_string_dtype(dtype)
+            and not is_object_dtype(dtype)


FWIW, is_string_dtype currently (confusingly, sometimes, if you really only want StringDtype) also returns True for object dtype, so this final and not is_object_dtype is not strictly needed (although I find it useful for reading the code ;))

jorisvandenbossche

Looks good!

Maybe add a test with actual surrogates to make sure the dtype=object workaround works in that case?

rhshadrach · 2025-02-17T21:17:29Z

Maybe add a test with actual surrogates to make sure the dtype=object workaround works in that case?

There is a surrogate in test_decode_object_dtype.

WillAyd · 2025-02-18T01:18:23Z

Have to think about this some more but my initial reaction is a -1. If someone needs non UTF-8 support I think we should push them to using the object dtype, rather than extending the API like this

jorisvandenbossche · 2025-02-18T08:24:22Z

Maybe add a test with actual surrogates to make sure the dtype=object workaround works in that case?

There is a surrogate in test_decode_object_dtype.

Sorry, missed that!

If someone needs non UTF-8 support I think we should push them to using the object dtype, rather than extending the API like this

That's exactly what this PR is intending to enable? With this keyword, people can choose to use object dtype explicitly (without it, we always use str dtype, which then fails)

WillAyd · 2025-02-18T15:06:28Z

With this keyword, people can choose to use object dtype explicitly (without it, we always use str dtype, which then fails)

Ah sorry I am getting tripped up over terminology and the existing API. I'm guessing this is a Python2 relic that we even offer str.decode, since there is no string method in Python3 for decode.

So makes sense, I just want to really limit the use of keywords to control data types because as they can be hard to reason about. I don't think it can be avoided specifically for this method though

mroeschke · 2025-02-18T17:39:56Z

Thanks @rhshadrach

lumberbot-app · 2025-02-18T17:40:19Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

git checkout 2.3.x
git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

git cherry-pick -x -m1 63249f2aa95ef0b0300ea2f1cc68200cc8b13484

You will likely have some merge/cherry-pick conflict here, fix them and commit:

git commit -am 'Backport PR #60940: ENH: Add dtype argument to str.decode'

Push to a named branch:

git push YOURFORK 2.3.x:auto-backport-of-pr-60940-on-2.3.x

Create a PR against branch 2.3.x, I would have named this PR:

"Backport PR #60940 on branch 2.3.x (ENH: Add dtype argument to str.decode)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

rhshadrach · 2025-02-18T21:10:05Z

I'm guessing this is a Python2 relic that we even offer str.decode, since there is no string method in Python3 for decode.

It's bytes.decode, I think we just don't want to add a bytes namespace so we stashed it under str.

WillAyd · 2025-02-18T21:23:51Z

Back in Python2 the str type had a decode method, as it was closer to an array of bytes than the unicode object type that it currently is in Python3. The accessor being under that namespace today is a bit of cruft owing back to that legacy.

Certainly not something to fix here, but its an interesting API nonetheless

WillAyd · 2025-02-19T15:53:06Z

Actually giving this a little more thought, instead of adding a dtype argument what if we coerce to object when the encoding is anything but utf-8 and just use the standard string inference logic when it is utf-8?

rhshadrach · 2025-02-19T21:22:35Z

I'm negative on adding values-specific behavior.

* ENH: Improved error message and raise new error for small-string NaN edge case in HDFStore.append (#60829) * Add clearer error messages for datatype mismatch in HDFStore.append. Raise ValueError when nan_rep too large for pytable column. Add and modify applicable test code. * Fix missed tests and correct mistake in error message. * Remove excess comments. Reverse error type change to avoid api changes. Move nan_rep tests into separate function. (cherry picked from commit 57340ec) * TST(string dtype): Resolve xfails in pytables (#60795) (cherry picked from commit 4511251) * BUG(string dtype): Resolve pytables xfail when reading with condition (#60943) (cherry picked from commit 0ec5f26) * Backport PR #60940: ENH: Add dtype argument to str.decode --------- Co-authored-by: Jake Thomas Trevallion <[email protected]>

…cked strings (#60984) * ENH: Improved error message and raise new error for small-string NaN edge case in HDFStore.append (#60829) * Add clearer error messages for datatype mismatch in HDFStore.append. Raise ValueError when nan_rep too large for pytable column. Add and modify applicable test code. * Fix missed tests and correct mistake in error message. * Remove excess comments. Reverse error type change to avoid api changes. Move nan_rep tests into separate function. (cherry picked from commit 57340ec) * TST(string dtype): Resolve xfails in pytables (#60795) (cherry picked from commit 4511251) * BUG(string dtype): Resolve pytables xfail when reading with condition (#60943) (cherry picked from commit 0ec5f26) * Backport PR #60940: ENH: Add dtype argument to str.decode * Backport PR #60938: ENH(string dtype): Implement cumsum for Python-backed strings --------- Co-authored-by: Jake Thomas Trevallion <[email protected]>

jorisvandenbossche · 2025-06-11T13:51:22Z

Backported in #60968

ENH: Add dtype argument to str.decode

e19455d

rhshadrach added Enhancement Strings String extension data type and string data labels Feb 16, 2025

rhshadrach added this to the 2.3 milestone Feb 16, 2025

rhshadrach added 4 commits February 16, 2025 07:56

Refinements

d37469f

cleanup

797f99c

cleanup

6cd5f02

type-hint fixup

5a836bb

rhshadrach requested review from WillAyd and jorisvandenbossche February 16, 2025 14:00

jorisvandenbossche reviewed Feb 17, 2025

View reviewed changes

jorisvandenbossche approved these changes Feb 17, 2025

View reviewed changes

rhshadrach and others added 2 commits February 17, 2025 16:18

Simplify condition

ee2d377

lint

91d6be3

WillAyd approved these changes Feb 18, 2025

View reviewed changes

mroeschke approved these changes Feb 18, 2025

View reviewed changes

mroeschke merged commit 63249f2 into pandas-dev:main Feb 18, 2025
42 checks passed

lumberbot-app bot added the Still Needs Manual Backport label Feb 18, 2025

rhshadrach added a commit to rhshadrach/pandas that referenced this pull request Feb 19, 2025

Backport PR pandas-dev#60940: ENH: Add dtype argument to str.decode

3d5c84b

rhshadrach mentioned this pull request Feb 23, 2025

ENH(string dtype): fallback for HDF5 with UTF-8 surrogates #60993

Merged

5 tasks

jorisvandenbossche removed the Still Needs Manual Backport label Jun 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Add dtype argument to str.decode #60940

ENH: Add dtype argument to str.decode #60940

Uh oh!

rhshadrach commented Feb 16, 2025 •

edited

Loading

Uh oh!

jorisvandenbossche Feb 17, 2025

Uh oh!

rhshadrach Feb 17, 2025

Uh oh!

jorisvandenbossche left a comment

Uh oh!

rhshadrach commented Feb 17, 2025

Uh oh!

WillAyd commented Feb 18, 2025

Uh oh!

jorisvandenbossche commented Feb 18, 2025

Uh oh!

WillAyd commented Feb 18, 2025

Uh oh!

Uh oh!

mroeschke commented Feb 18, 2025

Uh oh!

lumberbot-app bot commented Feb 18, 2025

Uh oh!

rhshadrach commented Feb 18, 2025

Uh oh!

WillAyd commented Feb 18, 2025

Uh oh!

WillAyd commented Feb 19, 2025

Uh oh!

rhshadrach commented Feb 19, 2025

Uh oh!

jorisvandenbossche commented Jun 11, 2025

Uh oh!

Uh oh!

Uh oh!

ENH: Add dtype argument to str.decode #60940

ENH: Add dtype argument to str.decode #60940

Uh oh!

Conversation

rhshadrach commented Feb 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche Feb 17, 2025

Choose a reason for hiding this comment

Uh oh!

rhshadrach Feb 17, 2025

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

rhshadrach commented Feb 17, 2025

Uh oh!

WillAyd commented Feb 18, 2025

Uh oh!

jorisvandenbossche commented Feb 18, 2025

Uh oh!

WillAyd commented Feb 18, 2025

Uh oh!

Uh oh!

mroeschke commented Feb 18, 2025

Uh oh!

lumberbot-app bot commented Feb 18, 2025

Uh oh!

rhshadrach commented Feb 18, 2025

Uh oh!

WillAyd commented Feb 18, 2025

Uh oh!

WillAyd commented Feb 19, 2025

Uh oh!

rhshadrach commented Feb 19, 2025

Uh oh!

jorisvandenbossche commented Jun 11, 2025

Uh oh!

Uh oh!

rhshadrach commented Feb 16, 2025 •

edited

Loading