-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add dtype argument to str.decode #60940
ENH: Add dtype argument to str.decode #60940
Conversation
pandas/core/strings/accessor.py
Outdated
if ( | ||
dtype is not None | ||
and not is_string_dtype(dtype) | ||
and not is_object_dtype(dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, is_string_dtype currently (confusingly, sometimes, if you really only want StringDtype) also returns True for object dtype, so this final and not is_object_dtype
is not strictly needed (although I find it useful for reading the code ;))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Maybe add a test with actual surrogates to make sure the dtype=object
workaround works in that case?
There is a surrogate in |
Have to think about this some more but my initial reaction is a -1. If someone needs non UTF-8 support I think we should push them to using the object dtype, rather than extending the API like this |
Sorry, missed that!
That's exactly what this PR is intending to enable? With this keyword, people can choose to use object dtype explicitly (without it, we always use str dtype, which then fails) |
Ah sorry I am getting tripped up over terminology and the existing API. I'm guessing this is a Python2 relic that we even offer So makes sense, I just want to really limit the use of keywords to control data types because as they can be hard to reason about. I don't think it can be avoided specifically for this method though |
Thanks @rhshadrach |
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove the If these instructions are inaccurate, feel free to suggest an improvement. |
It's bytes.decode, I think we just don't want to add a |
Back in Python2 the str type had a decode method, as it was closer to an array of bytes than the unicode object type that it currently is in Python3. The accessor being under that namespace today is a bit of cruft owing back to that legacy. Certainly not something to fix here, but its an interesting API nonetheless |
Actually giving this a little more thought, instead of adding a |
I'm negative on adding values-specific behavior. |
* ENH: Improved error message and raise new error for small-string NaN edge case in HDFStore.append (#60829) * Add clearer error messages for datatype mismatch in HDFStore.append. Raise ValueError when nan_rep too large for pytable column. Add and modify applicable test code. * Fix missed tests and correct mistake in error message. * Remove excess comments. Reverse error type change to avoid api changes. Move nan_rep tests into separate function. (cherry picked from commit 57340ec) * TST(string dtype): Resolve xfails in pytables (#60795) (cherry picked from commit 4511251) * BUG(string dtype): Resolve pytables xfail when reading with condition (#60943) (cherry picked from commit 0ec5f26) * Backport PR #60940: ENH: Add dtype argument to str.decode --------- Co-authored-by: Jake Thomas Trevallion <[email protected]>
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Ref: #60795 (comment)
PyArrow-backed strings cannot handle surrogates. When users have
infer_string=True
and PyArrow installed, they can end up with a failure they can't workaround when callingstr.decode
. Adding the dtype argument allows for a workaround.