Fix use-after-free in the unicode-escape decoder with error handler #129648

serhiy-storchaka · 2025-02-04T14:12:43Z

If the error handler is used, a new bytes object is created to set as the object attribute of UnicodeDecodeError, and that bytes object then replaces the original data. A pointer to the decoded data will became invalid after destroying that temporary bytes object. So we need other way to return the first invalid escape from _PyUnicode_DecodeUnicodeEscapeInternal().

_PyBytes_DecodeEscape() does not have such issue, because it does not use the error handlers registry, but it should be changed for compatibility with _PyUnicode_DecodeUnicodeEscapeInternal().

If the error handler is used, a new bytes object is created to set as the object attribute of UnicodeDecodeError, and that bytes object then replaces the original data. A pointer to the decoded data will became invalid after destroying that temporary bytes object. So we need other way to return the first invalid escape from _PyUnicode_DecodeUnicodeEscapeInternal(). _PyBytes_DecodeEscape() does not have such issue, because it does not use the error handlers registry, but it should be changed for compatibility with _PyUnicode_DecodeUnicodeEscapeInternal().

gpshead · 2025-02-04T21:55:53Z

Nice! This is similar enough, but clearly far more polished, than what I quickly whipped up while trying to understand the problem and linked to on the PSRT mailing list... that I won't bother posting my own draft PR.

I don't have a good feel for if we need to retain the older internal-use-only C APIs or not, but doing this change via ones with a suffix as you seem to be proposing and leaving the old, though now unused by our own internals, ones in place in case something else references them makes sense to me.

serhiy-storchaka · 2025-02-04T22:47:45Z

I experimented with several different solutions. One of them was similar to yours, except that I copied all three bytes. It was also necessary to distinguish "no invalid escape" from "escaped null byte". In the end, the currently proposed solution is the simplest.

This PR does not leave the old C API. I do not think that it is needed. The functions are renamed because an error at link time is more preferable than undefined behavior at run time.

gpshead · 2025-02-04T23:56:13Z

Parser/string_parser.c

                    first_invalid_escape, first_invalid_escape);
            }
            else {
                RAISE_SYNTAX_ERROR(
                    "\"\\%c\" is an invalid escape sequence. "
                    "Did you mean \"\\\\%c\"? A raw string is also an option.",
-                    c, c);
+                    first_invalid_escape, first_invalid_escape);


double checking: in C, does passing an int when formatting with %c always work? is a single char parameter expected to come from the platforms lowest significant byte of int sized space per in C calling conventions? casting these down to (unsigned char) might be a good idea for clarity even if it is defined behavior.
Same comment for the other files where this is done.

The expected argument type for %c in printf() is int. The expected argument type for %c in PyUnicode _Format() (which is used here) is also int.

serhiy-storchaka requested review from Yhg1s, ericvsmith and sethmlarson February 4, 2025 14:12

gpshead reviewed Feb 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix use-after-free in the unicode-escape decoder with error handler #129648

Fix use-after-free in the unicode-escape decoder with error handler #129648

serhiy-storchaka commented Feb 4, 2025

gpshead commented Feb 4, 2025

serhiy-storchaka commented Feb 4, 2025

gpshead Feb 4, 2025

serhiy-storchaka Feb 5, 2025

Fix use-after-free in the unicode-escape decoder with error handler #129648

Are you sure you want to change the base?

Fix use-after-free in the unicode-escape decoder with error handler #129648

Conversation

serhiy-storchaka commented Feb 4, 2025

gpshead commented Feb 4, 2025

serhiy-storchaka commented Feb 4, 2025

gpshead Feb 4, 2025

Choose a reason for hiding this comment

serhiy-storchaka Feb 5, 2025

Choose a reason for hiding this comment