Docs: Encourage SvPVbyte and SvPVutf8 over SvPV. #18587

FGasper · 2021-02-17T02:00:17Z

Per an (admittedly short) p5p discussion.

I don’t know if this is appropriate in Perl’s docs, but I’ve found utf8::downgrade a useful workaround for XS code that uses SvPV.

pod/perlguts.pod

FGasper · 2021-02-20T02:43:21Z

Incidentally, YAML::Syck has the problem caused by using SvPV rather than SvPVbyte/SvPVutf8, too: cpan-authors/YAML-Syck#60

tonycoz · 2021-03-18T00:08:29Z

I do think the distinction between UTF-8 flagged and and non-flagged SVs could be integrated into the other discussion of SVs in perlguts. Right now we have general discussion of SVs, and then a separate discussion of Unicode further down. The only mention of UTF8 is the example discussing direct buffer access added more years ago than I want to believe.

FGasper · 2021-03-18T02:52:31Z

I do think the distinction between UTF-8 flagged and and non-flagged SVs could be integrated into the other discussion of SVs in perlguts. Right now we have general discussion of SVs, and then a separate discussion of Unicode further down. The only mention of UTF8 is the example discussing direct buffer access added more years ago than I want to believe.

Clarifying the imperative for SvPV callers to check the string’s encoding would definitely be an improvement.

All the same, the following seem apparent:

SvPV leaks the internal string-encoding abstraction—or, at least, makes the caller go out of their way to preserve the abstraction. (Same problem as bytes.pm, but in C.)
SvPVbyte and SvPVutf8 preserve that abstraction.
All things being equal, it’s better for integrators to prefer “safe”, abstraction-preserving APIs rather than “dangerous”, abstraction-leaking ones.

… in light of which, wouldn’t some manner of “deprioritization” of SvPV (et al.) in the documentation be apropos?

FWIW, personally I’d like to see SvPV_INTERNAL be that macro’s canonical name, with a note in documentation about the “historical” SvPV alias.

pod/perlguts.pod

FGasper · 2021-03-22T11:06:28Z

I realized that SvPVutf8 has its own problem since this is Perl’s “lax” UTF-8, not actual valid UTF-8.

So I added:

commit 4ca94ddf03397718a9cc6958ef8956e99dfba4a7
Author: Felipe Gasper <[email protected]>
Date:   Mon Mar 22 07:04:07 2021 -0400

    Add caveat about SvPVutf8 and UTF-8 validity.

diff --git a/pod/perlguts.pod b/pod/perlguts.pod
index c201be0a6f..a5e1a988e5 100644
--- a/pod/perlguts.pod
+++ b/pod/perlguts.pod
@@ -171,10 +171,19 @@ or string:
 
 If the Perl string is C<"\xff\xff">, then this returns a 2-byte C<char*>.
 
+This is suitable for Perl strings that represent bytes.
+
 =item * UTF-8 string: C<SvPVutf8(SV*, STRLEN len)> or C<SvPVutf8_nolen(SV*)>
 
 If the Perl string is C<"\xff\xff">, then this returns a 4-byte C<char*>.
 
+This is suitable for Perl strings that represent characters.
+
+B<CAVEAT>: That C<char*> will be encoded via Perl’s internal "lax" UTF-8
+variant, which accepts various things that the official UTF-8 standard
+forbids. See L<perlapi/is_strict_utf8_string> for some methods Perl gives
+you to ensure that what these macros give you is I<actually> valid UTF-8.
+
 =item * You can also use C<SvPV(SV*, STRLEN len)> or C<SvPV_nolen(SV*)>
 to fetch the SV's raw internal buffer. This is tricky, though; if your Perl
 string
@@ -192,6 +201,9 @@ similarly-named macros I<without> looking up the SV's UTF8 bit is
 a likely source of character-encoding bugs, unless the string in question
 is always fully UTF-8 invariant.
 
+When the UTF8 bit is on, the same B<CAVEAT> about UTF-8 validity applies
+here as for C<SvPVutf8>.
+
 =back
 
 (See L</How do I pass a Perl string to a C library?> for more details.)

tonycoz · 2021-03-24T05:05:25Z

+B<CAVEAT>: That C<char*> will be encoded via Perl’s internal "lax" UTF-8
+variant, which accepts various things that the official UTF-8 standard
+forbids. See L<perlapi/is_strict_utf8_string> for some methods Perl gives
+you to ensure that what these macros give you is I<actually> valid UTF-8.

The problem with calling it "lax" is it doesn't really explain anything. I think it would be better to say if you're input SV has non-Unicode code points, then the result may contain extensions over valid UTF-8.

FGasper · 2021-03-24T12:39:04Z

+B<CAVEAT>: That C<char*> will be encoded via Perl’s internal "lax" UTF-8
+variant, which accepts various things that the official UTF-8 standard
+forbids. See L<perlapi/is_strict_utf8_string> for some methods Perl gives
+you to ensure that what these macros give you is I<actually> valid UTF-8.
The problem with calling it "lax" is it doesn't really explain anything. I think it would be better to say if you're input SV has non-Unicode code points, then the result may contain extensions over valid UTF-8.

“lax” is what Larry called it, but that change makes sense. I’ll amend.

FGasper · 2021-03-28T21:28:34Z

I realized that SvPV’s description in perlapi was unchanged. So to the section that describes the different SvPV* macros in sv.h I added:

+The forms with neither C<byte> nor C<utf8> in their names (e.g., C<SvPV> or
+C<SvPV_nolen>) can expose the SV's internal string buffer. If
+that buffer consists entirely of bytes 0-255 and includes any bytes above
+127, then you B<MUST> consult C<SvUTF8> to determine the actual code points
+the string is meant to contain. Generally speaking, it is probably safer to
+prefer C<SvPVbyte>, C<SvPVutf8>, and the like. See
+L<perlguts/How do I pass a Perl string to a C library?> for more details.

salva

I have made some comments, but all of them are minor issues.

As it is right now this patch is already a big improvement over what we have.

salva · 2021-03-31T16:39:30Z

sv.h

-Returns the length of the string which is in the SV.  See C<L</SvLEN>>.
+Returns the length, in bytes, of the C string which is in the SV.
+Note that this may not match Perl's C<length>; for that, use
+C<SvUTF8(sv) ? sv_len_utf8(sv) : sv_len(sv)>.


There is really no need to check the value of SvUTF8(sv), sv_len_utf8 already does it internally.

Thank you for pointing this out! I’m revising.

salva · 2021-03-31T16:46:49Z

sv.h

+C<SvPV_nolen>) can expose the SV's internal string buffer. If
+that buffer consists entirely of bytes 0-255 and includes any bytes above
+127, then you B<MUST> consult C<SvUTF8> to determine the actual code points
+the string is meant to contain. Generally speaking, it is probably safer to


Generally speaking, it is probably safer

Is that too mild? IMO the usage of SvPV should be clearly discouraged, more on the line of...

Don't use SvPV unless you know what you are doing.

I agree, but I’m not sure that’s generally accepted. For now I just want to improve on status quo.

I'm OK with the current new wording too

salva · 2021-03-31T16:51:47Z

pod/perlguts.pod

+C<SvPVbyte> if your C library expects byte strings, or C<SvPVutf8>
+if it expects UTF-8.
+
+If your C library happens to support both encodings, then of course


That would happen very seldom and there are several gotchas associated with doing that.

I think it would be better to just remove this paragraph or at least the "is preferable" part.

Revised to:

+If your C library happens to support both encodings, then C<SvPV>--always +in tandem with lookups to C<SvUTF8>!--may be safe and (slightly) more +efficient.

FGasper · 2021-04-03T12:10:47Z

@Grinnz @Leont @khwilliamson Do you see any outstanding issues as of now with this PR?

Thank you!

Grinnz · 2021-04-03T16:34:03Z

LGTM

khwilliamson · 2021-04-05T15:23:15Z

pod/perlguts.pod

+This is suitable for Perl strings that represent characters.
+
+B<CAVEAT>: That C<char*> will be encoded via Perl's internal UTF-8 variant,
+which means that if the SV contains non-Unicode code points (e.g., U+FFFF),


U+FFFF is a non-character Unicode code point, so this example is wrong. I suggest 0x110000. 'U+' is strictly wrong here, which is why I used '0x'

In documentation I've written recently, I've tried to be careful to use U+ only for the 0-10FFFF range

I'm trying to see that fix. I reloaded this PR and it says the commit was in March

I haven’t pushed; I was waiting for your thoughts on #18587 (comment)

But I‘ll push up now.

khwilliamson · 2021-04-05T15:28:03Z

pod/perlguts.pod

+
+Some C libraries may expect other encodings (e.g., UTF-16LE). To give
+Perl strings to such libraries
+you must either do that encoding in Perl then use C<SvPVbyte>, or


Add comma before 'then'

That seems like it would potentially confuse; the separation that the single comma creates between the two options would be weakened by having another comma.

What about:

a) Reformat it as an ordered list (=item a), =item b))

b) Replace the comma before or with a semicolon

c) Just add the extra comma, if you think it’s best that way.

?

@khwilliamson ^^ Question for you, when you have a chance?

khwilliamson · 2021-04-05T15:29:49Z

pod/perlguts.pod

+B<TESTING> B<TIP:> Use L<utf8>'s C<upgrade> and C<downgrade> functions
+in your tests to ensure consistent handling regardless of Perl's
+internal encoding.
+


I really like this section

khwilliamson · 2021-04-05T15:31:59Z

sv.h

@@ -855,8 +857,8 @@ Set the value of the MAGIC pointer in C<sv> to val.  See C<L</SvIV_set>>.
 Set the value of the STASH pointer in C<sv> to val.  See C<L</SvIV_set>>.

 =for apidoc Am|void|SvCUR_set|SV* sv|STRLEN len
-Set the current length of the string which is in the SV.  See C<L</SvCUR>>
-and C<SvIV_set>>.
+Sets the current length, in bytes, of the C string which is in the SV.


I don't understands why this has to be a C string. The SV may contain embedded NULs, and those are counted as bytes.

Perhaps an explicit mention that the string may contain NULs and they are counted.

Yeah I meant “C string” as distinct from “Perl string”, i.e., what sv_len_utf8 does. When I looked at it I wasn’t sure if it referred to Perl length() or the number of bytes in the PV.

Maybe:

Sets the current length of the PV inside the SV.

?

Sets the current length, in bytes, of the PV inside the SV

would be better IMO

khwilliamson · 2021-04-05T15:33:08Z

sv.h

+C<SvPV_nolen>) can expose the SV's internal string buffer. If
+that buffer consists entirely of bytes 0-255 and includes any bytes above
+127, then you B<MUST> consult C<SvUTF8> to determine the actual code points
+the string is meant to contain. Generally speaking, it is probably safer to


I'm OK with the current new wording too

perlguts, perlxs, perlxstut, and perlapi. Issue Perl#18600

Leont reviewed Feb 17, 2021

View reviewed changes

pod/perlguts.pod Outdated Show resolved Hide resolved

FGasper mentioned this pull request Feb 21, 2021

[doc] Is SvPV properly documented? #18600

Closed

FGasper force-pushed the prefer_encoding_specific_svpv_variants branch 2 times, most recently from e133ac9 to db2627f Compare March 21, 2021 19:50

Leont reviewed Mar 21, 2021

View reviewed changes

pod/perlguts.pod Show resolved Hide resolved

FGasper force-pushed the prefer_encoding_specific_svpv_variants branch 2 times, most recently from b83b19e to 437a6ef Compare March 22, 2021 11:04

FGasper force-pushed the prefer_encoding_specific_svpv_variants branch 3 times, most recently from 8fe0a1b to 8a126a4 Compare March 27, 2021 11:21

FGasper marked this pull request as ready for review March 27, 2021 11:46

FGasper force-pushed the prefer_encoding_specific_svpv_variants branch from 8a126a4 to 01ae805 Compare March 28, 2021 21:24

salva approved these changes Mar 31, 2021

View reviewed changes

FGasper force-pushed the prefer_encoding_specific_svpv_variants branch 2 times, most recently from 3472d76 to 92d6369 Compare April 1, 2021 00:50

khwilliamson requested changes Apr 5, 2021

View reviewed changes

Docs: Emphasize SvPVbyte and SvPVutf8 over SvPV. This updates

0ea2594

perlguts, perlxs, perlxstut, and perlapi. Issue Perl#18600

FGasper force-pushed the prefer_encoding_specific_svpv_variants branch from 92d6369 to 0ea2594 Compare April 9, 2021 03:34

khwilliamson merged commit 3c3f883 into Perl:blead Apr 14, 2021

Docs: Encourage SvPVbyte and SvPVutf8 over SvPV. #18587

Docs: Encourage SvPVbyte and SvPVutf8 over SvPV. #18587

Uh oh!

Conversation

FGasper commented Feb 17, 2021

Uh oh!

Uh oh!

FGasper commented Feb 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tonycoz commented Mar 18, 2021

Uh oh!

FGasper commented Mar 18, 2021

Uh oh!

Uh oh!

FGasper commented Mar 22, 2021

Uh oh!

tonycoz commented Mar 24, 2021

Uh oh!

FGasper commented Mar 24, 2021

Uh oh!

FGasper commented Mar 28, 2021

Uh oh!

salva left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FGasper commented Apr 3, 2021

Uh oh!

Grinnz commented Apr 3, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

FGasper commented Feb 20, 2021 •

edited

Loading