Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: Encourage SvPVbyte and SvPVutf8 over SvPV. #18587

Merged

Conversation

FGasper
Copy link
Contributor

@FGasper FGasper commented Feb 17, 2021

Per an (admittedly short) p5p discussion.

I don’t know if this is appropriate in Perl’s docs, but I’ve found utf8::downgrade a useful workaround for XS code that uses SvPV.

pod/perlguts.pod Outdated Show resolved Hide resolved
@FGasper
Copy link
Contributor Author

FGasper commented Feb 20, 2021

Incidentally, YAML::Syck has the problem caused by using SvPV rather than SvPVbyte/SvPVutf8, too: cpan-authors/YAML-Syck#60

@tonycoz
Copy link
Contributor

tonycoz commented Mar 18, 2021

I do think the distinction between UTF-8 flagged and and non-flagged SVs could be integrated into the other discussion of SVs in perlguts. Right now we have general discussion of SVs, and then a separate discussion of Unicode further down. The only mention of UTF8 is the example discussing direct buffer access added more years ago than I want to believe.

@FGasper
Copy link
Contributor Author

FGasper commented Mar 18, 2021

I do think the distinction between UTF-8 flagged and and non-flagged SVs could be integrated into the other discussion of SVs in perlguts. Right now we have general discussion of SVs, and then a separate discussion of Unicode further down. The only mention of UTF8 is the example discussing direct buffer access added more years ago than I want to believe.

Clarifying the imperative for SvPV callers to check the string’s encoding would definitely be an improvement.

All the same, the following seem apparent:

  1. SvPV leaks the internal string-encoding abstraction—or, at least, makes the caller go out of their way to preserve the abstraction. (Same problem as bytes.pm, but in C.)
  2. SvPVbyte and SvPVutf8 preserve that abstraction.
  3. All things being equal, it’s better for integrators to prefer “safe”, abstraction-preserving APIs rather than “dangerous”, abstraction-leaking ones.

… in light of which, wouldn’t some manner of “deprioritization” of SvPV (et al.) in the documentation be apropos?

FWIW, personally I’d like to see SvPV_INTERNAL be that macro’s canonical name, with a note in documentation about the “historical” SvPV alias.

@FGasper FGasper force-pushed the prefer_encoding_specific_svpv_variants branch 2 times, most recently from e133ac9 to db2627f Compare March 21, 2021 19:50
@FGasper FGasper force-pushed the prefer_encoding_specific_svpv_variants branch 2 times, most recently from b83b19e to 437a6ef Compare March 22, 2021 11:04
@FGasper
Copy link
Contributor Author

FGasper commented Mar 22, 2021

I realized that SvPVutf8 has its own problem since this is Perl’s “lax” UTF-8, not actual valid UTF-8.

So I added:

commit 4ca94ddf03397718a9cc6958ef8956e99dfba4a7
Author: Felipe Gasper <[email protected]>
Date:   Mon Mar 22 07:04:07 2021 -0400

    Add caveat about SvPVutf8 and UTF-8 validity.

diff --git a/pod/perlguts.pod b/pod/perlguts.pod
index c201be0a6f..a5e1a988e5 100644
--- a/pod/perlguts.pod
+++ b/pod/perlguts.pod
@@ -171,10 +171,19 @@ or string:
 
 If the Perl string is C<"\xff\xff">, then this returns a 2-byte C<char*>.
 
+This is suitable for Perl strings that represent bytes.
+
 =item * UTF-8 string: C<SvPVutf8(SV*, STRLEN len)> or C<SvPVutf8_nolen(SV*)>
 
 If the Perl string is C<"\xff\xff">, then this returns a 4-byte C<char*>.
 
+This is suitable for Perl strings that represent characters.
+
+B<CAVEAT>: That C<char*> will be encoded via Perl’s internal "lax" UTF-8
+variant, which accepts various things that the official UTF-8 standard
+forbids. See L<perlapi/is_strict_utf8_string> for some methods Perl gives
+you to ensure that what these macros give you is I<actually> valid UTF-8.
+
 =item * You can also use C<SvPV(SV*, STRLEN len)> or C<SvPV_nolen(SV*)>
 to fetch the SV's raw internal buffer. This is tricky, though; if your Perl
 string
@@ -192,6 +201,9 @@ similarly-named macros I<without> looking up the SV's UTF8 bit is
 a likely source of character-encoding bugs, unless the string in question
 is always fully UTF-8 invariant.
 
+When the UTF8 bit is on, the same B<CAVEAT> about UTF-8 validity applies
+here as for C<SvPVutf8>.
+
 =back
 
 (See L</How do I pass a Perl string to a C library?> for more details.)

@tonycoz
Copy link
Contributor

tonycoz commented Mar 24, 2021

+B<CAVEAT>: That C<char*> will be encoded via Perl’s internal "lax" UTF-8
+variant, which accepts various things that the official UTF-8 standard
+forbids. See L<perlapi/is_strict_utf8_string> for some methods Perl gives
+you to ensure that what these macros give you is I<actually> valid UTF-8.

The problem with calling it "lax" is it doesn't really explain anything. I think it would be better to say if you're input SV has non-Unicode code points, then the result may contain extensions over valid UTF-8.

@FGasper
Copy link
Contributor Author

FGasper commented Mar 24, 2021

+B<CAVEAT>: That C<char*> will be encoded via Perl’s internal "lax" UTF-8
+variant, which accepts various things that the official UTF-8 standard
+forbids. See L<perlapi/is_strict_utf8_string> for some methods Perl gives
+you to ensure that what these macros give you is I<actually> valid UTF-8.

The problem with calling it "lax" is it doesn't really explain anything. I think it would be better to say if you're input SV has non-Unicode code points, then the result may contain extensions over valid UTF-8.

“lax” is what Larry called it, but that change makes sense. I’ll amend.

@FGasper FGasper force-pushed the prefer_encoding_specific_svpv_variants branch 3 times, most recently from 8fe0a1b to 8a126a4 Compare March 27, 2021 11:21
@FGasper FGasper marked this pull request as ready for review March 27, 2021 11:46
@FGasper FGasper force-pushed the prefer_encoding_specific_svpv_variants branch from 8a126a4 to 01ae805 Compare March 28, 2021 21:24
@FGasper
Copy link
Contributor Author

FGasper commented Mar 28, 2021

I realized that SvPV’s description in perlapi was unchanged. So to the section that describes the different SvPV* macros in sv.h I added:

+The forms with neither C<byte> nor C<utf8> in their names (e.g., C<SvPV> or
+C<SvPV_nolen>) can expose the SV's internal string buffer. If
+that buffer consists entirely of bytes 0-255 and includes any bytes above
+127, then you B<MUST> consult C<SvUTF8> to determine the actual code points
+the string is meant to contain. Generally speaking, it is probably safer to
+prefer C<SvPVbyte>, C<SvPVutf8>, and the like. See
+L<perlguts/How do I pass a Perl string to a C library?> for more details.

Copy link
Contributor

@salva salva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made some comments, but all of them are minor issues.

As it is right now this patch is already a big improvement over what we have.

sv.h Outdated
Returns the length of the string which is in the SV. See C<L</SvLEN>>.
Returns the length, in bytes, of the C string which is in the SV.
Note that this may not match Perl's C<length>; for that, use
C<SvUTF8(sv) ? sv_len_utf8(sv) : sv_len(sv)>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is really no need to check the value of SvUTF8(sv), sv_len_utf8 already does it internally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out! I’m revising.

C<SvPV_nolen>) can expose the SV's internal string buffer. If
that buffer consists entirely of bytes 0-255 and includes any bytes above
127, then you B<MUST> consult C<SvUTF8> to determine the actual code points
the string is meant to contain. Generally speaking, it is probably safer to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking, it is probably safer

Is that too mild? IMO the usage of SvPV should be clearly discouraged, more on the line of...

Don't use SvPV unless you know what you are doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but I’m not sure that’s generally accepted. For now I just want to improve on status quo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with the current new wording too

pod/perlguts.pod Outdated
C<SvPVbyte> if your C library expects byte strings, or C<SvPVutf8>
if it expects UTF-8.

If your C library happens to support both encodings, then of course
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would happen very seldom and there are several gotchas associated with doing that.

I think it would be better to just remove this paragraph or at least the "is preferable" part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised to:

+If your C library happens to support both encodings, then C<SvPV>--always
+in tandem with lookups to C<SvUTF8>!--may be safe and (slightly) more
+efficient.

@FGasper FGasper force-pushed the prefer_encoding_specific_svpv_variants branch 2 times, most recently from 3472d76 to 92d6369 Compare April 1, 2021 00:50
@FGasper
Copy link
Contributor Author

FGasper commented Apr 3, 2021

@Grinnz @Leont @khwilliamson Do you see any outstanding issues as of now with this PR?

Thank you!

@Grinnz
Copy link
Contributor

Grinnz commented Apr 3, 2021

LGTM

pod/perlguts.pod Outdated
This is suitable for Perl strings that represent characters.

B<CAVEAT>: That C<char*> will be encoded via Perl's internal UTF-8 variant,
which means that if the SV contains non-Unicode code points (e.g., U+FFFF),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

U+FFFF is a non-character Unicode code point, so this example is wrong. I suggest 0x110000. 'U+' is strictly wrong here, which is why I used '0x'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In documentation I've written recently, I've tried to be careful to use U+ only for the 0-10FFFF range

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to see that fix. I reloaded this PR and it says the commit was in March

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven’t pushed; I was waiting for your thoughts on #18587 (comment)

But I‘ll push up now.


Some C libraries may expect other encodings (e.g., UTF-16LE). To give
Perl strings to such libraries
you must either do that encoding in Perl then use C<SvPVbyte>, or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comma before 'then'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like it would potentially confuse; the separation that the single comma creates between the two options would be weakened by having another comma.

What about:

a) Reformat it as an ordered list (=item a), =item b))

b) Replace the comma before or with a semicolon

c) Just add the extra comma, if you think it’s best that way.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khwilliamson ^^ Question for you, when you have a chance?

B<TESTING> B<TIP:> Use L<utf8>'s C<upgrade> and C<downgrade> functions
in your tests to ensure consistent handling regardless of Perl's
internal encoding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this section

@@ -855,8 +857,8 @@ Set the value of the MAGIC pointer in C<sv> to val. See C<L</SvIV_set>>.
Set the value of the STASH pointer in C<sv> to val. See C<L</SvIV_set>>.

=for apidoc Am|void|SvCUR_set|SV* sv|STRLEN len
Set the current length of the string which is in the SV. See C<L</SvCUR>>
and C<SvIV_set>>.
Sets the current length, in bytes, of the C string which is in the SV.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understands why this has to be a C string. The SV may contain embedded NULs, and those are counted as bytes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps an explicit mention that the string may contain NULs and they are counted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I meant “C string” as distinct from “Perl string”, i.e., what sv_len_utf8 does. When I looked at it I wasn’t sure if it referred to Perl length() or the number of bytes in the PV.

Maybe:

Sets the current length of the PV inside the SV.

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sets the current length, in bytes, of the PV inside the SV

would be better IMO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

C<SvPV_nolen>) can expose the SV's internal string buffer. If
that buffer consists entirely of bytes 0-255 and includes any bytes above
127, then you B<MUST> consult C<SvUTF8> to determine the actual code points
the string is meant to contain. Generally speaking, it is probably safer to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with the current new wording too

perlguts, perlxs, perlxstut, and perlapi.

Issue Perl#18600
@FGasper FGasper force-pushed the prefer_encoding_specific_svpv_variants branch from 92d6369 to 0ea2594 Compare April 9, 2021 03:34
@khwilliamson khwilliamson merged commit 3c3f883 into Perl:blead Apr 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants