-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docs: Encourage SvPVbyte and SvPVutf8 over SvPV. #18587
Docs: Encourage SvPVbyte and SvPVutf8 over SvPV. #18587
Conversation
Incidentally, YAML::Syck has the problem caused by using SvPV rather than SvPVbyte/SvPVutf8, too: cpan-authors/YAML-Syck#60 |
I do think the distinction between UTF-8 flagged and and non-flagged SVs could be integrated into the other discussion of SVs in perlguts. Right now we have general discussion of SVs, and then a separate discussion of Unicode further down. The only mention of UTF8 is the example discussing direct buffer access added more years ago than I want to believe. |
Clarifying the imperative for SvPV callers to check the string’s encoding would definitely be an improvement. All the same, the following seem apparent:
… in light of which, wouldn’t some manner of “deprioritization” of SvPV (et al.) in the documentation be apropos? FWIW, personally I’d like to see |
e133ac9
to
db2627f
Compare
b83b19e
to
437a6ef
Compare
I realized that So I added:
|
The problem with calling it "lax" is it doesn't really explain anything. I think it would be better to say if you're input SV has non-Unicode code points, then the result may contain extensions over valid UTF-8. |
“lax” is what Larry called it, but that change makes sense. I’ll amend. |
8fe0a1b
to
8a126a4
Compare
8a126a4
to
01ae805
Compare
I realized that SvPV’s description in
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made some comments, but all of them are minor issues.
As it is right now this patch is already a big improvement over what we have.
sv.h
Outdated
Returns the length of the string which is in the SV. See C<L</SvLEN>>. | ||
Returns the length, in bytes, of the C string which is in the SV. | ||
Note that this may not match Perl's C<length>; for that, use | ||
C<SvUTF8(sv) ? sv_len_utf8(sv) : sv_len(sv)>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is really no need to check the value of SvUTF8(sv)
, sv_len_utf8
already does it internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing this out! I’m revising.
C<SvPV_nolen>) can expose the SV's internal string buffer. If | ||
that buffer consists entirely of bytes 0-255 and includes any bytes above | ||
127, then you B<MUST> consult C<SvUTF8> to determine the actual code points | ||
the string is meant to contain. Generally speaking, it is probably safer to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally speaking, it is probably safer
Is that too mild? IMO the usage of SvPV
should be clearly discouraged, more on the line of...
Don't use SvPV unless you know what you are doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but I’m not sure that’s generally accepted. For now I just want to improve on status quo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with the current new wording too
pod/perlguts.pod
Outdated
C<SvPVbyte> if your C library expects byte strings, or C<SvPVutf8> | ||
if it expects UTF-8. | ||
|
||
If your C library happens to support both encodings, then of course |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would happen very seldom and there are several gotchas associated with doing that.
I think it would be better to just remove this paragraph or at least the "is preferable" part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised to:
+If your C library happens to support both encodings, then C<SvPV>--always
+in tandem with lookups to C<SvUTF8>!--may be safe and (slightly) more
+efficient.
3472d76
to
92d6369
Compare
@Grinnz @Leont @khwilliamson Do you see any outstanding issues as of now with this PR? Thank you! |
LGTM |
pod/perlguts.pod
Outdated
This is suitable for Perl strings that represent characters. | ||
|
||
B<CAVEAT>: That C<char*> will be encoded via Perl's internal UTF-8 variant, | ||
which means that if the SV contains non-Unicode code points (e.g., U+FFFF), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
U+FFFF is a non-character Unicode code point, so this example is wrong. I suggest 0x110000. 'U+' is strictly wrong here, which is why I used '0x'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In documentation I've written recently, I've tried to be careful to use U+ only for the 0-10FFFF range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to see that fix. I reloaded this PR and it says the commit was in March
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven’t pushed; I was waiting for your thoughts on #18587 (comment)
But I‘ll push up now.
|
||
Some C libraries may expect other encodings (e.g., UTF-16LE). To give | ||
Perl strings to such libraries | ||
you must either do that encoding in Perl then use C<SvPVbyte>, or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comma before 'then'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like it would potentially confuse; the separation that the single comma creates between the two options would be weakened by having another comma.
What about:
a) Reformat it as an ordered list (=item a)
, =item b)
)
b) Replace the comma before or
with a semicolon
c) Just add the extra comma, if you think it’s best that way.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@khwilliamson ^^ Question for you, when you have a chance?
B<TESTING> B<TIP:> Use L<utf8>'s C<upgrade> and C<downgrade> functions | ||
in your tests to ensure consistent handling regardless of Perl's | ||
internal encoding. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this section
@@ -855,8 +857,8 @@ Set the value of the MAGIC pointer in C<sv> to val. See C<L</SvIV_set>>. | |||
Set the value of the STASH pointer in C<sv> to val. See C<L</SvIV_set>>. | |||
|
|||
=for apidoc Am|void|SvCUR_set|SV* sv|STRLEN len | |||
Set the current length of the string which is in the SV. See C<L</SvCUR>> | |||
and C<SvIV_set>>. | |||
Sets the current length, in bytes, of the C string which is in the SV. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understands why this has to be a C string. The SV may contain embedded NULs, and those are counted as bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps an explicit mention that the string may contain NULs and they are counted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I meant “C string” as distinct from “Perl string”, i.e., what sv_len_utf8
does. When I looked at it I wasn’t sure if it referred to Perl length()
or the number of bytes in the PV.
Maybe:
Sets the current length of the PV inside the SV.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sets the current length, in bytes, of the PV inside the SV
would be better IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
C<SvPV_nolen>) can expose the SV's internal string buffer. If | ||
that buffer consists entirely of bytes 0-255 and includes any bytes above | ||
127, then you B<MUST> consult C<SvUTF8> to determine the actual code points | ||
the string is meant to contain. Generally speaking, it is probably safer to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with the current new wording too
perlguts, perlxs, perlxstut, and perlapi. Issue Perl#18600
92d6369
to
0ea2594
Compare
Per an (admittedly short) p5p discussion.
I don’t know if this is appropriate in Perl’s docs, but I’ve found utf8::downgrade a useful workaround for XS code that uses SvPV.