Skip to content

Commit 8fe0a1b

Browse files
committed
Docs: Emphasize SvPVbyte and SvPVutf8 over SvPV.
Issue #18600
1 parent 9915526 commit 8fe0a1b

File tree

4 files changed

+141
-27
lines changed

4 files changed

+141
-27
lines changed

dist/ExtUtils-ParseXS/lib/perlxs.pod

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -603,7 +603,7 @@ and C<$type> can be used as in typemaps.
603603

604604
bool_t
605605
rpcb_gettime(host,timep)
606-
char *host = (char *)SvPV_nolen($arg);
606+
char *host = (char *)SvPVbyte_nolen($arg);
607607
time_t &timep = 0;
608608
OUTPUT:
609609
timep
@@ -630,7 +630,7 @@ Here's a truly obscure example:
630630
bool_t
631631
rpcb_gettime(host,timep)
632632
time_t &timep; /* \$v{timep}=@{[$v{timep}=$arg]} */
633-
char *host + SvOK($v{timep}) ? SvPV_nolen($arg) : NULL;
633+
char *host + SvOK($v{timep}) ? SvPVbyte_nolen($arg) : NULL;
634634
OUTPUT:
635635
timep
636636

@@ -993,7 +993,7 @@ The XS code, with ellipsis, follows.
993993
char *host = "localhost";
994994
CODE:
995995
if( items > 1 )
996-
host = (char *)SvPV_nolen(ST(1));
996+
host = (char *)SvPVbyte_nolen(ST(1));
997997
RETVAL = rpcb_gettime( host, &timep );
998998
OUTPUT:
999999
timep
@@ -1294,7 +1294,7 @@ prototypes.
12941294
char *host = "localhost";
12951295
CODE:
12961296
if( items > 1 )
1297-
host = (char *)SvPV_nolen(ST(1));
1297+
host = (char *)SvPVbyte_nolen(ST(1));
12981298
RETVAL = rpcb_gettime( host, &timep );
12991299
OUTPUT:
13001300
timep

dist/ExtUtils-ParseXS/lib/perlxstut.pod

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1143,7 +1143,8 @@ Mytest.xs:
11431143
for (n = 0; n <= numpaths; n++) {
11441144
HV * rh;
11451145
STRLEN l;
1146-
char * fn = SvPV(*av_fetch((AV *)SvRV(paths), n, 0), l);
1146+
SV * path = *av_fetch((AV *)SvRV(paths), n, 0);
1147+
char * fn = SvPVbyte(path, l);
11471148

11481149
i = statfs(fn, &buf);
11491150
if (i != 0) {

pod/perlguts.pod

Lines changed: 129 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -153,37 +153,87 @@ Perl's own functions typically add a trailing C<NUL> for this reason.
153153
Nevertheless, you should be very careful when you pass a string stored
154154
in an SV to a C function or system call.
155155

156-
To access the actual value that an SV points to, you can use the macros:
157-
158-
SvIV(SV*)
159-
SvUV(SV*)
160-
SvNV(SV*)
161-
SvPV(SV*, STRLEN len)
162-
SvPV_nolen(SV*)
163-
164-
which will automatically coerce the actual scalar type into an IV, UV, double,
165-
or string.
166-
167-
In the C<SvPV> macro, the length of the string returned is placed into the
168-
variable C<len> (this is a macro, so you do I<not> use C<&len>). If you do
169-
not care what the length of the data is, use the C<SvPV_nolen> macro.
170-
Historically the C<SvPV> macro with the global variable C<PL_na> has been
171-
used in this case. But that can be quite inefficient because C<PL_na> must
156+
To access the actual value that an SV points to, Perl's API exposes
157+
several macros that coerce the actual scalar type into an IV, UV, double,
158+
or string:
159+
160+
=over
161+
162+
=item * C<SvIV(SV*)> (C<IV>) and C<SvUV(SV*)> (C<UV>)
163+
164+
=item * C<SvNV(SV*)> (C<double>)
165+
166+
=item * Strings are a bit complicated:
167+
168+
=over
169+
170+
=item * Byte string: C<SvPVbyte(SV*, STRLEN len)> or C<SvPVbyte_nolen(SV*)>
171+
172+
If the Perl string is C<"\xff\xff">, then this returns a 2-byte C<char*>.
173+
174+
This is suitable for Perl strings that represent bytes.
175+
176+
=item * UTF-8 string: C<SvPVutf8(SV*, STRLEN len)> or C<SvPVutf8_nolen(SV*)>
177+
178+
If the Perl string is C<"\xff\xff">, then this returns a 4-byte C<char*>.
179+
180+
This is suitable for Perl strings that represent characters.
181+
182+
B<CAVEAT>: That C<char*> will be encoded via Perl’s internal UTF-8 variant,
183+
which means that if the SV contains non-Unicode code points (e.g., U+FFFF),
184+
then the result may contain extensions over valid UTF-8.
185+
See L<perlapi/is_strict_utf8_string> for some methods Perl gives
186+
you to check the UTF-8 validity of these macros’ returns.
187+
188+
=item * You can also use C<SvPV(SV*, STRLEN len)> or C<SvPV_nolen(SV*)>
189+
to fetch the SV's raw internal buffer. This is tricky, though; if your Perl
190+
string
191+
is C<"\xff\xff">, then depending on the SV's internal encoding you might get
192+
back a 2-byte B<OR> a 4-byte C<char*>.
193+
Moreover, if it's the 4-byte string, that could come from either Perl
194+
C<"\xff\xff"> stored UTF-8 encoded, or Perl C<"\xc3\xbf\xc3\xbf"> stored
195+
as raw octets. To differentiate between these you B<MUST> look up the
196+
SV's UTF8 bit (cf. C<SvUTF8>) to know whether the source Perl string
197+
is 2 characters (C<SvUTF8> would be on) or 4 characters (C<SvUTF8> would be
198+
off).
199+
200+
B<IMPORTANT:> Use of C<SvPV>, C<SvPV_nolen>, or
201+
similarly-named macros I<without> looking up the SV's UTF8 bit is
202+
a likely source of character-encoding bugs, unless the string in question
203+
is always fully UTF-8 invariant.
204+
205+
When the UTF8 bit is on, the same B<CAVEAT> about UTF-8 validity applies
206+
here as for C<SvPVutf8>.
207+
208+
=back
209+
210+
(See L</How do I pass a Perl string to a C library?> for more details.)
211+
212+
In C<SvPVbyte>, C<SvPVutf8>, and C<SvPV>, the length of the C<char*> returned
213+
is placed into the
214+
variable C<len> (these are macros, so you do I<not> use C<&len>). If you do
215+
not care what the length of the data is, use C<SvPVbyte_nolen>,
216+
C<SvPVutf8_nolen>, or C<SvPV_nolen> instead.
217+
The global variable C<PL_na> can also be given to
218+
C<SvPVbyte>/C<SvPVutf8>/C<SvPV>
219+
in this case. But that can be quite inefficient because C<PL_na> must
172220
be accessed in thread-local storage in threaded Perl. In any case, remember
173221
that Perl allows arbitrary strings of data that may both contain NULs and
174222
might not be terminated by a C<NUL>.
175223

176-
Also remember that C doesn't allow you to safely say C<foo(SvPV(s, len),
224+
Also remember that C doesn't allow you to safely say C<foo(SvPVbyte(s, len),
177225
len);>. It might work with your
178226
compiler, but it won't work for everyone.
179227
Break this sort of statement up into separate assignments:
180228

181229
SV *s;
182230
STRLEN len;
183231
char *ptr;
184-
ptr = SvPV(s, len);
232+
ptr = SvPVbyte(s, len);
185233
foo(ptr, len);
186234

235+
=back
236+
187237
If you want to know if the scalar value is TRUE, you can use:
188238

189239
SvTRUE(SV*)
@@ -200,7 +250,7 @@ add space for the trailing C<NUL> byte (perl's own string functions typically do
200250
C<SvGROW(sv, len + 1)>).
201251

202252
If you want to write to an existing SV's buffer and set its value to a
203-
string, use SvPV_force() or one of its variants to force the SV to be
253+
string, use SvPVbyte_force() or one of its variants to force the SV to be
204254
a PV. This will remove any of various types of non-stringness from
205255
the SV while preserving the content of the SV in the PV. This can be
206256
used, for example, to append data from an API function to a buffer
@@ -3243,6 +3293,66 @@ There is no published API for dealing with this, as it is subject to
32433293
change, but you can look at the code for C<pp_lc> in F<pp.c> for an
32443294
example as to how it's currently done.
32453295

3296+
=head2 How do I pass a Perl string to a C library?
3297+
3298+
A Perl string, conceptually, is an opaque sequence of code points.
3299+
Many C libraries expect their inputs to be "classical" C strings, which are
3300+
arrays of octets 1-255, terminated with a NUL byte. Your job when writing
3301+
an interface between Perl and a C library is to define the mapping between
3302+
Perl and that library.
3303+
3304+
Generally speaking, C<SvPVbyte> and related macros suit this task well.
3305+
These assume that your Perl string is a "byte string", i.e., is either
3306+
raw, undecoded input into Perl or is pre-encoded to, e.g., UTF-8.
3307+
3308+
Alternatively, if your C library expects UTF-8 text, you can use
3309+
C<SvPVutf8> and related macros. This has the same effect as encoding
3310+
to UTF-8 then calling the corresponding C<SvPVbyte>-related macro.
3311+
3312+
Some C libraries may expect other encodings (e.g., UTF-16LE). To give
3313+
Perl strings to such libraries
3314+
you must either do that encoding in Perl then use C<SvPVbyte>, or
3315+
use an intermediary C library to convert from however Perl stores the
3316+
string to the desired encoding.
3317+
3318+
Take care also that NULs in your Perl string don't confuse the C
3319+
library. If possible, give the string's length to the C library; if that's
3320+
not possible, consider rejecting strings that contain NUL bytes.
3321+
3322+
=head3 What about C<SvPV>, C<SvPV_nolen>, etc.?
3323+
3324+
Consider a 3-character Perl string C<$foo = "\x64\x78\x8c">.
3325+
Perl can store these 3 characters either of two ways:
3326+
3327+
=over
3328+
3329+
=item * bytes: 0x64 0x78 0x8c
3330+
3331+
=item * UTF-8: 0x64 0x78 0xc2 0x8c
3332+
3333+
=back
3334+
3335+
Now let's say you convert C<$foo> to a C string thus:
3336+
3337+
STRLEN strlen;
3338+
char *str = SvPV(foo_sv, strlen);
3339+
3340+
At this point C<str> could point to a 3-byte C string or a 4-byte one.
3341+
3342+
Generally speaking, we want C<str> to be the same regardless of how
3343+
Perl stores C<$foo>, so the ambiguity here is undesirable. C<SvPVbyte>
3344+
and C<SvPVutf8> solve that by giving predictable output: use
3345+
C<SvPVbyte> if your C library expects byte strings, or C<SvPVutf8>
3346+
if it expects UTF-8.
3347+
3348+
If your C library happens to support both encodings, then of course
3349+
C<SvPV>--always in tandem with lookups to C<SvUTF8>!--is preferable since
3350+
it will avoid superfluous encoding/decoding operations.
3351+
3352+
B<TESTING> B<TIP:> Use L<utf8>'s C<upgrade> and C<downgrade> functions
3353+
in your tests to ensure consistent handling regardless of Perl's
3354+
internal encoding.
3355+
32463356
=head2 How do I convert a string to UTF-8?
32473357

32483358
If you're mixing UTF-8 and non-UTF-8 strings, it is necessary to upgrade

sv.h

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -801,7 +801,10 @@ compiler will complain if you were to try to modify the contents of the string,
801801
(unless you cast away const yourself).
802802
803803
=for apidoc Am|STRLEN|SvCUR|SV* sv
804-
Returns the length of the string which is in the SV. See C<L</SvLEN>>.
804+
Returns the length, in bytes, of the C string which is in the SV.
805+
Note that this may not match Perl's C<length>; for that, use
806+
C<SvUTF8(sv) ? sv_len_utf8(sv) : sv_len(sv)>.
807+
See C<L</SvLEN>> also.
805808
806809
=for apidoc Am|STRLEN|SvLEN|SV* sv
807810
Returns the size of the string buffer in the SV, not including any part
@@ -855,8 +858,8 @@ Set the value of the MAGIC pointer in C<sv> to val. See C<L</SvIV_set>>.
855858
Set the value of the STASH pointer in C<sv> to val. See C<L</SvIV_set>>.
856859
857860
=for apidoc Am|void|SvCUR_set|SV* sv|STRLEN len
858-
Set the current length of the string which is in the SV. See C<L</SvCUR>>
859-
and C<SvIV_set>>.
861+
Sets the current length, in bytes, of the C string which is in the SV.
862+
See C<L</SvCUR>> and C<SvIV_set>>.
860863
861864
=for apidoc Am|void|SvLEN_set|SV* sv|STRLEN len
862865
Set the size of the string buffer for the SV. See C<L</SvLEN>>.

0 commit comments

Comments
 (0)