@@ -153,37 +153,86 @@ Perl's own functions typically add a trailing C<NUL> for this reason.
153
153
Nevertheless, you should be very careful when you pass a string stored
154
154
in an SV to a C function or system call.
155
155
156
- To access the actual value that an SV points to, you can use the macros:
157
-
158
- SvIV(SV*)
159
- SvUV(SV*)
160
- SvNV(SV*)
161
- SvPV(SV*, STRLEN len)
162
- SvPV_nolen(SV*)
163
-
164
- which will automatically coerce the actual scalar type into an IV, UV, double,
165
- or string.
166
-
167
- In the C<SvPV> macro, the length of the string returned is placed into the
168
- variable C<len> (this is a macro, so you do I<not> use C<&len>). If you do
169
- not care what the length of the data is, use the C<SvPV_nolen> macro.
170
- Historically the C<SvPV> macro with the global variable C<PL_na> has been
171
- used in this case. But that can be quite inefficient because C<PL_na> must
156
+ To access the actual value that an SV points to, Perl's API exposes
157
+ several macros that coerce the actual scalar type into an IV, UV, double,
158
+ or string:
159
+
160
+ =over
161
+
162
+ =item * C<SvIV(SV*)> (C<IV>) and C<SvUV(SV*)> (C<UV>)
163
+
164
+ =item * C<SvNV(SV*)> (C<double>)
165
+
166
+ =item * Strings are a bit complicated:
167
+
168
+ =over
169
+
170
+ =item * Byte string: C<SvPVbyte(SV*, STRLEN len)> or C<SvPVbyte_nolen(SV*)>
171
+
172
+ If the Perl string is C<"\xff\xff">, then this returns a 2-byte C<char*>.
173
+
174
+ This is suitable for Perl strings that represent bytes.
175
+
176
+ =item * UTF-8 string: C<SvPVutf8(SV*, STRLEN len)> or C<SvPVutf8_nolen(SV*)>
177
+
178
+ If the Perl string is C<"\xff\xff">, then this returns a 4-byte C<char*>.
179
+
180
+ This is suitable for Perl strings that represent characters.
181
+
182
+ B<CAVEAT>: That C<char*> will be encoded via Perl's internal UTF-8 variant,
183
+ which means that if the SV contains non-Unicode code points (e.g.,
184
+ 0x110000), then the result may contain extensions over valid UTF-8.
185
+ See L<perlapi/is_strict_utf8_string> for some methods Perl gives
186
+ you to check the UTF-8 validity of these macros' returns.
187
+
188
+ =item * You can also use C<SvPV(SV*, STRLEN len)> or C<SvPV_nolen(SV*)>
189
+ to fetch the SV's raw internal buffer. This is tricky, though; if your Perl
190
+ string
191
+ is C<"\xff\xff">, then depending on the SV's internal encoding you might get
192
+ back a 2-byte B<OR> a 4-byte C<char*>.
193
+ Moreover, if it's the 4-byte string, that could come from either Perl
194
+ C<"\xff\xff"> stored UTF-8 encoded, or Perl C<"\xc3\xbf\xc3\xbf"> stored
195
+ as raw octets. To differentiate between these you B<MUST> look up the
196
+ SV's UTF8 bit (cf. C<SvUTF8>) to know whether the source Perl string
197
+ is 2 characters (C<SvUTF8> would be on) or 4 characters (C<SvUTF8> would be
198
+ off).
199
+
200
+ B<IMPORTANT:> Use of C<SvPV>, C<SvPV_nolen>, or
201
+ similarly-named macros I<without> looking up the SV's UTF8 bit is
202
+ almost certainly a bug if non-ASCII input is allowed.
203
+
204
+ When the UTF8 bit is on, the same B<CAVEAT> about UTF-8 validity applies
205
+ here as for C<SvPVutf8>.
206
+
207
+ =back
208
+
209
+ (See L</How do I pass a Perl string to a C library?> for more details.)
210
+
211
+ In C<SvPVbyte>, C<SvPVutf8>, and C<SvPV>, the length of the C<char*> returned
212
+ is placed into the
213
+ variable C<len> (these are macros, so you do I<not> use C<&len>). If you do
214
+ not care what the length of the data is, use C<SvPVbyte_nolen>,
215
+ C<SvPVutf8_nolen>, or C<SvPV_nolen> instead.
216
+ The global variable C<PL_na> can also be given to
217
+ C<SvPVbyte>/C<SvPVutf8>/C<SvPV>
218
+ in this case. But that can be quite inefficient because C<PL_na> must
172
219
be accessed in thread-local storage in threaded Perl. In any case, remember
173
220
that Perl allows arbitrary strings of data that may both contain NULs and
174
221
might not be terminated by a C<NUL>.
175
222
176
- Also remember that C doesn't allow you to safely say C<foo(SvPV (s, len),
223
+ Also remember that C doesn't allow you to safely say C<foo(SvPVbyte (s, len),
177
224
len);>. It might work with your
178
225
compiler, but it won't work for everyone.
179
226
Break this sort of statement up into separate assignments:
180
227
181
228
SV *s;
182
229
STRLEN len;
183
230
char *ptr;
184
- ptr = SvPV (s, len);
231
+ ptr = SvPVbyte (s, len);
185
232
foo(ptr, len);
186
233
234
+ =back
235
+
187
236
If you want to know if the scalar value is TRUE, you can use:
188
237
189
238
SvTRUE(SV*)
@@ -200,7 +249,7 @@ add space for the trailing C<NUL> byte (perl's own string functions typically do
200
249
C<SvGROW(sv, len + 1)>).
201
250
202
251
If you want to write to an existing SV's buffer and set its value to a
203
- string, use SvPV_force () or one of its variants to force the SV to be
252
+ string, use SvPVbyte_force () or one of its variants to force the SV to be
204
253
a PV. This will remove any of various types of non-stringness from
205
254
the SV while preserving the content of the SV in the PV. This can be
206
255
used, for example, to append data from an API function to a buffer
@@ -3243,6 +3292,66 @@ There is no published API for dealing with this, as it is subject to
3243
3292
change, but you can look at the code for C<pp_lc> in F<pp.c> for an
3244
3293
example as to how it's currently done.
3245
3294
3295
+ =head2 How do I pass a Perl string to a C library?
3296
+
3297
+ A Perl string, conceptually, is an opaque sequence of code points.
3298
+ Many C libraries expect their inputs to be "classical" C strings, which are
3299
+ arrays of octets 1-255, terminated with a NUL byte. Your job when writing
3300
+ an interface between Perl and a C library is to define the mapping between
3301
+ Perl and that library.
3302
+
3303
+ Generally speaking, C<SvPVbyte> and related macros suit this task well.
3304
+ These assume that your Perl string is a "byte string", i.e., is either
3305
+ raw, undecoded input into Perl or is pre-encoded to, e.g., UTF-8.
3306
+
3307
+ Alternatively, if your C library expects UTF-8 text, you can use
3308
+ C<SvPVutf8> and related macros. This has the same effect as encoding
3309
+ to UTF-8 then calling the corresponding C<SvPVbyte>-related macro.
3310
+
3311
+ Some C libraries may expect other encodings (e.g., UTF-16LE). To give
3312
+ Perl strings to such libraries
3313
+ you must either do that encoding in Perl then use C<SvPVbyte>, or
3314
+ use an intermediary C library to convert from however Perl stores the
3315
+ string to the desired encoding.
3316
+
3317
+ Take care also that NULs in your Perl string don't confuse the C
3318
+ library. If possible, give the string's length to the C library; if that's
3319
+ not possible, consider rejecting strings that contain NUL bytes.
3320
+
3321
+ =head3 What about C<SvPV>, C<SvPV_nolen>, etc.?
3322
+
3323
+ Consider a 3-character Perl string C<$foo = "\x64\x78\x8c">.
3324
+ Perl can store these 3 characters either of two ways:
3325
+
3326
+ =over
3327
+
3328
+ =item * bytes: 0x64 0x78 0x8c
3329
+
3330
+ =item * UTF-8: 0x64 0x78 0xc2 0x8c
3331
+
3332
+ =back
3333
+
3334
+ Now let's say you convert C<$foo> to a C string thus:
3335
+
3336
+ STRLEN strlen;
3337
+ char *str = SvPV(foo_sv, strlen);
3338
+
3339
+ At this point C<str> could point to a 3-byte C string or a 4-byte one.
3340
+
3341
+ Generally speaking, we want C<str> to be the same regardless of how
3342
+ Perl stores C<$foo>, so the ambiguity here is undesirable. C<SvPVbyte>
3343
+ and C<SvPVutf8> solve that by giving predictable output: use
3344
+ C<SvPVbyte> if your C library expects byte strings, or C<SvPVutf8>
3345
+ if it expects UTF-8.
3346
+
3347
+ If your C library happens to support both encodings, then C<SvPV>--always
3348
+ in tandem with lookups to C<SvUTF8>!--may be safe and (slightly) more
3349
+ efficient.
3350
+
3351
+ B<TESTING> B<TIP:> Use L<utf8>'s C<upgrade> and C<downgrade> functions
3352
+ in your tests to ensure consistent handling regardless of Perl's
3353
+ internal encoding.
3354
+
3246
3355
=head2 How do I convert a string to UTF-8?
3247
3356
3248
3357
If you're mixing UTF-8 and non-UTF-8 strings, it is necessary to upgrade
0 commit comments