@@ -153,37 +153,73 @@ Perl's own functions typically add a trailing C<NUL> for this reason.
153
153
Nevertheless, you should be very careful when you pass a string stored
154
154
in an SV to a C function or system call.
155
155
156
- To access the actual value that an SV points to, you can use the macros:
157
-
158
- SvIV(SV*)
159
- SvUV(SV*)
160
- SvNV(SV*)
161
- SvPV(SV*, STRLEN len)
162
- SvPV_nolen(SV*)
163
-
164
- which will automatically coerce the actual scalar type into an IV, UV, double,
165
- or string.
166
-
167
- In the C<SvPV> macro, the length of the string returned is placed into the
168
- variable C<len> (this is a macro, so you do I<not> use C<&len>). If you do
169
- not care what the length of the data is, use the C<SvPV_nolen> macro.
170
- Historically the C<SvPV> macro with the global variable C<PL_na> has been
171
- used in this case. But that can be quite inefficient because C<PL_na> must
156
+ To access the actual value that an SV points to, Perl's API exposes
157
+ several macros that coerce the actual scalar type into an IV, UV, double,
158
+ or string:
159
+
160
+ =over
161
+
162
+ =item * C<SvIV(SV*)> (C<IV>) and C<SvUV(SV*)> (C<UV>)
163
+
164
+ =item * C<SvNV(SV*)> (C<double>)
165
+
166
+ =item * Strings are a bit complicated:
167
+
168
+ =over
169
+
170
+ =item * Byte string: C<SvPVbyte(SV*, STRLEN len)> or C<SvPVbyte_nolen(SV*)>
171
+
172
+ If the Perl string is C<"\xff\xff">, then this returns a 2-byte C<char*>.
173
+
174
+ =item * UTF-8 string: C<SvPVutf8(SV*, STRLEN len)> or C<SvPVutf8_nolen(SV*)>
175
+
176
+ If the Perl string is C<"\xff\xff">, then this returns a 4-byte C<char*>.
177
+
178
+ =item * You can also use C<SvPV(SV*, STRLEN len)> or C<SvPV_nolen(SV*)>
179
+ to fetch the SV's raw internal buffer. This is tricky, though; if your Perl
180
+ string
181
+ is C<"\xff\xff">, then depending on the SV's internal encoding you might get
182
+ back a 2-byte B<OR> a 4-byte C<char*>.
183
+ Moreover, if it's the 4-byte string, that could come from either Perl
184
+ C<"\xff\xff"> stored UTF-8 encoded, or Perl C<"\xc3\xbf\xc3\xbf"> stored
185
+ as raw octets. To differentiate between these you B<MUST> look up the
186
+ SV's UTF8 bit (cf. C<SvUTF8>) to know whether the source Perl string
187
+ is 2 characters (C<SvUTF8> would be on) or 4 characters (C<SvUTF8> would be
188
+ off).
189
+
190
+ B<IMPORTANT:> Use of C<SvPV>, C<SvPV_nolen>, or
191
+ similarly-named macros I<without> looking up the SV's UTF8 bit is
192
+ almost certainly a bug.
193
+
194
+ =back
195
+
196
+ (See L</How do I pass a Perl string to a C library?> for more details.)
197
+
198
+ In C<SvPVbyte>, C<SvPVutf8>, and C<SvPV>, the length of the C<char*> returned
199
+ is placed into the
200
+ variable C<len> (these are macros, so you do I<not> use C<&len>). If you do
201
+ not care what the length of the data is, use C<SvPVbyte_nolen>,
202
+ C<SvPVutf8_nolen>, or C<SvPV_nolen> instead.
203
+ The global variable C<PL_na> can also be given to
204
+ C<SvPVbyte>/C<SvPVutf8>/C<SvPV>
205
+ in this case. But that can be quite inefficient because C<PL_na> must
172
206
be accessed in thread-local storage in threaded Perl. In any case, remember
173
207
that Perl allows arbitrary strings of data that may both contain NULs and
174
208
might not be terminated by a C<NUL>.
175
209
176
- Also remember that C doesn't allow you to safely say C<foo(SvPV (s, len),
210
+ Also remember that C doesn't allow you to safely say C<foo(SvPVbyte (s, len),
177
211
len);>. It might work with your
178
212
compiler, but it won't work for everyone.
179
213
Break this sort of statement up into separate assignments:
180
214
181
215
SV *s;
182
216
STRLEN len;
183
217
char *ptr;
184
- ptr = SvPV (s, len);
218
+ ptr = SvPVbyte (s, len);
185
219
foo(ptr, len);
186
220
221
+ =back
222
+
187
223
If you want to know if the scalar value is TRUE, you can use:
188
224
189
225
SvTRUE(SV*)
@@ -200,7 +236,7 @@ add space for the trailing C<NUL> byte (perl's own string functions typically do
200
236
C<SvGROW(sv, len + 1)>).
201
237
202
238
If you want to write to an existing SV's buffer and set its value to a
203
- string, use SvPV_force () or one of its variants to force the SV to be
239
+ string, use SvPVbyte_force () or one of its variants to force the SV to be
204
240
a PV. This will remove any of various types of non-stringness from
205
241
the SV while preserving the content of the SV in the PV. This can be
206
242
used, for example, to append data from an API function to a buffer
@@ -3243,6 +3279,66 @@ There is no published API for dealing with this, as it is subject to
3243
3279
change, but you can look at the code for C<pp_lc> in F<pp.c> for an
3244
3280
example as to how it's currently done.
3245
3281
3282
+ =head2 How do I pass a Perl string to a C library?
3283
+
3284
+ A Perl string, conceptually, is an opaque sequence of code points.
3285
+ Many C libraries expect their inputs to be "classical" C strings, which are
3286
+ arrays of octets 1-255, terminated with a NUL byte. Your job when writing
3287
+ an interface between Perl and a C library is to define the mapping between
3288
+ Perl and that library.
3289
+
3290
+ Generally speaking, C<SvPVbyte> and related macros suit this task well.
3291
+ These assume that your Perl string is a "byte string", i.e., is either
3292
+ raw, undecoded input into Perl or is pre-encoded to, e.g., UTF-8.
3293
+
3294
+ Alternatively, if your C library expects UTF-8 text, you can use
3295
+ C<SvPVutf8> and related macros. This has the same effect as encoding
3296
+ to UTF-8 then calling the corresponding C<SvPVbyte>-related macro.
3297
+
3298
+ Some C libraries may expect other encodings (e.g., UTF-16LE). To give
3299
+ Perl strings to such libraries
3300
+ you must either do that encoding in Perl then use C<SvPVbyte>, or
3301
+ use an intermediary C library to convert from however Perl stores the
3302
+ string to the desired encoding.
3303
+
3304
+ Take care also that NULs in your Perl string don't confuse the C
3305
+ library. If possible, give the string's length to the C library; if that's
3306
+ not possible, consider rejecting strings that contain NUL bytes.
3307
+
3308
+ =head3 What about C<SvPV>, C<SvPV_nolen>, etc.?
3309
+
3310
+ Consider a 3-character Perl string C<$foo = "\x64\x78\x8c">.
3311
+ Perl can store these 3 characters either of two ways:
3312
+
3313
+ =over
3314
+
3315
+ =item * bytes: 0x64 0x78 0x8c
3316
+
3317
+ =item * UTF-8: 0x64 0x78 0xc2 0x8c
3318
+
3319
+ =back
3320
+
3321
+ Now let's say you convert C<$foo> to a C string thus:
3322
+
3323
+ STRLEN strlen;
3324
+ char *str = SvPV(foo_sv, strlen);
3325
+
3326
+ At this point C<str> could point to a 3-byte C string or a 4-byte one.
3327
+
3328
+ Generally speaking, we want C<str> to be the same regardless of how
3329
+ Perl stores C<$foo>, so the ambiguity here is undesirable. C<SvPVbyte>
3330
+ and C<SvPVutf8> solve that by giving predictable output: use
3331
+ C<SvPVbyte> if your C library expects byte strings, or C<SvPVutf8>
3332
+ if it expects UTF-8.
3333
+
3334
+ If your C library happens to support both encodings, then of course
3335
+ C<SvPV>--always in tandem with lookups to C<SvUTF8>!--is preferable since
3336
+ it will avoid superfluous encoding/decoding operations.
3337
+
3338
+ B<TESTING> B<TIP:> Use L<utf8>'s C<upgrade> and C<downgrade> functions
3339
+ in your tests to ensure consistent handling regardless of Perl's
3340
+ internal encoding.
3341
+
3246
3342
=head2 How do I convert a string to UTF-8?
3247
3343
3248
3344
If you're mixing UTF-8 and non-UTF-8 strings, it is necessary to upgrade
0 commit comments