Skip to content

Commit e814197

Browse files
committed
Fix bug in mb_get_substr_slow (sometimes outputs wrong number of characters)
Thanks to Maurício Fauth for finding and reporting this bug. The bug was introduced in October 2022. It originally only affected text encodings which do not have a fixed byte width per characters and for which mbstring does not have an mblen_table. However, I recently made another change to mbstring, such that mb_substr no longer relies on the mblen_table even if one is available. Because of this change, the bug earlier introduced in October 2022 now affected a greater number of text encodings, including UTF-8.
1 parent 87c906c commit e814197

File tree

2 files changed

+9
-2
lines changed

2 files changed

+9
-2
lines changed

ext/mbstring/mbstring.c

+3-2
Original file line numberDiff line numberDiff line change
@@ -2092,9 +2092,10 @@ static zend_string* mb_get_substr_slow(unsigned char *in, size_t in_len, size_t
20922092
if (from >= out_len) {
20932093
from -= out_len;
20942094
} else {
2095-
enc->from_wchar(wchar_buf + from, MIN(out_len - from, len), &buf, !in_len || out_len >= len);
2095+
size_t needed_codepoints = MIN(out_len - from, len);
2096+
enc->from_wchar(wchar_buf + from, needed_codepoints, &buf, !in_len || out_len >= len);
20962097
from = 0;
2097-
len -= MIN(out_len, len);
2098+
len -= needed_codepoints;
20982099
}
20992100
}
21002101

ext/mbstring/tests/mb_substr.phpt

+6
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,11 @@ echo "Regression:\n";
133133
$str = "\xbd\xbd\xbd\xbd\xbd\xbd\xbd\xbe\xbd\xbd\xbd\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x89\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x00\x00\x00\x00\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8b\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8f\x8b\x8b\x8b\xbd\xbd\xbd\xbd\xbd\xbd\xbd\xbe\x01:O\xaa\xd3";
134134
echo bin2hex(mb_substr($str, 0, 128, "JIS")), "\n";
135135

136+
/* Alex messed up when reimplementing mb_substr and, in cases where `from` is non-zero and
137+
* the number of characters to extract is more than 128, miscalculated where to end the substring
138+
* Thanks to Maurício Fauth for finding the issue */
139+
var_dump(mb_substr('Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum dapibus feugiat ex non cursus. Pellentesque vestibulum tellus sit lectus.', 19, -1));
140+
136141
?>
137142
--EXPECT--
138143
EUC-JP:
@@ -213,3 +218,4 @@ Testing agreement with mb_strpos on invalid UTF-8 string:
213218
?AAA
214219
Regression:
215220
1b28493d3d3d3d3d3d3d3e3d3d3d1b28423f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f000000003f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f3f1b28493d3d3d3d3d3d3d3e1b2842013a4f1b28492a1b2842
221+
string(121) "it amet, consectetur adipiscing elit. Vestibulum dapibus feugiat ex non cursus. Pellentesque vestibulum tellus sit lectus"

0 commit comments

Comments
 (0)