Skip to content

Commit e261316

Browse files
authored
- lexer-strings.rb: Avoid an exception on utf8 surrogate pair codepoints (#1051)
Starting from Ruby 2.4, these are a syntax error. I don't see an easy way of representing such strings. Right now the parser actually crashses (in all versions) so I'd say it's an improvement.
1 parent 6f54456 commit e261316

File tree

2 files changed

+28
-0
lines changed

2 files changed

+28
-0
lines changed

lib/parser/lexer-strings.rl

+9
Original file line numberDiff line numberDiff line change
@@ -429,6 +429,15 @@ class Parser::LexerStrings
429429
break
430430
end
431431

432+
# UTF-16 surrogate pairs. These are actually accepted before Ruby 2.4
433+
# but can't be represented in the AST. Make them a syntax error in
434+
# all versions instead, Ruby would raise an exception otherwise.
435+
if codepoint & 0xfffff800 == 0xd800
436+
diagnostic :error, :invalid_unicode_escape, nil,
437+
range(codepoint_s, codepoint_s + codepoint_str.length)
438+
break
439+
end
440+
432441
@escape += codepoint.chr(Encoding::UTF_8)
433442
codepoint_s += codepoint_str.length
434443
end

test/test_parser.rb

+19
Original file line numberDiff line numberDiff line change
@@ -5782,6 +5782,25 @@ def test_codepoint_too_large
57825782
SINCE_1_9)
57835783
end
57845784

5785+
def test_codepoint_surrogate
5786+
assert_diagnoses(
5787+
[:error, :invalid_unicode_escape],
5788+
%q{"\u{D800}"},
5789+
%q{ ~~~~ location})
5790+
5791+
assert_diagnoses(
5792+
[:error, :invalid_unicode_escape],
5793+
%q{"\u{DFFF}"},
5794+
%q{ ~~~~ location})
5795+
5796+
[
5797+
%q{"\u{D7FF}"},
5798+
%q{"\u{E000}"},
5799+
].each do |code|
5800+
refute_diagnoses(code)
5801+
end
5802+
end
5803+
57855804
def test_on_error
57865805
assert_diagnoses(
57875806
[:error, :unexpected_token, { :token => 'tIDENTIFIER' }],

0 commit comments

Comments
 (0)