Skip to content

Commit 92a393e

Browse files
committed
Strip invalid chars without raising encoding exception regardless of encoding
On upgrade to faraday 2, HTTP responses can come back in specific encodings per the content-type of response. This effected one of our tests, that is using a live query to google.com (probably not a great idea), that comes back ISO-8859-1. But it actually would have failed if UTF-8 too, because we're trying to gsub some intentionally illegal bytes. I am not sure if the approach to encoding is totally sound in this gem in general -- XML does have to be Unicode legally, but can be UTF-8 or -16, this strip routine may be assuming UTF8? But not actually properly making it with the ruby UTF8 encoding? But this is just an attempt to change as little as possible leaving as backward compat as possible while passing tests in faraday 2; and resolving the specific problem of the strip_invalid_utf_8_chars failing when response comes back with ruby encoding tag from faraday 2.
1 parent 6f53af2 commit 92a393e

File tree

1 file changed

+14
-1
lines changed

1 file changed

+14
-1
lines changed

lib/oai/client.rb

+14-1
Original file line numberDiff line numberDiff line change
@@ -331,14 +331,27 @@ def parse_date(value)
331331
# Regex is from WebCollab:
332332
# http://webcollab.sourceforge.net/unicode.html
333333
def strip_invalid_utf_8_chars(xml)
334-
xml && xml.gsub(/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]
334+
return xml unless xml
335+
336+
# If it's in a specific encoding other than BINARY, it may trigger
337+
# an exception to try to gsub these illegal bytes. Temporarily
338+
# put it in BINARY. NOTE: We're not totally sure what's going on
339+
# with encodings in this gem in general, it might not be totally reasonable.
340+
orig_encoding = xml.encoding
341+
xml.force_encoding("BINARY")
342+
343+
xml = xml.gsub(/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]
335344
| [\x00-\x7F][\x80-\xBF]+
336345
| ([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*
337346
| [\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})
338347
| [\xE0-\xEF](([\x80-\xBF](?![\x80-\xBF]))
339348
| (?![\x80-\xBF]{2})|[\x80-\xBF]{3,})/x, '?')\
340349
.gsub(/\xE0[\x80-\x9F][\x80-\xBF]
341350
| \xED[\xA0-\xBF][\x80-\xBF]/,'?')
351+
352+
xml.force_encoding(orig_encoding)
353+
354+
xml
342355
end
343356

344357
end

0 commit comments

Comments
 (0)