Skip to content

Wrong result when parsing escaped unicode characters #120

@m1dnight

Description

@m1dnight

I tried parsing some Excel files that contain newlines, and encountered some errors while parsing the file.

Input

If in an Excel sheet, a cell contains a newline, that unicode value is not allowed in the standard. Therefore Excell stores it as _x000D_.
An underscore is also not allowed, and that one is encoded as _x005F_.
This means that a carriage return is encoded as _x005F_x000D_.
A document with a newline is properly parsed by the library.

I do have an Excel (that I cannot share) that is wrongly parsed. But this might be because of an older version of Excel that made the file, because as soon as I open and save it with my Excel version it works fine.

When a cell contains the literal string _x000D_ it is parsed as _x005F_x000D_.

Guess

I have yet to find a specific reason why this happens, but I found this which states that some unicode characters are not allowed in XML 1.0 and therefore they are escaped as xHHHH. The underscore in the prefix is also escaped with _x005F, which results in the entire string being represented as _x005F_x000D_.

This link tells us that CR is escaped and that the first underscore of its escaped representation is also escaped, leading a CR to be represented as _x005F_x000D_.

Proof

I have a test case in m1dnight@c335061 this commit that shows the behavior.

I'm not sure though, if this is a bug in SAX or not.

Any ideas on how to proceed?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions