Skip to content

bodyfile: extend character escaping for characters special Unicode and non-Unicode characters #77

@joachimmetz

Description

@joachimmetz

Certain file systems allow for characters that either have a special meaning in Unicode such as U+d800 and/or non-Unicode characters

The extended bodyfile 3 format currently does not specify how to handle these characters. Proposal is to escape such characters as "\u####" and "\U########", preferring the short form over the long form where possible.

Open questions

  • What about "Unicode compatibility characters" ?
  • What about U+110000-U+ffffffff
  • What about original path uses a specific codepage (encoding), which is converted to Unicode, however that can be encoded into multiple variations of the original encoding e.g. encoding U+2252 to cp932. What if there are 2 paths that decode to the same string? How should the original path be best preserved?
  • filename contains a path segment separator (e.g. \ or /), if not escaped this leads to ambiguity e.g. if / is a path segment separator is 'test/1234' a single file name or a path ?

A related discussion dfxml-working-group/dfxml_schema#34

Also consider if the format should be extended with a header to specify its encoding?

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions