Skip to content
This repository has been archived by the owner on Nov 16, 2020. It is now read-only.

Invalid ALTO and PAGE export #38

Open
stweil opened this issue Oct 22, 2019 · 1 comment
Open

Invalid ALTO and PAGE export #38

stweil opened this issue Oct 22, 2019 · 1 comment

Comments

@stweil
Copy link

stweil commented Oct 22, 2019

The exported ALTO and PAGE files are not valid XML. Validators complain, and the PRIMA PageViewer refuses to load such files. Tested example from the GT data set of ÖNB:

$ ocr-validate alto-2-0 ONB_aze_18950706_1.alto 
mXSDFilename: /usr/local/share/ocr-fileformat/xsd/alto-2-0.xsd
mXMLFilename: ONB_aze_18950706_1.alto
ONB_aze_18950706_1.alto fails to validate because: 

cvc-id.1: There is no ID/IDREF binding for IDREF 'Times_New_Roman_4.5_______'.
At: 1:103402

$ ocr-validate page-2013-07-15 ONB_aze_18950706_1.xml 
mXSDFilename: /usr/local/share/ocr-fileformat/xsd/page-2013-07-15.xsd
mXMLFilename: /tmp/ONB_aze_18950706_1.xml
ONB_aze_18950706_1.xml fails to validate because: 

cvc-complex-type.2.4.d: Invalid content was found starting with element 'TranskribusMetadata'. No child element is expected at this point.
At: 12:290
@hackmanschorsch
Copy link
Contributor

hackmanschorsch commented Apr 9, 2020

One of the two formats is fixed now.

  • ALTO XML
  • PAGE XML

For PAGE XML we need to publish a new XSD. But this does not mean that it can be loaded by the PRIMA PageViewer.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants