[Canonical-rebuilt] Several problems with CI titles #144

piconti · 2025-01-15T17:02:25Z

Running the linguistic processing has allowed @simon-clematide to identify a variety of problems with the current content-item titles which for the most part exist since some time.

Several issues have been opened relating to titles, and I think it would beneficial to have one large issue which encompasses all the others and allows us to have a better overall view of what problems can occur with content-item titles.
In particular, it might show that these problems make more sense to fix at the level of a consolidated-rebuilt instead of any previous stages, due to their diversity, complexity to handle in the first processing stages and it being a better context to apply a unifying solution.

For each of these issues, we should investigate further and identify if the problem comes form the processings (text-importer/rebuilt) or the original OCR/OLR data. Based on this, the respective issues should be fixed either at the corresponding start of the pipeline or in a consolidated-rebuilt data stage, respectively.

The issues in questions are:

Real titles vs Dummy titles #136: when actual titles are missing on the page, and the title is actually the first tokens/lines of the CI full text
Missing content (only titles) in Escher Tageblatt #137: tageblatt had CIs with titles butno full text, there are also many CIs with length 0 or length 1 full text which can correspond to various situations (no text at all, incorrect region segments, or glued words).
HTML Entities in certain titles of Express #139: Titles show differently in the app based on where they are displayed. There is a special-characeter conversion problem at some point (is this in the data or in the app only?)
Filtering UNKNOWN/UNTITLED as markers for non-existing titles #140: Titles with "untitled", "unknown", etc as verbatim titles. probably should be corrected after in the consolidated-rebuilt

The text was updated successfully, but these errors were encountered:

piconti assigned simon-clematide, e-maud and piconti Jan 15, 2025

e-maud mentioned this issue Jan 17, 2025

Solr: test new image collection (metadata and embeddings) impresso/impresso-middle-layer#489

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Canonical-rebuilt] Several problems with CI titles #144

[Canonical-rebuilt] Several problems with CI titles #144

piconti commented Jan 15, 2025

[Canonical-rebuilt] Several problems with CI titles #144

[Canonical-rebuilt] Several problems with CI titles #144

Comments

piconti commented Jan 15, 2025