You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running the linguistic processing has allowed @simon-clematide to identify a variety of problems with the current content-item titles which for the most part exist since some time.
Several issues have been opened relating to titles, and I think it would beneficial to have one large issue which encompasses all the others and allows us to have a better overall view of what problems can occur with content-item titles.
In particular, it might show that these problems make more sense to fix at the level of a consolidated-rebuilt instead of any previous stages, due to their diversity, complexity to handle in the first processing stages and it being a better context to apply a unifying solution.
For each of these issues, we should investigate further and identify if the problem comes form the processings (text-importer/rebuilt) or the original OCR/OLR data. Based on this, the respective issues should be fixed either at the corresponding start of the pipeline or in a consolidated-rebuilt data stage, respectively.
The issues in questions are:
Real titles vs Dummy titles #136: when actual titles are missing on the page, and the title is actually the first tokens/lines of the CI full text
Missing content (only titles) in Escher Tageblatt #137: tageblatt had CIs with titles butno full text, there are also many CIs with length 0 or length 1 full text which can correspond to various situations (no text at all, incorrect region segments, or glued words).
HTML Entities in certain titles of Express #139: Titles show differently in the app based on where they are displayed. There is a special-characeter conversion problem at some point (is this in the data or in the app only?)
Running the linguistic processing has allowed @simon-clematide to identify a variety of problems with the current content-item titles which for the most part exist since some time.
Several issues have been opened relating to titles, and I think it would beneficial to have one large issue which encompasses all the others and allows us to have a better overall view of what problems can occur with content-item titles.
In particular, it might show that these problems make more sense to fix at the level of a
consolidated-rebuilt
instead of any previous stages, due to their diversity, complexity to handle in the first processing stages and it being a better context to apply a unifying solution.For each of these issues, we should investigate further and identify if the problem comes form the processings (text-importer/rebuilt) or the original OCR/OLR data. Based on this, the respective issues should be fixed either at the corresponding start of the pipeline or in a
consolidated-rebuilt
data stage, respectively.The issues in questions are:
tageblatt
had CIs with titles butno full text, there are also many CIs with length 0 or length 1 full text which can correspond to various situations (no text at all, incorrect region segments, or glued words).consolidated-rebuilt
The text was updated successfully, but these errors were encountered: