Improve WET text extraction, address #45 and #46 #47

sebastian-nagel · 2025-02-20T23:45:48Z

A work around to read until closing </script> or </style> tag. This avoids that scripts or styles containing HTML snippets end prematurely and add bad textual content.

Note: this is more a work-around because the underlying HTML parser library htmlparser.org is not maintained since 14 years. Fixing the issue there is not possible. Main changes:

use Lexer.parseCDATA(false), disabling the "quotesmart" feature which tries to ignore HTML end tags inside quotes. The "quotesmart" sometimes fails which causes that the script/style block is prematurely "closed" by a </[a-z] sequence.
if the state is inside a script/style block, loop until the parser returns a closing </script> or </style> tag. Append everything else to the cached text value of the script/style block.

Tested so far:

the WET extracts of the pages referenced in WET extracts may contain embedded CSS #45 and WET extracts may include embedded Javascript #46 do not anymore contain CSS resp. Javascript snippets. In addition, some more text is extracted.
a comparison "before and after" for about 1000 WET records also shows better text extracts in overall. There are few regressions, see below.
verified the WAT extraction which is also affected by this fix: there are very small differences in the number of extracted URLs.

Regressions (less text extracted now):

if the script or style block includes an opening HTML comment marker () or until end of file is consumed as comment by the parser. Not fixable from outside of "htmlparser.org": a fix would require to override the method org.htmlparser.lexer.Lexer.parseString(...) with a version ignoring HTML comments only if inside a script or style block.

templates including HTML snippets are now ignored (not really a regression, but less extracted text). For example:

<script type="text/x-ph-tmpl" ...><!-- HTML snippet --></script>
<script type="text/html" id="tmpl-..."><!-- HTML snippet --></script>
<script id="ckyBannerTemplate" type="text/template"><!-- HTML snippet --></script>

- work around to read until closing `</script>` or `</style>` tag. Avoid that scripts or styles containing HTML snippets end prematurely and add bad textual content. - add `<center>` and `<noframes>` tags as HTML block elements

wumpus · 2025-02-21T00:35:01Z

Ouch. This looks like a good minimal fix.

Improve WET text extraction, address #45 and #46

01052bc

- work around to read until closing `</script>` or `</style>` tag. Avoid that scripts or styles containing HTML snippets end prematurely and add bad textual content. - add `<center>` and `<noframes>` tags as HTML block elements

sebastian-nagel merged commit ed57699 into master Mar 10, 2025
5 checks passed

This was referenced Mar 10, 2025

WET extracts may contain embedded CSS #45

Closed

WET extracts may include embedded Javascript #46

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve WET text extraction, address #45 and #46 #47

Improve WET text extraction, address #45 and #46 #47

Uh oh!

sebastian-nagel commented Feb 20, 2025 •

edited

Loading

Uh oh!

wumpus commented Feb 21, 2025

Uh oh!

Uh oh!

Uh oh!

Improve WET text extraction, address #45 and #46 #47

Improve WET text extraction, address #45 and #46 #47

Uh oh!

Conversation

sebastian-nagel commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wumpus commented Feb 21, 2025

Uh oh!

Uh oh!

Uh oh!

sebastian-nagel commented Feb 20, 2025 •

edited

Loading