Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve WET text extraction, address #45 and #46 #47

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sebastian-nagel
Copy link

@sebastian-nagel sebastian-nagel commented Feb 20, 2025

A work around to read until closing </script> or </style> tag. This avoids that scripts or styles containing HTML snippets end prematurely and add bad textual content.

Note: this is more a work-around because the underlying HTML parser library htmlparser.org is not maintained since 14 years. Fixing the issue there is not possible. Main changes:

  • use Lexer.parseCDATA(false), disabling the "quotesmart" feature which tries to ignore HTML end tags inside quotes. The "quotesmart" sometimes fails which causes that the script/style block is prematurely "closed" by a </[a-z] sequence.
  • if the state is inside a script/style block, loop until the parser returns a closing </script> or </style> tag. Append everything else to the cached text value of the script/style block.

Tested so far:

  1. the WET extracts of the pages referenced in WET extracts may contain embedded CSS #45 and WET extracts may include embedded Javascript #46 do not anymore contain CSS resp. Javascript snippets. In addition, some more text is extracted.
  2. a comparison "before and after" for about 1000 WET records also shows better text extracts in overall. There are few regressions, see below.
  3. verified the WAT extraction which is also affected by this fix: there are very small differences in the number of extracted URLs.

Regressions (less text extracted now):

  • if the script or style block includes an opening HTML comment marker (<!--) but no closing marker, the closing </script> resp. </style> element is not found. That is everything until another HTML comment end marker (-->) or until end of file is consumed as comment by the parser. Not fixable from outside of "htmlparser.org": a fix would require to override the method org.htmlparser.lexer.Lexer.parseString(...) with a version ignoring HTML comments only if inside a script or style block.
  • templates including HTML snippets are now ignored (not really a regression, but less extracted text). For example:
    <script type="text/x-ph-tmpl" ...><!-- HTML snippet --></script>
    <script type="text/html" id="tmpl-..."><!-- HTML snippet --></script>
    <script id="ckyBannerTemplate" type="text/template"><!-- HTML snippet --></script>

- work around to read until closing `</script>`
  or `</style>` tag. Avoid that scripts or styles
  containing HTML snippets end prematurely and
  add bad textual content.
- add `<center>` and `<noframes>` tags as HTML block
  elements
@wumpus
Copy link
Member

wumpus commented Feb 21, 2025

Ouch. This looks like a good minimal fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants