Improve WET text extraction, address #45 and #46 #47
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A work around to read until closing
</script>
or</style>
tag. This avoids that scripts or styles containing HTML snippets end prematurely and add bad textual content.Note: this is more a work-around because the underlying HTML parser library htmlparser.org is not maintained since 14 years. Fixing the issue there is not possible. Main changes:
</[a-z]
sequence.</script>
or</style>
tag. Append everything else to the cached text value of the script/style block.Tested so far:
Regressions (less text extracted now):
<!--
) but no closing marker, the closing</script>
resp.</style>
element is not found. That is everything until another HTML comment end marker (-->
) or until end of file is consumed as comment by the parser. Not fixable from outside of "htmlparser.org": a fix would require to override the method org.htmlparser.lexer.Lexer.parseString(...) with a version ignoring HTML comments only if inside a script or style block.