Skip to content

Conversation

@sebastian-nagel
Copy link

This merges the upstream branch master into the code base, updating the version from 1.3.1-SNAPSHOT to 3.0.1-SNAPSHOT. Changes from upstream:

  • dependency upgrades (commons-io, commons-lang)
  • drop dependency on Apache Commons HttpClient 3.1
  • update effective_tld_names.dat and move to org/archive/effective_tld_names.dat to prevent conflict with the list shipped together with "crawler-commons"
  • upgrade to JUnit 5
  • removal of deprecated classes and methods

ato and others added 30 commits May 18, 2025 09:39
HttpClient 3 was discontinued in 2007 and frequently triggers alerts in dependency vulnerability scanners. We're also not using much of it anymore, with one big exception.

The URI class is the foundation of UsableURI and central to Heritrix which has made removing the library difficult. URIException in particular appears a lot in client code. HttpClient 4+ has switched to java.net.URI and the main reason Heritrix was built on HttpClient URI instead was because java.net.URI is not flexible and differs from how browsers behave. (Although, how browsers behave has shifted over time.)

Eventually we'll probably need to rework Heritrix's URI handling to follow the WhatWG URL spec. However, to let us remove the dependency while keeping UsableURI working, this copies HttpClient 3's URI, URIException and ChunkedInputStream with some small tweaks remove their dependency on other classes in HttpClient. The HttpClient Header class is replaced with our existing HttpHeader. URI and ChunkedInputStream are marked package private for now.

This is a breaking API change and will trigger a bump of the major version number.
Remove dependency on Apache Commons HttpClient 3.1
Remove deprecated code for 2.0.0 release
Using RecordingInputStream requires an awkward workaround when the API being recorded is not in the form of an InputStream, for example, if it's asynchronous. This adds a method to access the underlying RecordingOutputStream so you can write to it directly when that would be easier.
…tream

Add RecordingInputStream.asOutputStream()
We'll do this in settings instead.
…lic suffixes file to avoid collisions with other jars
…parsing

Fix: public suffixes tld parsing
@sebastian-nagel
Copy link
Author

sebastian-nagel commented Aug 28, 2025

Successfully tested converting 8 WARC files to WAT and WET.

  • no regressions found
  • minor and rare improvements when converting HTML entities (", etc.) per upgrade of jsoup:
    wat/CC-MAIN-20250803011606-20250803041606-00237.warc.wat.gz
    <               "title": "האתר פועל ברישיון תל&quotי",
    ---
    >               "title": "האתר פועל ברישיון תל\"י",
    
    wat/CC-MAIN-20250805191422-20250805221422-00000.warc.wat.gz
    <               "title": "&laquoОтделение почтовой связи Лошнів» - Посмотреть на большой карте",
    ---
    >               "title": "«Отделение почтовой связи Лошнів» - Посмотреть на большой карте",
    
    wat/CC-MAIN-20250812110712-20250812140712-00289.warc.wat.gz
    <               "title": "Карабины пожарные  D IN 5299C (ранее &quotс фиксатором\")",
    ---
    >               "title": "Карабины пожарные  D IN 5299C (ранее \"с фиксатором\")",
    <               "title": "Зажимы &quotкрокодилы\"",
    ---
    >               "title": "Зажимы \"крокодилы\"",
    

@sebastian-nagel sebastian-nagel merged commit ee10c72 into master Aug 28, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants