Skip to content

Commit

Permalink
Merge branch 'master' into sitemaps
Browse files Browse the repository at this point in the history
  • Loading branch information
anjackson authored May 20, 2021
2 parents 52093eb + c1bcdd9 commit 396467c
Show file tree
Hide file tree
Showing 104 changed files with 3,680 additions and 2,207 deletions.
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,8 @@ contrib/target
*/.classpath
*/.project
*/.settings
.idea
*.iml
/adhoc.keystore
/heritrix_dmesg.log
/jobs
20 changes: 8 additions & 12 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,31 +1,27 @@
sudo: false

language: java

jdk:
- oraclejdk8
- openjdk7
- openjdk8

matrix:
allow_failures:
- jdk: openjdk7

include:
- jdk: oraclejdk8
dist: trusty
- jdk: openjdk8
- jdk: openjdk11

before_install:
- "export JAVA_OPTS=-Xmx1500m"
- "echo JAVA_OPTS=$JAVA_OPTS"
- "export MAVEN_OPTS=-Xmx1500m"
- "echo MAVEN_OPTS=$MAVEN_OPTS"
- "export _JAVA_OPTIONS=-Xmx1500m"
- "echo _JAVA_OPTIONS=$_JAVA_OPTIONS"

install: mvn dependency:resolve -B -V

cache:
directories:
- $HOME/.m2

script:
- travis_wait 30 mvn install
- cd contrib && mvn install

after_failure:
- cat */target/surefire-reports/*.txt
86 changes: 86 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,90 @@
# Change Log

## [Unreleased](https://github.com/internetarchive/heritrix3/tree/HEAD)

[Full Changelog](https://github.com/internetarchive/heritrix3/compare/3.4.0-20200518...HEAD)

**Closed issues:**

- Add support for the SFTP protocol [\#319](https://github.com/internetarchive/heritrix3/issues/319)

**Merged pull requests:**

- Fixes extractor multiple regex matcher recycle [\#335](https://github.com/internetarchive/heritrix3/pull/335) ([adam-miller](https://github.com/adam-miller))
- Remove deprecated sudo setting. [\#333](https://github.com/internetarchive/heritrix3/pull/333) ([dengliming](https://github.com/dengliming))

## [3.4.0-20200518](https://github.com/internetarchive/heritrix3/tree/3.4.0-20200518) (2020-05-18)
[Full Changelog](https://github.com/internetarchive/heritrix3/compare/3.4.0-20200304...3.4.0-20200518)

**Closed issues:**

- Cannot find class \[ExtractorYoutubeDL\] [\#322](https://github.com/internetarchive/heritrix3/issues/322)
- Checkpoints 'spoiled' when used to resume crawls [\#277](https://github.com/internetarchive/heritrix3/issues/277)

**Merged pull requests:**

- Fix match result is always false in MatchesListRegexDecideRule [\#328](https://github.com/internetarchive/heritrix3/pull/328) ([morokosi](https://github.com/morokosi))
- Add real crawlStatus in the crawlReport [\#326](https://github.com/internetarchive/heritrix3/pull/326) ([clawia](https://github.com/clawia))
- youtube-dl: request best medium-ish size format [\#325](https://github.com/internetarchive/heritrix3/pull/325) ([galgeek](https://github.com/galgeek))
- Add parsing for HTML tags \(data-\*\) [\#323](https://github.com/internetarchive/heritrix3/pull/323) ([clawia](https://github.com/clawia))
- Add support for the SFTP protocol [\#320](https://github.com/internetarchive/heritrix3/pull/320) ([bnfleb](https://github.com/bnfleb))

## [3.4.0-20200304](https://github.com/internetarchive/heritrix3/tree/3.4.0-20200304) (2020-03-04)
[Full Changelog](https://github.com/internetarchive/heritrix3/compare/3.4.0-20190418...3.4.0-20200304)

**Fixed bugs:**

- exception logged when opening/saving crawler-beans.cxml via web interface editor [\#305](https://github.com/internetarchive/heritrix3/issues/305)
- Java interface text editor error when saving crawler-beans.cxml [\#293](https://github.com/internetarchive/heritrix3/issues/293)
- Unable to upload crawler-beans.cxml with curl [\#282](https://github.com/internetarchive/heritrix3/issues/282)
- CookieStoreTest.testConcurrentLoad fails randomly [\#274](https://github.com/internetarchive/heritrix3/issues/274)

**Closed issues:**

- Contrib project has a maven dependency with an older version of guava library. [\#311](https://github.com/internetarchive/heritrix3/issues/311)
- BloomFilter64bitTest is slow [\#299](https://github.com/internetarchive/heritrix3/issues/299)
- ObjectIdentityBdbManualCacheTest is slow [\#297](https://github.com/internetarchive/heritrix3/issues/297)
- HTTPS console inaccessible via browser [\#279](https://github.com/internetarchive/heritrix3/issues/279)
- JDK11 support: ssl errors from console [\#275](https://github.com/internetarchive/heritrix3/issues/275)
- JDK11 support: FetchHTTPTest: ssl handshake\_failure [\#268](https://github.com/internetarchive/heritrix3/issues/268)
- JDK11 support: org.archive.util.ObjectIdentityBdbCacheTest failures [\#267](https://github.com/internetarchive/heritrix3/issues/267)
- JDK11 support: ClassNotFoundException: javax.transaction.xa.Xid [\#266](https://github.com/internetarchive/heritrix3/issues/266)
- JDK11 support: tools.jar [\#265](https://github.com/internetarchive/heritrix3/issues/265)
- JDK11 support: jaxb [\#264](https://github.com/internetarchive/heritrix3/issues/264)

**Merged pull requests:**

- Use the Wayback Machine to repair a link to Oracle docs. [\#315](https://github.com/internetarchive/heritrix3/pull/315) ([anjackson](https://github.com/anjackson))
- Utilize the `d` parameter [\#314](https://github.com/internetarchive/heritrix3/pull/314) ([hennekey](https://github.com/hennekey))
- Exclude hbase-client's guava 12 transitive dependency [\#312](https://github.com/internetarchive/heritrix3/pull/312) ([ato](https://github.com/ato))
- Fix stream closed exception for Paged view [\#308](https://github.com/internetarchive/heritrix3/pull/308) ([ldko](https://github.com/ldko))
- Fix stream closed exception by not closing output stream [\#306](https://github.com/internetarchive/heritrix3/pull/306) ([ato](https://github.com/ato))
- Replace custom Base32 encoding [\#304](https://github.com/internetarchive/heritrix3/pull/304) ([hennekey](https://github.com/hennekey))
- Replace constant with accessor methods [\#303](https://github.com/internetarchive/heritrix3/pull/303) ([hennekey](https://github.com/hennekey))
- limit ExtractorYoutubeDL heap usage [\#302](https://github.com/internetarchive/heritrix3/pull/302) ([nlevitt](https://github.com/nlevitt))
- fix logging config [\#301](https://github.com/internetarchive/heritrix3/pull/301) ([nlevitt](https://github.com/nlevitt))
- Use Guice instead of custom bloom filter implementation [\#300](https://github.com/internetarchive/heritrix3/pull/300) ([hennekey](https://github.com/hennekey))
- Speed up ObjectIdentityBdbManualCacheTest [\#298](https://github.com/internetarchive/heritrix3/pull/298) ([hennekey](https://github.com/hennekey))
- Set JUnit version to latest [\#296](https://github.com/internetarchive/heritrix3/pull/296) ([hennekey](https://github.com/hennekey))
- Disable test that connects to wwwb-dedup.us.archive.org [\#295](https://github.com/internetarchive/heritrix3/pull/295) ([ato](https://github.com/ato))
- Fix 'Method Not Allowed' on POST of config editor form [\#294](https://github.com/internetarchive/heritrix3/pull/294) ([ato](https://github.com/ato))
- Crawltrap regex timeout [\#290](https://github.com/internetarchive/heritrix3/pull/290) ([csrster](https://github.com/csrster))
- Bdb frontier access [\#289](https://github.com/internetarchive/heritrix3/pull/289) ([csrster](https://github.com/csrster))
- Attempt to filter out embedded images. [\#288](https://github.com/internetarchive/heritrix3/pull/288) ([csrster](https://github.com/csrster))
- change trough dedup `date` type to varchar. [\#287](https://github.com/internetarchive/heritrix3/pull/287) ([nlevitt](https://github.com/nlevitt))
- Add support for forced queue assignment and parallel queues [\#286](https://github.com/internetarchive/heritrix3/pull/286) ([adam-miller](https://github.com/adam-miller))
- Warc writer chain [\#285](https://github.com/internetarchive/heritrix3/pull/285) ([nlevitt](https://github.com/nlevitt))
- Fix jobdir PUT [\#283](https://github.com/internetarchive/heritrix3/pull/283) ([ato](https://github.com/ato))
- Upgrade BDB JE to version 7.5.11 - IMPORTANT CHANGE [\#281](https://github.com/internetarchive/heritrix3/pull/281) ([anjackson](https://github.com/anjackson))
- Mitigate random CookieStore.testConcurrentLoad test failures [\#280](https://github.com/internetarchive/heritrix3/pull/280) ([ato](https://github.com/ato))
- JDK11 support: upgrade to Jetty 9.4.19, Restlet 2.4.0 and drop JDK 7 support [\#276](https://github.com/internetarchive/heritrix3/pull/276) ([ato](https://github.com/ato))
- JDK11 support: remove unused class ObjectIdentityBdbCache and tests [\#273](https://github.com/internetarchive/heritrix3/pull/273) ([ato](https://github.com/ato))
- JDK11 support: upgrade maven-surefire-plugin to 2.22.2 [\#272](https://github.com/internetarchive/heritrix3/pull/272) ([ato](https://github.com/ato))
- JDK11 support: exclude tools.jar from hbase-client dependency [\#271](https://github.com/internetarchive/heritrix3/pull/271) ([ato](https://github.com/ato))
- Travis fixes [\#270](https://github.com/internetarchive/heritrix3/pull/270) ([ato](https://github.com/ato))
- WIP: ExtractorYoutubeDL [\#257](https://github.com/internetarchive/heritrix3/pull/257) ([nlevitt](https://github.com/nlevitt))
- Update README and add LICENSE.txt [\#256](https://github.com/internetarchive/heritrix3/pull/256) ([ruebot](https://github.com/ruebot))

## [3.4.0-20190418](https://github.com/internetarchive/heritrix3/tree/3.4.0-20190418) (2019-04-18)
[Full Changelog](https://github.com/internetarchive/heritrix3/compare/3.4.0-20190207...3.4.0-20190418)

Expand Down Expand Up @@ -68,6 +153,7 @@

**Merged pull requests:**

- JDK11 support: explicitly depend on JAXB [\#269](https://github.com/internetarchive/heritrix3/pull/269) ([ato](https://github.com/ato))
- do not checkpoint if crawl job has not started [\#227](https://github.com/internetarchive/heritrix3/pull/227) ([nlevitt](https://github.com/nlevitt))
- namespace scope log logger to crawl job [\#226](https://github.com/internetarchive/heritrix3/pull/226) ([nlevitt](https://github.com/nlevitt))
- un-threadlocal the HConnection [\#224](https://github.com/internetarchive/heritrix3/pull/224) ([nlevitt](https://github.com/nlevitt))
Expand Down
11 changes: 11 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
71 changes: 23 additions & 48 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,64 +1,39 @@
Readme for Heritrix
====================
# Heritrix
[![Build Status](https://travis-ci.org/internetarchive/heritrix3.svg?branch=master)](https://travis-ci.org/internetarchive/heritrix3)
[![Maven Central](https://maven-badges.herokuapp.com/maven-central/org.archive/heritrix/badge.svg)](https://maven-badges.herokuapp.com/maven-central/org.archive/heritrix)
[![Javadoc](https://javadoc-badge.appspot.com/org.archive/heritrix.svg?label=javadoc)](https://www.javadoc.io/doc/org.archive.heritrix/heritrix-engine)
[![LICENSE](https://img.shields.io/badge/license-Apache-blue.svg?style=flat-square)](./LICENSE)

1. Introduction
2. Crawl Operators!
3. Getting Started
4. Developer Documentation
5. Release History
6. License
## Introduction

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or missaid as heratrix/heritix/heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.

## 1. Introduction
## Crawl Operators!

Heritrix is the Internet Archive's open-source, extensible, web-scale,
archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or
misspelled or missaid as heratrix/heritix/heretix/heratix) is an archaic word
for heiress (woman who inherits). Since our crawler seeks to collect and
preserve the digital artifacts of our culture for the benefit of future
researchers and generations, this name seemed apt.
Heritrix is designed to respect the [`robots.txt`](http://www.robotstxt.org/robotstxt.html) exclusion directives<sup>†</sup> and [META nofollow tags](http://www.robotstxt.org/meta.html). Please consider the
load your crawl will place on seed sites and set politeness policies accordingly. Also, always identify your crawl with contact information in the `User-Agent` so sites that may be adversely affected by your crawl can contact you or adapt their server behavior accordingly.

<sup>†</sup> The newer wildcard extension to robots.txt is [not yet](https://github.com/internetarchive/heritrix3/issues/250) supported.

## 2. Crawl Operators!
## Getting Started

Heritrix is designed to respect the robots.txt
<http://www.robotstxt.org/wc/robots.html> exclusion directives and META robots
tags <http://www.robotstxt.org/wc/exclusion.html#meta>. Please consider the
load your crawl will place on seed sites and set politeness policies
accordingly. Also, always identify your crawl with contact information in the
User-Agent so sites that may be adversely affected by your crawl can contact
you or adapt their server behavior accordingly.
- [User Manual](https://github.com/internetarchive/heritrix3/wiki)

## Developer Documentation

## 3. Getting Started
- [Developer Manual](http://crawler.archive.org/articles/developer_manual/index.html)
- [REST API documentation](https://heritrix.readthedocs.io/en/latest/api.html)
- JavaDoc: [engine](https://www.javadoc.io/doc/org.archive.heritrix/heritrix-engine), [modules](https://www.javadoc.io/doc/org.archive.heritrix/heritrix-modules), [commons](https://www.javadoc.io/doc/org.archive.heritrix/heritrix-commons), [contrib](https://www.javadoc.io/doc/org.archive.heritrix/heritrix-contrib)

See the User Manual, available from <https://github.com/internetarchive/heritrix3/wiki>

## Latest Releases

## 4. Developer Documentation
Information about releases can be found [here](https://github.com/internetarchive/heritrix3/wiki#latest-releases).

See <http://crawler.archive.org/articles/developer_manual/index.html>.
For REST API documentation, see <https://heritrix.readthedocs.io/en/latest/api.html>
and for JavaDoc see <http://builds.archive.org/javadoc/heritrix-3.2.0/> (n.b. Javadoc currently out of date).
## License

Heritrix is free software; you can redistribute it and/or modify it under the terms of the [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0)

## 5. Latest Releases

Information about releases can be found at <https://github.com/internetarchive/heritrix3/wiki#latest-releases>


## 6. License

Heritrix is free software; you can redistribute it and/or modify it
under the terms of the Apache License, Version 2.0:

http://www.apache.org/licenses/LICENSE-2.0

Some individual source code files are subject to or offered under other
licenses. See the included LICENSE.txt file for more information.

Heritrix is distributed with the libraries it depends upon. The
libraries can be found under the 'lib' directory, and are used under
the terms of their respective licenses, which are included alongside
the libraries in the 'lib' directory.
Some individual source code files are subject to or offered under other licenses. See the included [`LICENSE.txt`](./LICENSE) file for more information.

Heritrix is distributed with the libraries it depends upon. The libraries can be found under the `lib` directory in the release distribution, and are used under the terms of their respective licenses, which are included alongside the libraries in the `lib` directory.
27 changes: 13 additions & 14 deletions commons/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
<repositories>
<repository>
<id>download.oracle.com,maven</id>
<url>http://download.oracle.com/maven</url>
<url>https://download.oracle.com/maven</url>
</repository>
</repositories>

Expand All @@ -34,7 +34,7 @@
<dependency>
<groupId>com.sleepycat</groupId>
<artifactId>je</artifactId>
<version>4.1.6</version>
<version>7.5.11</version>
</dependency>
<dependency>
<groupId>commons-lang</groupId>
Expand Down Expand Up @@ -119,7 +119,6 @@
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.2</version>
<scope>compile</scope>
</dependency>
<dependency>
Expand All @@ -137,27 +136,22 @@
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-core</artifactId>
<version>3.0.5.RELEASE</version>
<version>${spring.version}</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-beans</artifactId>
<version>3.0.5.RELEASE</version>
<version>${spring.version}</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-context</artifactId>
<version>3.0.5.RELEASE</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-asm</artifactId>
<version>3.0.5.RELEASE</version>
<version>${spring.version}</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-expression</artifactId>
<version>3.0.5.RELEASE</version>
<version>${spring.version}</version>
</dependency>

<dependency>
Expand Down Expand Up @@ -196,7 +190,11 @@
</exclusion>
</exclusions>
</dependency>

<dependency>
<groupId>com.jcraft</groupId>
<artifactId>jsch</artifactId>
<version>0.1.52</version>
</dependency>
</dependencies>
<build>
<resources>
Expand All @@ -221,7 +219,7 @@
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.9</version>
<version>2.22.2</version>
<configuration>
<!--
There was a unit test, SinkHandlerTest, that required
Expand Down Expand Up @@ -250,5 +248,6 @@
</build>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<spring.version>5.3.3</spring.version>
</properties>
</project>
2 changes: 1 addition & 1 deletion commons/src/main/java/org/archive/bdb/BdbModule.java
Original file line number Diff line number Diff line change
Expand Up @@ -269,7 +269,7 @@ protected void setup(File f, boolean create)
config.setSharedCache(getUseSharedCache());

// we take the advice literally from...
// http://www.oracle.com/technology/products/berkeley-db/faq/je_faq.html#33
// https://web.archive.org/web/20100727081707/http://www.oracle.com/technology/products/berkeley-db/faq/je_faq.html#33
long nLockTables = getExpectedConcurrency()-1;
while(!BigInteger.valueOf(nLockTables).isProbablePrime(Integer.MAX_VALUE)) {
nLockTables--;
Expand Down
Loading

0 comments on commit 396467c

Please sign in to comment.