Skip to content

Commit 396467c

Browse files
authored
Merge branch 'master' into sitemaps
2 parents 52093eb + c1bcdd9 commit 396467c

File tree

104 files changed

+3680
-2207
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

104 files changed

+3680
-2207
lines changed

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,8 @@ contrib/target
1111
*/.classpath
1212
*/.project
1313
*/.settings
14+
.idea
15+
*.iml
16+
/adhoc.keystore
17+
/heritrix_dmesg.log
18+
/jobs

.travis.yml

Lines changed: 8 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,27 @@
1-
sudo: false
2-
31
language: java
4-
5-
jdk:
6-
- oraclejdk8
7-
- openjdk7
8-
- openjdk8
9-
102
matrix:
11-
allow_failures:
12-
- jdk: openjdk7
13-
3+
include:
4+
- jdk: oraclejdk8
5+
dist: trusty
6+
- jdk: openjdk8
7+
- jdk: openjdk11
8+
149
before_install:
1510
- "export JAVA_OPTS=-Xmx1500m"
1611
- "echo JAVA_OPTS=$JAVA_OPTS"
1712
- "export MAVEN_OPTS=-Xmx1500m"
1813
- "echo MAVEN_OPTS=$MAVEN_OPTS"
1914
- "export _JAVA_OPTIONS=-Xmx1500m"
2015
- "echo _JAVA_OPTIONS=$_JAVA_OPTIONS"
16+
17+
install: mvn dependency:resolve -B -V
2118

2219
cache:
2320
directories:
2421
- $HOME/.m2
2522

2623
script:
2724
- travis_wait 30 mvn install
28-
- cd contrib && mvn install
2925

3026
after_failure:
3127
- cat */target/surefire-reports/*.txt

CHANGELOG.md

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,90 @@
11
# Change Log
22

3+
## [Unreleased](https://github.com/internetarchive/heritrix3/tree/HEAD)
4+
5+
[Full Changelog](https://github.com/internetarchive/heritrix3/compare/3.4.0-20200518...HEAD)
6+
7+
**Closed issues:**
8+
9+
- Add support for the SFTP protocol [\#319](https://github.com/internetarchive/heritrix3/issues/319)
10+
11+
**Merged pull requests:**
12+
13+
- Fixes extractor multiple regex matcher recycle [\#335](https://github.com/internetarchive/heritrix3/pull/335) ([adam-miller](https://github.com/adam-miller))
14+
- Remove deprecated sudo setting. [\#333](https://github.com/internetarchive/heritrix3/pull/333) ([dengliming](https://github.com/dengliming))
15+
16+
## [3.4.0-20200518](https://github.com/internetarchive/heritrix3/tree/3.4.0-20200518) (2020-05-18)
17+
[Full Changelog](https://github.com/internetarchive/heritrix3/compare/3.4.0-20200304...3.4.0-20200518)
18+
19+
**Closed issues:**
20+
21+
- Cannot find class \[ExtractorYoutubeDL\] [\#322](https://github.com/internetarchive/heritrix3/issues/322)
22+
- Checkpoints 'spoiled' when used to resume crawls [\#277](https://github.com/internetarchive/heritrix3/issues/277)
23+
24+
**Merged pull requests:**
25+
26+
- Fix match result is always false in MatchesListRegexDecideRule [\#328](https://github.com/internetarchive/heritrix3/pull/328) ([morokosi](https://github.com/morokosi))
27+
- Add real crawlStatus in the crawlReport [\#326](https://github.com/internetarchive/heritrix3/pull/326) ([clawia](https://github.com/clawia))
28+
- youtube-dl: request best medium-ish size format [\#325](https://github.com/internetarchive/heritrix3/pull/325) ([galgeek](https://github.com/galgeek))
29+
- Add parsing for HTML tags \(data-\*\) [\#323](https://github.com/internetarchive/heritrix3/pull/323) ([clawia](https://github.com/clawia))
30+
- Add support for the SFTP protocol [\#320](https://github.com/internetarchive/heritrix3/pull/320) ([bnfleb](https://github.com/bnfleb))
31+
32+
## [3.4.0-20200304](https://github.com/internetarchive/heritrix3/tree/3.4.0-20200304) (2020-03-04)
33+
[Full Changelog](https://github.com/internetarchive/heritrix3/compare/3.4.0-20190418...3.4.0-20200304)
34+
35+
**Fixed bugs:**
36+
37+
- exception logged when opening/saving crawler-beans.cxml via web interface editor [\#305](https://github.com/internetarchive/heritrix3/issues/305)
38+
- Java interface text editor error when saving crawler-beans.cxml [\#293](https://github.com/internetarchive/heritrix3/issues/293)
39+
- Unable to upload crawler-beans.cxml with curl [\#282](https://github.com/internetarchive/heritrix3/issues/282)
40+
- CookieStoreTest.testConcurrentLoad fails randomly [\#274](https://github.com/internetarchive/heritrix3/issues/274)
41+
42+
**Closed issues:**
43+
44+
- Contrib project has a maven dependency with an older version of guava library. [\#311](https://github.com/internetarchive/heritrix3/issues/311)
45+
- BloomFilter64bitTest is slow [\#299](https://github.com/internetarchive/heritrix3/issues/299)
46+
- ObjectIdentityBdbManualCacheTest is slow [\#297](https://github.com/internetarchive/heritrix3/issues/297)
47+
- HTTPS console inaccessible via browser [\#279](https://github.com/internetarchive/heritrix3/issues/279)
48+
- JDK11 support: ssl errors from console [\#275](https://github.com/internetarchive/heritrix3/issues/275)
49+
- JDK11 support: FetchHTTPTest: ssl handshake\_failure [\#268](https://github.com/internetarchive/heritrix3/issues/268)
50+
- JDK11 support: org.archive.util.ObjectIdentityBdbCacheTest failures [\#267](https://github.com/internetarchive/heritrix3/issues/267)
51+
- JDK11 support: ClassNotFoundException: javax.transaction.xa.Xid [\#266](https://github.com/internetarchive/heritrix3/issues/266)
52+
- JDK11 support: tools.jar [\#265](https://github.com/internetarchive/heritrix3/issues/265)
53+
- JDK11 support: jaxb [\#264](https://github.com/internetarchive/heritrix3/issues/264)
54+
55+
**Merged pull requests:**
56+
57+
- Use the Wayback Machine to repair a link to Oracle docs. [\#315](https://github.com/internetarchive/heritrix3/pull/315) ([anjackson](https://github.com/anjackson))
58+
- Utilize the `d` parameter [\#314](https://github.com/internetarchive/heritrix3/pull/314) ([hennekey](https://github.com/hennekey))
59+
- Exclude hbase-client's guava 12 transitive dependency [\#312](https://github.com/internetarchive/heritrix3/pull/312) ([ato](https://github.com/ato))
60+
- Fix stream closed exception for Paged view [\#308](https://github.com/internetarchive/heritrix3/pull/308) ([ldko](https://github.com/ldko))
61+
- Fix stream closed exception by not closing output stream [\#306](https://github.com/internetarchive/heritrix3/pull/306) ([ato](https://github.com/ato))
62+
- Replace custom Base32 encoding [\#304](https://github.com/internetarchive/heritrix3/pull/304) ([hennekey](https://github.com/hennekey))
63+
- Replace constant with accessor methods [\#303](https://github.com/internetarchive/heritrix3/pull/303) ([hennekey](https://github.com/hennekey))
64+
- limit ExtractorYoutubeDL heap usage [\#302](https://github.com/internetarchive/heritrix3/pull/302) ([nlevitt](https://github.com/nlevitt))
65+
- fix logging config [\#301](https://github.com/internetarchive/heritrix3/pull/301) ([nlevitt](https://github.com/nlevitt))
66+
- Use Guice instead of custom bloom filter implementation [\#300](https://github.com/internetarchive/heritrix3/pull/300) ([hennekey](https://github.com/hennekey))
67+
- Speed up ObjectIdentityBdbManualCacheTest [\#298](https://github.com/internetarchive/heritrix3/pull/298) ([hennekey](https://github.com/hennekey))
68+
- Set JUnit version to latest [\#296](https://github.com/internetarchive/heritrix3/pull/296) ([hennekey](https://github.com/hennekey))
69+
- Disable test that connects to wwwb-dedup.us.archive.org [\#295](https://github.com/internetarchive/heritrix3/pull/295) ([ato](https://github.com/ato))
70+
- Fix 'Method Not Allowed' on POST of config editor form [\#294](https://github.com/internetarchive/heritrix3/pull/294) ([ato](https://github.com/ato))
71+
- Crawltrap regex timeout [\#290](https://github.com/internetarchive/heritrix3/pull/290) ([csrster](https://github.com/csrster))
72+
- Bdb frontier access [\#289](https://github.com/internetarchive/heritrix3/pull/289) ([csrster](https://github.com/csrster))
73+
- Attempt to filter out embedded images. [\#288](https://github.com/internetarchive/heritrix3/pull/288) ([csrster](https://github.com/csrster))
74+
- change trough dedup `date` type to varchar. [\#287](https://github.com/internetarchive/heritrix3/pull/287) ([nlevitt](https://github.com/nlevitt))
75+
- Add support for forced queue assignment and parallel queues [\#286](https://github.com/internetarchive/heritrix3/pull/286) ([adam-miller](https://github.com/adam-miller))
76+
- Warc writer chain [\#285](https://github.com/internetarchive/heritrix3/pull/285) ([nlevitt](https://github.com/nlevitt))
77+
- Fix jobdir PUT [\#283](https://github.com/internetarchive/heritrix3/pull/283) ([ato](https://github.com/ato))
78+
- Upgrade BDB JE to version 7.5.11 - IMPORTANT CHANGE [\#281](https://github.com/internetarchive/heritrix3/pull/281) ([anjackson](https://github.com/anjackson))
79+
- Mitigate random CookieStore.testConcurrentLoad test failures [\#280](https://github.com/internetarchive/heritrix3/pull/280) ([ato](https://github.com/ato))
80+
- JDK11 support: upgrade to Jetty 9.4.19, Restlet 2.4.0 and drop JDK 7 support [\#276](https://github.com/internetarchive/heritrix3/pull/276) ([ato](https://github.com/ato))
81+
- JDK11 support: remove unused class ObjectIdentityBdbCache and tests [\#273](https://github.com/internetarchive/heritrix3/pull/273) ([ato](https://github.com/ato))
82+
- JDK11 support: upgrade maven-surefire-plugin to 2.22.2 [\#272](https://github.com/internetarchive/heritrix3/pull/272) ([ato](https://github.com/ato))
83+
- JDK11 support: exclude tools.jar from hbase-client dependency [\#271](https://github.com/internetarchive/heritrix3/pull/271) ([ato](https://github.com/ato))
84+
- Travis fixes [\#270](https://github.com/internetarchive/heritrix3/pull/270) ([ato](https://github.com/ato))
85+
- WIP: ExtractorYoutubeDL [\#257](https://github.com/internetarchive/heritrix3/pull/257) ([nlevitt](https://github.com/nlevitt))
86+
- Update README and add LICENSE.txt [\#256](https://github.com/internetarchive/heritrix3/pull/256) ([ruebot](https://github.com/ruebot))
87+
388
## [3.4.0-20190418](https://github.com/internetarchive/heritrix3/tree/3.4.0-20190418) (2019-04-18)
489
[Full Changelog](https://github.com/internetarchive/heritrix3/compare/3.4.0-20190207...3.4.0-20190418)
590

@@ -68,6 +153,7 @@
68153

69154
**Merged pull requests:**
70155

156+
- JDK11 support: explicitly depend on JAXB [\#269](https://github.com/internetarchive/heritrix3/pull/269) ([ato](https://github.com/ato))
71157
- do not checkpoint if crawl job has not started [\#227](https://github.com/internetarchive/heritrix3/pull/227) ([nlevitt](https://github.com/nlevitt))
72158
- namespace scope log logger to crawl job [\#226](https://github.com/internetarchive/heritrix3/pull/226) ([nlevitt](https://github.com/nlevitt))
73159
- un-threadlocal the HConnection [\#224](https://github.com/internetarchive/heritrix3/pull/224) ([nlevitt](https://github.com/nlevitt))

LICENSE

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
Licensed under the Apache License, Version 2.0 (the "License");
2+
you may not use this file except in compliance with the License.
3+
You may obtain a copy of the License at
4+
5+
http://www.apache.org/licenses/LICENSE-2.0
6+
7+
Unless required by applicable law or agreed to in writing, software
8+
distributed under the License is distributed on an "AS IS" BASIS,
9+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10+
See the License for the specific language governing permissions and
11+
limitations under the License.

README.md

Lines changed: 23 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -1,64 +1,39 @@
1-
Readme for Heritrix
2-
====================
1+
# Heritrix
2+
[![Build Status](https://travis-ci.org/internetarchive/heritrix3.svg?branch=master)](https://travis-ci.org/internetarchive/heritrix3)
3+
[![Maven Central](https://maven-badges.herokuapp.com/maven-central/org.archive/heritrix/badge.svg)](https://maven-badges.herokuapp.com/maven-central/org.archive/heritrix)
4+
[![Javadoc](https://javadoc-badge.appspot.com/org.archive/heritrix.svg?label=javadoc)](https://www.javadoc.io/doc/org.archive.heritrix/heritrix-engine)
5+
[![LICENSE](https://img.shields.io/badge/license-Apache-blue.svg?style=flat-square)](./LICENSE)
36

4-
1. Introduction
5-
2. Crawl Operators!
6-
3. Getting Started
7-
4. Developer Documentation
8-
5. Release History
9-
6. License
7+
## Introduction
108

9+
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or missaid as heratrix/heritix/heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.
1110

12-
## 1. Introduction
11+
## Crawl Operators!
1312

14-
Heritrix is the Internet Archive's open-source, extensible, web-scale,
15-
archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or
16-
misspelled or missaid as heratrix/heritix/heretix/heratix) is an archaic word
17-
for heiress (woman who inherits). Since our crawler seeks to collect and
18-
preserve the digital artifacts of our culture for the benefit of future
19-
researchers and generations, this name seemed apt.
13+
Heritrix is designed to respect the [`robots.txt`](http://www.robotstxt.org/robotstxt.html) exclusion directives<sup>†</sup> and [META nofollow tags](http://www.robotstxt.org/meta.html). Please consider the
14+
load your crawl will place on seed sites and set politeness policies accordingly. Also, always identify your crawl with contact information in the `User-Agent` so sites that may be adversely affected by your crawl can contact you or adapt their server behavior accordingly.
2015

16+
<sup>†</sup> The newer wildcard extension to robots.txt is [not yet](https://github.com/internetarchive/heritrix3/issues/250) supported.
2117

22-
## 2. Crawl Operators!
18+
## Getting Started
2319

24-
Heritrix is designed to respect the robots.txt
25-
<http://www.robotstxt.org/wc/robots.html> exclusion directives and META robots
26-
tags <http://www.robotstxt.org/wc/exclusion.html#meta>. Please consider the
27-
load your crawl will place on seed sites and set politeness policies
28-
accordingly. Also, always identify your crawl with contact information in the
29-
User-Agent so sites that may be adversely affected by your crawl can contact
30-
you or adapt their server behavior accordingly.
20+
- [User Manual](https://github.com/internetarchive/heritrix3/wiki)
3121

22+
## Developer Documentation
3223

33-
## 3. Getting Started
24+
- [Developer Manual](http://crawler.archive.org/articles/developer_manual/index.html)
25+
- [REST API documentation](https://heritrix.readthedocs.io/en/latest/api.html)
26+
- JavaDoc: [engine](https://www.javadoc.io/doc/org.archive.heritrix/heritrix-engine), [modules](https://www.javadoc.io/doc/org.archive.heritrix/heritrix-modules), [commons](https://www.javadoc.io/doc/org.archive.heritrix/heritrix-commons), [contrib](https://www.javadoc.io/doc/org.archive.heritrix/heritrix-contrib)
3427

35-
See the User Manual, available from <https://github.com/internetarchive/heritrix3/wiki>
3628

29+
## Latest Releases
3730

38-
## 4. Developer Documentation
31+
Information about releases can be found [here](https://github.com/internetarchive/heritrix3/wiki#latest-releases).
3932

40-
See <http://crawler.archive.org/articles/developer_manual/index.html>.
41-
For REST API documentation, see <https://heritrix.readthedocs.io/en/latest/api.html>
42-
and for JavaDoc see <http://builds.archive.org/javadoc/heritrix-3.2.0/> (n.b. Javadoc currently out of date).
33+
## License
4334

35+
Heritrix is free software; you can redistribute it and/or modify it under the terms of the [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0)
4436

45-
## 5. Latest Releases
46-
47-
Information about releases can be found at <https://github.com/internetarchive/heritrix3/wiki#latest-releases>
48-
49-
50-
## 6. License
51-
52-
Heritrix is free software; you can redistribute it and/or modify it
53-
under the terms of the Apache License, Version 2.0:
54-
55-
http://www.apache.org/licenses/LICENSE-2.0
56-
57-
Some individual source code files are subject to or offered under other
58-
licenses. See the included LICENSE.txt file for more information.
59-
60-
Heritrix is distributed with the libraries it depends upon. The
61-
libraries can be found under the 'lib' directory, and are used under
62-
the terms of their respective licenses, which are included alongside
63-
the libraries in the 'lib' directory.
37+
Some individual source code files are subject to or offered under other licenses. See the included [`LICENSE.txt`](./LICENSE) file for more information.
6438

39+
Heritrix is distributed with the libraries it depends upon. The libraries can be found under the `lib` directory in the release distribution, and are used under the terms of their respective licenses, which are included alongside the libraries in the `lib` directory.

commons/pom.xml

Lines changed: 13 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
<repositories>
1919
<repository>
2020
<id>download.oracle.com,maven</id>
21-
<url>http://download.oracle.com/maven</url>
21+
<url>https://download.oracle.com/maven</url>
2222
</repository>
2323
</repositories>
2424

@@ -34,7 +34,7 @@
3434
<dependency>
3535
<groupId>com.sleepycat</groupId>
3636
<artifactId>je</artifactId>
37-
<version>4.1.6</version>
37+
<version>7.5.11</version>
3838
</dependency>
3939
<dependency>
4040
<groupId>commons-lang</groupId>
@@ -119,7 +119,6 @@
119119
<dependency>
120120
<groupId>junit</groupId>
121121
<artifactId>junit</artifactId>
122-
<version>3.8.2</version>
123122
<scope>compile</scope>
124123
</dependency>
125124
<dependency>
@@ -137,27 +136,22 @@
137136
<dependency>
138137
<groupId>org.springframework</groupId>
139138
<artifactId>spring-core</artifactId>
140-
<version>3.0.5.RELEASE</version>
139+
<version>${spring.version}</version>
141140
</dependency>
142141
<dependency>
143142
<groupId>org.springframework</groupId>
144143
<artifactId>spring-beans</artifactId>
145-
<version>3.0.5.RELEASE</version>
144+
<version>${spring.version}</version>
146145
</dependency>
147146
<dependency>
148147
<groupId>org.springframework</groupId>
149148
<artifactId>spring-context</artifactId>
150-
<version>3.0.5.RELEASE</version>
151-
</dependency>
152-
<dependency>
153-
<groupId>org.springframework</groupId>
154-
<artifactId>spring-asm</artifactId>
155-
<version>3.0.5.RELEASE</version>
149+
<version>${spring.version}</version>
156150
</dependency>
157151
<dependency>
158152
<groupId>org.springframework</groupId>
159153
<artifactId>spring-expression</artifactId>
160-
<version>3.0.5.RELEASE</version>
154+
<version>${spring.version}</version>
161155
</dependency>
162156

163157
<dependency>
@@ -196,7 +190,11 @@
196190
</exclusion>
197191
</exclusions>
198192
</dependency>
199-
193+
<dependency>
194+
<groupId>com.jcraft</groupId>
195+
<artifactId>jsch</artifactId>
196+
<version>0.1.52</version>
197+
</dependency>
200198
</dependencies>
201199
<build>
202200
<resources>
@@ -221,7 +219,7 @@
221219
<plugin>
222220
<groupId>org.apache.maven.plugins</groupId>
223221
<artifactId>maven-surefire-plugin</artifactId>
224-
<version>2.9</version>
222+
<version>2.22.2</version>
225223
<configuration>
226224
<!--
227225
There was a unit test, SinkHandlerTest, that required
@@ -250,5 +248,6 @@
250248
</build>
251249
<properties>
252250
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
251+
<spring.version>5.3.3</spring.version>
253252
</properties>
254253
</project>

commons/src/main/java/org/archive/bdb/BdbModule.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -269,7 +269,7 @@ protected void setup(File f, boolean create)
269269
config.setSharedCache(getUseSharedCache());
270270

271271
// we take the advice literally from...
272-
// http://www.oracle.com/technology/products/berkeley-db/faq/je_faq.html#33
272+
// https://web.archive.org/web/20100727081707/http://www.oracle.com/technology/products/berkeley-db/faq/je_faq.html#33
273273
long nLockTables = getExpectedConcurrency()-1;
274274
while(!BigInteger.valueOf(nLockTables).isProbablePrime(Integer.MAX_VALUE)) {
275275
nLockTables--;

0 commit comments

Comments
 (0)