-
Notifications
You must be signed in to change notification settings - Fork 765
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
104 changed files
with
3,680 additions
and
2,207 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,3 +11,8 @@ contrib/target | |
*/.classpath | ||
*/.project | ||
*/.settings | ||
.idea | ||
*.iml | ||
/adhoc.keystore | ||
/heritrix_dmesg.log | ||
/jobs |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,31 +1,27 @@ | ||
sudo: false | ||
|
||
language: java | ||
|
||
jdk: | ||
- oraclejdk8 | ||
- openjdk7 | ||
- openjdk8 | ||
|
||
matrix: | ||
allow_failures: | ||
- jdk: openjdk7 | ||
|
||
include: | ||
- jdk: oraclejdk8 | ||
dist: trusty | ||
- jdk: openjdk8 | ||
- jdk: openjdk11 | ||
|
||
before_install: | ||
- "export JAVA_OPTS=-Xmx1500m" | ||
- "echo JAVA_OPTS=$JAVA_OPTS" | ||
- "export MAVEN_OPTS=-Xmx1500m" | ||
- "echo MAVEN_OPTS=$MAVEN_OPTS" | ||
- "export _JAVA_OPTIONS=-Xmx1500m" | ||
- "echo _JAVA_OPTIONS=$_JAVA_OPTIONS" | ||
|
||
install: mvn dependency:resolve -B -V | ||
|
||
cache: | ||
directories: | ||
- $HOME/.m2 | ||
|
||
script: | ||
- travis_wait 30 mvn install | ||
- cd contrib && mvn install | ||
|
||
after_failure: | ||
- cat */target/surefire-reports/*.txt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,64 +1,39 @@ | ||
Readme for Heritrix | ||
==================== | ||
# Heritrix | ||
[data:image/s3,"s3://crabby-images/e91f1/e91f1e8bba81a8e29b9b9656ae2f48714b0ab45c" alt="Build Status"](https://travis-ci.org/internetarchive/heritrix3) | ||
[data:image/s3,"s3://crabby-images/e4f99/e4f99508f8799d8c8a8f3287b2bdcdf03d8a9ac3" alt="Maven Central"](https://maven-badges.herokuapp.com/maven-central/org.archive/heritrix) | ||
[data:image/s3,"s3://crabby-images/bb71b/bb71b12f14c3e34f9e4cc8f1cde5860ed69d225d" alt="Javadoc"](https://www.javadoc.io/doc/org.archive.heritrix/heritrix-engine) | ||
[data:image/s3,"s3://crabby-images/0ca8f/0ca8ff9be2c5c3d9e2cf81edf6de02becbff3ac7" alt="LICENSE"](./LICENSE) | ||
|
||
1. Introduction | ||
2. Crawl Operators! | ||
3. Getting Started | ||
4. Developer Documentation | ||
5. Release History | ||
6. License | ||
## Introduction | ||
|
||
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or missaid as heratrix/heritix/heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt. | ||
|
||
## 1. Introduction | ||
## Crawl Operators! | ||
|
||
Heritrix is the Internet Archive's open-source, extensible, web-scale, | ||
archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or | ||
misspelled or missaid as heratrix/heritix/heretix/heratix) is an archaic word | ||
for heiress (woman who inherits). Since our crawler seeks to collect and | ||
preserve the digital artifacts of our culture for the benefit of future | ||
researchers and generations, this name seemed apt. | ||
Heritrix is designed to respect the [`robots.txt`](http://www.robotstxt.org/robotstxt.html) exclusion directives<sup>†</sup> and [META nofollow tags](http://www.robotstxt.org/meta.html). Please consider the | ||
load your crawl will place on seed sites and set politeness policies accordingly. Also, always identify your crawl with contact information in the `User-Agent` so sites that may be adversely affected by your crawl can contact you or adapt their server behavior accordingly. | ||
|
||
<sup>†</sup> The newer wildcard extension to robots.txt is [not yet](https://github.com/internetarchive/heritrix3/issues/250) supported. | ||
|
||
## 2. Crawl Operators! | ||
## Getting Started | ||
|
||
Heritrix is designed to respect the robots.txt | ||
<http://www.robotstxt.org/wc/robots.html> exclusion directives and META robots | ||
tags <http://www.robotstxt.org/wc/exclusion.html#meta>. Please consider the | ||
load your crawl will place on seed sites and set politeness policies | ||
accordingly. Also, always identify your crawl with contact information in the | ||
User-Agent so sites that may be adversely affected by your crawl can contact | ||
you or adapt their server behavior accordingly. | ||
- [User Manual](https://github.com/internetarchive/heritrix3/wiki) | ||
|
||
## Developer Documentation | ||
|
||
## 3. Getting Started | ||
- [Developer Manual](http://crawler.archive.org/articles/developer_manual/index.html) | ||
- [REST API documentation](https://heritrix.readthedocs.io/en/latest/api.html) | ||
- JavaDoc: [engine](https://www.javadoc.io/doc/org.archive.heritrix/heritrix-engine), [modules](https://www.javadoc.io/doc/org.archive.heritrix/heritrix-modules), [commons](https://www.javadoc.io/doc/org.archive.heritrix/heritrix-commons), [contrib](https://www.javadoc.io/doc/org.archive.heritrix/heritrix-contrib) | ||
|
||
See the User Manual, available from <https://github.com/internetarchive/heritrix3/wiki> | ||
|
||
## Latest Releases | ||
|
||
## 4. Developer Documentation | ||
Information about releases can be found [here](https://github.com/internetarchive/heritrix3/wiki#latest-releases). | ||
|
||
See <http://crawler.archive.org/articles/developer_manual/index.html>. | ||
For REST API documentation, see <https://heritrix.readthedocs.io/en/latest/api.html> | ||
and for JavaDoc see <http://builds.archive.org/javadoc/heritrix-3.2.0/> (n.b. Javadoc currently out of date). | ||
## License | ||
|
||
Heritrix is free software; you can redistribute it and/or modify it under the terms of the [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0) | ||
|
||
## 5. Latest Releases | ||
|
||
Information about releases can be found at <https://github.com/internetarchive/heritrix3/wiki#latest-releases> | ||
|
||
|
||
## 6. License | ||
|
||
Heritrix is free software; you can redistribute it and/or modify it | ||
under the terms of the Apache License, Version 2.0: | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Some individual source code files are subject to or offered under other | ||
licenses. See the included LICENSE.txt file for more information. | ||
|
||
Heritrix is distributed with the libraries it depends upon. The | ||
libraries can be found under the 'lib' directory, and are used under | ||
the terms of their respective licenses, which are included alongside | ||
the libraries in the 'lib' directory. | ||
Some individual source code files are subject to or offered under other licenses. See the included [`LICENSE.txt`](./LICENSE) file for more information. | ||
|
||
Heritrix is distributed with the libraries it depends upon. The libraries can be found under the `lib` directory in the release distribution, and are used under the terms of their respective licenses, which are included alongside the libraries in the `lib` directory. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.