Skip to content

Commit

Permalink
Clarify robots protocol support
Browse files Browse the repository at this point in the history
Closes #351, #353
  • Loading branch information
ato authored Feb 15, 2021
1 parent b10b338 commit 35bb328
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,11 @@ Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-

## Crawl Operators!

Heritrix is designed to respect the [`robots.txt`](http://www.robotstxt.org/robotstxt.html) exclusion directives and [META robots tags](http://www.robotstxt.org/meta.html). Please consider the
Heritrix is designed to respect the [`robots.txt`](http://www.robotstxt.org/robotstxt.html) exclusion directives<sup>†</sup> and [META nofollow tags](http://www.robotstxt.org/meta.html). Please consider the
load your crawl will place on seed sites and set politeness policies accordingly. Also, always identify your crawl with contact information in the `User-Agent` so sites that may be adversely affected by your crawl can contact you or adapt their server behavior accordingly.

<sup>†</sup> The newer wildcard extension to robots.txt is [not yet](https://github.com/internetarchive/heritrix3/issues/250) supported.

## Getting Started

- [User Manual](https://github.com/internetarchive/heritrix3/wiki)
Expand Down

0 comments on commit 35bb328

Please sign in to comment.