From 35bb3286f4d441538ac7d74cd65835abaaff2736 Mon Sep 17 00:00:00 2001 From: Alex Osborne Date: Mon, 15 Feb 2021 17:29:49 +0900 Subject: [PATCH] Clarify robots protocol support Closes #351, #353 --- README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 9c88e6d78..dccbac23a 100644 --- a/README.md +++ b/README.md @@ -10,9 +10,11 @@ Heritrix is the Internet Archive's open-source, extensible, web-scale, archival- ## Crawl Operators! -Heritrix is designed to respect the [`robots.txt`](http://www.robotstxt.org/robotstxt.html) exclusion directives and [META robots tags](http://www.robotstxt.org/meta.html). Please consider the +Heritrix is designed to respect the [`robots.txt`](http://www.robotstxt.org/robotstxt.html) exclusion directives and [META nofollow tags](http://www.robotstxt.org/meta.html). Please consider the load your crawl will place on seed sites and set politeness policies accordingly. Also, always identify your crawl with contact information in the `User-Agent` so sites that may be adversely affected by your crawl can contact you or adapt their server behavior accordingly. + The newer wildcard extension to robots.txt is [not yet](https://github.com/internetarchive/heritrix3/issues/250) supported. + ## Getting Started - [User Manual](https://github.com/internetarchive/heritrix3/wiki)