Robots.txt
Further Information
What is robot.txt used for? "robots.txt" is a file website owners may/should place in root folder of each of their domains. Though the file itself is human-readable it is targeted to give web-crawlers guide-lines on how to deal with a website.
Nearly every search engine like Google, Yahoo, Bing or many others use web-crawlers to read web pages, index their content, extract all links to other sites and continue following these links.
Whenever a web-crawler intends to index a new page it reads /robots.txt file first and checks whether the planned site to be indexed is disallowed by The Robots Exclusion Protocol inside that robots.txt file. If so, indexing of this page is skipped.
Though using robots.txt by web-crawlers is best practise, it has to be mentioned that there are suspicious harvesting projects like collecting email addresses for spammers which will not follow the rules in robots.txt.
Allow crawling of full site
- Have no robots.txt in the domain root folder, or
- Create a robots.txt in domain root folder as follows:
# No restrictions to any robot
Having no path description in Disallow means: nothing is disallowed.
User-agent: *
Disallow:
"#" at beginning of line marks a comment.
Disallow parts of a site
- To block a site completely, have a robots.txt in domain root folder as follows:
# Exclude all robots from entire site
See how "/" as path is used to disallow root folder and every folder beneath it to block the whole site.
User-agent: *
Disallow: / - Only block specific folders using:
# Exclude selected folders from being crawled
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/ - Block a specific aggressive robot and let's call it AggressiveBot for amusement's sake:
# Exclude AggressiveBot from site
All other robots will be able to crawl the site.
User-agent: AggressiveBot
Disallow: /
Using robots.txt as de-facto standard goes back to two main publications
- A Standard for Robot Exclusion, 1994, www.robotstxt.org/orig.html
- A Method for Web Robots Control, 1997, www.robotstxt.org/norobots-rfc.txt
- Robots exclusion standard, Wikipedia, en.wikipedia.org/wiki/Robots_exclusion_standard
- Introduction to robots.txt, Google Search Central, developers.google.com/search/docs/advanced/robots/intro