Robots.txt

Check whether an already published Url, Path or File is meant to be crawled


Further Information

What is robot.txt used for?

"robots.txt" is a file website owners may/should place in root folder of each of their domains. Though the file itself is human-readable it is targeted to give web-crawlers guide-lines on how to deal with a website.
Nearly every search engine like Google, Yahoo, Bing or many others use web-crawlers to read web pages, index their content, extract all links to other sites and continue following these links.
Whenever a web-crawler intends to index a new page it reads /robots.txt file first and checks whether the planned site to be indexed is disallowed by The Robots Exclusion Protocol inside that robots.txt file. If so, indexing of this page is skipped.
Though using robots.txt by web-crawlers is best practise, it has to be mentioned that there are suspicious harvesting projects like collecting email addresses for spammers which will not follow the rules in robots.txt.

Basic use cases

Allow crawling of full site

  1. Have no robots.txt in the domain root folder, or
  2. Create a robots.txt in domain root folder as follows:

    # No restrictions to any robot
    User-agent: *
    Disallow:

    Having no path description in Disallow means: nothing is disallowed.
    "#" at beginning of line marks a comment.

Disallow parts of a site

  1. To block a site completely, have a robots.txt in domain root folder as follows:

    # Exclude all robots from entire site
    User-agent: *
    Disallow: /

    See how "/" as path is used to disallow root folder and every folder beneath it to block the whole site.
  2. Only block specific folders using:

    # Exclude selected folders from being crawled
    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /tmp/

  3. Block a specific aggressive robot and let's call it AggressiveBot for amusement's sake:

    # Exclude AggressiveBot from site
    User-agent: AggressiveBot
    Disallow: /

    All other robots will be able to crawl the site.
Read more about the standards for Robot.txt

Using robots.txt as de-facto standard goes back to two main publications

Additional information can always be found with our friends at wikipedia: If you are interested in learning how a specific search engine handles robots.txt files it may also be helpful to directly search there. They often provide information, like