Nov 6, 2008

Robots txt file

"Robots.txt" is a regular text file that through its name, has special meaning to the majority of "honorable" robots on the web. By defining a few rules in this text file, you can instruct robots to not crawl and index certain files, directories within your site, or at all.

There are 3 types of robots namely


  • robots.txt
  • Meta Robots
  • Nofollow Tag
robots.txt

The basic robot file is the following file
User-agent: *
Disallow: /

  • * Represents all search engine
  • Disallow: / represents which of the part is not to be crawled by the search engine.
User-Agent: googlebot
Disallow: /images/

  • This is for google search engine and to block the folder "images" from crawling.
Disallow: *.doc$

  • This is to block the word document
Sitemap: http://example.com/mainsitemap.xml

  • This is to ensure the sitemap.xml file in the web.
Allow: /research/findings/*

  • Allow tag allows the search engine to crawl
Meta Robots

It is possible to instruct the robots not to crawl a single page.


In the meta tag, add the attribute as name=robots and content="noindex, nofollow".

Nofollow Tag

In some cases, we might have given the some site URL as the reference in our site. So, by giving the URL of other site, it is an upgrade to that site. So, to avoid this "rel=nofollow" attribute in anchor tag is used to prevent the site from crawling.

No comments: