Mahesh: Robots.txt

There are a lot of questions about robots.txt lately so i decided to create a hub that answers most of the newbie question related to robots.txt.

What is Robots.txt ?

The robot.txt file is used by search engine bots to analyze which content to index or deindex from your web server. If there is no explicit mention of de-indexing any folder or files then search engine bots will index everything by default. They crawl and index almost everything unless they’re advised not to in “disallow” variable.

How to create robots.txt ?

There are some online robots.txt generators which allow you to create robots.txt for your website. If you don’t want to use online tools then you can create your own as well, just follow the instruction mentioned below:

Create new file using notepad and add following code into it.

User-agent: *
Disallow:

save this file as “robots.txt" and upload it on your webserver. If you’re using linux server then upload this file in “/public_html" folder. If you want to index almost everything then above code will work just fine. If you wish to stop search engines from indexing some part of your website then use the below code:

User-agent: *
Disallow: /secret-folder/
Disallow:/don-touch/

If above code is placed inside your robots.txt then search bots will not be able to index above mentioned “secret-folder & don-touch folder”. Anything you write in front of disallow variable will be de-indexed. If you wish to get your entire site de-indexed from google(not recommended) then use this code:

User-agent: *
Disallow: /

This code will block entire content of your domain name. If you want to enable indexing again for your site then you need to remove “/”.

What is user-agent ?

User agent field is for the name of search bots. In this field you can specify name of the search bots like “googlebot” or “Slurp/2.0 ”. Using name of the bot in user-agent field allows search engine bots to know that these indexing instruction are for specific bot. For example you want googlebot to deindex some folders so here is what you need to write.

User-agent: googlebot
Disallow: /secret-folder/

This will force googlebot to exclude this directory from indexing.If name of search bot is not mentioned then "*" is used to apply instructions to all search engine bots.

What is crawl-delay ?

Search engines like yahoo and bing support crawl delay feature. If you specify any number in front of crawl-delay field it’ll delay the crawl in the number of seconds with respect to number mentioned.
User-agent: slurp/3.0
crawl-delay : 20
Disallow: /secret-folder/

Here crawl-delay is placed for 20 seconds so yahoo slurp bot will delay indexing for 20 seconds. Google allows you to set priority or delay using their webmaster central service so there is no need for modifying robots.txt to let googlebot know about it.

How to analyze robots.txt for errors ?

Yahoo, Bing and Google offer their own webmaster tools for analyzing robots.txt. Dashboard of respective webmaster tool service will tell you what errors their search bots found with your robots.txt file. You can also check your robots.txt with 3rd party sites like shoemoney tools and correct these errors.

What about robots.txt of hubpages ?

You don’t need to worry about robots.txt of hubpages. If they find any error with it they’ll fix it.

Why adsense mediabot complains about robots.txt ?

This happens when google translate or another search engine bot renders your page but disallows adsense to render. This is not a bug or issue with robots.txt as it’s issue between translator bot and google adsense, so you don’t need to worry about it.

Robots.txt Generator

You can use the wordpress plugin to make your own robots.txt file generator. If you are using any other CMS than WordPress then you can find respective plgugin. I suggest looking for the plugins on search engine.
You can also read more about this robots.txt file generator from Google Webmaster Support forum.