ए भाई Think ज़रा हटके
Note: amtyThumb must be installed for new version of amty thumb post/recent

Why does robots.txt is important?

robots.txt

What is robots.txt ?
When a search engine crawler comes to your site, it will look for a special file on your site. That file is called robots.txt and it tells the search engine spider, which Web pages of your site should be indexed and which Web pages should be ignored.

Where to place it?
The robots.txt file is a simple text file (no HTML), that must be placed in your root directory, for example:

http://www.yourwebsite.com/robots.txt


How to create it?
This is simple text file. There are basically 2 parts;

1
User-agent
The User-agent line specifies the robot. For example:
User-agent: googlebot

You may also use the wildcard character “*” to specify all robots:
User-agent: *

You can find user agent names in your own logs by checking for requests to robots.txt. Most major search engines have short names for their spiders.

1
Disallow
The second part of a record consists of Disallow: directive lines. These lines specify files and/or directories. For example, the following line instructs spiders that it can not download adminlogin.amty:
Disallow: adminlogin.amty

You may also specify directories:
Disallow: /cgi-bin/
Which would block spiders from your cgi-bin directory.

There is a wildcard nature to the Disallow directive. The standard dictates that /bob would disallow /bob.html and /bob/indes.html (both the file bob and files in the bob directory will not be indexed).

If you leave the Disallow line blank, it indicates that ALL files may be retrieved. At least one disallow line must be present for each User-agent directive to be correct. A completely empty Robots.txt file is the same as if it were not present.


Self understanding Example;

Sitemap: http://www.yourwebsite.com/sitemap-web.xml
Sitemap: http://www.yourwebsite.com/sitemap-mobile.xml
Sitemap: http://www.yourwebsite.com/sitemap-image.xml
Sitemap: http://www.yourwebsite.com/sitemap-video.xml

User-Agent: *
Disallow: /wp/wp-admin/
Disallow: /wp/wp-includes/
Disallow: /wp/wp-content/
Disallow: /wp/wp-
Disallow: /go/
Disallow: /forums/profile/

1. Hackers might use less popular crawlers to search restricted material over our site. In such a case either you should specify all user-agents or just use wild character.

2. Try to avoid comments in robots.txt


What you can do with robots.txt
You can stop crawlers to look into into your site contents.
You can protect cache folders, private folders from outsiders.

What to hide?
1. Cache folders & files
2. Search results
3. Login page

Never forget to read about how to use robots.txt to hack actual path of wordpress installation directory.

Amit Gupta

Hey! this is Amit Gupta (amty). By profession, I am a Software Eng. And teaching is my passion. Sometimes I am a teacher, as you can see many technical tutorials on my site, sometimes I am a poet, And sometime just a friend of friends...

281
views


To book below area mail me




  • Hello, I have a couple of question about robot.txt file.
    1) If my blog is on Blogger, how do I upload a robot .txt file?
    2)I have the following url that is restricted: Ihttp://www.bloggerbroadcast.com/search/label/Savings
    URL restricted by robots.txt Sep 23, 2011
    a) is this bad for my site to have these labels restricted? Is this why my search widget doesn’t work?
    b) how do I remove these files from being restricted?

    Thanks for your time, any help would be appreciated.

  • Unfortunately, you can not upload robots.txt to any blogger or wordpress site until you host them to some other server.
    Moreover, If you are searching on your own site then robots.txt doesn’t interrupt your search

captcha

You can follow any responses to this entry through the RSS 2.0 feed.