Process by which Robots.txt File Blocks Web Pages

Robots.txt file restricts access to search engine “spiders” that “crawl” a website. These automated bots check if robots.txt file exists before accessing the pages of a website. A website requires a robots .txt file whenever the site contains content that the site owner does not want to be indexed. As Google follows the robots.txt file directive and would not “crawl” the site content or pages blocked by this tool, it can, however, still index the URLs if they are located on other pages of the web. In such cases, the URL of the page as well as other public information such as Anchor Text links to the site or the Open Directory title project would appear in the Google search results. To use the robots.txt file, it is necessary to access the root of the domain.

 

Those websites who do not access the root of the domain can still restrict their website’s access by using the Robots Meta tag. Certain procedures are observed which help to block pages when robots .txt file is installed :

01)      Initially one should click on the desired site on the Webmaster Tools Home Page.

02)      Under the tab “site configuration” click “Crawler Access”.

03)      Next click the “Robots.txt” tab.

04)      Select the default robot access. However, it is suggested that websites should allow specific bots to access the site. This prevents the problem arising if crucial “crawlers” accidentally block the indexing of the website.

05)      To block Google bot from all files and directories, it is advisable to click on “Action Site” and the tab “Disallow” while in the robot list one should click “Google bot”. In the file box after typing “Add”, the code for the robots.txt file would automatically be generated.

06)      The robots.txt file can be saved by downloading the file and copying the contents in a txt file. The file should be saved in the highest level directory of the website. Any robots .txt file located in the sub-directory is not valid as bots only check the root of a domain for its presence.

 

There are some rogue robots who do not respect the instructions in the robot .txt files. And under such circumstances it is suggested that all confidential information should be kept in a password protected directory on the server. As different robots interpret the text files differently, all precautions should be taken to create robots .txt file that would work unanimously for all robots. Each website should have its own robots.txt file especially if the website content is aligned both to http and https. Writing the txt file is easy as it is an ASCII txt file. The file normally lists the names of “spiders” and directories that it is not allowed to access. The wild card asterisk character is used in robots.txt file, as it informs the robots that they are not allowed access to any content in the “cgi-bin” Directory and its descendents.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *