Robots.txt is a vital part of most/all of websites. It feeds the robots from different search engine e.g. Google, Bing etc.

Instructions in this file instructs robots to what to crawl(eat), what not to crawl (eat) by blacklisting or whitelisting urls. The thumb rule that is followed to pick either of the approach are:

1.) Use Whitelisting public urls if one wish to block crawler to eat up newly added url (without entry being written for same in robots.txt). We whitelist url’s by writing

Allow: /users
Allow: /\*/tests

in the robots.txt

2.) Use Blacklisting urls if one wish to allow crawler to eat up everything else. We blacklist url’s by writing

Disallow: /users
Disallow: /\*/tests

in robots.txt

Now this is okay as far as one is only blocking public urls from being content aggregated. But this may pose security threats to web application if one is blacklisting secret urls, by making them public (unknowingly), since robots.txt is publicly available and in human readable format.

Disallow: /secret.html
Disallow: /\*/password.xml

So by blacklisting secret urls, one make them prone to attack. Instead of blacklisting secret urls. One should whitelist other urls

Allow: /public
Allow: /public/*

<3 <3 <3


About The Author

I am Pankaj Baagwan, a System Design Architect. A Computer Scientist by heart, process enthusiast, and open source author/contributor/writer. Advocates Karma. Love working with cutting edge, fascinating, open source technologies.

  • To consult Pankaj Bagwan on System Design, Cyber Security and Application Development, SEO and SMO, please reach out at me[at]bagwanpankaj[dot]com

  • For promotion/advertisement of your services and products on this blog, please reach out at me[at]bagwanpankaj[dot]com

Stay tuned <3. Signing off for RAAM