Not all subdirectories were made for all people; at least not to see. For example, you may not want a Google cache exposing the mistakes of a test site you’re working on. Similarly, you may have a photo albumn that you’d rather not find itself showing up on search.Yahoo.com. How do you avoid various search engines indexing these directories while including the rest? Simple …
A long time ago, on a blog far away we discussed the ‘Robot Exclusion Tutorial‘ using just enough geek so you don’t shoot your foot clean off. Meaning, by simply using the robots exclusion standard, you can usually keep most ‘well behaved‘ search engines from ‘spidering’ into semi-private directories.
Note the emphasis on ‘well behaved.’ There are some nasty-bots out there that of course look at such entries as an engraved invitation to sneak a peak. Mark Pilgrim wrote about such nere-do-wells, even setting up a form of a ‘honey pot’ to ‘nail the suckas.’
That said, while searching for various Robots.txt validators, I came across a tool that generates a robots.txt file based upon entries you make … along with offering an option to include ‘nasty bots’ though I think from the prompts on the page, the webmaster needs to get two things straight.
First, you’re not really ‘blocking’ anything but requesting that a search engine not traverse up a stated path. Second, you don’t make an entry like ‘www.yoursite.com/private/ ‘ but rather ‘/private’.
Once you get past that the robots.txt generator tool works as advertised. Once you’re done with that, you may also want to visit my post listing posts on blocking sites using mod_rewrite via .htaccess and other such fun.
Too bad we just can’t send all the spammers on a rocket ship to the sun or something.