would anyone out there care to share their robots.txt experience using centos as a webserver and their robots.txt files? i realize this is a somewhat simple exercise, yet i am sure there are both large and small hosters out there and possibly those that have high traffic modify their robots.txt files differently that others ??? please share if you can or care to please? for years we have just did a * (allow all) and disallow on things like /cgi-bin as examples of places to visit for those out or in the know... http://www.robotstxt.org/ http://en.wikipedia.org/wiki/Robots_exclusion_standard http://www.google.com/robots.txt and others... quite frankly, there are many orgs out there that dont follow this anyways, right? anyone? tia - rh
On Sat, Jan 16, 2010 at 10:18 PM, R-Elists <lists07 at abbacomm.net> wrote:> > would anyone out there care to share their robots.txt experience using > centos as a webserver and their robots.txt files? > > i realize this is a somewhat simple exercise, yet i am sure there are both > large and small hosters out there and possibly those that have high traffic > modify their robots.txt files differently that others ??? > > please share if you can or care to please? > > for years we have just did a * (allow all) and disallow on things like > /cgi-bin > > as examples of places to visit for those out or in the know... > > http://www.robotstxt.org/ > > http://en.wikipedia.org/wiki/Robots_exclusion_standard > > http://www.google.com/robots.txt > > and others... > > quite frankly, there are many orgs out there that dont follow this anyways, > right?Right http://blogs.perl.org/users/cpan_testers/2010/01/msnbot-must-die.html> > anyone? > > tia > > ?- rh > > _______________________________________________ > CentOS mailing list > CentOS at centos.org > http://lists.centos.org/mailman/listinfo/centos >
On Sat, 2010-01-16 at 14:18 -0800, R-Elists wrote:> quite frankly, there are many orgs out there that dont follow this > anyways,right?Since robots.txt is a "suggestion" and .htaccess is actually enforced, I use a simple robots.txt like this: User-agent: * Disallow: and put the bad guys into .htaccess. -- MELVILLE THEATRE ~ Melville Sask ~ http://www.melvilletheatre.com
Add User-agent: Slurp Crawl-delay: 86400 to stop misbehaving Yahoo bots. Slurp is often misbehaving, but it at least follows these rules. Something you can't say of Googlebot, for instance. Kai -- Get your web at Conactive Internet Services: http://www.conactive.com
Maybe Matching Threads
- Request for the Admin
- favicon.ico and robots.txt
- import file formatted RFC-822
- [LLVMdev] llvm.org robots.txt prevents crawling by Google code search?
- OT: help with email list reading programs w/ best features to read the centos and other lists that can filter people etc