Published: 2010-11-18
Last Updated: 2010-11-18 21:53:16 UTC
by Chris Carboni (Version: 1)
We received a report of a very aggressive web spider that apparently is not obeying robots.txt.

The report claims the spider is from

Here are a few interesting tidbits from that site.

"008 runs on a grid computing platform that consists of several thousand computers, which is why you may see our web crawler access your site from many different IP addresses."

"If you block 008 using robots.txt, you will see crawl requests die down gradually, rather than immediately. This happens because of our distributed architecture. Our computers only periodically receive robots.txt information for domains they are crawling."

And my personal favorite ...

"Blocking our web crawler by IP address will not work. Due to the distributed nature of our infrastructure, we have thousands of constantly changing IP addresses. We strongly recommend you don't try to block our web crawler by IP address, as you'll most likely spend several hours of futile effort and be in a very bad mood at the end of it."

Several thousand computers?  Sounds like a recipe for a DDoS attack if I ever saw one and I don't even want to think about what could happen if that site got 0wn3d.

Has anyone else seen this?  Let us know.

Christopher Carboni - Handler On Duty

