Archive for the 'stupid spiders' Category

seeqpod.com is flooding librivox

Wednesday, February 27th, 2008

Seeqpod.com has a very badly written spider, apparently named ‘heritrix’. It has been responsible for two librivox outages in the last three days. Specifically, it makes an extremely large number of simultaneous connections, and requests the exact same URL over and over and over again. Here’s a log snippet. Imagine the same line about 6000 times in just a few minutes:

4.71.164.213 - - [26/Feb/2008:17:35:14 -0800] “GET /2007/06/ HTTP/1.0″ 200 21312 “http://librivox.org/far-away-and-long-ago-by-wh-hudson/” “Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.seeqpod.com)”

The hits all came from different IPs in the 4.71.164 block. Occasionally the User-Agent was “Python-urllib/2.4″. I blocked both with .htaccess. This kind of bad programming is totally inexcusable in this day and age. Especially considering that their about page claims their algorithm was developed at the Lawrence Berkeley National Laboratory.

Please fix this.