Search engines that are no search engines

Published: 2007-11-09
Last Updated: 2007-11-09 15:53:19 UTC
by Johannes Ullrich (Version: 1)
0 comment(s)
The DShield database was running a bit "hot" earlier today, so I took a closer look at the web log and found that one particular "search engine" was indexing the site rather aggressively:

a.b.c.d - - [09/Nov/2007:15:24:35 +0000] "GET /portreportascii.html?date=2007-11-09 HTTP/1.0" 200 500572 "-" "gsa-crawler (Enterprise; S5-FTNF3BWZPUJAS; nobody@google.com)" "-"

At first, I thought "oh well, its google". But looking at the user agent string closer, reveals some subtle differences. This is a Google search appliance, not the uber-google-bot we all love. The regular Google bot looks like this:

66.249.65.233 - - [09/Nov/2007:15:24:37 +0000] "GET /date.html?port=47109&date=2007-10-25 HTTP/1.1" 200 7538 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"

I have seen similar cases a few times now. While this one was not malicious, in some cases attacks used google's (or other search engine) user agent strings. I can only assume that this is an attempt to fit in better, and maybe retrieve a search engine version of the page. If anybody knows a good reference where to find IP address ranges used by certain search engines: let us know.

(and btw... if you need bulk data access to dshield data: Please ask. Spidering the site is just not very efficient and you will run into some anti-harvesting traps sending you in circles)

-----
Johannes B. Ullrich
Chief Research Officer, SANS Technology Institute

Keywords:
0 comment(s)

Comments


Diary Archives