Threat Level: green Handler on Duty: Brad Duncan

SANS ISC: Scripting Web Categorization SANS ISC InfoSec Forums

Special Webcast: What you need to know about the crypt32.dll vulnerability. Register Now

Sign Up for Free!   Forgot Password?
Log In or Sign Up for Free!
Scripting Web Categorization

When you are dealing with a huge amount of data, it can be very useful to enhance them by adding more valuable content. Example:

  • Geolocalization for IP addresses
  • Get an IP address DShield score
  • Lookup domain names in list of malicious domains
  • ...

When you are processing many URLs during a security incident investigation or while extracting IOC's from a malware sample or logs, it can also be very interesting to categorize them. The process of categorization helps to tag an URL with a label like the classic "Adult Content", "Government", "Forums", etc. Many commercial solutions offer this feature. It can be very powerful to configure your firewall to deny access to non-business categories. But, integrated in closed solutions, it's not easy to re-use them to benefit of this information in your own scripts. For years, Bluecoat has a product called "K9" that helps to protect kids surfing the web. It's free, you just can get a license key and install the tool or... use the online API!  I had to categorize a bunch of URLs , so I decided to take some time to write a few lines of Python to automate this task.

My script fetches the defined categories at regular interval (every two hours) and perform a lookup for each URL passed as argument:

$ ./,Education

Multiple URLs can be passed on the same command line or the script can be fed via STDIN if you use "-" as parameter:

$ ./,Education,Technology/Internet
$ cat suspicious-urls.tmp | ./ -,Business/Economy,Business/Economy,Malicious Outbound Data/Botnets,Malicious Outbound Data/Botnets,Malicious Sources/Malnets,Malicious Sources/Malnets,Malicious Outbound Data/Botnets,Uncategorized,Malicious Outbound Data/Botnets,Malicious Outbound Data/Botnets,Malicious Outbound Data/Botnets,Uncategorized,Uncategorized,Sports/Recreation,Malicious Sources/Malnets,Malicious Sources/Malnets,Malicious Sources/Malnets,Malicious Sources/Malnets,Malicious Sources/Malnets,Malicious Outbound Data/Botnets

The API returns an hexadecimal code corresponding to the web category. That's why the script fetches them at regular interval and store them in a local file:

$ ./ -h
usage: [-h] [-f CACHEFILE] [-F] [URL [URL ...]]

Categorize URL using BlueCoat K9

positional arguments:
  URL                   the URL(s) to check. Format: fqdn[:port]

optional arguments:
  -h, --help            show this help message and exit
                        Categories local cache file (default:
  -F, --force           force a fetch of categories

Before using the script, you have to register to get your K9 license, add it to the script (line 30).

Note: I'm not aware of any rate-limit in place while querying the API. During my investigations, I was never blocked.

Xavier Mertens
ISC Handler - Freelance Security Consultant


499 Posts
ISC Handler
I tried the ( and getting something like " No JSON object could be decoded". Any help is appreciated. ( windows 8, Python 2.7)

C:\Python27>python -F
Traceback (most recent call last):
File "", line 133, in <module>
File "", line 107, in main
webCats = fetchCategories(args.cacheFile)
File "", line 43, in fetchCategories
data = json.load(r)
File "C:\Python27\lib\json\", line 290, in load
File "C:\Python27\lib\json\", line 338, in loads
return _default_decoder.decode(s)
File "C:\Python27\lib\json\", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Python27\lib\json\", line 384, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

Sign Up for Free or Log In to start participating in the conversation!