Analyzing JPEG files
Last Updated: 2017-09-10 18:10:51 UTC
by Didier Stevens (Version: 1)
In my PDF analysis I started last week, I have to analyze a JPEG file. I usually do this with a binary editor with templates (010 Editor), but this is not an open source solution.
I made a tool (written in Python) to help me analyze JPEG files. The tool, jpegdump.py, is still beta. Before I finish my short diary entry serie "It is a resume", I want to show some analysis example with this tool.
First a normal JPEG file:
Each line presents data for a marker and its data. We see that the file starts with a Start Of Image marker (SOI) at position 0, and ends with a End Of Image marker (EOI), without data following this marker. So that looks clean.
And then we have the markers we can expect: application (APP?), quantization tables (DQT), start of frame (SOF), Huffman tables (DHT), and finally the compressed image: start of scan (SOS). That is what we can expect in a normal image.
Compare this with a JPEG file containing an exploit I created with Metasploit:
The different markers look normal, but not when we look at 6: this is an unknown marker (FFAC), and it also does not follow directly after the data of the previous market (5 DHT): there is a difference of 108 bytes (d=108).
This unknown marker is also supposed to have 15457 bytes of data, but the last message (negative trailing) informs us that this is less.
Another more subtle anomaly is the entropy of the data in the Huffman table (e=7.26...): this looks high for a Huffman table.
With jpegdump, we can dump the content of the data of the Huffman table in marker 5:
This data looks random, and not like a normal Huffman table. For comparison, here is a Huffman table dump of the first image we analyzed:
You can see that in this table, the data is far less random.
Let's see if we can find anything interesting in this random looking data. First we look for strings in the data starting with marker 5 (position 0xae):
We can clearly see an IP address, and something that resembles BASE64 data or the path of a URL.
URLs used by Metasploit payloads encode data, and I have a tool to try to decode this data (metatool.py). Let's try this here:
This confirms that this is a Metasploit exploit: metatool can extract the payload UID, platform and architecture, and also the timestamp when I created the payload.
This is how I proceed when I analyze data structures: I take an overal look at the structure, checking if all expected elements are there. And if I find anomalies, I take a closer look.
I my next diary entry, I will do this for the image in the PDF I was analyzing.