Recognizing ZLIB Compression
Last Updated: 2019-07-29 15:30:24 UTC
by Didier Stevens (Version: 1)
In diary entry "Analyzing Compressed PowerShell Scripts" and video "Video: Analyzing Compressed PowerShell Scripts" I show how to decompress ZLIB compressed data.
Let me share some more info on ZLIB compressed data. Compressing data with ZLIB is called deflating, and the algorithm is called DEFLATE.
When I compress the text "Hello, Hello, Hello, Hello" with Python's ZLIB module, I obtain the following binary data (represented in hexadecimal): 789cf348cdc9c9d751f0c0a400745608b5.
This data is structured according to RFC 1950: the first byte (0x78 in this example) if known as CMF (Compression Method and Flags). This byte is very often equal to 0x78. The 4 least significant bits identify the compression method (8 is DEFLATE and 15 is reserved), the 4 most significant bits are used to encode the size of the window when the compression method is 8. This value is often 7 (32K window size).
0x78 is a lowercase letter x, so easy to recognize in an ASCII dump. So, if you encounter some high entropy data that starts with x (0x78), it might be ZLIB compressed data according to RFC 1950.
My tool translate.py can be used, with function ZlibD (ZLIB Decompression), to decompress this data:
There's a second header byte after CMF: FLG (flags). And depending on these flags, there might be some more data, but usually, it's the compressed data that follows. This is compressed with the DEFLATE algorithm, and is structured according to RFC 1951. translate.py can also decompress this data, using function ZlibRawD.
If you mix-up data format and functions, you get an error.
Inflating data with a header with function ZlibRawD produces this error:
And inflating data without a header with function ZlibD produces this error:
In my tool file-magic.py, I have some custom definitions to detect ZLIB compressed data (RFC1950):
If you see a byte sequence starting with 7801, 789C or 78DA, your best chance is to first to try to decompress it with ZlibD.
And GZIP? That's RFC 1952. The content of a GZIP file looks like this:
The compressed data in this example is RFC 1951.
I'll provide more details in an upcoming diary entry, but there are many tools to decompress GZIP files.
Jul 30th 2019
3 years ago