Extracting BTC addresses from emails
Last Updated: 2018-07-16 00:07:22 UTC
by Didier Stevens (Version: 1)
I was asked if I had a tip to automatically extract Bitcoin addresses from emails (cfr. Retrieving and processing JSON data (BTC example)). I do.
My tool, re-search.py, comes with a regular expression to match Bitcoin addresses, and also with the Bitcoin address checksum validation algorithm.
Bitcoin addresses are base58check encoded integers with a checksum. The following regular expression will match a Bitcoin address:
Of course, regular expressions can not be used for checksum calculations, and hence this regular expression will also match strings that are not valid Bitcoin addresses (e.g. correct syntax, but invalid checksum).
My re-search.py tool contains a function to validate Bitcoin addresses (BTCValidate) by checking the checksum. It is used like this:
(?# ... ) is a comment for regular expressions, and is thus ignored by regular expression engines, but re-search interprets this comment to take extra actions, like in this case, calling BTCValidate.
This is the command I use to extract Bitcoin addresses from emails:
Option -n with argument btc directs re-search.py to lookup and use the regular expression with name btc from its library. That's the regular expression for Bitcoin addresses.
Option -c directs re-search.py to perform case-sensitive matches (Bitcoin addresses can contain an uppercase letter L but not a lowercase letter l).
Option -u directs re-search.py to produce a list of unique Bitcoin addresses, i.e. to remove duplicate entries.
And finally, option -e directs re-search.py to extract strings from the files it processes (*.vir files). That's because the extortion emails that I have come in various formats: MIME files, RTF files, MSG files (e.g. ole files). ole files are a binary format, and by default re-search.py reads text files. Option -e extracts ASCII and UNICODE strings from binary files (and text files too) before processing.