My next class:

Getting a Better Handle on International Domain Names and Punycode

Published: 2025-08-26. Last Updated: 2025-08-26 16:34:11 UTC
by Johannes Ullrich (Version: 1)
0 comment(s)

International domain names (IDN) continue to be an interesting topic. For the most part, they are probably less of an issue than some people make them out to be, given that popular browsers like Google Chrome are pretty selective in displaying them. But on the other hand, they are still used legitimately or not, and keeping a handle on them is interesting.

When analyzing DNS traffic, you should see the Punycode encoding for these domain names. Punycode is defined in RFC 3492 [1]. Punycode encoded domain names start with "xn--", making identifying them easy. 

Several anomalies may happen with Punnycode; luckily, some Python modules can help us identify them.

1 - Invalid Punycode

The Punycode standard is complex, and you may end up with invalid Punycode domains.

2 - Mixed Script

That is the most interesting issue. You are detecting if a domain name mixes different languages. There is no easy way to identify the "language"; instead, we are using the "Script". The Latin script can be used for most European languages. The "Script" identifies a group of languages using the same characters. In Python, the "unicodedata2" module can be used to determine the script of a particular character.

The Python "unicodedata2" module can be used to look up the Unicode name of a character, and the first word in a Unicode name identifies the script the character is a part of. Mixing different scripts in a domain name is suspect as legit international domain names should only use one language.

You can find a quick Python implementation on GitHub: https://github.com/jullrich/idntest

[1] https://datatracker.ietf.org/doc/html/rfc3492


Johannes B. Ullrich, Ph.D. , Dean of Research, SANS.edu
Twitter|

Keywords:
0 comment(s)
My next class:

Comments


Diary Archives