Anthony Y. Fu, Xiaotie Deng, Wenyin Liu, and Greg Little: Methodology and an Application to Fight Unicode Attacks

July 14, 2006 by Ping

Read the paper here.

Unicode makes available a wide range of similar-looking characters that can be used to fool us into trusting the wrong domain name or address.  For example, “citibank” can be spelled with similar-looking characters in over 200 billion different ways (there are about 20 characters that look like “c”, 58 that look like “i”, and so on).

The authors presented two methods for automatically detecting the confusability of Unicode strings: one based on visual and semantic edit distance (VSED) and one based on the Knuth-Morris-Pratt algorithm (VSKMP).  Both use a table of visual similarity between characters, produced by comparing the pixels of characters rendered in Arial Unicode MS, and a table of semantic similarity, produced by hand.

See also the researchers’ website.

I’m highly amused by how their semantic similarity example shows that “student” and “coin” are completely unrelated.