Published: 2026-06-10Views: 21DNSsecurity

Homograph Attacks: When the Domain Name Is a Lie

In 2017 a researcher registered a domain that displayed as apple.com in three major browsers, served over valid HTTPS. Every character was Cyrillic. The reason it's hard to fix is older and stranger than the bug itself.

In April 2017 a researcher named Xudong Zheng registered apple.com.

Not the real one, obviously. But if you loaded his page in Chrome, Firefox, or Opera that spring, the address bar said apple.com, the connection was HTTPS, and the certificate was valid and green. There was no typo to squint at. The letters were the right letters in the right order. You could read the URL character by character out loud and never catch it.

Every one of those characters was Cyrillic. The domain he actually registered was xn--80ak6aa92e.com, and the thing your browser rendered — аррӏе.com — was built from the Cyrillic а (U+0430), the Cyrillic р (U+0440), and a palochka standing in for the lowercase L. To a font, they’re apple. To DNS, they’re a completely different name. To you, looking at the bar, there is no difference at all.

That’s a homograph attack. The unsettling part isn’t the trick — it’s that the trick falls directly out of two decisions the internet made on purpose, and there’s no clean way to undo either one.

DNS never learned to read

DNS speaks ASCII. Letters, digits, hyphens — the LDH set, and nothing else. That was fine when the web was American and the names were English, and catastrophic the moment someone in Seoul or Cairo or Athens wanted a domain in their own script.

The fix, standardized as IDNA, is a piece of genuinely clever engineering called Punycode (RFC 3492). Punycode is a reversible, deterministic encoding: it takes a Unicode string and produces a pure-ASCII string that DNS can carry, prefixed with xn-- so software knows to decode it. münchen.com becomes xn--mnchen-3ya.com on the wire. The resolver, the registry, the TLS layer — they all deal in the ASCII form. Only the browser, at the very last step, decodes it back to Unicode and paints münchen in the address bar for a human to read.

This is the right design. You cannot rewrite every DNS server on Earth to handle UTF-8, so you encode at the edges and leave the core untouched. It shipped, it works, and you’ve almost certainly used it without noticing.

The problem is the decode-and-display step. The browser’s job is to show the user the friendly Unicode name. And Unicode, by design, contains thousands of characters that look identical to other characters.

Unicode is full of twins

Latin a, Cyrillic а, and Greek alpha α are three different code points that a typical font draws as the same shape. This is not a bug in Unicode. It encodes the world’s writing systems, and the world’s writing systems genuinely share glyphs — Latin and Cyrillic both inherited letters from Greek, so of course an o is an о is an ο. Unicode calls these confusables, and it ships an entire technical standard (UTS 39) cataloguing them, because the people who built it knew exactly what was coming.

So you have a naming system that now accepts the full Unicode range, a display layer obligated to render it faithfully, and a character set where dozens of letters have perfect look-alikes in other scripts. The homograph attack falls out of those three facts like a dropped ball. You don’t exploit a flaw. You just spell a famous name using somebody else’s alphabet.

And people figured this out almost immediately. Gabrilovich and Gontmakher described the attack in Communications of the ACM in February 2002 — they registered a Cyrillic microsoft.com to prove it worked. At ShmooCon in 2005, Eric Johanson demoed a Cyrillic pаypal.com. The 2017 apple.com that made the rounds was the third act of a play that had already been running for fifteen years.

So browsers became the spell-checkers

Since DNS won’t police this and the registries can’t agree to, the browser became the last line of defense — which means the browser has to guess, on every single label of every single hostname, whether the Unicode you’re about to see is a real name or a costume.

Chrome’s answer is a stack of heuristics, and reading them tells you how hard the problem is. It decodes each label, and shows the raw xn-- Punycode instead of the pretty Unicode if any of several alarms trip: the label mixes scripts in a suspicious way (Latin glued to Cyrillic), or its “skeleton” — the shape you get after mapping every confusable to a canonical form — matches a known popular domain, or it’s entirely drawn from look-alike letters in a script whose top-level domain doesn’t match. That last rule is the one Zheng’s аррӏе.com defeated. It was pure Cyrillic, single-script, internally consistent. The mixed-script detector that browsers had leaned on since the 2005 scare saw nothing to flag, because nothing was mixed. Chrome added whole-script confusable detection in version 58 specifically to close that hole.

Firefox made a different call, and it’s worth sitting with because it isn’t obviously wrong. Firefox has a preference, network.IDN_show_punycode, that forces every internationalized name to display as raw xn-- gibberish. It defaults to off. Mozilla declined to flip it, and declined again when people asked after 2017, on the grounds that turning it on would render every legitimate non-Latin domain — every Cyrillic, Arabic, Chinese, Devanagari name on the internet — as unreadable ASCII soup. Protecting English-reading users from a Cyrillic spoof by breaking the address bar for everyone who actually writes in Cyrillic is a strange definition of safety. So they lean on per-registry rules and confusable checks instead, and accept that the protection is softer.

There’s no resolution to that argument, which is the whole point. Every defense against homographs is a tax on internationalization, and every concession to internationalization is an opening for homographs. You’re trading off two good things against each other and there’s no setting where both win.

The part that doesn’t get fixed

Browser heuristics have gotten genuinely good, and the easy version of this attack mostly doesn’t survive a modern address bar anymore. The skeleton-matching catches the obvious targets, the script rules catch the lazy ones.

But heuristics are a blocklist wearing a trench coat, and the search space is the entire Unicode confusables table crossed with every brand worth impersonating. There will always be a combination — some script pairing, some less-famous target, some newly assigned code point — that nobody wrote a rule for yet. The 2017 attack worked because everyone’s mental model was “watch for mixed scripts,” and pure Cyrillic walked straight through that model. The next one will work the same way: it’ll exploit the gap between the rule we wrote and the thing the rule was supposed to mean.

You learned to read a URL left to right and trust your eyes. The whole game here is that your eyes were never reading the URL. They were reading a picture of it.

Continue the conversation

Discuss on X Discuss on Reddit

← Back to Blog