Question 1

Why does X (Twitter/Discord/Mastodon) strip the zero-width space I just pasted?

Accepted Answer

Each platform applies its own normalization. Twitter filters U+200B from tweet bodies and display names per third-party trip reports. Discord allows U+3164 (Hangul Filler) because its Unicode General Category is Lo (Letter, other) and not Cf (Format), so identifier validators that block format characters let it through. Mastodon enforces RFC 7564 PRECIS IdentifierClass on remote handles and stricter ASCII rules on local sign-up — invisible characters fail both gates. The differences are intentional: each surface decides how strict it wants to be.

Question 2

Why is U+200B 3 bytes when an ASCII space is 1?

Accepted Answer

UTF-8 encodes code points in the range U+0800–U+FFFF as three bytes (RFC 3629 §3). U+200B sits in that range. ASCII space (U+0020) is in U+0000–U+007F and encodes to a single byte. The visual width is zero in both cases, but on the wire and in storage the cost differs by 3×. A tweet padded to 280 characters with U+200B carries the same payload as roughly 840 ASCII characters, which matters for SMS gateways, log-file rotation, and any system that bills or budgets by byte rather than visible glyph.

Question 3

Are zero-width characters legitimate, or only used for tricks?

Accepted Answer

Both. U+200B marks word boundaries in Thai, Khmer, Myanmar, and Lao text — scripts that do not use spaces between words — and Unicode Standard Annex #14 treats it as a soft line-break opportunity. U+200C and U+200D control ligature formation in Arabic, Persian, and Devanagari, and emoji ZWJ sequences (the family 👨‍👩‍👧 emoji is five code points: U+1F468 + U+200D + U+1F469 + U+200D + U+1F467). Misuse and legitimate use share the same code points; the security posture lives in normalization, not in banning the characters.

Question 4

Can these characters be used to attack AI assistants or websites?

Accepted Answer

Hidden prompt injection via Tag-block characters (U+E0000–U+E007F) was demonstrated against Amazon Q Developer in 2025. AWS Bedrock Guardrails added a prompt-attack filter that year, though no AWS WAF managed rule ships specifically for invisible-character injection — practitioners deploy custom byte-match rules on the Tag block. On the web side, IDN homograph domains exploit visually confusable characters: the Akamai 2022 DNS-traffic study observed 6,670 such domains across 29,071 devices over 32 days. MITRE catalogs the underlying weakness as CWE-1007 (Insufficient Visual Distinction of Homoglyphs).

Question 5

How do registrars and protocols defend against invisible-character abuse?

Accepted Answer

At the DNS layer, IDNA normalizes Unicode labels to Punycode (RFC 3492 / IDNA2008 RFCs 5891–5892); ICANN IDN Implementation Guidelines v4.1 (November 2022) sets the registry baseline. At the protocol layer, RFC 7564 (PRECIS Framework) defines the IdentifierClass that strict applications use to reject format characters in usernames and resource names. UTS #39 (Unicode Security Mechanisms) defines the confusable-detection algorithm registrars and identity systems use when the policy needs to compare strings for visual similarity.

Invisible Character

Invisible characters in 2026 — what they are, where they break

Frequently asked questions