Click any character to copy it to your clipboard.
Invisible characters in 2026 — what they are, where they break
Invisible characters are Unicode code points that occupy logical position in a string but produce no visible glyph. The seven listed above belong to the layout-control class defined in Chapter 23 of the Unicode Standard. Three of them — Zero-Width Space (U+200B), Zero-Width Non-Joiner (U+200C), and Zero-Width Joiner (U+200D) — exist for legitimate typographic reasons: U+200B marks word boundaries in Thai, Khmer, Myanmar, and Lao text per UAX #14 and UAX #29; U+200C and U+200D control ligature formation in Arabic, Persian, Devanagari, and emoji ZWJ sequences (the family 👨👩👧 emoji is five code points: U+1F468 + U+200D + U+1F469 + U+200D + U+1F467). Misuse occurs when these characters travel through identifiers, URLs, or LLM prompts. The Akamai 2022 DNS-traffic study observed 6,670 IDN homograph domains across 29,071 devices over 32 days, exploiting visual confusables documented in MITRE CWE-1007. Defenses since IDNA2008 converge on Punycode normalization (RFC 3492) at the registrar layer and PRECIS IdentifierClass enforcement (RFC 7564) at protocol layer; ICANN IDN Implementation Guidelines v4.1 (November 2022) sets the registry baseline. Platform behavior diverges. Twitter strips U+200B aggressively from tweet bodies and display names according to third-party trip reports through 2026. Discord accepts U+3164 (Hangul Filler) consistently because its General Category is Lo (Letter, other), not Cf (Format). Mastodon enforces RFC 7564 on remote handles and stricter ASCII rules locally — invisible characters are rejected on both gates. AI assistants became an attack surface in 2025: hidden prompt injection via Tag-block characters (U+E0000–U+E007F) was demonstrated against Amazon Q Developer that year; AWS Bedrock Guardrails added a prompt-attack filter that year, but no AWS WAF managed rule ships specifically for invisible-character injection — practitioners deploy custom byte-match rules. The byte cost matters: U+200B encodes to 3 bytes in UTF-8 (RFC 3629 §3) while ASCII space is 1 byte, so a 280-character tweet padded with zero-width spaces consumes triple the wire payload it appears to. The collection on this page lets a reader copy each character to inspect how a target system normalizes, displays, or rejects it.
How this tool fits the rest of the toolkit
Pair this with the Character Counter to compare visible vs total length, the Lorem Ipsum generator when test fixtures need filler that does not interfere with the variables under test, and the Text-to-Binary converter to inspect how each character encodes to UTF-8 bytes.
- 7 layout-control characters (U+200B, U+200C, U+200D, U+2060, U+00AD, U+200A, U+2800)
- Per-character Unicode code point and short description
- One-click copy — character is placed on the clipboard, no transformation
- Runs in the browser; no network call when copying
Free. No signup. Your inputs stay in your browser. Ads via Google AdSense (consent required).
Frequently asked questions
Why does X (Twitter/Discord/Mastodon) strip the zero-width space I just pasted?
Each platform applies its own normalization. Twitter filters U+200B from tweet bodies and display names per third-party trip reports. Discord allows U+3164 (Hangul Filler) because its Unicode General Category is Lo (Letter, other) and not Cf (Format), so identifier validators that block format characters let it through. Mastodon enforces RFC 7564 PRECIS IdentifierClass on remote handles and stricter ASCII rules on local sign-up — invisible characters fail both gates. The differences are intentional: each surface decides how strict it wants to be.
Why is U+200B 3 bytes when an ASCII space is 1?
UTF-8 encodes code points in the range U+0800–U+FFFF as three bytes (RFC 3629 §3). U+200B sits in that range. ASCII space (U+0020) is in U+0000–U+007F and encodes to a single byte. The visual width is zero in both cases, but on the wire and in storage the cost differs by 3×. A tweet padded to 280 characters with U+200B carries the same payload as roughly 840 ASCII characters, which matters for SMS gateways, log-file rotation, and any system that bills or budgets by byte rather than visible glyph.
Are zero-width characters legitimate, or only used for tricks?
Both. U+200B marks word boundaries in Thai, Khmer, Myanmar, and Lao text — scripts that do not use spaces between words — and Unicode Standard Annex #14 treats it as a soft line-break opportunity. U+200C and U+200D control ligature formation in Arabic, Persian, and Devanagari, and emoji ZWJ sequences (the family 👨👩👧 emoji is five code points: U+1F468 + U+200D + U+1F469 + U+200D + U+1F467). Misuse and legitimate use share the same code points; the security posture lives in normalization, not in banning the characters.
Can these characters be used to attack AI assistants or websites?
Hidden prompt injection via Tag-block characters (U+E0000–U+E007F) was demonstrated against Amazon Q Developer in 2025. AWS Bedrock Guardrails added a prompt-attack filter that year, though no AWS WAF managed rule ships specifically for invisible-character injection — practitioners deploy custom byte-match rules on the Tag block. On the web side, IDN homograph domains exploit visually confusable characters: the Akamai 2022 DNS-traffic study observed 6,670 such domains across 29,071 devices over 32 days. MITRE catalogs the underlying weakness as CWE-1007 (Insufficient Visual Distinction of Homoglyphs).
How do registrars and protocols defend against invisible-character abuse?
At the DNS layer, IDNA normalizes Unicode labels to Punycode (RFC 3492 / IDNA2008 RFCs 5891–5892); ICANN IDN Implementation Guidelines v4.1 (November 2022) sets the registry baseline. At the protocol layer, RFC 7564 (PRECIS Framework) defines the IdentifierClass that strict applications use to reject format characters in usernames and resource names. UTS #39 (Unicode Security Mechanisms) defines the confusable-detection algorithm registrars and identity systems use when the policy needs to compare strings for visual similarity.
Sources (9)
- The Unicode Consortium (2024). The Unicode Standard, Version 16.0 — Chapter 23 (Special Areas and Format Characters). Unicode Consortium, Mountain View, CA.
- Davis, M., & Suignard, M. (Eds.) (2024). UAX #14: Unicode Line Breaking Algorithm. Unicode Standard Annex, revision 53 (Unicode 16.0).
- Davis, M. (2024). UTS #39: Unicode Security Mechanisms. Unicode Technical Standard, version 16.0.0.
- Yergeau, F. (2003). UTF-8, a transformation format of ISO 10646. RFC 3629, IETF.
- Costello, A. (2003). Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA). RFC 3492, IETF.
- Klensin, J. (2010). Internationalized Domain Names in Applications (IDNA): Protocol. RFC 5891, IETF (IDNA2008).
- Saint-Andre, P., & Blanchet, M. (2015). PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols. RFC 7564, IETF.
- MITRE Corporation (2024). CWE-1007: Insufficient Visual Distinction of Homoglyphs Presented to User. Common Weakness Enumeration v4.19.1 (introduced CWE 3.1, 2018).
- ICANN (2022). IDN Implementation Guidelines v4.1. Internet Corporation for Assigned Names and Numbers (17 November 2022).
These are the original publications the formulas in this tool are based on. Locate them by journal name and year on Google Scholar or PubMed.