Encoding
Character Encoding for Developers — ASCII, UTF-8, UTF-16
Character encoding confuses people decades after it should be solved. Here is the mental model: code points, encodings, BOMs, and why UTF-8 almost always wins.
Character encoding should be a solved problem by 2026, and for most of us it is — almost everything is UTF-8. But the mental model still matters, because the 5% of cases where it goes wrong (BOMs, legacy exports, Windows file dialogs, database client libraries) eat hours out of your week if you do not know the shape of the problem. This guide is the foundation.
The two separate concepts
The single biggest source of confusion: “Unicode” and “UTF-8” are different things.
- Unicode is a standard that assigns every character a unique integer called a code point, written
U+XXXX.Ais U+0041.éis U+00E9. The pile-of-poo emoji is U+1F4A9. There are over a million possible code points; about 150,000 are assigned as of Unicode 16. - An encoding is a way to turn code points into bytes. UTF-8, UTF-16, UTF-32, and legacy encodings like Latin-1, Shift-JIS, Windows-1252 all encode (some subset of) characters into bytes.
Unicode is the character set. UTF-8 is one encoding of that set. You can have UTF-8 encoded Unicode, UTF-16 encoded Unicode, or the same characters as legacy Windows-1252 bytes. Same characters, different bytes on disk.
ASCII, the common ancestor
ASCII is a 7-bit encoding defined in 1963. It covers 128 code points: Latin letters, digits, punctuation, control characters. Every byte fits in 7 bits; the eighth bit is zero.
Every modern encoding is backward-compatible with ASCII at the byte level for those 128 code points. A file that contains only ASCII is simultaneously valid ASCII, UTF-8, Latin-1, and Windows-1252. That is why tools from the 1980s still work for English-only text.
UTF-8 — the universal default
UTF-8 is a variable-length encoding. Each code point encodes to 1-4 bytes:
| Code point range | Bytes | Binary pattern |
|---|---|---|
| U+0000 to U+007F | 1 | 0xxxxxxx |
| U+0080 to U+07FF | 2 | 110xxxxx 10xxxxxx |
| U+0800 to U+FFFF | 3 | 1110xxxx 10xxxxxx 10xxxxxx |
| U+10000 to U+10FFFF | 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
Three consequences worth internalizing:
- ASCII code points are one byte each, bit-identical to ASCII.
Ain UTF-8 is 0x41, the same as ASCII. - Any byte that starts with
10is a continuation byte, never the start of a character. This is why UTF-8 is self-synchronizing: you can land in the middle of a file and find the next character boundary in at most 3 bytes. - Non-ASCII characters take 2-4 bytes.
éis 2 bytes (0xC3 0xA9).你is 3 bytes. A pile-of-poo emoji is 4 bytes.
UTF-8 is now the dominant encoding on the web (over 98% of pages), in most databases, in Linux filesystems, and in most programming languages’ default I/O.
# See the bytes behind a character
printf "é" | xxd
# 00000000: c3a9 ..
printf "A" | xxd
# 00000000: 41 A
printf "你" | xxd
# 00000000: e4bda0 ...
To explore the relationship between characters and their binary representation interactively, the text-to-binary and binary-to-text tools let you see the byte layout character by character.
UTF-16 — why it persists
UTF-16 encodes code points in 2 or 4 bytes. It was the “obvious” encoding when Unicode was 16 bits wide (pre-1996). Unicode outgrew 16 bits, so UTF-16 added “surrogate pairs” — two 16-bit code units that together encode a single code point above U+FFFF.
UTF-16 still shows up in three places:
- Windows APIs internally. The
Wversions of Win32 functions (CreateFileW,MessageBoxW) take UTF-16 wide strings. - Java and JavaScript strings internally. A JavaScript
string.lengthcounts UTF-16 code units, not characters — so"💩".lengthis 2, not 1. - Some older XML and SOAP services that emit UTF-16 with a BOM.
For almost any new I/O, UTF-8 is correct. UTF-16 has no storage advantage for English text (doubles it), no self-synchronization, and a byte-order problem UTF-8 does not have.
// JavaScript string indexing is UTF-16, not code points
const s = "💩abc";
s.length; // 4 (2 code units for 💩 + 3)
[...s].length; // 4? No — 4 (the spread iterates code points)
s.charAt(0); // "\uD83D" — a lone surrogate, not a character
[...s][0]; // "💩" — iterating with the spread operator uses code points
Byte Order Mark — a historical hazard
A Byte Order Mark (BOM) is an optional U+FEFF at the start of a file that indicates the byte order of UTF-16 or UTF-32. In UTF-16LE it is FF FE; in UTF-16BE it is FE FF.
For UTF-8 there is no byte-order ambiguity — bytes are bytes — but Microsoft Notepad and some other tools write a three-byte UTF-8 BOM (EF BB BF) at the start of files. This is legal per the Unicode spec but discouraged. It breaks:
- Shell scripts (the BOM appears before
#!/bin/bash). - CSV parsers that expect the first field to start at byte 0.
- YAML and JSON parsers in some older libraries.
git diffof a file that suddenly has a BOM.
If your CSV export from Excel opens with strange characters before the first field name, it has a UTF-8 BOM. Strip it with sed -i '1s/^\xEF\xBB\xBF//' file.csv or read the file as utf-8-sig in Python.
Legacy encodings still in the wild
| Encoding | What it is | Where you meet it |
|---|---|---|
| Latin-1 / ISO-8859-1 | 1-byte European characters | Old HTTP default, older database dumps |
| Windows-1252 | Latin-1 + some punctuation | Excel, older Windows programs |
| Shift-JIS | Japanese 1-2 byte | Legacy Japanese systems |
| GB18030 | Chinese (mainland) | Chinese government systems |
| UTF-16LE | JavaScript / Windows internal | Serialized as file format occasionally |
The é character is U+00E9 in Unicode, byte 0xE9 in Latin-1, bytes 0xC3 0xA9 in UTF-8. When you see é in a browser, you are looking at UTF-8 bytes being decoded as Latin-1: the two UTF-8 bytes for é are interpreted as two separate Latin-1 characters. This single symptom tells you: “something along the pipeline is not UTF-8-aware.”
Grapheme clusters — the thing beyond code points
One user-perceived character is not always one code point. 👨👩👧 is a family emoji made of five code points joined by zero-width joiners. é can be either U+00E9 (one code point) or U+0065 + U+0301 (two code points: e + combining acute). To count “characters” the way a user sees them, you need grapheme cluster segmentation, which is the job of Intl.Segmenter in JavaScript or the grapheme crate in Rust.
This is why "💩".length === 2 is not the right answer for “how many characters is this string” — and why the word counter tool uses grapheme-aware counting for correct results on emoji-heavy input.
Practical rules
- Use UTF-8 everywhere. Input, output, filenames, database column encoding, HTTP headers. Everywhere.
- Declare the encoding in every format that supports it.
<meta charset="utf-8">in HTML,# -*- coding: utf-8 -*-in old Python,Content-Type: application/json; charset=utf-8in HTTP responses. - Strip BOMs at ingest. If you accept files, handle an optional BOM. Do not write BOMs on output.
- Never trust that “a string has a length” — know whether you are counting bytes, code units (UTF-16), code points, or grapheme clusters. They are four different numbers.
Takeaways
Unicode is the character set. UTF-8 is the encoding. Default to UTF-8 for everything. BOMs are a historical hazard; handle them on input, do not emit them on output. String length is four different numbers depending on what you mean. For the adjacent transport-encoding topic, see the Base64 in production guide; for how character encoding interacts with URLs, see the URL encoding guide.