Skip to content
Character Encoding for Developers — ASCII, UTF-8, UTF-16

Encoding

Character Encoding for Developers — ASCII, UTF-8, UTF-16

Character encoding confuses people decades after it should be solved. Here is the mental model: code points, encodings, BOMs, and why UTF-8 almost always wins.

Character encoding should be a solved problem by 2026, and for most of us it is — almost everything is UTF-8. But the mental model still matters, because the 5% of cases where it goes wrong (BOMs, legacy exports, Windows file dialogs, database client libraries) eat hours out of your week if you do not know the shape of the problem. This guide is the foundation.

The two separate concepts

The single biggest source of confusion: “Unicode” and “UTF-8” are different things.

  • Unicode is a standard that assigns every character a unique integer called a code point, written U+XXXX. A is U+0041. é is U+00E9. The pile-of-poo emoji is U+1F4A9. There are over a million possible code points; about 150,000 are assigned as of Unicode 16.
  • An encoding is a way to turn code points into bytes. UTF-8, UTF-16, UTF-32, and legacy encodings like Latin-1, Shift-JIS, Windows-1252 all encode (some subset of) characters into bytes.

Unicode is the character set. UTF-8 is one encoding of that set. You can have UTF-8 encoded Unicode, UTF-16 encoded Unicode, or the same characters as legacy Windows-1252 bytes. Same characters, different bytes on disk.

ASCII, the common ancestor

ASCII is a 7-bit encoding defined in 1963. It covers 128 code points: Latin letters, digits, punctuation, control characters. Every byte fits in 7 bits; the eighth bit is zero.

Every modern encoding is backward-compatible with ASCII at the byte level for those 128 code points. A file that contains only ASCII is simultaneously valid ASCII, UTF-8, Latin-1, and Windows-1252. That is why tools from the 1980s still work for English-only text.

UTF-8 — the universal default

UTF-8 is a variable-length encoding. Each code point encodes to 1-4 bytes:

Code point rangeBytesBinary pattern
U+0000 to U+007F10xxxxxxx
U+0080 to U+07FF2110xxxxx 10xxxxxx
U+0800 to U+FFFF31110xxxx 10xxxxxx 10xxxxxx
U+10000 to U+10FFFF411110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Three consequences worth internalizing:

  1. ASCII code points are one byte each, bit-identical to ASCII. A in UTF-8 is 0x41, the same as ASCII.
  2. Any byte that starts with 10 is a continuation byte, never the start of a character. This is why UTF-8 is self-synchronizing: you can land in the middle of a file and find the next character boundary in at most 3 bytes.
  3. Non-ASCII characters take 2-4 bytes. é is 2 bytes (0xC3 0xA9). is 3 bytes. A pile-of-poo emoji is 4 bytes.

UTF-8 is now the dominant encoding on the web (over 98% of pages), in most databases, in Linux filesystems, and in most programming languages’ default I/O.

# See the bytes behind a character
printf "é" | xxd
# 00000000: c3a9       .. 
printf "A" | xxd
# 00000000: 41         A
printf "你" | xxd
# 00000000: e4bda0     ...

To explore the relationship between characters and their binary representation interactively, the text-to-binary and binary-to-text tools let you see the byte layout character by character.

UTF-16 — why it persists

UTF-16 encodes code points in 2 or 4 bytes. It was the “obvious” encoding when Unicode was 16 bits wide (pre-1996). Unicode outgrew 16 bits, so UTF-16 added “surrogate pairs” — two 16-bit code units that together encode a single code point above U+FFFF.

UTF-16 still shows up in three places:

  • Windows APIs internally. The W versions of Win32 functions (CreateFileW, MessageBoxW) take UTF-16 wide strings.
  • Java and JavaScript strings internally. A JavaScript string.length counts UTF-16 code units, not characters — so "💩".length is 2, not 1.
  • Some older XML and SOAP services that emit UTF-16 with a BOM.

For almost any new I/O, UTF-8 is correct. UTF-16 has no storage advantage for English text (doubles it), no self-synchronization, and a byte-order problem UTF-8 does not have.

// JavaScript string indexing is UTF-16, not code points
const s = "💩abc";
s.length;        // 4 (2 code units for 💩 + 3)
[...s].length;   // 4? No — 4 (the spread iterates code points)
s.charAt(0);     // "\uD83D" — a lone surrogate, not a character
[...s][0];       // "💩" — iterating with the spread operator uses code points

Byte Order Mark — a historical hazard

A Byte Order Mark (BOM) is an optional U+FEFF at the start of a file that indicates the byte order of UTF-16 or UTF-32. In UTF-16LE it is FF FE; in UTF-16BE it is FE FF.

For UTF-8 there is no byte-order ambiguity — bytes are bytes — but Microsoft Notepad and some other tools write a three-byte UTF-8 BOM (EF BB BF) at the start of files. This is legal per the Unicode spec but discouraged. It breaks:

  • Shell scripts (the BOM appears before #!/bin/bash).
  • CSV parsers that expect the first field to start at byte 0.
  • YAML and JSON parsers in some older libraries.
  • git diff of a file that suddenly has a BOM.

If your CSV export from Excel opens with strange characters before the first field name, it has a UTF-8 BOM. Strip it with sed -i '1s/^\xEF\xBB\xBF//' file.csv or read the file as utf-8-sig in Python.

Legacy encodings still in the wild

EncodingWhat it isWhere you meet it
Latin-1 / ISO-8859-11-byte European charactersOld HTTP default, older database dumps
Windows-1252Latin-1 + some punctuationExcel, older Windows programs
Shift-JISJapanese 1-2 byteLegacy Japanese systems
GB18030Chinese (mainland)Chinese government systems
UTF-16LEJavaScript / Windows internalSerialized as file format occasionally

The é character is U+00E9 in Unicode, byte 0xE9 in Latin-1, bytes 0xC3 0xA9 in UTF-8. When you see é in a browser, you are looking at UTF-8 bytes being decoded as Latin-1: the two UTF-8 bytes for é are interpreted as two separate Latin-1 characters. This single symptom tells you: “something along the pipeline is not UTF-8-aware.”

Grapheme clusters — the thing beyond code points

One user-perceived character is not always one code point. 👨‍👩‍👧 is a family emoji made of five code points joined by zero-width joiners. é can be either U+00E9 (one code point) or U+0065 + U+0301 (two code points: e + combining acute). To count “characters” the way a user sees them, you need grapheme cluster segmentation, which is the job of Intl.Segmenter in JavaScript or the grapheme crate in Rust.

This is why "💩".length === 2 is not the right answer for “how many characters is this string” — and why the word counter tool uses grapheme-aware counting for correct results on emoji-heavy input.

Practical rules

  1. Use UTF-8 everywhere. Input, output, filenames, database column encoding, HTTP headers. Everywhere.
  2. Declare the encoding in every format that supports it. <meta charset="utf-8"> in HTML, # -*- coding: utf-8 -*- in old Python, Content-Type: application/json; charset=utf-8 in HTTP responses.
  3. Strip BOMs at ingest. If you accept files, handle an optional BOM. Do not write BOMs on output.
  4. Never trust that “a string has a length” — know whether you are counting bytes, code units (UTF-16), code points, or grapheme clusters. They are four different numbers.

Takeaways

Unicode is the character set. UTF-8 is the encoding. Default to UTF-8 for everything. BOMs are a historical hazard; handle them on input, do not emit them on output. String length is four different numbers depending on what you mean. For the adjacent transport-encoding topic, see the Base64 in production guide; for how character encoding interacts with URLs, see the URL encoding guide.

Related tools

By ·