Skip to main content

Bits and Codes · 25 min

Character Encodings: ASCII, UTF-8, Unicode

How text became bytes: from 7-bit ASCII to Unicode code points to UTF-8 variable-width encoding, with worked byte-level examples.

Why This Matters

The string é is one Unicode code point, two UTF-8 bytes, one UTF-16 code unit, and often one grapheme cluster. The string 😀 is one Unicode code point, four UTF-8 bytes, two UTF-16 code units, and one grapheme cluster. A buffer size, database column limit, or substring call must name which unit it counts.

The byte pair C3 A9 means é in UTF-8. If those same bytes are decoded as Windows-1252, they print as é. That error, mojibake, is not random corruption. It is the deterministic result of using the wrong mapping from bytes to text.

Core Definitions

Definition

Character

A character is an abstract text element such as Latin capital A, Greek small alpha, or grinning face. It is not a byte pattern. The same character can have several encoded byte sequences across encodings.

Definition

Code point

A Unicode code point is an integer assigned by the Unicode Standard, written U+ followed by hexadecimal digits. Valid scalar values range from U+0000 to U+10FFFF, excluding surrogate code points U+D800 through U+DFFF.

Definition

Code unit

A code unit is the fixed-size storage unit used by an encoding form. UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units, and UTF-32 uses 32-bit code units.

Definition

Encoding

An encoding maps a sequence of characters or code points to a sequence of code units, then to bytes. ASCII maps characters directly to 7-bit integers. UTF-8 maps Unicode scalar values to one to four bytes.

Definition

Grapheme cluster

A grapheme cluster is what a user often perceives as one displayed character. It can contain several code points, such as e plus a combining acute accent, or an emoji plus a skin-tone modifier.

ASCII and the 7-Bit Byte Interface

ASCII is a 7-bit code with 128 entries, numbered 0x00 through 0x7F. The printable region starts at space, 0x20, and runs through ~, 0x7E. The digits, uppercase letters, and lowercase letters are contiguous ranges:

'0'..'9'  0x30..0x39
'A'..'Z'  0x41..0x5A
'a'..'z'  0x61..0x7A
DEL       0x7F

The first 32 entries are control codes because ASCII came from teleprinters and data links, more than just memory buffers. NUL at 0x00 could mean no punched holes. BEL at 0x07 rang a bell. BS at 0x08 moved the print head backward. TAB, LF, and CR at 0x09, 0x0A, and 0x0D controlled layout. ESC at 0x1B introduced terminal control sequences. DEL at 0x7F was all seven bits set, useful for punching over a bad character on paper tape.

ASCII fits in a byte by leaving the high bit zero. The letter A has code 0x41, binary 0100 0001. A newline has code 0x0A, binary 0000 1010.

#include <stdio.h>

int main(void) {
    unsigned char s[] = { 'A', '\n', 0 };
    printf("A: 0x%02X\n", s[0]);   // 0x41
    printf("\\n: 0x%02X\n", s[1]);  // 0x0A
}

A C string literal such as "A\n" usually stores three bytes: 41 0A 00. The final 00 is the NUL terminator used by C library functions, not part of the text.

The 8-Bit Code Page Problem

Once bytes became common storage units, systems used the unused high-bit range 0x80 through 0xFF for extra characters. ISO-8859-1, also called Latin-1, assigns many Western European characters in that range. In ISO-8859-1, é is the single byte 0xE9.

Windows-1252 is close to ISO-8859-1 but not the same. The range 0x80 through 0x9F is the trap. ISO-8859-1 reserves that range for control codes. Windows-1252 assigns printable characters such as curly quotes and the euro sign:

byte    ISO-8859-1        Windows-1252
0x80    control           €
0x91    control           ‘
0x92    control           ’
0x93    control           “
0x94    control           ”

This is why old text migrations fail in specific ways. If a Windows-1252 file containing as byte 80 is decoded as ISO-8859-1, the result is a control character, not the euro sign. If UTF-8 bytes are decoded as Windows-1252, the result can be printable garbage.

Worked mojibake example:

text intended:        é
Unicode code point:   U+00E9
UTF-8 bytes:          C3 A9

Wrong decoding as Windows-1252:
C3 -> Ã
A9 -> ©

displayed result:     é

No byte changed. The decoder used the wrong table.

Unicode Code Points and Planes

Unicode assigns integers to characters across writing systems, symbols, punctuation, and emoji. The range is U+0000 through U+10FFFF. These code points are divided into 17 planes of 0x10000 code points each.

Plane 0 is the Basic Multilingual Plane, abbreviated BMP. It spans U+0000 through U+FFFF. Many common scripts and symbols live there, including ASCII, Latin extensions, Greek, Cyrillic, Hebrew, Arabic, Devanagari, and common CJK characters.

Supplementary planes start at U+10000. Emoji such as 😀 at U+1F600 live outside the BMP. This split matters because UTF-16 stores BMP code points in one 16-bit code unit but stores supplementary code points in two code units.

Unicode separates identity from storage. The code point U+0041 names Latin capital A. UTF-8 stores it as 41. UTF-16 stores it as 0041, subject to byte order. UTF-32 stores it as 00000041, subject to byte order.

UTF-32 and UTF-16

UTF-32 uses one 32-bit code unit per code point. It is simple for indexing by code point: the nth code unit is the nth code point. It is also wasteful for ASCII-heavy text.

For the text cat\n, the code points are U+0063 U+0061 U+0074 U+000A.

UTF-8 bytes:
63 61 74 0A

UTF-32 big-endian bytes:
00 00 00 63  00 00 00 61  00 00 00 74  00 00 00 0A

That is 4 bytes versus 16 bytes for the same four code points.

UTF-16 uses one 16-bit code unit for BMP code points, except the surrogate range. Code points from U+10000 through U+10FFFF are encoded as surrogate pairs. The high surrogate range is U+D800 through U+DBFF. The low surrogate range is U+DC00 through U+DFFF.

For 😀, code point U+1F600:

U              = 0x1F600
U - 0x10000    = 0x0F600
high ten bits  = 0x3D
low ten bits   = 0x200

high surrogate = 0xD800 + 0x3D  = 0xD83D
low surrogate  = 0xDC00 + 0x200 = 0xDE00

So UTF-16 represents 😀 as two code units: D83D DE00.

JavaScript strings expose this historical choice. The .length property counts UTF-16 code units, not Unicode code points.

"A".length      = 1
"é".length      = 1
"中".length     = 1
"😀".length     = 2

This is not an emoji special case. Any supplementary-plane code point has length 2 in UTF-16 code units.

UTF-8 Encoding Mechanics

UTF-8 is a variable-width encoding of Unicode scalar values into one to four bytes. It preserves ASCII: code points U+0000 through U+007F use the same byte values as ASCII. It also makes byte classes visible from their leading bits.

range                  byte layout
U+0000..U+007F         0xxxxxxx
U+0080..U+07FF         110xxxxx 10xxxxxx
U+0800..U+FFFF         1110xxxx 10xxxxxx 10xxxxxx
U+10000..U+10FFFF      11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Continuation bytes always start with binary 10. Leading bytes never start with 10. This lets a decoder recover boundaries after seeking into the middle of a byte stream: scan left until a byte is not 10xxxxxx, then decode forward.

UTF-8 also has a shortest-form rule. A code point must use the shortest byte length that can represent it. For example, U+0041 must be 41, not C1 81. Shortest-form validation prevents multiple byte strings from naming the same scalar value.

Worked UTF-8 Encodings

The letter A is U+0041, decimal 65. It is in the ASCII range, so the UTF-8 byte is unchanged:

U+0041 = 0b1000001
layout = 0xxxxxxx
byte   = 0100 0001 = 0x41

The character é is U+00E9, decimal 233. It needs two UTF-8 bytes.

U+00E9 payload, 11 bits: 000 11101001
split into 5 and 6 bits: 00011 101001

byte 1: 11000011 = 0xC3
byte 2: 10101001 = 0xA9

UTF-8: C3 A9

The character is U+4E2D. It needs three bytes.

U+4E2D in binary:       0100 1110 0010 1101
split into 4,6,6 bits:  0100 111000 101101

byte 1: 11100100 = 0xE4
byte 2: 10111000 = 0xB8
byte 3: 10101101 = 0xAD

UTF-8: E4 B8 AD

The emoji 😀 is U+1F600. It needs four bytes.

U+1F600 payload, 21 bits: 000011111011000000000
split into 3,6,6,6 bits:  000 011111 011000 000000

byte 1: 11110000 = 0xF0
byte 2: 10011111 = 0x9F
byte 3: 10011000 = 0x98
byte 4: 10000000 = 0x80

UTF-8: F0 9F 98 80

A byte dump for the string Aé中😀 is therefore:

41 C3 A9 E4 B8 AD F0 9F 98 80

One string, four code points, ten UTF-8 bytes.

Key Result

Proposition

UTF-8 Is Self-Synchronizing

Statement

In a valid UTF-8 byte stream, every continuation byte has prefix 10, and no leading byte has prefix 10. From any byte position, scanning left at most three bytes reaches the start byte of the current code point or the start of the stream.

Intuition

A UTF-8 code point has length at most four bytes. Only non-initial bytes have the 10xxxxxx pattern. The leading byte tells the decoder how many total bytes belong to the code point.

Proof Sketch

The layouts are 0xxxxxxx, 110xxxxx 10xxxxxx, 1110xxxx 10xxxxxx 10xxxxxx, and 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx. A byte beginning with 0, 110, 1110, or 11110 is a start byte. A byte beginning with 10 is a continuation byte. Since the longest sequence has three continuation bytes after its start byte, scanning left across continuation bytes takes at most three steps before reaching the start byte.

Why It Matters

A parser can resynchronize after a split or seek without rereading the whole file. ASCII bytes also remain byte-identical, so older protocols that treat bytes below 0x80 specially often still parse separators such as newline and slash.

Failure Mode

If a decoder accepts overlong encodings or isolated continuation bytes, distinct byte strings can decode to the same code point or to invalid scalar values. That breaks byte-level filters that assume a single canonical encoding.

Common Confusions

Watch Out

Byte Length Is Not Character Length

"Aé中😀" has 4 Unicode code points but 10 UTF-8 bytes. A fixed 8-byte field can store "Aé中" because 41 C3 A9 E4 B8 AD is 6 bytes, but it cannot store "Aé中😀" without truncation. Truncating after 8 bytes gives 41 C3 A9 E4 B8 AD F0 9F, which ends in the middle of a four-byte UTF-8 sequence.

Watch Out

Code Point Length Is Not Grapheme Length

The displayed text é can be one code point, U+00E9, or two code points, U+0065 U+0301. Both can render as the same glyph. Counting code points gives 1 for the first representation and 2 for the second, even when the user sees one grapheme cluster.

Watch Out

JavaScript Length Counts UTF-16 Code Units

"😀".length is 2 because the string stores the surrogate pair D83D DE00. Iterating by numeric index can split the pair. APIs that iterate by code point, such as for...of, treat it as one iteration value.

Watch Out

Mojibake Is Usually a Decode Error, Not a Font Error

If é becomes é, the UTF-8 bytes C3 A9 were decoded as Windows-1252. A font problem usually shows a missing-glyph box for a code point the font lacks. Mojibake produces consistent wrong characters.

Exercises

ExerciseCore

Problem

Encode the Unicode code point U+20AC euro sign in UTF-8. Give the final bytes in hexadecimal.

ExerciseCore

Problem

Decode the byte sequence F0 9F 99 82 as UTF-8. Give the Unicode code point.

ExerciseAdvanced

Problem

A log displays € where the input should display . Explain the byte path and recover the intended UTF-8 bytes.

References

Canonical:

  • Charles Petzold, Code: The Hidden Language of Computer Hardware and Software, 2nd ed. (2022), ch. 7-9, covers telegraphs, binary codes, ASCII, and stored text
  • Andrew S. Tanenbaum and Todd Austin, Structured Computer Organization, 6th ed. (2013), ch. 2, covers data representation and character codes
  • Unicode Consortium, The Unicode Standard, Version 15.0 (2022), ch. 2 and ch. 3, defines conformance, code points, encoding forms, and UTF-8
  • Unicode Consortium, The Unicode Standard, Version 15.0 (2022), ch. 23, covers special areas including surrogates and noncharacters
  • Randal E. Bryant and David R. O'Hallaron, Computer Systems: A Programmer's Perspective, 3rd ed. (2016), §2.1-§2.2, covers bytes, representations, and C strings

Accessible:

  • Joel Spolsky, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
  • W3C Internationalization, Character Encodings: Essential Concepts
  • Python Documentation, Unicode HOWTO

Next Topics

  • /computationpath/error-detection-correction
  • /computationpath/binary-data-formats
  • /computationpath/compression-basics