Unicode, UTF-8, and Why ASCII Wasn't Enough: Character Encoding Fundamentals

ASCII was designed in 1963 for 7-bit telegraph machines. Every country's attempt to extend it to 8 bits was incompatible, producing mojibake when files crossed systems. Here's how Unicode solved the problem, why UTF-8 became dominant (backward compatibility with ASCII), what byte order marks are, and what character encoding corruption actually looks like.

ASCII was designed in 1963 for telegraph machines — and its limitations created the character encoding chaos that Unicode had to fix

ASCII (American Standard Code for Information Interchange) defines 128 characters: the 26 uppercase and lowercase English letters, 10 digits, punctuation, and 33 control characters (things like newline, tab, and carriage return). It fits in 7 bits. For English text on American telegraph networks, it was sufficient.

The problem: every other language in the world has characters that ASCII doesn't include. Every attempt to solve this before Unicode created incompatible standards. The mess those incompatible standards left behind still produces the corrupted text ("mojibake") that the HTML entities tool helps clean up.

The ASCII table and why 128 characters wasn't enough

ASCII characters 0–31 are control characters (non-printing). Characters 32–126 are printable. Character 127 is DEL.

The 95 printable ASCII characters include: uppercase A–Z, lowercase a–z, digits 0–9, and punctuation: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~.

What ASCII excludes: accented characters (é, ü, ñ), currency symbols (€, £, ¥), non-Latin scripts (Arabic, Chinese, Cyrillic, Greek, Hebrew, Japanese, Korean), mathematical symbols, and everything else.

The 8th bit extensions: ASCII uses 7 bits, leaving the 8th bit unused. Different regions filled this with characters 128–255 (the "extended ASCII" range) for their own languages — but with completely incompatible assignments.

ISO 8859-1 (Latin-1): used by most of Western Europe; positions 160–255 cover accented Latin characters (é = 233, ü = 252, ñ = 241)
ISO 8859-5: Cyrillic alphabet occupies positions 160–255
ISO 8859-6: Arabic
Windows-1252 (cp1252): Microsoft's variant of Latin-1 with additional characters in positions 128–159 that Latin-1 left undefined

The problem: a file created on a French computer in ISO 8859-1 and opened on a Russian computer expecting ISO 8859-5 produces garbled output. The byte 0xE9 means é in Latin-1 and щ in ISO 8859-5.

Unicode: one standard for all characters

Unicode is the solution: a single character set that assigns a unique "code point" to every character in every writing system in the world, plus emoji, mathematical symbols, and ancient scripts.

Current scope: Unicode 15.1 (2023) defines 149,813 characters across 161 scripts, including Emoji (1,872 emoji), cuneiform, ancient Egyptian hieroglyphs, and the Linear A script of the Minoans.

How code points are written: U+XXXX where XXXX is a hexadecimal number.

U+0041 = A (Latin capital letter A)
U+00E9 = é (Latin small letter e with acute)
U+4E2D = 中 (Chinese character for "middle/China")
U+1F600 = 😀 (grinning face emoji)

Unicode does NOT specify encoding. It defines what characters exist and their code point numbers. How those numbers are stored as bytes is a separate question — answered by UTF-8, UTF-16, and UTF-32.

UTF-8: the dominant web encoding

UTF-8 is a variable-width encoding:

Code points U+0000–U+007F (ASCII range): 1 byte — identical to ASCII
U+0080–U+07FF: 2 bytes
U+0800–U+FFFF: 3 bytes (covers most CJK characters)
U+10000–U+10FFFF: 4 bytes (emoji, rare scripts)

The elegance: ASCII text is valid UTF-8. Any program that correctly reads ASCII also correctly reads the ASCII-compatible portions of UTF-8. This backward compatibility was crucial for adoption.

UTF-8 byte patterns:

1-byte:  0xxxxxxx                         (0x00–0x7F)
2-byte:  110xxxxx 10xxxxxx                (0x80–0x7FF)
3-byte:  1110xxxx 10xxxxxx 10xxxxxx       (0x800–0xFFFF)
4-byte:  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (0x10000–0x10FFFF)

The leading byte pattern tells you how many continuation bytes follow. Continuation bytes always begin with 10, making UTF-8 self-synchronising — if you start reading in the middle of a byte stream, you can always find the start of the next character.

Web dominance: UTF-8 became the encoding for over 98% of websites (W3Techs, 2024). Every HTML page should declare <meta charset="UTF-8">.

UTF-16: the Windows and JavaScript encoding

UTF-16 encodes most characters as 2 bytes (16 bits). Characters above U+FFFF use "surrogate pairs" — two 2-byte units.

Where UTF-16 is used:

Windows internal APIs (wchar_t is 2 bytes on Windows)
JavaScript strings (internal representation is UTF-16)
Java's char type
.NET string type

The surrogate pair problem in JavaScript:

"😀".length  // 2 — JS counts UTF-16 code units, not characters
"😀"[0]      // '' — half of the surrogate pair

This is why JavaScript string operations on emoji require special handling (covered in the Text Reverser article).

Byte Order Mark (BOM)

UTF-16 has an ambiguity: the same two bytes can be read as two different characters depending on whether the system is big-endian (most significant byte first) or little-endian (least significant byte first).

The Byte Order Mark (BOM) is a special character (U+FEFF, "zero-width no-break space") placed at the start of a text file to signal the byte order:

UTF-16 BE (big-endian): FE FF
UTF-16 LE (little-endian): FF FE

UTF-8 BOM: Some tools (particularly Microsoft tools) add a BOM to UTF-8 files: the bytes EF BB BF. This is unnecessary in UTF-8 (which has no endianness ambiguity) and causes problems:

# UTF-8 with BOM — the BOM appears as a character
with open('file_with_bom.txt', 'r', encoding='utf-8') as f:
    content = f.read()
    print(content[0])  # '\ufeff' — the BOM character

# Correct: use utf-8-sig which strips BOM automatically
with open('file_with_bom.txt', 'r', encoding='utf-8-sig') as f:
    content = f.read()

A UTF-8 BOM at the start of a file causes parsing failures in CSV imports, JSON parsing, XML processing, and PHP files (the BOM appears before the <?php opening tag, producing the "headers already sent" error).

Mojibake: what character encoding corruption looks like

Mojibake (文字化け, Japanese for "character transformation") is the garbled text that appears when a file encoded in one system is read as another.

Common patterns:

Original	Stored as	Read as	Result
café (UTF-8)	`63 61 66 c3 a9`	Windows-1252	`cafÃ©`
résumé (Latin-1)	`72 e9 73 75 6d e9`	UTF-8	`r\xefsumy\xef` or error
中文 (UTF-8)	`e4 b8 ad e6 96 87`	Latin-1	`ä¸æ–‡`

The HTML entities connection: HTML entities (é for é, é for the same) are a way to represent non-ASCII characters safely in ASCII-only contexts. In correctly declared UTF-8 HTML, you can just write é directly — but entities remain useful for legacy environments, email HTML, and situations where the character encoding declaration might be missing or wrong.

How to use the HTML Entities tool on sadiqbd.com

Encode text containing special characters to HTML entities — safe for older systems
Decode entities back to characters — read encoded text as readable text
Identify unknown characters — paste text with unusual characters to see their entity codes
Fix mojibake — use the tool alongside a character encoding converter to identify and fix encoding mismatches

Frequently Asked Questions

Does UTF-8 support all human languages? Yes — UTF-8 encodes all Unicode code points, which covers every writing system in current use plus many historical scripts. The practical question is font support: a browser or application must have a font containing the relevant glyphs to display them.

Why do some developers still use ISO 8859-1? Legacy applications and databases. If a database was created with Latin-1 character set and migrated data without re-encoding, it still stores Latin-1. Converting databases with decades of data to UTF-8 requires careful migration with re-encoding of all text columns.

Is the HTML Entities tool free? Yes — completely free, no sign-up required.

Try the HTML Entities tool free at sadiqbd.com — encode and decode HTML entities, and convert any character to its Unicode code point.