Huffman Coding: How Character Frequency Powers ZIP, PNG, and JPEG Compression

Every ZIP file, PNG image, and gzip-compressed page relies on the same observation as the previous article's frequency analysis: characters aren't equally frequent, so assigning shorter codes to common symbols and longer codes to rare ones saves space. Here's how Huffman coding builds optimal variable-length codes from frequency data, why "prefix-free" codes need no separators, how this fits into larger compression pipelines like DEFLATE, and why already-compressed data resists further compression.

Every time you save a ZIP file, send a fax, or stream a video, character (or symbol) frequency analysis is running in the background — because the most common compression technique in computing history assigns shorter codes to more frequent symbols, and the previous article's "E is the most common letter" observation is exactly the kind of fact this technique exploits

The previous article on this site covered how character frequency analysis breaks classical ciphers. This article looks at a completely different application of the same underlying observation — that characters/symbols in real data are not equally frequent — applied not to breaking codes, but to compressing data: Huffman coding, one of the foundational algorithms in data compression.

The core insight: fixed-length codes waste space on common symbols

Standard text encoding (ASCII, or the ASCII-compatible portion of UTF-8, covered in previous articles) uses a fixed number of bits per character — 8 bits (1 byte) per character, regardless of which character it is.

But, as the frequency-analysis article demonstrated, characters are not equally frequent — "E" appears far more often than "Z" in English text. A fixed-length code "spends" the same number of bits on every occurrence of "E" as on every occurrence of "Z" — even though "E" occurs far more often.

The compression insight: if frequently-occurring characters were assigned shorter codes, and rarely-occurring characters longer codes — the total number of bits needed to represent a piece of text could be reduced — frequent characters (getting short codes) contribute less, per occurrence, to the total size; infrequent characters (getting longer codes) contribute more per occurrence, but occur rarely enough that this doesn't outweigh the savings from the frequent characters' shorter codes.

Huffman coding: building an optimal variable-length code from frequencies

Huffman coding (developed by David Huffman in 1952) is an algorithm that, given the frequency of each symbol in a piece of data, constructs a variable-length binary code — assigning shorter bit sequences to more frequent symbols, and longer bit sequences to less frequent symbols — in a way that's provably optimal (for this general approach — symbol-by-symbol, variable-length, prefix-free codes, a category that Huffman coding belongs to, and is optimal within).

The algorithm (conceptually):

Start with each symbol as a separate "node," weighted by its frequency
Repeatedly combine the two lowest-frequency nodes into a new node (whose frequency is the sum of the two combined nodes), until only one node remains (the "root")
This process builds a binary tree — the path from the root to each original symbol (a sequence of left/right choices, encoded as 0s/1s) becomes that symbol's code

The result: symbols that were combined early (low frequency, far from the root) end up with longer codes (more steps from root to leaf); symbols combined late (high frequency, close to the root) get shorter codes.

"Prefix-free": why variable-length codes can be unambiguous

A natural question: if codes have different lengths, how does a decoder know where one code ends and the next begins, without explicit separators (which would themselves take space, undermining the compression)?

Huffman codes are prefix-free (also called "prefix codes") — no code is a prefix of any other code — e.g., if "E" is coded as 0, no other symbol's code can start with 0 (every other code must start with 1) — this property guarantees that a decoder, reading bits left-to-right, can always determine, unambiguously, "I've just completed reading one full code" — without needing any separator characters between codes.

This is why Huffman's tree-based construction produces prefix-free codes automatically — each symbol corresponds to a unique leaf of the binary tree, and the path from root to leaf (the code) can't be a prefix of another leaf's path, because reaching a leaf means the path has ended — you can't continue "further" from a leaf to reach a different leaf.

Huffman coding's role within larger compression algorithms

Huffman coding is rarely used entirely alone in modern compression — it's typically one stage within a larger compression pipeline:

DEFLATE (used by ZIP, gzip, PNG, and many others): combines LZ77 (a different technique, finding repeated sequences within the data and replacing them with references to earlier occurrences) with Huffman coding — LZ77 reduces repetition; Huffman coding then further compresses the output of the LZ77 stage, by assigning shorter codes to whatever symbols (which, after LZ77, represent a mix of literal characters and "reference" codes) are most frequent in that specific file's LZ77-processed output.

JPEG image compression uses Huffman coding (or a similar variable-length-code technique) as a late stage — after other transformations (related to how visual data is represented — beyond the scope of this article) produce a sequence of values whose frequency distribution is then Huffman-coded for final compression.

The pattern: Huffman coding is a general-purpose, "final-stage" technique — applicable whenever you have a sequence of symbols with a non-uniform frequency distribution — regardless of what those symbols "mean" (raw text characters, LZ77 output codes, JPEG coefficient values) — the frequency-distribution observation is the common thread, connecting text-character frequency (the original topic of this site's char-frequency articles) to image/archive-file compression, via the shared underlying principle.

Why Huffman coding alone can't always reach the theoretical (entropy) limit

The previous article on Shannon entropy established that entropy represents a theoretical lower bound on the average number of bits needed per symbol, given a probability distribution.

Huffman coding gets close to this limit, but generally can't reach it exactly — because Huffman codes must have whole-number bit-lengths (a code is some integer number of bits — 1, 2, 3, etc.) — but the entropy-optimal "length" for a symbol, given its probability p, is -log2(p) bits — which is, in general, not a whole number.

Example: a symbol with probability 0.4 — its entropy-optimal "length" would be -log2(0.4) ≈ 1.32 bits — but Huffman coding must assign it either 1 bit or 2 bits (some whole number) — whichever is chosen, there's a small "mismatch" between the assigned (whole-number) length and the entropy-optimal (fractional) length — *this mismatch, summed across all symbols, is why Huffman coding's average code length is generally slightly above the entropy limit, though often quite close.

More advanced techniques (arithmetic coding, range coding) can achieve compression closer to the entropy limit, by avoiding the "whole-number-of-bits-per-symbol" constraint entirely (effectively allowing "fractional-bit" codes, conceptually, by encoding entire sequences of symbols together, rather than symbol-by-symbol) — these techniques are used in some modern compression formats (certain video/image codecs, for instance) where the additional compression gains justify the additional computational complexity, compared to Huffman coding's simplicity.

How to use the Character Frequency tool on sadiqbd.com

Explore why compression works: analyze the character-frequency distribution of a piece of text — more skewed distributions (a few characters dominating) represent more "compressible" data (in principle) than flatter distributions (where all characters occur with similar frequency, closer to "random" data, which is harder to compress)
Estimate theoretical compression potential: combined with the Shannon entropy calculation (covered in the previous article) — the frequency distribution directly informs the entropy figure, which represents the theoretical lower bound a Huffman-coding-based (or similar) compressor could approach for this specific data

Frequently Asked Questions

Why doesn't every file get compressed using Huffman coding, if it's based on such a fundamental principle? Data that's already close to "random" (a flat, uniform frequency distribution — every possible byte value occurring with roughly equal frequency) has little "compressibility" via frequency-based techniques — Huffman coding (or any frequency-based approach) provides minimal benefit for such data, because there's no "skew" to exploit (assigning shorter codes to some symbols requires assigning longer codes to others — if all symbols are equally frequent, there's no net benefit to this reassignment). **This is why already-compressed data (a .zip file, a .jpg image, encrypted data) doesn't compress further (much, if at all) when run through another compression algorithm — prior compression (or encryption, which aims for output indistinguishable from random — per the previous cryptanalysis article's discussion of modern ciphers) has already "flattened" the frequency distribution toward uniform, leaving little additional frequency-based compressibility for a second pass to exploit.