Frequency Analysis: How Counting Letters Breaks Classical Ciphers

A Caesar cipher can be broken in seconds — not by trying all 25 shifts, but by counting which ciphertext letter appears most often and matching it against English's most common letter, "E." Here's how frequency analysis breaks substitution ciphers, why polyalphabetic ciphers like Vigenère were designed to defeat it, and why modern encryption (AES, RSA) is immune to this entire category of attack.

The Caesar cipher can be broken in seconds without knowing the key — just by counting which letters appear most often, and matching that pattern against the known letter frequencies of the language

Frequency analysis — counting how often each character/letter appears in a piece of ciphertext, and comparing this distribution against the known frequency distribution of letters in the plaintext language — is one of the oldest and most fundamental techniques in cryptanalysis. Simple substitution ciphers (where each letter is consistently replaced by another letter, throughout the entire message) are vulnerable to this technique precisely because substitution doesn't change the frequency with which each (now-substituted) symbol appears — it just relabels the frequencies.

English letter frequency: the baseline

In typical English text, letter frequencies are famously not uniform — some letters appear far more often than others:

Most frequent letters (approximate order): E, T, A, O, I, N, S, H, R — these letters collectively account for a substantial majority of all letter occurrences in typical English text.

Least frequent letters: Q, X, Z, J — these appear quite rarely.

The letter "E" specifically is famously the most common letter in English — appearing in a substantial fraction of all letter positions in typical text (this is why "E" is often the first letter cryptanalysts look for when attacking a substitution cipher).

How frequency analysis breaks a Caesar cipher

A Caesar cipher shifts every letter by a fixed amount — e.g., a shift of 3 turns A→D, B→E, ..., Z→C.

The attack: count the frequency of each letter in the ciphertext. Whatever letter appears most frequently in the ciphertext is likely to correspond to "E" (the most frequent letter in English plaintext) — because the Caesar cipher preserves the relative frequency ordering, just shifts which symbol represents which frequency rank.

If the most frequent ciphertext letter is, say, "H" — and "E" is expected to be the most frequent plaintext letter — this suggests a shift where E→H, which is a shift of +3 (E is the 5th letter, H is the 8th letter, 8-5=3). Testing this shift (shifting the entire ciphertext back by 3) would, if correct, produce readable English text.

Why this works so quickly for Caesar ciphers specifically: there are only 25 possible shifts (excluding shift-by-0, which would be no encryption at all) — even without frequency analysis, simply trying all 25 shifts and checking which produces readable text ("brute force") is fast for a computer (and even feasible by hand for a short message). Frequency analysis doesn't make Caesar-cipher-breaking possible — it makes it faster, by immediately suggesting the most likely shift(s) to try first, rather than requiring trying all 25 systematically.

General substitution ciphers: where frequency analysis becomes essential

A general substitution cipher doesn't use a simple shift — instead, each letter is mapped to some other letter via an arbitrary (not necessarily shift-based) one-to-one mapping. There are 26! (26 factorial) possible substitution mappings — an astronomically large number, making brute-force trying-every-possibility completely infeasible, even for computers.

Frequency analysis remains effective, however, because the substitution doesn't change how often each underlying plaintext letter occurs — it just changes which symbol represents that letter. If "E" is the most common plaintext letter, whatever ciphertext symbol "E" gets mapped to will also be the most common symbol in the ciphertext — frequency analysis identifies this symbol as "likely E," regardless of what the substitution mapping actually is, because frequency analysis operates on the statistical pattern, not on trying specific mappings.

The iterative process: identify the most-frequent ciphertext symbol → hypothesize it's "E" → look for other patterns (common short words like "the," "and" — if a 3-letter ciphertext "word" appears very frequently and its middle letter is the hypothesized "E," this is consistent with "the," suggesting the other two symbols correspond to "T" and "H") → progressively build up the full substitution mapping, symbol by symbol, using both overall frequency and contextual/positional patterns (common digraphs, common short words, doubled-letter patterns — "ll," "ee," "ss" are common doubled letters in English) to confirm/refine hypotheses.

Why this historically mattered: the Enigma and beyond

Simple substitution ciphers were used historically for genuine secrecy — but frequency analysis (refined over centuries, dating back to early cryptanalysis work in the medieval Islamic world, and developed further through the European Renaissance) made simple substitution ciphers fundamentally insecure for any message of meaningful length (frequency analysis requires enough ciphertext for the statistical patterns to emerge clearly — very short messages don't provide enough data for frequency analysis to be reliable, which is part of why short messages encrypted with otherwise-weak ciphers can still be practically hard to break, even though the cipher itself is theoretically weak).

Polyalphabetic ciphers (like the Vigenère cipher) were developed specifically to defeat simple frequency analysis — by using multiple different substitution alphabets, cycling through them based on a key, so that the same plaintext letter, appearing at different positions, might be encrypted to different ciphertext letters (depending on which alphabet in the cycle was "active" at that position) — flattening the overall frequency distribution and making simple single-distribution frequency analysis much less directly effective.

Breaking polyalphabetic ciphers required further techniques — determining the key length (via methods like the Kasiski examination, which looks for repeated sequences in the ciphertext, whose spacing can reveal likely key lengths) and then performing frequency analysis separately on each "slice" of the ciphertext corresponding to a single position within the repeating key (each such slice, having been encrypted with a single, consistent substitution from that position in the key, is again vulnerable to standard frequency analysis, once the slices are correctly identified).

Modern cryptography (the AES, RSA, and other algorithms covered conceptually in previous articles on this site, in the context of hashing) is designed specifically to not exhibit any of these statistical patterns — modern ciphers aim for ciphertext that is computationally indistinguishable from random data, regardless of what the plaintext was — frequency analysis (and the broader category of classical cryptanalysis techniques it represents) has essentially no applicability to correctly-implemented modern symmetric/asymmetric encryption — this is a historical/educational topic, not a practical attack against modern systems.

Frequency analysis beyond cryptanalysis

The broader concept — analyzing the frequency distribution of symbols in a dataset, and comparing it against an expected distribution — has applications well beyond cryptanalysis:

Detecting the language of unknown text: different languages have different characteristic letter-frequency distributions — comparing an unknown text's frequency distribution against known distributions for various languages can help identify what language a text is in (useful for, e.g., automated language-detection in text-processing pipelines, as a simple, fast heuristic — though modern language-detection typically uses more sophisticated techniques than single-letter frequency alone, often incorporating word-level and n-gram-level statistics).

Anomaly detection in data: in various data-quality contexts, an unexpected shift in character (or word, or value) frequency distributions, compared to a historical baseline, can signal a change worth investigating — e.g., if a dataset of customer names suddenly shows a very different distribution of characters than historically observed, this might indicate a data-import issue (wrong character encoding being applied, as covered in previous articles on Unicode/mojibake) or a genuine shift in the underlying population (e.g., expansion into a new geographic market with different naming conventions) — frequency analysis as a diagnostic first step, prompting further investigation into which of these (or other) explanations applies.

How to use the Character Frequency tool on sadiqbd.com

Analyze ciphertext (for educational/historical cryptanalysis exploration): paste suspected substitution-cipher ciphertext and compare its frequency distribution against known English letter frequencies — the most-frequent ciphertext character is a starting hypothesis for "E"
For language-detection-adjacent tasks: compare the character-frequency profile of a text sample against known profiles for different languages, as a simple heuristic
For data-quality monitoring: compare character-frequency distributions of a dataset over time — significant shifts can be a signal worth investigating, whether for encoding issues or genuine underlying data changes

Frequently Asked Questions

Can frequency analysis break modern encryption (AES, etc.)? No — as discussed, modern symmetric ciphers (AES) and asymmetric ciphers (RSA, elliptic-curve cryptography) are designed such that ciphertext exhibits no statistically-exploitable patterns related to the plaintext — frequency analysis (and classical cryptanalysis generally) has no applicability to correctly-implemented, correctly-used modern cryptography. This topic remains educationally valuable (illustrating fundamental concepts about information and redundancy in language) and historically significant, but isn't a practical concern for modern security.

Does frequency analysis work on languages other than English? Yes — the technique (comparing observed frequency distributions against known baseline distributions) applies to any language with a characteristic, non-uniform letter-frequency distribution — which describes most natural languages (the specific frequency rankings/values differ by language — e.g., "E" being the most common letter is an English-specific observation; other languages have their own most-common letters/characters, reflecting that language's own vocabulary and grammar).

Is the Character Frequency tool free? Yes — completely free, no sign-up required.

Try the Character Frequency tool free at sadiqbd.com — analyze the character distribution of any text instantly.

Frequency Analysis: How Counting Letters Breaks Caesar Ciphers, Substitution Ciphers, and Why Modern Encryption Is Immune