Shannon Entropy and Character Frequency: The Information Theory Behind Text Analysis

Character frequency analysis connects directly to Shannon entropy, data compression, and information theory. Here's what the distribution of characters in text reveals about compressibility, password strength, Zipf's Law, and stylometric authorship analysis.

Shannon entropy is the number that tells you how much information is actually in a message

Claude Shannon's 1948 paper "A Mathematical Theory of Communication" introduced a precise, quantitative definition of information. Before Shannon, information was a fuzzy concept. After Shannon, it had a unit — the bit — and a formula that revealed something surprising: the information content of a message has nothing to do with its meaning, only with how unpredictable it is.

Character frequency analysis is one of the entry points to understanding Shannon entropy. The distribution of characters in a text directly determines how much information it contains.

The entropy formula

Shannon entropy for a string is:

H = −∑ p(x) × log₂(p(x))

Where p(x) is the probability (relative frequency) of each character x.

What this produces:

A string of all identical characters: H = 0 bits (no uncertainty — you always know what comes next)
A string where all characters are equally probable: H = log₂(alphabet size) bits — maximum entropy
Natural language text: somewhere between these extremes, with H typically 3–5 bits per character

For English text:

Individual letter entropy: ~4.0–4.5 bits per character (just characters)
With word-level structure considered: ~1–1.2 bits per character (Shannon's estimate, accounting for language patterns)

The gap between these numbers (4 bits vs 1 bit) quantifies how much redundancy English contains — how much of any given sentence could be predicted from context.

What character frequency reveals about text compression

Text compressors like gzip, brotli, and Huffman coding work by exploiting redundancy — the predictability in text that entropy measures.

Huffman coding assigns shorter bit sequences to frequent characters and longer sequences to rare ones. The most common character in English (space) might be encoded as 0 (1 bit). A rare character like z might be encoded as 11010110 (8 bits).

The optimal Huffman code for a text achieves compression close to the Shannon entropy of the character distribution. If entropy is 4 bits per character and ASCII uses 8 bits per character, Huffman coding can compress the file by roughly 50%.

Practical implication: running a character frequency analysis on a text gives you a rough sense of how compressible it is. A text where one character dominates (low entropy) compresses well. A text with nearly uniform character distribution (high entropy) compresses poorly.

This is why encrypted data (which should look random) doesn't compress — its entropy is near maximum. Trying to compress already-encrypted data often makes it slightly larger.

Information theory in password strength

Password strength is fundamentally an entropy problem. A password's entropy (in bits) is:

H = log₂(C^L) = L × log₂(C)

Where C is the character set size and L is the password length.

Password	Entropy
8 lowercase letters	8 × log₂(26) = 37.6 bits
8 mixed case + digits	8 × log₂(62) = 47.6 bits
12 mixed case + digits + symbols	12 × log₂(95) = 78.7 bits
4 random words from 7776-word list	4 × log₂(7776) = 51.7 bits

But this calculation assumes the password characters are chosen randomly. Human-chosen "random" passwords have far less entropy than the formula suggests, because humans follow predictable patterns (capital first letter, number at end, common substitutions). Character frequency analysis of large password databases reveals this — the distribution is not uniform, meaning actual entropy is much lower than theoretical.

Zipf's Law and natural language character/word distribution

Natural language has a remarkable property known as Zipf's Law: the frequency of any word (or character) is roughly inversely proportional to its rank in a frequency table. The most common word appears about twice as often as the second most common, three times as often as the third, and so on.

This power-law distribution is not unique to English — it appears in virtually every analysed natural language, in network degree distributions, in city population sizes, and in many other complex systems.

What this means for character frequency analysis:

Character frequency distributions in natural language are highly predictable across documents
A text that dramatically deviates from the typical distribution may be synthetic, encoded, or highly domain-specific
The characteristic "E-T-A-O-I-N" ordering in English is robust enough to form the basis of classical cipher attacks

Using character frequency for stylometric analysis

Authors have measurable stylometric signatures — consistent patterns in word length, sentence structure, punctuation frequency, and character frequency that persist across their writing.

Character-level stylometry (using character frequency distributions as features) has been used in:

Authorship attribution in legal and historical contexts
Identifying AI-generated vs. human-written text
Plagiarism detection at the stylistic level
Cross-language authorship analysis

A character frequency distribution is a basic but surprisingly informative feature vector for any piece of text.

How to use the Character Frequency tool on sadiqbd.com

Paste your text — the longer, the more statistically meaningful the distribution
Run the analysis — counts each character's frequency and percentage
Compare against known distributions — does the character distribution look like typical English? Like code? Like encrypted or random data?
Examine the top phrases (bigrams and trigrams) — phrase frequency reveals more semantic structure than individual character frequency

Frequently Asked Questions

What's the entropy of a typical English paragraph? At the character level, English text has approximately 4–4.5 bits of entropy per character. Accounting for word-level predictability (as Shannon measured), it's closer to 1–1.5 bits per character of "true" information. This high redundancy is why text compression works so well.

Can character frequency identify what language a text is written in? As a first-pass heuristic: yes. Different languages have distinctive frequency distributions — French has high é frequency, German has ä/ö/ü, Spanish has ñ and high n/a frequency. Combined with other features, character frequency is one component of language identification algorithms.

Why do some data types look like random noise in character frequency analysis? Encrypted data, compressed data, and binary files (images, executables) have near-uniform character/byte distributions — high entropy, no dominant patterns. This is what you want from encryption (you can't extract information by frequency analysis) and is a side effect of compression (the output looks like entropy-maximised data).

Is the Character Frequency tool free? Yes — completely free, no sign-up required.

Character frequency is a surprisingly rich lens on text. The distribution reveals compressibility, hints at the language, provides stylometric features, and connects directly to Shannon's foundational theory of information.

Try the Character Frequency tool free at sadiqbd.com — analyse how often every character appears in any text, with frequency percentages and distribution charts.