Character Frequency and NLP Foundations: From Zipf's Law to Word Embeddings

Word frequency analysis underlies search engines, compression algorithms, and how large language models learn. Here's Zipf's Law, TF-IDF for meaningful keyword extraction, how word embeddings come from co-occurrence statistics, and why the character frequency distribution you measure is the same foundation that GPT models learn from.

Word frequency analysis is the foundation of natural language processing — and understanding it explains how large language models work

Modern NLP systems, from search engines to large language models, rely on statistical relationships between words and characters. Character frequency analysis is where this story begins: the insight that language has statistical structure — that some words and letter combinations appear far more often than others — is the foundation on which text analysis, compression, machine translation, and neural language models are all built.

Character frequency in language: Zipf's Law

In the 1930s, linguist George Zipf observed that word frequency in any large corpus of natural language follows a remarkably consistent pattern. The most common word appears approximately twice as often as the second most common, three times as often as the third, and so on.

This relationship — frequency inversely proportional to frequency rank — is known as Zipf's Law, and it holds across languages, authors, topics, and time periods.

English word frequency (approximate, based on large corpora):

"the": ~7% of all words
"of": ~3.5%
"and": ~3%
"to": ~2.5%

The top 100 most common English words account for roughly 50% of all text. The top 1,000 words account for about 85%.

This has practical implications: any NLP system that learns from text will see the word "the" many more times than "quantum" — and will therefore develop much stronger statistical representations of "the."

TF-IDF: from raw frequency to meaningful signal

Raw term frequency is a poor measure of a word's importance to a document. "The," "is," and "a" have high frequency in nearly every document — they tell you nothing about what the document is about.

TF-IDF (Term Frequency-Inverse Document Frequency) corrects this by weighting frequency against how common a term is across all documents in a corpus:

TF (term frequency): how often the term appears in this document.

IDF (inverse document frequency): IDF(t) = log(N / df(t))

Where N is the total number of documents and df(t) is the number of documents containing term t.

A term that appears in every document gets a very low IDF (log(N/N) = 0). A rare term that appears in few documents gets a high IDF.

TF-IDF = TF × IDF

Words like "the" have high TF but near-zero IDF — near-zero TF-IDF score. Domain-specific terms have moderate TF but high IDF — high TF-IDF score. These high-scoring terms represent the document's distinctive content.

Search engines use TF-IDF variants (or learned representations that implicitly capture similar information) to match documents to queries.

Word embeddings: from frequency to meaning

Word2Vec (2013, Google) showed that by training a neural network to predict words from their context (or context from words), you learn vector representations of words where semantic relationships appear as geometric relationships in the vector space.

The famous example: king - man + woman ≈ queen

This works because of co-occurrence statistics: "king" and "queen" appear in similar contexts (near words like "reign," "throne," "crown") but differ in the contexts they share with "man" and "woman." Statistical analysis of which words appear near which other words produces representations that capture semantic relationships.

Character frequency → word frequency → co-occurrence statistics → word embeddings

The path from basic character counting to the word representations inside GPT-4 is a progression of increasingly sophisticated statistical analysis of the same underlying phenomenon: text has statistical structure.

Text preprocessing in machine learning

Before training any NLP model, text must be preprocessed — and character/word frequency analysis informs several preprocessing decisions:

Tokenisation: splitting text into tokens (words, subwords, or characters). Common words are typically kept as single tokens; rare words are split into subword pieces.

Vocabulary construction: NLP models have fixed vocabularies. Frequency analysis determines the vocabulary: the most frequent words/subwords are included; rare terms are handled as unknown or split into known subword pieces.

Stopword removal: high-frequency, low-information words ("the," "is," "a") are often removed before analysis. The choice of stopwords is determined by frequency analysis of the specific language and domain.

BPE (Byte Pair Encoding): the tokenisation method used by GPT models and many others. BPE starts with character-level tokens and iteratively merges the most frequent adjacent token pairs into single tokens. The process is entirely driven by frequency statistics.

Frequency analysis for text analytics

Character and word frequency has practical applications beyond NLP training:

Keyword extraction: TF-IDF on a document corpus extracts the terms most distinctive to each document.

Language detection: character frequency distributions are highly distinctive across languages. "e" is the most common letter in English but "и" dominates Russian Cyrillic text. Language detection systems use character n-gram frequencies as a primary signal.

Authorship attribution: different authors have measurably different function word frequencies. John Burrows' "Delta" method uses the most common words' frequencies to distinguish authors — a technique applied to disputed historical texts.

Compression: Huffman coding assigns shorter bit sequences to more frequent characters. A text file compressed with Huffman coding is essentially a frequency-based encoding — the most common characters get the shortest codes. DEFLATE (used in gzip and ZIP) builds on this principle.

How to use the Character Frequency tool on sadiqbd.com

Paste any text — article, code, document, or any text sample
See frequency distribution — which characters or words appear most and least often
Apply to text analysis:
- Compare frequency distributions across documents to identify distinctive vocabulary
- Check for unusual character distributions that might indicate encoding issues
- Analyse keyword density from a frequency perspective
Use for language learning — frequency analysis of a target-language text reveals which words are most worth prioritising for vocabulary learning

Frequently Asked Questions

How are large language models different from statistical frequency models? LLMs learn much richer representations than simple frequency statistics. They capture contextual meaning (the same word in different contexts has different representations), long-range dependencies, and complex semantic relationships. But their training begins with the same foundation: processing enormous text corpora and learning statistical patterns — including, at the lowest level, character and word frequencies.

Why does English text compress more efficiently than some other languages? Character frequency distribution affects compression efficiency. English has highly unequal character frequencies (e, t, a, o dominate), allowing Huffman-style compression to achieve significant size reduction. Languages with more uniform character frequency distributions compress less efficiently.

Is the Character Frequency tool free? Yes — completely free, no sign-up required.

Character and word frequency analysis is simultaneously a simple tool for understanding text composition and the foundation of one of the most powerful technology paradigms of the past decade. The statistical structure of language that makes frequency analysis useful is the same structure that large language models learn to capture at higher levels of abstraction.

Try the Character Frequency tool free at sadiqbd.com — count how often every character and word appears in any text, instantly.