Fuzzy Deduplication and Record Linkage: When Exact Matching Isn't Enough

Exact deduplication handles perfect matches — but real data has name variations, address inconsistencies, and multi-source formatting differences. Here's fuzzy matching, the record linkage workflow, edit distance, Soundex, and SQL and Python approaches for production-quality deduplication.

Exact deduplication is the easy case — real data requires fuzzy matching

Removing exact duplicate lines from a list is solved by a single tool call. The harder problem is deduplication when records that represent the same entity are not textually identical: "John Smith" and "John R. Smith," "IBM" and "International Business Machines," "42 Main St" and "42 Main Street, Apt 2B."

This is the record linkage problem, and it sits at the intersection of data quality, probabilistic matching, and machine learning. Understanding both the exact and fuzzy cases makes deduplication a proper data engineering skill rather than just a cleanup operation.

When exact deduplication is sufficient

Exact deduplication — remove any line that appears more than once — works when:

Your data is already structured and standardised (machine-generated IDs, normalised email addresses, canonical product SKUs)
You're working with clean, consistent data exports from a single source system
The identifying field is an exact-match key (user ID, UUID, email address)

For these cases, the Remove Duplicate Lines tool handles it instantly. Paste your list, get back a deduplicated list.

When exact deduplication fails: the messiness of real data

Human-entered or multi-source data rarely matches exactly even when representing the same entity:

Name variations:

"Alexander Johnson" vs. "Alex Johnson" vs. "Alexander R. Johnson"
"Dr. Sarah Williams" vs. "Sarah Williams"
"Xiaoming Wang" vs. "Wang Xiaoming" (different name order conventions)

Address variations:

"150 King's Road, Chelsea, London SW3 5XP"
"150 Kings Road London SW3 5XP"
"150 Kings Rd, SW3 5XP"

Company name variations:

"Apple Inc." vs. "Apple, Inc." vs. "Apple Incorporated"
"McKinsey & Company" vs. "McKinsey and Company" vs. "McKinsey"

Phone number variations: covered in the data cleaning post — formatting inconsistencies that require normalisation before any matching.

Fuzzy matching techniques

Edit distance (Levenshtein distance)

Counts the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another.

"kitten" → "sitting": distance = 3 (substitute k→s, substitute e→i, insert g at end)

Threshold-based matching: if Levenshtein distance ≤ N, treat as a match. What value of N to use depends on the expected variation in your data.

Jaro-Winkler similarity

Designed for short strings (particularly names). Accounts for character transposition and gives extra weight to matching prefixes. Scores from 0 (no similarity) to 1 (identical).

"Johnathan" and "Jonathan": Jaro-Winkler ~0.96 (very similar) "John Smith" and "Jon Smythe": ~0.85 (similar) "John Smith" and "Jane Brown": ~0.60 (less similar)

Token-based similarity (Jaccard, cosine)

Splits strings into tokens (words or n-grams) and compares the sets. Useful when word order varies or when parts of a name/address are swapped.

"International Business Machines" and "Business Machines International": token sets are identical → Jaccard similarity = 1.0 despite different word order.

Soundex and phonetic algorithms

Maps words to their phonetic representation — words that sound the same produce the same code.

"Smith" → S530 "Smyth" → S530 "Smythe" → S530

All three sound the same and produce the same Soundex code. Useful for name matching across spelling variations.

The record linkage workflow

For merging two datasets where the same entity may appear with variations:

1. Standardisation / normalisation

Before any comparison:

Lowercase everything
Remove extra whitespace
Normalise punctuation (remove commas, periods, hyphens)
Expand abbreviations where possible (St → Street, Rd → Road)
For names: parse into components (first name, last name, middle initial)

2. Blocking (candidate generation)

Comparing every record to every other record is O(n²) — for 100,000 records, that's 10 billion comparisons. Blocking reduces this by only comparing records that share a key characteristic (same postcode, same first three characters of last name, same phone area code).

3. Comparison

Apply similarity functions to candidate pairs. Often multiple fields:

Last name: Jaro-Winkler
First name: Jaro-Winkler
Phone: exact after normalisation
Email: exact after lowercasing

4. Classification

Combine similarity scores into a match/no-match decision. Approaches:

Threshold-based: if weighted score exceeds threshold, classify as match
Probabilistic (Fellegi-Sunter model): mathematical framework for estimating probability of true match given observed field agreements
Machine learning: train a classifier on labelled examples of known matches and non-matches

5. Review of uncertain cases

Pairs near the decision boundary are flagged for human review. Automation handles clear matches and clear non-matches; humans resolve edge cases.

Deduplication in SQL

For exact database-level deduplication:

-- Find duplicate email addresses
SELECT email, COUNT(*) as count
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

-- Keep most recent record for each email, delete others
DELETE FROM users
WHERE id NOT IN (
  SELECT MAX(id)
  FROM users
  GROUP BY email
);

-- Deduplicate into a new table
CREATE TABLE users_deduped AS
SELECT DISTINCT ON (email) *
FROM users
ORDER BY email, created_at DESC;

Python libraries for fuzzy deduplication

thefuzz (formerly fuzzywuzzy): Levenshtein distance and ratio-based matching:

from thefuzz import fuzz
fuzz.ratio("John Smith", "Jon Smith")  # 91

dedupe: probabilistic machine learning deduplication:

import dedupe
# Train on labelled examples, then classify unseen pairs

recordlinkage: comprehensive record linkage toolkit:

import recordlinkage
indexer = recordlinkage.Index()
indexer.block('surname')  # block on surname

How to use the Remove Duplicate Lines tool on sadiqbd.com

For exact deduplication of simple lists:

Paste the list — email addresses, IDs, keywords, URLs
Configure options — case sensitivity, whitespace handling
Remove duplicates — instant clean list

For fuzzy deduplication of complex data: the tool handles exact matching; the Python libraries above handle fuzzy matching programmatically.

Frequently Asked Questions

What's the best free library for fuzzy deduplication? For Python: thefuzz for simple similarity scoring, dedupe for production-quality probabilistic record linkage. For R: RecordLinkage package. For SQL without external tools: the pg_trgm extension in PostgreSQL provides trigram-based similarity matching.

At what scale does deduplication become computationally expensive? Exact deduplication scales linearly — 10M records deduplicate quickly. Fuzzy deduplication scales quadratically without blocking — 100K records with naive all-pairs comparison becomes impractical without blocking or indexing strategies.

Is the Remove Duplicate Lines tool free? Yes — completely free, no sign-up required.

Exact deduplication handles the clean case. The real world is messier, and the record linkage problem — matching entities across imperfect representations — is where data quality work actually lives.

Try the Remove Duplicate Lines tool free at sadiqbd.com — deduplicate any list instantly with case and whitespace options.