Regex for Data Cleaning: Practical Patterns for Messy Real-World Data
Real-world data — phone numbers, dates, emails, log entries, product codes — arrives inconsistently. Here are the regex patterns for the most common data cleaning tasks: phone normalisation, date standardisation, HTML stripping, whitespace cleaning, and log redaction.
By sadiqbd · June 9, 2026
Messy real-world data has patterns — and regex with find & replace can fix most of them
Clean data is a myth. Production systems accumulate phone numbers in five different formats, dates written three different ways, email addresses with trailing spaces, product codes mixed with descriptions, log entries that need to be parsed. Regular expressions combined with find and replace are the fastest way to normalise this kind of inconsistent data without writing a full parsing script.
This is a practical guide to the regex patterns that handle the most common real-world data cleaning tasks.
Phone number normalisation
Phone numbers arrive in every format imaginable:
+1 (415) 555-0123
415.555.0123
4155550123
+14155550123
(415)555-0123
Step 1: Strip everything except digits and leading +
Find (regex): [^\d+]
Replace: `` (empty)
Result: 14155550123, 4155550123, 14155550123, 14155550123, 4155550123
Step 2: Normalise to E.164 format (for US numbers)
Find: ^(\+?1)?(\d{10})$
Replace: +1$2
Now all valid US numbers are +1XXXXXXXXXX.
For UK numbers, similar approach — strip formatting, ensure leading +44.
Date format standardisation
Dates arrive in multiple formats across datasets:
15/06/2024
06/15/2024
2024-06-15
June 15, 2024
15-Jun-2024
DD/MM/YYYY to ISO 8601 (YYYY-MM-DD):
Find: (\d{2})/(\d{2})/(\d{4})
Replace: $3-$2-$1
US MM/DD/YYYY to ISO 8601:
Find: (\d{2})/(\d{2})/(\d{4})
Replace: $3-$1-$2
Note: you need to know which format the source uses before choosing the replacement — the pattern matches both. Context or validation is required to disambiguate.
Stripping timestamp from datetime (keeping date only):
Find: T\d{2}:\d{2}:\d{2}(\.\d+)?(Z|[+-]\d{2}:\d{2})?
Replace: `` (empty)
Turns 2024-06-15T14:30:00Z into 2024-06-15.
Stripping HTML tags
Log files, CMS exports, and scraped data often contain HTML markup mixed with text content.
Remove all HTML tags:
Find (regex): <[^>]+>
Replace: `` (empty)
<p>Hello, <strong>world</strong>!</p> → Hello, world!
Remove specific tags only (e.g., just <span> tags):
Find: </?span[^>]*>
Replace: `` (empty)
Collapse multiple spaces left by tag removal:
Find: {2,} (two or more spaces)
Replace: (single space)
Email address validation and extraction
Find all email addresses in a body of text:
Find: [a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}
This returns all matches. In a find-and-replace context, you can replace with [REDACTED] to anonymise email addresses in logs:
Replace: [REDACTED]
Find addresses with specific domains:
Find: \b[\w.+-]+@example\.com\b
Normalising whitespace in CSV/TSV data
Data exported from spreadsheets often has leading/trailing spaces in cells, multiple spaces, and inconsistent line endings.
Trim leading spaces from start of each line:
Find: ^ +
Replace: `` (empty)
Trim trailing spaces from end of each line:
Find: +$
Replace: `` (empty)
Collapse multiple consecutive spaces:
Find: {2,}
Replace: (single space)
Normalise Windows line endings to Unix:
Find: \r\n
Replace: \n
Extracting structured data from log files
Server log entries often have a fixed format that can be parsed with regex:
192.168.1.1 - - [10/Jun/2024:14:23:45 +0000] "GET /page HTTP/1.1" 200 1234
Extract just the IP address:
Find: ^(\d+\.\d+\.\d+\.\d+).*
Replace: $1
Extract status codes (third field after the quoted request):
Find: ^.+ " +\d{3}
This identifies the status code in context — can be used to filter for specific codes.
Redact IPs in logs (replace with X.X.X.X):
Find: \b(?:\d{1,3}\.){3}\d{1,3}\b
Replace: X.X.X.X
Product code cleaning
E-commerce data often mixes product codes with descriptions or formatting inconsistencies:
SKU-12345-BLU
SKU_12345_BLU
sku12345blu
12345-BLU
Normalise to uppercase hyphenated SKU format:
Step 1: Replace underscores and spaces with hyphens:
Find: [_ ]+
Replace: -
Step 2: Uppercase (depends on tool — some have an uppercase replacement modifier \U$0)
Step 3: Remove leading "SKU-" if inconsistently present:
Find: ^SKU-
Replace: `` (empty if standardising to numbers only)
Credit card number redaction
Redact VISA/Mastercard numbers (leaving last 4 digits):
Find: \b(?:\d{4}[- ]?){3}(\d{4})\b
Replace: ****-****-****-$1
This preserves the last 4 digits for reference while masking the rest.
How to use the Find & Replace tool on sadiqbd.com
- Paste your data — the messy text or dataset
- Enable regex mode — for pattern-based matching
- Enter the find pattern — from the examples above, adapted to your data
- Enter the replacement — with capture group references (
$1,$2) where needed - Preview — most tools show matched text before committing
- Apply — get the cleaned output
- Chain operations — paste the result back and run additional passes for multi-step cleaning
Frequently Asked Questions
Can I use these patterns directly in Python or JavaScript?
Mostly yes, with minor syntax differences. Python uses \1 or \g<1> for backreferences; JavaScript uses $1. The patterns themselves (character classes, quantifiers, anchors) are nearly identical across PCRE, Python re, and JavaScript regex.
What if the regex matches too much or too little?
Use the test input: try the pattern on a small sample first including edge cases. If it matches too broadly, add word boundaries (\b), anchors (^, $), or negative lookaheads. If it matches too narrowly, relax the quantifiers or remove overly specific character requirements.
Is the Find & Replace tool free? Yes — completely free, no sign-up required.
Real-world data cleaning is repetitive, time-consuming, and error-prone when done manually. These regex patterns eliminate the manual work for the most common normalisation tasks — and the find & replace tool lets you apply them without writing code.
Try the Find & Replace tool free at sadiqbd.com — apply regex patterns to any text for data cleaning, normalisation, and extraction.