Catastrophic Backtracking and ReDoS: When Regex Patterns Hang on Real Data

A find-and-replace that works instantly on a 100-character test string can take minutes (or never finish) on a 10,000-character real document — not because the engine is slow, but because certain regex patterns cause processing time to grow exponentially with input length. Here's how "catastrophic backtracking" works with patterns like (a+)+b, why it's invisible on short test inputs, why it's also a recognized security vulnerability (ReDoS), and how to recognize vulnerable pattern structures.

A find-and-replace operation that works instantly on a 100-character test string can take minutes (or effectively never finish) on a 10,000-character real document — not because the regex engine is slow, but because certain regex patterns cause the engine's processing time to grow exponentially with input length

The previous articles on this site covered regex patterns for data cleaning, spreadsheet regex functions, and capture groups/multi-cursor editing. This article addresses catastrophic backtracking — a regex performance phenomenon that's invisible on small test inputs but can make a seemingly simple pattern effectively hang on larger, real-world text — and is also the basis of a recognized security concern (ReDoS).

How regex engines "search" via backtracking

Most regex engines (the kind used in find-and-replace tools, programming languages, etc.) use a backtracking search strategy: when matching a pattern against text, the engine tries a particular interpretation of the pattern — and, if that interpretation fails to produce a match, backtracks (undoes some of its choices) and tries a different interpretation.

For most patterns, against most text, this backtracking is fast — the engine finds a match (or determines there's no match) quickly, trying relatively few alternative interpretations before succeeding or giving up.

The problem arises with certain pattern structures, combined with certain input structures — where the number of alternative interpretations the engine might need to try grows exponentially with the length of the relevant portion of the input — turning a pattern that matches (or fails to match) instantly on short input into one that takes prohibitively long on longer input.

The classic pattern: nested/adjacent quantifiers on overlapping character sets

A commonly-cited example pattern: (a+)+b — matching "one or more "a" characters, repeated one or more times, followed by "b"."

Against input like "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!" (many "a"s, followed by a character that's NOT "b" — causing the overall pattern to fail to match, since there's no "b" anywhere) — the regex engine, before concluding "no match," tries an enormous number of ways of grouping the "a"s between the outer + and the inner + — because (a+)+ can match the same sequence of "a"s in many different "groupings" (one group of all the "a"s; two groups; three groups; ... all the way down to "each "a" is its own group") — and, for each such grouping, the engine checks "does the next character match "b"?" — failing each time (since the next character is "!"), forcing the engine to try the next grouping.

*The number of possible groupings of N "a"s grows exponentially with N` — for a short string (say, 10 "a"s), the engine tries a manageable number of groupings (on the order of hundreds or low thousands) and quickly concludes "no match." For a longer string (say, 40 "a"s) — the number of groupings can reach into the billions or more — taking the engine a correspondingly enormous amount of time to exhaust all possibilities before concluding "no match."

Why this is invisible on small test inputs

*The exponential growth means: a pattern that "works fine" (returns instantly) on a test string of, say, 20 characters — could take seconds, minutes, or effectively forever on a string of 40-50 characters of the same, "problematic" structure — the transition from "instant" to "effectively hangs" can happen within a fairly narrow range of input lengths, due to the exponential (not linear or gradual) nature of the growth.

This is precisely why "it worked in my testing" doesn't guarantee "it'll work on real data" — test inputs (especially hand-crafted, short test cases) are often far shorter than real-world data (user-submitted text, log files, documents) — and "catastrophic" patterns are, almost by definition, fine on short inputs — the problem only manifests at lengths that might not be represented in typical, small-scale testing.

ReDoS: "Regular Expression Denial of Service"

The security implication: if user-supplied input is processed by a vulnerable regex pattern (a pattern susceptible to catastrophic backtracking, combined with input that triggers it) — an attacker could deliberately submit input crafted to trigger catastrophic backtracking — causing the processing server/application to consume excessive CPU time (potentially for a single request) — **a denial-of-service vector: "ReDoS" (Regular Expression Denial of Service).

This is a recognized, documented category of security vulnerability — appearing in vulnerability databases/advisories for various software over the years, where user-controllable input was processed by a vulnerable regex — *even a single request, containing a carefully-crafted string, could cause the processing server to "hang" for an extended period (or effectively indefinitely, for sufficiently "bad" combinations of pattern/input) — representing a genuine availability risk, not just a "slow" user experience.

Recognizing potentially-vulnerable pattern structures

Patterns susceptible to catastrophic backtracking generally share a structure: nested or adjacent quantifiers (+, *, {n,m}) applied to overlapping character sets or sub-patterns — meaning the same input characters could, in principle, be "consumed" by more than one of the quantified components, in multiple different ways — creating the "many possible groupings" situation described above.

Common "danger sign" patterns:

(a+)+, (a*)*, (a|a)* and similar — nested quantifiers on the same/overlapping content
(a+)(a+) — adjacent groups both matching the same character (the engine can try many different "splits" of a run of "a"s between the two groups)
Patterns with .* ... .* (multiple "match anything, any amount" segments) — particularly if combined with other quantified segments — can, depending on the specific overall pattern and the input, exhibit similar issues

**A general, if imprecise, heuristic: if a pattern contains multiple quantifiers, and it's unclear, just by reading the pattern, "could the same characters be matched by more than one of these quantified parts, in different ways?" — this ambiguity is exactly what creates the "many possible groupings" scenario — patterns where each quantified part is clearly, unambiguously responsible for a distinct, non-overlapping portion of the input (e.g., via mutually-exclusive character classes — [a-z]+[0-9]+, where letters and digits can't overlap) generally don't exhibit this problem, since there's no ambiguity in "which part matched which characters."

Mitigations: tools and approaches

Some regex engines impose a timeout — if matching exceeds a configured time limit, the operation is aborted (raising an error, rather than hanging indefinitely) — this limits the worst-case impact (preventing truly "indefinite" hangs) but doesn't prevent the underlying inefficiency — a request that "times out" after consuming significant CPU time for (say) 5 seconds, repeated across many requests, could still represent a meaningful resource-consumption issue, even if no single request "hangs forever."

Rewriting the pattern to eliminate the ambiguity (the overlapping-quantifier structure) is the more fundamental fix — often achievable by making character classes mutually-exclusive where possible, or restructuring the pattern to avoid nested/adjacent quantifiers over the same potential input.

Alternative regex engines — some regex implementations use non-backtracking algorithms (based on different theoretical approaches to regex matching) that don't exhibit catastrophic backtracking at all — guaranteeing linear-time (in input length) matching, regardless of pattern structure — though such engines typically don't support the full range of "regex" features that backtracking engines do (certain advanced features, like backreferences, generally require backtracking-style engines and aren't supported by linear-time alternatives) — representing a feature-vs-guaranteed-performance trade-off.

How to use the Find & Replace tool on sadiqbd.com

Test regex patterns against realistic-length input, not just short test strings — particularly for patterns involving multiple quantifiers — if a pattern seems to "hang" or take unexpectedly long on longer test input, this is a signal worth investigating for catastrophic-backtracking structure
For patterns intended to process user-supplied/untrusted input (in applications you're building, beyond this tool's own use) — consider the "ReDoS" implications: could a malicious user craft input that triggers catastrophic backtracking in patterns your application applies to their input?
For complex patterns: simplifying — making character-class boundaries mutually-exclusive where possible, avoiding nested/adjacent quantifiers over potentially-overlapping content — generally improves both performance and (often) readability/correctness of the pattern

Frequently Asked Questions

Is there a tool that can tell me, in advance, whether a regex pattern is vulnerable to catastrophic backtracking? Various static-analysis tools/libraries exist that attempt to analyze regex patterns and flag potentially-vulnerable structures — these can be useful as a "first pass" check, particularly for patterns that will process untrusted input — though, given the genuinely complex nature of determining, with certainty, whether a given pattern is vulnerable for all possible inputs — such tools generally provide "likely vulnerable, worth reviewing" signals, rather than definitive, exhaustive guarantees. *Testing patterns against progressively longer "adversarial"-style inputs (similar to the "many a's, then a non-matching character" example above, adapted to the specific pattern's structure) during development remains a valuable, practical complement to automated analysis tools.