Text Diff Granularity: Character, Word, Line, and Structured

The simplest possible text diff — "what characters changed?" — is the Edit Distance problem, with an O(n×m) solution too slow for document-scale text. This is why practical diff tools work at line level, with a second character-level pass only within changed lines. Here's how granularity (character, word, line, sentence, structural) changes both what's shown and what's computationally feasible, the unified diff format explained, and why structured diffs for JSON/HTML need tree-level comparison rather than line-level.

The simplest possible text diff — "what characters changed?" — is computationally equivalent to the "Edit Distance" problem, which has a well-known O(n×m) dynamic programming solution, but most practical diff tools use line-level granularity to keep the problem tractable at document scale

The previous articles on this site covered the Myers diff algorithm, practical diff workflows (legal redlining, code review), and plagiarism detection. This article addresses diff granularity — why diffs can be computed at the character, word, sentence, or line level, and how the choice of granularity changes both what's shown and what's computationally feasible.

What "edit distance" measures

Edit distance (Levenshtein distance) is the minimum number of single-character operations (insertions, deletions, substitutions) required to transform one string into another.

"cat" → "bat": 1 substitution (c→b). Edit distance = 1.
"cat" → "cart": 1 insertion (add r). Edit distance = 1.
"cat" → "dog": 3 substitutions. Edit distance = 3.

Computing exact edit distance requires comparing every character in one string against every character in the other — an O(n×m) dynamic programming algorithm (where n and m are the lengths of the two strings). For two strings of 1,000 characters each: 1,000,000 comparisons. For two strings of 100,000 characters each (a small document): 10 billion comparisons — slow.

Why line-level diffing is the standard for documents

Git, diff command-line tools, and most text diff applications work at line level — each "element" being compared is a full line, not a character. This dramatically reduces the problem size:

A 1,000-line document, compared to another 1,000-line document: roughly 1,000,000 line comparisons. Each comparison is "are these two lines identical?" — which can be computed as a hash comparison (O(1) per line comparison after hashing). The Myers algorithm then finds the longest common subsequence of matching lines.

After finding which lines differ, some tools apply a second pass — character-level diff within changed lines — to highlight exactly which characters changed within a modified line. This two-pass approach (line-level diff to find changed lines, then character-level diff only within those changed-line pairs) gives good granularity where it matters while keeping the computation tractable.

Word-level diffing: better for prose, worse for code

Word-level diffing treats each whitespace-delimited token as an element — useful for prose (where you want to see "the word urgent was changed to important") but less meaningful for code (where punctuation and whitespace have semantic significance that word tokenization loses).

Many document revision tools (Google Docs "suggestion mode," Microsoft Word "Track Changes") implement word-level diffing — when you accept a suggestion, you're seeing the word-level diff between the original and suggested version. The displayed "which words changed" is generated by diffing the old and new text at word granularity.

Limitation: word-level diffing is sensitive to tokenization — "it's" vs "its" is one word changed; "its" vs "it is" is one word deleted and two inserted (a three-token change). Different tokenizers (split on whitespace only, vs split on punctuation too) produce different diffs for the same change.

Sentence and semantic diffing: for meaning, not characters

Sentence-level diffing identifies which sentences changed — useful for analyzing document revisions at a higher abstraction level than word changes.

Semantic diffing — comparing meaning rather than surface form — is an emerging area where LLMs and embedding models are used. Two sentences can be character-level identical while meaning different things (irony, context), or character-level very different while meaning almost the same thing (paraphrase). Traditional diff algorithms operate purely on surface form; semantic approaches attempt to surface meaningful changes.

This connects to plagiarism detection (covered in the previous article) — paraphrase plagiarism changes the surface form while preserving the meaning, which character/word/line diff entirely misses.

Structured diffs: XML, JSON, HTML

Plain text diff algorithms applied to XML, JSON, or HTML produce results that can be misleading — a reformatted JSON file (all on one line vs pretty-printed) would show every line as changed in a line-level diff, even if the data content is identical.

Structured diff tools parse the document into its semantic structure first (the XML/JSON/HTML tree), then diff at the structural level:

JSON diff: compare key-value trees — a renamed key at a nested level shows as "this key was renamed" rather than "this entire JSON object was rewritten" (as a line-level diff might show if the indentation changed).

HTML diff: compare DOM trees — a moved element shows as "this element moved from here to there" rather than lines of HTML being deleted and inserted at different positions.

XML schema-aware diffing: some tools use knowledge of the schema to produce semantically meaningful diffs — e.g., "this legal clause was amended" rather than "these 23 lines changed."

The "unified diff" format: the standard representation

The output format most diff tools use — the one you see in git diff output and pull request views — is the unified diff format, where context lines and changed lines are shown together:

@@ -10,6 +10,7 @@    ← hunk header: original lines 10-15, new lines 10-16
 context line 1        ← unchanged (shown for context, prefixed with space)
 context line 2
-removed line          ← line only in original (prefixed with -)
+added line            ← line only in new version (prefixed with +)
+another added line
 context line 3

The hunk header @@ -10,6 +10,7 @@ means: in the original file, this hunk starts at line 10 and spans 6 lines; in the new file, it starts at line 10 and spans 7 lines (one more, because a line was added).

Context lines (unchanged lines surrounding changes) help reviewers understand where in the document a change occurred — without context, a line reading "}" tells you a brace was unchanged, but you'd need the context to know which function or block it closes.

How to use the Text Diff tool on sadiqbd.com

For code comparison: line-level diffing (the standard) is appropriate — code is line-structured and line-level changes are meaningful units
For prose revision comparison: if the tool offers word-level diff, this is usually more readable — seeing which words changed rather than which lines is more natural for prose
For JSON/structured data: be aware that reformatting (changing indentation or key order) will appear as a massive diff even if the data is logically unchanged — consider normalizing format before diffing (minifying JSON, sorting keys consistently) if you want to compare content rather than formatting

Frequently Asked Questions

Why do some diff tools show different results for the same change? Because there's often more than one minimal diff — multiple edit sequences of the same total cost can produce the same transformation, and different algorithms choose differently among them. The Myers algorithm (used by Git) is tuned to prefer certain diff patterns (it tends toward "delete then insert" rather than "replace"), producing diffs that feel natural to developers. Another algorithm applied to the same input might find an equally minimal diff that looks different — neither is "wrong," they just represent different sequences of the same minimal edit distance. This is particularly visible when comparing large blocks of similar-but-not-identical content, where algorithms differ in how they align corresponding sections.

Is the Text Diff tool free? Yes — completely free, no sign-up required.

Try the Text Diff tool free at sadiqbd.com — compare any two texts and see exactly what changed, character by character.