Text Truncation Edge Cases: UTF-8 Boundaries, Word Breaks, and HTML Tags

Truncating at exactly 100 characters sounds simple — until you're splitting a multibyte UTF-8 sequence and producing invalid text, cutting mid-word and leaving "fo" dangling, or cutting HTML and leaving unclosed tags. Here's the byte vs code point vs grapheme cluster distinction, the word-boundary backtracking algorithm, why HTML truncation requires an actual parser (not string slicing), and why your ellipsis length has to come out of the character budget before you cut.

Truncating at exactly 100 characters sounds simple — until you're truncating in the middle of a multibyte UTF-8 character and producing invalid text, or cutting the string mid-word and leaving a dangling partial word, or truncating HTML and leaving unclosed tags in the output

The previous articles on this site covered platform truncation rules, database VARCHAR data loss bugs, and pagination vs "read more" patterns. This article addresses what happens at the byte level when text is truncated — specifically the ways truncation can produce malformed output, and the correct implementations for each failure mode.

The byte vs character vs grapheme boundary problem

"Character 100" in a string depends on how you count characters.

Bytes: in UTF-8, characters outside the ASCII range (code points > 127) use 2, 3, or 4 bytes. "100 bytes" into a UTF-8 string is not the same as "100 characters" — cutting at byte 100 might land in the middle of a multi-byte character, producing invalid UTF-8.

Code points: Unicode code points are the standard way to count "characters" — but some characters that look like one "character" to a user (an emoji with a skin-tone modifier, a base letter with an accent combining character) are multiple code points. "100 code points" might cut through a grapheme cluster, producing a base character without its modifier — displaying incorrectly or producing a different character than intended.

Grapheme clusters: the visual unit — what a user would call "one character." Truncating at grapheme cluster boundaries is the safest approach for user-visible text, though it requires grapheme-segmentation algorithms (available in ICU, or via Intl.Segmenter in modern JavaScript).

For ASCII-only text, these distinctions collapse — every byte is a code point is a grapheme, and "100 characters" is unambiguous. Unicode-aware truncation only matters when text may contain characters outside the ASCII range.

Word-boundary truncation: avoiding mid-word cuts

Truncating at exactly 140 characters (for a character limit) might produce "The quick brown fo" — cutting mid-word, leaving "fo" as a dangling non-word. Most user-facing truncation wants "The quick brown..." — finding the last word boundary before the limit.

The word-boundary algorithm:

Find the character at position N (the limit)
If position N is mid-word (the character is not a space/punctuation boundary), backtrack to the last word boundary before N
Append the ellipsis or truncation marker
Report the actual length (which may be less than N due to the backtracking)

Edge case: what if there's no word boundary? If the first N characters are all one continuous word (a URL, a compound German word, a string of characters with no spaces), there's no boundary to backtrack to — the implementation needs a fallback (typically: truncate at N characters, word boundary or not, and accept the mid-word cut).

Ellipsis length: if you're truncating to 140 characters including the ellipsis, you need to find the last word boundary at or before character 137 (reserving 3 for "..."). Truncating to 140 characters then appending "..." produces a 143-character result that exceeds the limit.

HTML truncation: the unclosed tag problem

Truncating HTML as if it were plain text can produce structurally broken HTML:

This is some bold text and this is the rest

Truncating the raw string at the "a" in "and" produces: This is some bold text a

An unclosed  tag — browsers will attempt to repair this (auto-close the tag at the end of the document), but the visual and structural result is unpredictable.

HTML-aware truncation requires:

Parsing the HTML into a DOM tree
Truncating the text content (ignoring tag characters) at the character limit
Serializing the resulting DOM back to HTML (which automatically closes any open tags)

This is non-trivial — it requires an HTML parser, not just string operations. Libraries exist for this (html-truncate in Node.js, truncate_html in Ruby) but naive string truncation should never be applied directly to HTML content.

A simpler alternative: strip HTML entirely before truncating (producing plain text), then truncate the plain text. Loses formatting, but produces valid output.

Markdown truncation: a less severe but still tricky case

Markdown is less structurally critical than HTML (a truncated bold marker **bold tex doesn't produce invalid syntax, just unrendered asterisks) — but truncating a markdown string at a backtick or bracket boundary can produce confusing output.

For rendered-markdown contexts (where the output will be processed by a markdown parser): similar to HTML, render-then-truncate is safer than truncate-then-render.

For raw-markdown contexts (where the truncated output will be displayed as-is): simple character-boundary truncation is generally acceptable, since markdown's syntax damage from truncation is visual rather than structural.

Database truncation: the silent data loss case

The previous "database VARCHAR silent data loss" article covered the scenario where inserting a string longer than a column's VARCHAR limit silently truncates the data in some database/ORM configurations. That's truncation-as-failure. This article's context is truncation-as-feature — intentional truncation for display purposes.

The key difference: truncation-as-feature should happen before data reaches the database (at the application layer), at the correct boundaries (word boundaries, grapheme clusters, valid UTF-8), with the truncated value being what gets stored. Relying on the database to truncate (the silent-data-loss scenario) bypasses all of these correctness concerns.

How to use the Text Truncator on sadiqbd.com

Set the limit including the ellipsis — if you want output no longer than 140 characters, and you're using a 3-character ellipsis, truncate at 137 characters of content
Word-boundary truncation (if the tool offers it) produces more readable output for prose — character-boundary truncation is appropriate for code, URLs, and strings where word structure doesn't apply
For HTML content: extract plain text first before using a plain-text truncator, or use an HTML-aware truncation approach — never apply character-boundary truncation directly to HTML strings

Frequently Asked Questions

Should I count bytes or characters for storage limits vs display limits? For storage limits (database columns, API payload sizes): count bytes — storage is measured in bytes, not characters, and multi-byte characters count as multiple bytes. A VARCHAR(100) allows 100 bytes (in some databases) or 100 characters (in others, when using a multi-byte character set) — check your database's definition. For display limits (character limits shown to users, SERP title length): count characters (code points or grapheme clusters) — users think in characters, not bytes. Displaying "max 160 characters" and then rejecting a 160-grapheme-cluster input because it's 200 bytes is a confusing user experience.

Is the Text Truncator free? Yes — completely free, no sign-up required.

Try the Text Truncator free at sadiqbd.com — shorten any text to a set character or word limit with clean word-boundary cutting.