String Reversal and Unicode: Why Naive Implementations Break for Emoji

Reversing a string in JavaScript with split("").reverse().join("") breaks emoji — each emoji is two code units and the split separates them. Here's why naive string reversal fails for Unicode, how surrogate pairs and combining characters cause problems, and the correct grapheme cluster approach in Python and JavaScript.

Reversing a string in code is a classic interview question — but doing it correctly for Unicode reveals most implementations are wrong

The naive implementation of string reversal in most languages iterates over characters and builds the reversed sequence. This works perfectly for ASCII. For Unicode text — particularly strings containing emoji, combining characters, or right-to-left scripts — naive reversal produces broken output.

Understanding why reveals how strings are represented in memory, and fixing it produces a genuinely correct implementation.

The naive approach and why it works for ASCII

Python:

text = "Hello, World!"
reversed_text = text[::-1]
# → "!dlroW ,olleH"

JavaScript:

const text = "Hello, World!";
const reversed = text.split("").reverse().join("");
// → "!dlroW ,olleH"

These work correctly for ASCII because every character is a single code unit. "H" is one character, one code unit, one entry in the string's internal representation.

The Unicode problem: multi-unit characters

Modern Unicode contains characters that require more than one code unit in common string encodings.

Emoji and supplementary characters

Emoji (and many characters outside the Basic Multilingual Plane) are represented as "surrogate pairs" in JavaScript's UTF-16 encoding — two 16-bit code units that together represent one character.

const emoji = "😀";
emoji.length          // 2 — JavaScript counts code units, not code points
emoji.split("")       // ['', ''] — splits the surrogate pair
emoji.split("").reverse().join("")  // Broken surrogate pair

The correct approach in JavaScript:

// Use Array.from() to iterate by code points, not code units
const text = "Hello 😀🌍";
const reversed = Array.from(text).reverse().join("");
// → "🌍😀 olleH" — emoji preserved correctly

Or using the spread operator (which also iterates by code points):

const reversed = [...text].reverse().join("");

Combining characters

Unicode allows characters to be modified by separate "combining" characters. A character like "é" can be represented as:

Precomposed: a single code point U+00E9 (LATIN SMALL LETTER E WITH ACUTE)
Decomposed: two code points U+0065 (e) + U+0301 (COMBINING ACUTE ACCENT)

If you naively reverse a string with decomposed characters, the combining marks become detached from their base characters:

# Decomposed form: 'e' + combining accent
text = "\u0065\u0301"  # "é" as two code points
print(text)            # → é

# Naive reversal separates the accent
reversed_naive = text[::-1]
print(reversed_naive)  # → ́e (accent before e — garbled)

The correct Python approach:

import unicodedata

def reverse_string_unicode(text):
    # Normalize to NFC (precomposed form) before reversing
    normalized = unicodedata.normalize('NFC', text)
    # Then reverse by grapheme clusters (use the third-party 'grapheme' package for full correctness)
    return normalized[::-1]

For complete correctness, reversing should operate on grapheme clusters — what humans perceive as single characters, which may consist of multiple Unicode code points. The Python grapheme package and JS Intl.Segmenter implement this:

// Modern JavaScript: grapheme cluster segmentation
const segmenter = new Intl.Segmenter();
const text = "Hello 👨‍👩‍👧‍👦";  // Family emoji (ZWJ sequence, 11 code units!)
const graphemes = [...segmenter.segment(text)].map(s => s.segment);
const reversed = graphemes.reverse().join("");
// → "👨‍👩‍👧‍👦 olleH" — family emoji intact

RTL scripts and logical vs visual reversal

Arabic and Hebrew are written right-to-left. Reversing an Arabic string character by character produces something that looks reversed from the Arabic speaker's perspective — but the Unicode Bidirectional Algorithm (which governs how RTL text is displayed) renders it left-to-right on screen, which may not be the intended result.

The distinction:

Character-level reversal: reverses the byte sequence of code points
Visual reversal: reverses how the text appears on screen

For RTL text, these are different operations. Reversing the character sequence of Arabic text produces different Arabic words (or gibberish) — not a mirror image of the original display.

Palindrome detection with Unicode

A proper Unicode palindrome check must:

Normalise to NFC (or NFD) form to handle composed/decomposed variants
Fold case (using Unicode case folding, not just toLowerCase())
Remove punctuation, spaces, and non-letter characters
Segment by grapheme clusters
Compare the sequence to its reverse

import unicodedata

def is_palindrome(text):
    # Normalise and casefold
    normalised = unicodedata.normalize('NFC', text).casefold()
    # Keep only letters and digits
    letters_only = ''.join(c for c in normalised if c.isalnum())
    return letters_only == letters_only[::-1]

is_palindrome("A man, a plan, a canal: Panama")  # True
is_palindrome("Was it a car or a cat I saw?")      # True

Practical uses of string reversal

Checking for palindromes (words, phrases, numbers).

Reversing domain names for DNS storage: DNS resolvers store domain names reversed for efficient lookup. www.example.com is stored as traversal com.example.www in some internal representations.

Data obfuscation (weak, not security): reversing strings in configuration files to prevent casual reading — similar to ROT13, this is obscurity, not encryption.

Right-to-left text debugging: checking whether a rendering engine is correctly handling RTL text by comparing expected and actual visual positions.

How to use the Text Reverser on sadiqbd.com

Enter any text
Select reversal mode:
- By character — reverses individual characters
- By word — reverses word order but preserves each word's characters
- By line — reverses line order but preserves each line's content
Use for: palindrome checking, creating mirror text effects, reversing word order in headlines

Frequently Asked Questions

Why does Python's [::-1] break for emoji in some cases? Python 3 strings are sequences of Unicode code points, so [::-1] correctly handles most supplementary characters (emoji) by reversing code points, not bytes. It does fail for strings with combining characters (decomposed form) and for complex multi-code-point sequences like ZWJ (Zero Width Joiner) sequences used in family emoji. For truly correct reversal of all Unicode, grapheme cluster segmentation is needed.

Is reversing a string an O(n) operation? Yes — reversing a string requires reading every character and writing it in reverse order, which is proportional to string length. Memory-wise, in most languages this creates a new string (immutable strings in Python, Java, JavaScript), so it's O(n) in time and O(n) in space.

Is the Text Reverser free? Yes — completely free, no sign-up required.

String reversal is the quintessential "simple" operation that isn't simple once Unicode is involved. The jump from "reverse the bytes" to "reverse the grapheme clusters" encompasses decades of Unicode standardisation and the full complexity of how modern text is encoded.

Try the Text Reverser free at sadiqbd.com — reverse any text by character, word, or line instantly.