Natural Sort vs Lexicographic Sort: How Languages, Databases, and File Managers Differ

"file10" sorting before "file2" isn't a bug specific to one tool — it's the default lexicographic behavior across most programming languages, while file managers typically default to natural sort, creating a common mismatch. Here's how Python, JavaScript, SQL ORDER BY, and spreadsheets each handle this differently, and why version-number sorting (SemVer) is a related but distinct problem with its own rules.

"file10" sorting before "file9" is one specific symptom of a much broader phenomenon — every major programming language's default sort, and most operating systems' default file listing, disagree with each other about how to order text containing numbers, and the disagreements follow predictable patterns once you know what to look for

The previous article on this site covered why "file10" sorts before "file2" under standard lexicographic (character-by-character) sorting, and introduced "natural sort" as the alternative. This article surveys how different programming languages and platforms handle this — because the default behavior varies, and code that assumes one behavior while running in an environment with the other produces subtly wrong ordering that often isn't noticed until specific edge cases (like double-digit-numbered items) appear.

The default: lexicographic (character-code) sorting

Most programming languages' default string-sorting (when you call a generic "sort this list of strings" function without specifying a custom comparator) performs lexicographic comparison — comparing strings character by character, based on each character's underlying code-point value.

Python: sorted(["file1", "file10", "file2"]) returns ['file1', 'file10', 'file2'] — lexicographic, "file10" before "file2" (because '1' < '2' as characters, and the comparison stops at the first differing character — "file1" vs "file2" already differ at position 4, before "file10"'s extra "0" is even considered).

JavaScript: ["file1", "file10", "file2"].sort() produces the same lexicographic result — JavaScript's default .sort() converts elements to strings and compares them lexicographically, by default.

Most languages' default string sort is lexicographic — this is consistent, predictable, and (importantly) fast — lexicographic comparison is a simple, direct operation; "natural sort" requires parsing numeric substrings and comparing them numerically, which is inherently more complex/slower per-comparison.

Where "natural sort" is the default: file managers and some specific APIs

Most modern graphical file managers (Windows Explorer, macOS Finder, and many Linux file managers) default to natural sort for filenames — "file2" before "file10" — because, for human users browsing files, natural sort generally matches intuitive expectations (a user naming files "Chapter 1," "Chapter 2," ..., "Chapter 10" expects them to appear in that numeric order, not "Chapter 1, Chapter 10, Chapter 2, ...").

This creates a common mismatch: a script/program that lists files via a programming-language API (which, as discussed, often defaults to lexicographic) and then processes them "in order" — might process them in a different order than what a user, looking at the same files in their file manager, would see/expect — this is a common, if often subtle, source of "why did my script process these files in a weird order" confusion.

Database `ORDER BY`: lexicographic by default, with collation-dependent nuances

SQL ORDER BY on a text/string column is, by default, lexicographic (based on the column's collation — which, itself, can vary: different collations might order characters/case differently, but "natural" number-aware ordering is generally not part of standard collations — 'file10' sorts before 'file2' under most standard text collations, the same as programming-language lexicographic sort).

Achieving "natural sort" in SQL typically requires either:

Storing a separate, numeric column (e.g., extracting "10" from "file10" into its own integer column at insert time) and ORDER BY that numeric column — the most reliable approach, but requires schema/data changes
Database-specific natural-sort functions/extensions — some databases offer built-in or extension-provided natural-sort collations/functions (varying significantly by database system — PostgreSQL, MySQL, SQL Server each have different, non-standard approaches/extensions for this, where available at all) — checking your specific database's documentation for "natural sort" support is necessary, as this isn't part of the SQL standard
Application-level sorting — retrieve rows without ORDER BY (or with a simple ORDER BY as a starting point), then sort in application code using a natural-sort comparator (many languages have third-party libraries implementing natural sort, given that it's not part of the standard library's default sort)

Spreadsheet software: often natural sort, but inconsistently

Spreadsheet applications (Excel, Google Sheets, and similar) — when sorting a column containing text values that include numbers — behavior can vary depending on whether the values are recognized as "text" vs "mixed alphanumeric" vs (in some cases) being auto-detected in ways that affect sort behavior — some spreadsheet sort implementations apply natural-sort-like behavior for certain patterns (e.g., recognizing "Item 1," "Item 2," ..., "Item 10" and sorting numerically by the trailing number) — but this isn't universally/consistently applied across all spreadsheet software/versions, and isn't something to rely on without verifying for your specific tool/data — if natural ordering is important for a specific spreadsheet, testing the actual sort behavior with your specific data (rather than assuming based on general expectations) is the only reliable approach, given the inconsistency.

Version number sorting: a related but distinct problem

Version strings (like "1.9.0", "1.10.0", "1.2.0") present a related challenge: lexicographic sort would order these as "1.10.0", "1.2.0", "1.9.0" (comparing character by character: "1.1" < "1.2" < "1.9", so "1.10.0" — starting with "1.1" — sorts first, before "1.2.0" and "1.9.0") — clearly wrong for version-number semantics, where "1.10.0" should be understood as coming after "1.9.0" (ten is greater than nine, in the second version component).

Semantic versioning (SemVer) comparison — implemented by package managers (npm, pip, Cargo, and others) — uses dedicated version-comparison logic that parses each dot-separated component as a number (generally) and compares numerically, component by component — this is, conceptually, similar to "natural sort," but applied specifically and consistently to the recognized "version string" format, rather than being a general-purpose "sort any text containing numbers naturally" operation — package managers don't rely on generic natural-sort libraries for this; they implement/use version-comparison logic specific to the SemVer (or similar) specification, which has its own rules beyond just "numeric components compare numerically" (e.g., SemVer's handling of pre-release identifiers like "1.0.0-alpha" vs "1.0.0," which has specific, defined precedence rules that go beyond simple numeric component comparison).

How to use the Sort Lines tool on sadiqbd.com

Check which sort mode you need: the tool offers both alphabetical (lexicographic) and numeric/natural sort options — match the mode to what the downstream use of the sorted list expects (matching a file manager's display order? Matching a script's processing order, which might itself be lexicographic by default?)
For preparing data for SQL ORDER BY: if your target database lacks natural-sort support, consider whether a separate numeric column (extracted here, or via the find-and-replace tool's capture-group techniques from a previous article) would better serve your needs than relying on text-column sorting
For version-like strings specifically: recognize that general natural sort and semantic-versioning comparison can diverge for edge cases (pre-release identifiers, and other SemVer-specific rules) — for genuinely version-number data, dedicated SemVer-aware tooling (where available in your programming environment) is more correct than general-purpose natural sort

Frequently Asked Questions

Why doesn't every programming language just default to natural sort, if it's "more intuitive"? Performance and predictability, primarily — lexicographic comparison is a simple, fast, well-defined operation on any string, with no ambiguity. Natural sort requires parsing decisions (what counts as a "number" within a string? How are leading zeros handled — does "file01" equal "file1" for sorting purposes, or not? What about decimal points within a "number" — is "v1.10" one number "1.10" or two numbers "1" and "10"?) — these questions don't have single, universally-agreed answers, making "natural sort" less suitable as a single, unambiguous default for a general-purpose language's string-comparison operator, compared to lexicographic comparison's unambiguous simplicity.

Does natural sort handle negative numbers and decimals correctly? This depends entirely on the specific implementation — different "natural sort" libraries/tools make different choices about whether/how to recognize "-5" as a negative number (vs a hyphen followed by the digit 5) or "3.14" as a decimal (vs the digit 3, a period, and the digit 14) — if your data includes negative numbers/decimals and natural sort ordering of these matters, testing the specific tool's behavior with representative examples is advisable, rather than assuming a particular interpretation.