How I learned to sort strings in Python
TL;DR
If your Python program is sorting strings then don’t forget to put
locale.setlocale(locale.LC_ALL, "")
at the beginning of you program
and align “locale” settings on all environments. This might save you
some time and headache down the road…
Full story
One day I was debugging a subtle production issue. The gist of the issue was that sorting of string values was not consistent between environments where the application was running and most importantly it was wrong on the production environment.
In production, the app was running on a Debian Linux server, but locally I was developing on a macOs laptop. On the CI environment, where we run the tests, results were matching with the results on the local development machine.
To sort a list of string in Python we would normally use the built-in
sorted
function. Let’s try to run a simple test:
>>> sorted(["a", "A"])
['A', 'a']
OK, looks fine… Uppercase letters come before lowercase ones, right? Or do they? Let’s take a step back and check how sorting works in Linux…
You’ve probably heard about a thing called - locale
. A locale
defines rules for natural languages how, among other things, sorting
needs to be performed. To see you current local settings execute
locale
function from terminal:
~$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
Each locale category that you see printed above can be set to use a
different locale. The category that is relevant for string sorting is
LC_COLLATE
. This article will assume that all categories are
configured to use the same locale.
The most basic locale is called C
, which operates only with ASCII
character encoding standard. On development machine you would normally
use a locale that corresponds to the natural language you speak and
region you live in. For example, in the US it’s common to
useen_US.UTF-8
locale which represents language rules of American
English.
Let’s try to perform the same sorting test on the Debian Linux machine where the original issue was happening:
~$ sort <<< $'a\nA'
a
A
This is not the same result that Python sorted
function
returned. Why so? It’s the same machine and I didn’t change the
locale…
Could be that Python doesn’t use the system locale by default? Of course, I should have checked the documentation beforehand…
Nothing mentioned about the locale in the documentation for
sorted
. After looking into the linked “Sorting HOW
TO” tutorial, in the “Odds &
Ends” section, I found that for locale-aware
sorting one should use locale.strcoll()
as the comparison
function. OK, let’s try that out… After reading some more
documentation I came up with this:
>>> import locale
>>> from functools import cmp_to_key
>>> sorted(["a", "A"], key=cmp_to_key(locale.strcoll))
['A', 'a']
Hmm, but still it’s not what I had expected. Let’s check if Python is even picking up my system locale:
>>> locale.getlocale()
('en_US', 'UTF-8')
Yep, that’s the one. Python knows what is my system locale, but doesn’t apply it automatically. Let me try to “force” it and try again:
>>> locale.setlocale(locale.LC_ALL, "en_US.UTF-8")
'en_US.UTF-8'
>>> sorted(["a", "A"], key=cmp_to_key(locale.strcoll))
['a', 'A']
OK, now it works as I expected. But why?
The answer is documented deep in the locale docs. Here’s the important bit:
Initially, when a program is started, the locale is the C locale, no matter what the user’s preferred locale is. There is one exception: the
LC_CTYPE
category is changed at startup to set the current locale encoding to the user’s preferred locale encoding. The program must explicitly say that it wants the user’s preferred locale settings for other categories by callingsetlocale(LC_ALL, "")
.
This is how I learned the hard way that sorting strings in Python correctly is not as easy as it might appear from the first glance.