How I learned to sort strings in Python


If your Python program is sorting strings then don’t forget to put locale.setlocale(locale.LC_ALL, "") at the beginning of you program and align “locale” settings on all environments. This might save you some time and headache down the road…

Full story

One day I was debugging a subtle production issue. The gist of the issue was that sorting of string values was not consistent between environments where the application was running and most importantly it was wrong on the production environment.

In production, the app was running on a Debian Linux server, but locally I was developing on a macOs laptop. On the CI environment, where we run the tests, results were matching with the results on the local development machine.

To sort a list of string in Python we would normally use the built-in sorted function. Let’s try to run a simple test:

>>> sorted(["a", "A"])
['A', 'a']

OK, looks fine… Uppercase letters come before lowercase ones, right? Or do they? Let’s take a step back and check how sorting works in Linux…

You’ve probably heard about a thing called - locale. A locale defines rules for natural languages how, among other things, sorting needs to be performed. To see you current local settings execute locale function from terminal:

~$ locale

Each locale category that you see printed above can be set to use a different locale. The category that is relevant for string sorting is LC_COLLATE. This article will assume that all categories are configured to use the same locale.

The most basic locale is called C, which operates only with ASCII character encoding standard. On development machine you would normally use a locale that corresponds to the natural language you speak and region you live in. For example, in the US it’s common to useen_US.UTF-8 locale which represents language rules of American English.

Let’s try to perform the same sorting test on the Debian Linux machine where the original issue was happening:

~$ sort <<< $'a\nA'

This is not the same result that Python sorted function returned. Why so? It’s the same machine and I didn’t change the locale…

Could be that Python doesn’t use the system locale by default? Of course, I should have checked the documentation beforehand…

Nothing mentioned about the locale in the documentation for sorted. After looking into the linked “Sorting HOW TO” tutorial, in the “Odds & Ends” section, I found that for locale-aware sorting one should use locale.strcoll() as the comparison function. OK, let’s try that out… After reading some more documentation I came up with this:

>>> import locale
>>> from functools import cmp_to_key
>>> sorted(["a", "A"], key=cmp_to_key(locale.strcoll))
['A', 'a']

Hmm, but still it’s not what I had expected. Let’s check if Python is even picking up my system locale:

>>> locale.getlocale()
('en_US', 'UTF-8')

Yep, that’s the one. Python knows what is my system locale, but doesn’t apply it automatically. Let me try to “force” it and try again:

>>> locale.setlocale(locale.LC_ALL, "en_US.UTF-8")
>>> sorted(["a", "A"], key=cmp_to_key(locale.strcoll))
['a', 'A']

OK, now it works as I expected. But why?

The answer is documented deep in the locale docs. Here’s the important bit:

Initially, when a program is started, the locale is the C locale, no matter what the user’s preferred locale is. There is one exception: the LC_CTYPE category is changed at startup to set the current locale encoding to the user’s preferred locale encoding. The program must explicitly say that it wants the user’s preferred locale settings for other categories by calling setlocale(LC_ALL, "").

This is how I learned the hard way that sorting strings in Python correctly is not as easy as it might appear from the first glance.