In Python 2 when dealing with unicode strings, a comparison of two strings can sometimes result in unexpected behaviour. Two strings can look the same, but the Python comparison can report otherwise. How is this possible? How could the german word “Glück” not be the same as “Glück”?

In Unicode there are two ways to encode the german Umlaut. See more: https://en.wikipedia.org/wiki/Unicode_equivalence

In the case of the word “Glück” the ü-Umlaut can be encoded as

  • Complete character as U+00fc
  • As character “u” followed by a non-spacing mark or accent character, which is not a sign on it’s own, but rather modifies/overlaps the preceding character. This will be encoded as the ASCII “u” (U+0075) followed by the " (U+0308).

Code to convert between the two:

>>> import unicodedata
>>> unicodedata.normalize("NFD", u"ü")
u'u\u0308'
>>> unicodedata.normalize("NFC", u"u\u0308")
u'\xfc'

So please be aware of that when comparing strings in python and also be aware that Python 2 or 3 doesn’t handle them exactly the same.