In Python when dealing with unicode strings, a comparison of two strings can sometimes result in an unexpected behaviour. Two strings can look the same, but the Python comparison can report otherwise. How is this possible? How can the german word “Glück” not be the same as “Glück”?
In Unicode there are two ways to encode german Umlaut. See more: https://en.wikipedia.org/wiki/Unicode_equivalence
In the case of the word “Glück” the ü-Umlaut can be encoded as
- Complete character as
- As character “u” followed by a non-spacing mark or accent character, which is not a sign on it’s own, but rather modifies/overlaps the preceding character. This will be encoded as the ASCII “u” (
U+0075) folowed by the
Code to convert between the two:
>>> import unicodedata >>> unicodedata.normalize("NFD", u"ü") u'u\u0308' >>> unicodedata.normalize("NFC", u"u\u0308") u'\xfc'
So please be aware of that when comparing strings in python and also be aware that Python 2 or 3 doesn’t handle them exactly the same.