RSS

Python unicode string to lower case and caseless match

str.lower() and str.casefold()

str.lower() and str.casefold()

Starting with Python 3.0, strings are stored as Unicode.

Python defined to two functions str.lower() and str.casefold() can be used to convert string to lowercase:

str.lower()
Return a copy of the string with all the cased characters 4 converted to lowercase.

The lowercasing algorithm used is described in section 3.13 of the Unicode Standard.

str.casefold()
Return a casefolded copy of the string. Casefolded strings may be used for caseless matching.

Casefolding is similar to lowercasing but more aggressive because it is intended to remove all case distinctions in a string. For example, the German lowercase letter ‘ß’ is equivalent to “ss”.

Since it is already lowercase, lower() would do nothing to ‘ß’; casefold() converts it to “ss”.

Python’s casefold() implements the Unicode’s toCasefold(). NFKC_Casefold provides a mapping designed for best behavior when doing caseless matching of strings interpreted as identifiers.

Example usage of lower() and casefold() in python:

>>> import unicodedata

>>> def casefold(s):
...     return unicodedata.normalize('NFKC', s).casefold()
...

>>> s = 'AÁÀÃÂÄÅĀĂĄ-ḂƁḄḆƂƄɃ-ĆĈĊČƇÇḈȻ-ḊƊḌḎḐḒĎÐĐƉ-ƋÈÉÊẼĒĔĖËẺĚȄȆẸ-ß'

>>> s.lower()
'aáàãâäåāăą-ḃɓḅḇƃƅƀ-ćĉċčƈçḉȼ-ḋɗḍḏḑḓďðđɖ-ƌèéêẽēĕėëẻěȅȇẹ-ß'

>>> unicodedata.normalize('NFKC', s).casefold()
'aáàãâäåāăą-ḃɓḅḇƃƅƀ-ćĉċčƈçḉȼ-ḋɗḍḏḑḓďðđɖ-ƌèéêẽēĕėëẻěȅȇẹ-ss'

>>> unicodedata.normalize('NFKC', s).casefold().islower()
True

>>> unicodedata.normalize('NFKC', s).casefold() == s.lower()
False

# Becare about this two 'K', they are different in upper case
>>> k1 = 'K'  # The first 'K' code point is 0x212A
>>> k2 = 'K'

>>> hex(ord(k1))
'0x212a'

>>> hex(ord(k2))
'0x4b'

>>> k1 == k2
False

>>> k1.lower() == k2.lower()
True

>>> casefold(k1) == casefold(k2)
True

Here comes the question, what is caseless matching?

Unicode Default Caseless Matching

The Unicode Standard Chapter 3.13 defines the default caseless matching.

Default caseless matching is the process of comparing two strings for case-insensitive equality. The definitions of Unicode Default Caseless Matching build on the definitions of Unicode Default Case Folding.

Default Caseless Matching uses full case folding:

A string X is a caseless match for a string Y if and only if:
toCasefold(X) = toCasefold(Y)

When comparing strings for case-insensitive equality, the strings should also be normalized for most correct results. For example, the case folding of U+00C5 Å latin capital letter a with ring above is U+00E5 å latin small letter a with ring above, whereas the case folding of the sequence <U+0041 “A” latin capital letter a, U+030A Ää combining ring above> is the sequence <U+0061 “a” latin small letter a, U+030A Ää combining ring above>. Simply doing a binary comparison of the results of case folding both strings will not catch the fact that the resulting case-folded strings are canonicalequivalent sequences. In principle, normalization needs to be done after case folding, because case folding does not preserve the normalized form of strings in all instances. This requirement for normalization is covered in the following definition for canonical caseless matching:

D145 A string X is a canonical caseless match for a string Y if and only if:
NFD(toCasefold(NFD(X))) = NFD(toCasefold(NFD(Y)))

The invocations of canonical decomposition (NFD normalization) before case folding in D145 are to catch very infrequent edge cases. Normalization is not required before case folding, except for the character U+0345 n combining greek ypogegrammeni and any characters that have it as part of their canonical decomposition, such as U+1FC3 o greek small letter eta with ypogegrammeni. In practice, optimized versions of canonical caseless matching can catch these special cases, thereby avoiding an extra normalization step for each comparison. In some instances, implementers may wish to ignore compatibility differences between characters when comparing strings for case-insensitive equality. The correct way to do this makes use of the following definition for compatibility caseless matching:

A string X is a compatibility caseless match for a string Y if and only if:
NFKD(toCasefold(NFKD(toCasefold(NFD(X))))) =
  NFKD(toCasefold(NFKD(toCasefold(NFD(Y)))))

Compatibility caseless matching requires an extra cycle of case folding and normalization for each string compared, because the NFKD normalization of a compatibility character such as U+3392 square mhz may result in a sequence of alphabetic characters which must again be case folded (and normalized) to be compared correctly. Caseless matching for identifiers can be simplified and optimized by using the NFKC_- Casefold mapping. That mapping incorporates internally the derived results of iterated case folding and NFKD normalization. It also maps away characters with the property value Default_Ignorable_Code_Point = True, which should not make a difference when comparing identifiers. The following defines identifier caseless matching:

A string X is an identifier caseless match for a string Y if and only if:
toNFKC_Casefold(NFD(X)) = toNFKC_Casefold(NFD(Y))

Special upper case transformation

Char Code Point Output Char
ß 0x00DF SS
ı 0x0131 I
ſ 0x017F S
0xFB00 FF
0xFB01 FI
0xFB02 FL
0xFB03 FFI
0xFB04 FFL
0xFB05 ST
0xFB06 ST

Special lower case transformation

Char Code Point Output Char
0x212A k

Reference