M-A's

technology blog

Thursday, 15 November 2012

Unicode equivalence may not be handled as you think

Unicode normalization is not always happening how you would expect, especially w.r.t. file systems. First, I recommend you to read about it on the wikipedia page http://en.wikipedia.org/wiki/Unicode_equivalence that is fairly well written:
In general, the code points of truly identical characters (which can be rendered in the same way in Unicode fonts) are defined to be canonically equivalent.
Unicode has 2 equivalence notions, with pre-composed or decomposed representing the same characters, and 2 normal forms, the canonical one (NF) and the "compatible" one (NFK).
In order to compare or search Unicode strings, software can use either composed or decomposed forms; this choice does not matter as long as it is the same for all strings involved in a search, comparison, etc. On the other hand, the choice of equivalence criteria can affect search results. For instance some typographic ligatures like U+FB03 (ffi), roman numerals like U+2168 (Ⅸ) and even subscripts and superscripts, e.g. U+2075 (⁵) have their own Unicode code points. Canonical normalization (NF) does not affect any of these, but compatibility normalization (NFK) will decompose the ffi ligature into the constituent letters, so a search for U+0066 (f) as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03. Likewise when searching for the Latin letter I (U+0049) in the precomposed Roman Numeral Ⅸ (U+2168). Similarly the superscript "⁵" (U+2075) is transformed to "5" (U+0035) by compatibility mapping.
I found out while writing an universal file tracer for the Chromium project. [Spoiler alert]: The gory details are buried in trace_inputs.py. Note that the code in trace_inputs.py also does case normalization, which a subject in itself, maybe for another post.

I wasn't sure about each OS behaviour with regard to file path handling so I wrote a small python script to figure out what is happening exactly. I pasted the script's code at the bottom of this post. I'll let you guess what happens on each of the following OS: OSX 10.8, Ubuntu 12.04 with LANG=foo.UTF-8 and Windows 7. The analysis is under the eye of if NF or NFK is employed when trying to open a file. I'm explicitly excluding case normalization (a vs A) for this post.

Ubuntu

Let's start with Ubuntu, which behaved exactly as I imagined. Note that I'm using LANG=foo.UTF-8:
~/src/foo> ./unicode_is_hard.py
e-acute-circumflex
Found 2 different encodings for u'\u1ebf'
  NFKC: u'\u1ebf'
   NFD: u'e\u0302\u0301'
   NFC: u'\u1ebf'
  NFKD: u'e\u0302\u0301'
  OS returned: u'NFC\u1ebf', u'NFDe\u0302\u0301', u'NFKC\u1ebf', u'NFKDe\u0302\u0301'

roman_numeral_one
Found 2 different encodings for u'\u2160'
  NFKC: u'I'
   NFD: u'\u2160'
   NFC: u'\u2160'
  NFKD: u'I'
  OS returned: u'NFC\u2160', u'NFD\u2160', u'NFKCI', u'NFKDI'

e-acute-circumflex + roman_numeral_one
Found 4 different encodings for u'\u1ebf\u2160'
  NFKC: u'\u1ebfI'
   NFD: u'e\u0302\u0301\u2160'
   NFC: u'\u1ebf\u2160'
  NFKD: u'e\u0302\u0301I'
  OS returned: u'NFC\u1ebf\u2160', u'NFDe\u0302\u0301\u2160', u'NFKC\u1ebfI', u'NFKDe\u0302\u0301I'
How Nautilus displays the files
As you can see, the file system is not processing the Unicode characters at all so what you write is what you get. Now I'll let you guess what happens on OSX and Windows. Prepare your bets.

Windows

Windows is interesting because it didn't behave as I expected.
D:\src>python unicode_is_hard.py
e-acute-circumflex
Found 2 different encodings for u'\u1ebf'
  NFKC: u'\u1ebf'
   NFD: u'e\u0302\u0301'
   NFC: u'\u1ebf'
  NFKD: u'e\u0302\u0301'
  OS returned: u'NFC\u1ebf', u'NFDe\u0302\u0301', u'NFKC\u1ebf', u'NFKDe\u0302\u0301'

roman_numeral_one
Found 2 different encodings for u'\u2160'
  NFKC: u'I'
   NFD: u'\u2160'
   NFC: u'\u2160'
  NFKD: u'I'
  OS returned: u'NFC\u2160', u'NFD\u2160', u'NFKCI', u'NFKDI'

e-acute-circumflex + roman_numeral_one
Found 4 different encodings for u'\u1ebf\u2160'
  NFKC: u'\u1ebfI'
   NFD: u'e\u0302\u0301\u2160'
   NFC: u'\u1ebf\u2160'
  NFKD: u'e\u0302\u0301I'
  OS returned: u'NFC\u1ebf\u2160', u'NFDe\u0302\u0301\u2160', u'NFKC\u1ebfI', u'NFKDe\u0302\u0301I'

How Windows Explorer displays the files
As you can see, and that was unexpected to me, Windows doesn't normalize the unicode code points to NFK so you will get whatever the program used like for Ubuntu. As a spoiler, cygwin is doing the same but I left its output for brevity. Note how the rendering is significantly different for \u2160 (I) unlike Ubuntu's default rendering in Unity.

OSX

If you already played with unicode code point normalization and had to touch OSX, you problaby know why I kept it as the last one:
~/src/foo> ./unicode_is_hard.py
e-acute-circumflex
Found 2 different encodings for u'\u1ebf'
  NFKC: u'\u1ebf'
   NFD: u'e\u0302\u0301'
   NFC: u'\u1ebf'
  NFKD: u'e\u0302\u0301'
  OS returned: u'NFCe\u0302\u0301', u'NFDe\u0302\u0301', u'NFKCe\u0302\u0301', u'NFKDe\u0302\u0301'
  2 are not matching.
  For  NFC, expected  NFC, NFKC but could with  NFC,  NFD, NFKC, NFKD
  For NFKC, expected  NFC, NFKC but could with  NFC,  NFD, NFKC, NFKD
  For  NFD, expected  NFD, NFKD but could with  NFC,  NFD, NFKC, NFKD
  For NFKD, expected  NFD, NFKD but could with  NFC,  NFD, NFKC, NFKD

roman_numeral_one
Found 2 different encodings for u'\u2160'
  NFKC: u'I'
   NFD: u'\u2160'
   NFC: u'\u2160'
  NFKD: u'I'
  OS returned: u'NFC\u2160', u'NFD\u2160', u'NFKCI', u'NFKDI'

e-acute-circumflex + roman_numeral_one
Found 4 different encodings for u'\u1ebf\u2160'
  NFKC: u'\u1ebfI'
   NFD: u'e\u0302\u0301\u2160'
   NFC: u'\u1ebf\u2160'
  NFKD: u'e\u0302\u0301I'
  OS returned: u'NFCe\u0302\u0301\u2160', u'NFDe\u0302\u0301\u2160', u'NFKCe\u0302\u0301I', u'NFKDe\u0302\u0301I'
  2 are not matching.
  For  NFC, expected  NFC but could with  NFC,  NFD
  For NFKC, expected NFKC but could with NFKC, NFKD
  For  NFD, expected  NFD but could with  NFC,  NFD
  For NFKD, expected NFKD but could with NFKC, NFKD
How Finder displays the files
As you can see, OSX is the only OS to normalize Unicode code points. But it is doing partial normalization, only for NFD vs NFC but not for NFKx vs NFx. That's interesting as I'd have expected NFK handling instead. So a file written in NFKx cannot be opened in NFx but NFC vs NFD is transparently converted.

The code

#!/usr/bin/env python
# Copyright (c) 2012 Marc-Antoine Ruel. All rights reserved.

"""This scripts create a subdirectory named unicode_is_hard which contains
various files in various encoding.

See http://en.wikipedia.org/wiki/Unicode_equivalence for the various UTF
encodings.
"""

import os
import shutil
import sys
import unicodedata

BASE_DIR = os.path.dirname(os.path.abspath(__file__))

def try_with_string(work_dir, unicode_string):
  """Encodes an unicode string with 4 different encodings and tries to open the
  file with the other encodings.
  """
  # Delete the work directory if present.
  if os.path.isdir(work_dir):
    shutil.rmtree(work_dir)
  os.mkdir(work_dir)

  encodings = (u'NFC', u'NFKC', u'NFD', u'NFKD')
  encoded = dict(
      (key, unicodedata.normalize(key, unicode_string)) for key in encodings)
  filenames = dict((key, key + value) for key, value in encoded.iteritems())

  # This implicitly assumes python does the right thing here.
  different_encodings = len(set(encoded.itervalues()))
  print(
      'Found %d different encodings for %r' %
      (different_encodings, unicode_string))
  for encoding, value in encoded.iteritems():
    print('  %4s: %r' % (encoding, value))

  # Now for each type, create a file. See if the other encodings can open it.
  for filename in filenames.itervalues():
    open(os.path.join(work_dir, filename), 'w').close()

  files_found = sorted(os.listdir(work_dir))
  print('  OS returned: %s' % ', '.join(repr(i) for i in files_found))
  not_matching = set(filenames.itervalues()).difference(files_found)
  if not_matching:
    print('  %d are not matching.' % len(not_matching))

  expected = {}
  for encoding, value in encoded.iteritems():
    # Assumes comparison in python is correctly done.
    for encoding_to_confirm, value_to_confirm in encoded.iteritems():
      if value_to_confirm == value:
        expected.setdefault(encoding, []).append(encoding_to_confirm)

  # Now do the 16 combinations to try to open each files with the other
  # encoding.
  actual = {}
  for encoding, original_filename in filenames.iteritems():
    for encoding_to_try, value_to_try in encoded.iteritems():
      # Try to open the file with the other encoding.
      try:
        open(os.path.join(work_dir, encoding + value_to_try)).close()
        actual.setdefault(encoding, []).append(encoding_to_try)
      except IOError:
        pass

  # Print if anything unexpected succeeded. This happens in the case
  # encoded[encoding1] != encoded[encoding2] but they could open each other.
  for encoding in encodings:
    if sorted(expected[encoding]) != sorted(actual[encoding]):
      print(
          '  For %4s, expected %s but could with %s' % (
            encoding,
            ', '.join('%4s' % i for i in sorted(expected[encoding])),
            ', '.join('%4s' % i for i in sorted(actual[encoding]))))

def main():
  work_dir = os.path.join(unicode(BASE_DIR), u'unicode_is_hard')

  # Examples taken from the Wikipedia page and unicodedata python stdlib doc.
  # http://docs.python.org/2/library/unicodedata.html
  e_acute_circumflex = u'\u1ebf'
  roman_numeral_one = u'\u2160'

  print('e-acute-circumflex')
  try_with_string(work_dir, e_acute_circumflex)

  print('\nroman_numeral_one')
  try_with_string(work_dir, roman_numeral_one)

  print('\ne-acute-circumflex + roman_numeral_one')
  try_with_string(work_dir, e_acute_circumflex + roman_numeral_one)
  return 0

if __name__ == '__main__':
  sys.exit(main())