M-A's

technology blog

Thursday 15 November 2012

Unicode equivalence may not be handled as you think

Unicode normalization is not always happening how you would expect, especially w.r.t. file systems. First, I recommend you to read about it on the wikipedia page http://en.wikipedia.org/wiki/Unicode_equivalence that is fairly well written:
In general, the code points of truly identical characters (which can be rendered in the same way in Unicode fonts) are defined to be canonically equivalent.
Unicode has 2 equivalence notions, with pre-composed or decomposed representing the same characters, and 2 normal forms, the canonical one (NF) and the "compatible" one (NFK).
In order to compare or search Unicode strings, software can use either composed or decomposed forms; this choice does not matter as long as it is the same for all strings involved in a search, comparison, etc. On the other hand, the choice of equivalence criteria can affect search results. For instance some typographic ligatures like U+FB03 (ffi), roman numerals like U+2168 (Ⅸ) and even subscripts and superscripts, e.g. U+2075 (⁵) have their own Unicode code points. Canonical normalization (NF) does not affect any of these, but compatibility normalization (NFK) will decompose the ffi ligature into the constituent letters, so a search for U+0066 (f) as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03. Likewise when searching for the Latin letter I (U+0049) in the precomposed Roman Numeral Ⅸ (U+2168). Similarly the superscript "⁵" (U+2075) is transformed to "5" (U+0035) by compatibility mapping.
I found out while writing an universal file tracer for the Chromium project. [Spoiler alert]: The gory details are buried in trace_inputs.py. Note that the code in trace_inputs.py also does case normalization, which a subject in itself, maybe for another post.

I wasn't sure about each OS behaviour with regard to file path handling so I wrote a small python script to figure out what is happening exactly. I pasted the script's code at the bottom of this post. I'll let you guess what happens on each of the following OS: OSX 10.8, Ubuntu 12.04 with LANG=foo.UTF-8 and Windows 7. The analysis is under the eye of if NF or NFK is employed when trying to open a file. I'm explicitly excluding case normalization (a vs A) for this post.

Ubuntu

Let's start with Ubuntu, which behaved exactly as I imagined. Note that I'm using LANG=foo.UTF-8:
~/src/foo> ./unicode_is_hard.py
e-acute-circumflex
Found 2 different encodings for u'\u1ebf'
  NFKC: u'\u1ebf'
   NFD: u'e\u0302\u0301'
   NFC: u'\u1ebf'
  NFKD: u'e\u0302\u0301'
  OS returned: u'NFC\u1ebf', u'NFDe\u0302\u0301', u'NFKC\u1ebf', u'NFKDe\u0302\u0301'

roman_numeral_one
Found 2 different encodings for u'\u2160'
  NFKC: u'I'
   NFD: u'\u2160'
   NFC: u'\u2160'
  NFKD: u'I'
  OS returned: u'NFC\u2160', u'NFD\u2160', u'NFKCI', u'NFKDI'

e-acute-circumflex + roman_numeral_one
Found 4 different encodings for u'\u1ebf\u2160'
  NFKC: u'\u1ebfI'
   NFD: u'e\u0302\u0301\u2160'
   NFC: u'\u1ebf\u2160'
  NFKD: u'e\u0302\u0301I'
  OS returned: u'NFC\u1ebf\u2160', u'NFDe\u0302\u0301\u2160', u'NFKC\u1ebfI', u'NFKDe\u0302\u0301I'
How Nautilus displays the files
As you can see, the file system is not processing the Unicode characters at all so what you write is what you get. Now I'll let you guess what happens on OSX and Windows. Prepare your bets.

Windows

Windows is interesting because it didn't behave as I expected.
D:\src>python unicode_is_hard.py
e-acute-circumflex
Found 2 different encodings for u'\u1ebf'
  NFKC: u'\u1ebf'
   NFD: u'e\u0302\u0301'
   NFC: u'\u1ebf'
  NFKD: u'e\u0302\u0301'
  OS returned: u'NFC\u1ebf', u'NFDe\u0302\u0301', u'NFKC\u1ebf', u'NFKDe\u0302\u0301'

roman_numeral_one
Found 2 different encodings for u'\u2160'
  NFKC: u'I'
   NFD: u'\u2160'
   NFC: u'\u2160'
  NFKD: u'I'
  OS returned: u'NFC\u2160', u'NFD\u2160', u'NFKCI', u'NFKDI'

e-acute-circumflex + roman_numeral_one
Found 4 different encodings for u'\u1ebf\u2160'
  NFKC: u'\u1ebfI'
   NFD: u'e\u0302\u0301\u2160'
   NFC: u'\u1ebf\u2160'
  NFKD: u'e\u0302\u0301I'
  OS returned: u'NFC\u1ebf\u2160', u'NFDe\u0302\u0301\u2160', u'NFKC\u1ebfI', u'NFKDe\u0302\u0301I'

How Windows Explorer displays the files
As you can see, and that was unexpected to me, Windows doesn't normalize the unicode code points to NFK so you will get whatever the program used like for Ubuntu. As a spoiler, cygwin is doing the same but I left its output for brevity. Note how the rendering is significantly different for \u2160 (I) unlike Ubuntu's default rendering in Unity.

OSX

If you already played with unicode code point normalization and had to touch OSX, you problaby know why I kept it as the last one:
~/src/foo> ./unicode_is_hard.py
e-acute-circumflex
Found 2 different encodings for u'\u1ebf'
  NFKC: u'\u1ebf'
   NFD: u'e\u0302\u0301'
   NFC: u'\u1ebf'
  NFKD: u'e\u0302\u0301'
  OS returned: u'NFCe\u0302\u0301', u'NFDe\u0302\u0301', u'NFKCe\u0302\u0301', u'NFKDe\u0302\u0301'
  2 are not matching.
  For  NFC, expected  NFC, NFKC but could with  NFC,  NFD, NFKC, NFKD
  For NFKC, expected  NFC, NFKC but could with  NFC,  NFD, NFKC, NFKD
  For  NFD, expected  NFD, NFKD but could with  NFC,  NFD, NFKC, NFKD
  For NFKD, expected  NFD, NFKD but could with  NFC,  NFD, NFKC, NFKD

roman_numeral_one
Found 2 different encodings for u'\u2160'
  NFKC: u'I'
   NFD: u'\u2160'
   NFC: u'\u2160'
  NFKD: u'I'
  OS returned: u'NFC\u2160', u'NFD\u2160', u'NFKCI', u'NFKDI'

e-acute-circumflex + roman_numeral_one
Found 4 different encodings for u'\u1ebf\u2160'
  NFKC: u'\u1ebfI'
   NFD: u'e\u0302\u0301\u2160'
   NFC: u'\u1ebf\u2160'
  NFKD: u'e\u0302\u0301I'
  OS returned: u'NFCe\u0302\u0301\u2160', u'NFDe\u0302\u0301\u2160', u'NFKCe\u0302\u0301I', u'NFKDe\u0302\u0301I'
  2 are not matching.
  For  NFC, expected  NFC but could with  NFC,  NFD
  For NFKC, expected NFKC but could with NFKC, NFKD
  For  NFD, expected  NFD but could with  NFC,  NFD
  For NFKD, expected NFKD but could with NFKC, NFKD
How Finder displays the files
As you can see, OSX is the only OS to normalize Unicode code points. But it is doing partial normalization, only for NFD vs NFC but not for NFKx vs NFx. That's interesting as I'd have expected NFK handling instead. So a file written in NFKx cannot be opened in NFx but NFC vs NFD is transparently converted.

The code

#!/usr/bin/env python
# Copyright (c) 2012 Marc-Antoine Ruel. All rights reserved.

"""This scripts create a subdirectory named unicode_is_hard which contains
various files in various encoding.

See http://en.wikipedia.org/wiki/Unicode_equivalence for the various UTF
encodings.
"""

import os
import shutil
import sys
import unicodedata

BASE_DIR = os.path.dirname(os.path.abspath(__file__))

def try_with_string(work_dir, unicode_string):
  """Encodes an unicode string with 4 different encodings and tries to open the
  file with the other encodings.
  """
  # Delete the work directory if present.
  if os.path.isdir(work_dir):
    shutil.rmtree(work_dir)
  os.mkdir(work_dir)

  encodings = (u'NFC', u'NFKC', u'NFD', u'NFKD')
  encoded = dict(
      (key, unicodedata.normalize(key, unicode_string)) for key in encodings)
  filenames = dict((key, key + value) for key, value in encoded.iteritems())

  # This implicitly assumes python does the right thing here.
  different_encodings = len(set(encoded.itervalues()))
  print(
      'Found %d different encodings for %r' %
      (different_encodings, unicode_string))
  for encoding, value in encoded.iteritems():
    print('  %4s: %r' % (encoding, value))

  # Now for each type, create a file. See if the other encodings can open it.
  for filename in filenames.itervalues():
    open(os.path.join(work_dir, filename), 'w').close()

  files_found = sorted(os.listdir(work_dir))
  print('  OS returned: %s' % ', '.join(repr(i) for i in files_found))
  not_matching = set(filenames.itervalues()).difference(files_found)
  if not_matching:
    print('  %d are not matching.' % len(not_matching))

  expected = {}
  for encoding, value in encoded.iteritems():
    # Assumes comparison in python is correctly done.
    for encoding_to_confirm, value_to_confirm in encoded.iteritems():
      if value_to_confirm == value:
        expected.setdefault(encoding, []).append(encoding_to_confirm)

  # Now do the 16 combinations to try to open each files with the other
  # encoding.
  actual = {}
  for encoding, original_filename in filenames.iteritems():
    for encoding_to_try, value_to_try in encoded.iteritems():
      # Try to open the file with the other encoding.
      try:
        open(os.path.join(work_dir, encoding + value_to_try)).close()
        actual.setdefault(encoding, []).append(encoding_to_try)
      except IOError:
        pass

  # Print if anything unexpected succeeded. This happens in the case
  # encoded[encoding1] != encoded[encoding2] but they could open each other.
  for encoding in encodings:
    if sorted(expected[encoding]) != sorted(actual[encoding]):
      print(
          '  For %4s, expected %s but could with %s' % (
            encoding,
            ', '.join('%4s' % i for i in sorted(expected[encoding])),
            ', '.join('%4s' % i for i in sorted(actual[encoding]))))

def main():
  work_dir = os.path.join(unicode(BASE_DIR), u'unicode_is_hard')

  # Examples taken from the Wikipedia page and unicodedata python stdlib doc.
  # http://docs.python.org/2/library/unicodedata.html
  e_acute_circumflex = u'\u1ebf'
  roman_numeral_one = u'\u2160'

  print('e-acute-circumflex')
  try_with_string(work_dir, e_acute_circumflex)

  print('\nroman_numeral_one')
  try_with_string(work_dir, roman_numeral_one)

  print('\ne-acute-circumflex + roman_numeral_one')
  try_with_string(work_dir, e_acute_circumflex + roman_numeral_one)
  return 0

if __name__ == '__main__':
  sys.exit(main())

Friday 12 October 2012

Short 10 items work-efficiency recipe

Here's a repost of a message I wrote internally at Google. I had been asked about how to be more efficient, or put another way, how to generate that much code. To get an idea, you can look at the data there;
http://svnsearch.org/svnsearch/repos/CHROMIUM/search?view=plot&author=maruel%40chromium.org. In that time frame, I also contributed to buildbot, Rietveld, and worked on Google-internal projects.

So here's my short 10 items work-efficiency recipe;

1. Always keep the same work schedule as much as possible

But work when your brain is in flux state. If you wake up a noon and get to bed at 3am everyday, keep it always the same. When you're 25yr old it's fine to be less stable in your work schedule. You'll get old eventually, if you survive yourself, and eventually, your body will hate what you do to it. Work on weekend if needed but keeping a stable schedule is important for maximal brain efficiency. Continue coding up to the exact moment as soon as you see yourself unsure of the design for your next line to write, stop coding at that point.

2. Do small changes

Other committers have probably larger diffstat than me but the CLs are more complicated so harder to read. I try to make small CLs because it's:
  • Easier to glance at to figure out what's it's doing.
  • Much easier to review, reducing turn around time -> enable review over email -> improve your own efficiency.
  • Easier to revert with less chance of merge error.
  • When doing small changes, it's possible to TBR= the patches more often. TBR in this context means to be reviewed.

3. Learn to cope with review latency

When doing small changes, you can pipeline them to reduce the effect of review latency. You can cheat sometimes with TBR but not abuse too much. Working on 2 projects concurrently helps a lot. I often start with a large change then split it up into smaller CLs. This always improve the quality of the code.

4. Take time to pay technical debt

Probably worth keeping aside 20% of your coding time to technical debt;
  • Adding tests. In spring 2011, I took 3 months writing unit tests for depot_tools. It was really depressing but it really helped afterward.
  • Refactoring poor designs. It's good to accumulate technical debt since it's often after the fact that you can really see the best design. Do not try to design too much up front, unless you're designing an API!
Often people are afraid to refactor because of the cost of doing so. Planing is the key. Split in sub tasks;
  1. Identify consumers.
  2. Identify problem and how a new interface would fix the problem.
  3. Evaluate the cost/benefit of a refactor. Think about intangibles, would it reduce the learning curve of a potential new contributor?
  4. Create the new interface.
  5. Write tests for the new interface.
  6. Alert everyone.
  7. Switch consumer to new interface.
  8. Wait for propagation delay.
  9. Remove old interface.
It applies mostly everywhere. It requires being methodic. But sometimes, give up, the refactor is not worth it! A refactor for the fun of refactoring is skipping the "Identify problem" step. See next item.

5. Focus on your user's benefit

Do not focus code or just yet-another-feature. It's not the number of commits or the diffstat, it's stuff that works that count. Do not fix problems for old code, do not be afraid to deprecate cleanly. Work on complex problems! Fix a complex problem with many simple solutions by splitting the problem in parts so as much existing components can be reused. Work most of the time on non-visible grungy stuff but occasionally work on highly visible projects otherwise you'll get no recognition.

6. Fix repetition with code

Kill idle time with code. Take the time to automate anything you see yourself doing 3 times. Write one-liner scripts and put them in a SCM. Separate your public scripts from the private ones. For example, this permits putting the public one on github.

7. Write the code to be refactorable in the first place

This is in general overlooked by new grads but it is extremely important. Someone will be stuck with the code you wrote 4 years from now and will hate you and will wonder why you did it this way. So at least, make it easy for them to refactor it.

That's why I always align the function call arguments at +4 on the following line, so that a single argument addition is a single-line diff that is very easy to revert or merge with other commits. Never align at "(", otherwise at the moment you are renaming the function, you have to realign all the call sites!

Another example is to use style check or static analysis.

8. Abuse to some extent the "test on prod" mentality

To be able to achieve that, you have to:
  • Write code defensively. Especially with python, springle asserts generously.
  • Plan for failure. If everything breaks, what is the cascading effect? Plan for cascading failure. For example, a gclient sync [The chromium meta checkout tool] breakage could DDoS the subversion server, then provision accordinly.
  • Have breakage not be too important -> do many incremental changes instead of big ones.
  • Make sure it's easy to revert fast (small CLs).
  • Have some sort of monitoring. Devs yelling at you is a form of monitoring. Otherwise, it's time to pay technical debt. I abuse 'devs monitoring' a bit. Try to do without pissing off your colleagues too much.
  • Unit tests are great. You need test. But you need integration (smoke) tests too. Are your mocks representative of the actual implementation? If the component you rely on working with your use case?

9. Optimize your work environment

  • Have your text editor be efficient. I personally use vim exclusively even if I do not consider myself a power-user. Spend an inordinate amount of time configuring it. Try a few before settling in.
  • Use the CLI all the time.
  • Try to never touch the mouse. But still use an high quality mouse.
  • Use an high quality keyboard. Grab a keyboard where the F-keys are near the numbers row if you use F-keys. Millimeters count.
  • Take time to learn how to use your SCM and review tool. As an example in Chromium-land, commands like "git cl comments" help boost your productivity.
  • Not using GUI makes it easier to effectively use any wait time I may have; grab laptop, fire up an ssh window and screen -x exactly where I had left up. Setup ssh keys to reduce wait time. That's to help with #1.
  • Do not be lazy. Use the best tools available. There are awesome engineers in the world produce new tools that could be of use for you, use their output. So the list of tools is different from last year's; be prepared for change. For example, "If you are not using ninja to build Chromium, You are compiling it wrong(tm)". Do not accept status-quo for toolsets.

10. Optimize your meta-work environment

  • Do not get distracted. Social events are great. Visit other offices if you work in a multi-office environment. Meeting colleagues face to face is extremely important to build trust relationships. Otherwise, join meets-up to learn about how other companies solve common problems. But most of your time should be spent coding if you are a SWE.
  • Communicate asynchronously as much as possible. But when it's time for coordination, communicate synchronously. VC/IM/F2F.
  • Do not be shy. You are not paid to be shy. It doesn't mean to be a jerk, just not be afraid to ask questions. Be prepared to receive RTFM as answer.
  • Do not meta-work. Gmail filter out as much as possible. Force yourself to use keyboard shortcuts in Gmail. Do not spend as much time on G+ as I do. :) Meetings are meta-work. Meta-work is your #1 enemy.
  • Reduce communication overhead as much as possible. Use broadcast instead of 1:1 to spread information. Use mailing lists instead of direct email addresses for easier searchability and archival.
  • If you do not like working with someone, do not work with the person. Do not let management overhead kill your productivity.