A string of unexpected lengths
When you start learning to program, or working in a new language, it’s often suggested that you build a simple program like Battleship or Tic-tac-toe. The games’ rules are well-defined and easy to grasp, and you only need to read and print text to get started. This frees you up to focus on the mechanics and ideas of the programming language you’re learning.
To create the game’s interface in the terminal,
you end up doing a lot of string formatting: board layout,
progress bars, announcements to the user.
The length of a string is useful when formatting for
terminals, since they usually use monospaced fonts.
For example, while writing a game of Battleship in Python
we might use the
len() function explicitly for formatting math
or implicitly in convenient built-in methods like
make exciting messages like the following:
>>> msg = 'battleship sunk!' >>> len(msg) 16 >>> def underlined(msg): ... return msg + '\n' + '-' * len(msg) ... >>> print underlined(msg) battleship sunk! ---------------- >>> print msg.center(30, '*') *******battleship sunk!*******
However, the code above won’t always work as we expect because
len() of text isn’t necessarily the same as its width
when displayed in a terminal.
Let’s explore three ways these numbers can differ.
Multiple bytes for one character
Byte strings (known as “strings” in Python 2) have formatting methods like
center() which assume that the displayed width of a string is equal to the number of bytes it contains.
But this assumption doesn’t always hold!
The single visible character
Ä might be encoded as several bytes in a source file
>>> shipname = 'Ägir' >>> shipname '\xc3\x84gir' >>> len(shipname) 5 >>> print shipname.center(10, '=') ==Ägir=== >>> print shipname + '\n' + '-' * len(shipname) Ägir -----
The number of bytes in this byte string doesn’t match the number of characters so built-in formatting operations don’t behave correctly.
>>> shipname = u'Ägir' >>> len(shipname) 4 >>> shipname.center(10, u'=') u'===\xc4gir===' >>> print shipname.center(16, u'*') ===Ägir=== >>> print shipname + '\n' + '-' * len(shipname) Ägir ----
ANSI escape code formatting
ANSI escape codes let us format text
by writing bytes like
'\x1b[31m' to start writing in red, and
to stop. If we build a string containing these sequences,
the calculated length of our string won’t match its
displayed width in a terminal:
>>> s = '\x1b[31mhit!\x1b[0m' >>> print s hit! >>> len(s) 13 >>> print s + '\n' + '-' * len(s) hit! ------------- >>> print s.center(14, '*') hit!*
The colored string reports a length larger than its displayed width, causing problems for built-in text-alignment methods. Fortunately, there are several Python libraries that make it easier to work with colored string-like objects that don’t include formatting characters in their length calculations.
Clint’s colored strings have formatting methods that produce the output you expect:
>>> from clint.textui.colored import green >>> len(green(u'ship')) 4 >>> green(u'ship').center(10) <GREEN-string: u' ship '> >>> print green(u'ship').center(10) ship
but this no longer works once two colored strings are combined into a new colored string:
>>> from clint.textui.colored import blue, green >>> len(green('ship') + blue('ocean')) 39 >>> green('ship') + blue('ocean') '\x1b[31m\x1b[22mship\x1b[39m\x1b[34m\x1b[22mocean\x1b[39m' >>> print (green('ship') + blue('ocean')).center(10, '*') shipocean
My own attempt at solving this problem uses smart string objects which know how to concatenate:
>>> from curtsies.fmtfuncs import green, blue >>> len(green(u'ship')) 4 >>> green(u'ship').center(10) green(" ship ") >>> print green(u'ship').center(10) ship >>> s = green(u'ship') + blue(u'ocean') >>> len(s) 9 >>> print s.center(13, '*') **shipocean**
but doesn’t correctly implement every formatting method yet: above,
**shipocean** has lost its color information because
a fallback implementation of
center() was used.2
The Unicode jungle
Formatting methods of Python Unicode strings like
center() assume that the display width of a string is equal
to its character count. But this assumption doesn’t always hold!
What if we use fullwidth Unicode characters?
>>> battleship = u'扶桑' >>> len(battleship) 2 >>> print battleship + '\n' + '-' * len(battleship) 扶桑 --
>>> battleship = u'Fuso\u0304' >>> print battleship Fusō >>> len(battleship) 5 >>> print battleship.center(6, u'*') *Fusō
The width of a Unicode string differs from the number of characters in it.
Fortunately, we can use the POSIX standard function
wcswidth to calculate
the display width of a Unicode string.
We can use this function to rebuild our
basic formatting functionality.4
>>> from wcwidth import wcswidth >>> wcswidth(battleship) 4 >>> def center(s, n, fillchar=' '): ... pad = max(0, n - wcswidth(s)) ... lpad, rpad = (pad + 1) // 2, pad // 2 ... return lpad * fillchar + s + rpad * fillchar ... >>> print center(c, 6, '*') *Fusō*
Unfortunately, for versions of Python earlier than 3.3 it’s still possible that the
len()of a Unicode character like
u'\U00010123'will be 2 if your Python was built to use the “narrow” internal representation of Unicode. You can check this with
sys.maxunicode- if it’s a number less than the total number of Unicode code points, some Unicode characters are going to have a
len()other than 1.↩
The Unicode spec calls this an extended grapheme cluster. Interestingly, the
Characterclass in the Swift programming language represents an extended grapheme cluster and may be composed of multiple Unicode code points. ↩