A string of unexpected lengths

Tom Ballinger Feb 20, 2015

When you start learning to program, or working in a new language, it’s often suggested that you build a simple program like Battleship or Tic-tac-toe. The games’ rules are well-defined and easy to grasp, and you only need to read and print text to get started. This frees you up to focus on the mechanics and ideas of the programming language you’re learning.

To create the game’s interface in the terminal, you end up doing a lot of string formatting: board layout, progress bars, announcements to the user. The length of a string is useful when formatting for terminals, since they usually use monospaced fonts. For example, while writing a game of Battleship in Python we might use the len() function explicitly for formatting math or implicitly in convenient built-in methods like center() to make exciting messages like the following:

>>> msg = 'battleship sunk!'
>>> len(msg)
16
>>> def underlined(msg):
...     return msg + '\n' + '-' * len(msg)
...
>>> print underlined(msg)
battleship sunk!
----------------
>>> print msg.center(30, '*')
*******battleship sunk!*******

However, the code above won’t always work as we expect because the len() of text isn’t necessarily the same as its width when displayed in a terminal. Let’s explore three ways these numbers can differ.

Multiple bytes for one character

Byte strings (known as “strings” in Python 2) have formatting methods like center() which assume that the displayed width of a string is equal to the number of bytes it contains. But this assumption doesn’t always hold! The single visible character Ä might be encoded as several bytes in a source file or terminal.

>>> shipname = 'Ägir'
>>> shipname
'\xc3\x84gir'
>>> len(shipname)
5
>>> print shipname.center(10, '=')
==Ägir===
>>> print shipname + '\n' + '-' * len(shipname)
Ägir
-----

The number of bytes in this byte string doesn’t match the number of characters so built-in formatting operations don’t behave correctly.

Fortunately, using Unicode strings instead of byte strings solves this problem because they usually report a length equal to the number of Unicode code points they contain.¹

>>> shipname = u'Ägir'
>>> len(shipname)
4
>>> shipname.center(10, u'=')
u'===\xc4gir==='
>>> print shipname.center(16, u'*')
===Ägir===
>>> print shipname + '\n' + '-' * len(shipname)
Ägir
----

ANSI escape code formatting

ANSI escape codes let us format text by writing bytes like '\x1b[31m' to start writing in red, and '\x1b[39m' to stop. If we build a string containing these sequences, the calculated length of our string won’t match its displayed width in a terminal:

>>> s = '\x1b[31mhit!\x1b[0m'
>>> print s
hit!
>>> len(s)
13
>>> print s + '\n' + '-' * len(s)
hit!
-------------
>>> print s.center(14, '*')
hit!*

The colored string reports a length larger than its displayed width, causing problems for built-in text-alignment methods. Fortunately, there are several Python libraries that make it easier to work with colored string-like objects that don’t include formatting characters in their length calculations.

Clint’s colored strings have formatting methods that produce the output you expect:

>>> from clint.textui.colored import green
>>> len(green(u'ship'))
4
>>> green(u'ship').center(10)
<GREEN-string: u'   ship   '>
>>> print green(u'ship').center(10)
   ship

but this no longer works once two colored strings are combined into a new colored string:

>>> from clint.textui.colored import blue, green
>>> len(green('ship') + blue('ocean'))
39
>>> green('ship') + blue('ocean')
'\x1b[31m\x1b[22mship\x1b[39m\x1b[34m\x1b[22mocean\x1b[39m'
>>> print (green('ship') + blue('ocean')).center(10, '*')
shipocean

My own attempt at solving this problem uses smart string objects which know how to concatenate:

>>> from curtsies.fmtfuncs import green, blue
>>> len(green(u'ship'))
4
>>> green(u'ship').center(10)
green("   ship   ")
>>> print green(u'ship').center(10)
   ship
>>> s = green(u'ship') + blue(u'ocean')
>>> len(s)
9
>>> print s.center(13, '*')
**shipocean**

but doesn’t correctly implement every formatting method yet: above, **shipocean** has lost its color information because a fallback implementation of center() was used.²

The Unicode jungle

Formatting methods of Python Unicode strings like center() assume that the display width of a string is equal to its character count. But this assumption doesn’t always hold!

What if we use fullwidth Unicode characters?

>>> battleship = u'扶桑'
>>> len(battleship)
2
>>> print battleship + '\n' + '-' * len(battleship)
扶桑
--

What about multiple Unicode code points that combine to display a single character?³

>>> battleship = u'Fuso\u0304'
>>> print battleship
Fusō
>>> len(battleship)
5
>>> print battleship.center(6, u'*')
*Fusō

The width of a Unicode string differs from the number of characters in it. Fortunately, we can use the POSIX standard function wcswidth to calculate the display width of a Unicode string. We can use this function to rebuild our basic formatting functionality.⁴

>>> from wcwidth import wcswidth
>>> wcswidth(battleship)
4
>>> def center(s, n, fillchar=' '):
...     pad = max(0, n - wcswidth(s))
...     lpad, rpad = (pad + 1) // 2, pad // 2
...     return lpad * fillchar + s + rpad * fillchar
...
>>> print center(c, 6, '*')
*Fusō*

Unfortunately, for versions of Python earlier than 3.3 it’s still possible that the len() of a Unicode character like u'\U00010123' will be 2 if your Python was built to use the “narrow” internal representation of Unicode. You can check this with sys.maxunicode - if it’s a number less than the total number of Unicode code points, some Unicode characters are going to have a len() other than 1.↩
Want to fix this? Pull requests are welcome! The fix would be pretty similar to the fix for this issue about .ljust and .rjust. ↩
The Unicode spec calls this an extended grapheme cluster. Interestingly, the Character class in the Swift programming language represents an extended grapheme cluster and may be composed of multiple Unicode code points. ↩
Here we’re using a pure Python implementation for compatibility and readability.↩