February 20, 2014

Unicode I/O and Locales in Python

I recently ran into a weird error when running some Python code in a chroot jail.

s = '你好'
with open('/tmp/asdf', 'w') as f:
  f.write(s)

gave me

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.3/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

The same happened with interprocess I/O:

with subprocess.Popen(
    '/usr/bin/cat',
    stdin=subprocess.PIPE,
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
    universal_newlines=True) as proc:
  (cmd_stdout, cmd_stderr) = proc.communicate('你好')

gave me

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.3/subprocess.py", line 578, in check_output
    output, unused_err = process.communicate(timeout=timeout)
  File "/usr/lib/python3.3/subprocess.py", line 908, in communicate
    stdout = _eintr_retry_call(self.stdout.read)
  File "/usr/lib/python3.3/subprocess.py", line 479, in _eintr_retry_call
    return func(*args)
  File "/usr/lib/python3.3/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

It turns out that Python str's are encoded to/decoded from raw bytes during I/O (print, file I/O, IPC, etc) using the default system locale encoding. The advantage is that, if your system locale is set up correctly, everything just works - there’s no explicit encoding/decoding between strings and bytes. The downside is that your Python code that runs fine on one machine can fail mysteriously on a different machine.

In my case, the chroot jail yielded:

$ locale
LANG=C
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=

Solution A

The simplest solution is to set the system locale, either just for the Python program or for your shell. For example,

# Run ./my_program.py with a custom LANG value.
LANG=en_US.utf-8 ./my_program.py

# Set locale for current shell session.
export LANG=en_US.utf-8
./my_program.py

In fact, it’s probably a good idea to add the export line to your ~/.bashrc, or follow however your Linux distro decides locales should be set.

Solution B

On the other hand, you can explicitly set the encoding used during I/O in your Python code.

For file I/O, in Python 3.x, you can set the encoding argument of open:

# Python 3.x
with open('/tmp/asdf', 'w', encoding='utf-8') as f:
  f.write('你好')

In Python 2.x, you can use codecs.open:

# Python 2.x
import codecs
with codecs.open('/tmp/asdf', 'w', encoding='utf-8') as f:
  f.write('你好')

Alternatively, you can use raw mode for file I/O:

with open('/tmp/asdf', 'wb') as f:
  f.write('你好'.encode('utf-8'))

For IPC with subprocess, you must not use universal_newlines=True, as that will always attempt to encode/decode using the system locale. Instead:

with subprocess.Popen(
    '/usr/bin/cat',
    stdin=subprocess.PIPE,
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE) as proc:
  (cmd_stdout_bytes, cmd_stderr_bytes) = proc.communicate('你好'.encode('utf-8'))
  (cmd_stdout, cmd_stderr) = (
      cmd_stdout_bytes.decode('utf-8'), cmd_stderr_bytes.decode('utf-8'))

Chuan Ji

About

Projects

Essays

Technical Notes

Unicode I/O and Locales in Python

Solution A

Solution B