When UnicodeDecodeError become irrational check $LANG


I spent hours this week trying to understand how an installation script can fail on some installations.

In input we have an utf-8 encoded file and we add some xml files, also ‘utf-8’ encoded. These are parsed with Markdown.

python -m lom2mlr.markdown -l -c rationale.md

It is really simple but sometimes we ran into a strange error:

Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/edegoute/Projects/lom2mlr/lom2mlr/lom2mlr/markdown/__main__.py", line 3, in <module>
compile()
File "lom2mlr/markdown/__init__.py", line 55, in compile
extensions=extensions)
File "/home/edegoute/Projects/lom2mlr/local/lib/python2.7/site-packages/markdown/__init__.py", line 529, in markdownFromFile
kwargs.get('encoding', None))
File "/home/edegoute/Projects/lom2mlr/local/lib/python2.7/site-packages/markdown/__init__.py", line 441, in convertFile
html = self.convert(text)
File "/home/edegoute/Projects/lom2mlr/local/lib/python2.7/site-packages/markdown/__init__.py", line 375, in convert
newRoot = treeprocessor.run(root)
File "lom2mlr/markdown/test_mlr.py", line 76, in run
print(" " * int(element.tag[1]) + element.text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc9' in position 4: ordinal not in range(128)

First, it was difficult to understand how an unicode error can raise an iso-8859-1 problem on utf-8 files. Going deep I found some known problems with ‘codecs.open’ in python2.7 but no solution. I tried to force Markown to treat these files as ‘iso-8859-1’ files, then it ran an utf-8 unicode error at the same line not in the opening. It was sounding too many magic for me.

At that point, I checked again the installation was identical: same python version, same pip version, same eggs versions. I tried some egg upgrade without any success. And finally came the idea to check environment variables. Bingo! On all systems with failing installations we have no localization ($LANG=C). The fix was so simple:

export LANG=en_US.UTF-8

That’s it!

I still don’t understand the magic in the codecs python module. Why it computes a different encoding when the function call already asks for one? The workaround is simple for programmers.

Les commentaires sont fermés.

%d blogueurs aiment cette page :