When UnicodeDecodeError become irrational check $LANG
2015 décembre 12
I spent hours this week trying to understand how an installation script can fail on some installations.
In input we have an utf-8 encoded file and we add some xml files, also ‘utf-8’ encoded. These are parsed with Markdown.
python -m lom2mlr.markdown -l -c rationale.md
It is really simple but sometimes we ran into a strange error:
Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/home/edegoute/Projects/lom2mlr/lom2mlr/lom2mlr/markdown/__main__.py", line 3, in <module> compile() File "lom2mlr/markdown/__init__.py", line 55, in compile extensions=extensions) File "/home/edegoute/Projects/lom2mlr/local/lib/python2.7/site-packages/markdown/__init__.py", line 529, in markdownFromFile kwargs.get('encoding', None)) File "/home/edegoute/Projects/lom2mlr/local/lib/python2.7/site-packages/markdown/__init__.py", line 441, in convertFile html = self.convert(text) File "/home/edegoute/Projects/lom2mlr/local/lib/python2.7/site-packages/markdown/__init__.py", line 375, in convert newRoot = treeprocessor.run(root) File "lom2mlr/markdown/test_mlr.py", line 76, in run print(" " * int(element.tag) + element.text) UnicodeEncodeError: 'ascii' codec can't encode character u'\xc9' in position 4: ordinal not in range(128)
First, it was difficult to understand how an unicode error can raise an iso-8859-1 problem on utf-8 files. Going deep I found some known problems with ‘codecs.open’ in python2.7 but no solution. I tried to force Markown to treat these files as ‘iso-8859-1’ files, then it ran an utf-8 unicode error at the same line not in the opening. It was sounding too many magic for me.
At that point, I checked again the installation was identical: same python version, same pip version, same eggs versions. I tried some egg upgrade without any success. And finally came the idea to check environment variables. Bingo! On all systems with failing installations we have no localization ($LANG=C). The fix was so simple:
I still don’t understand the magic in the codecs python module. Why it computes a different encoding when the function call already asks for one? The workaround is simple for programmers.