XBEL file of links generated from Akara content
RSS feed

Style

(Cookies required)
Powered by 4Suite

Ah, Unicode, XML and Python. The combination can be a bit of a tangle. Not as if this is such a friendly combination of Python is substituted for any other language. Anyway, there is a lot of good material on the subject.

  • This message by Paul Boddie gives a simple explanation of some of the issues in using Python to generate XML with non-ASCII characters. Beware a folluw-up that recommends encoding all non-ASCII characters as XML character entities. This is overkill.
  • Martin von Löwis, in this message answers a variety of questions about Unicode/Python/XML/HTML.

Cardinal rule number one: deal only with Unicode objects when interacting with Python XML APIs. You may get plain strings from external interfaces, files, sockets, etc. But you should convert plain strings to unicode as soon as possible. To do so, you will want to know what encoding they are in, and then you can run:

unicode_obj = unicode(plain_string, encoding)

If you don't know which encoding, a decent first guess is "iso8859-1".

It is almost inevitable if you are dealing with Unicode in Python that you will come across the dreaded UnicodeError. If you do, here are some resources that may help.

Start with Marc-André Lemburg's slides from his presentation, Unicode Support in Python (PDF). Also see Andy Robinson's Python Unicode Tutorial.

There is a FAQ entry on the UnicodeError.

I would recommend that you always build Python with UCS4 support on platforms that support it. But there are a couple of blocking bugs in Python that can cause trouble with UCS-4 builds. One, #610783 is fixed for Python 2.3, but this fix cannot be backported to the 2.2 series. The other, #599377, unfortunately looks as if it might not even be fixed in Python 2.3. Martin von Löwis suggests a workaround for #610783.

For all the details, good bad and ugly, of wide Unicode support in Python, see PEP 261. If you mess with the internal Unicode encoding while building Python, watch out for the situation outlined in this FAQ.

I must say that the state of Unicode in Python is a bit of a mess. Not that this is unique. The state of Unicode in what are, frankly, Python's main rivals: Perl and Java, are also a bit of a mess. Basically, despite the fact that XML has thrust Unicode into a very broad prominence, Unicode is still a new beast, and everyone is still shaking off the bugs. For a slog through all the arguments and rants about wide character Unicode support, see the long thread that ends with Guido's summation.

The Python Unicode implementation is formally described in PEP 100.


Comments