XML allows for the possibility of a parser to receive an entity (e.g. a document file) with the entity's encoding declared externally.
That is, rather than looking into the document itself to find a byte-order mark (BOM) and/or an encoding declaration in the prolog, the parser can instead be explicitly notified of the encoding to assume. See sections 4.3.3 and F.2 of XML 1.0 (Third Edition) for the various requirements and guidelines regarding internal and external mechanisms for indicating an entity's encoding. Also see the "Media Types for XML" section of Architecture of The World Wide Web for good practice guidelines for serving XML on the web - they actually discourage external declarations. However, the mechanism is there for people to use, and there are situations where it is helpful.
So, since there are legitimate use cases for it, 4Suite now supports the use of external encoding information for general non-validating parsing of entities that aren't in a 4Suite repository.
The encoding can be explicitly set in an InputSource. Users can do this when they create the InputSource (e.g. when calling one of the InputSourceFactory methods). Here is a brief example of forcing the assumed encoding to be iso-8859-15:
from Ft.Xml.Domlette import NonvalidatingReader, Print
from Ft.Xml.InputSource import DefaultFactory
# byte 0xA4 is the Euro symbol (U+20AC) in iso-8859-15
ORDER = '<order><qty>10</qty><price>\xa4420</price></order>'
isrc = DefaultFactory.fromString(ORDER, 'file:///order.xml',
encoding='iso-8859-15')
doc = NonvalidatingReader.parse(isrc)
Print(doc, encoding='us-ascii')
#
#EXPECTED RESULT:
#
#<?xml version="1.0" encoding="us-ascii"?>
#<order><qty>10</qty><price>€420</price></order>
A more practical use case for this functionality might be when you need to read a document whose encoding is not supported by Expat (see the 4Suite FAQ for info on what encodings are supported). As long as you know the actual encoding, you can transcode the document to UTF-8, and force a UTF-8 interpretaton of it, without having to rewrite the encoding declaration in the document itself.
Here's a quick test, using an EUC-JP encoded document:
from Ft.Xml.Domlette import NonvalidatingReader, Print
from Ft.Xml.InputSource import DefaultFactory
GREETING = """<?xml version="1.0" encoding="euc-jp"?>
<greeting xml:lang="ja">\xba\xa3\xc6\xfc\xa4\xcf</greeting>"""
GREETING_UTF8 = GREETING.decode('euc-jp').encode('utf-8')
isrc = DefaultFactory.fromString(GREETING_UTF8,
'file:///greeting.xml',
encoding='utf-8')
doc = NonvalidatingReader.parse(isrc)
Print(doc, encoding='us-ascii')
#
#EXPECTED RESULT:
#
#<?xml version="1.0" encoding="us-ascii"?>
#<greeting xml:lang="ja">今日は</greeting>
Furthermore, when a stream object that is being wrapped by an InputSource supplies its own external encoding or media type information, this info will be automatically used to deduce the encoding, where such a deduction can be made with confidence, in accordance with RFC 3023 and RFC 2616. Streams believed to originate from the local filesystem or via FTP are excluded from this checking.
The typical use case is when a stream is obtained via HTTP and the HTTP response contains a Content-Type header like one of these:
Content-Type: text/xml Content-Type: application/xml;charset=KOI8-R Content-Type: application/xhtml+xml;charset=utf-8
This metadata usually ends up in the stream object, and will be acted upon automatically. In cases where the charset value is specified, it will be used as the assumed encoding, regardless of what is in the actual XML entity. In cases where the charset value is not specified, the RFC 3023 and 2616 rules will apply:
1. text/xml, text/foo+xml, text/xml-external-parsed-entity must be assumed to be us-ascii encoded, a fact which may surprise a lot of people. Many HTTP servers unfortunately send all files named foo.xml with Content-Type: text/xml. It should be noted that as of this writing, discussion of completely deprecating text/xml for XML is underway in web architecture circles. ( See thread starting at http://lists.w3.org/Archives/Public/www-tag/2003Oct/0152.html ).
2. Other text/* entities, if served via HTTP, must be assumed to be iso-8859-1 encoded. This usually doesn't affect XML files unless they happen to be served as text/plain or text/html, as XHTML documents often are.
If the automatic application of these rules is troublesome to you, it is possible to defeat them by subclassing InputSource and making the _getStreamEncoding() method a 'return None' no-op, and passing your class as the inputSourceClass argument to the InputSourceFactory:
from Ft.Xml.InputSource import InputSourceFactory
from Ft.Xml.InputSource import InputSource
class MyInputSource(InputSource):
#disable all external encoding support
#def __init__(self, *v_args, **kw_args):
# if kw_args.has_key('encoding'):
# del kw_args['encoding']
# InputSource.__init__(self, *v_args, **kw_args)
# return
#
# disable detection of encoding from stream metadata
def _getStreamEncoding(self, stream):
return None
MyFactory = InputSourceFactory(inputSourceClass=MyInputSource)
# now you can use MyFactory as you would normally use DefaultFactory
