XBEL file of links generated from Akara content
RSS feed

Style

(Cookies required)
Powered by 4Suite

Domlette is 4Suite's lightweight DOM implementation. It is optimized for XPath operations, speed, and relatively low memory overhead, at least when compared to 4DOM and minidom. It is not fully DOM compliant, but it does provide an interface very close to DOM Level 2. In Domlette, where DOM and XPath disagree, XPath wins.

There is a single Domlette API accessible through Ft.Xml.Domlette. This is a wrapper for one of two underlying implementations: cDomlette or FtMiniDom.

  • cDomlette - a very fast and lightweight Domlette implemented completely in C. This is the default Domlette implementation on all platforms. It uses its own bundled version of Expat to do the XML parsing.
  • FtMiniDom - available only in 4Suite 0.12.0a1 through 1.0a4, FtMiniDom is identical to cDomlette, but is implemented completely in Python. In most cases, it uses your platform's PyExpat to do the XML parsing. FtMiniDom is very much like the xml.dom.minidom that comes with Python and the xml.dom.minidom replacement that comes with PyXML, but it does have some differences and added features. You can force the use of FtMiniDom by setting USE_MINIDOM=1 in your environment before importing Ft.Xml.Domlette.

Domlette Reader API

One can create Domlette instances by parsing XML documents with the reader system. The reader API is fairly simple. For general use, you can get a NonvalidatingReader or NoExtDtdReader instance from Ft.Xml.Domlette. You then feed an XML document entity's byte stream to the reader via one of these methods:

  • parseUri(uri) - The uri argument is the absolute URI of the document entity to parse. The URI will be dereferenced by the default resolver.
  • parseString(st, uri) - st is the XML document entity in the form of an encoded Python string (not a Unicode string). See the note about the uri argument, below.
  • parseStream(stream, uri) - stream is a Python file-like object that can supply the document entity's bytes via read() calls. See the note about the uri argument, below.
  • parse(isrc) - isrc is an Ft.Xml.InputSource object, described in the next section.

The Importance of Base URIs

In the first 3 methods, the uri argument is the URI of the document entity that you are feeding to the parser. It is a very important, but often overlooked concept in document processing.

The URI gives the document entity a unique identifier that can used to refer to the document as a whole. Also, each Domlette node derived from a particular entity inherits that entity's URI as the node's baseURI property, unless an alternative base URI was indicated, such as with xml:base.

The document's URI is also used as the "base URI" for resolving any relative URI references that may appear within the document itself. Relative URI references may occur in a document in places like:

  • <!DOCTYPE> or <!ENTITY>, immediately following the keyword SYSTEM
  • <xsl:import> and <xsl:include>, in the value of the href attribute
  • <xi:include>, in the value of the href attribute
  • <exsl:document>, in the value of the href attribute
  • the arguments to XSLT's document() function

It is a common misconception that relative URI references in a document's content are considered to be relative to the processor's current working directory. They are actually resolved relative to the URI of the document that contains the relative URI reference (more specifically, relative to the URI of the entity in which the reference occurs, keeping in mind that a document may be comprised of multiple entities, i.e., separate files).

In all cases, the document URI that you supply in the reader API must be "absolute", which means that it has a scheme, e.g. "http://spam/eggs.xml", not just "/spam/eggs.xml" or "eggs.xml".

If you know there are not going to be any relative URI references to resolve during initial parsing or during processing of the Domlette by other tools, then you can safely omit the argument, or, preferably, supply a dummy URI like "urn:dummy" or "http://spam/eggs.xml". If you choose to omit URI arguments from APIs that need them, you may get a Python warning, and a random URI, that is probably not what you want, will be assigned.

Parsing XML that's already a Unicode string

Because 4Suite is trying to provide as thin of a wrapper to the underlying parser as possible, and due to complexities in the APIs of these parsers, there is no API in 4Suite for parsing Python's Unicode strings.

If your XML is in the form of a Unicode string, you must encode the string as bytes so that the underlying parser can read it. Once you have an encoded string, you can pass it to the reader's parseString(), or wrap it in an InputSource using the InputSourceFactory's fromString(). If the string is not UTF-16 or UTF-8 encoded, then you must tell the reader what encoding it actually uses. You can do this either by writing or replacing the XML declaration in the string itself, or (much easier) setting the optional encoding keyword argument in the reader's parseString() method or the InputSourceFactory's fromString() method. For an example, see the Akara article on external encoding declarations.

NonvalidatingReader Examples

Example of parsing XML from the web:

from Ft.Xml.Domlette import NonvalidatingReader
doc = NonvalidatingReader.parseUri("http://xmlhack.com/read.php?item=1560")

Example of parsing XML from the local filesystem, using a URI:

from Ft.Xml.Domlette import NonvalidatingReader
doc = NonvalidatingReader.parseUri("file:///tmp/spam.xml")

Example of parsing XML from the local filesystem, when given a relative file path in the local OS's format, and you want it to be relative to the current working directory:

from Ft.Xml.Domlette import NonvalidatingReader
from Ft.Lib import Uri
file_uri = Uri.OsPathToUri('spam.xml', attemptAbsolute=1)
doc = NonvalidatingReader.parseUri(file_uri) 

Example of parsing XML from a string, without providing a document/base URI:

from Ft.Xml.Domlette import NonvalidatingReader
doc = NonvalidatingReader.parseString("<spam>eggs</spam>")

Example of parsing XML from a string, with a document/base URI, and a document that really needs that base URI:

from Ft.Xml.Domlette import NonvalidatingReader
s = """<!DOCTYPE spam [ <!ENTITY eggs "eggs.xml"> ]>
<spam>&eggs;</spam>""",
doc = NonvalidatingReader.parseString(s, 'http://foo/test/spam.xml')
# during parsing, the replacement text for &eggs;
# will be obtained from http://foo/test/eggs.xml

In all cases, doc is now a cDomlette or minidom node. As mentioned above, the reader uses cDomlette unless the environment variable USE_MINIDOM is set to 1.

EntityReader Examples

Sometimes you need to parse a fragment of XML rather than the full document. Domlette in non-validating mode has a reader that can handle this case, returning a Domlette document fragment rather than a document object.

from Ft.Xml.Domlette import EntityReader
s = """
<spam1>eggs</spam1>
<spam2>more eggs</spam2>
"""
docfrag = EntityReader.parseString(s, 'http://foo/test/spam.xml')

Notice: EntityReader cannot handle all XML document entities. In particular it chokes if the document entity has a doc type declaration. See this thread for some discussion.

ValidatingReader

If you want to parse with DTD validation, use ValidatingReader instead. You'll need PyXML installed, because 4Suite does not come with a validating parser.

#ValidatingReader is a global instance
from Ft.Xml.Domlette import ValidatingReader
#raises SystemExit if PyXML 0.8+ not installed
doc = ValidatingReader.parseUri("http://xmlhack.com/read.php?item=1560")

NoExtDtdReader

When using the NonvalidatingReader, the document's DTD is still checked for things like entity declarations and default attribute values. You cannot suppress reading of the internal DTD subset, but you can prevent the external subset from being accessed by using NoExtDtdReader. This won't affect the processing of external parameter entities defined in the internal DTD subset.

Creating your own reader instance

In some cases you might not want to use the global reader instances. For instances in multithreaded use, you might want a reader per thread. Or you might want to change some of the parameters on the readers. If so, you can create your own reader instance:

from Ft.Xml.Domlette import NonvalidatingReaderBase 
reader = NonvalidatingReaderBase()
doc = reader.parseUri("http://xmlhack.com/read.php?item=1560")

Instead of NonvalidatingReaderBase, you could instead use NoExtDtdReaderBase or ValidatingReaderBase, depending on your needs. Each of these 3 readers take an optional inputSourceFactory constructor argument, which you can use to supply a custom URI resolver.

InputSource objects

You can also handle InputSource objects. An InputSource is an object that encapsulates a source of encoded text for parsing, and a URI resolver. The advantage to using an InputSource is that it provides a standard API to the text stream, and, perhaps more importantly, allows you to associate a custom URI resolver with the stream.

Normally, you can just get an InputSource from the factory by calling the appropriate method: fromUri(uri), fromString(st), or fromStream(stream), much like the reader API described above. Then you pass the InputSource object to the reader's parse() method:

from Ft.Xml import InputSource
from Ft.Xml.Domlette import NonvalidatingReader
factory = InputSource.DefaultFactory
isrc = factory.fromUri("http://xmlhack.com/read.php?item=1560")
doc1 = NonvalidatingReader.parse(isrc)
#
# The factory is reusable. Here we also parse a string:
#
isrc = factory.fromString("<spam>eggs</spam>", "http://spam.com/base")
doc2 = NonvalidatingReader.parse(isrc)
#
# InputSource is a file-like object, so you can treat it as such:
#
isrc = factory.fromUri("http://xmlhack.com/read.php?item=1560")
raw_text = isrc.read()
#
#The uri/system ID you used for it is maintained
#
print isrc.uri
#
#You can also create other InputSources from URIs relative to this one
#
isrc2 = isrc.resolve("read.php?item=1703")

Converting from other DOM libraries

You can convert another Python DOM object (e.g. 4DOM or minidom) to a Domlette object using the function ConvertDocument:

Ft.Xml.Domlette.ConvertDocument(oldDocument, documentURI=u'')

DocumentURI provides a base URI for the converted nodes. If not specified, attributes documentURI and then baseURI are checked n the source DOM, as defined in DOM Level 3. If no URI is found in this way, a warning is issued and a UUID URI is generated for the new Domlette.

Domlette API summary

Everything should be self-explanatory; it's mostly the same as the normal xml.dom API, and the node type constants are inherited from xml.dom.Node.

DOMImplementation methods:

  • createDocument(namespaceURI, qualifiedName, doctype) - doctype must be None
  • createRootNode(documentURI)
  • hasFeature(feature, version)

Node creation:

  • cloneNode(deep)
  • createElementNS(namespaceURI, qualifiedName)
  • createAttributeNS(namespaceURI, qualifiedName)
  • createTextNode(data)
  • createComment(data)
  • createProcessingInstruction(target, data)
  • createDocumentFragment()
  • importNode(importedNode, deep)

Node tree manipulation:

  • appendChild(newChild)
  • insertBefore(newChild, refChild)
  • normalize()
  • removeChild(oldChild)
  • replaceChild(newChild, oldChild)
  • removeAttributeNS(namespaceURI, localName)
  • setAttributeNS(namespaceURI, qualifiedName, value)
  • setAttributeNodeNS(self, newAttr)
  • removeAttributeNode(oldAttr)

Node access:

  • getAttributeNS(namespaceURI, localName)
  • getAttributeNodeNS(namespaceURI, localName)
  • ownerDocument
  • rootNode
  • parentNode
  • childNodes
  • firstChild
  • lastChild
  • previousSibling
  • nextSibling
  • attributes

Node metadata:

  • hasAttributeNS(namespaceURI, localName)
  • hasChildNodes()
  • isSameNode(node)
  • nodeName
  • nodeValue
  • nodeType
  • namespaceURI
  • prefix
  • localName
  • baseURI
  • publicId - on Document nodes only
  • systemId - on Document nodes only

XPath query:

Looking for getElementsByTagName()? It isn't supported, because there are better options. See getElementsByTagName Alternatives for more info.

Serializing Domlette nodes

Domlette comes with a couple of very fast printer functions which also go to great pains to correctly handle character encoding issues. Here are some serialization examples using the Domlette printers, given a node 'node' (it doesn't have to be a document node):

from Ft.Xml.Domlette import Print, PrettyPrint

# basic serialization to sys.stdout
Print(node)

# ...with extra whitespace (indenting)
PrettyPrint(node)

# ...using a single tab, rather than 2 spaces, to indent at each level
PrettyPrint(node, indent='\t')

# serializing to a utf-8 encoded file
f = open('output.xml','w')
Print(node, stream=f)
f.close()

# ...to an iso-8859-1 encoded file
f = open('output.xml','w')
Print(node, stream=f, encoding='iso-8859-1')
f.close()

# ...to an ascii encoded string
import cStringIO
buf = cStringIO.StringIO()
Print(node, stream=buf, encoding='us-ascii')
buf.close()
s = buf.getvalue()

# Normally, output syntax (XML or HTML) is chosen based on the DOM type,
# which is automatically detected. A Domlette or XML DOM can be output in
# HTML syntax if the asHtml=1 argument is given.
PrettyPrint(node, asHtml=1)

See also: Serializing XML from DOM or Domlette documents

Building a DOM from scratch

implementation.createRootNode is a more natural approach for creating an XPath model root node. This is similar to the DOM idea of a document node and even closer to a DOM document fragment (multiple element children are allowed). implementation.createDocument on the other hand is designed to come close to the DOM interface.

As Mike Brown pointed out to a user,

doc = implementation.createRootNode('file:///article.xml')

is the equivalent of

from Ft.Xml import EMPTY_NAMESPACE
doc = implementation.createDocument(EMPTY_NAMESPACE, None, None)
doc.baseURI = 'file:///article.xml'

And similarly

from Ft.Xml import EMPTY_NAMESPACE
doc = implementation.createRootNode('file:///article.xml')
docelement = doc.createElementNS(EMPTY_NAMESPACE, 'article')
doc.appendChild(docelement)

is the equivalent of

from Ft.Xml import EMPTY_NAMESPACE
doc = implementation.createDocument(EMPTY_NAMESPACE, 'article', None)
doc.baseURI = 'file:///article.xml'

If you want as much fidelity to the DOM API as Domlette offers, use implementation.createDocument. If you just want to create a document or other such root-level node, and never mind the strange parameters, use implementation.createRootNode.

XPath query

You can easily perform XPath queries by use the .xpath() method for cDomlette nodes as follows:

from Ft.Xml.Domlette import NonvalidatingReader
doc = NonvalidatingReader.parseString("<spam>eggs<a/><a/></spam>")
print doc.xpath('//a')
print doc.xpath('string(/spam)')

Which is largely a shortcut for:

from Ft.Xml.XPath import Evaluate
print Evaluate('//a', contextNode=doc)

Notice: this is nothing like W3C DOM's XPath query module. The emphasis, as usual with Domlette, is on speed, simplicity and pythonic-ness (in that order).

The API, in brief:

node.xpath(expr[No title found]) * node - will be used as core of the context for evaluating the XPath * expr - XPath expression in string or compiled form * explicitNss - (optional) any additional or overriding namespace mappings in the form of a dictionary of prefix: namespace the base namespace mappings are taken from in-scope declarations on the given node. This explicit dictionary is suprimposed on the base mappings

See also: Basic use of the 4Suite Python XPath API

More URI info

For some users, always specifying a base URI feels like an inconvenience. Perhaps they always generate XML sources from text or streams without naturally associated URIs, and they have to figure out schemes to come up with base URIs for the parse. But there is good reason for this pickiness. Just ask one of the uers who got bitten by carelessness with base URIs in pratice. It's better to always put some amount of thought into base URIs when processing XML, and 4Suite encourages this.

Do rest assured that 4Suite only enforces the requirement for base URIs in cases where they are needed to make sense of a requested operation. Mike Brown has a nice discussion of the issue in this message. I also offer a less patient explanation. The discussion threads that lie behind the major re-write of the parsing and URI resolution infrastructure in 4Suite are scattered all over the place, but one instructive starting point is this message. In it Mike Brown even points to the similar situation on Saxon XSLT processor, which has led to a FAQ for that community.

One useful note is that if your main use for URI resolution is XSLT import and includes, you can avoid having to give valid base URIs by using XSLT include paths.

Additional information

Basic DOM processing

External encoding declarations

XML Catalogs


Comments