The authoritative reference is the official Python Library documentation page on the xml.dom module.
There are already many resources discussing how to use the various Python/XML DOMs. The rest of this item mostly references these other discussions with updates, or suggestions.
(FIX ME: the slides seem to be gone) This slide and the following from Alexandre Fayolles' excellent EuroPython 2002 tutorial on Python/XML processing is a great introduction to DOM processing in Python.
Christoph Dietze's Python Cookbook recipe "turn the structure of a XML-document into a combination of dictionaries and lists" is a very useful example of minidom as is, and all up to date. It could be faster using cDomlette, if available, but this would be a trivial port -- simply replace "parse(filename)" with "NonvalidatingReader.parseUri(filename)". And if there were elements with a lot of largish text nodes, the string_value iterator I've presented would probably be faster than Christoph's getTextFromNode function.
Andrew Cooke's "practical python, xml and dom" gives a brief discussion and example of using the DOM. Unfortunately, the author had to scrounge for any useful detail, because of the poor documentation of DOM for PyXML at the time. For example, he had to grep the source code to see whether there was a pretty-printer routine for PyXML DOM.
His code uses the slower 4DOM, and even then uses some deprecated APIs (of course he could hardly have known better). Here is an update of his first example to use the faster domlettes and the updated APIs:
from __future__ import generators
import os, sys
from xml.dom import Node
from Ft.Xml.Domlette import Print, PrettyPrint
from Ft.Xml.Domlette import NonvalidatingReader
def doc_order_iterator_filter(node, filter_func):
if filter_func(node):
yield node
for child in node.childNodes:
for cn in doc_order_iterator_filter(child, filter_func):
if filter_func(cn):
yield cn
return
def get_elements_by_tag_name_ns(node, ns, local):
return doc_order_iterator_filter(node, lambda n: n.nodeType == Node.ELEMENT_NODE and n.namespaceURI == ns and n.localName == local)
def get_first_element_by_tag_name_ns(node, ns, local):
return get_elements_by_tag_name_ns(node, ns, local).next()
def string_value(node):
text_nodes = doc_order_iterator_filter(node, lambda n: n.nodeType == Node.TEXT_NODE)
return u''.join([ n.data for n in text_nodes ])
def addFile(doc, file, filelist):
name = doc.createElementNS(None, "name")
name.appendChild(doc.createTextNode(file))
file = doc.createElementNS(None, "file")
file.appendChild(name)
filelist.appendChild(file)
def main():
doc = NonvalidatingReader.parseUri(sys.argv[1])
#don't traverse the whole DOM tree just to find the filelist element
#I assume doc.documentElement was not used in case the filelist
#XML format is embedded in another
filelist = get_first_element_by_tag_name_ns(doc, None, "filelist")
#Compute the list of names once: the algorithm doesn't require more
names = get_elements_by_tag_name_ns(doc, None, "name")
#Convert once to list of strings
names = [ string_value(n) for n in names ]
for file in os.listdir(sys.argv[2]):
if not file in names:
addFile(doc, file, filelist)
PrettyPrint(doc)
if __name__ == "__main__": main()
Just as a crude benchmark, the original code took 3 seconds to run for a particular data set. The version above takes 0.4 seconds. I think the advantage would increase dramatically as the number of files in the given directory increase.
A lot of the other discussion in the article is based on 4DOM, and therefore I cannot recommend using the classes he mentions. For example, in place of the Visitor, WalkerInterface and PreOrderWalker classes he mentions, I'd recommend using generator idioms such as those I've been presenting.
