The following discussion is inspired by this message and the response.
If you use xml.dom.minidom, the built-in ways to re-serialize are the toxml and toprettyxml methods:
DOC="<DESC><PARA>foo</PARA><PARA>bar</PARA></DESC>" from xml.dom import minidom doc = minidom.parseString(DOC) normal = doc.toxml() prettied = doc.toprettyxml()
The string "normal" will now be
<?xml version="1.0" ?> <DESC><PARA>foo</PARA><PARA>bar</PARA></DESC>
The string "prettied" will be something like (with tabs converted to 4 spaces):
<?xml version="1.0" ?>
<DESC>
<PARA>
foo
</PARA>
<PARA>
bar
</PARA>
</DESC>
Which is usually not what one wants. The added white space within the PARA tags is not even allowed under considerations of ignorable white-space. Both of these methods also do not deal well with unencodable characters.
With PyXML you can get a safer output by using the 4DOM printer. Append to the above script:
from xml.dom.ext import PrettyPrint PrettyPrint(doc)
This should output
<?xml version='1.0' encoding='UTF-8'?> <!DOCTYPE DESC> <DESC> <PARA>foo</PARA> <PARA>bar</PARA> </DESC>
which not only doesn't abuse white space, but also adds the encoding to the XML declaration, which is always good practice. PyXML's PrettyPrint should work with any DOM node that follows the Python protocol. There is also xml.dom.ext.Print which leaves white space entirely as it is.
If you are running 4Suite 0.12.0 or 1.0, you can use the Domlette printers, which are faster and have some important character encoding fixes. They have pretty much the same API as the 4DOM printer.
Here are some serialization examples using 4Suite's Domlette printers, given a document node 'doc':
from Ft.Xml.Domlette import Print, PrettyPrint
# basic serialization to sys.stdout
Print(doc)
# ...with extra whitespace (indenting)
PrettyPrint(doc)
# ...using a single tab, rather than 2 spaces, to indent at each level
PrettyPrint(doc, indent='\t')
# serializing to a utf-8 encoded file
f = open('output.xml','w')
Print(doc, stream=f)
f.close()
# ...to an iso-8859-1 encoded file
f = open('output.xml','w')
Print(doc, stream=f, encoding='iso-8859-1')
f.close()
# ...to an ascii encoded string
import cStringIO
buf = cStringIO.StringIO()
Print(doc, stream=buf, encoding='us-ascii')
buf.close()
s = buf.getvalue()
# Normally, output syntax (XML or HTML) is chosen based on the DOM type,
# which is automatically detected. A Domlette or XML DOM can be output in
# HTML syntax if the asHtml=1 argument is given.
PrettyPrint(doc, asHtml=1)
Note there is no way to control namespace prefix mappings when using any of these serialization methods.
