Introducing the Schematron

A fresh approach to XML validation and reporting

Originally featured August 2000 on UNIX Insider, now ITWorld
Summary
Judging from the ongoing developments and debates about XML document validation, it's evident the language is in flux. In this article, writer and consultant Uche Ogbuji gets a handle on some of these changes and introduces the Schematron, a new validation and reporting methodology and toolkit. (4,200 words)
By Uche Ogbuji


This article introduces the Schematron, a current validation methodology. Understanding the Schematron requires familiarity with XML and XML DTDs, along with some familiarity with XPath and XSLT transforms. For those who might need some grounding on one or more of those areas, I've added some helpful links in the Resources section below.

A bit of background
As I pointed out in my last XML article for Unix Insider, although XML doesn't introduce any notable innovation in data processing, it has introduced many useful disciplines inherited from Standard Generalized Markup Language (SGML). Perhaps the core discipline in this regard is its native support for validation. One of XML's early promises involved its support for bundling a data schema with data, and its provision for standard schema discovery in cases where this bundling was not done. Of course, the real world has proven that this facility, while useful, is no panacea. Even if one has a schema for machine interpretation of a data set, how does one determine the semantics associated with that schema? A further problem is the schema methodology with which XML ends up being bundled: the Document Type Definition (DTD).

DTDs are an odd mix of very generic and very specific expressions. For instance, simple tasks such as specifying that an element can have a number of particular child elements within a given range can be very cumbersome using DTDs. Yet DTDs are generic enough to allow elegant design patterns such as architectural forms. The expressive shortcomings of DTDs, along with arguments that XML validation should not require a separate computer language (DTDs differ in syntax from XML instances), encouraged the W3C, XML's major standards body, to develop a new schema language for XML using XML syntax. The resulting XML Schema specification is currently in the candidate recommendation phase and will presumably hit version 1.0 soon.

One of the key XML developments since XML 1.0's release is XML Namespaces 1.0. This recommendation provided a mechanism for disambiguating XML names, but did so in a way that is unfriendly to DTD users. There are tricks for using namespaces with DTDs but they are quite arcane. Many members of the SGML school of thought have argued that namespaces are a brittle solution and solve too narrow a problem to justify such disruption in XML technologies. The reality, however, is that with XML-related standards from XSLT to XLink relying heavily on namespaces, we'll have to develop solutions to the core problems that take namespaces into account.

The W3C Schema specification was a long time in development, and along the way there were rumblings about the complexity of the emerging model. XML Schemas did fill a very ambitious charter: covering document structure, data-typing worthy of databases, and even abstract data-modeling such as inheritance and subclassing.

Due to the gap between the emergence of namespaces and the completion of XML Schemas, as well as fears that the coming specification was far too complex, the XML community, which has a remarkable history of practical problem solving, went to work.

One of the developments was Murata Makoto's Regular Language description for XML, RELAX (see Resources). RELAX provides a system for developing grammars to describe XML documents. It uses XML-like syntax and offers features similar to those offered by DTDs. It includes facilities offered by XML Schemas, such as schema annotation and data typing, as well as exotic additions of its own, such as hedge grammars. RELAX also supports namespaces and provides a clean and inherently modular approach to validation. It has become popular enough to spawn its own mailing lists and contributed tools, such as a DTD-to-RELAX translator.

Harnessing the power of XPath
In the meantime, XSLT emerged as a W3C standard and immediately established itself as one of the most successful XML-related products. Most people are familiar with XSLT as a tool to display XML content on legacy HTML-only browsers, but there is much more to XSLT, largely because XPath, which it uses to express patterns in the XML source, is such a well-conceived tool.

In fact, because XPath is such a comprehensive system for indicating patterns and selecting from among them in XML, there is no reason that it could not express structural concepts similar to those expressed in a DTD. Take the DTD in Listing 1:

Listing 1

<!ELEMENT ADDRBOOK (ENTRY*)>
<!ELEMENT ENTRY (NAME, ADDRESS, PHONENUM+, EMAIL) >
<!ATTLIST ENTRY
    ID ID #REQUIRED
>
<!ELEMENT NAME (#PCDATA)>
<!ELEMENT ADDRESS (#PCDATA)>
<!ELEMENT PHONENUM (#PCDATA)>
<!ATTLIST PHONENUM
    DESC CDATA #REQUIRED
>
<!ELEMENT EMAIL (#PCDATA)>

Listing 2 is a sample document valid against this DTD:

Listing 2

<?xml version = "1.0"?>
<!DOCTYPE ADDRBOOK SYSTEM "addr_book.dtd">
<ADDRBOOK>
        <ENTRY ID="pa">
                <NAME>Pieter Aaron</NAME>
                <ADDRESS>404 Error Way</ADDRESS>
                <PHONENUM DESC="Work">404-555-1234</PHONENUM>
                <PHONENUM DESC="Fax">404-555-4321</PHONENUM>
                <PHONENUM DESC="Pager">404-555-5555</PHONENUM>
                <EMAIL>pieter.aaron@inter.net</EMAIL>
        </ENTRY>
        <ENTRY ID="en">
                <NAME>Emeka Ndubuisi</NAME>
                <ADDRESS>42 Spam Blvd</ADDRESS>
                <PHONENUM DESC="Work">767-555-7676</PHONENUM>
                <PHONENUM DESC="Fax">767-555-7642</PHONENUM>
                <PHONENUM DESC="Pager">800-SKY-PAGEx767676</PHONENUM>
                <EMAIL>endubuisi@spamtron.com</EMAIL>
        </ENTRY>
</ADDRBOOK>

Examine the declaration of the ADDRBOOK element. It says that such elements must have at least four child elements: a NAME, an ADDRESS, one or more PHONENUMs and an EMAIL. This can be expressed in XPath with a combination of the following three Boolean expressions (using the ADDRBOOK element as the context):

  1. count(NAME) = 1 and count(ADDRESS) = 1 and count(EMAIL) = 1

NAME[following-sibling::ADDRESS] and ADDRESS[following-sibling::PHONENUM] and PHONENUM[following-sibling::EMAIL]

count(NAME|ADDRESS|PHONENUM|EMAIL) = count(*)

The first is true if and only if (iff) there is exactly one NAME, one ADDRESS, and one EMAIL. This establishes the occurrence rule for these children. The second is true iff there is a NAME followed by an ADDRESS, an ADDRESS followed by a PHONENUM, and a PHONENUM followed by an EMAIL. This establishes the sequence rule for the children. Note that the preceding-sibling axis could have been used to the same effect. The third expression is true iff there are no other elements besides the NAME, ADDRESS, PHONENUM, EMAIL. This establishes the (implied) DTD rule that elements are not permitted except where specified in the content model by name, or with the ANY symbol.

Your first reaction might be that the XPath expressions are much more verbose than the equivalent DTD specification. That is true in this case, though one can easily think of situations where the opposite would be true. The DTD version is more concise because it is carefully designed to model such occurrence and sequence patterns. XPath has a far more general purpose, and we are building the DTD equivalent through a series of primitives, each of which operates at a more granular conceptual level than the DTD equivalent. So it may be more wordy, but it has greater expressive power.

Let's say we wanted to specify that there can be multiple ADDRESS and EMAIL children, but that they must be of the same number. This task, which seems a simple enough extension of the previous content model, is beyond the abilities of DTD. This is not true for XPath. Because XPath gives a primitive but complete model of the document, it's an easy addition.

  1. count(NAME) = 1 and count(ADDRESS) = count(EMAIL)

NAME[following-sibling::ADDRESS] and ADDRESS[following-sibling::PHONENUM] and PHONENUM[following-sibling::EMAIL]

count(NAME|ADDRESS|PHONENUM|EMAIL) = count(*)

The only change is in expression one and it should be self-explanatory. This small foray beyond the scope of DTD illustrates the additional power offered by XPath. Of course, XPath can handle the attribute declarations as well. For example, the attribute declaration for PHONENUM in the DTD can be expressed as follows (again using the ADDRBOOK element as context):

PHONENUM/@DESC

All these XPath expressions are fine in the abstract, but how would one actually use them for validation? The most convenient way is to write an XSLT transform that uses them to determine validity. Listing 3 represents a subset of the address book DTD.

Listing 3

<?xml version="1.0"?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">

  <xsl:template match="/">
    <xsl:if test='not(ADDRBOOK)'>
      Validation error: there must be an ADDRBOOK element at the root of the document.
    </xsl:if>
    <xsl:apply-templates select='*'/>
  </xsl:template>

  <xsl:template match="ENTRY">
    <xsl:if test='not(count(NAME) = 1)'>
      Validation error: ENTRY element missing a NAME child.
    </xsl:if>
    <xsl:if test='not(count(ADDRESS) = 1)'>
      Validation error: ENTRY element missing an ADDRESS child.
    </xsl:if>
    <xsl:if test='not(count(EMAIL) = 1)'&gt
      Validation error: ENTRY element missing an EMAIL child.
    </xsl:if>
    <xsl:if test='not(NAME[following-sibling::ADDRESS] and ADDRESS[following-sibling::PHONENUM] and PHONENUM[following-sibling::EMAIL])'>
      Validation error: ENTRY must have a NAME, ADDRESS, one or more PHONENUM, and an EMAIL in sequence
    </xsl:if>
    <xsl:if test='not(count(NAME|ADDRESS|PHONENUM|EMAIL) = count(*))'>
      Validation error: there is an extraneous element child of ENTRY
    </xsl:if>
    <xsl:apply-templates select='*'/>
  </xsl:template>

  <xsl:template match="PHONENUM">
    <xsl:if test='not(@DESC)'>
      Validation error: PHONENUM must have a DESC attribute
    </xsl:if>
    <xsl:apply-templates select='*'/>
  </xsl:template>

  <xsl:template match="*">
    <xsl:apply-templates select='*'/>
  </xsl:template>

</xsl:transform>

When run with a valid document such as the one above, this stylesheet would produce no output; with an invalid document such as Listing 4, however, it's a different story.

Listing 4

<?xml version = "1.0"?>
<ADDRBOOK>
        <ENTRY ID="pa">
                <NAME>Pieter Aaron</NAME>
                <PHONENUM DESC="Work">404-555-1234</PHONENUM>
                <PHONENUM DESC="Fax">404-555-4321</PHONENUM>
                <PHONENUM DESC="Pager">404-555-5555</PHONENUM>
                <EMAIL>pieter.aaron@inter.net</EMAIL>
        </ENTRY>
        <ENTRY ID="en">
                <NAME>Emeka Ndubuisi</NAME>
                <PHONENUM DESC="Work">767-555-7676</PHONENUM>
                <PHONENUM DESC="Fax">767-555-7642</PHONENUM>
                <PHONENUM DESC="Pager">800-SKY-PAGEx767676</PHONENUM>
                <EMAIL>endubuisi@spamtron.com</EMAIL>
                <ADDRESS>42 Spam Blvd</ADDRESS>
                <SPAM>Make money fast</SPAM>
        </ENTRY>
        <EXTRA/>
</ADDRBOOK>

Note that all of the XPath expressions we came up with are placed in iff statements and then negated. This is because a message is put out if certain conditions are not met. Running this source against the validation stylesheet using an XSLT processor results in the following output:

    Validation error: ENTRY element missing an ADDRESS child.

    Validation error: ENTRY must have a NAME, ADDRESS, one or more PHONENUM, and an EMAIL in sequence

    Validation error: ENTRY must have a NAME, ADDRESS, one or more PHONENUM, and an EMAIL in sequence

    Validation error: there must be an ENTRY element at the root of the document.

And so we have our validation result. Note that it's a report of the document and that you can see all the validation errors at once. Most XML parsers will only give you one error at a time. But the real power of this XSLT-based validation report is that it's just that: a report. We used it for DTD-based XML validation, but it's not hard to see how this could be extended to more sophisticated data-management needs. For instance, suppose we wanted to examine address book documents for email addresses in the .gov domain. This is beyond the realm of validation, but it is an example of reporting.

While it might be argued whether or not validation and reporting are cut from the same cloth, in practice, XML document validation can be treated as a subset of XML document reporting, and XPath and XSLT provide a powerful toolkit for document validation.

Introducing the Schematron
The Schematron (see Resources for a link) is a validation and reporting methodology and toolkit developed by Rick Jeliffe, a member of the W3C Schema working group. Without denigrating the efforts of his group, Mr. Jeliffe has pointed out that XML Schemas may be too complex for many users, and so he approaches validation from the same approach as the DTD.

Jeliffe developed the Schematron as a simple tool to harness the power of XPath, attacking the schema problem from a new angle. As he writes on his Website (see Resources for a link), "The Schematron differs in basic concept from other schema languages in that it is not based on grammars but on finding tree patterns in the parsed document. This approach allows many kinds of structures to be represented which are inconvenient and difficult in grammar-based schema languages."

The Schematron is no more than an XML vocabulary that can be used as an instruction set for generating stylesheets such as the one above. For instance, Listing 5 shows how our XPath-based validation might look like in the Schematron:

Listing 5


<schema xmlns='http://www.ascc.net/xml/schematron'>
        <pattern name="Structural Validation">
                <!-- Use a hack to set the root context.  Necessary because of
                     a bug in the schematron 1.3 meta-transforms. -->
                <rule context="/*">
                        <assert test="../addr:ADDRBOOK">Validation error: there must be an ADDRBOOK element at the root of the document.</assert>
                </rule>
                <rule context="ENTRY">
                        <assert test="count(NAME) = 1">Validation error: <name/> element missing a NAME child.</assert>
                        <assert test="count(ADDRESS) = 1">Validation error: <name/> element missing an ADDRESS child.</assert>
                        <assert test="count(EMAIL) = 1">Validation error: <name/> element missing an EMAIL child.</assert>
                        <assert test="NAME[following-sibling::ADDRESS] and ADDRESS[following-sibling::PHONENUM] and PHONENUM[following-sibling::EMAIL]">Validation error: <name/> must have a NAME, ADDRESS, one or more PHONENUM, and an EMAIL in sequence</assert>
                        <assert test="count(NAME|ADDRESS|PHONENUM|EMAIL) = count(*)">Validation error: there is an extraneous element child of ENTRY</assert>
                </rule>
                <rule context="PHONENUM">
                        <assert test="@DESC">Validation error: <name/> must have a DESC attribute</assert>
                </rule>
        </pattern>
</schema>

The root element in the Schematron is the schema element in the appropriate namespace. It contains one or more patterns, each of which represents a conceptual grouping of declarations. Patterns contain one or more rules, each of which sets a context for a series of declarations. This is not only a conceptual context, but one used for the XPath expressions in declarations within each rule.

Each rule contains a collection of asserts, reports, and keys. You can see asserts at work in the listing above. Asserts here are similar to asserts in C. They represent a declaration that a condition is true, and a signal if it is not. In the Schematron, assert elements specify that if the condition in their test attribute is not true, the text message within the assert elements will be printed. You'll note that the assert messages contain empty name elements. This is a convenient shorthand for the name of the current context node, given by the parent rule element, which makes it easy to reuse asserts from context to context.

Reports are similar to asserts, except that they output their contents when the condition in their test attribute is true rather than false. They also allow the use of the empty name element. Reports, as their name implies, tend to make sense for structural reporting tasks. For instance, to implement our earlier example of reporting email addresses in the .gov domain, we might add the following rule to our Schematron:

<rule context="EMAIL">
                        <report test="contains(., '.gov') and not(substring-after(., '.gov'))">This address book contains government contacts.</report>
                </rule>

I've already mentioned that namespaces are an important reason to seek a validation methodology other than DTDs. Schematron supports namespaces through XPath's. For instance, if we wanted to validate that all child elements of ENTRY in our address book document were in a particular namespace, we could do so by adding an assert to check the count of elements in a particular namespace. Assume that the prefix addr is bound to the valid namespace in the following example:

count(addr:*) = count(*)

Perhaps that's too draconian for your practical needs and you also want to allow elements in a particular namespace reserved for extensions:

count(addr:*) + count(ext:*) = count(*)

Maybe rather than permitting a single particular extension namespace, you want to allow any elements with namespaces whose URI contains the string extension:

count(addr:*) + count(*[contains(namespace(.), 'extension')]) = count(*)

With this latter addition and the addition of a report for email addresses in the .gov address, our Schematron looks like Listing 6:

Listing 6

<schema xmlns='http://www.ascc.net/xml/schematron'>

        <ns prefix='addr' uri='http://addressbookns.com'/>

        <pattern name="Structural Validation">
                <!-- Use a hack to set the root context.  Necessary because of
                     a bug in the schematron 1.3 meta-transforms. -->
                <rule context="/*">
                        <assert test="../addr:ADDRBOOK">Validation error: there must be an ADDRBOOK element at the root of the document.</assert>
                </rule>
                <rule context="addr:ENTRY">
                        <assert test="count(addr:*) + count(*[contains(namespace-uri(.), 'extension')]) = count(*)">
Validation error: all children of <name/> must either be in the namespace 'http://addressbookns.com' or in a namespace containing the substring 'extension'.
                        </assert>
                        <assert test="count(addr:NAME) = 1">
Validation error: <name/> element missing a NAME child.
                        </assert>
                        <assert test="count(addr:ADDRESS) = 1">
Validation error: <name/> element missing an ADDRESS child.
                        </assert>
                        <assert test="count(addr:EMAIL) = 1">
Validation error: <name/> element missing an EMAIL child.
                        </assert>
                        <assert test="addr:NAME[following-sibling::addr:ADDRESS] and addr:ADDRESS[following-sibling::addr:PHONENUM] and addr:PHONENUM[following-sibling::addr:EMAIL]">
Validation error: <name/> must have a NAME, ADDRESS, one or more PHONENUM, and an EMAIL in sequence
                        </assert>
                        <assert test="count(addr:NAME|addr:ADDRESS|addr:PHONENUM|addr:EMAIL) = count(*)">
Validation error: there is an extraneous element child of ENTRY
                        </assert>
                </rule>
                <rule context="addr:PHONENUM">
                        <assert test="@DESC">
Validation error: <name/> must have a DESC attribute
                        </assert>
                </rule>
        </pattern>
        <pattern name="Government Contact Report">
                <rule context="addr:EMAIL">
                        <report test="contains(., '.gov') and not(substring-after(., '.gov'))">
This address book contains government contacts.
                        </report>
                </rule>
        </pattern>
</schema>

Note the new top-level element, ns. We use this to declare the namespace we'll be incorporating into the Schematron rules. If you have multiple namespaces to declare, use one ns element for each. There are advanced uses of Schematron namespace declarations which you can read about on the Schematron site.

This was a pretty quick whirl through the Schematron. For more instruction, there is the tidy tutorial put together by Dr Miloslav Nic (see Resources).

Putting the Schematron to work
Remember that a Schematron document can be thought of as a set of instructions for generating special validation and report stylesheets, as we demonstrated earlier. This is the most common way of using the Schematron in practice. Conveniently, XSLT has all the power to convert Schematron specifications to XSLT-based validators. There is a metastylesheet available at the Schematron Website which, when run against a Schematron specification, will generate a specialized validator/reporter stylesheet, which can then be run against target source documents.

For instance, suppose I have the above Schematron specification as addrbook.schematron. I can turn it into a validator/reporter stylesheet as follows:

[uogbuji@borgia code]$ 4xslt.py listing6.schematron ~/devel/Ft/Xslt/test_suite/borrowed/schematron-skel-ns.xslt > addrbook.validator.xslt

As with all examples in this article, I'm using the 4XSLT stylesheet processor, which is an XSLT 1.0-compliant stylesheet processor written in Python and distributed as open source by my company, Fourthought. I ran the above from Linux. The first argument to 4xslt.py is the XML source document, the Schematron specification in the above listing, and the second is the stylesheet to be used, the Schematron namespace-aware metastylesheet. Next, I redirect the standard output to the file addrbook.validator.xslt, which becomes my validator/reporter stylesheet. I then run the validator stylesheet against the source document in Listing 7:

Listing 7

<?xml version = "1.0"?>
<ADDRBOOK xmlns='http://addressbookns.com'>
        <ENTRY ID="pa">
                <NAME xmlns='http://bogus.com'>Pieter Aaron</NAME>
                <ADDRESS>404 Error Way</ADDRESS>
                <PHONENUM DESC="Work">404-555-1234</PHONENUM>
                <PHONENUM DESC="Fax">404-555-4321</PHONENUM>
                <PHONENUM DESC="Pager">404-555-5555</PHONENUM>
                <EMAIL>pieter.aaron@inter.net</EMAIL>
        </ENTRY>
        <ENTRY ID="en">
                <NAME xmlns='http://bogus.com'>Emeka Ndubuisi</NAME>
                <ADDRESS>42 Spam Blvd</ADDRESS>
                <PHONENUM DESC="Work">767-555-7676</PHONENUM>
                <PHONENUM DESC="Fax">767-555-7642</PHONENUM>
                <PHONENUM DESC="Pager">800-SKY-PAGEx767676</PHONENUM>
                <EMAIL>endubuisi@spamtron.com</EMAIL>
        </ENTRY>
</ADDRBOOK>

This yields the following output:

[uogbuji@borgia code]$ 4xslt.py listing7.xml addrbook.validator.xslt Validation error: all children of ENTRY must either be in the namespace 'http://addressbookns.com' or in a namespace containing the substring 'extension'.Validation error: ENTRY element missing a NAME child.Validation error: ENTRY must have a NAME, ADDRESS, one or more PHONENUM, and an EMAIL in sequenceValidation error: there is an extraneous element child of ENTRYValidation error: all children of ENTRY must either be in the namespace 'http://addressbookns.com' or in a namespace containing the substring 'extension'.Validation error: ENTRY element missing a NAME child.Validation error: ENTRY must have a NAME, ADDRESS, one or more PHONENUM, and an EMAIL in sequenceValidation error: there is an extraneous element child of ENTRY

Hmm. Rather a mess. Looks as if there were quite a few messages combined without clear separation. There were actually two sets of messages, one for each ENTRY in the source document, because we caused the same cascade of validation errors in both by messing with the namespace of the NAME element. The two messages ran together because we used the skeleton Schematron metastylesheet. The skeleton comes with only basic output support, and normalizes space in all output, running the results together.

There's a good chance this is not what you want, and luckily Schematron is designed to be quite extensible. There are several Schematron metastylesheets on the Schematron home page that build on the skeleton by importing it. They range from basic, clearer text messages to HTML for browser integration. Using the sch-basic metastylesheet rather than the skeleton yields the following:

[uogbuji@borgia code]$ 4xslt.py listing6.schematron ~/devel/Ft/Xslt/test_suite/borrowed/sch-basic.xslt > addrbook.validator.xslt
[uogbuji@borgia code]$ 4xslt.py listing7.xml addrbook.validator.xslt 
In pattern Structural Validation:
   Validation error: all children of ENTRY must either be in the namespace 'http://addressbookns.com' or in a namespace containing the substring 'extension'.
In pattern Structural Validation:
   Validation error: ENTRY element missing a NAME child.
In pattern Structural Validation:
   Validation error: ENTRY must have a NAME, ADDRESS, one or more PHONENUM, and an EMAIL in sequence
In pattern Structural Validation:
   Validation error: there is an extraneous element child of ENTRY
In pattern Structural Validation:
   Validation error: all children of ENTRY must either be in the namespace 'http://addressbookns.com' or in a namespace containing the substring 'extension'.
In pattern Structural Validation:
   Validation error: ENTRY element missing a NAME child.
In pattern Structural Validation:
   Validation error: ENTRY must have a NAME, ADDRESS, one or more PHONENUM, and an EMAIL in sequence
In pattern Structural Validation:
   Validation error: there is an extraneous element child of ENTRY

To round things up, Listing 8 is a source document that validates cleanly against our sample Schematron:

Listing 8

<?xml version = "1.0"?>
<ADDRBOOK xmlns='http://addressbookns.com'>
        <ENTRY ID="pa">
                <NAME>Pieter Aaron</NAME>
                <ADDRESS>404 Error Way</ADDRESS>
                <PHONENUM DESC="Work">404-555-1234</PHONENUM>
                <PHONENUM DESC="Fax">404-555-4321</PHONENUM>
                <PHONENUM DESC="Pager">404-555-5555</PHONENUM>
                <EMAIL>pieter.aaron@inter.net</EMAIL>
        </ENTRY>
        <ENTRY ID="en">
                <NAME>Manfredo Manfredi</NAME>
                <ADDRESS>4414 Palazzo Terrace</ADDRESS>
                <PHONENUM DESC="Work">888-555-7676</PHONENUM>
                <PHONENUM DESC="Fax">888-555-7677</PHONENUM>
                <EMAIL>mpm@scudetto.dom.gov</EMAIL>
        </ENTRY>
</ADDRBOOK>

We can test this as follows:

[uogbuji@borgia code]$ 4xslt.py listing8.xml addrbook.validator.xslt 
In pattern Government Contact Report:
   This address book contains government contacts.

Everything is in the correct namespace, so we get no validation errors. However, notice that we did get the report from the email address in the .gov domain.

This is all very well and good, but no doubt you're wondering whether XSLT is fast enough to suit your real-world validation needs. This will depend on your requirements. In my experience, it is rarely necessary to validate a document every time it is processed. If you have attributes with default value, or no control over the data sources throughout your processing applications, you may have no choice. In this case, validation by an XML 1.0-compliant validating parser such as xmlproc is almost certainly faster than XSLT-based Schematron. But then again, there is no hard requirement that a Schematron processor must use XSLT. It would not be terribly difficult, given an efficient XPath library, to write a specialized Schematron application that doesn't need translation from metastylesheets.

But to give a quick comparison, I parsed a 170 KB address book example similar to the one above but with more entries. Using xmlproc and DTD validation, it took 7.25 seconds. Parsing this document without validation and then applying the Schematron stylesheet took 10.61 seconds, hardly a great penalty.

There are several things that DTDs provide that Schematron cannot, such as entity and notation definitions, and fixed or default attribute values. RELAX does not provide any of these facilities either, but XML Schemas provide them all -- as they must, because they are positioned as a DTD replacement. RELAX makes no such claim, and indeed the RELAX documentation has a section on using RELAX in concert with DTDs.

We have already mentioned that Schematron, far from claiming to be a DTD replacement, is positioned as an entirely fresh approach to validation. Nevertheless, attribute-value defaulting can be a useful way to reduce the clutter of XML documents for human readability, so we'll examine one way to provide default attributes in association with Schematron.

Remember that you're always free to combine DTDs with Schematron to get the best of both worlds, but if you want to leave DTDs behind, you can still get attribute-defaulting at the cost of one more pass through the document when the values are to be substituted. This can be done by a stylesheet that transforms a source document into a result that is identical except that all default attribute values are given.

There are other features of Schematron for those interested in further exploration. For instance, it supports keys, a mechanism similar to DTD's ID and IDREF. There are some hooks for embedding and control through external applications.

A more formal introduction to Schematron is available in the Schematron specification (see a href="#resources">Resources).

Conclusion
At Fourthought, we've used Schematron in deployed work products both for our clients and for ourselves. Because we already do a lot of work with XSLT, it's a very comfortable system and there's not much training required for XPath. To match the basic features of DTD, not a lot more knowledge is needed than path expressions, predicates, unions, the sibling and attribute axes, and a handful of functions. Performance has not been an issue because we typically have strong control over XML data in our systems and rarely use defaulted attributes. This allows us to validate only when a new XML datum is input, or an existing datum has modified our systems, reducing performance concerns.

Schematron is a clean, well-considered approach to validation and simple reporting. XML Schemas are significant, but it is debatable whether such a new and complex system is required for validation. RELAX and the Schematron both present simpler approaches coming from different angles, and might be a better fit for quick integration into XML systems. In any case, Schematron once again demonstrates the extraordinary reach of XSLT and the flexibility of XML as a data-management technology.

About the author
Uche Ogbuji is a consultant for and cofounder of Fourthought, a consulting firm that specializes in custom software development for enterprise applications. Fouthought uses XML to provide Web-based integration platforms for small or medium-sized businesses.