|I l@ve RuBoard|
12.2 Checking XML Well-Formedness
Credit: Paul Prescod
from xml.sax.handler import ContentHandler from xml.sax import make_parser from glob import glob import sys def parsefile(file): parser = make_parser( ) parser.setContentHandler(ContentHandler( )) parser.parse(file) for arg in sys.argv[1:]: for filename in glob(arg): try: parsefile(filename) print "%s is well-formed" % filename except Exception, e: print "%s is NOT well-formed! %s" % (filename, e)
A text is a well-formed XML document if it adheres to all the basic syntax rules for XML documents. In other words, it has a correct XML declaration and a single root element, all tags are properly nested, tag attributes are quoted, and so on.
This recipe uses the SAX API with a dummy ContentHandler that does nothing. Generally, when we parse an XML document with SAX, we use a ContentHandler instance to process the document's contents. But in this case, we only want to know if the document meets the most fundamental syntax constraints of XML; therefore, there is no processing that we need to do, and the do-nothing handler suffices.
$ python wellformed.py test.xml test.xml is NOT well-formed! test.xml:1002:2: mismatched tag
This means that character 2 on line 1,002 has a mismatched tag.
This recipe does not check adherence to a DTD or schema. That is a separate procedure called validation. The performance of the script should be quite good, precisely because it focuses on performing a minimal irreducible core task.
12.2.4 See Also
Recipe 12.3, Recipe 12.4, and Recipe 12.6 for other uses of the SAX API; the PyXML package (http://pyxml.sourceforge.net/) includes the pure-Python validating parser xmlproc, which checks the conformance of XML documents to specific DTDs; the PyRXP package from ReportLab is a wrapper around the faster validating parser RXP (http://www.reportlab.com/xml/pyrxp.html), which is available under the GPL license.
|I l@ve RuBoard|