I l@ve RuBoard

12.12 Module: XML Lexing (Shallow Parsing)

Credit: Paul Prescod

It's not uncommon to want to work with the form of an XML document rather than with the structural information it contains (e.g., to change a bunch of entity references or element names). The XML may be slightly incorrect, enough to choke a traditional parser. In such cases, you need an XML lexer, also known as a shallow parser.

You might be tempted to hack together a regular expression or two to do some simple parsing of XML (or other structured text format), rather than using the appropriate library module. Don't梚t's not a trivial task to get the regular expressions right! However, the hard work has already been done for you in Example 12-1, which contains already-debugged regular expressions and supporting functions that you can use for shallow-parsing tasks on XML data (or, more importantly, on data that is almost, but not quite, correct XML, so that a real XML parser seizes up with error diagnostics when you try to parse your data with it).

A traditional XML parser does a few tasks:

It breaks up the stream of text into logical components (tags, text, processing instructions, etc.).
It ensures that these components comply with the XML specification.
It throws away extra characters and reports the significant data. For instance, it would report tag names but not the less-than and greater-than signs around them.

The shallow parser in Example 12-1 performs only the first task. It breaks up the document and presumes that you know how to deal with the fragments yourself. That makes it efficient and forgiving of errors in the document.

The lexxml function is the code's entry point. Call lexxml(data) to get back a list of tokens (strings that are bits of the document). This lexer also makes it easy to get back the exact original content of the document. Unless there is a bug in the recipe, the following code should always succeed:

tokens = lexxml(data)
data2 = "".join(tokens)
assert data == data2

If you find any bugs that disallow this, please report them! There is a second, optional argument to lexxml that allows you to get back only markup and ignore the text of the document. This is useful as a performance optimization when you care only about tags. The walktokens function in the recipe shows how to walk over the tokens and work with them.

Example 12-1. XML lexing

import re

class recollector:
    def _ _init_ _(self):
        self.res={}
    def add(self, name, reg ):
        re.compile(reg) # Check that it is valid
        self.res[name] = reg % self.res

collector = recollector(  )
a = collector.add

a("TextSE" , "[^<]+")
a("UntilHyphen" , "[^-]*-")
a("Until2Hyphens" , "%(UntilHyphen)s(?:[^-]%(UntilHyphen)s)*-")
a("CommentCE" , "%(Until2Hyphens)s>?")
a("UntilRSBs" , "[^\\]]*](?:[^\\]]+])*]+")
a("CDATA_CE" , "%(UntilRSBs)s(?:[^\\]>]%(UntilRSBs)s)*>" )
a("S" , "[ \\n\\t\\r]+")
a("NameStrt" , "[A-Za-z_:]|[^\\x00-\\x7F]")
a("NameChar" , "[A-Za-z0-9_:.-]|[^\\x00-\\x7F]")
a("Name" , "(?:%(NameStrt)s)(?:%(NameChar)s)*")
a("QuoteSE" , "\"[^\"]*\"|'[^']*'")
a("DT_IdentSE" , "%(S)s%(Name)s(?:%(S)s(?:%(Name)s|%(QuoteSE)s))*" )
a("MarkupDeclCE" , "(?:[^\\]\"'><]+|%(QuoteSE)s)*>" )
a("S1" , "[\\n\\r\\t ]")
a("UntilQMs" , "[^?]*\\?+")
a("PI_Tail" , "\\?>|%(S1)s%(UntilQMs)s(?:[^>?]%(UntilQMs)s)*>" )
a("DT_ItemSE" ,
    "<(?:!(?:--%(Until2Hyphens)s>|[^-]%(MarkupDeclCE)s)|\\?%(Name)s"
    "(?:%(PI_Tail)s))|%%%(Name)s;|%(S)s"
)
a("DocTypeCE" ,
"%(DT_IdentSE)s(?:%(S)s)?(?:\\[(?:%(DT_ItemSE)s)*](?:%(S)s)?)?>?" )
a("DeclCE" ,
    "--(?:%(CommentCE)s)?|\\[CDATA\\[(?:%(CDATA_CE)s)?|DOCTYPE"
    "(?:%(DocTypeCE)s)?")
a("PI_CE" , "%(Name)s(?:%(PI_Tail)s)?")
a("EndTagCE" , "%(Name)s(?:%(S)s)?>?")
a("AttValSE" , "\"[^<\"]*\"|'[^<']*'")
a("ElemTagCE" ,
    "%(Name)s(?:%(S)s%(Name)s(?:%(S)s)?=(?:%(S)s)?(?:%(AttValSE)s))*"
    "(?:%(S)s)?/?>?")

a("MarkupSPE" ,
    "<(?:!(?:%(DeclCE)s)?|\\?(?:%(PI_CE)s)?|/(?:%(EndTagCE)s)?|"
    "(?:%(ElemTagCE)s)?)")
a("XML_SPE" , "%(TextSE)s|%(MarkupSPE)s")
a("XML_MARKUP_ONLY_SPE" , "%(MarkupSPE)s")


def lexxml(data, markuponly=0):
    if markuponly:
        reg = "XML_MARKUP_ONLY_SPE"
    else:
        reg = "XML_SPE"
    regex = re.compile(collector.res[reg])
    return regex.findall(data)

def assertlex(data, numtokens, markuponly=0):
    tokens = lexxml(data, markuponly)
    if len(tokens)!=numtokens:
        assert len(lexxml(data))==numtokens, \
            "data = '%s', numtokens = '%s'" %(data, numtokens)
    if not markuponly:
        assert "".join(tokens)==data
    walktokens(tokens)

def walktokens(tokens):
    print
    for token in tokens:
        if token.startswith("<"):
            if token.startswith("<!"):
                print "declaration:", token
            elif token.startswith("<?xml"):
                print "xml declaration:", token
            elif token.startswith("<?"):
                print "processing instruction:", token
            elif token.startswith("</"):
                print "end-tag:", token
            elif token.endswith("/>"):
                print "empty-tag:", token
            elif token.endswith(">"):
                print "start-tag:", token
            else:
                print "error:", token
        else:
            print "text:", token

def testlexer(  ):
    # This test suite could be larger!
    assertlex("<abc/>", 1)
    assertlex("<abc><def/></abc>", 3)
    assertlex("<abc>Blah</abc>", 3)
    assertlex("<abc>Blah</abc>", 2, markuponly=1)
    assertlex("<?xml version='1.0'?><abc>Blah</abc>", 3,
        markuponly=1)
    assertlex("<abc>Blah&foo;Blah</abc>", 3)
    assertlex("<abc>Blah&foo;Blah</abc>", 2, markuponly=1)
    assertlex("<abc><abc>", 2)
    assertlex("</abc></abc>", 2)
    assertlex("<abc></def></abc>", 3)

if _ _name_ _=="_ _main_ _":
    testlexer(  )

12.12.1 See Also

This recipe is based on the following article, with regular expressions translated from Perl into Python: "REX: XML Shallow Parsing with Regular Expressions", Robert D. Cameron, Markup Languages: Theory and Applications, Summer 1999, pp. 61-88, http://www.cs.sfu.ca/~cameron/REX.html.

I l@ve RuBoard