|I l@ve RuBoard|
12.4 Extracting Text from an XML Document
Credit: Paul Prescod
from xml.sax.handler import ContentHandler import xml.sax import sys class textHandler(ContentHandler): def characters(self, ch): sys.stdout.write(ch.encode("Latin-1")) parser = xml.sax.make_parser( ) handler = textHandler( ) parser.setContentHandler(handler) parser.parse("test.xml")
Sometimes you want to get rid of XML tags梖or example, to rekey a document or to spellcheck it. This recipe performs this task and will work with any well-formed XML document. It is quite efficient. If the document isn't well-formed, you could try a solution based on the XML lexer (shallow parser) shown in Recipe 12.12.
In this recipe's textHandler class, we subclass ContentHander's characters method, which the parser calls for each string of text in the XML document (excluding tags, XML comments, and processing instructions), passing as the only argument the piece of text as a Unicode string. We have to encode this Unicode before we can emit it to standard output. In this recipe, we're using the Latin-1 (also known as ISO-8859-1) encoding, which covers all Western-European alphabets and is supported by many popular output devices (e.g., printers and terminal-emulation windows). However, you should use whatever encoding is most appropriate for the documents you're handling and is supported by the devices you use. The configuration of your devices may depend on your operating system's concepts of locale and code page. Unfortunately, these vary too much between operating systems for me to go into further detail.
12.4.4 See Also
|I l@ve RuBoard|