I l@ve RuBoard

15.1 Introduction

Credit: Paul F. Dubois, Ph.D., Program for Climate Model Diagnosis and Intercomparison, Lawrence Livermore National Laboratory

This chapter covers topics such as lexing, parsing, and program introspection. Python has extensive facilities related to lexing and parsing, and the large number of user-contributed modules related to parsing standard languages reduces the need for doing your own programming. This introduction contains a general guide to solving some common problems in these categories.

Lexing and parsing are among the most common of programming tasks, and as a result, both are the subject of much theory and much prior development. Therefore, in these areas more than most, you will often profit if you take the time to search for solutions before resorting to writing your own. The recipes in this chapter concern accomplishing certain tasks in Python. The most important of these is currying, in which functions are created that are really other functions with predetermined arguments.

15.1.1 Lexing

Lexing is the process of dividing an input stream into meaningful units, or tokens, which are then processed. Lexing occurs in tasks such as data processing and creating tools for inspecting and modifying text.

The regular-expression facilities in Python are extensive and highly evolved, so your first consideration for a lexing task is to see if it can be formulated using regular expressions. Also, see the next section about parsers for common languages and how to lex them.

The tokenize module splits an input stream into Python-language tokens. Since Python's tokenization rules are similar to those of many other languages, this module may be suitable for other tasks.

The built-in string method split can also be used for many simple cases. For example, consider a file consisting of colon-separated text fields, with one record per line. You can read a line from the file as follows:

fields = line.split(':')

This produces a list of the fields. If at this point you fear spurious whitespace at the beginning and ends of the fields, you can remove it with:

fields = map(lambda x: x.strip(  ), fields)

For example:

>>> x = "abc :def:ghi    : klm\n"
>>> fields = x.split(':')
>>> print fields
['abc ', 'def', 'ghi    ', ' klm\n']
>>> print map(lambda x: x.strip(  ), fields)
['abc', 'def', 'ghi', 'klm']

Do not elaborate on this example. There are existing packages that have been written for tab, comma, or colon-separated values. There is a module in the ScientificPython package for reading and writing with Fortran-like formats. (See http://starship.python.net/crew/hinsen/scientific.html. For other links related to numeric data processing, see http://www.pfdubois.com/numpy/.)

A common "gotcha" for beginners is that, while this technique can be used to read numerical data from a file, at the end of this stage, the entries are text strings, not numbers. The string module methods atoi and atof, or the int and float built-in functions, are frequently needed here:

>>> x = "1.2, 2.3, 4, 5.6"
>>> import string
>>> print map(lambda f: string.atof(f.strip(  )), x.split(','))
[1.2, 2.2999999999999998, 4.0, 5.5999999999999996]

15.1.2 Parsing

Parsing refers to discovering semantic meaning out of a series of tokens according to the rules of a grammar. Parsing tasks are quite ubiquitous. Programming tools may attempt to discover information about program texts or modify them to fit a task. (Python's introspection capabilities come into play here, which we will discuss later.) "Little languages" is a name given to application-specific languages that serve as human-readable forms of computer input. These can vary from simple lists of commands and arguments to full-blown languages.

In the previous lexing example, there was a grammar, but it was implicit: the data you need is organized as one line per record with the fields separated by a special character. The "parser" in that case was supplied by the programmer reading the lines from the file and applying the simple split function to obtain the information. This sort of input file can easily lead to requests for a more elaborate form. For example, users may wish to use comments, blank lines, conditional statements, or alternate forms. While most of this can be handled with simple logic, at some point, it becomes so complicated that it is much more reliable to use a real grammar.

There is no hard and fast way to decide which part of the job is a lexing task and which belongs to the grammar. For example, comments can often be discarded in the lexing, but this is not wise in a program-transformation tool that needs to produce output that must contain the original comments.

Your strategy for parsing tasks can include:

Using a parser for that language from the standard library.
Using a parser from the user community. You can find one by visiting the Vaults of Parnassus or by searching http://www.python.org.
Generating a parser using a parser generator.
Using Python itself as your input language.

A combination of approaches is often fruitful. For example, a simple parser can turn input into Python-language statements, which Python executes in concert with a supporting package that you supply.

A number of parsers for specific languages exist in the standard library and in the user community. In particular, there are parsing packages for XML, HTML, SGML, command-line arguments, configuration files, and for Python itself.

You do not need to parse C to connect C routines to Python. Use SWIG (http://www.swig.org). Likewise, you do not need a Fortran parser to connect Fortran and Python. See the Numerical Python web page at http://www.pfdubois.com/numpy/ for further information.

15.1.3 PLY and SPARK

PLY and SPARK are Python-based parser generators. That is, they take as input statements that describe the grammar to be parsed and generate the parser for you. To make a useful tool, you must then add the semantic actions to be taken when a certain statement is recognized.

PLY (http://systems.cs.uchicago.edu/ply) is a Python implementation of the popular Unix tool yacc. SPARK (http://www.cpsc.ucalgary.ca/~aycock/spark) is a cleverly introspective method that parses a more general set of grammars than yacc.

The chief problem in using both these tools is that you need to educate yourself about grammars and learn to write them. Except for very simple grammars, a novice will encounter some difficulty. There is a lot of literature out there to teach you how to use yacc, and most of this knowledge will help you use SPARK as well.

If you are interested in this area, the ultimate reference is Aho, Sethi, and Ullman's Compilers (Addison-Wesley), affectionately known as "The Dragon Book" to generations of computer-science majors.

15.1.4 Using Python Itself as a Little Language

Python itself can be used to create many application-specific languages. By writing suitable classes, you can rapidly make something that is easy to get running yet is extensible later. Suppose I want a language to describe graphs. There are nodes that have names and edges that connect the nodes. I want a way to input such graphs so that after reading the input, I will have the data structures in Python that I need. So, for example:

nodes = {}

def getnode(name):
    "Return the node with the given name, creating it if necessary."
    if not nodes.has_key(name):
        nodes[name] = node(name)
    return nodes[name]

class node:
     "A node has a name and a list of edges emanating from it."
    def _ _init_ _(self, name):
        self.name = name
        self.edgelist = []

class edge:
    "An edge connects two nodes."
    def _ _init_ _(self, name1, name2):
        self.nodes = (getnode(name1), getnode(name2))
        for n in self.nodes:
            n.edgelist.append(self)

    def _ _repr_ _(self):
        return self.nodes[0].name + self.nodes[1].name

Using just these simple statements, I can now parse a list of edges that describe a graph, and afterwards have data structures that contain all my information. Here, I enter a graph with four edges and print the list of edges emanating from node 'A':

>>> edge('A', 'B')
>>> edge('B', 'C')
>>> edge('C', 'D')
>>> edge('C', 'A')
>>> print getnode('A').edgelist
[AB, CA]

Suppose that I now want a weighted graph. I could easily add a weight=1.0 argument to the edge constructor, and the old input would still work. Also, I could easily add error-checking logic to ensure that edge lists have no duplicates. Furthermore, I already have my node class and can start adding logic to it. I can easily turn the entries in the dictionary nodes into similarly named variables that are bound to the node objects. After adding a few more classes corresponding to other input I need, I am well on my way.

The advantage to this approach is clear. For example, the following is already handled correctly:

edge('A', 'B')

if not nodes.has_key('X'):
    edge('X', 'A')

def triangle(n1, n2, n3):
    edge(n1, n2)
    edge(n2, n3)
    edge(n3, n1)
triangle('A','W','K')

execfile('mygraph.txt')     # Read graph from a datafile

So I already have syntactic sugar, user-defined language extensions, and input from other files. Usually, the definitions will go into a module, and the user will simply import them. Had I written my own language, such accomplishments might be months away.

15.1.5 Introspection

Python programs have the ability to examine themselves; this set of facilities comes under the general title of introspection. For example, a Python function object knows the names of its arguments and the docstring comment that was given when it was defined:

>>> def f(a, b):
        "Return the difference of a and b"
        return a-b

>>> dir(f)
['_ _dict_ _', '_ _doc_ _', '_ _name_ _', 'func_closure', 'func_code',
'func_defaults', 'func_dict', 'func_doc', 'func_globals', 'func_name']
>>> f.func_name
'f'
>>> f.func_doc
'Return the difference of a and b'
>>> f.func_code
<code object f at 0175DDF0, file "<pyshell#18>", line 1>
>>> dir (f.func_code)
['co_argcount', 'co_cellvars', 'co_code', 'co_consts', 'co_filename',
'co_firstlineno', 'co_flags', 'co_freevars', 'co_lnotab', 'co_name',
'co_names', 'co_nlocals', 'co_stacksize', 'co_varnames']
>>> f.func_code.co_names
('a', 'b')

SPARK makes an interesting use of introspection. The grammar is entered as doc strings in the routines that take the semantic actions when those grammar constructs are recognized. (Hey, don't turn your head all the way around like that! Introspection has its limits.)

Python is the most powerful language that you can still read. The kinds of tasks discussed in this chapter show just how versatile and powerful it really is.

I l@ve RuBoard