I l@ve RuBoard Previous Section Next Section

15.2 Colorizing Python Source Using the Built-in Tokenizer

Credit: Jürgen Hermann

15.2.1 Problem

You need to convert Python source code into HTML markup, rendering comments, keywords, operators, and numeric and string literals in different colors.

15.2.2 Solution

tokenize.tokenize does most of the work and calls us back for each token found, so we can output it with appropriate colorization:

""" MoinMoin - Python Source Parser """
import cgi, string, sys, cStringIO
import keyword, token, tokenize

# Python Source Parser (does highlighting into HTML)

_KEYWORD = token.NT_OFFSET + 1
_TEXT    = token.NT_OFFSET + 2
_colors = {
    token.NUMBER:       '#0080C0',
    token.OP:           '#0000C0',
    token.STRING:       '#004080',
    tokenize.COMMENT:   '#008000',
    token.NAME:         '#000000',
    token.ERRORTOKEN:   '#FF8080',
    _KEYWORD:           '#C00000',
    _TEXT:              '#000000',

class Parser:
    """ Send colorized Python source as HTML to an output file (normally stdout).

    def _ _init_ _(self, raw, out = sys.stdout):
        """ Store the source text. """
        self.raw = string.strip(string.expandtabs(raw))
        self.out = out

    def format(self):
        """ Parse and send the colorized source to output. """
        # Store line offsets in self.lines
        self.lines = [0, 0]
        pos = 0
        while 1:
            pos = string.find(self.raw, '\n', pos) + 1
            if not pos: break

        # Parse the source and write it
        self.pos = 0
        text = cStringIO.StringIO(self.raw)
        self.out.write('<pre><font face="Lucida,Courier New">')
            tokenize.tokenize(text.readline, self) # self as handler callable
        except tokenize.TokenError, ex:
            msg = ex[0]
            line = ex[1][0]
            self.out.write("<h3>ERROR: %s</h3>%s\n" % (
                msg, self.raw[self.lines[line]:]))

    def _ _call_ _(self, toktype, toktext, (srow,scol), (erow,ecol), line):
        """ Token handler """
        if 0:  # You may enable this for debugging purposes only
            print "type", toktype, token.tok_name[toktype], "text", toktext,
            print "start", srow,scol, "end", erow,ecol, "<br>"

        # Calculate new positions
        oldpos = self.pos
        newpos = self.lines[srow] + scol
        self.pos = newpos + len(toktext)

        # Handle newlines
        if toktype in [token.NEWLINE, tokenize.NL]:

        # Send the original whitespace, if needed
        if newpos > oldpos:

        # Skip indenting tokens
        if toktype in [token.INDENT, token.DEDENT]:
            self.pos = newpos

        # Map token type to a color group
        if token.LPAR <= toktype <= token.OP:
            toktype = token.OP
        elif toktype == token.NAME and keyword.iskeyword(toktext):
            toktype = _KEYWORD
        color = _colors.get(toktype, _colors[_TEXT])

        style = ''
        if toktype == token.ERRORTOKEN:
            style = ' style="border: solid 1.5pt #FF0000;"'

        # Send text
        self.out.write('<font color="%s"%s>' % (color, style))

if _ _name_ _ == "_ _main_ _":
    import os, sys
    print "Formatting..."

    # Open own source
    source = open('python.py').read(  )

    # Write colorized version to "python.html"
    Parser(source, open('python.html', 'wt')).format(  )

    # Load HTML page into browser
    if os.name == "nt":
        os.system("explorer python.html")
        os.system("netscape python.html &")

15.2.3 Discussion

This code is part of MoinMoin (see http://moin.sourceforge.net/) and shows how to use the built-in keyword, token, and tokenize modules to scan Python source code and re-emit it with appropriate color markup but no changes to its original formatting ("no changes" is the hard part!).

The Parser class's constructor saves the multiline string that is the Python source to colorize and the file object, which is open for writing, where you want to output the colorized results. Then, the format method prepares a self.lines list that holds the offset (the index into the source string, self.raw) of each line's start.

format then calls tokenize.tokenize, passing self as the callback. Thus, the _ _call_ _ method is invoked for each token, with arguments specifying the token type and starting and ending positions in the source (each expressed as line number and offset within the line). The body of the _ _call_ _ method reconstructs the exact position within the original source code string self.raw, so it can emit exactly the same whitespace that was present in the original source. It then picks a color code from the _colors dictionary (which uses HTML color coding), with help from the keyword standard module to determine if a NAME token is actually a Python keyword (to be emitted in a different color than that used for ordinary identifiers).

The test code at the bottom of the module formats the module itself and launches a browser with the result. It does not use the standard Python module webbrowser to ensure compatibility with stone-age versions of Python. If you have no such worries, you can change the last few lines of the recipe to:

# Load HTML page into browser
import webbrowser
webbrowser.open("python.html", 0, 1)

and enjoy the result in your favorite browser.

15.2.4 See Also

Documentation for the webbrowser, token, tokenize, and keyword modules in the Library Reference; the colorizer is available at http://purl.net/wiki/python/MoinMoinColorizer, part of MoinMoin (http://moin.sourceforge.net).

    I l@ve RuBoard Previous Section Next Section