I l@ve RuBoard Previous Section Next Section

8.3 Library Modules

Currently, there are more than 200 modules in the standard distribution, covering topics such as string and text processing, networking and web tools, system interfaces, database interfaces, serialization, data structures and algorithms, user interfaces, numerical computing, and others. We touch on only the most widely used here and mention some of the more powerful and specialized ones in Chapter 9, and Chapter 10.

8.3.1 Basic String Operations: The string Module

The string module is somewhat of a historical anomaly. If Python were being designed today, chances are many functions currently in the string module would be implemented instead as methods of string objects.[4] The string module operates on strings. Table 8.4 lists the most useful functions defined in the string module, along with brief descriptions, just to give you an idea as to the module's purpose. The descriptions given here are not complete; for an exhaustive listing, check the Library Reference or the Python Pocket Reference. Except when otherwise noted, each function returns a string.

[4] For a more detailed discussion of this and of many other commonly asked questions about Python, check out the FAQ list at http://www.python.org/doc/FAQ.html. For the question of string methods versus string functions, see Question 6.4 in that document.

Table?.4. String Module Functions

Function Name

Behavior

atof(string)

Converts a string to a floating point number (see the float built-in):

>>> string.atof("1.4")
 1.4
atoi(string [, base])

Converts a string to an integer, using the base specified (base 10 by default (see the int built-in):

>>> string.atoi("365")
 365
atol(string [, base])

Same as atoi, except converts to a long integer (see the long built-in):

>>> string.atol("987654321")
 987654321L
capitalize(word)

Capitalizes the first letter of word:

>>> string.capitalize("tomato")
 'Tomato'
capwords(string)

Capitalizes each word in the string:

>>> string.capwords("now is the time")
 'Now Is The Time'
expandtabs( string, tabsize)

Expands the tab characters in string, using the specified tab size (no default)

find(s, sub [, start [, end]])

Returns the index of the string s corresponding to the first occurrence of the substring sub in s, or -1 if sub isn't in s:

>>> string.find("now is the time", 'is')
 4
rfind(s, sub [, start [, end]])

Same as find, but gives the index of the last occurrence of sub in s

index(s, sub [, start [, end]])

Same as find, but raises a ValueError exception if sub isn't found in s

rindex(s, sub[, start [, end]])

Same as rfind, but raises a ValueError exception if sub is not found in s

count(s, sub[, start [, end]])

Returns the number of occurrences of sub in s:

>>> string.count("now is the time", 'i')
 2
replace(str, old, new[, maxsplit])

Returns a string like str except that all (or some) occurrences of old have been replaced with new:

>>> string.replace("now is the time", ' ', '_')
 'now_is_the_time'
lower(string), upper(string)

Returns a lowercase (or uppercase) version of string

split(s [, sep[, maxsplit]])

Splits the string s at the specified separator string sep (whitespace by default), and returns a list of the "split" substrings:

>>> string.split("now is the time")
 ['now', 'is', 'the', 'time']
join(wordlist[, sep[, maxsplit]])

Joins a sequence of strings, inserting copies of sep between each (a single space by default):

>>> string.join(["now","is","the","time", '*'])
 'now*is*the*time'
 >>> string.join("now is the time", '*')
 'n*o*w* *i*s* *t*h*e* *t*i*m*e'

Remember that a string is itself a sequence of one-character strings!

lstrip(s), rstrip(s), strip(s)

Strips whitespace occurring at the left, right, or both ends of s:

>>> string.strip("  before  and  after   ")
 'before and after'
swapcase(s)

Returns a version of s with the lowercase letters replaced with their uppercase equivalent and vice versa

ljust(s, width), rjust(s, width),
center(s, width)

Left-pads, right-pads, or centers the string s with spaces so that the returned string has width characters

The string module also defines a few useful constants, as shown in Table 8.5.

Table?.5. String Module Constants

Constant Name

Value

digits

'0123456789'

octdigits

'01234567'

hexdigits

'0123456789abcdefABCDEF'

lowercase

'abcdefghijklmnopqrstuvwxyz' [5]

uppercase

'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

letters

lowercase + uppercase

whitespace

' \t\n\r\v' (all whitespace characters)

[5] On most systems, the string.lowercase, string.uppercase, and string.letters have the values listed above. If one uses the locale module to specify a different cultural locale, they are updated. Thus for example, after doing locale.setlocale(locale.LC_ALL, 'fr'), the string.letters attribute will also include accented letters and other valid French letters.

The constants in Table 8.5Table 8.5 generally test whether specific characters fit a criterion梖or example, x in string.whitespace returns true only if x is one of the whitespace characters.

A typical use of the string module is to clean up user input. The following line removes all "extra" whitespace, meaning it replaces sequences of whitespace with single space characters, and it deletes leading and trailing spaces:

thestring = string.strip(string.join(string.split(thestring)))

8.3.2 Advanced String Operations: The re Module

The string module defines basic operations on strings. It shows up in almost all programs that interact with files or users. Because Python strings can contain null bytes, they can also process binary data梞ore on this when we get to the struct module.

In addition, Python provides a specialized string-processing tool to use with regular expressions. For a long time, Python's regular expressions (available in the regex and regsub modules), while adequate for some tasks, were not up to par with those offered by competing languages, such as Perl. As of Python 1.5, a new module called re provides a completely overhauled regular expression package, which significantly enhances Python's string-processing abilities.

8.3.2.1 Regular expressions

Regular expressions are strings that let you define complicated pattern matching and replacement rules for strings. These strings are made up of symbols that emphasize compact notation over mnemonic value. For example, the single character . means "match any single character." The character + means "one or more of what just preceded me." Table 8.6 lists some of the most commonly used regular expression symbols and their meanings in English.

Table?.6. Common Elements of Regular Expression Syntax

Special Character

Meaning

.

Matches any character except newline by default

^

Matches the start of the string

$

Matches the end of the string

*

"Any number of occurrences of what just preceded me"

+

"One or more occurrences of what just preceded me"

|

"Either the thing before me or the thing after me"

\w

Matches any alphanumeric character

\d

Matches any decimal digit

tomato

Matches the string tomato

8.3.2.2 A real regular expression problem

Suppose you need to write a program to replace the strings "green pepper" and "red pepper" with "bell pepper" if and only if they occur together in a paragraph before the word "salad" and not if they are followed (with no space) by the string "corn." These kinds of requirements are surprisingly common in computing. Assume that the file you need to process is called pepper.txt. Here's a silly example of such a file:

This is a paragraph that mentions bell peppers multiple times. For
one, here is a red pepper and dried tomato salad recipe. I don't like
to use green peppers in my salads as much because they have a harsher
flavor.

This second paragraph mentions red peppers and green peppers but not
the "s" word (s-a-l-a-d), so no bells should show up.

This third paragraph mentions red peppercorns and green peppercorns,
which aren't vegetables but spices (by the way, bell peppers really
aren't peppers, they're chilies, but would you rather have a good cook
or a good botanist prepare your salad?).

The first task is to open it and read in the text:

file = open('pepper.txt')
text = file.read()

We read the entire text at once and avoid splitting it into lines, since we will assume that paragraphs are defined by two consecutive newline characters. This is easy to do using the split function of the string module:

import string
paragraphs = string.split(text, '\n\n')

At this point we've split the text into a list of paragraph strings, and all there is left is to do is perform the actual replacement operation. Here's where regular expressions come in:

import re?
matchstr = re.compile(
r"""\b(red|green) ?/B> # 'red' or 'green' starting new words (\s+ ?/B> # followed by whitespace pepper ?/b> # the word 'pepper' (?!corn) ?/b> # if not followed immediately by 'corn' (?=.*salad))""",?/B> # and if followed at some point by 'salad'', re.IGNORECASE |?/B> # allow pepper, Pepper, PEPPER, etc. re.DOTALL |?/b> # allow to match newlines as well re.VERBOSE)?/b> # this allows the comments and the newlines above for paragraph in paragraphs: fixed_paragraph = matchstr.sub(r'bell\2', paragraph) print fixed_paragraph+'\n'

The bold line is the hardest one; it creates a compiled regular expression pattern, which is like a program. Such a pattern specifies two things: which parts of the strings we're interested in and how they should be grouped. Let's go over these in turn.

Defining which parts of the string we're interested in is done by specifying a pattern of characters that defines a match. This is done by concatenating smaller patterns, each of which specifies a simple matching criterion (e.g., "match the string 'pepper'," "match one or more whitespace characters," "don't match 'corn'," etc.). As mentioned, we're looking for the words red or green, if they're followed by the word pepper, that is itself followed by the word salad, as long as pepper isn't followed immediately by 'corn'. Let's take each line of the re.compile(...) expression in turn.

The first thing to notice about the string in the re.compile() is that it's a "raw" string (the quotation marks are preceded by an r). Prepending such an r to a string (single- or triple-quoted) turns off the interpretation of the backslash characters within the string.[6] We could have used a regular string instead and used \\b instead of \b and \\s instead of \s. In this case, it makes little difference; for complicated regular expressions, raw strings allow much more clear syntax than escaped backslashes.

[6] Raw strings can't end with an odd number of backslash characters. That's unlikely to be a problem when using raw strings for regular expressions, however, since regular expressions can't end with backslashes.

The first line in the pattern is \b(red|green). \b stands for "the empty string, but only at the beginning or end of a word"; using it here prevents matches that have red or green as the final part of a word (as in "tired pepper"). The (red|green) pattern specifies an alternation: either 'red' or 'green'. Ignore the left parenthesis that follows for now. \s is a special symbol that means "any whitespace character," and + means "one or more occurrence of whatever comes before me," so, put together, \s+ means "one or more whitespace characters." Then, pepper just means the string 'pepper'. (?!corn) prevents matches of "patterns that have 'corn' at this point," so we prevent the match on 'peppercorn'. Finally, (?=.*salad) says that for the pattern to match, it must be followed by any number of characters (that's what .* means), followed by the word salad. The ?= bit specifies that while the pattern should determine whether the match occurs, it shouldn't be "used up" by the match process; it's a subtle point, which we'll ignore for now. At this point we've defined the pattern corresponding to the substring.

Now, note that there are two parentheses we haven't explained yet梩he one before \s+ and the last one. What these two do is define a "group," which starts after the red or green and go to the end of the pattern. We'll use that group in the next operation, the actual replacement. First, we need to mention the three flags that are joined by the logical operation "or". These specify kinds of pattern matches. The first, re.IGNORECASE, says that the text comparisons should ignore whether the text and the match have similar or different cases. The second, re.DOTALL, specifies that the . character should match any character, including the newline character (that's not the default behavior). Finally, the third, re.VERBOSE, allows us to insert extra newlines and # comments in the regular expression, making it easier to read and understand. We could have written the statement more compactly as:

matchstr = re.compile(r"\b(red|green)(\s+pepper(?!corn)(?=.*salad))", re.I | re.S)

The actual replacement operation is done with the line:

fixed_paragraph = matchstr.sub(r'bell\2', paragraph)

First, it should be fairly clear that we're calling the sub method of the matchstr object. That object is a compiled regular expression object, meaning that some of the processing of the expression has already been done (in this case, outside the loop), thus speeding up the total program execution. We use a raw string again to write the first argument to the method. The \2 is a reference to group 2 in the regular expression梩he second group of parentheses in the regular expression梚n our case, everything starting with pepper and up to and including the word 'salad'. This line therefore means, "Replace the matched string with the string that is 'bell' followed by whatever starts with 'pepper' and goes up to the end of the matched string, in the paragraph string."

So, does it work? The pepper.txt file we saw earlier had three paragraphs: the first satisfied the requirements of the match twice, the second didn't because it didn't mention the word "salad," and the third didn't because the red and green words are before peppercorn, not pepper. As it was supposed to, our program (saved in a file called pepper.py) modifies only the first paragraph:

/home/David/book$ python pepper.py
This is a paragraph that mentions bell peppers multiple times. For
one, here is a bell pepper and dried tomato salad recipe. I don't like 
to use bell peppers in my salads as much because they have a harsher 
flavor.

This second paragraph mentions red peppers and green peppers but not
the "s" word (s-a-l-a-d), so no bells should show up.

This third paragraph mentions red peppercorns and green peppercorns,
which aren't vegetables but spices (by the way, bell peppers really
aren't peppers, they're chilies, but would you rather have a good cook
or a good botanist prepare your salad?).

This example, while artificial, shows how regular expressions can compactly express complicated matching rules. If this kind of problem occurs often in your line of work, mastering regular expressions is a worthwhile investment of time and effort.

A thorough coverage of regular expressions is beyond the scope of this book. Jeffrey Friedl gives an excellent coverage of regular expressions in his book Mastering Regular Expressions (O'Reilly & Associates). His description of Python regular expressions (at least in the First Edition) uses the old-style syntax, which is no longer the recommended one, so those specifics should mostly be ignored; the regular expressions currently used in Python are much more similar to those of Perl. Still, his book is a must-have for anyone doing serious text processing. For the casual user (such as these authors), the descriptions in the Library Reference do the job most of the time. Use the re module, not the regexp, regex, and regsub modules, which are deprecated.

8.3.3 Generic Operating-System Interfaces: The os Module

The operating-system interface defines the mechanism by which programs are expected to manipulate things like files, processes, users, and threads.

8.3.3.1 The os and os.path modules

The os module provides a generic interface to the operating system's most basic set of tools. The specific set of calls it defines depend on which platform you use. (For example, the permission-related calls are available only on platforms that support them, such as Unix and Windows.) Nevertheless, it's recommended that you always use the os module, instead of the platform-specific versions of the module (called posix, nt, and mac). Table 8.7 lists some of the most often-used functions in the os module. When referring to files in the context of the os module, one is referring to filenames, not file objects.

Table?.7. Most Frequently Used Functions From the os Module

Function Name

Behavior

getcwd()

Returns a string referring to the current working directory (cwd):

>>> print os.getcwd()
 h:\David\book

listdir(path)

Returns a list of all of the files in the specified directory:

>>> os.listdir(os.getcwd())
 ['preface.doc', 'part1.doc', 'part2.doc']

chown(pathuid, gid)

Changes the owner ID and group ID of specified file

chmod(path, mode)

Changes the permissions of specified file with numeric mode mode (e.g., 0644 means read/write for owner, read for everyone else)

rename(src, dest)

Renames file named src with name dest

remove(path) or unlink(path)

Deletes specified file (see rmdir to remove directories)

mkdir([, mode])

Creates a directory named path with numeric mode mode(see os.chmod):

>>> os.mkdir('newdir')

rmdir(path)

Removes directory named path

system(command)

Executes the shell command in a subshell; the return value is the return code of the command

symlink(src, dest)

Creates soft link from file src to file dst

link(src, dest)

Creates hard link from file src to file dst

There are many other functions in the os module; in fact, any function that's part of the POSIX standard and widely available on most Unix platforms is supported by Python on Unix. The interfaces to these routines follow the POSIX conventions. You can retrieve and set UIDs, PIDs, and process groups; control nice levels; create pipes; manipulate file descriptors; fork processes; wait for child processes; send signals to processes; use the execv variants; etc.

The os module also defines some important attributes that aren't functions:

  • The os.name attribute defines the current version of the platform-specific operating-system interface. Registered values for os.name are 'posix', 'nt', 'dos', and 'mac'. It's different from sys.platform, which we discussed earlier in this chapter.

  • os.error defines a class used when calls in the os module raise errors. When this exception is raised, the value of the exception contains two variables. The first is the number corresponding to the error (known as errno), and the second is a string message explaining it (known as strerror):

    >>> os.rmdir('nonexistent_directory')      # how it usually shows up
    Traceback (innermost last):
      File "<stdin>", line 1, in ?
    os.error: (2, 'No such file or directory')
    >>> try:                                   # we can catch the error and take
    ...    os.rmdir('nonexistent directory')   # it apart
    ... except os.error, value:
    ...     print value[0], value[1]
    ...
    2 No such file or directory
  • The os.environ dictionary contains key/value pairs corresponding to the environment variables of the shell from which Python was started. Because this environment is inherited by the commands that are invoked using the os.system call, modifying the os.environ dictionary modifies the environment:

    >>> print os.environ['SHELL']
    /bin/sh
    >>> os.environ['STARTDIR'] = 'MyStartDir'
    >>> os.system('echo $STARTDIR')           # 'echo %STARTDIR%' on DOS/Win
    MyStartDir                                # printed by the shell
    0                                         # return code from echo

The os module also includes a set of strings that define portable ways to refer to directory-related operations, as shown in Table 8.8.

Table?.8. String Attributes of the os Module

Attribute Name

Meaning and Value

curdir

A string that denotes the current directory:

'.' on Unix, DOS, and Windows; ':' on the Mac

pardir

A string that denotes the parent directory:

'..' on Unix, DOS, and Windows; '::' on the Mac

sep

The character that separates pathname components:

'/' on Unix; '\' on DOS, Windows; ':' on the Mac

altsep

An alternate character to sep when available; set to Nonesystems except DOS and Windows, where it's '/'

pathsep

The character that separates path components:

':' on Unix; ';' on DOS and Windows

These strings are especially useful when combined with the functionality in the os.path module, which provides many functions that manipulate file paths (see Table 8.9). Note that the os.path module is an attribute of the os module; it's imported automatically when the os module is loaded, so you don't need to import it explicitly. The outputs of the examples in Table 8.9 correspond to code run on a Windows or DOS machine. On another platform, the appropriate path separators would be used instead.

Table?.9. Most Frequently Used Functions from the os.path Module

Function Name

Behavior

split(pathis equivalent to the tuple: (dirname(pathbasename(path))

Splits the given path into a pair consisting of a head and a tail; the head is the path up to the directory, and the tail is the filename:

>>> os.path.split("h:/David/book/part2.doc"
 ('h:/David/book', 'part2.doc')

join(path, ...)

Joins path components intelligently:

>>>  ... os.pardir, 'backup', 'part2.doc')
 h:\David\book\..\backup\part2.doc

exists(path)

Returns true if path corresponds to an existing path

expanduser(path)

Expands the argument with an initial argument of ~ followed optionally by a username:

>>> print os.path.expanduser('~/mydir')
 h:\David\mydir

expandvars(path)

Expands the path argument with the variables specified in the environment:

>>> print os.path.expandvars('$TMP')
 C:\TEMP

isfile(path), isdir(path),

islink(path), ismount(path)

Returns true if the specified path is a file, directory, link, or mount point, respectively

normpath(path)

Normalizes the given path, collapsing redundant separators and uplevel references:

>>> print os.path.normpath("/foo/bar\\../tmp")
 \foo\tmp

samefile(p, q)

Returns true if both arguments refer to the same file

walk(p, visit, arg)

Calls the function visit with arguments (arg, dirname, names) for each directory in the directory tree rooted at p(including pitself, if it's a directory); the argument dirname specifies the visited directory; the argument names lists the files in the directory:

>>> def test_walk(arg, dirname, names):
 ... print arg, dirname, names
 ...
 >>> os.path.walk('..', test_walk, 'show')
 show ..\logs ['errors.log', 'access.log']
 show ..\cgi-bin ['test.cgi']
 ... 

8.3.4 Copying Files and Directories: The shutil Module

The keen-eyed reader might have noticed that the os module, while it provides lots of file-related functions, doesn't include a copy function. On DOS, copying a file is basically the same thing as opening one file in read/binary modes, reading all its data, opening a second file in write/binary mode, and writing the data to the second file. On Unix and Windows, making that kind of copy fails to copy the so-called stat bits (permissions, modification times, etc.) associated with the file. On the Mac, that operation won't copy the resource fork, which contains data such as icons and dialog boxes. In other words, copying is just not so simple. Nevertheless, often you can get away with a fairly simple function that works on Windows, DOS, Unix, and Macs as long as you're manipulating just data files with no resource forks. That function, called copyfile, lives in the shutil module. It includes a few generally useful functions, shown in Table 8.10.

Table?.10. Functions of the shutil Module

Function Name

Behavior

copyfile(src, dest)

Makes a copy of the file src and calls it dest (straight binary copy).

copymode(src, dest)

Copies mode information (permissions) from src to dest.

copystat(src, dest)

Copies all stat information (mode, utime) from src to dest.

copy(src, dest)

Copies data and mode information from src to dest (doesn't include the resource fork on Macs).

copy2(src, dest)

Copies data and stat information from src to dest (doesn't include the resource fork on Macs).

copytree(src, dest, symlinks=0)

Copies a directory recursively using copy2. The symlinks flag specifies whether symbolic links in the source tree must result in symbolic links in the destination tree, or whether the files being linked to must be copied. The destination directory must not already exist.

rmtree(ignore_errors=0, onerror=None)

Recursively deletes the directory indicated by path. If ignore_error is set to (the default behavior), errors are ignored. Otherwise, if onerror is set, it's called to handle the error; if not, an exception is raised on error.

8.3.5 Internet-Related Modules

8.3.5.1 The Common Gateway Interface: The cgi module

Python programs often process forms from web pages. To make this task easy, the standard Python distribution includes a module called cgi. Chapter 10 includes an example of a Python script that uses the CGI.

8.3.5.2 Manipulating URLs: the urllib and urlparse modules

Universal resource locators are strings such as http://www.python.org/ that are now ubiquitous.[7] Two modules, urllib and urlparse, provide tools for processing URLs.

[7] The syntax for URLs was designed in the early days of the Web with the expectation that users would rarely see them and would instead click on hyperlinks tagged with the URLs, which would then be processed by computer programs. Had their future in advertising been predicted, a syntax making them more easily pronounced would probably have been chosen!

urllib defines a few functions for writing programs that must be active users of the Web (robots, agents, etc.). These are listed in Table 8.11.

Table?.11. Functions of the urllib Module

Function Name

Behavior

urlopen (url[, data])

Opens a network object denoted by a URL for reading; it can also open local files:

>>> page = urlopen('http://www.python.org')
 >>> page.readline()
 '<HTML>\012'
 >>> page.readline()
 DO NOT EDIT. -->\012'
urlretrieve (url[, filename][, hook])

Copies a network object denoted by a URL to a local file (uses a cache):

>>> urllib.urlretrieve('http://www.python.org/',
 'wwwpython.html')
urlcleanup()

Cleans up the cache used by urlretrieve

quote(string[, safe])

Replaces special characters in string using the %xx escape; the optional safe parameter specifies additional characters that shouldn't be quoted: its default value is:

>>> quote('this & that @ home')
 'this%20%26%20that%20%40%20home'
quote_plus (string[, safe])

Like quote(), but also replaces spaces by plus signs

unquote (string)

Replaces %xx escapes by their single-character equivalent:

>>> unquote('this%20%26%20that%20%40%20home')
 'this & that @ home'
urlencode (dict)

Converts a dictionary to a URL-encoded string, suitable to pass to urlopen() as the optional data argument:

>>> locals()
 {'urllib': <module 'urllib'>, '__doc__': None, 'x':
 '__builtin__'>}
 >>> urllib.urlencode(locals())
 __builtin__%27%3e'

urlparse defines a few functions that simplify taking URLs apart and putting new URLs together. These are listed in Table 8.12.

Table?.12. Functions of the urlparse Module

Function Name

Behavior

urlparse(urlstring[, [, default_scheme[,allow fragments]])

Parses a URL into six components, returning a six tuple: (addressing scheme, network location, path, parameters, query, fragment identifier):

>>> urlparse('http://www.python.org/FAQ.html')
  ('http', 'www.python.org', '/FAQ.html', '', '', '')
urlunparse(tuple)

Constructs a URL string from a tuple as returned by urlparse()

urljoin(base[,allow fragments])

Constructs a full (absolute) URL by combining a base URL (base) with a relative URL (url):

>>> urljoin('http://www.python.org', 'doc/lib')
 'http://www.python.org/doc/lib'

8.3.5.3 Specific Internet protocols

The most commonly used protocols built on top of TCP/IP are supported with modules named after them. These are the httplib module (for processing web pages with the HTTP protocol); the ftplib module (for transferring files using the FTP protocol); the gopherlib module (for browsing Gopher servers); the poplib and imaplib modules for reading mail files on POP3 and IMAP servers, respectively; the nntplib module for reading Usenet news from NNTP servers; the smtplib protocol for communicating with standard mail servers. We'll use some of these in Chapter 9. There are also modules that can build Internet servers, specifically a generic socket-based IP server (socketserver), a simple web server (SimpleHTTPServer), and a CGI-compliant HTTP server (CGIHTTPSserver).

8.3.5.4 Processing Internet data

Once you use an Internet protocol to obtain files from the Internet (or before you serve them to the Internet), you must process these files. They come in many different formats. Table 8.13 lists each module in the standard library that processes a specific kind of Internet-related file format (there are others for sound and image format processing: see the Library Reference).

Table?.13. Modules Dedicated to Internet File Processing

Module Name

File Format

sgmllib

A simple parser for SGML files

htmllib

A parser for HTML documents

xmllib

A parser for XML documents

formatter

Generic output formatter and device interface

rfc822

Parse RFC-822 mail headers (i.e., "Subject: hi there!")

mimetools

Tools for parsing MIME-style message bodies (a.k.a. file attachments)

multifile

Support for reading files that contain distinct parts

binhex

Encode and decode files in binhex4 format

uu

Encode and decode files in uuencode format

binascii

Convert between binary and various ASCII-encoded representations

xdrlib

Encode and decode XDR data

mailcap

Mailcap file handling

mimetypes

Mapping of filename extensions to MIME types

base64

Encode and decode MIME base64 encoding

quopri

Encode and decode MIME quoted-printable encoding

mailbox

Read various mailbox formats

mimify

Convert mail messages to and from MIME format

8.3.6 Dealing with Binary Data: The struct Module

A frequent question about file manipulation is "How do I process binary files in Python?" The answer to that question usually involves the struct module. It has a simple interface, since it exports just three functions: pack, unpack, and calcsize.

Let's start with the task of decoding a binary file. Imagine a binary file bindat.dat that contains data in a specific format: first there's a float corresponding to a version number, then a long integer corresponding to the size of the data, and then the number of unsigned bytes corresponding to the actual data. The key to using the struct module is to define a "format" string, which corresponds to the format of the data you wish to read, and find out which subset of the file corresponds to that data. For our example, we could use:

import struct

data = open('bindat.dat').read()
start, stop = 0, struct.calcsize('fl')
version_number, num_bytes = struct.unpack('fl', data[start:stop])
start, stop = stop, start + struct.calcsize('B'*num_bytes)
bytes = struct.unpack('B'*num_bytes, data[start:stop])

'f' is a format string for a single floating point number (a C float, to be precise), 'l' is for a long integer, and 'B' is a format string for an unsigned char. The available unpack format strings are listed in Table 8.14. Consult the Library Reference for usage details.

Table?.14. Format Codes Used by the struct Module

Format

C Type

Python

x

pad byte

No value

c

char

String of length 1

b

signed char

Integer

B

unsigned char

Integer

h

short

Integer

H

unsigned short

Integer

i

int

Integer

I

unsigned int

Integer

l

long

Integer

L

unsigned long

Integer

f

float

Float

d

double

Float

s

char[]

String

p

char[]

String

P

void *

Integer

At this point, bytes is a tuple of num_bytes Python integers. If we know that the data is in fact storing characters, we could either use chars = map(chr, bytes). To be more efficient, we could change the last unpack to use 'c' instead of 'B', which would do the conversion for us and return a tuple of num_bytes single-character strings. More efficiently still, we could use a format string that specifies a string of characters of a specified length, such as:

chars = struct.unpack(str(num_bytes)+'s', data[start:stop])

The packing operation is the exact converse; instead of taking a format string and a data string, and returning a tuple of unpacked values, it takes a format string and a variable number of arguments and packs those arguments using that format string into a new "packed" string.

Note that the struct module can process data that's encoded with either kind of byte-ordering,[8] thus allowing you to write platform-independent binary file manipulation code. For large files, consider using the array module.

[8] The order with which computers list multibyte words depends on the chip used (so much for standards). Intel and DEC systems use so-called little-endian ordering, while Motorola and Sun-based systems use big-endian ordering. Network transmissions also use big-endian ordering, so the struct module comes in handy when doing network I/O on PCs.

8.3.7 Debugging, Timing, Profiling

These last few modules will help debug, time, and optimize your Python programs.

The first task is, not surprisingly, debugging. Python's standard distribution includes a debugger called pdb. Using pdb is fairly straightforward. You import the pdb module and call its run method with the Python code the debugger should execute. For example, if you're debugging the program in spam.py from Chapter 6, do this:

>>> import spam                       # import the module we wish to debug
>>> import pdb                        # import pdb
>>> pdb.run('instance = spam.Spam()') # start pdb with a statement to run
> <string>(0)?()
(Pdb) break spam.Spam.__init__                 # we can set break points
(Pdb) next
> <string>(1)?()
(Pdb) n                                        # 'n' is short for 'next'
> spam.py(3)__init__()
-> def __init__(self):
(Pdb) n
> spam.py(4)__init__()
-> Spam.numInstances = Spam.numInstances + 1
(Pdb) list                                     # show the source code listing
  1    class Spam:
  2        numInstances = 0
  3 B      def __init__(self):                 # note the B for Breakpoint
  4  ->        Spam.numInstances = Spam.numInstances + 1  # where we are
  5        def printNumInstances(self):
  6            print "Number of instances created: ", Spam.numInstances
  7
[EOF]
(Pdb) where                                    # show the calling stack
  <string>(1)?()
> spam.py(4)__init__()
-> Spam.numInstances = Spam.numInstances + 1
(Pdb) Spam.numInstances = 10          # note that we can modify variables
(Pdb) print Spam.numInstances         # while the program is being debugged
10
(Pdb) continue                        # this continues until the next break-
--Return--                            # point, but there is none, so we're
> <string>(1)?()->None                # done
(Pdb) c                               # this ends up quitting Pdb
<spam.Spam instance at 80ee60>        # this is the returned instance
>>> instance.numInstances             # note that the change to numInstance
11                                    # was *before* the increment op

As the session above shows, with pdb you can list the current code being debugged (with an arrow pointing to the line about to be executed), examine variables, modify variables, and set breakpoints. The Library Reference's Chapter 9 covers the debugger in detail.

Even when a program is working, it can sometimes be too slow. If you know what the bottleneck in your program is, and you know of alternative ways to code the same algorithm, then you might time the various alternative methods to find out which is fastest. The time module, which is part of the standard distribution, provides many time-manipulation routines. We'll use just one, which returns the time since a fixed "epoch" with the highest precision available on your machine. As we'll use just relative times to compare algorithms, the precision isn't all that important. Here's two different ways to create a list of 10,000 zeros:

def lots_of_appends():
  zeros = []
  for i in range(10000):
    zeros.append(0)

def one_multiply():
  zeros = [0] * 10000

How can we time these two solutions? Here's a simple way:

import time, makezeros

def do_timing(num_times, *funcs):
    totals = {}
    for func in funcs: totals[func] = 0.0
    for x in range(num_times):
        for func in funcs:
            starttime = time.time()        # record starting time
            apply(func)
            stoptime = time.time()         # record ending time
            elapsed = stoptime--starttime   # difference yields time elapsed
            totals[func] = totals[func] + elapsed
    for func in funcs:
        print "Running %s %d times took %.3f seconds" % (func.__name__, 
                                                         num_times
                                                         totals[func])
do_timing(100, (makezeros.lots_of_appends, makezeros.one_multiply))

And running this program yields:

csh> python timings.py
Running lots_of_appends 100 times took 7.891 seconds
Running one_multiply 100 times took 0.120 seconds

As you might have suspected, a single list multiplication is much faster than lots of appends. Note that in timings, it's always a good idea to compare lots of runs of functions instead of just one. Otherwise the timings are likely to be heavily influenced by things that have nothing to do with the algorithm, such as network traffic on the computer or GUI events.

What if you've written a complex program, and it's running slower than you'd like, but you're not sure what the problem spot is? In those cases, what you need to do is profile the program: determine which parts of the program are the time-sinks and see if they can be optimized, or if the program structure can be modified to even out the bottlenecks. The Python distribution includes just the right tool for that, the profile module, documented in the Library Reference. Assuming that you want to profile a given function in the current namespace, do this:

>>> from timings import *
>>> from makezeros import *
>>> profile.run('do_timing(100, (lots_of_appends, one_multiply))')
Running lots_of_appends 100 times took 8.773 seconds
Running one_multiply 100 times took 0.090 seconds
         203 function calls in 8.823 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      100   8.574   0.086   8.574  0.086 makezeros.py:1(lots_of_appends)
      100   0.101   0.001   0.101  0.001 makezeros.py:6(one_multiply)
        1   0.001   0.001   8.823  8.823 profile:0(do_timing(100, 
                                            (lots_of_appends, one_multiply)))
        0   0.000           0.000        profile:0(profiler)
        1   0.000   0.000   8.821  8.821 python:0(194.C.2)
        1   0.147   0.147   8.821  8.821 timings.py:2(do_timing)

As you can see, this gives a fairly complicated listing, which includes such things as per-call time spent in each function and the number of calls made to each function. In complex programs, the profiler can help find surprising inefficiencies. Optimizing Python programs is beyond the scope of this book; if you're interested, however, check the Python newsgroup: periodically, a user asks for help speeding up a program and a spontaneous contest starts up, with interesting advice from expert users.

I l@ve RuBoard Previous Section Next Section