I l@ve RuBoard

9.4 Internet-Related Activities

The Internet is a treasure trove of information, but its exponential growth can make it hard to manage. Furthermore, most tools currently available for "surfing the Web" are not programmable. Many web-related tasks can be automated quite simply with the tools in the standard Python distribution.

9.4.1 Downloading a Web Page Programmatically

If you're interested in finding out what the weather in a given location is over a period of months, it's much easier to set up an automated program to get the information and collect it in a file than to have to remember to do it by hand.

Here is a program that finds the weather in a couple of cities and states using the pages of the weather.com web site:

import urllib, urlparse, string, time

def get_temperature(country, state, city):
    url = urlparse.urljoin('http://www.weather.com/weather/cities/',
                           string.lower(country)+'_' + \
                           string.lower(state) + '_' + \
                           string.replace(string.lower(city), ' ',
                                          '_') + '.html')
    data = urllib.urlopen(url).read()
    start = string.index(data, 'current temp: ') + len('current temp: ')
    stop = string.index(data, '&deg;F', start-1)
    temp = int(data[start:stop])
    localtime = time.asctime(time.localtime(time.time()))
    print ("On %(localtime)s, the temperature in %(city)s, " +\
           "%(state)s %(country)s is %(temp)s F.") % vars()

get_temperature('FR', '', 'Paris')
get_temperature('US', 'RI', 'Providence')
get_temperature('US', 'CA', 'San Francisco')

When run, it produces output like:

~/book:> python get_temperature.py
On Wed Nov 25 16:22:25 1998, the temperature in Paris,  FR is 39 F.
On Wed Nov 25 16:22:30 1998, the temperature in Providence, RI US is 39 F.
On Wed Nov 25 16:22:35 1998, the temperature in San Francisco, CA US is 58 F.

The code in get_temperature.py suffers from one flaw, which is that the logic of the URL creation and of the temperature extraction is dependent on the specific HTML produced by the web site you use. The day the site's graphic designer decides that "current temp:" should be spelled with capitalized words, this script won't work. This is a problem with programmatic parsing of web pages that will go away only when more structural formats (such as XML) are used to produce web pages.^[6]

^[6] XML (eXtensible Markup Language) is a language for marking up structured text files that emphasizes the structure of the document, not its graphical nature. XML processing is an entirely different area of Python text processing, with much ongoing work. See Appendix A, for some pointers to discussion groups and software.

9.4.2 Checking the Validity of Links and Mirroring Web Sites: webchecker.py and Friends

One of the big hassles of maintaining a web site is that as the number of links in the site increases, so does the chance that some of the links will no longer be valid. Good web-site maintenance therefore includes periodic checking for such stale links. The standard Python distribution includes a tool that does just this. It lives in the Tools/webchecker directory and is called webchecker.py .

A companion program called websucker.py located in the same directory uses similar logic to create a local copy of a remote web site. Be careful when trying it out, because if you're not careful, it will try to download the entire Web on your machine! The same directory includes two programs called wsgui.py and webgui.py that are Tkinter-based frontends to websucker and webchecker, respectively. We encourage you to look at the source code for these programs to see how one can build sophisticated web-management systems with Python's standard toolset.

In the Tools/Scripts directory, you'll find many other small to medium-sized scripts that might be of interest, such as an equivalent of websucker.py for FTP servers called ftpmirror.py.

9.4.3 Checking Mail

Electronic mail is probably the most important medium on the Internet today; it's certainly the protocol with which most information passes between individuals. Python includes several libraries for processing mail. The one you'll need to use depends on the kind of mail server you're using. Modules for interacting with POP3 servers (poplib) and IMAP servers (imaplib) are included. If you need to talk to a Microsoft Exchange server, you'll need some of the tools in the win32 distribution (see Appendix B, for pointers to the win32 extensions web page).

Here's a simple test of the poplib module, which is used to talk to a mail server running the POP protocol:

>>> from poplib import *
>>> server = POP3('mailserver.spam.org')
>>> print server.getwelcome()
+OK QUALCOMM Pop server derived from UCB (version 2.1.4-R3) at spam starting.
>>> server.user('da')
'+OK Password required for da.'
>>> server.pass_('youllneverguess')
'+OK da has 153 message(s) (458167 octets).'
>>> header, msg, octets = server.retr(152)# let's get the latest msgs
>>> import string
>>> print string.join(msg[:3], '\n')   # and look at the first three lines
Return-Path: <jim@bigbad.com>
Received: from gator.bigbad.com by mailserver.spam.org (4.1/SMI-4.1)
        id AA29605; Wed, 25 Nov 98 15:59:24 PST

In a real application, you'd use a specialized module such as rfc822 to parse the header lines, and perhaps the mimetools and mimify modules to get the data out of the message body (e.g., to process attached files).

I l@ve RuBoard