I l@ve RuBoard

11.6 Resuming the HTTP Download of a File

Credit: Chris Moffitt

11.6.1 Problem

You need to resume an HTTP download of a file that has been partially transferred.

11.6.2 Solution

Large downloads are sometimes interrupted. However, a good HTTP server that supports the Range header lets you resume the download from where it was interrupted. The standard Python module urllib lets you access this functionality almost seamlessly. You need to add only the needed header and intercept the error code the server sends to confirm that it will respond with a partial file:

import urllib, os

class myURLOpener(urllib.FancyURLopener):
    """ Subclass to override error 206 (partial file being sent); okay for us """
    def http_error_206(self, url, fp, errcode, errmsg, headers, data=None):
        pass    # Ignore the expected "non-error" code

def getrest(dlFile, fromUrl, verbose=0):
    loop = 1
    existSize = 0
    myUrlclass = myURLOpener(  )
    if os.path.exists(dlFile):
        outputFile = open(dlFile,"ab")
        existSize = os.path.getsize(dlFile)
        # If the file exists, then download only the remainder
        myUrlclass.addheader("Range","bytes=%s-" % (existSize))
    else:
        outputFile = open(dlFile,"wb")

    webPage = myUrlclass.open(fromUrl)
    if verbose:
        for k, v in webPage.headers.items(  ):
            print k, "=", v

    # If we already have the whole file, there is no need to download it again
    numBytes = 0
    webSize = int(webPage.headers['Content-Length'])
    if webSize == existSize:
        if verbose: print "File (%s) was already downloaded from URL (%s)"%(
            dlFile, fromUrl)
    else:
        if verbose: print "Downloading %d more bytes" % (webSize-existSize)
        while 1:
            data = webPage.read(8192)
            if not data:
                break
            outputFile.write(data)
            numBytes = numBytes + len(data)

    webPage.close(  )
    outputFile.close(  )

    if verbose:
        print "downloaded", numBytes, "bytes from", webPage.url
    return numbytes

11.6.3 Discussion

The HTTP Range header lets the web server know that you want only a certain range of data to be downloaded, and this recipe takes advantage of this header. Of course, the server needs to support the Range header, but since the header is part of the HTTP 1.1 specification, it's widely supported. This recipe has been tested with Apache 1.3 as the server, but I expect no problems with other reasonably modern servers.

The recipe lets urllib.FancyURLopener to do all the hard work of adding a new header, as well as the normal handshaking. I had to subclass it to make it known that the error 206 is not really an error in this case梥o you can proceed normally. I also do some extra checks to quit the download if I've already downloaded the whole file.

Check out the HTTP 1.1 RFC (2616) to learn more about what all of the headers mean. You may find a header that is very useful, and Python's urllib lets you send any header you want. This recipe should probably do a check to make sure that the web server accepts Range, but this is pretty simple to do.

11.6.4 See Also

Documentation of the standard library module urllib in the Library Reference; the HTTP 1.1 RFC (http://www.ietf.org/rfc/rfc2616.txt).

I l@ve RuBoard