I l@ve RuBoard

7.7 Calculating the Rate of Client Cache Hits on Apache

Credit: Mark Nenadov

7.7.1 Problem

You need to monitor how often client requests are refused by your Apache web server because the client's cache of the page is up to date.

7.7.2 Solution

When a browser queries a server for a page that the browser has in its cache, the browser lets the server know about the cached data, and the server returns an error code (rather than serving the page again) if the client's cache is up to date. Here's how to find the statistics for such occurrences in your server's logs:

def ClientCachePercentage(logfile_pathname):
    Contents = open(logfile_pathname, "r").xreadlines(  )
    TotalRequests = 0
    CachedRequests = 0

    for line in Contents:
        TotalRequests += 1
        if line.split(" ")[8] == "304":  # if server returned "not modified"
            CachedRequests += 1

    return (100*CachedRequests)/TotalRequests

7.7.3 Discussion

The percentage of requests to your Apache server that are met by the client's own cache is an important factor in the perceived performance of your server. The code in this recipe helps you get this information from the server's log. Typical use would be:

log_path = "/usr/local/nusphere/apache/logs/access_log"
print "Percentage of requests that are client-cached: " + str(
    ClientCachePercentage(log_path)) + "%"

The recipe reads the log file via the special method xreadlines, introduced in Python 2.1, rather than via the more normal readlines. readlines must read the whole file into memory, since it returns a list of all lines, making it unsuitable for very large files, which server log files can certainly be. Therefore, trying to read the whole log file into memory at once might not work (or work too slowly due to virtual-memory thrashing effects). xreadlines returns a special object, meant to be used only in a for statement (somewhat like an iterator in Python 2.2; Python 2.1 did not have a formal concept of iterators), which can save a lot of memory. In Python 2.2, it would be simplest to iterate on the file object directly, with a for statement such as:

for line in open(logfile_pathname):

This is the simplest and fastest approach, but it does require Python 2.2 or later to work.

The body of the for loop calls the split method on each line string, with a string of a single space as the argument, to split the line into a tuple of its space-separated fields. Then it uses indexing ([8]) to get the ninth such field. Apache puts the error code into the ninth field of each line in the log. Code "304" means "not modified" (i.e., the client's cache was already correctly updated). We count those cases in the CachedRequests variable and all lines in the log in the TotalRequests variable, so that, in the end, we can return the percentage of cache hits. Note that in the expression used with the return statement, it's important to multiply by 100 before we divide, since up to Python 2.1 (and even in 2.2, by default), division between integers truncates (i.e., ignores the remainder). If we divided first, that would truncate to 0; so multiplying by 100 would still give 0, which is not a very useful result!

7.7.4 See Also

The Apache web server is available and documented at http://httpd.apache.org.

I l@ve RuBoard