I l@ve RuBoard

2.3 Processing Selected Pairs of Structured Data Efficiently

Credit: Alex Martelli, David Ascher

2.3.1 Problem

You need to efficiently process pairs of data from two large and related data sets.

2.3.2 Solution

Use an auxiliary dictionary to do preprocessing of the data, thereby reducing the need for iteration over mostly irrelevant data. For instance, if xs and ys are the two data sets, with matching keys as the first item in each entry, so that x[0] == y[0] defines an "interesting" pair:

auxdict = {}
for y in ys: auxdict.setdefault(y[0], []).append(y)
result = [ process(x, y) for x in xs for y in auxdict[x[0]] ]

2.3.3 Discussion

To make the problem more concrete, let's look at an example. Say you need to analyze data about visitors to a web site who have purchased something online. This means you need to perform some computation based on data from two log files梠ne from the web server and one from the credit-card processing framework. Each log file is huge, but only a small number of the web server log entries correspond to credit-card log entries. Let's assume that cclog is a sequence of records, one for each credit-card transaction, and that weblog is a sequence of records describing each web site hit. Let's further assume that each record uses the attribute ipaddress to refer to the IP address involved in each event. In this case, a reasonable first approach would be to do something like:

results = [ process(webhit, ccinfo) for webhit in weblog for ccinfo in cclog \
            if ccinfo.ipaddress==webhit.ipaddress ]

The problem with this approach is that the nested list comprehension will iterate over each entry in the web server log, and, for each entry in the web server log, it will also iterate over each entry in the credit-card log. This means that the algorithm has O(M x N) performance characteristics梚n other words, the time it takes to compute will be proportional to the product of the size of both logs. As the web site becomes more popular and as data accumulates, the performance of the algorithm will rapidly deteriorate.

The key to optimizing this algorithm is to recognize that the computation (process) needs to happen for only a small subset of all of the possible combinations of the two variables (in this case, webhit and ccinfo). If we could search for only the right pairs, we might be able to speed up the process. As Tim Peters says in the introduction to this chapter, if you need to search, use an auxiliary dictionary. The solution described earlier, rewritten to use our variables, yields:

ipdict = {}
for webhit in weblog: ipdict.setdefault(webhit.ipaddress, []).append(webhit)
results = [ process(webhit, ccinfo) for ccinfo in cclog \
                                    for webhit in ipdict[ccinfo.ipaddress] ]

The highlighted line creates a dictionary mapping IP addresses to lists containing the data for each web hit. Because we're indexing the dictionary by IP address, we are optimizing the data structures for a particular query: "give me all of the web records for a particular IP address." The list comprehension now iterates over only the data we require梖or each credit-card transaction, we iterate over all of the web hits corresponding to the IP address used in that transaction. Not only did the algorithm go from O(M x N) to O(M+N) from a theoretical point of view, but, because we chose to hold in the auxiliary dictionary data that is sparser from the point of view of the task at hand, we've made the solution faster than the alternative (which would also be O(M+N)).

Note that the test used to determine whether two records correspond to a pair of interest can be arbitrary. The generic description of the solution uses indexing, while the web example uses attribute-getting. The test you use will depend on your application and your data structures.

2.3.4 See Also

Recipe 1.6 for more details on the setdefault method of dictionaries; the introduction to Chapter 2.

I l@ve RuBoard