I l@ve RuBoard

### 2.3 Processing Selected Pairs of Structured Data Efficiently

Credit: Alex Martelli, David Ascher

#### 2.3.1 Problem

You need to efficiently process pairs of data from two large and related data sets.

#### 2.3.2 Solution

Use an auxiliary dictionary to do preprocessing of the data, thereby reducing the need for iteration over mostly irrelevant data. For instance, if xs and ys are the two data sets, with matching keys as the first item in each entry, so that x[0] == y[0] defines an "interesting" pair:

```auxdict = {}
for y in ys: auxdict.setdefault(y[0], []).append(y)
result = [ process(x, y) for x in xs for y in auxdict[x[0]] ]```

#### 2.3.3 Discussion

To make the problem more concrete, let's look at an example. Say you need to analyze data about visitors to a web site who have purchased something online. This means you need to perform some computation based on data from two log files梠ne from the web server and one from the credit-card processing framework. Each log file is huge, but only a small number of the web server log entries correspond to credit-card log entries. Let's assume that cclog is a sequence of records, one for each credit-card transaction, and that weblog is a sequence of records describing each web site hit. Let's further assume that each record uses the attribute ipaddress to refer to the IP address involved in each event. In this case, a reasonable first approach would be to do something like:

```results = [ process(webhit, ccinfo) for webhit in weblog for ccinfo in cclog \

The problem with this approach is that the nested list comprehension will iterate over each entry in the web server log, and, for each entry in the web server log, it will also iterate over each entry in the credit-card log. This means that the algorithm has O(M x N) performance characteristics梚n other words, the time it takes to compute will be proportional to the product of the size of both logs. As the web site becomes more popular and as data accumulates, the performance of the algorithm will rapidly deteriorate.

The key to optimizing this algorithm is to recognize that the computation (process) needs to happen for only a small subset of all of the possible combinations of the two variables (in this case, webhit and ccinfo). If we could search for only the right pairs, we might be able to speed up the process. As Tim Peters says in the introduction to this chapter, if you need to search, use an auxiliary dictionary. The solution described earlier, rewritten to use our variables, yields:

```ipdict = {}
for webhit in weblog: ipdict.setdefault(webhit.ipaddress, []).append(webhit)
results = [ process(webhit, ccinfo) for ccinfo in cclog \