I l@ve RuBoard

1.9 Finding the Intersection of Two Dictionaries

Credit: Andy McKay, Chris Perkins, Sami Hangaslammi

1.9.1 Problem

Given two dictionaries, you need to find the set of keys that are in both dictionaries.

1.9.2 Solution

Dictionaries are a good concrete representation for sets in Python, so operations such as intersections are common. Say you have two dictionaries (but pretend that they each contain thousands of items):

some_dict = { 'zope':'zzz', 'python':'rocks' }
another_dict = { 'python':'rocks', 'perl':'$' }

Here's a bad way to find their intersection that is very slow:

intersect = []
for item in some_dict.keys(  ):
    if item in another_dict.keys(  ):
        intersect.append(item)
print "Intersects:", intersect

And here's a good way that is simple and fast:

intersect = []
for item in some_dict.keys(  ):
    if another_dict.has_key(item):
        intersect.append(item)
print "Intersects:", intersect

In Python 2.2, the following is elegant and even faster:

print "Intersects:", [k for k in some_dict if k in another_dict]

And here's an alternate approach that wins hands down in speed, for Python 1.5.2 and later:

print "Intersects:", filter(another_dict.has_key, some_dict.keys())

1.9.3 Discussion

The keys method produces a list of all the keys of a dictionary. It can be pretty tempting to fall into the trap of just using in, with this list as the righthand side, to test for membership. However, in the first example, you're looping through all of some_dict, then each time looping through all of another_dict. If some_dict has N1 items, and another_dict has N2 items, your intersection operation will have a compute time proportional to the product of N1x N2. (O(N1x N2) is the common computer-science notation to indicate this.)

By using the has_key method, you are not looping on another_dict any more, but rather checking the key in the dictionary's hash table. The processing time for has_key is basically independent of dictionary size, so the second approach is O(N1). The difference is quite substantial for large dictionaries! If the two dictionaries are very different in size, it becomes important to use the smaller one in the role of some_dict, while the larger one takes on the role of another_dict (i.e., loop on the keys of the smaller dictionary, thus picking the smaller N1).

Python 2.2 lets you iterate on a dictionary's keys directly, with the statement:

for key in dict

You can test membership with the equally elegant:

if key in dict

rather than the equivalent but syntactically less nice dict.has_key(key). Combining these two small but nice innovations of Python 2.2 with the list-comprehension notation introduced in Python 2.0, we end up with a very elegant approach, which is at the same time concise, clear, and quite speedy.

However, the fastest approach is the one that uses filter with the bound method another_dict.has_key on the list some_dict.keys. A typical intersection of two 500-item dictionaries with 50% overlap, on a typical cheap machine of today (AMD Athlon 1.4GHz, DDR2100 RAM, Mandrake Linux 8.1), took 710 microseconds using has_key, 450 microseconds using the Python 2.2 technique, and 280 microseconds using the filter-based way. While these speed differences are almost substantial, they pale in comparison with the timing of the bad way, for which a typical intersection took 22,600 microseconds?0 times longer than the simple way and 80 times longer than the filter-based way! Here's the timing code, which shows a typical example of how one goes about measuring relative speeds of equivalent Python constructs:

import time

def timeo(fun, n=1000):
    def void(  ): pass
    start = time.clock(  )
    for i in range(n): void(  )
    stend = time.clock(  )
    overhead = stend - start
    start = time.clock(  )
    for i in range(n): fun(  )
    stend = time.clock(  )
    thetime = stend-start
    return fun._ _name_ _, thetime-overhead

to500 = {}
for i in range(500): to500[i] = 1
evens = {}
for i in range(0, 1000, 2): evens[i] = 1

def simpleway(  ):
    result = []
    for k in to500.keys(  ):
        if evens.has_key(k):
            result.append(k)
    return result

def pyth22way(  ):
    return [k for k in to500 if k in evens]

def filterway(  ):
    return filter(evens.has_key, to500.keys(  ))

def badsloway(  ):
    result = []
    for k in to500.keys(  ):
        if k in evens.keys(  ):
            result.append(k)
    return result

for f in simpleway, pyth22way, filterway, badsloway:
    print "%s: %.2f"%timeo(f)

You can save this code into a .py file and run it (a few times, on an otherwise quiescent machine, of course) with python -O to check how the timings of the various constructs compare on any specific machine in which you're interested. (Note that this script requires Python 2.2 or later.) Timing different code snippets to find out how their relative speeds compare is an important Python technique, since intuition is a notoriously unreliable guide to such relative-speed comparisons. For detailed and general instruction on how to time things, see the introduction to Chapter 17.

When applicable without having to use a lambda form or a specially written function, filter, map, and reduce often offer the fastest solution to any given problem. Of course, a clever Pythonista cares about speed only for those very, very few operations where speed really matters more than clarity, simplicity, and elegance! But these built-ins are pretty elegant in their own way, too.

We don't have a separate recipe for the union of the keys of two dictionaries, but that's because the task is even easier, thanks to a dictionary's update method:

def union_keys(some_dict, another_dict):
    temp_dict = some_dict.copy(  )
    temp_dict.update(another_dict)
    return temp_dict.keys(  )

1.9.4 See Also

The Library Reference section on mapping types.

I l@ve RuBoard