I l@ve RuBoard Previous Section Next Section

2.9 Selecting Random Elements from a List Without Repetition

Credit: Iuri Wickert, Duncan Grisby, Steve Holden, Alex Martelli

2.9.1 Problem

You need to consume, in random order, the items of a rather long list, and the most direct approach is painfully slow.

2.9.2 Solution

While it's a common mistake to be overly concerned with speed, you should not ignore the different performances of various algorithms. Suppose we must process all of the items in a long list in random order, without repetition. For example:

import random

# an example of a processing function
def process(datum): print datum
# an example of a set of data to process
data = range(10000)

The simplest version is very slow:

def simple(  ):
    while data:
        # returns element, not index (doesn't help)
        elem = random.choice(data) 
        # needs to search for the element throughout the list!

Here is a faster version:

def faster(  ):
    while data:
        index = random.randrange(len(data))
        elem = data[index]
        # direct deletion, no search needed
        del data[index] 

# the same, but preserving the data list
def faster_preserve(  ):
    aux = range(len(data))
    while aux:
        posit = random.randrange(len(aux))
        index = aux[posit]
        elem = data[index]
        # alters the auxiliary list only
        del aux[posit] 

However, the key improvement is to switch to an O(N) algorithm:

def improved(  ):
    size = len(data)
    while size:
        index = random.randrange(size)
        elem = data[index]
        data[index] = data[size-1]
        size = size - 1

Of course, you can also implement a version of this that preserves the data list.

But the winner is the version that appears to be the simplest:

def best(  ):
    for elem in data: process(elem)

# or, if you need to preserve the data list's original ordering:
def best_preserve(  ):
    aux = list(data)
    for elem in aux: process(elem)

2.9.3 Discussion

The simplest, most direct way of consuming a list in a random fashion is painfully slow for lists with a few hundred elements. While it is tempting to use the simple, clear choice/remove combination, as in the simple function, this is a bad choice, because remove must linearly search through the list to find the element to delete. In other words, the overall performance is O(N2), with a large multiplicative constant.

Fortunately, there are equally simple (or simpler), but much faster, ways to do this. The faster function, using randrange/del to generate the random indexes for the list, can skip the costly search for the deletion. If it's important to preserve the input list, you can use a disposable auxiliary list to generate the data list indexes, as in the faster_preserve function.

However, del anylist[x] for a random x is still O(N), so overall performance is still O(N2), albeit with a much smaller multiplicative constant. Incidentally, the pseudorandom order in which items are processed is not the same with the various approaches, even if random is seeded in the same way. All of the orderings are equally pseudorandom, though.

Pursuing O(N) performance, one possibility is not to delete an item from a list at all, but rather to overwrite it with the last item and decrease at each step the number of items from which we're choosing. The improved function takes this tack and benefits greatly from it, in terms of performance.

The fastest approach, however, is to shuffle the data (or an auxiliary copy of it) once, at the start, then just process the items sequentially in the resulting pseudorandom order. The nice thing about this approach, shown in the best and best_preserve functions, is that it's actually the simplest of all.

On lists of 10,000 items, as shown in this recipe, the overhead (meaning pure overhead, using a do-nothing processing function) of simple is about 13 or 14 times more than that of faster and faster_preserve. Those functions, in turn, have over twice as much overhead as improved, best, and best_preserve. On lists of 100,000 items, faster and faster_preserve become about 15 times slower than improved, best, and best_preserve. The latter two have, for every list size, about 20%-30% less overhead than improved梐 very minor issue, although their utter simplicity clearly does make them deserve their names.

While an improvement of 25%, or even a factor of 2, may be neglected without substantial cost for the performance of your program as a whole, the same does not apply to an algorithm that is 10 or more times as slow as it could be. Such terrible performance is likely to make that program fragment a bottleneck, all by itself, particularly when we're talking about O(N2) versus O(N) behavior.

2.9.4 See Also

The documentation for the random module in the Library Reference.

    I l@ve RuBoard Previous Section Next Section