I l@ve RuBoard

17.4 Removing Duplicates from a Sequence

Credit: Tim Peters

17.4.1 Problem

You have a sequence that may include duplicates, and you need to remove the duplicates in the fastest possible way without knowing much about the properties of the items in the sequence. You do not care about the order of items in the resulting sequence.

17.4.2 Solution

The key is to try several approaches, fastest first, and use try/except to handle the failing cases of the fastest approaches:

def unique(s):
    """ Return a list of the elements in s in arbitrary order, but without
    duplicates. """

    # Get the special case of an empty s out of the way very rapidly
    n = len(s)
    if n == 0:
        return []

    # Try using a dict first, because it's the fastest and will usually work
    u = {}
    try:
        for x in s:
            u[x] = 1
    except TypeError:
        del u  # Move on to the next method
    else:
        return u.keys(  )

    # Since you can't hash all elements, try sorting, to bring equal items
    # together and weed them out in a single pass
    try:
        t = list(s)
        t.sort(  )
    except TypeError:
        del t  # Move on to the next method
    else:
        assert n > 0
        last = t[0]
        lasti = i = 1
        while i < n:
            if t[i] != last:
                t[lasti] = last = t[i]
                lasti += 1
            i += 1
        return t[:lasti]

    # Brute force is all that's left
    u = []
    for x in s:
        if x not in u:
            u.append(x)
    return u

17.4.3 Discussion

The purpose of this recipe's unique function is to take a sequence s as an argument and return a list of the items in s in arbitrary order, but without duplicates. For example, calling unique([1, 2, 3, 1, 2, 3]) returns an arbitrary permutation of [1, 2, 3], calling unique("abcabc") returns an arbitrary permutation of ["a", "b", "c"], and calling unique(([1, 2], [2, 3], [1, 2])) returns an arbitrary permutation of [[2, 3], [1, 2]].

The fastest way to remove duplicates from a sequence depends on some pretty subtle properties of the sequence elements, such as whether they're hashable and whether they support full comparisons. The unique function shown in this recipe tries three methods, from fastest to slowest, letting runtime exceptions pick the best method available for the sequence at hand.

For best speed, all sequence elements should be hashable. When they are, the unique function will usually work in linear time (i.e., O(N), or directly proportional to the number of elements in the input, which is a good and highly scalable performance characteristic).

If it turns out that hashing the elements (using them as dictionary keys) is not possible, the next best thing is that the elements enjoy a total ordering. If list(s).sort( ) doesn't raise a TypeError, we can assume that s's elements do enjoy a total ordering. Then unique will usually work in O(N x log(N)) time. Note that Python lists' sort method was specially designed to be highly efficient in the presence of many duplicate elements, so the sorting approach may be more effective in Python than elsewhere.

If sorting also turns out to be impossible, the sequence elements must at least support equality testing, or else the very concept of duplicates can't really be meaningful for them. In this case, unique works in quadratic time (i.e., O(N²), or proportional to the square of the number of elements in the input, which is not very scalable, but is the least of all evils, given the sequence item's obviously peculiar nature if we get all the way to this subcase).

This is a pure example of how algorithm efficiency depends on the strength of the assumptions you can make about the data. Of course, you could split this into three distinct functions and directly call the one that best meets your needs. In practice, however, the brute-force method is so slow for large sequences that nothing measurable is lost by simply letting the function as written try the faster methods first.

If you need to preserve the same order of items in the output sequence as in the input sequence, see Recipe 17.5.

17.4.4 See Also

Recipe 17.5.

I l@ve RuBoard