I l@ve RuBoard

9.1 Data Structure Manipulations

One of Python's greatest features is that it provides the list, tuple, and dictionary built-in types. They are so flexible and easy to use that once you've grown used to them, you'll find yourself reaching for them automatically.

9.1.1 Making Copies Inline

Due to Python's reference management scheme, the statement a = b doesn't make a copy of the object referenced by b ; instead, it makes a new reference to that object. Sometimes a new copy of an object, not just a shared reference, is needed. How to do this depends on the type of the object in question. The simplest way to make copies of lists and tuples is somewhat odd. If myList is a list, then to make a copy of it, you can do:

newList = myList[:]

which you can read as "slice from beginning to end," since you'll remember from Chapter 2, that the default index for the start of a slice is the beginning of the sequence (0), and the default index for the end of a slice is the end of sequence. Since tuples support the same slicing operation as lists, this same technique can also copy tuples. Dictionaries, on the other hand, don't support slicing. To make a copy of a dictionary myDict, you can use:

newDict = {}
for key in myDict.keys():
    newDict[key] = myDict[key]

This is such a common task that a new method was added to the dictionary object in Python 1.5, the copy()method, which performs this task. So the preceding code can be replaced with the single statement:

newDict = myDict.copy()

Another common dictionary operation is also now a standard dictionary feature. If you have a dictionary oneDict, and want to update it with the contents of a different dictionary otherDict, simply type oneDict.update(otherDict). This is the equivalent of:

for key in otherDict.keys():
    oneDict[key] = otherDict[key]

If oneDict shared some keys with otherDict before the update() operation, the old values associated with the keys in oneDict are obliterated by the update. This may be what you want to do (it usually is, which is why this behavior was chosen and why it was called "update"). If it isn't, the right thing to do might be to complain (raise an exception), as in:

def mergeWithoutOverlap(oneDict, otherDict):
    newDict = oneDict.copy()
    for key in otherDict.keys():
        if key in oneDict.keys():
            raise ValueError, "the two dictionaries are sharing keys!"
        newDict[key] = otherDict[key]
    return newDict

or, alternatively, combine the values of the two dictionaries, with a tuple, for example:

def mergeWithOverlap(oneDict, otherDict):
    newDict = oneDict.copy()
    for key in otherDict.keys():
        if key in oneDict.keys():
            newDict[key] = oneDict[key], otherDict[key]
        else:
            newDict[key] = otherDict[key]
    return newDict

To illustrate the differences between the preceding three algorithms, consider the following two dictionaries:

phoneBook1 = {'michael': '555-1212', 'mark': '554-1121', 'emily': '556-0091'}
phoneBook2 = {'latoya': '555-1255', 'emily': '667-1234'}

If phoneBook1 is possibly out of date, and phoneBook2 is more up to date but less complete, the right usage is probably phoneBook1.update(phoneBook2). If the two phoneBooks are supposed to have nonoverlapping sets of keys, using newBook = mergeWithoutOverlap(phoneBook1, phoneBook2) lets you know if that assumption is wrong. Finally, if one is a set of home phone numbers and the other a set of office phone numbers, chances are newBook = mergeWithOverlap(phoneBook1, phoneBook2)is what you want, as long as the subsequent code that uses newBook can deal with the fact that newBook['emily'] is the tuple ('556-0091', '667-1234').

9.1.2 Making Copies: The copy Module

Back to making copies: the [:] and .copy()tricks will get you copies in 90% of the cases. If you are writing functions that, in true Python spirit, can deal with arguments of any type, it's sometimes necessary to make copies of X, regardless of what X is. In comes the copy module. It provides two functions, copy and deepcopy. The first is just like the [:] sequence slice operation or the copy method of dictionaries. The second is more subtle and has to do with deeply nested structures (hence the term deepcopy). Take the example of copying a list listOne by slicing it from beginning to end using the [:] construct. This technique makes a new list that contains references to the same objects contained in the original list. If the contents of that original list are immutable objects, such as numbers or strings, the copy is as good as a "true" copy. However, suppose that the first element in listOne is itself a dictionary (or any other mutable object). The first element of the copy of listOne is a new reference to the same dictionary. So if you then modify that dictionary, the modification is evident in both listOne and the copy of listOne. An example makes it much clearer:

>>> import copy
>>> listOne = [{"name": "Willie", "city": "Providence, RI"}, 1, "tomato", 3.0]
>>> listTwo = listOne[:]                   # or listTwo=copy.copy(listOne)
>>> listThree = copy.deepcopy(listOne)
>>> listOne.append("kid")
>>> listOne[0]["city"] = "San Francisco, CA"
>>> print listOne, listTwo, listThree
[{'name': 'Willie', 'city': 'San Francisco, CA'}, 1, 'tomato', 3.0, 'kid']
[{'name': 'Willie', 'city': 'San Francisco, CA'}, 1, 'tomato', 3.0]
[{'name': 'Willie', 'city': 'Providence, RI'}, 1, 'tomato', 3.0]

As you can see, modifying listOne directly modified only listOne. Modifying the first entry of the list referenced by listOne led to changes in listTwo, but not in listThree; that's the difference between a shallow copy ([:]) and a deepcopy. The copy module functions know how to copy all the built-in types that are reasonably copyable,^[1] including classes and instances.

^[1] Some objects don't qualify as "reasonably copyable," such as modules, file objects, and sockets. Remember that file objects are different from files on disk.

9.1.3 Sorting and Randomizing

In Chapter 2, you saw that lists have a sort method that does an in-place sort. Sometimes you want to iterate over the sorted contents of a list, without disturbing the contents of this list. Or you may want to list the sorted contents of a tuple. Because tuples are immutable, an operation such as sort, which modifies it in place, is not allowed. The only solution is to make a list copy of the elements, sort the list copy, and work with the sorted copy, as in:

listCopy = list(myTuple)
listCopy.sort()
for item in listCopy:
    print item                             # or whatever needs doing

This solution is also the way to deal with data structures that have no inherent order, such as dictionaries. One of the reasons that dictionaries are so fast is that the implementation reserves the right to change the order of the keys in the dictionary. It's really not a problem, however, given that you can iterate over the keys of a dictionary using an intermediate copy of the keys of the dictionary:

keys = myDict.keys()                       # returns an unsorted list of
                                           # the keys in the dict
keys.sort()
for key in keys:                           # print key, value pairs 
    print key, myDict[key]                 # sorted by key

The sort method on lists uses the standard Python comparison scheme. Sometimes, however, that scheme isn't what's needed, and you need to sort according to some other procedure. For example, when sorting a list of words, case (lower versus UPPER) may not be significant. The standard comparison of text strings, however, says that all uppercase letters "come before" all lowercase letters, so 'Baby' is "less than" 'apple' but 'baby' is "greater than" 'apple'. In order to do a case-independent sort, you need to define a comparison function that takes two arguments, and returns -1, 0, or 1 depending on whether the first argument is smaller than, equal to, or greater than the second argument. So, for our case-independent sorting, you can use:

>>> def caseIndependentSort(something, other):

...    something, other  = string.lower(something), string.lower(other)

...    return cmp(something, other)

... 
>>> testList = ['this', 'is', 'A', 'sorted', 'List']
>>> testList.sort()
>>> print testList
['A', 'List', 'is', 'sorted', 'this']
>>> testList.sort(caseIndependentSort)
>>> print testList
['A', 'is', 'List', 'sorted', 'this']

We're using the built-in function cmp, which does the hard part of figuring out that 'a' comes before 'b', 'b' before 'c', etc. Our sort function simply lowercases both items and sorts the lowercased versions, which is one way of making the comparison case-independent. Also note that the lowercasing conversion is local to the comparison function, so the elements in the list aren't modified by the sort.

9.1.4 Randomizing: The random Module

What about randomizing a sequence, such as a list of lines? The easiest way to randomize a sequence is to repeatedly use the choice function in the random module, which returns a random element from the sequence it receives as an argument.^[2] In order to avoid getting the same line multiple times, remember to remove the chosen item. When manipulating a list object, use the remove method:

^[2] The random module provides many other useful functions, such as the random function, which returns a random floating-point number between and 1. Check a reference source for details.

while myList:                        # will stop looping when myList is empty
    element = random.choice(myList)
    myList.remove(element)
    print element,

If you need to randomize a nonlist object, it's usually easiest to convert that object to a list and randomize the list version of the same data, rather than come up with a new strategy for each data type. This might seem a wasteful strategy, given that it involves building intermediate lists that might be quite large. In general, however, what seems large to you probably won't seem so to the computer, thanks to the reference system. Also, consider the time saved by not having to come up with a different strategy for each data type! Python is designed to save time; if that means running a slightly slower or bigger program, so be it. If you're handling enormous amounts of data, it may be worthwhile to optimize. But never optimize until the need for optimization is clear; that would be a waste of time.

9.1.5 Making New Data Structures

The last point about not reinventing the wheel is especially true when it comes to data structures. For example, Python lists and dictionaries might not be the lists and dictionaries or mappings you're used to, but you should avoid designing your own data structure if these structures will suffice. The algorithms they use have been tested under wide ranges of conditions, and they're fast and stable. Sometimes, however, the interface to these algorithms isn't convenient for a particular task.

For example, computer-science textbooks often describe algorithms in terms of other data structures such as queues and stacks. To use these algorithms, it may make sense to come up with a data structure that has the same methods as these data structures (such as pop and push for stacks or enqueue/dequeue for queues). However, it also makes sense to reuse the built-in list type in the implementation of a stack. In other words, you need something that acts like a stack but is based on a list. The easiest solution is to use a class wrapper around a list. For a minimal stack implementation, you can do this:

class Stack:
    def __init__(self, data):
        self._data = list(data)
    def push(self, item):
        self._data.append(item)
    def pop(self):
        item = self._data[-1]
        del self._data[-1]
        return item

The following is simple to write, to understand, to read, and to use:

>>> thingsToDo = Stack(['write to mom', 'invite friend over', 'wash the kid'])

>>> thingsToDo.push('do the dishes')

>>> print thingsToDo.pop()

do the dishes
>>> print thingsToDo.pop()

wash the kid

Two standard Python naming conventions are used in the Stack class above. The first is that class names start with an uppercase letter, to distinguish them from functions. The other is that the _data attribute starts with an underscore. This is a half-way point between public attributes (which don't start with an underscore), private attributes (which start with two underscores; see Chapter 6), and Python-reserved identifiers (which both start and end with two underscores). What it means is that _data is an attribute of the class that shouldn't be needed by clients of the class. The class designer expects such "pseudo-private" attributes to be used only by the class methods and by the methods of any eventual subclass.

9.1.6 Making New Lists and Dictionaries: The UserList and UserDict Modules

The Stack class presented earlier does its minimal job just fine. It assumes a fairly minimal definition of what a stack is, specifically, something that supports just two operations, a push and a pop. Quickly, however, you find that some of the features of lists are really nice, such as the ability to iterate over all the elements using the for...in... construct. This can be done by reusing existing code. In this case, you should use the UserList class defined in the UserList module as a class from which the Stack can be derived. The library also includes a UserDict module that is a class wrapper around a dictionary. In general, they are there to be specialized by subclassing. In our case:

# import the UserList class from the UserList module
from UserList import UserList

# subclass the UserList class
class Stack(UserList):
    push = UserList.append
    def pop(self):
        item = self[-1]                    # uses __getitem__
        del self[-1] 
        return item

This Stack is a subclass of the UserList class. The UserListclass implements the behavior of the [] brackets by defining the special __getitem__ and __delitem_ _ methods among others, which is why the code in pop works. You don't need to define your own __init__ method because UserList defines a perfectly good default. Finally, the push method is defined just by saying that it's the same as UserList's append method. Now we can do list-like things as well as stack-like things:

>>> thingsToDo = Stack(['write to mom', 'invite friend over', 'wash the kid'])?/B>
>>> print thingsToDo?/b>                  # inherited from UserList
['write to mom', 'invite friend over', 'wash the kid']
>>> thingsToDo.pop()?/b>        
'wash the kid'
>>> thingsToDo.push('change the oil')?/B>
>>> for chore in thingsToDo:?/B>          # we can also iterate over the contents?/i>
...    print chore                    ?/b># as "for .. in .." uses __getitem__
...
write to mom
invite friend over
change the oil

As this book was being written, Guido van Rossum announced that in Python 1.5.2 (and subsequent versions), list objects now have an additional method called pop, which behaves just like the one here. It also has an optional argument that specifies what index to use to do the pop (with the default being the last element in the list).

I l@ve RuBoard