I l@ve RuBoard Previous Section Next Section

8.3 Serializing Data Using the pickle and cPickle Modules

Credit: Luther Blissett

8.3.1 Problem

You have a Python data structure, which may include fundamental Python objects, and possibly classes and instances, and you want to serialize it and reconstruct it at a reasonable speed.

8.3.2 Solution

If you don't want to assume that your data is composed of only fundamental Python objects, or you need portability across versions of Python, or you need to transmit the serialized form as text, the best way of serializing your data is with the cPickle module (the pickle module is a pure-Python equivalent, but it's far slower and not worth using except if you're missing cPickle). For example:

data = {12:'twelve', 'feep':list('ciao'), 1.23:4+5j, (1,2,3):u'wer'}

You can serialize data to a text string:

import cPickle
text = cPickle.dumps(data)

or to a binary string, which is faster and takes up less space:

bytes = cPickle.dumps(data, 1)

You can now sling text or bytes around as you wish (e.g., send it across a network, put it as a BLOB in a database, etc.), as long as you keep it intact. In the case of bytes, this means keeping its arbitrary binary bytes intact. In the case of text, this means keeping its textual structure intact, including newline characters. Then you can reconstruct the data at any time, regardless of machine architecture or Python release:

redata1 = cPickle.loads(text)
redata2 = cPickle.loads(bytes)

Either call reconstructs a data structure that compares equal to data. In other words, the order of keys in dictionaries is arbitrary in both the original and reconstructed data structures, but order in any kind of sequence is meaningful, and thus it is preserved. You don't need to tell cPickle.loads whether the original dumps used text mode (the default) or binary (faster and more compact)?TT>loads figures it out by examining its argument's contents.

When you specifically want to write the data to a file, you can also use the dump function of the cPickle module, which lets you dump several data structures one after the other:

ouf = open('datafile.txt', 'w')
cPickle.dump(data, ouf)
cPickle.dump('some string', ouf)
cPickle.dump(range(19), ouf)
ouf.close(  )

Once you have done this, you can recover from datafile.txt the same data structures you dumped into it, in the same sequence:

inf = open('datafile.txt')
a = cPickle.load(inf)
b = cPickle.load(inf)
c = cPickle.load(inf)
inf.close(  )

You can also pass cPickle.dump a third argument of 1 to tell it to serialize the data in binary form (faster and more compact), but the datafile must be opened for binary I/O, not in the default text mode, when you originally dump to it and when you later load from it.

8.3.3 Discussion

Python offers several ways to serialize data (i.e., make the data into a string of bytes that you can save on disk, in a database, send across the network, and so on) and corresponding ways to reconstruct the data from such serialized forms. Typically, the best approach is to use the cPickle module. There is also a pure-Python equivalent, called pickle (the cPickle module is coded in C as a Python extension), but pickle is substantially slower, and the only reason to use it is if you don't have cPickle (e.g., a Python port onto a handheld computer with tiny storage space, where you saved every byte you possibly could by installing only an indispensable subset of Python's large standard library).

cPickle supports most elementary data types (e.g., dictionaries, lists, tuples, numbers, strings) and combinations thereof, as well as classes and instances. Pickling classes and instances saves only the data involved, not the code. (Code objects are not even among the types that cPickle knows how to serialize, basically because there would be no way to guarantee their portability across disparate versions of Python). See Recipe 8.4 for more about pickling classes and instances.

cPickle guarantees compatibility from one Python release to another and independence from a specific machine's architecture. Data serialized with cPickle will still be readable if you upgrade your Python release, and pickling is guaranteed to work if you're sending serialized data between different machines.

The dumps function of cPickle accepts any Python data structure and returns a text string representing it. Or, if you call dumps with a second argument of 1, it returns an arbitrary byte string instead, which is faster and takes up less space. You can pass either the text or the byte string to the loads function, which will return another Python data structure that compares equal (==) to the one you originally dumped. In between the dumps and loads calls, you can subject the byte string to any procedure you wish, such as sending it over the network, storing it in a database and retrieving it, or encrypting it and decrypting it. As long as the string's textual or binary structure is correctly restored, loads will work fine on it (even across platforms and releases).

When you specifically need to save the data into a file, you can also use cPickle's dump function, which takes two arguments: the data structure you're dumping and the open file object. If the file is opened for binary I/O, rather than the default (text I/O), by giving dump a third argument of 1, you can ask for binary format, which is faster and takes up less space. The advantage of dump over dumps is that, with dump, you can perform several calls, one after the other, with various data structures and the same open file object. Each data structure is then dumped with information about how long the dumped string is. Consequently, when you later open the file for reading (binary reading, if you asked for binary format), and then repeatedly call cPickle.load, passing the file as the argument, each data structure previously dumped is reloaded sequentially, one after the other. The return value of load, as that of loads, is a new data structure that compares equal to the one you originally dumped.

8.3.4 See Also

Recipe 8.2 and Recipe 8.4; documentation for the standard library module cPickle in the Library Reference.

    I l@ve RuBoard Previous Section Next Section