I l@ve RuBoard Previous Section Next Section

4.2 Reading from a File

Credit: Luther Blissett

4.2.1 Problem

You want to read text or data from a file.

4.2.2 Solution

Here's the most convenient way to read all of the file's contents at once into one big string:

all_the_text = open('thefile.txt').read(  )    # all text from a text file
all_the_data = open('abinfile', 'rb').read(  ) # all data from a binary file

However, it is better to bind the file object to a variable so that you can call close on it as soon as you're done. For example, for a text file:

file_object = open('thefile.txt')
all_the_text = file_object.read(  )
file_object.close(  )

There are four ways to read a text file's contents at once as a list of strings, one per line:

list_of_all_the_lines = file_object.readlines(  )
list_of_all_the_lines = file_object.read(  ).splitlines(1)
list_of_all_the_lines = file_object.read().splitlines(  )
list_of_all_the_lines = file_object.read(  ).split('\n')

The first two ways leave a '\n' at the end of each line (i.e., in each string item in the result list), while the other two ways remove all trailing '\n' characters. The first of these four ways is the fastest and most Pythonic. In Python 2.2 and later, there is a fifth way that is equivalent to the first one:

list_of_all_the_lines = list(file_object)

4.2.3 Discussion

Unless the file you're reading is truly huge, slurping it all into memory in one gulp is fastest and generally most convenient for any further processing. The built-in function open creates a Python file object. With that object, you call the read method to get all of the contents (whether text or binary) as a single large string. If the contents are text, you may choose to immediately split that string into a list of lines, with the split method or with the specialized splitlines method. Since such splitting is a frequent need, you may also call readlines directly on the file object, for slightly faster and more convenient operation. In Python 2.2, you can also pass the file object directly as the only argument to the built-in type list.

On Unix and Unix-like systems, such as Linux and BSD variants, there is no real distinction between text files and binary data files. On Windows and Macintosh systems, however, line terminators in text files are encoded not with the standard '\n' separator, but with '\r\n' and '\r', respectively. Python translates the line-termination characters into '\n' on your behalf, but this means that you need to tell Python when you open a binary file, so that it won't perform the translation. To do that, use 'rb' as the second argument to open. This is innocuous even on Unix-like platforms, and it's a good habit to distinguish binary files from text files even there, although it's not mandatory in that case. Such a good habit will make your programs more directly understandable, as well as letting you move them between platforms more easily.

You can call methods such as read directly on the file object produced by the open function, as shown in the first snippet of the solution. When you do this, as soon as the reading operation finishes, you no longer have a reference to the file object. In practice, Python notices the lack of a reference at once and immediately closes the file. However, it is better to bind a name to the result of open, so that you can call close yourself explicitly when you are done with the file. This ensures that the file stays open for as short a time as possible, even on platforms such as Jython and hypothetical future versions of Python on which more advanced garbage-collection mechanisms might delay the automatic closing that Python performs.

If you choose to read the file a little at a time, rather than all at once, the idioms are different. Here's how to read a binary file 100 bytes at a time, until you reach the end of the file:

file_object = open('abinfile', 'rb')
while 1:
    chunk = file_object.read(100)
    if not chunk: break
    do_something_with(chunk)
file_object.close(  )

Passing an argument N to the read method ensures that read will read only the next N bytes (or fewer, if the file is closer to the end). read returns the empty string when it reaches the end of the file.

Reading a text file one line at a time is a frequent task. In Python 2.2 and later, this is the easiest, clearest, and fastest approach:

for line in open('thefile.txt'):
    do_something_with(line)

Several idioms were common in older versions of Python. The one idiom you can be sure will work even on extremely old versions of Python, such as 1.5.2, is quite similar to the idiom for reading a binary file a chunk at a time:

file_object = open('thefile.txt')
while 1:
    line = file_object.readline(  )
    if not line: break
    do_something_with(line)
file_object.close(  )

readline, like read, returns the empty string when it reaches the end of the file. Note that the end of the file is easily distinguished from an empty line because the latter is returned by readline as '\n', which is not an empty string but rather a string with a length of 1.

4.2.4 See Also

Recipe 4.3; documentation for the open built-in function and file objects in the Library Reference.

    I l@ve RuBoard Previous Section Next Section