4.6 Retrieving a Line at Random from a File of Unknown Size
Credit: Richard Papworth
4.6.1 Problem
You have a file of unknown (but
potentially very large) size, and you need to select one line at
random from it, with a single pass on the file.
4.6.2 Solution
We do need to
read the whole file, but we don't have to read it
all at once:
import random
def randomLine(file_object):
"Retrieve a random line from a file, reading through the file once"
lineNum = 0
selected_line = ''
while 1:
aLine = file_object.readline( )
if not aLine: break
lineNum = lineNum + 1
# How likely is it that this is the last line of the file?
if random.uniform(0,lineNum)<1:
selected_line = aLine
file_object.close( )
return selected_line
4.6.3 Discussion
Of course, a more obvious approach would be:
random.choice(file_object.readlines( ))
But that requires reading the whole file into memory at once, which
can be a problem for truly enormous files.
This recipe works thanks to an unobvious but not terribly deep
theorem: when we have seen the first N lines in
the file, there is a probability of exactly 1/N
that each of them is the one selected so far (i.e., the one to which
the selected_line variable refers at this point).
This is easy to see for the last line (the one just read into the
aLine variable), because we explicitly choose it
with a probability of 1.0/lineNum. The general
validity of the theorem follows by induction. If it was true after
the first N1 lines were read,
it's clearly still true after the
Nth one is read. As we select the very first
line with probability 1, the limit case of the theorem clearly does
hold when N=1.
Of course, the same technique holds for a uniformprobability
selection from any finite sequence that, for whatever reason, is made
available only one item at a time. But, apart from, for example,
selecting a random word from /usr/dict/words,
there aren't all that many practical applications of
this pretty theorem.
4.6.4 See Also
Documentation for the random module in the
Library Reference.
