This article shows a Python generator that lets you read lines from a file, splitting it up in chucks, or blocks, to let you easily parallelize a program to analyze a very large file, without having to split that file in separate smaller files first.
Suppose you have a large text file where each line is a row of data, and you want to process all this data to come up with some kind of statistics. In most cases like this, what you have is an “embarrassingly parallel problem”, and you could easily make your program faster by following these steps:
- Split your data in chunks, or blocks.
- Process these blocks separately, and produce an intermediary result for each block.
- Combine the results for each of the blocks to produce the result for the whole data.
The problem is that to use these tools your first of all must have the whole environment set up nicely, and your data must be stored in the right place, and most importantly: your data must also have been already split up into chucks. Well, that can be quite of a nuisance if all you have is one single large file that you want to analyse, and you are also not having to deal with tons of computers on a network. All you want instead is to distribute your load over the few processors in your personal computer, for example.
So this is the problem scenario: I have a single large file I want to process, and all I want to do is to code one single Python script to do the job. I have coded all the analysis, and I can already process that big file into a single process. Now I want to use the multiprocessing module to parallelize the process.
So we originally have some code like
fp = open(filename) for line in fp.readlines(): process(line)
What I created is a generator that picks up a file object, and then iterates over the lines of the “n-th” chuck of lines. The first step is to find out the file size, what is done by seeking to the end of the file. We then divide the file size by the number of chunks and multiplied by the index of the chunk we want to process. That gives us the start of the chunk. We then “fast-forward” to the beginning of the next line, and start yielding the chunk lines, until we get to the beginning of the next chunk, or to the end of the file. Here is the generator, and how to use it:
def file_block(fp, number_of_blocks, block): ''' A generator that splits a file into blocks and iterates over the lines of one of the blocks. ''' assert 0 <= block and block < number_of_blocks assert 0 < number_of_blocks fp.seek(0,2) file_size = fp.tell() ini = file_size * block / number_of_blocks end = file_size * (1 + block) / number_of_blocks if ini <= 0: fp.seek(0) else: fp.seek(ini-1) fp.readline() while fp.tell() < end: yield fp.readline() if __name__ == '__main__': fp = open(filename) number_of_chunks = 4 for chunk_number in range(number_of_chunks): print chunk_number, 100 * '=' for line in file_block(fp, number_of_chunks, chunk_number): process(line)
Of course, all that code does right now to process the lines for each chunk sequentially. I am actually also just printing lines right now to debug. Next step is to effectively make use of the multiprocessing module, and replace the outer loop for a distributed processing.
I think this is one neat example of “hiding complexity”. We had a line iterator at first, now we have a line iterator for the lines of a chunk, that can be very easily used for multiprocessing. That generator function is small, but actually hides away quite a lot of complexity, like looking at the file size, managing the need to start at the beginning of the lines, etc. Writing this was one of the coolest “Python experiences” I had recently.