
Handling Faults in Python Generator Chains
July 11, 2008Here’s an idea I’m going to try Real Soon Now -
David Beazley gave a cool talk at PyCon ‘08, called “Generator Tricks for System Programmers.” Take a look if you haven’t, and are interested in this kind of stuff – his slides use the processing of an Apache log as the running example, it’s fairly practical and relevant stuff.
Down at the very end of the talk (slide 71 of 71… the slides are here) David mentions a number of pitfalls with the technique, and error handling is one of them. The problem is essentially that when a chain of generators is used for offline processing of a large data stream, it becomes difficult to know when to catch exceptions and what to do with them. In particular, the link in the generator chain that raises an exception may not have any access to the data that the processing chain started with – the log line, in the Apache example. And if you have a large data set then this might come up thousands of times, so you really can’t handle faults by stopping and fixing…
In one case there might be a very nice solution for this. The case is when the exceptions are a result of malformed or otherwise exceptional data, and the right reaction is to filter this data into a separate file for manual or special processing later. And, the solution is to pass the exception back up the yield chain, using the send primitive. The semantics are basically, “hey, that didn’t work so well for us here down the line, could you please put this one aside for special processing…” In the Apache example, the outcome could be a new log excerpt with only the lines that raised exceptions during normal processing.
Some related reading – generator coroutines PEP, and thoughts on send semantics.
I’ll post a code sample when I get around to trying this out.