regex – More Zonkyness

I have a Python script that over-simplifying, reads very large log files and runs a whole bunch of regular expressions on each line. As it had started running inconveniently slowly, I had a look at improving the performance.

The conventional wisdom is that if you are reading a file (or standard input), then the simplest method is probably almost always the fastest :-

for line in logstream:
    processline(line)

But being stubborn, I looked at possible improvements and came up with :-

from itertools import islice
    
while True:
    buffer = list(islice(logstream, islicecount))
    if buffer != []:
        for line in buffer:
             processline(line)
    else:
        break

This code has been updated twice because the first version added a splat to the output and the second version (which was far more elegant) didn’t work. The final version

This I benchmarked as being nearly 5% quicker – not bad, but nowhere near enough for my purposes.

The next step was to improve the regular expressions – I read somewhere that .* can be expensive and that [^\s]* was far quicker and often gave the same result. I replaced a number of .* occurrences in the “patterns” file and re-ran the benchmark to find (in a case with lots of regular expressions) the time had dropped nearly 25%.

The last step was to install nuitka to compile the Python script into a binary executable. This showed a further 25% drop – a script that started the day taking 15 minutes to run through one particular run ended the day taking just under 8 minutes.

The funny thing is that the optimisation that took the longest and had the biggest effect on the code showed the smallest improvement!