I have a Python script that over-simplifying, reads very large log files and runs a whole bunch of regular expressions on each line. As it had started running inconveniently slowly, I had a look at improving the performance.
The conventional wisdom is that if you are reading a file (or standard input), then the simplest method is probably almost always the fastest :-
for line in logstream:
processline(line)
But being stubborn, I looked at possible improvements and came up with :-
from itertools import islice
while True:
buffer = list(islice(logstream, islicecount))
if buffer != []:
for line in buffer:
processline(line)
else:
break
This code has been updated twice because the first version added a splat to the output and the second version (which was far more elegant) didn’t work. The final version
This I benchmarked as being nearly 5% quicker – not bad, but nowhere near enough for my purposes.
The next step was to improve the regular expressions – I read somewhere that .* can be expensive and that [^\s]* was far quicker and often gave the same result. I replaced a number of .* occurrences in the “patterns” file and re-ran the benchmark to find (in a case with lots of regular expressions) the time had dropped nearly 25%.
The last step was to install nuitka to compile the Python script into a binary executable. This showed a further 25% drop – a script that started the day taking 15 minutes to run through one particular run ended the day taking just under 8 minutes.
The funny thing is that the optimisation that took the longest and had the biggest effect on the code showed the smallest improvement!