Tag: regex

  • Optimising A Python Script

    I have a Python script that over-simplifying, reads very large log files and runs a whole bunch of regular expressions on each line. As it had started running inconveniently slowly, I had a look at improving the performance.

    The conventional wisdom is that if you are reading a file (or standard input), then the simplest method is probably almost always the fastest :-

    for line in logstream:
        processline(line)

    But being stubborn, I looked at possible improvements and came up with :-

    from itertools import islice
        
    while True:
        buffer = list(islice(logstream, islicecount))
        if buffer != []:
            for line in buffer:
                 processline(line)
        else:
            break
    

    This code has been updated twice because the first version added a splat to the output and the second version (which was far more elegant) didn’t work. The final version 

    This I benchmarked as being nearly 5% quicker – not bad, but nowhere near enough for my purposes.

    The next step was to improve the regular expressions – I read somewhere that .* can be expensive and that [^\s]* was far quicker and often gave the same result. I replaced a number of .* occurrences in the “patterns” file and re-ran the benchmark to find (in a case with lots of regular expressions) the time had dropped nearly 25%.

    The last step was to install nuitka to compile the Python script into a binary executable. This showed a further 25% drop – a script that started the day taking 15 minutes to run through one particular run ended the day taking just under 8 minutes.

    The funny thing is that the optimisation that took the longest and had the biggest effect on the code showed the smallest improvement!

    Four Posts