I'm parsing big Apache logs like:
example.com:80 1.2.3.4 - - [01/Jul/2021:06:12:12 +0000] "GET /test/example/index.php?a=b&c=d HTTP/1.1" 302 486 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.3945.117 Safari/537.36"
with:
import apache_log_parser, shlex
parser = apache_log_parser.make_parser("%v:%p %h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"")
with open("access.log") as f:
for l in enumerate(f):
x = parser(l)
For each line, it takes ~0.1 ms (i5 laptop) / ~0.9ms (low-end Atom CPU N2800 1.86GHz)
This is quite slow: nearly one millisecond for each line!So I decided to do my own parsing with
shlex(which deals nicely with quotes such asfirst "second block" "third block" fourth).
It's worse! I get, per line, ~0.3 ms (i5 laptop) / ~1.6ms low-end serverwith open("access.log") as f: for l in enumerate(f): x = shlex.split(l)Question: Which faster method (maybe with direct regex?) could allow the parsing of such logs? I only need
server port ip datetime url status bytes referer useragent.
I finally found a solution that does a x10 speed improvement: pure regex.
~ 0.01 ms per line on my i5 laptop.