Parsing Apache logs ################### :date: 2021-09-18 :summary: Parsing Apache logs in Python Due to a clerical error at my ISP, my internet connection is down for the weekend. So what's an activity that doesn't require a working internet connection? Blogging. Earlier this year I was interviewing and on 3 different occasions I was presented with a technical challenge of parsing Apache access logs. I know from experience that the goal of the challenge is to gauge how well I can parse text files with the standard Unix command line tools (mainly :code:`awk` but also :code:`grep`, :code:`wc`, :code:`sed` and some plain shell scripting). Now :code:`awk` is really a great tool and its performance is better than any Hadoop cluster (as long as you can fit the data in a single machine) and usually the challenge is limited enough that it's doable with :code:`awk`. However, it may be easier to read and debug to write the solution using Python and obviously with Python being a much more flexible tool, we can solve any problem (even a real-life one) that has to do with parsing Apache's access logs. For this I'm going to use `Parse <https://github.com/r1chardj0n3s/parse>`_. With Parse we can write specifications (sort of the reverse of f-strings) and with them we can parse log lines and get back nicely structured data. Also, for bonus points, I'm going to use generators (it should also improve performance a bit). .. code:: python from parse import compile def parse_log(fh): """Reads the file handler line by line and returns a dictionary of the log fields. """ parser = compile( '''{ip} {logname} {user} [{date:th}] "{request}" {status:d} {bytes} "{referer}" "{user_agent}"''' ) for line in fh.readlines(): result = parser.parse(line) if result is not None: yield result This function will work with any file opened with :code:`open` or with :code:`sys.stdin`. Let's grab an example log file from `Elastic's examples repo <https://github.com/elastic/examples>`_ and print the 10 most frequent client IP addresses. .. code:: python import urllib.request ipaddresses = {} with urllib.request.urlopen( "https://github.com/elastic/examples/blob/master/Common%20Data%20Formats/apache_logs/apache_logs?raw=true", ) as fh: for record in parse_log(fh): ip = record["ip"] if ip in ipaddresses: ipaddresses[ip] = ipaddresses[ip] + 1 else: ipaddresses[ip] = 1 sorted_addresses = sorted( ipaddresses.items(), key=lambda x: x[1], reverse=True, ) for i in range(10): print(f"{sorted_addresses[i][0]}: {sorted_addresses[i][1]}") Obviously this is a simple example, but this method is not limited in any way. There's no messing around with delimiters or worrying about long strings inside quotation marks nor checking the :code:`awk` man page for functions you never used before. The resulting code is pretty clear and the performance is on-par with any shell script you can whip together.