diff --git a/content/parsing-apache-logs.rst b/content/parsing-apache-logs.rst new file mode 100644 index 0000000000000000000000000000000000000000..6b415ab4b46a314e6a35847a62058f773701539d --- /dev/null +++ b/content/parsing-apache-logs.rst @@ -0,0 +1,77 @@ +Parsing Apache logs +################### +:date: 2021-09-18 +:summary: Parsing Apache logs in Python + +Due to a clerical error at my ISP, my internet connection is down for the +weekend. So what's an activity that doesn't require a working internet +connection? Blogging. Earlier this year I was interviewing and on 3 different +occasions I was presented with a technical challenge of parsing Apache access +logs. I know from experience that the goal of the challenge is to gauge how well +I can parse text files with the standard Unix command line tools (mainly +:code:`awk` but also :code:`grep`, :code:`wc`, :code:`sed` and some plain shell +scripting). + +Now :code:`awk` is really a great tool and its performance is better than any +Hadoop cluster (as long as you can fit the data in a single machine) and usually +the challenge is limited enough that it's doable with :code:`awk`. However, it +may be easier to read and debug to write the solution using Python and obviously +with Python being a much more flexible tool, we can solve any problem (even a +real-life one) that has to do with parsing Apache's access logs. + +For this I'm going to use `Parse <https://github.com/r1chardj0n3s/parse>`_. +With Parse we can write specifications (sort of the reverse of f-strings) and +with them we can parse log lines and get back nicely structured data. Also, for +bonus points, I'm going to use generators (it should also improve performance a +bit). + +.. code:: python + + from parse import compile + + def parse_log(fh): + """Reads the file handler line by line and returns a dictionary of the + log fields. + """ + parser = compile( + '''{ip} {logname} {user} [{date:th}] "{request}" {status:d} {bytes} "{referer}" "{user_agent}"''' + ) + for line in fh.readlines(): + result = parser.parse(line) + if result is not None: + yield result + + +This function will work with any file opened with :code:`open` or with +:code:`sys.stdin`. Let's grab an example log file from `Elastic's examples repo +<https://github.com/elastic/examples>`_ and print the 10 most frequent client IP +addresses. + +.. code:: python + + import urllib.request + + ipaddresses = {} + with urllib.request.urlopen( + "https://github.com/elastic/examples/blob/master/Common%20Data%20Formats/apache_logs/apache_logs?raw=true", + ) as fh: + for record in parse_log(fh): + ip = record["ip"] + if ip in ipaddresses: + ipaddresses[ip] = ipaddresses[ip] + 1 + else: + ipaddresses[ip] = 1 + sorted_addresses = sorted( + ipaddresses.items(), + key=lambda x: x[1], + reverse=True, + ) + for i in range(10): + print(f"{sorted_addresses[i][0]}: {sorted_addresses[i][1]}") + + +Obviously this is a simple example, but this method is not limited in any way. +There's no messing around with delimiters or worrying about long strings inside +quotation marks nor checking the :code:`awk` man page for functions you never +used before. The resulting code is pretty clear and the performance is on-par +with any shell script you can whip together.