Parsing Apache logs
###################
:date: 2021-09-18
:summary: Parsing Apache logs in Python

Due to a clerical error at my ISP, my internet connection is down for the
weekend. So what's an activity that doesn't require a working internet
connection? Blogging. Earlier this year I was interviewing and on 3 different
occasions I was presented with a technical challenge of parsing Apache access
logs. I know from experience that the goal of the challenge is to gauge how well
I can parse text files with the standard Unix command line tools (mainly
:code:`awk` but also :code:`grep`, :code:`wc`, :code:`sed` and some plain shell
scripting).

Now :code:`awk` is really a great tool and its performance is better than any
Hadoop cluster (as long as you can fit the data in a single machine) and usually
the challenge is limited enough that it's doable with :code:`awk`. However, it
may be easier to read and debug to write the solution using Python and obviously
with Python being a much more flexible tool, we can solve any problem (even a
real-life one) that has to do with parsing Apache's access logs.

For this I'm going to use `Parse <https://github.com/r1chardj0n3s/parse>`_.
With Parse we can write specifications (sort of the reverse of f-strings) and
with them we can parse log lines and get back nicely structured data. Also, for
bonus points, I'm going to use generators (it should also improve performance a
bit).

.. code:: python

    from parse import compile

    def parse_log(fh):
        """Reads the file handler line by line and returns a dictionary of the
        log fields.
        """
        parser = compile(
            '''{ip} {logname} {user} [{date:th}] "{request}" {status:d} {bytes} "{referer}" "{user_agent}"'''
        )
        for line in fh.readlines():
            result = parser.parse(line)
            if result is not None:
                yield result


This function will work with any file opened with :code:`open` or with
:code:`sys.stdin`. Let's grab an example log file from `Elastic's examples repo
<https://github.com/elastic/examples>`_ and print the 10 most frequent client IP
addresses.

.. code:: python

    import urllib.request

    ipaddresses = {}
    with urllib.request.urlopen(
        "https://github.com/elastic/examples/blob/master/Common%20Data%20Formats/apache_logs/apache_logs?raw=true",
    ) as fh:
        for record in parse_log(fh):
            ip = record["ip"]
            if ip in ipaddresses:
                ipaddresses[ip] = ipaddresses[ip] + 1
            else:
                ipaddresses[ip] = 1
    sorted_addresses = sorted(
        ipaddresses.items(),
        key=lambda x: x[1],
        reverse=True,
    )
    for i in range(10):
        print(f"{sorted_addresses[i][0]}: {sorted_addresses[i][1]}")


Obviously this is a simple example, but this method is not limited in any way.
There's no messing around with delimiters or worrying about long strings inside
quotation marks nor checking the :code:`awk` man page for functions you never
used before. The resulting code is pretty clear and the performance is on-par
with any shell script you can whip together.