From 21a79a34bccc4c48aa404b5d566aaf40a13756af Mon Sep 17 00:00:00 2001
From: Adar Nimrod <nimrod@shore.co.il>
Date: Sat, 18 Sep 2021 20:39:05 +0300
Subject: [PATCH] Post on parsing Apache access logs.

---
 content/parsing-apache-logs.rst | 77 +++++++++++++++++++++++++++++++++
 1 file changed, 77 insertions(+)
 create mode 100644 content/parsing-apache-logs.rst

diff --git a/content/parsing-apache-logs.rst b/content/parsing-apache-logs.rst
new file mode 100644
index 0000000..6b415ab
--- /dev/null
+++ b/content/parsing-apache-logs.rst
@@ -0,0 +1,77 @@
+Parsing Apache logs
+###################
+:date: 2021-09-18
+:summary: Parsing Apache logs in Python
+
+Due to a clerical error at my ISP, my internet connection is down for the
+weekend. So what's an activity that doesn't require a working internet
+connection? Blogging. Earlier this year I was interviewing and on 3 different
+occasions I was presented with a technical challenge of parsing Apache access
+logs. I know from experience that the goal of the challenge is to gauge how well
+I can parse text files with the standard Unix command line tools (mainly
+:code:`awk` but also :code:`grep`, :code:`wc`, :code:`sed` and some plain shell
+scripting).
+
+Now :code:`awk` is really a great tool and its performance is better than any
+Hadoop cluster (as long as you can fit the data in a single machine) and usually
+the challenge is limited enough that it's doable with :code:`awk`. However, it
+may be easier to read and debug to write the solution using Python and obviously
+with Python being a much more flexible tool, we can solve any problem (even a
+real-life one) that has to do with parsing Apache's access logs.
+
+For this I'm going to use `Parse <https://github.com/r1chardj0n3s/parse>`_.
+With Parse we can write specifications (sort of the reverse of f-strings) and
+with them we can parse log lines and get back nicely structured data. Also, for
+bonus points, I'm going to use generators (it should also improve performance a
+bit).
+
+.. code:: python
+
+    from parse import compile
+
+    def parse_log(fh):
+        """Reads the file handler line by line and returns a dictionary of the
+        log fields.
+        """
+        parser = compile(
+            '''{ip} {logname} {user} [{date:th}] "{request}" {status:d} {bytes} "{referer}" "{user_agent}"'''
+        )
+        for line in fh.readlines():
+            result = parser.parse(line)
+            if result is not None:
+                yield result
+
+
+This function will work with any file opened with :code:`open` or with
+:code:`sys.stdin`. Let's grab an example log file from `Elastic's examples repo
+<https://github.com/elastic/examples>`_ and print the 10 most frequent client IP
+addresses.
+
+.. code:: python
+
+    import urllib.request
+
+    ipaddresses = {}
+    with urllib.request.urlopen(
+        "https://github.com/elastic/examples/blob/master/Common%20Data%20Formats/apache_logs/apache_logs?raw=true",
+    ) as fh:
+        for record in parse_log(fh):
+            ip = record["ip"]
+            if ip in ipaddresses:
+                ipaddresses[ip] = ipaddresses[ip] + 1
+            else:
+                ipaddresses[ip] = 1
+    sorted_addresses = sorted(
+        ipaddresses.items(),
+        key=lambda x: x[1],
+        reverse=True,
+    )
+    for i in range(10):
+        print(f"{sorted_addresses[i][0]}: {sorted_addresses[i][1]}")
+
+
+Obviously this is a simple example, but this method is not limited in any way.
+There's no messing around with delimiters or worrying about long strings inside
+quotation marks nor checking the :code:`awk` man page for functions you never
+used before. The resulting code is pretty clear and the performance is on-par
+with any shell script you can whip together.
-- 
GitLab