Skip to content
Snippets Groups Projects
Commit 21a79a34 authored by nimrod's avatar nimrod
Browse files

Post on parsing Apache access logs.

parent 0893e56b
No related branches found
No related tags found
No related merge requests found
Pipeline #2216 passed
Parsing Apache logs
###################
:date: 2021-09-18
:summary: Parsing Apache logs in Python
Due to a clerical error at my ISP, my internet connection is down for the
weekend. So what's an activity that doesn't require a working internet
connection? Blogging. Earlier this year I was interviewing and on 3 different
occasions I was presented with a technical challenge of parsing Apache access
logs. I know from experience that the goal of the challenge is to gauge how well
I can parse text files with the standard Unix command line tools (mainly
:code:`awk` but also :code:`grep`, :code:`wc`, :code:`sed` and some plain shell
scripting).
Now :code:`awk` is really a great tool and its performance is better than any
Hadoop cluster (as long as you can fit the data in a single machine) and usually
the challenge is limited enough that it's doable with :code:`awk`. However, it
may be easier to read and debug to write the solution using Python and obviously
with Python being a much more flexible tool, we can solve any problem (even a
real-life one) that has to do with parsing Apache's access logs.
For this I'm going to use `Parse <https://github.com/r1chardj0n3s/parse>`_.
With Parse we can write specifications (sort of the reverse of f-strings) and
with them we can parse log lines and get back nicely structured data. Also, for
bonus points, I'm going to use generators (it should also improve performance a
bit).
.. code:: python
from parse import compile
def parse_log(fh):
"""Reads the file handler line by line and returns a dictionary of the
log fields.
"""
parser = compile(
'''{ip} {logname} {user} [{date:th}] "{request}" {status:d} {bytes} "{referer}" "{user_agent}"'''
)
for line in fh.readlines():
result = parser.parse(line)
if result is not None:
yield result
This function will work with any file opened with :code:`open` or with
:code:`sys.stdin`. Let's grab an example log file from `Elastic's examples repo
<https://github.com/elastic/examples>`_ and print the 10 most frequent client IP
addresses.
.. code:: python
import urllib.request
ipaddresses = {}
with urllib.request.urlopen(
"https://github.com/elastic/examples/blob/master/Common%20Data%20Formats/apache_logs/apache_logs?raw=true",
) as fh:
for record in parse_log(fh):
ip = record["ip"]
if ip in ipaddresses:
ipaddresses[ip] = ipaddresses[ip] + 1
else:
ipaddresses[ip] = 1
sorted_addresses = sorted(
ipaddresses.items(),
key=lambda x: x[1],
reverse=True,
)
for i in range(10):
print(f"{sorted_addresses[i][0]}: {sorted_addresses[i][1]}")
Obviously this is a simple example, but this method is not limited in any way.
There's no messing around with delimiters or worrying about long strings inside
quotation marks nor checking the :code:`awk` man page for functions you never
used before. The resulting code is pretty clear and the performance is on-par
with any shell script you can whip together.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment