Skip to content
GitLab
Explore
Sign in
Register
Primary navigation
Search or go to…
Project
B
blog
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
nimrod
blog
Commits
21a79a34
Commit
21a79a34
authored
3 years ago
by
nimrod
Browse files
Options
Downloads
Patches
Plain Diff
Post on parsing Apache access logs.
parent
0893e56b
No related branches found
No related tags found
No related merge requests found
Pipeline
#2216
passed
3 years ago
Stage: .pre
Stage: deploy
Changes
1
Pipelines
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
content/parsing-apache-logs.rst
+77
-0
77 additions, 0 deletions
content/parsing-apache-logs.rst
with
77 additions
and
0 deletions
content/parsing-apache-logs.rst
0 → 100644
+
77
−
0
View file @
21a79a34
Parsing Apache logs
###################
:date: 2021-09-18
:summary: Parsing Apache logs in Python
Due to a clerical error at my ISP, my internet connection is down for the
weekend. So what's an activity that doesn't require a working internet
connection? Blogging. Earlier this year I was interviewing and on 3 different
occasions I was presented with a technical challenge of parsing Apache access
logs. I know from experience that the goal of the challenge is to gauge how well
I can parse text files with the standard Unix command line tools (mainly
:code:`awk` but also :code:`grep`, :code:`wc`, :code:`sed` and some plain shell
scripting).
Now :code:`awk` is really a great tool and its performance is better than any
Hadoop cluster (as long as you can fit the data in a single machine) and usually
the challenge is limited enough that it's doable with :code:`awk`. However, it
may be easier to read and debug to write the solution using Python and obviously
with Python being a much more flexible tool, we can solve any problem (even a
real-life one) that has to do with parsing Apache's access logs.
For this I'm going to use `Parse <https://github.com/r1chardj0n3s/parse>`_.
With Parse we can write specifications (sort of the reverse of f-strings) and
with them we can parse log lines and get back nicely structured data. Also, for
bonus points, I'm going to use generators (it should also improve performance a
bit).
.. code:: python
from parse import compile
def parse_log(fh):
"""Reads the file handler line by line and returns a dictionary of the
log fields.
"""
parser = compile(
'''{ip} {logname} {user} [{date:th}] "{request}" {status:d} {bytes} "{referer}" "{user_agent}"'''
)
for line in fh.readlines():
result = parser.parse(line)
if result is not None:
yield result
This function will work with any file opened with :code:`open` or with
:code:`sys.stdin`. Let's grab an example log file from `Elastic's examples repo
<https://github.com/elastic/examples>`_ and print the 10 most frequent client IP
addresses.
.. code:: python
import urllib.request
ipaddresses = {}
with urllib.request.urlopen(
"https://github.com/elastic/examples/blob/master/Common%20Data%20Formats/apache_logs/apache_logs?raw=true",
) as fh:
for record in parse_log(fh):
ip = record["ip"]
if ip in ipaddresses:
ipaddresses[ip] = ipaddresses[ip] + 1
else:
ipaddresses[ip] = 1
sorted_addresses = sorted(
ipaddresses.items(),
key=lambda x: x[1],
reverse=True,
)
for i in range(10):
print(f"{sorted_addresses[i][0]}: {sorted_addresses[i][1]}")
Obviously this is a simple example, but this method is not limited in any way.
There's no messing around with delimiters or worrying about long strings inside
quotation marks nor checking the :code:`awk` man page for functions you never
used before. The resulting code is pretty clear and the performance is on-par
with any shell script you can whip together.
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment