Skip to content
Snippets Groups Projects
Commit fd0d924c authored by nimrod's avatar nimrod
Browse files

Post on amilive.

parent de135d15
No related branches found
No related tags found
No related merge requests found
Pipeline #1406 passed
Monitoring shore.co.il
======================
:date: 2021-05-08
:summary: Monitoring shore.co.il
Recently, I had some time to work on a project I had on my to-do list for a long
time, monitoring services in `shore.co.il <https://www.shore.co.il/>`_. The
project is now done and is available in my `GitLab instance
<https://git.shore.co.il/shore/amilive>`_.
Requirements
------------
When I write monitoring, I mean periodic checks on services and alerts if they
fail. I had a specific requirement set in mind with this project. I wanted the
monitoring to be reliable, meaning that if anything and everything in my
infrastructure failed, I would still get alerts. This was critical for me since
I run a lot of my infrastructure at home and a prolonged internet or power
outage would bring down many services. Cheap and easy would also be nice.
Architecture
------------
I decided on using Lambda functions along with SMS notifications from SNS on
AWS. Lambda functions can be reliably triggered using CloudWatch Events on a
schedule (every x minutes) and failures can be published to a SNS topic that has
a target that sends SMS messages to my cellphone. So far, very reliable, no
dependency on anything in my infrastructure. For added reliability, I added
CloudWatch alerts in case a function failed to be invoked recently or if the
invocation failed. Said alerts would also send me an SMS message. SMS messages
cost a little (hopefully there would little of those), I don't have enough
Lambda function invocation or runtime to go over the free tier and the price for
the code in S3 isn't great either. For me, it was easier, cheaper and more
reliable than setting up Nagios, Sensu or similar.
Solution
--------
I wrote a few Python functions to test the different services I run (DNS, SMTP,
IMAP, SSH, different web services). To deploy them I wrote a Terraform module
that does everything from creating the SNS topic, upload the Python code and
hook up the Lambda functions. Everything is ran inside a GitLab CI pipeline and
uses the `GitLab remote Terraform state
<https://docs.gitlab.com/ee/user/infrastructure/terraform_state.html>`_ (I
recently had reason to try it out and I was impressed).
Conclusions
-----------
I don't think I would set up this specific solution for a company. A company
would most likely have an on-call schedule. Maybe using a SaaS product would be
easier and better in some aspects (like running checks from multiple locations).
But for my small infrastructure and considerations it was a success. The project
can be adapted to use a service like PagerDuty to have an on-call schedule and
it can be deployed to multiple regions to run checks from multiple regions.
Lastly, Nagios and Sensu have a library of ready checks in Ruby or Perl so you
don't have to write them yourself. This project has been live for more than a
week now and has been reliable. The AWS Cost Explorer predicts that the cost for
this month would be a few dollars. I call it a success.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment