From fd0d924c21bd7debe0b99dd0e7f66a5294fd2941 Mon Sep 17 00:00:00 2001
From: Adar Nimrod <nimrod@shore.co.il>
Date: Sat, 8 May 2021 20:35:24 +0300
Subject: [PATCH] Post on amilive.

---
 content/monitoring-shore.rst | 60 ++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)
 create mode 100644 content/monitoring-shore.rst

diff --git a/content/monitoring-shore.rst b/content/monitoring-shore.rst
new file mode 100644
index 0000000..4b9d357
--- /dev/null
+++ b/content/monitoring-shore.rst
@@ -0,0 +1,60 @@
+Monitoring shore.co.il
+======================
+
+:date: 2021-05-08
+:summary: Monitoring shore.co.il
+
+Recently, I had some time to work on a project I had on my to-do list for a long
+time, monitoring services in `shore.co.il <https://www.shore.co.il/>`_. The
+project is now done and is available in my `GitLab instance
+<https://git.shore.co.il/shore/amilive>`_.
+
+Requirements
+------------
+
+When I write monitoring, I mean periodic checks on services and alerts if they
+fail. I had a specific requirement set in mind with this project. I wanted the
+monitoring to be reliable, meaning that if anything and everything in my
+infrastructure failed, I would still get alerts. This was critical for me since
+I run a lot of my infrastructure at home and a prolonged internet or power
+outage would bring down many services. Cheap and easy would also be nice.
+
+Architecture
+------------
+
+I decided on using Lambda functions along with SMS notifications from SNS on
+AWS. Lambda functions can be reliably triggered using CloudWatch Events on a
+schedule (every x minutes) and failures can be published to a SNS topic that has
+a target that sends SMS messages to my cellphone. So far, very reliable, no
+dependency on anything in my infrastructure. For added reliability, I added
+CloudWatch alerts in case a function failed to be invoked recently or if the
+invocation failed. Said alerts would also send me an SMS message. SMS messages
+cost a little (hopefully there would little of those), I don't have enough
+Lambda function invocation or runtime to go over the free tier and the price for
+the code in S3 isn't great either. For me, it was easier, cheaper and more
+reliable than setting up Nagios, Sensu or similar.
+
+Solution
+--------
+
+I wrote a few Python functions to test the different services I run (DNS, SMTP,
+IMAP, SSH, different web services). To deploy them I wrote a Terraform module
+that does everything from creating the SNS topic, upload the Python code and
+hook up the Lambda functions. Everything is ran inside a GitLab CI pipeline and
+uses the `GitLab remote Terraform state
+<https://docs.gitlab.com/ee/user/infrastructure/terraform_state.html>`_ (I
+recently had reason to try it out and I was impressed).
+
+Conclusions
+-----------
+
+I don't think I would set up this specific solution for a company. A company
+would most likely have an on-call schedule. Maybe using a SaaS product would be
+easier and better in some aspects (like running checks from multiple locations).
+But for my small infrastructure and considerations it was a success. The project
+can be adapted to use a service like PagerDuty to have an on-call schedule and
+it can be deployed to multiple regions to run checks from multiple regions.
+Lastly, Nagios and Sensu have a library of ready checks in Ruby or Perl so you
+don't have to write them yourself. This project has been live for more than a
+week now and has been reliable. The AWS Cost Explorer predicts that the cost for
+this month would be a few dollars. I call it a success.
-- 
GitLab