From fd0d924c21bd7debe0b99dd0e7f66a5294fd2941 Mon Sep 17 00:00:00 2001 From: Adar Nimrod <nimrod@shore.co.il> Date: Sat, 8 May 2021 20:35:24 +0300 Subject: [PATCH] Post on amilive. --- content/monitoring-shore.rst | 60 ++++++++++++++++++++++++++++++++++++ 1 file changed, 60 insertions(+) create mode 100644 content/monitoring-shore.rst diff --git a/content/monitoring-shore.rst b/content/monitoring-shore.rst new file mode 100644 index 0000000..4b9d357 --- /dev/null +++ b/content/monitoring-shore.rst @@ -0,0 +1,60 @@ +Monitoring shore.co.il +====================== + +:date: 2021-05-08 +:summary: Monitoring shore.co.il + +Recently, I had some time to work on a project I had on my to-do list for a long +time, monitoring services in `shore.co.il <https://www.shore.co.il/>`_. The +project is now done and is available in my `GitLab instance +<https://git.shore.co.il/shore/amilive>`_. + +Requirements +------------ + +When I write monitoring, I mean periodic checks on services and alerts if they +fail. I had a specific requirement set in mind with this project. I wanted the +monitoring to be reliable, meaning that if anything and everything in my +infrastructure failed, I would still get alerts. This was critical for me since +I run a lot of my infrastructure at home and a prolonged internet or power +outage would bring down many services. Cheap and easy would also be nice. + +Architecture +------------ + +I decided on using Lambda functions along with SMS notifications from SNS on +AWS. Lambda functions can be reliably triggered using CloudWatch Events on a +schedule (every x minutes) and failures can be published to a SNS topic that has +a target that sends SMS messages to my cellphone. So far, very reliable, no +dependency on anything in my infrastructure. For added reliability, I added +CloudWatch alerts in case a function failed to be invoked recently or if the +invocation failed. Said alerts would also send me an SMS message. SMS messages +cost a little (hopefully there would little of those), I don't have enough +Lambda function invocation or runtime to go over the free tier and the price for +the code in S3 isn't great either. For me, it was easier, cheaper and more +reliable than setting up Nagios, Sensu or similar. + +Solution +-------- + +I wrote a few Python functions to test the different services I run (DNS, SMTP, +IMAP, SSH, different web services). To deploy them I wrote a Terraform module +that does everything from creating the SNS topic, upload the Python code and +hook up the Lambda functions. Everything is ran inside a GitLab CI pipeline and +uses the `GitLab remote Terraform state +<https://docs.gitlab.com/ee/user/infrastructure/terraform_state.html>`_ (I +recently had reason to try it out and I was impressed). + +Conclusions +----------- + +I don't think I would set up this specific solution for a company. A company +would most likely have an on-call schedule. Maybe using a SaaS product would be +easier and better in some aspects (like running checks from multiple locations). +But for my small infrastructure and considerations it was a success. The project +can be adapted to use a service like PagerDuty to have an on-call schedule and +it can be deployed to multiple regions to run checks from multiple regions. +Lastly, Nagios and Sensu have a library of ready checks in Ruby or Perl so you +don't have to write them yourself. This project has been live for more than a +week now and has been reliable. The AWS Cost Explorer predicts that the cost for +this month would be a few dollars. I call it a success. -- GitLab