diff --git a/content/collecting-metrics.rst b/content/collecting-metrics.rst new file mode 100644 index 0000000000000000000000000000000000000000..5ad7c510c358c3888f86ae56bdedc288f84794a2 --- /dev/null +++ b/content/collecting-metrics.rst @@ -0,0 +1,66 @@ +Collecting metrics +================== + +:date: 2021-05-08 +:summary: Integrating metric collection in your application + +A few startups I worked at had a similar story. When they got started they +didn't have any metric collection (maybe some system metric from their cloud +provider, but nothing more). After a few times where they had to debug an issue +where metrics were needed they decided to start collecting metrics from the +application. Since they were a small team with little experience in setting up +the needed infrastructure or the man power to handle such a task they decided to +use a SaaS product (NewRelic and Datadog are both good choices here). Then as +the company grew and the number of users, processes, components and instances +grew so did the bill from the that SaaS. This is usually the time where they +decide that they need an DevOps person on the team (not just because of the +metrics issue, but as the company matures, the customer base grows, uptime +requirements increase, scaling is an issue, etc.). What follows is my advice on +such undertakings. + +First of all, use StatsD. Not specifically the StatsD daemon but the protocol. +It's mature, flexible enough for most tasks, supported in all languages and in +all metrics collection software. You can even use netcat to push metrics from +shell scripts. + +You don't have to setup storage for the metrics you collect, CloudWatch, Librato +or similar services can be used and are cheaper than the more integrated SaaS +offering. If you have Elasticsearch already for log collection, you can use that +for storing metrics as well (great for a few dozen instances). InfluxDB and +Gnocchi support the StatsD protocol. As you can see, it's a flexible solution. + +I think that most people associate the StatsD protocol with the Graphite +protocol and software, but there is a key difference: StatsD uses UDP. It's +faster and there's less overhead. You can default to sending the metrics to +localhost and if nothing collects them that's still fine (great for local +development and CI jobs where you don't want to run metrics collection software +or where it doesn't make sense). I know that some would say that TCP is more +reliable than UDP and you can lose metrics using it. To that I would say that +the lower overhead of UDP, the lack of connection is actually an advantage. No +:code:`connection closed` errors, that single packet is more likely to reach its +destination in case the system is under heavy load than opening a new TCP +connection. You can read this `GitHub blog post +<https://github.blog/2015-06-15-brubeck/>`_ to see how they dealt with their +metric loss. + +I know that Prometheus is the new hotness and very popular when running +Kubernetes and it's not a bad choice. But there are a few things that you need +to keep in mind when considering it. It's best when you have service discovery +available for it (if you're already using Kubernetes or running in a cloud +provider that's not an issue but that's not everybody). Also, for short lived +tasks (like Cron or processes that pull from a job queue) or for processes that +can't open a listening socket (or if you have many of the same processes running +on the same host), you need to setup a push gateway. The process pushes metrics +to the gateway and Prometheus collects them afterwards (sounds an awful like the +StatsD daemon, doesn't it). Prometheus has a StatsD exporter so you can still +use StatsD along with Prometheus. + +If you want some ad-hoc or more lightweight metrics collection, `statsd-vis +<https://github.com/rapidloop/statsd-vis>`_ is a good solution. It holds the +data in memory for a configurable time and has a builtin web UI to watch the +graphs. I have a a very small `Docker image +<https://hub.docker.com/r/adarnimrod/statsd-vis>`_ for that. + +Lastly, for Python applications I recommend the `Markus +<https://markus.readthedocs.io/>`_ library. It's easy to use, reliable and you +add more metrics as needed. 2 thumbs up.