In our previous blog posts we talked about the benefits of containers, and tried to answer the question “What’s the right choice for my company: containers or virtual machines?” by describing several business examples. When adopting a new technology, you will always face some challenges. One of the important questions in this regard is how to get insight in the behaviour of your applications and container environment. To answer this question, this week we’ll talk about the monitoring of containers.
The dynamic nature of the cloud
One aspect of cloud computing is that systems and the jobs run on them can be easily started, stopped and migrated. This is in contrast with traditional computing, where one uses a fixed set of systems, often running applications in a fixed layout. At Kumina, we’ve come to realise that in order to use such a dynamic system effectively, you also need to use a monitoring system accommodating such use. In our experience, such a system should meet at least the following criteria:
- Reliable service discovery. The monitoring system should integrate with the cluster manager and continuously extract a list of systems and tasks, so that the right targets end up being monitored. This is where a system like Prometheus has a huge advantage over tools like Nagios, as it implements service discovery for many cluster managers out there: Microsoft Azure, Amazon AWS, Google Compute Engine, Kubernetes, etc.
- High-level methods of processing the data. Setups with higher levels of replication are fault tolerant. If only a small fraction of jobs fail, it may not necessarily be a reason for concern, and therefore constitute no need to page someone in the middle of the night. A proper monitoring system should be designed in such a way allowing you to design alerting rules that use statistics that are aggregated. Prometheus does this by making use of its query language (PromQL), whereas Nagios is only built to alert based on individual probes.
- Good support for federation. In the case of hybrid-cloud setups, systems tend to be spread out geographically across different sites. To monitor such setups reliably and accurately, it makes sense to run one copy of your monitoring system in each of those sites, with each one of them only monitoring systems nearby. Prometheus allows you to set up federation easily, so that you can still analyse your setup centrally.
These aspects have been the main reasons for us as a company to migrate from our Nagios- and Munin-based monitoring stack to Prometheus. Combined with Grafana, we are able to provide functional dashboards that offer good insight into our production setup.
Whitebox and blackbox monitoring
When you’re using cluster managers like Kubernetes, your applications consisting of separate components are more likely to be spread out across multiple systems. In such a distributed model, performance is also influenced by the performance of the network, in terms of reliability, latency and throughput.
With most traditional monitoring systems, there is a strong emphasis on performing blackbox monitoring: testing the system by merely considering its externally observable behaviour. A good example of such a test is a HTTP probe, which checks whether a given URL matches a certain output. In the case of distributed applications, such probes are sufficient for alerting, but don’t provide enough insight into the root causes of complex problems.
This is why it’s important to also focus on whitebox monitoring: extending your application to report statistics relevant to its operation as well. The Prometheus project offers libraries that you can link into your application, allowing you to easily convert such metrics into counters, gauges and histograms. These metrics are then exported via HTTP, so that they then can be scraped by Prometheus.
Which metrics should processes expose?
By far the most interesting thing to measure is a task’s communication with its surroundings, both in terms of the requests it receives and the ones it generates. This applies to RPCs sent over the network, but also the task’s interaction with the system (disk I/O). For each of these channels, it makes sense to at least measure these five aspects:
- Traffic: the number of requests.
- Bandwidth: the size of these requests in bytes.
- Latency: the duration of these requests in seconds.
- Errors: how many of these requests failed, grouped by cause if applicable.
- Saturation: in the case of incoming requests, how many requests are enqueued, but also whether requests had to be discarded due to the queue being saturated.
Having at least these metrics available will make it a lot easier to identify the root cause of problems in systems that are largely distributed. Be sure to read the chapter “Monitoring distributed systems” of Google’s SRE book, as it provides good insight with regard to their experiences monitoring their production setup.
Metrics shouldn’t be an afterthought
We’ve observed that most open-source applications often don’t export useful metrics by default. Even when they do, they are typically only added during a later stage of the application’s development lifecycle, after receiving feature requests for them from users.
As we at Kumina believe that metrics are not only of interest to administrators, but also to the software’s developers, we advise that metrics shouldn’t be seen as an afterthought. They should be added as soon as during the early stages of the software development process, just like unit tests. Is there a Product Readiness Review (PRR) process in place at your organisation for determining whether software can be taken into production? If so, consider making the availability of useful metrics a requirement for completion.
While setting up Prometheus at Kumina, we’ve had to develop several utilities for converting metrics provided by existing pieces of software to Prometheus’ format (so-called metrics exporters). Since we at Kumina want to give back to the open-source community, we’ve published most of these on our company’s GitHub page.
Kumina creates and manages Docker and Kubernetes based container platforms, completed with a wide range of professional services and unlimited support. Don’t hesitate to contact us when you are considering the move to a container-based platform, we love to help you get started by offering you an hour of free consultancy.