Having worked with hundreds of customers on building a monitoring stack for their containerized environments, we’ve learned a thing or two about what works and what doesn’t. The outcomes might surprise you - including the observation that instrumentation is just as important as the application when it comes to monitoring.
In this post, I wanted to cover some details around what it takes to build a scale-out, highly reliable monitoring system to work across tens of thousands of containers. I’ll share a bit about what our infrastructure looks like, the design choices we made, and tradeoffs. The five areas I’ll cover:
Instrumenting the system
Relating your data to your applications, hosts, and containers.
Deciding what to data to store
How to enable troubleshooting in containerized environments
For context, Sysdig is the container monitoring company. We’re based on the open source Linux troubleshooting project by the same name. The open source project allows you to see every single system call down to process, arguments, payload, and connection on a single host. The commercial offering turns all this data into thousands of metrics for every container and host, aggregates it all, and gives you dashboarding, alerting, and an htop-like exploration environment.
Ok, let’s get into the details, starting with the impact containers have had on monitoring systems.
Why do containers change the rules of the monitoring game?
Containers are pretty powerful. They are :
Simple - typically an individual process
Small - 1/10th of the size of a VM means they’re portable
Isolated - They have few dependencies
Dynamic - Rapid startup times means they can be scaled, killed, and moved quickly
Keeping containers simple is core to their value proposition and what makes them a great building block for microservices. But this simplicity comes at a cost. From an ops perspective, you need deep visibility inside containers rather than just knowing that some containers exist.
Your containers are also likely managed by an orchestration system (think Kubernetes or swarm), and if you’re operating a PaaS, then your developers may be pushing new applications at any time, without informing the platform team.
Alright, so now we know we’re dealing with these small black boxes that appear, die, and are moved at the whims of your orchestration system. Your developers have the freedom to add and modify their applications freely. And your job is to make sure that your company’s apps are running properly, not to mention have the data to solve issues when they might arise. Let’s start breaking down the monitoring challenge.
Instrumentation needs to be transparent
In static or virtual environments it was simple to get an agent running on a host and configure the agent based on the relevant applications. In containerized environments, however, this approach doesn’t work:
You can’t place an agent within each container (without destroying a key value of containers)
With applications coming and going, you can’t manually configure agent plug-ins to collect relevant app-level metrics
So we went about trying to make the act of instrumenting must be as transparent as possible, requiring as little human intervention as possible. Infrastructure metrics, application metrics, service response times, custom metrics and resource/network utilization should be ingested without any effort from within the container. It certainly shouldn’t require efforts with the spin up of each additional container.
There are two possible approaches to achieving this. First are pods, a concept created by Kubernetes. A pod is a group of containers that share a common namespace. As a consequence, a container inside a pod can see what the other containers in the same pod are doing. For monitoring agents, this is often referred to as a “sidecar” container.
The positive here is that this is something relatively easy to do in Kubernetes. The downsides however are worrisome: resource consumption can be high if you have many pods on a machine - it’s a bit like having a monitoring agent per process(!) - and you create dependencies as well as additional attack surface within that pod. That means if your monitoring sidecar has performance, stability, or security issues, it can wreak havoc on your applications.
The second model is per-host, transparent instrumentation. This transparent instrumentation captures all application, container, statsd, and host metrics with a single instrumentation point, leveraging the tracepoints facility in the kernel, and sends it to a container per host for processing & transfer. This eliminates the need to turn everything into a statsd metric, which is something we’ve seen many people resort to. Unlike sidecar models, per-host agents drastically reduce resource consumption of monitoring agents, and require no modification to application code. It does, however, require a privileged container and a kernel module.
Sysdig chose to do the latter, and herein lies the biggest tradeoff we had to make. Sysdig calls this “ContainerVision” - I’ll use that as our shorthand for this technical approach.
Despite greater complexity in building the agent, despite the potential concerns of running in the kernel, we believed this approach would allow us to collect more data with lower overhead even in high density container environments. We also believed the instrumentation approach represents a reduced threat to environments versus complex networking and a high density of monitoring agents in your environment. Finally, to reduce concerns, we open sourced our kernel module as part of the Sysdig linux and container visibility command-line tool.
Wait, How does this kernel instrumentation work?
ContainerVision is a core element of what makes Sysdig’s approach different, so let’s spend a little time diving into how it works. For this section, I’ll focus on how the open source project works. For many of you, you’re most likely to work with that version of Sysdig first, so after reading this you’ll have a deeper understanding of how it works at the kernel level.
Sysdig has an architecture that is very similar to that of libpcap/tcpdump/wireshark.(This is not by chance - Sysdig was created by one of the co-creators of wireshark.) First, events are captured in the kernel by a small driver, called sysdig-probe, which leverages a kernel facility called tracepoints. Tracepoints make it possible to install a “handler” that is called from specific functions in the kernel. Currently, sysdig registers tracepoints for system calls on enter and exit, and for process scheduling events. Sysdig-probe’s handler for these events is super simple – it’s limited to copying the event details into a shared buffer, encoded for later consumption. The reason to keep the handler simple, as you can imagine, is performance, since the original kernel execution is “frozen” until the handler returns.
That’s all the driver does. The rest of the magic happens at user level.
The event buffer is memory-mapped into user space so that it can be accessed without any copy, minimizing CPU usage and cache misses. Two libraries, libscap and libsinsp, then offer support for reading, decoding, and parsing events. Specifically, libscap offers trace file management functionality, while libsinsp includes sophisticated state tracking functionality (e.g. you can use a file name instead of an FD number) and also filtering, event decoding, a Lua JIT compiler to run chisels, and much more. Finally, sysdig tops it off as a simple wrapper around these libraries.
But what if sysdig, libsinsp or libscap are not fast enough to keep up with the stream of events coming from the kernel? Will sysdig slow down my system just like strace (the wise reader asks)? Well of course not. In this scenario, the event buffer fills up, and sysdig-probe starts dropping the incoming events. So you will lose a bit of trace information, but the machine, and the other processes running on it, will not be slowed down.
This is a key advantage of sysdig’s architecture. It means the tracing overhead is very predictable, and it means sysdig is ideal for running in production environments. It also means that our chisels don’t need to be as carefully optimized as DTraces’s D scripts. And on top of that, chisels get to leverage the rich set of libraries written for Lua (want to pipe your chisel’s data to Redis? There’s a library for that!). To go a little deeper on this, feel free to read DTrace vs strace vs sysdig: A technical discussion.
The open source troubleshooting tool gives you both a command-line interface and a curses-based interface to all this data (as seen below) for a single host.
For our monitoring product, we use these same system calls to extract the relevant metrics up and down your stack and create an htop-like interface across your distributed environment.
Our belief, however, is the metrics alone are not enough once you’re dealing with containers. So let’s talk about making sense of it all.
How to relate your data to your applications, hosts, containers, and orchestrators
As your environment increases in complexity, the ability to filter, segment and group metrics based on metadata is essential. Tags allow you to represent the logical blueprint of your application architecture in addition to the physical reality of where containers are running. For example, you may split by dev/prod, by service, by pod, or many other things.
You should be able to dynamically select or order tags on metrics - the era of hierarchically named, dot-separated metric names with pre-defined aggregations is fading away. Dynamic selection gives you the ability to quickly answer questions about the performance of a service at large, or drill down into a namespace or even container.
There are two ways to think about tagging metrics: explicit (attributes you’d like to store) vs implicit (orchestrator) tags. You should have a mechanism and a best practice for adding explicit tags so that anyone on your team can add them as needed. But implicit tags should be captured by default. This latter point is so important, it’s a big part of the next point on orchestrators. As our product evolved, we found that we typically add 12-25 tags to any given metric by default. Power users of our product may have significantly more. Think of each unique combination of tags as a separate metric that you need to store, process, and then recall on demand for your user. This has some pretty major implications, as we discuss down below in “Deciding What to Store.”
Orchestrators radically changed the scheduling management approach for containers, and impacted users’ monitoring strategy along the way. Whether it’s Kubernetes, Swarm, or Mesos, you’ll see a similar change to the monitoring approach. Individual containers become less important, while the performance of a service becomes more important. The service is made up of potentially many containers, and more importantly, the orchestrator can move those containers as needed to meet performance and health requirements. Specifically, there are two implications here for a monitoring system:
Your monitoring system must implicitly tag all metrics according to the metadata of your orchestrator. This applies to system metrics, container metrics, application component metrics and even custom metrics.
Custom metrics bears repeating: your developers should be able to simply output the custom metric, and the monitoring system should keep state regarding where the metric is from. If you rely on “best practice” for tagging, your metrics will be as good as useless when it really comes time to troubleshoot a problem. More on this topic here.
Your monitoring agents should be able to discover the application components running on a host, in a container, without any manual input. Given that an orchestrator can move containers at any point in time, the burden falls to your monitoring system to auto-discover what’s running and then collect the correct metrics. Auto-discovery might require you to write a custom “check” for your custom code once, but it should never have to be repeated or manually told when to run.
Nominally there are two methods to do this: depend on events emitted by the orchestrator to flag containers, or determine applications based on the heuristics of a container. The latter requires more intelligence in your monitoring system, but produces more reliable results.That’s because your system calls don’t lie - you can easily relate them to the application running. Given our instrumentation, you can guess that Sysdig chose the latter approach.
For more on monitoring Kubernetes and orchestrators, check out “Monitoring Kubernetes: A Detailed How-To Guide”
Deciding what to data to store: “All the Data”
Another way of saying this is…. Collect all the data. All the data. No filters. No “intelligence” beforehand to filter out certain metrics. No “discretion” that allows variation per service.
You must accept the the fact that the more complex your system gets, the more distributed your software becomes, the more data you’re going to have to collect. Tags are a perfect example of this: they are a multiplier on the amount of metrics (or time-series) that you must store.
Sure it’s appealing to reduce your metric count by, say, only storing the avg/min/max/95th percentile of your node app Kubernetes deployment. It will significantly reduce metrics collection, storage, and compute.
But the more complex your infrastructure becomes, the more important it becomes to have all the data, and have it stored in a way that allows for ad-hoc analysis and troubleshooting. For example, what if you have an intermittent slow response time in that node app mentioned before? How can you figure out if it’s a systemic problem in the code, a container on the frtiz, or an issue in AWS? Aggregating all that information by deployment will not give you enough resolution to solve the problem.
Of course, the resulting problems of this approach are not surprising: you have a lot of data to collect. Both metrics and events. You have to persist it and make it accessible on-the-fly to users. Here we made a couple big design decisions as well:
We decided that it wasn’t reasonable to deploy smaller, isolated backends on a per-service or per-application environment.
To us, this didn’t seem like a manageable approach for us to run (in the cloud environment) or for a customer to manage (in an on-premise deployment of our software). If you’re wondering why this would be a consideration at all, you’ll see some open source projects like Prometheus in fact default to the isolated backend model, and thereby push concerns about scalability by forcing significantly more management work to the user. Instead we wanted to build a horizontally scalable backend, with the ability for our application to then isolate data, dashboards, alerts, etc. based on a user or service.
In order to offer retention that wasn’t time-bounded, we decided to roll up data over time. We store full resolution data for 6 hours, and then begin aggregating data after that. (We also have a solution for storing full resolution data around alerts…. See the next section for that.)
While our backend continues to evolve, today it consists of horizontally scalable clusters of Cassandra (metrics), ElasticSearch (events), and Redis (intra-service brokering). Building on these components gives high reliability and scale to store years of data for long term trending and analysis. All data is accessible by a REST API. This entire backend can be used via Sysdig’s cloud service or deployed as software in a private cloud for greater security and isolation. This design allows you to avoid running one system for monitoring and another system for long term analysis or data retention.
How to enable troubleshooting in containerized environments
Containers are designed to be small, lightweight, and distributed. This is all fantastic for deployability and repeatability, but will also impact the way you troubleshoot the container.
Remember your old friends ssh, top, ps, ifconfig, and the like? You’re likely not to have them in your containers. If you’re operating in a controlled PaaS environment, you might not even have that access even if those tools were available. And…. did we mention the container may not even exist anymore? If the orchestrator is doing its job, that container is probably long gone before you get to troubleshooting it.
In short - it’s going to be complicated to get the information you need. And, on top of that, having the appropriate context from the orchestrator for troubleshooting will be essential. Thus it’s essential to enable your developers to be able to get this in-depth information, ideally without polluting your production environment. We had to address this issue because we decided that simplifying troubleshooting was just as important as enabling monitoring for container workloads.
This is where a container troubleshooting tool like open source Sysdig comes into play. Its ability to capture every single system call on a host gives deep visibility into how an application, container, host, and the network are performing.
The ability to interface with the orchestration master means that it can collect relevant metadata so that you can capture context and state of the distributed system, not just state of the individual machine. And finally, Sysdig’s ability to capture everything into a file means that you can capture data from prod, but troubleshoot on your laptop. And you can do it long after the containers are gone, letting you do a proper post-mortem when your hair is no longer on fire.
For example, imagine you get an alert because you see a large number of network connections on a particular container. That alert in our monitoring service can trigger a capture, recording all system calls for a period of time on that host. Exploring that in cSysdig, you can get the correct container context and then drill down into its network connections:
To get a sense of what you can do with open source sysdig, check out Understanding how Kubernetes DNS Services work.
Building a highly scalable, distributed monitoring system is not an easy task. Whether you choose to do it yourself or leverage someone else’s system, I believe you’re going to have to make many of the same choices we made along the way.
How are you going to instrument the system?
How will you interface with orchestrators?
How will you simplify the collection of application data and custom metrics?
What will you decide to retain?
And will you enable troubleshooting in dynamic, ephemeral environments?
Hopefully this post has been useful to you. It talks a lot about the what of building a container monitoring system. Of course, talk is cheap, so if you’d like just take us for a spin with a Docker Container Monitoring Trial so you can see for yourself how these design decisions play out.