VHJvdWJsZXNob290aW5nIHJlc3BvbnNlIHRpbWUgcHJvYmxlbXMg4oCTIHdoeSB5b3UgY2Fubm90 IHRydXN0IHlvdXIgc3lzdGVtIG1ldHJpY3M=

Production Monitoring is about ensuring the stability and health of our system, that also  includes the application. A lot of times we encounter production systems  that concentrate on System Monitoring, under the assumption that a  stable system leads to stable and healthy applications. So let’s see  what System Monitoring can tell us about our Application.

Let’s take a very simple two tier Web Application:

A simple two tier web application

This is a simple multi-tier eCommerce solution. Users are concerned  about bad performance when they do a search. Let's see what we can find  out about it if performance is not satisfactory. We start by looking at a  couple of simple metrics.

CPU Utilization

The best known operating system metric is CPU utilization, but it is  also the most misunderstood. This metric tells us how much time the CPU  spent executing code in the last interval and how much more it could  execute theoretically. Like all other utilization measures it tells us  something about the capacity, but not about health, stability or even  performance. Simply put: 99% CPU utilization can either be optimal or  indicate impeding disaster depending on the application.

The CPU Usage of the two tiers

The CPU charts show no shortage on either tier

Let's look at our setup. We see that the CPU  utilization is well below 100%, so we do have capacity left. But does  that mean the machine or the application can be considered healthy?  Let’s look at another measure that is better suited for the job, the  Load Average  (System\Processor QueueLength on Windows ). The Load  Average tells us how many threads or processes are currently executed or  waiting to get CPU time.

Unix Top Output: load average: 1.31, 1.13, 1.10

Linux systems display three sliding load averages for the last one,  five and 15 minutes. The output above shows that in the last minute  there were on average 1.3 processes that needed a CPU core at the same  time.

If the Load Average is higher than the number of cores in the system  we should either see near 100% CPU utilization, or the system has to  wait for other resources and cannot max out the CPU. Examples would be  Swapping or other I/O related tasks. So the Load Average tells us if we  should trust the CPU usage on the one hand and if the machine is  overloaded on the other. It does not tell us how well the application  itself is performing, but whether the shortage of CPU might impact it  negatively. If we do notice a problem we can identify the application  that is causing the issue, but not why it is causing it.

In our case we see that neither the load average nor the CPU usage  shines any light on our performance issue. If it were to show high CPU  utilization or a high load average we could assume that the shortage in  CPU is a problem, but we could not be certain.

Memory Usage

Used memory is monitored because the lack of memory will lead to  system instability. An important fact to note is that Unix and Linux  operating systems will most always show close to 100% memory utilization  over time. They fill the memory up with buffers and caches which get  discarded, as opposed to swapped out, if that memory is needed  otherwise. In order to get the "real" memory usage we need subtract  these. In Linux we can do by using the free command.

Memory Usage on the two systems

Memory Usage on the two systems, neither is suffering memory problems

If we do not have enough memory we can try to identify which  application consumes the most by looking at the resident memory usage of  a process. Once identified we will have to use other means to identify  why the process uses up the memory and whether this is ok. When we look  towards memory regarding Java/.NET performance we have to make sure that  the application itself is never swapped out. This is especially  important because  Java accesses all its memory in a random-access  fashion and if a portion were to be swapped out it would have serve  performance penalties. We can monitor this via swapping measures on the  process itself. So what we can learn here is whether the shortage of  memory has a negative impact on application performance. As this is not  the case, we are tempted to ignore memory as the issue.

We could look at other measures like network or disk, but in all cases the same thing would be true, the shortage of a resource might have impact, but we cannot say for sure. And if we don't find a shortage it does not necessarily mean that everything is fine.

Database

An especially good example of this problem is the database. Very  often the database is considered the source of all performance problems,  at least by the application people. From a DBA's and operations point of view the database is often running fine though.  Their reasoning is simple  enough, the database is not running out of any resources, there are no  especially long running or CPU consuming statements or processes running  and most statements execute quite fast. So the database can not be the  problem.

Let's look at this from an application point of view

Looking at the Application

As users are reporting performance problems the first thing that we  do is to look at the response time and its distribution within our  system.

The overall distribution in our web application

The overall distribution in our system does not show any particular bottleneck

At first glance we don't see anything particularly interesting when looking at the whole system. As users are complaining about specific requests lets go ahead and look at these in particular:

Response time distribution of the search

The  response time distribution of the specific request shows a bottleneck  in the backend and a lot of database calls for each and every search  request

We see that the majority of the response time lies in the backend and  the database layer. That the database contributes a major portion to  the response time does not mean however that the DBA was wrong. We see  that every single search executes 416 statements on average! That means  that every statement is executing in under one millisecond and this is  fast enough from the database point of view. The problem really lies  within the application and its usage of the database. Let's look at the  backend next.

Heap usage and GC activity on the backend

The Heap Usage and GC activity chart shows a lot of GC runs, but does it have negative impact?

Looking at the JVM we immediately see that it does execute a lot of garbage collection (the red spikes), as you would probably see in every monitoring tool.  Although this gives us a strong suspicion, we do not know how this is  affecting our users. So let's look at that impact:

GC Runtime suspensions that have an impact on the search

These  are the runtime suspensions that directly impact the search. It is  considerable but still amounts to only 10% of the response time

A single transaction is hit by garbage collection several times and if we do the math we find out that garbage collection  contributes 10% to the response time. While that is considerable it  would not have made sense to spend a lot of time on tuning it just now.  Even if we get it down to half it would only have saved us 5% of the  response time. So while monitoring garbage collection is important, we  should always analyze the impact before we jump to conclusions.

So let's take a deeper look at where that particular transaction is  spending time on the backend. To do this we need to have application  centric monitoring in place which we can then use to isolate the root  cause.

Response time distribution of the search within the backend

The  detailed response time distribution of the search within the backend  shows two main problems: too many EJB calls and a very slow doPost  method

With the right measure points within our application we immediately  see the root causes of the response time problem. At first we see that  the WebService call done by the search takes up a large portion of the  response time. It is also the largest CPU hotspot within that call. So  while the host is not suffering CPU problems, we are in fact consuming a  lot of it in that particular transaction. Secondly we see that an awful  lot of EJB calls are done which in turn leads to the many database  calls that we have already noticed.

That means we have identified a small memory-related issue; although  there are no memory problems noticeable if we were to look only at  system monitoring. We also found that we have a CPU hotspot, but the  machine itself does not have a CPU problem. And finally we found that  the biggest issue is squarely within the application; too many database  and EJB calls, which we cannot see on a system monitoring level at all.

Conclusion

System metrics do a very good job at describing the environment,  after all that is what they are meant for. If the environment itself has  resource shortages we can almost assume that this has a negative impact  on the applications, but we cannot be sure. If there is no obvious  shortage this does not, however, imply that the application is running  smoothly. A healthy and stable environment does not guarantee a healthy,  stable and performing application.

Similar to the system, the application needs to be monitored in  detail and with application-specific metrics in order to ensure its  health and stability. There is no universal rule as to what these  metrics are, but they should enable us to describe the health, stability  and performance of the application itself.

Related reading:

  1. Application Performance Monitoring in production – Step by Step Guide – Measuring a distributed system // Last time I explained logical and organizational prerequisites to...
  2. Application Performance Monitoring in production – A Step-by-Step Guide – Part 1 // Setting up Application Performance Monitoring is a big task,...
  3. The Top Java Memory Problems – Part 1 // Memory and Garbage Collection problems are still the most...
  4. Application Performance and Architectural Problems You can Find in One Hour // When we get engaged with prospects that are interested...
  5. The impact of Garbage Collection on Java performance // In my last post I explained what a major...