« Six Lessons Learned Deploying a Large-scale Infrastructure in Amazon EC2 | Main | A picture is realy worth a thousand word, and also a window in time... »

How do you monitor the performance of your cluster?

I had posted a note the other day about collectl and its ganglia interface but perhaps I wasn't provocative enough to get any responses so let me ask it a different way, specifically how do people monitor their clusters and more importantly how often? Do you monitor to get a general sense of what the system is doing OR do you monitor with the expectation that when something goes wrong you'll have enough data to diagnose the problem? Or both? I suspect both...

Many cluster-based monitoring tools tend to have a data collection daemon running on each target node which periodically sends data to some central management station. That machine typically writes the data to some database from which it can then extract historical plots. Some even put up graphics in real-time.

From my experience working with large clusters - and I'm talking either many hundreds or even 1000s of nodes, most have to limit both the amount of data they manage centrally as well as the frequency that they collect it, otherwise they'll overwhelm their management station because most DBs can't write hundreds of counters many times/minute from thousands of nodes.

As a related example, how many of you run sar at the default monitoring interval of 10 minutes? Do you really think you're getting useful information? What happens if you have a 2 minutes burst of 100% network load and you're idle the other 8 minutes? Sar will happily tell you the network load was 20% and you'll never know your network is tanking.

The point of all this is I do think there's a place for central monitoring, though I'm personally not a fan because of the inaccuracy of infrequent data samples, but I also appreciate some data is better than none, as long as you realize the inherent accuracy problems.

And that's where collectl comes in and my previous comment about ganglia. When I wrote collectl my overarching design goal, from which I haven't wavered, was to provide highly accurate local data with minimal overhead so you will take samples in the 1-10 second range without fear of impacting the rest of the system. You can literally sample just about everything going on every 10 seconds and use <0.1% of the CPU. If you're willing to give up a few more tenths of a percent you can even monitor processes and slab activity, though you should only sample them at a 60 second frequency because it IS expensive to monitor them.

However I also realize this doesn't do any good if do have 1Ks of machine you want to watch and so that's where the socket interface comes in over which collectl can send data to a central manager at that same frequency OR if you prefer have is send its remote data at a different rate, giving you the best of both worlds. Collectl can provide it's data to a central management station while at the same time providing local logging for accuracy, which will let you do a deep dive into the data if a problem does arrive for which there is not enough data stored centrally.

My point about the ganglia interface was my response to the fact that a lot of of people running large (as well as smaller) clusters do use ganglia but like most central monitoring stations have to give up the accuracy of finer-grained data and I was just wondering if anyone looking at this forum use ganglia and if they might be interested in trying out the collectl interface to it.


Reader Comments (8)

At Viigo, I have a SNMP daemon on each logical element(process) i.e. multiple SNMP daemons per host. I have a dedicated Cacti host that goes and collect/polls measurements from all the daemon on the server farm. All measurements are stored in round-robin-database files, so there is never a problem with the storage or performance. Cacti provides a web interface to render file contents into a nice graphs. That is all. Works like a charm.

As soon as # of host will grow big enough, I'll start grouping them into clusters with one cacti host per
cluster (the same way like I suppose ganglia does) and have a super-visor Cacti host(s) poll files on their subordinate cacti host thus provide me with aggregated statistic. Should I need to drill down to individual measurement I can log-in to that individual cluster and scrutinize stats data there.

So there is no problem with performance, it does scale, there is a central data storage for aggregated data but in the same time all the data is distributed.

November 29, 1990 | Unregistered CommenterIgor Katkov

What is your monitoring interval, how big is your cluster and how many performance counters are your recording in RRD? My experience has been many people who use RRD have to store a limited number of counters taken at intervals of a minute or more and I'm explicitly talking about dealing with 100s of samples in the 1-10 second range.

My suggestion is the only way to scale this level of data collection is to store those 100s of samples on each node and only send a subset up the wire. Also, what happens if you have network problems and the data never gets to rrd?


November 29, 1990 | Unregistered Commentermarkseger

there is no problem with performance, it does scale, there is a central data storage for aggregated data but in the same time all the data is distributed.

November 29, 1990 | Unregistered Commenterilahiler

ilahiler - same questions I had for Igor. What is your monitoring interval, how many counters are you collecting and how many nodes are you monitoring?

November 29, 1990 | Unregistered Commentermarkseger

50 nodes ~500 counters in total. Poll interval is 1-5min.
Well, I guess it does not really match your farm of 100th of nodes. But, frankly I don't see any reason why I can't stretch my solution to that extent.

> My suggestion is the only way to scale this level of data collection is to store those 100s of
> samples on each node and only send a subset up the wire

I totally agree with, this is exactly my plan to scale my monitoring solution. I soon as I see that one poller host can't keep the load, I'll have another one. For aggregation I'll add yet another poller and start aggregate monitoring data, but not from the target hosts, but from these X subordinate poller hosts. I'd actually have a poller that go and read these X rrd files and store aggregated data in yet another rrd file.

That way I have a birds view and can see little details if I want.

November 29, 1990 | Unregistered Commenterikatkov

I guess my plan to stir up some lively debate isn't getting anywhere. I still claim if you're monitoring at once a minute or less, the data you're getting isn't providing very good information about what might be going wrong. If you have occasional spikes you'll never see them. I guess I've never looked at data less frequently than 1-10 seconds and so an very used to see a much more detailed picture.

November 29, 1990 | Unregistered Commentermarkseger

I should have also mentioned that I just released a new version of collectl that contains both an API for added custom data, something I've resisted for a long time out of fear of reducing collectl's efficiency. I think if people are careful they can add stats and retain efficiency.

The other piece is ganglia support, not that it's been running for awhile on a 2000+ node cluster, taking samples every 10 seconds on ALL the systems. Of course ganglia can't write the data to RRD at that rate and that's why this cluster doesn't use that component, but rather uses ganglia for moving the data around and they have their own central recording/display system.


November 29, 1990 | Unregistered Commentermarkseger

Hi I have a similair problme about monitoring the cpu/memory usage and I have googled this topic for a while and can only find here that discussing about this issue.

So my problem is that my customer want me to monitor the system performance and when cpu usage is above X% or memory usage is above Y% I should reject the further request. I am not sure exactly why they want to do that maybe they the system to be "stable" or maybe they want to monitor to get a general sense of what the system is doing.

To me I just need to figure 2 things out,

1. For how long when system is running above these thresholds I can say the system is indeed overloaded.
2. What is the appropriate sampling rate ?

Any further suggestion ?

November 29, 1990 | Unregistered Commenterqiulang@beijing

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>