How do you monitor the performance of your cluster?

I had posted a note the other day about collectl and its ganglia interface but perhaps I wasn't provocative enough to get any responses so let me ask it a different way, specifically how do people monitor their clusters and more importantly how often?  Do you monitor to get a general sense of what the system is doing OR do you monitor with the expectation that when something goes wrong you'll have enough data to diagnose the problem?  Or both?  I suspect both...

Many cluster-based monitoring tools tend to have a data collection daemon running on each target node which periodically sends data to some central management station.  That machine typically writes the data to some database from which it can then extract historical plots.  Some even put up graphics in real-time.

From my experience working with large clusters - and I'm talking either many hundreds or even 1000s of nodes, most have to limit both the amount of data they manage centrally as well as the frequency that they collect it, otherwise they'll overwhelm their management station because most DBs can't write hundreds of counters many times/minute from thousands of nodes.

As a related example, how many of you run sar at the default monitoring interval of 10 minutes?  Do you really think you're getting useful information?  What happens if you have a 2 minutes burst of 100% network load and you're idle the other 8 minutes?  Sar will happily tell you the network load was 20% and you'll never know your network is tanking.

The point of all this is I do think there's a place for central monitoring, though I'm personally not a fan because of the inaccuracy of infrequent data samples, but I also appreciate some data is better than none, as long as you realize the inherent accuracy problems.

And that's where collectl comes in and my previous comment about ganglia.  When I wrote collectl my overarching design goal, from which I haven't wavered, was to provide highly accurate local data with minimal overhead so you will take samples in the 1-10 second range without fear of impacting the rest of the system.  You can literally sample just about everything going on every 10 seconds and use <0.1% of the CPU.  If you're willing to give up a few more tenths of a percent you can even monitor processes and slab activity, though you should only sample them at a 60 second frequency because it IS expensive to monitor them.

However I also realize this doesn't do any good if do have 1Ks of machine you want to watch and so that's where the socket interface comes in over which collectl can send data to a central manager at that same frequency OR if you prefer have is send its remote data at a different rate, giving you the best of both worlds.  Collectl can provide it's data to a central management station while at the same time providing local logging for accuracy, which will let you do a deep dive into the data if a problem does arrive for which there is not enough data stored centrally.

My point about the ganglia interface was my response to the fact that a lot of of people running large (as well as smaller) clusters do use ganglia but like most central monitoring stations have to give up the accuracy of finer-grained data and I was just wondering if anyone looking at this forum use ganglia and if they might be interested in trying out the collectl interface to it.

-mark