« Stuff The Internet Says On Scalability For August 26, 2011 | Main | The Cloud and The Consumer: The Impact on Bandwidth and Broadband »

Colmux - Finding Memory Leaks, High I/O Wait Times, and Hotness on 3000 Node Clusters

Todd had originally posted an entry on collectl here at Collectl - Performance Data Collector. Collectl collects real-time data from a large number of subsystems like buddyinfo, cpu, disk, inodes, infiniband, lustre, memory, network, nfs, processes, quadrics, slabs, sockets and tcp, all using one tool and in one consistent format.

Since then a lot has happened.  It's now part of both Fedora and Debian distros, not to mention several others. There has also been a pretty good summary written up by Joe Brockmeier. It's also pretty well documented (I like to think) on sourceforge. There have also been a few blog postings by Martin Bach on his blog.

Anyhow, awhile back I released a new version of collectl-utils and gave a complete face-lift to one of the utilities, colmux, which is a collectl multiplexor.  This tool has the ability to run collectl on multiple systems, which in turn send all their output back to colmux.  Colmux then sorts the output on a user-specified column and reports the 'top-n' results.  

For example, here's an example of the top users of slab memory from a 41 node sample:

> colmuxl -addr cn[10-50] -command "-sm" -column 5
# Thu Aug 25 06:15:41 2011  Connected: 41 of 41
#         <-----------Memory----------->
#Host     Free Buff Cach Inac Slab  Map
cn23       60G    0 174M  87M 109M 190M
cn28       60G    0 177M  89M 107M 177M
cn27       60G    0 186M 101M 105M 139M
cn24       60G    0 103M  48M 105M 175M
cn21      123G    0  43M  27M 105M  90M
cn17      123G    0  42M  27M 104M  48M
cn35       60G    0 102M  54M 104M 173M
cn25       60G    0 125M  63M 104M 176M
cn18      123G    0  42M  27M 103M  49M
cn36       60G    0 103M  54M 103M 135M
cn34       60G    0  76M  54M 103M 174M
cn31       60G    0 103M  54M 103M 174M
cn19      123G    0  42M  27M 102M  49M
cn32       60G    0 103M  54M 102M 135M
cn22      123G    0  43M  27M 102M  90M
cn30       60G    0 103M  54M 101M 176M
cn26       60G    0 110M  55M 101M 175M
cn20      123G    0  42M  27M 101M  49M
cn15      123G    0  42M  26M 100M  47M
cn14      123G    0  42M  26M 100M  54M
cn13      123G    0  42M  27M 100M  50M


Debugging a Memory Leak Across a 64 Node Cluster

In fact I used this very command to track down a very strange problem on a large cluster running the lustre file system.  Several nodes were running slower than others and nobody knew why.  I ran collectl on each, comparing virtually everything I could think of from cpu loads, to interrupts, context switches, disks, networks (both ethernet and infiniband) and other subsystems as well.  

It wasn't until I stumbled on the fact that some machines were using a lot more slab memory that I tried unmounting/remounting lustre on them and sure enough, the slab sizes dropped and their performance immediately improved.  Drilling down into the individual slab allocations with collectl I then discovered a single slab called ll_async_page seemed to be the culprit and digging deeper with google I discovered a known problem with lustre memory leaks.  With this factoid in mind, I then could use colmux to identify all the top slab memory consumers and sure enough, a small number of them had significantly higher values than the bulk of the nodes.  

Therefore, it was simply a matter of unmounting/remounting lustre on just those and the problem was resolved.  While it didn't fix the memory leak problem, which is a slow one, it at least got the cluster operating at full efficiency for a couple of months when the process had to be repeated.  This was an older version of lustre and so maybe the problem has been resolved.

Tracking Down High I/O Wait Times

I've also used colmux on a large disk farm to track down disks with high I/O wait times.  The possibilities are limitless, but naturally the commands/columns you choose to look at are highly dependent on the problem you're trying to solve.

Finding a Hot Needle in a 2000+ Node Haystack

One other use that was pretty cool was a colleague used this with collectl's ability to monitor temperatures to track down 'hot' system on a 2000+ node cluster during a linpack run.

Reliving History

You can also change columns dynamically by typing in the column number or using the arrow keys (if you installed the perl module TermReadKey).  You can even reverse the sort order!

Furthermore, if you have historical data you've collected over several days, you can instruct colmux to play it back and sort it.  So let's say you had some sort of hang on the cluster yesterday at 2PM.  Just play back ALL the data across all the nodes and look at the top processes or maybe the network or anything else that could cause a hang.

Take a Look

Anyhow, if you think this might be worth a look install collectl-utils and take it for a spin.  If you havent' tried collectl yet, perhaps this would be a good reason to do so.

There is an alternative output format I call single line, in which a small number of columns are all reported on the same line, making it real easy to spot change.  If you look at the bottom of the colmux page, there's a cool picture if monitoring close to 200 systems on a single line, of course it takes 3-30" monitors to see them all.

Reader Comments (6)

"2000K+ node cluster"

Really? That's 2 million+ nodes. That's more servers that all of Google worldwide combined.

Is it even possible to have a cluster with that many nodes??

If that's really the case, I'd definitely want to learn more about it.

August 25, 2011 | Unregistered CommenterAndy

oops, I suppose I'd like to think colmux could do that but of course you're right. but even looking at 2000 nodes once a second is still pretty impressive in my opinion. how would you go about tracking down virtually any performance counter on a 2K node cluster in real-time? How about something as obscure as an nfs client doing too many commits? or how about the one node getting excessive interrupts by exact interrupt number? you can look at some pretty bizarre stuff you never even thought of in this way. and remember - the nodes you're monitoring aren't even breaking a sweat as all the work is on the machine running colmux.

August 26, 2011 | Unregistered CommenterMark Seger

Nice, I didn't know about collectl. Could probably hook it up to OpenTSDB to persist the data points it collects. At StumbleUpon we're now collecting about 10000 data points per second, and persisting them all forever in OpenTSDB. Ultimately my goal is to get almost every metric exposed by the kernel and our apps into OpenTSDB, and collect them all every few seconds.

August 26, 2011 | Unregistered CommenterBenoit Sigoure

CLI looks inspired by xCAT (that is written in perl too). Is these projects related any way?

August 27, 2011 | Unregistered CommenterNikolay

re OpenTSDB - I had never heard of it before but it sounds pretty cool, especially since it does plotting. When you talk about 10K data points/sec, can I assume that's the aggregate across a cluster as opposed to a single node? In the case of colletl I never counted but suspect it collects on the order or hundreds of counters every 10 seconds, not counting slab or process data which it only collects every minute. I'm sure one can collect more with lower overhead but I'm not sure how much more as this is basically a problem of reading MANY different data structures in /proc. One can certainly crank up the monitoring frequency, to as fine a grained level as you like, say 100ths of a second, but then you're starting to use real cpu time.

On the other hand if you can gather that much data across a cluster I'm sure there are many uses such as feeding it with collectl data. The current collectl model is to collect data locally for 2 reasons:
- I've always felt, and still do, that the the major flaw with remote collection is if you lose your network during the times of network problems, you loose the very data you need to diagnose it. My solution is to do both - log locally as well as send it off to a remote 'catcher'. This is a core capability of collectl.
- centralized DB's have always been a great concept but I've yet to see one that could handle heavy loads or high numbers of variable, for example RRD. Great tool but it can't deal with volume, at least not that I know of. Sounds like OpenTSDB could be the answer everyone has been looking for!

As for collectl, it has the ability to send data to a remote collector. We have an HP product called CMU or Cluster Management Utility, that can optionally use collectl to remotely collect data centrally from thousands of nodes every 5 seconds and display the output in real time. I'd think with OpenTSDB you could use the same methodology. I'd be more than happy to have a discussion, perhaps on collectl's mailing list or a forum on SourceForge, primarily so others can participate if they like.

re xCat - sorry but I'm not familiar with it. Collectl is solely based on the Tru64 utility, collect. Collectl gets it's 'l' for Linux as in 'collect for linux'. ;)


August 27, 2011 | Unregistered CommenterMark Seger

Is it the same as collectd, ended in 'd', in Ubuntu repositories?
I can't find colmux there.

June 1, 2012 | Unregistered Commenterpepe

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>