Colmux - Finding Memory Leaks, High I/O Wait Times, and Hotness on 3000 Node Clusters
Todd had originally posted an entry on collectl here at Collectl - Performance Data Collector. Collectl collects real-time data from a large number of subsystems like buddyinfo, cpu, disk, inodes, infiniband, lustre, memory, network, nfs, processes, quadrics, slabs, sockets and tcp, all using one tool and in one consistent format.
Since then a lot has happened. It's now part of both Fedora and Debian distros, not to mention several others. There has also been a pretty good summary written up by Joe Brockmeier. It's also pretty well documented (I like to think) on sourceforge. There have also been a few blog postings by Martin Bach on his blog.
Anyhow, awhile back I released a new version of collectl-utils and gave a complete face-lift to one of the utilities, colmux, which is a collectl multiplexor. This tool has the ability to run collectl on multiple systems, which in turn send all their output back to colmux. Colmux then sorts the output on a user-specified column and reports the 'top-n' results.
For example, here's an example of the top users of slab memory from a 41 node sample:
> colmuxl -addr cn[10-50] -command "-sm" -column 5# Thu Aug 25 06:15:41 2011 Connected: 41 of 41# <-----------Memory----------->#Host Free Buff Cach Inac Slab Mapcn23 60G 0 174M 87M 109M 190Mcn28 60G 0 177M 89M 107M 177Mcn27 60G 0 186M 101M 105M 139Mcn24 60G 0 103M 48M 105M 175Mcn21 123G 0 43M 27M 105M 90Mcn17 123G 0 42M 27M 104M 48Mcn35 60G 0 102M 54M 104M 173Mcn25 60G 0 125M 63M 104M 176Mcn18 123G 0 42M 27M 103M 49Mcn36 60G 0 103M 54M 103M 135Mcn34 60G 0 76M 54M 103M 174Mcn31 60G 0 103M 54M 103M 174Mcn19 123G 0 42M 27M 102M 49Mcn32 60G 0 103M 54M 102M 135Mcn22 123G 0 43M 27M 102M 90Mcn30 60G 0 103M 54M 101M 176Mcn26 60G 0 110M 55M 101M 175Mcn20 123G 0 42M 27M 101M 49Mcn15 123G 0 42M 26M 100M 47Mcn14 123G 0 42M 26M 100M 54Mcn13 123G 0 42M 27M 100M 50M
Debugging a Memory Leak Across a 64 Node Cluster
In fact I used this very command to track down a very strange problem on a large cluster running the lustre file system. Several nodes were running slower than others and nobody knew why. I ran collectl on each, comparing virtually everything I could think of from cpu loads, to interrupts, context switches, disks, networks (both ethernet and infiniband) and other subsystems as well.
It wasn't until I stumbled on the fact that some machines were using a lot more slab memory that I tried unmounting/remounting lustre on them and sure enough, the slab sizes dropped and their performance immediately improved. Drilling down into the individual slab allocations with collectl I then discovered a single slab called ll_async_page seemed to be the culprit and digging deeper with google I discovered a known problem with lustre memory leaks. With this factoid in mind, I then could use colmux to identify all the top slab memory consumers and sure enough, a small number of them had significantly higher values than the bulk of the nodes.
Therefore, it was simply a matter of unmounting/remounting lustre on just those and the problem was resolved. While it didn't fix the memory leak problem, which is a slow one, it at least got the cluster operating at full efficiency for a couple of months when the process had to be repeated. This was an older version of lustre and so maybe the problem has been resolved.
Tracking Down High I/O Wait Times
I've also used colmux on a large disk farm to track down disks with high I/O wait times. The possibilities are limitless, but naturally the commands/columns you choose to look at are highly dependent on the problem you're trying to solve.
Finding a Hot Needle in a 2000+ Node Haystack
One other use that was pretty cool was a colleague used this with collectl's ability to monitor temperatures to track down 'hot' system on a 2000+ node cluster during a linpack run.
Reliving History
You can also change columns dynamically by typing in the column number or using the arrow keys (if you installed the perl module TermReadKey). You can even reverse the sort order!
Furthermore, if you have historical data you've collected over several days, you can instruct colmux to play it back and sort it. So let's say you had some sort of hang on the cluster yesterday at 2PM. Just play back ALL the data across all the nodes and look at the top processes or maybe the network or anything else that could cause a hang.
Take a Look
Anyhow, if you think this might be worth a look install collectl-utils and take it for a spin. If you havent' tried collectl yet, perhaps this would be a good reason to do so.
There is an alternative output format I call single line, in which a small number of columns are all reported on the same line, making it real easy to spot change. If you look at the bottom of the colmux page, there's a cool picture if monitoring close to 200 systems on a single line, of course it takes 3-30" monitors to see them all.