« Streaming Video on Amazon EC2? | Main | Ideas on how to scale a shared inventory database??? »

Product: Collectl - Performance Data Collector

From their website:
There are a number of times in which you find yourself needing performance data. These can include benchmarking, monitoring a system's general heath or trying to determine what your system was doing at some time in the past. Sometimes you just want to know what the system is doing right now. Depending on what you're doing, you often end up using different tools, each designed to for that specific situation. Features include:

  • You are be able to run with non-integral sampling intervals.
  • Collectl uses very little CPU. In fact it has been measured to use <0.1% when run as a daemon using the default sampling interval of 60 seconds for process and slab data and 10 seconds for everything else.
  • Brief, verbose, and plot formats are supported.
  • You can report aggregated performance numbers on many devices such as CPUs, Disks, interconnects such as Infiniband or Quadrics, Networks or even Lustre file systems.
  • Collectl will align its sampling on integral second boundaries.
  • Supports process and slab monitoring.
  • New to the 2.4.0 release is the monitoring of process i/o statistics.

    Unlike most monitoring tools that either focus on a small set of statistics, format their output in only one way, run either interactively or as a daemon but not both, collectl tries to do it all. You can choose to monitor any of a broad set of subsystems which currently include cpu, disk, inodes, infiniband, lustre, memory, network, nfs, processes, quadrics, slabs, sockets and tcp. The following is an example of simply running the collectl command with no arguments and using its default settings. Below we see what the cpu, network and disk were doing while writing a large file:

    #cpu sys inter ctxsw KBRead Reads KBWrit Writes netKBi pkt-in netKBo pkt-out
    37 37 382 188 0 0 27144 254 45 68 3 21
    25 25 366 180 20 4 31280 296 0 1 0 0
    25 25 368 183 0 0 31720 275 2 20 0 1

    Output can also be saved in a rolling set of logs for later playback or displayed interactively in a variety of formats. If all that isn't enough there are additional mechanisms for supplying data to external tools via a socket interface or by generating its output as s-expressions, a format of choice for some tools such as supermon. You can even create files in space-separated formats for plotting with external packages like the one below which was done with gnuplot using 1 second samples.
  • Reader Comments (21)

    use sar.

    November 29, 1990 | Unregistered CommenterAnonymous

    I use Munin a lot lately for collecting data.

    November 29, 1990 | Unregistered CommenterKent

    I'm the author of collectl and in response to the 2 word comment of 'use sar' I have to say before I wrote collectl I looked very closely at sar. It does some things very well but I also think it's gotten a little long on the tooth. The problem with some of the older tools are that so many scripts are dependent on their output formats they can't be changed and are locked in an older way of doing things. For example:
    - can sar display multiple types of data on a single line?
    - can it report Inifiniband stats?
    - how about lustre? nfs? tcp? slabs?
    - is it possible to load sar data directly into a spreadsheet?
    - how about plotting sar data without having to manipulate the data?
    - what about sub-second monitoring intervals?

    enough rambling. the list is much longer...

    November 29, 1990 | Unregistered CommenterMark Seger

    thanks for your answer, mark

    put in a point-to-point comparison on your site, or people will just "use sar" and never complain to you. Also you may consider supporting windows os. Remember these are the OSS freaks, they would rather sell their mom on ebay than pay for commercial software. The slashdot clique isn't the most distinguished customer base.

    have fun competing with the community that does things for free (u.e. $0).

    November 29, 1990 | Unregistered CommenterAnonymous

    Also you may consider nicer graphing output or support for RRDTool or nagios

    November 29, 1990 | Unregistered CommenterAnonymous

    re: side-by-side comparison to SAR
    I didn't want to get too deeply into that as I've noticed many people who use SAR are quite happy with the default 10 minute monitoring intervals which I find relatively useless for any kind of analytical trouble shooting. After all, if you're told the cpu was 25% busy for a give 10 minute period how could you ever tell it was idle for 7:30 and pegged for the other 2:30?

    As I said before the main differences are in the types of data collected. SAR collects a lot but collectl collects a lot more.

    re: windows - ain't gonna happen! collectl is based on the very light-weight /proc interface to get it's data and doing something similar in windows would be very painful.

    re: 'competing with the community that does things for free'
    what gives anyone the impression collectl costs something? it is open source and free!

    re: graphing
    You can generate data in plottable form and load it into a spreadsheet and use its graphing features or call something like gnuplot to do it for you. If yo have a particular set of data you want to plot over and over again, you could always script it.

    re: rrd
    Collectl can generate data in rrd format. I was contemplating trying to actually load an rrd database directrly from collectl and asked if anyone wanted to work with me and I didn't get any takers. I also did some experimenting with rrd and found it didn't really meet my plotting needs which requires I get 100% accurate plotting data and you can't do that with rrd since it normalizes multiple data points into a single one and you therefore lose information. Since rrd was never intended to have a highly accurate plotting package but rather focus on trends, this is fine for it to do but that's not ok when you're trying to use that data to diagnose a system problem.


    November 29, 1990 | Unregistered CommenterMark Seger

    interesting little hack you made there but it can not imagine why anyone would want to use it.
    for "getting numbers fast" there is sar, which is timetested and does pretty much everything
    one would need in such a use-case.

    for larger scale or longtime monitoring people commonly use munin, nagios or other
    plugin-based solutions. seems like your tool has nothing to offer on that front.

    November 29, 1990 | Unregistered CommenterJohn

    @mark, I am sorry but these 2 are non-arguments
    - is it possible to load sar data directly into a spreadsheet?
    Why would you want a specialized program to output into a specialized format (spreadsheet.. wasnt this the tool for the accounting types?)

    - how about plotting sar data without having to manipulate the data?
    Similar to the one above.

    PS: I have not looked at collectd yet, just commenting as knee-jerk.

    November 29, 1990 | Unregistered Commenteratif.ghaffar

    re: Interesting little hack
    It may be a hack to you but in the world of High Performance Computing, something many people may not be all that familiar with, it's proven to be invaluable as some of the largest computers in the world run collectl on a daily basis. Ever hear of the top500 list? The majority of the systems listed are HP and many of those run collectl. People who have used it recognize its worth. While sar provides a lot of data, it does not include information on Infiniband and
    Lustre and they are far too important to not have at your fingertips.

    As for tools like nagios, etc I find they don't scale. Can you sends hundreds of performance counters to them every 10 seconds (or less) from over 1K nodes and not choke it? Furthermore, what happens when you're trying to debug a network problem and you can't get the data to nagios to display?

    re: output to a spreadsheet
    I tried to choose my words carefully but perhaps not carefully enough 9-)
    What I was trying to say is collectl can generate data in space-separated format (or on fact let you choose you own separator) and as such can be easily imported into any tool that recognizes such a format. Speadsheets are the main ones that come to mind. However more importanly you can also run gnuplot directly.

    tough crowd, but I like to hear all feedback 8-)
    btw - at least Kevin Closson's agrees with me
    see http://kevinclosson.wordpress.com/2007/12/18/its-your-choice-collectl-or-some-odd-collection-of-sundry-commands/


    November 29, 1990 | Unregistered CommenterMark Seger

    I for one will definitely have a look into Collectl.

    We run a bunch of clusters and some SMP-machines, most of them doing HPC.
    Right now we primarily use Ganglia and what I find useful is that you can define and implement your own metrics with gmetric - just get your data in any way you want (e.g. shell script that gets the temperature via IMPI) and gmetric will feed it into Ganglia, which produces all charts, statistics, etc.
    I wonder if we could use Collectl to get some data into Ganglia.

    BTW - for the moment I thought your tool is not free (as beer ;-)) but it is and even if it wasn't that really doesn't matter when we talk about destination hardware it is meant for (this is in respect to one of the previous comments).


    I run a blog about building and administering clusters - http://clusteradmin.blogspot.com/">http://clusteradmin.blogspot.com - perhaps somebody here will be interested.

    November 29, 1990 | Unregistered Commenterclusteradmin.blogspot.com

    Actually beer isn't free, but collectl is 9-) so by all means give it a shot. I suspect it's farily easy to pass data from colletl to ganglia, but just be aware that when you plot data you lose accuracy and so really want to keep collectl data local as well. Perhaps the best place to have that discussion is on your blog so I'll enter a few comments there...

    November 29, 1990 | Unregistered CommenterMark Seger

    Sounds great. I'll cover some monitoring aspects (including your tool) over the weekend. Your comments will be very valuable.

    http://clusteradmin.blogspot.com/">clusteradmin.blogspot.com :: blog about building and administering clusters.

    November 29, 1990 | Unregistered Commenterclusteradmin.blogspot.com

    "This blog is open to invited readers only" -- I don't know why you are advertising your blog when you apparently don't want anyone to read it....

    November 29, 1990 | Unregistered CommenterAnonymous

    Also you may consider nicer graphing output or support for RRDTool or nagios

    November 29, 1990 | Unregistered CommenterCulture - Bilisim

    The short answers to the last questions are yes and yes. Collectl's mission is to be the best data collector and logged around. To that end it doesn't try to aggregate data from multiple nodes, load it into data bases or do fancy graphics. Those are jobs for other tools like Ganlia, Nagios or RRD.

    However what collectl does do is make data available for importation in a number of mechanisms. What I don't want to do is dictate how that data should be loaded and therefore leave it to others. is that a cop-out on my part? I say no, because what ever implementation I might choose there will be others who either disagree with that mechanism or simply know the external tools better and have more efficient ways to implement those mechanisms. Quite frankly I'm waiting for someone to raise their hand and say they want to import collectl data into another tool and are looking for help. I'd be more than happy to hear what they have to say and help where I can.


    November 29, 1990 | Unregistered CommenterMark Seger

    It's been awhile but I thought I'd post an update on collectl. A couple of weeks ago collectl was added to the Fedora 10 release and quickly back-ported to releases 8 & 9 so it looks like it's starting to gain some traction.

    I was also looking through some previous discussions in this thread and as a more detailed comparison of collectl to other utilities I had developed a chart which shows a subset of collectl's commands mapped against exisiting tools like sar. If I've missed any sar (or other tool options) let me know and I'll be happy to update the table. It's at - http://collectl.sourceforge.net/Matrix.html

    I also thought I'd take the opportunity to mention that I have released collectl 3.0.0 which I think is pretty cool because of a key new feature I added, specifically the --top switch which makes collectl sort of work like top, only better! With this switch you not only can display processes sorted by cpu, you can also display top processes by I/O (assuming your kernel supports that). Furthermore you can simultaneously display others stats such as disk traffic, network, etc. In fact, you can even include process threads. But wait - there's more! Since collectl has a highly integrated set of capabilities, if you've had it running as a daemon and writing statistics to a file, you can play back that file with --top, multiple times if you like, and see who the top processes were at different times in the past! More on process monitoring here - http://collectl.sourceforge.net/Process.html

    Something new I'm currently working on is adding ipmi data such as fan/temp data. That tends to be a bit more challenging because every systems reports its ipmi data in different formats!

    If anyone has tried collectl since my last posting and has any feedback (both good and bad) I'd be interested in hearing what you think.

    marek - I finally got around to trying to access your blog but apparently I need to be authorized by you to access it or am I missing something? If only those who are given permission are allowed in and don't know your email address isn't that self-defeating?


    November 29, 1990 | Unregistered CommenterMark Seger

    Thanks for the update Mark. All sounds good. You aren't missing anything on the new users. Because of spam I approve new users in batches. Sorry for the awkwardness of the process but I'm not sure what else to do.

    November 29, 1990 | Unregistered CommenterTodd Hoff

    Actually I'm talking about the pointer to clusteradmin.blogspot.com that was mentioned by Marek. He invited people to join in but you can't unless you're a member and without an email to ask him it's kind of a catch-22.

    November 29, 1990 | Unregistered CommenterMark Seger

    Interesting...perhaps a way to correlate to actual end-user performance experience? I use an appliance to stitch together all the http packets together, I guess if I knew what to correlate onto this collectl would be good for deep-dive forensics?

    November 29, 1990 | Unregistered CommenterTim

    The whole point is you never know ahead of time what to correlate and so you collect everything including process data. When the time comes to analyze a problem at a particular time, you just play back all different types of data around the time and start trying to correlate it with what else you might know such as a message in /var/log/messages or perhaps an entry in a web log that might have caused a failure or perhaps resulted in a slow response time. Does that help?

    November 29, 1990 | Unregistered Commentermarkseger

    We have been having strange problems with our processes on a large server (>60T disk, 700GB ram). The problem with sar is trying to correlate all the data points. It's very difficult. Anything that can make this job easier would be welcome. I'm very interested to see what collectl can do.

    October 28, 2016 | Unregistered CommenterGreyGnome

    PostPost a New Comment

    Enter your information below to add a new comment.
    Author Email (optional):
    Author URL (optional):
    Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>