Product: Collectl - Performance Data Collector

From their website: There are a number of times in which you find yourself needing performance data. These can include benchmarking, monitoring a system's general heath or trying to determine what your system was doing at some time in the past. Sometimes you just want to know what the system is doing right now. Depending on what you're doing, you often end up using different tools, each designed to for that specific situation. Features include:

  • You are be able to run with non-integral sampling intervals.
  • Collectl uses very little CPU. In fact it has been measured to use <0.1% when run as a daemon using the default sampling interval of 60 seconds for process and slab data and 10 seconds for everything else.
  • Brief, verbose, and plot formats are supported.
  • You can report aggregated performance numbers on many devices such as CPUs, Disks, interconnects such as Infiniband or Quadrics, Networks or even Lustre file systems.
  • Collectl will align its sampling on integral second boundaries.
  • Supports process and slab monitoring.
  • New to the 2.4.0 release is the monitoring of process i/o statistics. Unlike most monitoring tools that either focus on a small set of statistics, format their output in only one way, run either interactively or as a daemon but not both, collectl tries to do it all. You can choose to monitor any of a broad set of subsystems which currently include cpu, disk, inodes, infiniband, lustre, memory, network, nfs, processes, quadrics, slabs, sockets and tcp. The following is an example of simply running the collectl command with no arguments and using its default settings. Below we see what the cpu, network and disk were doing while writing a large file: #<--------CPU--------><-----------Disks-----------><-----------Network----------> #cpu sys inter ctxsw KBRead Reads KBWrit Writes netKBi pkt-in netKBo pkt-out 37 37 382 188 0 0 27144 254 45 68 3 21 25 25 366 180 20 4 31280 296 0 1 0 0 25 25 368 183 0 0 31720 275 2 20 0 1 Output can also be saved in a rolling set of logs for later playback or displayed interactively in a variety of formats. If all that isn't enough there are additional mechanisms for supplying data to external tools via a socket interface or by generating its output as s-expressions, a format of choice for some tools such as supermon. You can even create files in space-separated formats for plotting with external packages like the one below which was done with gnuplot using 1 second samples.

    Click to read more ...

  • Sunday

    Ideas on how to scale a shared inventory database???

    We have a database today that holds all of our shared inventory. How do we scale out ? We run into concurrency issues today as mutliple users may want to access the same inventory,etc. Im sure its a common problem.. So how do folks implement this while also having faster response to available inventory and also ensuring no downtime Thanks

    Click to read more ...


    The case against ORM Frameworks in High Scalability Architectures

    Let me begin by saying that I have used and continue to use various ORM frameworks such as hibernate, ibatis, propel and activerecord in applications and websites that have a user base ranging from a couple hundred to 500k users. Especially for projects that have to be up and running in a short duration of time, ORM frameworks significantly reduce the effort required to manipulate and persist OOP objects by providing time saving facilities such as automatically generated model objects, integrated unit testing, secure variable substitution, etc. Hibernate even supports horizontal data partitioning via Hibernate Shards. However, the lay of the land is significantly different in the rarefied space occupied by applications needing to support millions of users. Profiling an application at this level and paying particular attention to the operations needed to move data to and from the database, it becomes evident that a significant portion of the operations are API related, whereby the ORM framework is traversing the abstraction layer built between the application logic and the native methods that ultimately interact with the database. I see a couple of problems with this level of abstraction and for the purpose of this discussion, I will purposely ignore caching for the sake of keeping the scope succinct. 1. The process of optimizing database queries is as much an art as it is a science and I am yet to see an ORM framework that does this well. In the case of mysql, optimization involves using facilities such as explain, benchmark, analyze table, show index, and the slow queries log to identify non-performing queries and tweak them to extract the leanest performance. These optimizations necessarily work best when applied as close as possible to the bare metal, so to speak, and the abstraction of an ORM framework negates to an extent the benefits of optimization. The devil remains in the details and the further away you are from the details, the lesser a chance you have to find and square with the devil. 2. At the end of the day, an ORM framework is essentially middleware. My reading of some of the real life architectures presented on this sites seems to reinforce the assessment that middleware will only take you so far, beyond which you have to roll your own. This makes perfect sense. ORM frameworks are built to serve as wide an audience as possible and while their success is unquestionable in the commodity/middle market, they are not and cannot possibly be tooled to accommodate the atypical demands of high scalability architecture. That would be akin to running with hares and hunting with the hounds. Building a framework for hight scalability would also require that the builders have a front and center seat in an enterprise where they are exposed to the machinery and day to day operations of a high scalability site. A situation for which you would be hard pressed to find another installation bearing similar characteristics or with similar requirements. Additionally, and without putting down the developers who contribute to these frameworks, a majority of them would not have the exposure to a bona fide high scalability architecture to be able to bring their experience to bear on the framework code base. 3. Just as with kernel developers, I have a significant amount of faith in the folks that spend their every waking hour coding database engines such as MySQL, Postgres, Oracle, MS SQL etc. Consequently, when the main goal is ultimate performance and scalability, I generally frown upon efforts to introduce a middle man between the wicked fast database and the application logic. And having invested the time and effort over many years to learn the intricacies of a database engine, I am more apt to cast my lot with the devil that I know than abdicate control to a framework, however versatile. One could argue that it makes sense to start off with an ORM framework and as the demands for the site begin to eclipse what the framework can provide, gradually transition to a custom built solution. In my experience, refactoring on the database tier for a site that has a significant amount of data and needs to be operational 24x7 is pure hell. So much so that a more feasible option would be to build a parallel site then migrate and switch over. Of course this could be mitigated by using a service oriented architecture and thereby giving yourself some degree of maneuverability, but at the end of the day, there will be thousands of operations trying to read and write to the db every second. You are had, whichever which way you turn. Taking a look at the mediawiki source code that powers the Wikimedia sites including Wikipedia, there are two classes, DatabaseMySQL and DatabasePostgress which encapsulate the native PHP functions that talk to MySQL or PostgreSQL respectively. The other main classes such as the Article class then use these database classes to interact with the db. Simple and straight forward and in my opinion, the best way to get maximum performance and throughput.

    Click to read more ...


    The AOL XMPP scalability challenge

    Large scale distributed instant messaging, presence based protocol are a real challenge. With big players adopting the standard, the XMPP (eXtensible Messaging and Presence Protocol) community is facing the need to validate protocol and implementations to even larger scale.

    Click to read more ...


    How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data

    How do you query hundreds of gigabytes of new data each day streaming in from over 600 hyperactive servers? If you think this sounds like the perfect battle ground for a head-to-head skirmish in the great MapReduce Versus Database War, you would be correct. Bill Boebel, CTO of Mailtrust (Rackspace's mail division), has generously provided a fascinating account of how they evolved their log processing system from an early amoeba'ic text file stored on each machine approach, to a Neandertholic relational database solution that just couldn't compete, and finally to a Homo sapien'ic Hadoop based solution that works wisely for them and has virtually unlimited scalability potential. Rackspace faced a now familiar problem. Lots and lots of data streaming in. Where do you store all that data? How do you do anything useful with it? In the first version of their system logs were stored in flat text files and had to be manually searched by engineers logging into each individual machine. Then came a scripted version of the same process. The next big evolution was a single machine MySQL version. Inserts quickly became the bottleneck as the huge torrents of data flooding caused a lot of index churn. Perdiodic bulk loading was the remedy to this problem, but the shear size of the indexes slowed it down. Data was then broken into Merge Tables based on time so index updates weren't a problem. As more and more data this solution broke down with a combination of load and operational problems. Facing exponential growth they spent about 3 months building a new log processing system using Hadoop (an open-source implementation of Google File System and MapReduce), Lucene and Solr. Moving to a partitioned MySQL data set was an option, but they thought it would only buy time until and a more scalable solution would need to be created in the future anyway. The future came a little early this year. The advantage of their new system is that they can now look at their data in anyway they want:

  • Nightly MapReduce jobs collect statistics about their mail system such as spam counts by domain, bytes transferred and number of logins.
  • When they wanted to find out which part of the the world their customers logged in from, a quick MapReduce job was created and they had the answer within a few hours. Not really possible in your typical ETL system. This switch has changed how they run their business. Stu Hood nicely sums up the impact: "Now whenever we think of complex question about our customers’ usage patterns, we can pull the answer from our logs within hours via MapReduce. This is powerful stuff." In the rest of this post Bill describes the evolution of their system and the forces that caused them to move from a relational database solution to a MapReduce system. Before getting started, I'd really like to thank Bill Boebel for spending so much time and effort in creating this very valuable experience report.

    Information Sources

  • MapReduce at Rackspace
  • A document sent to me by Bill Boebel, CTO of Mailtrust (Rackspace's mail division). This post is a little different than normal because most all the content past this point is by Bill, I've just organized it a little differently.

    The Platform

  • Hadoop
  • Hadoop Distributed File System (HDFS)
  • Lucene
  • Solr
  • Tomcat

    The Stats

  • Rackspace has more than 50K devices and 7 data centers.
  • The mail system and logging servers are currently in 3 of the Rackspace data centers.
  • The system stores over 800 million objects (an object = a user event such as receiving an email or logging into IMAP) within Solr and 9.6 billion within Hadoop, which equals 6.3 TB compressed.
  • Several hundred gigabytes of email log data is generated each day.

    Background on Mailtrust

  • Email hosting company
  • Founded in 1999, merged with Rackspace in 2007, previous name:
  • 80K business customers, 700K mailboxes.
  • 2 hosted mail products: Noteworthy, MS Exchange
  • The Noteworthy System: * Homegrown, Linux based, POP3, IMAP, webmail, RSS feeds, shared calendaring, Outlook sync, Blackberry sync. * ~600 servers, commodity hardware, designed to work around frequent failures.
  • The MS Exchange System: * MAPI, POP, IMAP, OWA, Blackberry, Goodmail, ActiveSync. * ~100 servers, higher-end hardware, SAN & DAS storage.

    The Architecture

    The way the current Hadoop based system works is:
  • Raw logs get streamed from hundreds of mail servers to the Hadoop Distributed File System (”HDFS”) in real time.
  • MapReduce jobs are scheduled run to index the new data using Apache Lucene and Solr.
  • Once the indexes have been built, they are compressed and stored away in HDFS.
  • Each Hadoop datanode runs a Tomcat servlet container, which hosts a number of Solr instances that pull and merge the new indexes, and provide really fast search results to our support team.

    The System Evolution

    The Problem

    Mailtrust is a very customer service focused company. It is extremely important for our support techs to be able to examine mail logs in order to troubleshoot problems for our customers. Our support techs need to search the logs hundreds of times per day, so the tools that provide this functionality must be fast and accurate. With over 600 mail servers, and hundreds of gigabytes of raw log data produced each day, this can be tricky to manage. Here is a brief history of the Mailtrust logging architecture, problems we faced, how we over came them, and what the system looks like today...

    Logging v1.0

    Logs were stored in flat text files on the local disk of each mail server and were kept for 14 days. Our support techs did not have login access to the servers, so in order to search the logs they would have to escalate a ticket to our engineers. The engineers would then have to ssh into each mail server and grep /var/log/maillog. Problems: Once we grew much past a dozen servers, this manual process of logging into each server become too time consuming for our engineers.

    Logging v1.1

    Sped up the search process by writing a script that would search multiple servers via one command run from a centralized server. An engineer could tell the script what type of mail server to search (inbound smtp, outbound smtp, backend mailbox). The script would look at /etc/hosts for a list of servers of that type, and then iterate through each server, ssh in, perform the grep and then output the results. The script could also search in the past via "gunzip -c /var/log/maillog.* | grep" Problems: The support techs still had to escalate a ticket to the engineers in order to perform a search. As the number of customers and servers increased, this began to take too much of our engineers' scarce time. Also, storing and searching the logs on a live server was negatively affecting the performance of the servers. To make matters worse, the engineering team had grown and we started running into the problem where two engineers would perform a search at the same time, which really slowed things down.

    Logging v2.0

    We released a log search tool that the support techs could use directly, without involving the engineers. The support team was given a web-based tool where they could search the logs. It allowed searching by the sender or recipient's email address, domain name or IP address. All of these were indexed fields in a MySQL database. Wildcard text searches (i.e. MySQL "LIKE" statements) were not allowed because the data set was very large and these queries would be horribly slow. Each day's logs were stored in a separate table, so that we could cleanup old data by simply dropping and recreating MySQL tables. This made cleanup really fast compared to running a conditional DELETE command on a large table. Log data was only kept for 3 days in order to keep the MySQL database down to a reasonable size. To get the logs into the database, each mail server initially wrote its log data to a local 16MB tempfs partition. Logrotate was called via cron every 60 seconds to rotate the temporary log file and then preprocess the data before sending it on to the centralized log server. This preprocessing step reduced the volume of data that had to be transmitted over the network to the log server, and this also distributed the processing workload to avoid creating bottleneck on the log server. After the data was processed locally, the script would send comma delimited log data back to syslog-ng on the local server, and syslog-ng would then send it over the network to the centralized log server. The log server was configured to receive data on 6 different ports, one for each type of log data... inbound smtp, outbound smtp, backend smtp, spam/virus filtering, POP3 and IMAP. As log data was received, the records were inserted one by one into the database via MySQL INSERT commands. Problems: We quickly realized that we had a bottleneck with the MySQL inserts. As the tables grew, indexing each entry as it was inserted became slow. Within the first hours of testing, the inserts began slowing and could not keep up with the rate at which data was received. Version 2.0 of the logging system was never used in production.

    Logging v2.1

    Fixed the MySQL INSERT bottleneck by queuing up the log entries in local text files on the centralized log server and periodically bulk loading them into the database. As syslog-ng received logs on its 6 ports, the data would be streamed to 6 separate text files. Every 10 minutes a script would rotate those text files and execute a MySQL LOAD to load the data into the database. This was magnitudes faster than inserting the log data one record at a time. Problems: The LOADs would get progressively slower as the database grew because MySQL indexing performance decreases as the table you are inserting into gets larger. This version was fast enough to be released into production, but we knew the system would not scale too far without additional work.

    Logging v2.2

    Introduced Merge Tables in order to speed up loading the log data into the database. With this version, every 10 minutes our script would create a new database table and then load the text logs into the empty table. This made the LOAD command extremely fast because there were no existing database indexes that could negatively affect performance. After the data was loaded, the script would modify a set of Merge Tables that combined all of the 10-minute tables together. The web search tool was modified to allow searching within the following time ranges: all day, past 12-hours, past 6-hours, past 2-hours. Corresponding Merge Tables existed for each of those time ranges, and were modified every 10 minutes as new tables were created. Problems: This version of the logging system worked reliably for about one year. But we began having problems with it as our support team, customer base and server count grew. When we reached about 100 servers the database LOAD operations would take 2-3 minutes to run, which was acceptable, but the server was now always under a heavy cpu and disk IO load. Searches were being performed more frequently and were becoming slow. We started to see some strange problems such as random errors while trying to create new tables or modify the Merge Tables. These errors progressively became more frequent, resulting in missing log data. The support team began to lose confidence in the system's accuracy. Also, there were several occasions where our engineers performed a software upgrade to a particular application, which changed log format in such a way that broke the preprocessing script. Since our raw logs were deleted from the local mail servers every 60 seconds, we'd have no way to recover the missing logs when this occurred. Additionally, the log search tool was becoming ever more critical to our support team's daily operations; however, the logging system had no redundancy. There was no RAID, no backups, no failover system. We also do not have a good plan for scaling the log system beyond a single monolithic server. Incrementally patching problems and tweaking performance with the log system was taking up a lot of time and we needed something better. We needed a new solution that would be fast, reliable and could scale indefinitely with our growth. We needed something truly scalable.

    Logging v3.0

    While designing v3.0, we looked at several commercial log processing applications. Splunk stood out, and did just about everything we wanted; however, we worried that using a vendor product like this might limit our abilities to build new features down the road. For example, we wanted to build a tool that would allow our customers to search their logs directly. We had been keeping an eye on the Apache Hadoop project since its inception, and were extremely impressed with its progress and direction. Hadoop is an open-source implementation of Google File System and MapReduce... a system that is designed specifically for large scale distributed data processing. It scales out it's workload horizontally by adding servers and distributing the data and MapReduce jobs amongst the servers. Other companies were already using it for their own log processing. So chose to go with Hadoop. In about 3 months we build a fresh new log processing system using Hadoop, Lucene and Solr. The system is described here: We believe this new system will be able to scale with us as our company grows. And there is a lot of momentum behind the Hadoop project, which gives us a lot of confidence that its scalability will continue to improve. Yahoo is one of the major contributors to the project and has built Hadoop clusters that contain thousands of servers, and they are aggressively working to get Hadoop to support tens of thousands of servers. Problems: To date, the only problems we have found have been our own bugs; and we fix those as we find them. We are actively running v3.0 today, but we're not going to stop here. We have a lot of plans for new features...

    The Future

    Version 3.1 is being coded currently. It includes new MapReduce jobs that support Microsoft Exchange log processing. (currently we only process Noteworthy logs with this system). We plan to go live in March. In version 4.0 we plan to put the log search tool in the hands of our customers so that they can have the same troubleshooting power that our support team has. This will most likely require reorganizing the way we store log index shards so that they are grouped by user, rather than letting Solr randomly group them. Our resellers seem to be excited about this, because it should allow them to better support their customers. Who knows what we'll build after v4.0...

    Related Articles

  • Google Architecture
  • Database People Hating on MapReduce
  • Product: Hadoop
  • Running Hadoop MapReduce on Amazon EC2 and Amazon S3
  • Solr

    Click to read more ...

  • Tuesday

    Too many databases

    Hi, I am using drupal for my clients website, and was thinking is it possible to host all ( about 500) of them on the same server(maybe VPS or dedicated). Here is the situation..... Each clients website has a database with about 50 tables each, all the databases are small in size about 2-5 MB .... and the websites are low traffic websites with say.. 50 hits/day on avg.... that means about 2000 queries/db/day ..... (avg 40 queries per hit).... Wanted to know if it is possible to have so many databases about 500 on the same server? what are the things that i should look into if i should make this happen?

    Click to read more ...


    Building scalable storage into application - Instead of MogileFS OpenAFS etc.

    I am planning the scaling of a hosted service, similar to typepad etc. and would appreciate feedback on my plan so far. Looking into scaling storage, I have come accross MogileFS and OpenAFS. My concern with these is I am not at all experienced with them and as the sole tech guy I don't want to build something into this hosting service that proves complex to update and adminster. So, I'm thinking of building replication and scalability right into the application, in a similar but simplified way to how MogileFS works (I think). So, for our database table of uploaded files, here's how it currently looks (simplified): fileid (pkey) filename ownerid For adding the replication and scalability, I would add a few more columns: serveroneid servertwoid serverthreeid s3 At the time the user uploads a file, it will go to a specific server (managed by the application) and the id of that server will be placed in the "serverone" column. Then hourly or so, a cron job will run through the "files" table, and copy any files that haven't been replicated (where servertwo and serverthree are null) to other servers. Another cron will copy files to Amazon's s3 for an extra backup (if null then copy to s3). Now at the client level, when the page to display the file is loaded, it will know which of the three servers it can pull the file from. If one server goes down, the application will know and use one of the other servers. When storage capacity runs low, another server is added with a big drive, perhaps not even having raid on it. These servers will also be used for php serving through load balancing. I'm probably missing some big drawbacks of this approach but it appeals to me that it should be quite simple to implement and be less complex to adminster than systems like MogileFS which would present a lot more unknowns.

    Click to read more ...


    Speed up (Oracle) database code with result caching

    One of the most interesting new features of Oracle 11 is the new function result caching mechanism. Until now, making sure that a PL/SQL function gets executed only as many times as necessary was a black art. The new caching system makes that quite easy -- here is how it works.

    Click to read more ...


    When things aren't scalable

    OK, I know this site is for scalable web site design. But as there aren't any sites I can find for graceful failure under "slashdotted" like pressure I'll ask here. Does anyone have a sensible way, once you have a "web application" that either won't scale, or can't scale, that you can give some users a good consistent experience and bounce other users to a busy site page. I have seen sites do this to varying degrees, some of which work better than others, but no explanations beyond simply bouncing requests to a "we're busy page server" when you have more than a given number of connections. This is obviously useless as a web page likely requires multiple connection (ignoring keep-alive, pipelining etc) multiple connection to completely render properly. The normal problem is users getting a page and not the "furniture" for that page like images or css. Other problems are having to wait ages to get the busy page or the site being slow even if you do "get in". And some site let a user "in" and then as they browse around they get bounced out suddenly to the busy page. Obviously not being the developer for sites I deal with (I am an infrastructure bod) I can't solve the problem where it should have been pre-emptively solved. That is to say I can't write the code to be scalable or re-write the code to do some simple session filtering or the like (and not being a developer I get dirty looks when I point developers at information like your site ... I can hear them thinking "how dare you suggest I don't know how to code a web site you lowly infrastructure cretin"). Before developer on-line lynch me I should point out that sometimes the cause of not being able to scale a site is that I can't get in new hardware quick enough, but then who knows when you will get slashdotted right ?. So my question applies even when a developer of genius level brilliance has built a unsurpasibly scalable web site for me to run the infrastructure for. My best guess so far is using something like HAProxy to load balance sessions, and then use it's more advanced total session count, and cookie issuing abilities to track users and bounce some at a given "heavy load" point. This isn't ideal as the heavy load point would have to be based on connection counts not server load or server response times, but it's the best I can come up with so far. Also, having mentioned brilliant developers writing great sites not always making my question redundant, could I ask, do people normally think about coping with overload when designing scalable solution - surely they should but I don't see much talk about it. Couldn't a simple Java filter or the equivalent for other things be built into applications ? It'd be nice to have a site that not only scales, but "is nice" when waiting for the infrastructure it runs on to be scaled, which could be several days when you have to purchase new hardware.

    Click to read more ...


    Howto setup GFS/GNBD

    Before you proceed make sure you have physical volume(something like /dev/sda1, /dev/sda4, etc) with no data. This is going to be the gfs volume which you will export to other nodes. It should be on the node which is going to be your gnbd server. If you dont have such volume create one using fdisk. I used mounted gfs volume as a DOCUMENT ROOT for my Apache server nodes(Load Balanced). I tried it on FC4 64-bit. If you plan to try it on any other distribution or 32-bit arch.. still the procedure remains same. Since I built it from source but not RPMs, you may have to simply supply config options with a different CFLAGS. Full details at

    Click to read more ...