How do you monitor the performance of your cluster?

I had posted a note the other day about collectl and its ganglia interface but perhaps I wasn't provocative enough to get any responses so let me ask it a different way, specifically how do people monitor their clusters and more importantly how often? Do you monitor to get a general sense of what the system is doing OR do you monitor with the expectation that when something goes wrong you'll have enough data to diagnose the problem? Or both? I suspect both... Many cluster-based monitoring tools tend to have a data collection daemon running on each target node which periodically sends data to some central management station. That machine typically writes the data to some database from which it can then extract historical plots. Some even put up graphics in real-time. From my experience working with large clusters - and I'm talking either many hundreds or even 1000s of nodes, most have to limit both the amount of data they manage centrally as well as the frequency that they collect it, otherwise they'll overwhelm their management station because most DBs can't write hundreds of counters many times/minute from thousands of nodes. As a related example, how many of you run sar at the default monitoring interval of 10 minutes? Do you really think you're getting useful information? What happens if you have a 2 minutes burst of 100% network load and you're idle the other 8 minutes? Sar will happily tell you the network load was 20% and you'll never know your network is tanking. The point of all this is I do think there's a place for central monitoring, though I'm personally not a fan because of the inaccuracy of infrequent data samples, but I also appreciate some data is better than none, as long as you realize the inherent accuracy problems. And that's where collectl comes in and my previous comment about ganglia. When I wrote collectl my overarching design goal, from which I haven't wavered, was to provide highly accurate local data with minimal overhead so you will take samples in the 1-10 second range without fear of impacting the rest of the system. You can literally sample just about everything going on every 10 seconds and use <0.1% of the CPU. If you're willing to give up a few more tenths of a percent you can even monitor processes and slab activity, though you should only sample them at a 60 second frequency because it IS expensive to monitor them. However I also realize this doesn't do any good if do have 1Ks of machine you want to watch and so that's where the socket interface comes in over which collectl can send data to a central manager at that same frequency OR if you prefer have is send its remote data at a different rate, giving you the best of both worlds. Collectl can provide it's data to a central management station while at the same time providing local logging for accuracy, which will let you do a deep dive into the data if a problem does arrive for which there is not enough data stored centrally. My point about the ganglia interface was my response to the fact that a lot of of people running large (as well as smaller) clusters do use ganglia but like most central monitoring stations have to give up the accuracy of finer-grained data and I was just wondering if anyone looking at this forum use ganglia and if they might be interested in trying out the collectl interface to it. -mark

Click to read more ...


A picture is realy worth a thousand word, and also a window in time...

Photograpic picture to me is window, an address to that specific moment what do your think about that?

Click to read more ...


At Some Point the Cost of Servers Outweighs the Cost of Programmers

This is the intriguing quote by Bill Venners in an interview with Twitter's Alex Payne on Twitter's heretical switch from a pure Ruby stack to a Ruby on Rails stack on the front-end and JVM/Scala on the back-end:

So performance was also one of the problems with JRuby, which I [Bill Venners] think helps explain better why they'd [Twitter] prefer Scala over Ruby or JRuby for some things. I have often heard Rubyists say that although Ruby is slower than Java, for many things it is plenty fast enough, and they are right. The logic goes further, saying that servers are cheap, and programmers expensive, so it makes sense to tradeoff some runtime performance for programmer productivity. And I think that's very often true too, but not always. If you have enough traffic, at some point the cost of servers outweighs the cost of programmers. I'm not sure whether Twitter is past that point, but they get a lot of traffic. And frankly this isn't an intrinsic tradeoff. Other dynamic languages are faster than Ruby, and Scala is too. And people can be quite productive in these other languages too, including Scala.
I feel Alex's Max Payne. You might wonder why the geekosphere cares so passionately which technology stack Twitter uses? Well, it's Twitter and it's Ruby on Rails. That's like the Lindsay Lohan and Samantha Ronson of tech buzz. It creates it's own self-sustaining posting reaction. Boom! It took some giant cajones to switch from a well defended platform like Ruby on Rails to an obscure language like Scala. Few people would have been brave enough to pull the trigger on that decision. Twitter didn't take this large leap out of ignorance or incompetence. Twitter's Steve Jenson said they spent several weeks going over our options, running extensive load tests, and presented our findings to the team at each stage. We did our due diligence. They did the work and came to conclusions valid for their situation. They have to follow their own bliss. They aren't telling you to use Scala. They aren't telling you not to use Ruby. Have at it. But they have chosen the path less traveled and seem happy with the direction they are heading. If you aren't happy with their decision then that's a you problem, not a them problem. This points out for me the evolving nature of the two tier web: a client tier and a back-end tier accessed only via an API. That back-end tier can be implemented in anything at all. Twitter likes the JVM for it's undeniable performance, reliability, and scalability. They may like Scala because it has a lot of cool features, is concise, has static typing, performs well, and is a pleasure to develop in. When people jumped with a happy little grin on their face to Ruby, they loved a lot of things about Ruby that helped them make that decision to switch. Maybe Scala is cool, effective, and fun to use too? So instead of worrying how the homeboys have ever so indirectly dissed Ruby, maybe it's worth taking a look at Scala to see if it's actually any good? As a start try Bill Venners on the rise of Scala. It's a great overview of Scala and why it's a worthy next evolution. Then maybe take a peek at First Steps to Scala. And if you want to take a look at some real-life code try Kestrel - tiny queue system based on starling. Knowing only the vaguest bits about Scala before my own little exploration, I come out impressed and interested. I like a lot of things Bill had to say: Scala means scalable language as in it scales to different sized tasks and domains; Scala's static typing is valuable, especially in a team project; Scala has the feel of a scripting language, but can also work as a systems language; Scala is an artful blend of a fully Object and fully Functional language; Scala supports imperative programming, but with a functional bent; Scala can express more in types than Java so you get more rigorous type checking; Scala is concise so can say more with less; Scala treats libraries like creating internal DSLs; Scala helps programmer productivity like Ruby and Python. Sounds like Scala is worth a deeper look to me. Can't we all just get along? Well, that's not how this post started at least. It started with Bill's refreshingly anti-establishment statement: If you have enough traffic, at some point the cost of servers outweighs the cost of programmers. This is an idea we don't hear much of anymore in this era of abundance. Servers cost real cash though, especially for bootstrappers and the truly humongous. So if you can cut your cash requirements by putting in a little programmer elbow grease, that makes a lot of sense to me. Performance still matters.

Related Articles

  • Twitter on Scala by Bill Venners
  • My Reasoned Response about Scala at Twitter by Obie Fernandez.
  • Twitter Blaming Ruby for their Mistakes by Tony Arcieri
  • Scala's home page
  • Hacker News Thread on Message Queue Options
  • The Secret Behind Twitter's Growth by Kate Greene
  • Why Scala for Web 2.0? by Alex Payne
  • Mending The Bitter Absence of Reasoned Technical Discussion by Alex Payne

    Click to read more ...

  • Saturday

    Performance Anti-Pattern

    Want your apps to run faster? Here’s what not to do. By: Bart Smaalders, Sun Microsystems. Performance Anti-Patterns: - Fixing Performance at the End of the Project - Measuring and Comparing the Wrong Things - Algorithmic Antipathy - Reusing Software - Iterating Because That’s What Computers Do Well - Premature Optimization - Focusing on What You Can See Rather Than on the Problem - Software Layering - Excessive Numbers of Threads - Asymmetric Hardware Utilization - Not Optimizing for the Common Case - Needless Swapping of Cache Lines Between CPUs For more detail go there

    Click to read more ...


    Digg Architecture

    Update 4:: Introducing Digg’s IDDB Infrastructure by Joe Stump. IDDB is a way to partition both indexes (e.g. integer sequences and unique character indexes) and actual tables across multiple storage servers (MySQL and MemcacheDB are currently supported with more to follow). Update 3:: Scaling Digg and Other Web Applications. Update 2:: How Digg Works and How Digg Really Works (wear ear plugs). Brought to you straight from Digg's blog. A very succinct explanation of the major elements of the Digg architecture while tracing a request through the system. I've updated this profile with the new information. Update: Digg now receives 230 million plus page views per month and 26 million unique visitors - traffic that necessitated major internal upgrades. Traffic generated by Digg's over 22 million famously info-hungry users and 230 million page views can crash an unsuspecting website head-on into its CPU, memory, and bandwidth limits. How does Digg handle billions of requests a month? Site:

    Information Sources

  • How Digg Works by Digg
  • How uses the LAMP stack to scale upward
  • Digg PHP's Scalability and Performance


  • MySQL
  • Linux
  • PHP
  • Lucene
  • Python
  • APC PHP Accelerator
  • MCache
  • Gearman - job scheduling system
  • MogileFS - open source distributed filesystem
  • Apache
  • Memcached

    The Stats

  • Started in late 2004 with a single Linux server running Apache 1.3, PHP 4, and MySQL. 4.0 using the default MyISAM storage engine
  • Over 22 million users.
  • 230 million plus page views per month
  • 26 million unique visitors per month
  • Several billion page views per month
  • None of the scaling challenges faced had anything to do with PHP. The biggest issues faced were database related.
  • Dozens of web servers.
  • Dozens of DB servers.
  • Six specialized graph database servers to run the Recommendation Engine.
  • Six to ten machines that serve files from MogileFS.

    What's Inside

  • Specialized load balancer appliances monitor the application servers, handle failover, constantly adjust the cluster according to health, balance incoming requests and caching JavaScript, CSS and images. If you don't have the fancy load balancers take a look at Linux Virtual Server and Squid as a replacement.
  • Requests are passed to the Application Server cluster. Application servers consist of: Apache+PHP, Memcached, Gearman and other daemons. They are responsible for making coordinating access to different services (DB, MogileFS, etc) and creating the response sent to the browser.
  • Uses a MySQL master-slave setup. - Four master databases are partitioned by functionality: promotion, profiles, comments, main. Many slave databases hang off each master. - Writes go to the masters and reads go to the slaves. - Transaction-heavy servers use the InnoDB storage engine. - OLAP-heavy servers use the MyISAM storage engine. - They did not notice a performance degradation moving from MySQL 4.1 to version 5. - The schema is denormalized more than "your average database design." - Sharding is used to break the database into several smaller ones.
  • Digg's usage pattern makes it easier for them to scale. Most people just view the front page and leave. Thus 98% of Digg's database accesses are reads. With this balance of operations they don't have to worry about the complex work of architecting for writes, which makes it a lot easier for them to scale.
  • They had problems with their storage system telling them writes were on disk when they really weren't. Controllers do this to improve the appearance of their performance. But what it does is leave a giant data integrity whole in failure scenarios. This is really a pretty common problem and can be hard to fix, depending on your hardware setup.
  • To lighten their database load they used the APC PHP accelerator MCache.
  • Memcached is used for caching and memcached servers seemed to be spread across their database and application servers. A specialized daemon monitors connections and kills connections that have been open too long.
  • You can configure PHP not parse and compile on each load using a combination of Apache 2’s worker threads, FastCGI, and a PHP accelerator. On a page's first load the PHP code is compiles so any subsequent page loads are very fast.
  • MogileFS, a distributed file system, serves story icons, user icons, and stores copies of each story’s source. A distributed file system spreads and replicates files across a lot of disks which supports fast and scalable file access.
  • A specialized Recommendation Engine service was built to act as their distributed graph database. Relational databases are not well structured for generating recommendations so a separate service was created. LinkedIn did something similar for their graph.

    Lessons Learned

  • The number of machines isn't as important what the pieces are and how they fit together.
  • Don't treat the database as a hammer. Recommendations didn't fit will with the relational model so they made a specialized service.
  • Tune MySQL through your database engine selection. Use InnoDB when you need transactions and MyISAM when you don't. For example, transactional tables on the master can use MyISAM for read-only slaves.
  • At some point in their growth curve they were unable to grow by adding RAM so had to grow through architecture.
  • People often complain Digg is slow. This is perhaps due to their large javascript libraries rather than their backend architecture.
  • One way they scale is by being careful of which application they deploy on their system. They are careful not to release applications which use too much CPU. Clearly Digg has a pretty standard LAMP architecture, but I thought this was an interesting point. Engineers often have a bunch of cool features they want to release, but those features can kill an infrastructure if that infrastructure doesn't grow along with the features. So push back until your system can handle the new features. This goes to capacity planning, something the Flickr emphasizes in their scaling process.
  • You have to wonder if by limiting new features to match their infrastructure might Digg lose ground to other faster moving social bookmarking services? Perhaps if the infrastructure was more easily scaled they could add features faster which would help them compete better? On the other hand, just adding features because you can doesn't make a lot of sense either.
  • The data layer is where most scaling and performance problems are to be found and these are language specific. You'll hit them using Java, PHP, Ruby, or insert your favorite language here.

    Related Articles

    * LinkedIn Architecture * Live Journal Architecture * Flickr Architecture * An Unorthodox Approach to Database Design : The Coming of the Shard
  • Ebay Architecture

    Click to read more ...

  • Friday

    Collectl interface to Ganglia - any interest?

    It's been awhile since I've said anything about collectl and I wanted to let this group know I'm currently working on an interface to ganglia since I've seen a variety of posts ranging from how much data to log and where to log it as well as which tools/mechanism to use for logging. From my perspective there are essentially 2 camps on the monitoring front - one says to have distributed agents all sending their data to a central point, but don't send too much or too often. The other camp (which is the one I'm in) says do it all locally with a highly efficient data collector, because you need a lot of data (I also read a post in here about logging everything) and you can't possibly monitors 100s or 1Ks of nodes remotely at the granularity necessary to get anything meaningful. Enter collectl and its evolving interface for ganglia. This will allow you to log lots of detailed data on local nodes at the usual 10 sec interval (or more frequent if you prefer) at about 0.1% system overhead while sending a subset at a lower rate to the ganglia gmonds. This would give you the best of both worlds but I don't know if people are too married to the centralized concept to try something different. I don't know how many people who follow this forum have actually tried it, I know at least a few of you have, but to learn more just go to and look at some of documentation or just download the rpm and type 'collectl'. -mark

    Click to read more ...


    Art of scalability (1) - Scalability principles


    Ebay history and architecture

    Ebay[1] Starts in 1995, initial name AuctionWeb (V1) : - very simple architecture - based on perl - no database, for data persistence they used plain files Because of rapid growth they needed to improve their architecture and so V2 (clever name) was born: - replaced perl with C/C++ - started using a database in a master-slave configuration - C++ back-end - XSLT front-end Any request will lead to an XML file being created in C++ and the XLST processor will transform that into html. *pretty sophisticated architecture for the 90s, XLST was cutting-edge back then* ebay v2 That hold out pretty well for a while but in the late 90s ebay experienced an exponential growth. They started having some trouble with outages and needed improvements, so V3 was developed: - based on java - search engine still used C++ - proof that relational databases can scale (aggressive caching) - developed a messaging layer for making a lot of asyncronious calls, they actually ended up being sued because of the delay in which images appear on the site after an item gets posted :-) - although they switched to java the basic principle of generating xml files for each request was still used. ebay v3 Combining the need for a multilingual website with the boom of AJAX technologies and flash applications they started to doubt their XSL system and moved on to what became in 2006 V4 of ebay: - remove everything that could be replaced with java, Ebay loves Java Everything is Java (Code): - Image - Java class - Link - Java class - Javascript - Java classes - Content - Java classes => lot's of code to write, they're using Eclipse for developing. ebay v4 For more details check: Eclipse at Ebay Tailoring Eclipse to the eBay architecture [1] Images and ideas/info are from the links above.

    Click to read more ...


    Lavabit Architecture - Creating a Scalable Email Service

    Ladar Levison of Lavabit has written an incredible article on how they took a centralized off-the-shelf email server that could handle only few thousand users and built their own custom distributed infrastructure for handling hundreds of thousands of email users. Lavabit processes 70 gigabytes of data per day, is made up of 26 servers, hosts 260,000 email addresses, and processes 600,000 emails a day. That's a lot of email.

    Lavabit's mission has a little edge to it too:

    Lavabit was founded as a direct reaction to the larger free e-mail services available. We felt it was possible to create an e-mail service that was fast, reliable, feature rich and didn't achieve profitability by prostituting its user base to marketers.

    What I really like about this article is that Lavabit has some challenging elements in dealing with different email protocols while being able to scale to a lot of users. There's more going on than just trying to scale out a database. Many products contain complicated bits like this, so it's interesting to see how Ladar handled them. There are lots of useful details that will help anyone build their own system. Putting in this extra work in is what Ladar thinks makes Lavabit different:

    One of the ways to gain an advantage over your competition is to invest the time and money needed to build systems that are better than what is easily available to your competition. It is the custom platform we developed that has allowed us to thrive while many other free email companies either stopped offering their service for free, or shut down altogether.

    Since Ladar was so thorough I saved article as a separate html file. Please select the visit link to read the entire article. I'd like to thank Ladar again for taking the time and making the effort to document their architecture for the benefit of the community at large to learn from.


    Performance - When do I start worrying?

    A common problem of the application designers is to predict when they need to start worrying about the Architectural/System improvements on their application. Do I need to add more resources? If yes, then how long before I am compelled to do so? The question is not only when but also what. Should I plan to implement a true caching layer on top of my application or do I need to shard my database. Do I need to move to a distributed search infrastructure and if yes when ! Essentially we try to find out the functionalities of the application that will become critical over time.

    Click to read more ...