advertise
Monday
Oct042010

Paper: An Analysis of Linux Scalability to Many Cores  

An Analysis of Linux Scalability to Many Cores, by a number of MIT researchers, is a refreshingly practical paper on what it takes to scale Linux and common applications like Exim, memcached, Apache, PostgreSQL, gmake, Psearchy, and MapReduce to run on 48 core systems. A very timely paper given moderately massive multicore systems are reportedly the near future of computing.

This paper must have taken a lot of work. They both tracked down bottlenecks in a number of applications and the Linux kernel and they also tried to fix them. Modestly speaking the authors said they made "modest" changes to the kernel and applications, but there's nothing modest about what they did. It's excellent work.

After the next bit, which is the abstract, there is a list of the problems they found and how they fixed them.

Click to read more ...

Friday
Oct012010

Hot Scalability Links For Oct 1, 2010

Click to read more ...

Friday
Oct012010

Google Paper: Large-scale Incremental Processing Using Distributed Transactions and Notifications

This paper, Large-scale Incremental Processing Using Distributed Transactions and Notifications by Daniel Peng and Frank Dabek, is Google's much anticipated description of Percolator, their new real-time indexing system.

The abstract:

Updating an index of the web as documents are crawled requires continuously transforming a large repository of existing documents as new documents arrive. This task is one example of a class of data processing tasks that transform a large repository of data via small, independent mutations. These tasks lie in a gap between the capabilities of existing infrastructure. Databases do not meet the storage or throughput requirements of these tasks: Google’s indexing system stores tens of petabytes of data and processes billions of updates per day on thousands of machines. MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency.

 

We have built Percolator, a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, we process the same number of documents per day, while reducing the average age of documents in Google search results by 50%. 
Thursday
Sep302010

More Troubles with Caching

As a tasty pairing with Facebook And Site Failures Caused By Complex, Weakly Interacting, Layered Systems, is another excellent tale of caching gone wrong by Peter Zaitsev, in an exciting twin billing: Cache Miss Storm and More on dangers of the caches. This is fascinating case where the cause turned out to be software upgrade that ran long because it had to be rolled back. During the long recovery time many of the cache entries timed out. When the database came back, slam, all the clients queried the database to repopulate the cache and bad things happened to the database. The solution was equally interesting: 

Click to read more ...

Thursday
Sep302010

Facebook and Site Failures Caused by Complex, Weakly Interacting, Layered Systems

Facebook has been so reliable that when a site outage does occur it's a definite learning opportunity. Fortunately for us we can learn something because in More Details on Today's Outage, Facebook's Robert Johnson gave a pretty candid explanation of what caused a rare 2.5 hour period of down time for Facebook. It wasn't a simple problem. The root causes were feedback loops and transient spikes caused ultimately by the complexity of weakly interacting layers in modern systems. You know, the kind everyone is building these days. Problems like this are notoriously hard to fix and finding a real solution may send Facebook back to the whiteboard. There's a technical debt that must be paid. 

The outline and my interpretation (reading between the lines) of what happened is:

Click to read more ...

Tuesday
Sep282010

Sponsored Post: Wiredrive, Joyent, DeviantART, CloudSigma, ManageEngine, Site24x7

Who's Hiring?

Cool Products and Services

Click to read more ...

Tuesday
Sep282010

6 Strategies for Scaling BBC iPlayer

The BBC's iPlayer site averages 8 million page views a day for 1.3 million users. Technical Architect Simon Frost describes how they scaled their site in Scaling the BBC iPlayer to handle demand:

  1. Use frameworks. Frameworks support component based development which makes it convenient for team development, but can introduce delays that have to be minimized. Zend/PHP is used because it supports components and is easy to recruit for.  MySQL is used for program metadata. CouchDB is used for key-value access for fast read/write of user-focused data.
  2. Prove architecture before building it. Eliminate guesswork by coming up with alternate architectures and create prototypes to determine which option works best. Balance performance with factors like ease of development.
  3. Cache a lot. Data is cached in memcached for a few seconds to minutes. Short cache invalidation periods keep the data up to date for the users, but even these short periods make a huge difference in performance. Caching doesn't have to be for a long time to see a benefit. Varnish is used to cache HTML pages. Much of the invalidation is time or action-based (e.g. someone adds a new favourite).
  4. Click to read more ...

Thursday
Sep232010

Working With Large Data Sets

This is an excerpt from my blogpost Working With Large Data Sets...

For the past 18 months I’ve moved from working on the SMTP proxy to working on our other systems, all of which make use of the data we collect from each connection. It’s a fair amount of data and it can be up to 2Kb in size for each connection. Our servers receive approximately 1000 of these pieces of data per second, which is fairly sustained due to our global distribution of customers. If you compare that to Twitter’s peak of 3,283 tweets per second (maximum of 140 characters), you can see it’s not a small amount of data that we are dealing with here.

I recently set out to scientifically prove the benefits of throttling, which is our technology for slowing down connections in order to detect spambots, who are kind enough to disconnect quite quickly when they see a slow connection. Due to the nature of the data we had, I needed to work with a long range of data to show evidence that an IP that appeared on Spamhaus had previously been throttled and disconnected, and then measure the duration until it appeared on Spamhaus. I set a job to pre-process a selected set of customers data and arbitrarily decided 66 days would be a good amount to process, as this was 2 months plus a little breathing room. I knew from my experience it was possible that it might take 2 months for a bad IP to be picked up by Spamhaus.

I extracted 28,204,693 distinct IPs, some of which were seen over million times in this data set.

Click here to read more...

Wednesday
Sep222010

Applying Scalability Patterns to Infrastructure Architecture

Too often software design patterns are overlooked by network and application delivery network architects but these patterns are often equally applicable to addressing a broad range of architectural challenges in the application delivery tier of the data center.

Click to read more ...

Tuesday
Sep212010

Sponsored Post: Joyent, DeviantART, CloudSigma, ManageEngine, Site24x7

Who's Hiring?

Cool Products and Services

Click to read more ...