Paper: Mind the Gap: Reconnecting Architecture and OS Research

Mind the Gap: Reconnecting Architecture and OS Research is a paper presented at HotOS XIII, the place where researchers talk about making potential futures happen. For a great overview of the conference take a look at this article by Matt Welsh: Conference report: HotOS 2011 in Napa.

In the VM/cloud age I question the need of having an OS at all, programs can compile directly against "raw" hardware, but the paper does a good job of trying to figure out the new roll operating systems can play in the future. We've been in a long OS holding pattern, so long that we've seen the rise of PaaS vendors skipping the OS level abstraction completely, but there's room for a middle ground between legacy time sharing systems of the past and service level APIs that are but one possible future.


Click to read more ...


Troubleshooting response time problems – why you cannot trust your system metrics

Production Monitoring is about ensuring the stability and health of our system, that also includes the application. A lot of times we encounter production systems that concentrate on System Monitoring, under the assumption that a stable system leads to stable and healthy applications. So let’s see what System Monitoring can tell us about our Application.

Let’s take a very simple two tier Web Application:

A simple two tier web application 

This is a simple multi-tier eCommerce solution. Users are concerned about bad performance when they do a search. Let's see what we can find out about it if performance is not satisfactory. We start by looking at a couple of simple metrics.

CPU Utilization

Click to read more ...


Viddler Architecture - 7 Million Embeds a Day and 1500 Req/Sec Peak  

Viddler is in the high quality Video as a Service business for a customer who wants to pay a fixed cost, be done with it, and just have it work. Similar to Blip and Ooyala, more focussed on business than YouTube. They serve thousands of business customers, including high traffic websites like FailBlog, Engadget, and Gawker.

Viddler is a good case to learn from because they are a small company trying to provide a challenging service in a crowded field. We are catching them just as they transitioning from a startup that began in one direction, as a YouTube competitor, and pivoted into a slightly larger company focussed on paying business customers.

Transition is the key word for Viddler: transitioning from a free YouTube clone to a high quality paid service. Transitioning from a few colo sites that didn't work well to a new higher quality datacenter. Transitioning from an architecture that was typical of a startup to one that features redundancy, high availability, and automation. Transitioning from a lot of experiments to figuring out how they want to do things and making that happen. Transition to an architecture where features were spread out amongst geographically distributed teams using different technology stacks to having clear defined roles.

In other words, Viddler is like most every other maturing startup out there and that's fun to watch. Todd Troxell, Systems Architect at Viddler, was kind enough to give us an interview and share the details on Viddler's architecture. It's an interesting mix of different technologies, groups, and processes, but it somehow seems to all work. It works because behind all the moving parts is the single idea: making the customer happy and giving them what they want, no matter what. That's not always pretty, but it does get results.


The Stats

Click to read more ...


Stuff The Internet Says On Scalability For May 6th, 2011

Submitted for your reading pleasure...Hi Mom!...


  • We don't need no stinking servers says the W3C. This Could be Big: Decentralized Web Standard Under Development by W3C by Marshall Kirkpatrick. Browsers talking to directly to other browsers. Marshall is right, this could be very big.
  • Quotable Quotes for Pi Alex:
    • @eric_brewer The Amazon outage & CAP theorem: (partition is the root cause)
    • @kylecordes A problem with cloud hosting (EC2) is that it brings the problems of scalability to systems that *don't* need scalability.
    • @virtualpete Last month everyone was a nuclear physicist. Today everyone is a web scalability architect
    • @jfelipe We cannot overlook migration/federation issues (scalability) in cloud tech: open standards are a plus compared 2 closed (Amazon)
    • @lapsu Stored procedures aren't so bad if you write them in Javascript & they do MapReduce. That makes them cool. #nosql
  • Adapteva wants your tablet and phone to have 64 processors. What can you do with all that power? Process the world around you in real-time. Analyzing sound, video, making sense of it, embedding you in a data enchanted world. That's one option anyway.
For a lot more Stuff the Internet Says please read below...

Click to read more ...


Paper: A Study of Practical Deduplication

With BigData comes BigStorage costs. One way to store less is simply not to store the same data twice. That's the radically simple and powerful notion behind data deduplication. If you are one of those who got a good laugh out of the idea of eliminating SQL queries as a rather obvious scalability strategy, you'll love this one, but it is a powerful feature and one I don't hear talked about outside the enterprise. A parallel idea in programming is the once-and-only-once principle of never duplicating code.

Using deduplication technology, for some upfront CPU usage, which is a plentiful resource in many systems that are IO bound anyway, it's possible to reduce storage requirements by upto 20:1, depending on your data, which saves both money and disk write overhead. 

This comes up because of really good article Robin Harris of StorageMojo wrote, All de-dup works, on a paper,  A Study of Practical Deduplication by Dutch Meyer and William Bolosky, 

For a great explanation of deduplication we turn to Jeff Bonwick and his experience on the ZFS Filesystem:

Click to read more ...


Sponsored Post: Percona, Mathworks, AppDynamics, Gazillion, Edmunds, OPOWER, ClearStone, ScaleOut, aiCache, WAPT, Karmasphere, Newrelic, Cloudkick, Membase, CloudSigma, ManageEngine, Site24x7

Who's Hiring?

  • MathWorks Looking for Multiple, Full-time Scaling Experts. Apply now:
  • Gazillion Entertainment is looking for a Web Developer Generalist to work on massively multiplayer online games. Please apply here
  • helps people find the car that meets their every need.  We’re currently hiring talented Java Developers in the Los Angeles area.
  • OPOWER motivates millions to become more energy efficient, and we're hiring!

Fun and Informative Events

  • Percona is running an intensive one-day MySQL conference in New York City on May 26th.  High Scalability readers save $50 with the code PLNY-HiSc. Learn more and register at
  • Interested in CouchDB or MembaseTraining? CouchOne just announced dates for our CouchDB Developer and Membase Server Ops Training. Click here to learn more or register today.

Cool Products and Services

  • AppDynamics is the very first free product designed for troubleshooting Java performance while getting full visibility in production environments. Visit
  • APM (Application Performance Management) for NOSQL, Java and More - Try ClearStone 5.0. Download ClearStone 5.0 today!
  • ScaleOut StateServer - Scale Out Your Server Farm Applications!
  • aiCache creates a better user experience by increasing the speed scale and stability of your web-site. 
  • WAPT is a load, stress and performance testing tool for websites and web-based applications.
  • Karmasphere is bringing Apache Hadoop power to developers and analysts. Download your Free Community Edition today!
  • Newrelic - What are you doing to ensure the performance of your apps?
  • Cloudkick - monitor & manage your serversde better with a FREE Cloudkick developer account.
  • CloudSigma. Instantly scalable European cloud servers.
  • ManageEngine Applications Manager : Monitor physical, virtual and Cloud Applications.
  • : Monitor End User Experience from a global monitoring network.

For a longer description of each sponser please read more below...

Click to read more ...


The Updated Big List of Articles on the Amazon Outage

Since The Big List Of Articles On The Amazon Outage was published we've a had few updates that people might not have seen. Amazon of course released their Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. Netlix shared their Lessons Learned from the AWS Outage as did Heroku (How Heroku Survived the Amazon Outage), Smug Mug (How SmugMug survived the Amazonpocalypse), and SimpleGeo (How SimpleGeo Stayed Up During the AWS Downtime). 

The curious thing from my perspective is the general lack of response to Amazon's explanation. I expected more discussion. There's been almost none that I've seen. My guess is very few people understand what Amazon was talking about enough to comment whereas almost everyone feels qualified to talk about the event itself.

Lesson for crisis handlers: deep dive post-mortems that are timely, long, honestish, and highly technical are the most effective means of staunching the downward spiral of media attention. 

Amazon's Explanation of What Happened

Click to read more ...


Stack Overflow Makes Slow Pages 100x Faster by Simple SQL Tuning

The most common complaint against NoSQL is that if you know how to write good SQL queries then SQL works fine. If SQL is slow you can always tune it and make it faster. A great example of this incremental improvement process was written up by StackExchange's Sam Saffron, in A day in the life of a slow page at Stack Overflow, where he shows through profiling and SQL tuning it was possible to reduce page load times from 630ms to 40ms for some pages and for other pages the improvement was 100x.

Sam provides a lot of wonderful detail of his tuning process, how it works, the thought process, the tools used, and the tradeoffs involved. Here's a short summary of the steps:

Click to read more ...


PaaS on OpenStack - Run Applications on Any Cloud, Any Time Using Any Thing

Yesterday, I had a session during the OpenStack Summit where I tried to present a more general view on how we should be thinking about PaaS in the context of OpenStack.

The key takeaway :

The main goal of PaaS is to drive productivity into the process by which we can deliver new applications.

Most of the existing PaaS solutions take a fairly extreme approach with their abstraction of the underlying infrastructure and therefore fit a fairly small number of extremely simple applications and thus miss the real promise of PaaS.

Amazon's Elastic Beanstalk took a more bottom up approach giving us better set of tradeoffs between the abstraction and control which makes it more broadly applicable to a larger set of applications.

The fact that OpenStack is opensource allows us to think differently on the things we can do at the platform layer. We can create a tighter integration between the PaaS and IaaS layers and thus come up with better set of tradeoffs into the way we drive productivity without giving up control. Specifically that means that:

  • Anyone should be able to:
    • Build their own PaaS in a snap
    • Run on any cloud (public/private)
    • Gain multi-tenancy, elasticity… Without code changes.
  • Provide a significantly higher degree of control without adding substantial complexity over our:
    • Language choice
    • Operating System
    • Middleware stack
  • Should come pre-integrated with a popular stack:
    • Spring,Tomcat, DevOps, NoSQL, Hadoop...
    • Designed to run the most demanding mission-critical app

You can read the full story and see the demo here


Heroku Emergency Strategy: Incident Command System and 8 Hour Ops Rotations for Fresh Minds

In Resolved: Widespread Application OutageHeroku tells their story of how they dealt with the Amazon outage. While taking 100% responsibility for the downtime, they also shared a number of the strategies they used to bring their service back to full working order.

One of Heroku's most interesting strategies wasn't a technical hack at all, but how they consciously went about deploying their Ops personnel in response to the emergency. An outline of their strategy is:

Click to read more ...