Stuff the Internet Says on Scalability For November 29th, 2010

Eating turkey all weekend and wondering what you might have missed?

  • James Hamilton on why “all you have learned about disks so far is probably wrong" in Availability in Globally Distributed Storage. It turns out for the same reason our financial systems melt down: black swans. The world is predictably unpredictable. Murat Demirbas also has a good post on the same Google research paper.
  • Stack Overflow Hits 10M Uniques
  • Vroom...Formula One racecar streams 27 gigabytes of telemetry data during a race weekend! 200 sensors “measuring anything and everything that moves or gets warm. 
  • Quotable Quotes:
    • @dmalenko: It is cool to sit by the ocean, oversee the sunset and think about scalability models for a web app
    • @detroitpro: I have to admit; sometimes I think "This would be easier with a SQL DB" #NoSQL #NotOften #ComplextRelationships #FindingRootObjects
  • You may have missed the Google App Engine cage match. First GAE sucks and then it's great. Whatever your conclusion it's an informative discussion. GAE is like one of the biggest managed hosting systems in the world and it's based on a distributed file system, so the restrictions are there for a reason. Royans has a nice post summarizing some features of GAE 1.4.0 that will remove some of the problems. You will be able to buy an Always On feature to keep your instances warm to respond to requests. Remember, GAE is a "pay for what you use" system so this won't be free.
  • How long does it take to make a context switch? Tsuna tests 3 different generations of CPUs and concludes: Context switching is expensive. My rule of thumb is that it'll cost you about 30µs of CPU overhead. Applications that create too many threads that are constantly fighting for CPU time (such as Apache's HTTPd or many Java applications) can waste considerable amounts of CPU cycles just to switch back and forth between different threads. I think the sweet spot for optimal CPU use is to have the same number of worker threads as there are hardware threads, and write code in an asynchronous / non-blocking fashion.
  • StumbleUpon open-sourced their main monitoring system, OpenTSDB:  a distributed, scalable Time Series Database (TSDB) written on top of HBase. OpenTSDB allows you to collect many thousands of metrics from thousands of hosts and applications, at a high rate (every few seconds). OpenTSDB will never delete or downsample data and can easily store billions of data points. As a matter of fact, StumbleUpon uses it to keep track of hundred of thousands of time series and collects over 100 million data points per day in their main production cluster.
  • Partitioning MySQL database with high load solutions. Using MySQL's partitioning capability to handle a daily waterfall of about 30 million INSERTs and a few daily analytic big SELECTs too.
  • OrientDB is the new kid on the block, and as this slide deck shows, this kid has skills: The Document Database with the support of ACID Transactions, SQL and Native Queries, Asynchronous Commands, Intents, and much more.
  • Inside SQL Azure. Kalen Delaney examines the internals of the SQL Azure databases, and how they are managed in the Microsoft Data Centers. 
  • Mike's Place with a nice write-up on Automatic Monitoring with Puppet and Nagios.
  • A free preview of Scalability Rules: 50 Principles for Scaling Web Sites written by Michael T. Fisher and Martin L. Abbott, authors of the excellent Art of Scalability. This one is looking good too.
  • Rohit Garg with a First look at Google BigQuery. BigQuery API shows good performance in scenarios where you have huge amounts of data that needs to be processed. Running any query on almost 28 million rows uploaded in a test data set gives back response in just 2-3 seconds. The time includes request post and response received times as well.
  • Baron Schwartz talks about Scaling Without ShardingScalability = Performance = Tasks and Time