Hot Scalability Links for July 2, 2010

  • What says 4th of July like Nathan's ultimate scalable hot dog eating contest? This totally requires a scale-up strategy.
  • Facebook at 60,000 servers and counting.
  • Deepak Singh has collected some impressive massive data stats on extreme Hadoop usage: Facebook: 36 PB of uncompressed data, 2250 machines, 23,000 cores, 32 GB of RAM per machine, processing 80-90TB/day; Yahoo: 70 PB of data in HDFS, 170 PB spread across the globe, 34000 servers, Processing 3 PB per day, 120 TB flow through Hadoop every day; Twitter: 7 TB/day into HDFS; LinkedIn: 120 Billion relationships; 82 Hadoop jobs daily (IIRC); 16 TB of intermedia data.
  • Who knew DevOps could be so funny? Adam Jacob, CTO of Opscode, gave a hilarious talk at the Velocity conference on the true nature of DevOps. Warning: your neck may get sore from nodding in agreement so much and your belly may ache from laughing so much.
  • Click to read more ...


Paper: GraphLab: A New Framework For Parallel Machine Learning

In the never ending quest to figure out how to do something useful with never ending streams of data, GraphLab: A New Framework For Parallel Machine Learning wants to go beyond low-level programming, MapReduce, and dataflow languages with a new parallel framework for ML (machine learning) which exploits the sparse structure and common computational patterns of ML algorithms. GraphLab enables ML experts to easily design and implement efficient scalable parallel algorithms by composing problem specific computation, data-dependencies, and scheduling.  Our main contributions include: 

  • A graph-based data model which simultaneously represents data and computational dependencies. 
  • A set of concurrent access models which provide a range of sequential-consistency guarantees. 
  • A sophisticated modular scheduling mechanism. 
  • An aggregation framework to manage global state. 

Click to read more ...


VoltDB Decapitates Six SQL Urban Myths and Delivers Internet Scale OLTP in the Process

What do you get when you take a SQL database and start a new implementation from scratch, taking advantage of the latest research and modern hardware? Mike Stonebraker, the sword wielding Johnny Appleseed of the database world, hopes you get something like his new database, VoltDB: a pure SQL, pure ACID, pure OLTP, shared nothing, sharded, scalable, lockless, open source, in-memory DBMS, purpose-built for running hundreds of thousands of transactions a second. VoltDB claims to be 100 times faster than MySQL, up to 13 times faster than Cassandra, and 45 times faster than Oracle, with near-linear scaling.

Will VoltDB kill off the new NoSQL upstarts? Will VoltDB cause a mass extinction of ancient databases? Probably no and no to both questions, but it's a product with a definite point-of-view and is worth a look as the transaction component in your system. But will it be right for you? Let's see...

Click to read more ...


Hot Scalability Links for June 25, 2010

  • Royans Tharakan is blogging like a mad man at the Velocity Conference. Read a summary of many of the presentations on his blog.
  • Zuckerberg almost guarantees 1 billion Facebook users. And I almost believe him.
  • Northscale introduces Membase, a new distributed key-value NoSQL competitor featuring a memcache compatible interface, yet is persistent like a database. Hopefully we'll have more on their internals later.
  • Notable Tweets: 
    • Aaron Cordova - scalability means "can change size" and also "works at large sizes" - this conflates two orthogonal features of cloud computing. 
    • Jaime Garcia Reinoso - It's the scalability, stupid! 
    • Alex Averbuch - when I read/hear "unlimited/inifinite scalability" I stop reading/listening and start thinking about cake.
    • Dennis Clark - I used to smirk at developers whose main DB experience was in MUMPS or Pick, until I realized those are old-school #NoSQL engines.

Click to read more ...


Product: dbShards - Share Nothing. Shard Everything.

I met the CodeFutures folks, makers of dbShards, at Gluecon. They occupy an interesting niche in the database space, somewhere between NoSQL, which jettisons everything SQL, and high end analytics platforms that completely rewrite the backend while keeping a SQL facade.

High concept: I think of dbShards as a sort of commercial OLTP mashup of features from HSCALE (partitioning) + MySQL Proxy (transparent intermediate layer) + Memcached (client side sharding) + Gigaspaces (parallel query) + MySQL (transactions).

You may find dbShards interesting if you are looking to keep SQL, need scale out writes and reads, need out of the box parallel query capabilities, and would prefer to use a standard platform like MySQL as a base. To learn more about dbShards I asked Cory Isaacson (CEO and CTO) a few devastatingly difficult questions (not really).

Who are you, what is dbShards, and what problem was dbShards created to solve?

Click to read more ...


Exploring the software behind Facebook, the world’s largest site

Peter Alguacil at Pingdom wrote a HighScalability worthy article on Facebook's architecture: Exploring the software behind Facebook, the world’s largest site. It covers the challenges Facebook faces, the software Facebook uses, and the techniques Facebook uses to keep on scaling. Definitely worth a look.



Sponsored Post: Jobs: Etsy, Digg, Huffington Post Event: Velocity Conference


Paper: The Declarative Imperative: Experiences and Conjectures in Distributed Logic

The Declarative Imperative: Experiences and Conjectures in Distributed Logic is written by UC Berkeley's Joseph Hellerstein for a keynote speech he gave at PODS. The video version of the talk is here. You may have heard about Mr. Hellerstein through the Berkeley Orders Of Magnitude project (BOOM), whose purpose is to help people build systems that are OOM (orders of magnitude) bigger than are building today, with OOM less effort than traditional programming methodologies. A noble goal which may be why BOOM was rated as a top 10 emerging technology for 2010 by MIT Technology Review. Quite an honor.

The motivation for the talk is a familiar one: it's a dark period for computer programming and if we don't learn how to write parallel programs the children of Moore's law will destroy us all. We have more and more processors, yet we are stuck on figuring out how the average programmer can exploit them. The BOOM solution is the Bloom language which is based on Dedalus: 

Click to read more ...


WTF is Elastic Data Grid? (By Example)

Forrester released their new wave report:  The Forrester Wave™: Elastic Caching Platforms, Q2 2010 where they listed GigaSpaces, IBM, Oracle, and Terracotta as leading vendors in the field. In this post I'd like to take some time to explain what some of these terms mean, and why they’re important to you. I’ll start with a definition of Elastic Data Grid (Elastic Caching), how it is different then other caching and NoSQL alternatives, and more importantly -- I'll illustrate how it works through some real code examples.

You can read the full story here.


Hot Scalability Links for June 16, 2010

  • You're Doing it Wrong by Poul-Henning Kamp. Don't look so guilty, he's not talking about you know what, he's talking about writing high-performance server programs: Not just wrong as in not perfect, but wrong as in wasting half, or more, of your performance. What good is an O(log2(n)) algorithm if those operations cause page faults and slow disk operations? For most relevant datasets an O(n) or even an O(n^2) algorithm, which avoids page faults, will run circles around it. 
  • A Microsoft Windows Azure primer: the basics by Peter Bright. Nice article explaining the basics of Azure and how it compares to Google and Amazon.
  • A call to change the name from NoSQL to Postmodern Databases. Interesting idea, but the problem is the same one I have for Postmodern Art, when is it? I always feel like I'm in the post-post modern period, yet for art it's really in the early 1900s. Let's save future developers from this existential time crisis.
  • Constructions from Dots and Lines by Marko A. Rodriguez, Peter Neubauer. Delightful yet in-depth explanation of the complex world of graph data structures. To make use of the graphs beyond simply representing their explicit structure, graph traversal frameworks and algorithms have been developed in order to shape graphs by driving the evolution of the entities that they model—e.g. humans and their relationships to one another and the objects of their world

Click to read more ...