Intercloud: How Will We Scale Across Multiple Clouds?

In A Brief History of the Internet it was revealed that the Internet was based on the idea that there would be multiple independent networks of rather arbitrary design. The Internet as we now know it embodies a key underlying technical idea, namely that of open architecture networking. In this approach, the choice of any individual network technology was not dictated by a particular network architecture but rather could be selected freely by a provider and made to interwork with the other networks through a meta-level "Internetworking Architecture".  

With the cloud we are in the same situation today, just a layer or two higher up the stack. We have independent clouds that we would like to connect and work seamlessly together, preferably with the ease at which we currently connect nodes to a network and networks to the Internet. This technology seems to be called the Intercloud: an interconnected global "cloud of clouds" as apposed to the Internet which is a "network of networks."

person most often called the father of the Internet, says in Cloud Computing and the Internet that we are ripe for an Intercloud in the same way we were once ripe for the Internet:

Click to read more ...


Hot Scalability Links for April 1, 2010

  1. Why NoSQL Will Not Die. Stephan Schmidt explains why you may wait a long time for NoSQL to go to that great bit bucket in the sky.
  2. DBMS Musings: Distinguishing Two Major Types of Column-Stores by Daniel Abadi. I have noticed that Bigtable, HBase, Hypertable, and Cassandra are being called column-stores with increasing frequency, due to their ability to store and access column families separately. This makes them appear to be in the same category as column-stores such as Sybase IQ, C-Store, Vertica, VectorWise, MonetDB, ParAccel, and Infobright, which also are able to access columns separately.
  3. Cloud Economics, By The Square Foot by Rich Miller. But cloud computing offers a middle path, offering cost and usability advantages for customers, as well as an attractive return for providers.
  4. PostgreSQL: meet your queue by Theo Schlossnagle. I really think that cueing your database to publish over AMQP is the bees knees and it turns out I wasn't alone!
  5. Scaling GIS Data in Non-relational Data Store by Mike Malone. How SimpleGEO uses NoSQL and other technologies. Yes, the still use memcached. Caching ain’t going anywhere.
  6. CLTV45: The Evolution of the Graph Data Structure from Research to Production. In this recording from “NoSQL Live Boston” we learn how Graph Data Structures evolved from research into production.
  7. Spanner: Google’s next Massive Storage and Computation infrastructure by Royans. MapReduce, Bigtable and Pregel have their origins in Google and they all deal with “large systems”. But all of them may be dwarfed in size and complexity by a new project Google is working on. .
  8. Click to read more ...


Running Large Graph Algorithms - Evaluation of Current State-of-the-Art and Lessons Learned

On the surface nothing appears more different than soft data and hard raw materials like iron. Then isn’t it ironic, in the Alanis Morissette sense, that in this Age of Information, great wealth still lies hidden deep beneath piles of stuff? It's so strange how directly digging for dollars in data parallels the great wealth producing models of the Industrial Revolution.

The piles of stuff is the Internet. It takes lots of prospecting to find the right stuff. Mighty web crawling machines tirelessly collect stuff, bringing it into their huge maws, then depositing load after load into rack after rack of distributed file system machines. Then armies of still other machines take this stuff and strip out the valuable raw materials, which in the Information Age, are endless bytes of raw data. Link clicks, likes, page views, content, head lines, searches, inbound links, outbound links, search clicks, hashtags, friends, purchases: anything and everything you do on the Internet is a valuable raw material.

By itself data is no more useful than a truck load of iron ore. Data must be brought to a factory. It must be purified, processed, and formed. That’s the job for a new field of science called Data Science. Yes, while you weren't looking a whole new branch of science was created. It makes sense in a way. Since data is a new kind of material we need a new profession paralleling that of the Material Scientist, someone who seeks to deeply understand data, the Data Scientist. We aren't so much in the age of data, as the age of data inference.

Click to read more ...


Strategy: Caching 404s Saved the Onion 66% on Server Time

In the article The Onion Uses Django, And Why It Matters To Us, a lot of interesting points are made about their ambitious infrastructure move from Drupal/PHP to Django/Python: the move wasn't that hard, it just took time and work because of their previous experience moving the A.V. Club website; churn in core framework APIs make it more attractive to move than stay; supporting the structure of older versions of the site is an unsolved problem; the built-in Django admin saved a lot of work; group development is easier with "fewer specialized or hacked together pieces"; they use IRC for distributed development; sphinx for full-text search; nginx is the media server and reverse proxy; haproxy made the launch process a 5 second procedure; capistrano for deployment; clean component separation makes moving easier; Git for version control; ORM with complicated querysets is a performance problem; memcached for caching rendered pages; the CDN checks for updates every 10 minutes; videos, articles, images, 404 pages are all served by a CDN.

But the most surprising point had to be:

Click to read more ...


Digg: 4000% Performance Increase by Sorting in PHP Rather than MySQL

O'Reilly Radar's James Turner conducted a very informative interview with Joe Stump, current CTO of SimpleGeo and former lead architect at Digg, in which Joe makes some of his usually insightful comments on his experience using Cassandra vs MySQL. As Digg started out with a MySQL oriented architecture and has recently been moving full speed to Cassandra, his observations on some of their lessons learned and the motivation for the move are especially valuable. Here are some of the key takeaways you find useful:

Click to read more ...


7 Secrets to Successfully Scaling with Scalr (on Amazon) by Sebastian Stadil

This is a part interview part guest with Sebastian Stadil, founder of Scalr, a cheaper open-source version of RightScale. Scalr takes care of all the web site infrastructure bits to on Amazon (and other clouds) so you don’t have to.

I first met Sebastian at one of the original Silicon Valley Cloud Computing Group meetups, a group which he founded. The meetings started in the tiny offices of Intalio where Sebastian was working with this new fangled Amazon thing to create an auto-scaling server farm on EC2. I remember distinctly how Sebastian met everyone at the door with a handshake and a smile, making us all feel welcome. Later I took one of the classes he created on how to use AWS. I guess he figured all this cloud stuff was going somewhere and decided to start Scalr.

My only regret about this post is that the name Amazon does not begin with the letter ‘S’, that would have made for an epic title.

Click to read more ...


Hot Scalability Links for March 19, 2010

  1. The Changelog Episode 0.1.8 - NoSQL Smackdown! This podcast was recorded at SXSW and features some energetic trash talking by: Stu Hood from Cassandra, Jan Lehnardt from CouchDB, Wynn Netherland from The Changelog, subbing for MongoDB, Werner Vogels CTO at Amazon. It's fun hearing these guys step out of their sober advocacy roles and let loose a little with why they are great and the other products suck, hard.
  2. Algorithmic Graph Theory . It's FREE! A GNU-FDL book on algorithmic graph theory by David Joyner, Minh Van Nguyen, and Nathann Cohen. This is an introductory book on algorithmic graph theory.
  3. HBase vs Cassandra: why we moved by Dominic Williams.
  4. Benchmarking Cloud Serving Systems with YCSB by lots of people from Yahoo! Research. We present the Yahoo! Cloud Serving Benchmark (YCSB) framework, with the goal of facilitating performance comparisons of the new generation of cloud data serving systems. We define a core set of benchmarks and report re- sults for four widely used systems: Cassandra, HBase, Yahoo!’s PNUTS, and a simple sharded MySQL implementation.
  5. All recordings from NoSQL Live Boston now online!. It's almost like you were there.

Click to read more ...


1 Billion Reasons Why Adobe Chose HBase 

Cosmin Lehene wrote two excellent articles on Adobe's experiences with HBase: Why we’re using HBase: Part 1 and Why we’re using HBase: Part 2. Adobe needed a generic, real-time, structured data storage and processing system that could handle any data volume, with access times under 50ms, with no downtime and no data loss. The article goes into great detail about their experiences with HBase and their evaluation process, providing a "well reasoned impartial use case from a commercial user". It talks about failure handling, availability, write performance, read performance, random reads, sequential scans, and consistency. 

One of the knocks against HBase has been it's complexity, as it has many parts that need installation and configuration. All is not lost according to the Adobe team:

HBase is more complex than other systems (you need Hadoop, Zookeeper, cluster machines have multiple roles). We believe that for HBase, this is not accidental complexity and that the argument that “HBase is not a good choice because it is complex” is irrelevant. The advantages far outweigh the problems. Relying on decoupled components plays nice with the Unix philosophy: do one thing and do it well. Distributed storage is delegated to HDFS, so is distributed processing, cluster state goes to Zookeeper. All these systems are developed and tested separately, and are good at what they do. More than that, this allows you to scale your cluster on separate vectors. This is not optimal, but it allows for incremental investment in either spindles, CPU or RAM. You don’t have to add them all at the same time.

Highly recommended, especially if you need some sort of balance to the recent gush of Cassandra articles. 

Mar162010's Live Video Broadcasting Architecture

The future is live. The future is real-time. The future is now. That's the hype anyway. And as it has a habit of doing, the hype is slowly becoming reality. We are seeing live searches, live tweets, live location, live reality augmentation, live crab (fresh and local), and live event publishing. One of the most challenging of all live technologies is that of live video broadcasting. Imagine a world in which everyone becomes a broadcaster and a consumer of video streams, all in real-time (< 250 msec latency), all so you can talk and interact directly without feeling like you are in the middle of a time shift war. The resources and the engineering needed to make this happened must be substantial. How do you do that?

To find out I talked to Kyle Vogt, Founder and VP of Engineering. certainly has the numbers. Their 30 million unique monthly visitors even outshine YouTube in the video upload game, reportedly uploading nearly 30 hours per minute of video compared to YouTube's 23. I asked for an interview after listening to an interview with Justin Kan, another Founder of the eponymously named Justin talked about how live video was fundamentally different than YouTube's batch video approach, where all the video is stored on disk and replayed later on demand. Live video can't be made by pushing video faster, it takes a completely differently architecture. Since the YouTube Architecture article is the most popular article ever on this site, I thought people might also enjoy learning about live side of the video world. Kyle was unbelievably generous with his time and insight into how makes all this live video magic happen, going way beyond the call, providing a tremendous number of juicy details. Anyone building a system can learn something from how they run their business. I can't thank Kyle enough for putting up with my never ending prodding.

Click to read more ...


What would you like to ask

It looks like I'll have the chance to interview someone tomorrow from about their architecture, which is pretty exciting given their leadership role in live broadcasting. They get 30 million uniques a month, can handle 1 million simultaneous broadcasts and hope to grow another magnitude in the near future. That must take some doing.

Here's your opportunity, especially if you think my questions suck, to ask your own sucky questions :-) What would you like to know about