Strategy: Make it Really Fast vs Do the Work Up Front

In Cool spatial algos with Neo4j: Part 1 - Routing with A* in Ruby Peter Neubauer not only does a fantastic job explaining a complicated routing algorithm using the graph database Neo4j, but he surfaces an interesting architectural conundrum: make it really fast so work can be done on the reads or do all the work on the writes so the reads are really fast.

The money quote pointing out the competing options is:

[Being] able to do these calculations in sub-second speeds on graphs of millions of roads and waypoints makes it possible in many cases to abandon the normal approach of precomputing indexes with K/V stores and be able to put routing into the critical path with the possibility to adapt to the live conditions and build highly personalized and dynamic spatial services.

The poster boys for the precompute strategy is SimpleGeo, a startup that is building a "scaling infrastructure for geodata." Their strategy for handling geodata is to use Cassandra and build two clusters: one for indexes and one for records. The records cluster is a simple data lookup. The index cluster has a carefully constructed key for every lookup scenario. The indexes are computed on the write, so reads are very fast. Ad hoc queries are not allowed. You can only search on what has been precomputed.

What I think Peter is saying is because a graph database represents the problem in such a natural way and graph navigation is so fast, it becomes possible to run even large complex queries in real-time. No special infrastructure is needed.

If you are creating a geo service, which approach would you choose? Before you answer, let's first ponder: is the graph database solution really solving the same problem as SimpleGeo is solving?

Click to read more ...


Sponsored Post: Event - Social Developer Summit

Social Developer Summit - June 29, 2010 - San Franciso, CA

A meeting of the technically social - Building, scaling, and profiting in a social age

Whether it's social games, social news, social discovery, social search, or other forms of social solutions, developers today are facing new hurdles in building instantly scalable products. As new technologies emerge to address the challenges faced by social application developers, it's increasingly important to come together for knowledge sharing purposes.

The first Social Developer Summit will bring together social application developers to discuss the challenges, solutions, and best practices for building applications in the rapidly expanding social web economy. At the Social Developer Summit, industry experts will share tips and case studies for building high performance social web products.

For more information please take a look at Social Developer Summit.

If you are interested in a sponsored post for an event, job, or product, please take a look at the advertising section. 


Intercloud: How Will We Scale Across Multiple Clouds?

In A Brief History of the Internet it was revealed that the Internet was based on the idea that there would be multiple independent networks of rather arbitrary design. The Internet as we now know it embodies a key underlying technical idea, namely that of open architecture networking. In this approach, the choice of any individual network technology was not dictated by a particular network architecture but rather could be selected freely by a provider and made to interwork with the other networks through a meta-level "Internetworking Architecture".  

With the cloud we are in the same situation today, just a layer or two higher up the stack. We have independent clouds that we would like to connect and work seamlessly together, preferably with the ease at which we currently connect nodes to a network and networks to the Internet. This technology seems to be called the Intercloud: an interconnected global "cloud of clouds" as apposed to the Internet which is a "network of networks."

person most often called the father of the Internet, says in Cloud Computing and the Internet that we are ripe for an Intercloud in the same way we were once ripe for the Internet:

Click to read more ...


Hot Scalability Links for April 1, 2010

  1. Why NoSQL Will Not Die. Stephan Schmidt explains why you may wait a long time for NoSQL to go to that great bit bucket in the sky.
  2. DBMS Musings: Distinguishing Two Major Types of Column-Stores by Daniel Abadi. I have noticed that Bigtable, HBase, Hypertable, and Cassandra are being called column-stores with increasing frequency, due to their ability to store and access column families separately. This makes them appear to be in the same category as column-stores such as Sybase IQ, C-Store, Vertica, VectorWise, MonetDB, ParAccel, and Infobright, which also are able to access columns separately.
  3. Cloud Economics, By The Square Foot by Rich Miller. But cloud computing offers a middle path, offering cost and usability advantages for customers, as well as an attractive return for providers.
  4. PostgreSQL: meet your queue by Theo Schlossnagle. I really think that cueing your database to publish over AMQP is the bees knees and it turns out I wasn't alone!
  5. Scaling GIS Data in Non-relational Data Store by Mike Malone. How SimpleGEO uses NoSQL and other technologies. Yes, the still use memcached. Caching ain’t going anywhere.
  6. CLTV45: The Evolution of the Graph Data Structure from Research to Production. In this recording from “NoSQL Live Boston” we learn how Graph Data Structures evolved from research into production.
  7. Spanner: Google’s next Massive Storage and Computation infrastructure by Royans. MapReduce, Bigtable and Pregel have their origins in Google and they all deal with “large systems”. But all of them may be dwarfed in size and complexity by a new project Google is working on. .
  8. Click to read more ...


Running Large Graph Algorithms - Evaluation of Current State-of-the-Art and Lessons Learned

On the surface nothing appears more different than soft data and hard raw materials like iron. Then isn’t it ironic, in the Alanis Morissette sense, that in this Age of Information, great wealth still lies hidden deep beneath piles of stuff? It's so strange how directly digging for dollars in data parallels the great wealth producing models of the Industrial Revolution.

The piles of stuff is the Internet. It takes lots of prospecting to find the right stuff. Mighty web crawling machines tirelessly collect stuff, bringing it into their huge maws, then depositing load after load into rack after rack of distributed file system machines. Then armies of still other machines take this stuff and strip out the valuable raw materials, which in the Information Age, are endless bytes of raw data. Link clicks, likes, page views, content, head lines, searches, inbound links, outbound links, search clicks, hashtags, friends, purchases: anything and everything you do on the Internet is a valuable raw material.

By itself data is no more useful than a truck load of iron ore. Data must be brought to a factory. It must be purified, processed, and formed. That’s the job for a new field of science called Data Science. Yes, while you weren't looking a whole new branch of science was created. It makes sense in a way. Since data is a new kind of material we need a new profession paralleling that of the Material Scientist, someone who seeks to deeply understand data, the Data Scientist. We aren't so much in the age of data, as the age of data inference.

Click to read more ...


Strategy: Caching 404s Saved the Onion 66% on Server Time

In the article The Onion Uses Django, And Why It Matters To Us, a lot of interesting points are made about their ambitious infrastructure move from Drupal/PHP to Django/Python: the move wasn't that hard, it just took time and work because of their previous experience moving the A.V. Club website; churn in core framework APIs make it more attractive to move than stay; supporting the structure of older versions of the site is an unsolved problem; the built-in Django admin saved a lot of work; group development is easier with "fewer specialized or hacked together pieces"; they use IRC for distributed development; sphinx for full-text search; nginx is the media server and reverse proxy; haproxy made the launch process a 5 second procedure; capistrano for deployment; clean component separation makes moving easier; Git for version control; ORM with complicated querysets is a performance problem; memcached for caching rendered pages; the CDN checks for updates every 10 minutes; videos, articles, images, 404 pages are all served by a CDN.

But the most surprising point had to be:

Click to read more ...


Digg: 4000% Performance Increase by Sorting in PHP Rather than MySQL

O'Reilly Radar's James Turner conducted a very informative interview with Joe Stump, current CTO of SimpleGeo and former lead architect at Digg, in which Joe makes some of his usually insightful comments on his experience using Cassandra vs MySQL. As Digg started out with a MySQL oriented architecture and has recently been moving full speed to Cassandra, his observations on some of their lessons learned and the motivation for the move are especially valuable. Here are some of the key takeaways you find useful:

Click to read more ...


7 Secrets to Successfully Scaling with Scalr (on Amazon) by Sebastian Stadil

This is a part interview part guest with Sebastian Stadil, founder of Scalr, a cheaper open-source version of RightScale. Scalr takes care of all the web site infrastructure bits to on Amazon (and other clouds) so you don’t have to.

I first met Sebastian at one of the original Silicon Valley Cloud Computing Group meetups, a group which he founded. The meetings started in the tiny offices of Intalio where Sebastian was working with this new fangled Amazon thing to create an auto-scaling server farm on EC2. I remember distinctly how Sebastian met everyone at the door with a handshake and a smile, making us all feel welcome. Later I took one of the classes he created on how to use AWS. I guess he figured all this cloud stuff was going somewhere and decided to start Scalr.

My only regret about this post is that the name Amazon does not begin with the letter ‘S’, that would have made for an epic title.

Click to read more ...


Hot Scalability Links for March 19, 2010

  1. The Changelog Episode 0.1.8 - NoSQL Smackdown! This podcast was recorded at SXSW and features some energetic trash talking by: Stu Hood from Cassandra, Jan Lehnardt from CouchDB, Wynn Netherland from The Changelog, subbing for MongoDB, Werner Vogels CTO at Amazon. It's fun hearing these guys step out of their sober advocacy roles and let loose a little with why they are great and the other products suck, hard.
  2. Algorithmic Graph Theory . It's FREE! A GNU-FDL book on algorithmic graph theory by David Joyner, Minh Van Nguyen, and Nathann Cohen. This is an introductory book on algorithmic graph theory.
  3. HBase vs Cassandra: why we moved by Dominic Williams.
  4. Benchmarking Cloud Serving Systems with YCSB by lots of people from Yahoo! Research. We present the Yahoo! Cloud Serving Benchmark (YCSB) framework, with the goal of facilitating performance comparisons of the new generation of cloud data serving systems. We define a core set of benchmarks and report re- sults for four widely used systems: Cassandra, HBase, Yahoo!’s PNUTS, and a simple sharded MySQL implementation.
  5. All recordings from NoSQL Live Boston now online!. It's almost like you were there.

Click to read more ...


1 Billion Reasons Why Adobe Chose HBase 

Cosmin Lehene wrote two excellent articles on Adobe's experiences with HBase: Why we’re using HBase: Part 1 and Why we’re using HBase: Part 2. Adobe needed a generic, real-time, structured data storage and processing system that could handle any data volume, with access times under 50ms, with no downtime and no data loss. The article goes into great detail about their experiences with HBase and their evaluation process, providing a "well reasoned impartial use case from a commercial user". It talks about failure handling, availability, write performance, read performance, random reads, sequential scans, and consistency. 

One of the knocks against HBase has been it's complexity, as it has many parts that need installation and configuration. All is not lost according to the Adobe team:

HBase is more complex than other systems (you need Hadoop, Zookeeper, cluster machines have multiple roles). We believe that for HBase, this is not accidental complexity and that the argument that “HBase is not a good choice because it is complex” is irrelevant. The advantages far outweigh the problems. Relying on decoupled components plays nice with the Unix philosophy: do one thing and do it well. Distributed storage is delegated to HDFS, so is distributed processing, cluster state goes to Zookeeper. All these systems are developed and tested separately, and are good at what they do. More than that, this allows you to scale your cluster on separate vectors. This is not optimal, but it allows for incremental investment in either spindles, CPU or RAM. You don’t have to add them all at the same time.

Highly recommended, especially if you need some sort of balance to the recent gush of Cassandra articles.