« Hot Scalability Links for April 8, 2010 | Main | Sponsored Post: Event - Social Developer Summit »

Strategy: Make it Really Fast vs Do the Work Up Front

In Cool spatial algos with Neo4j: Part 1 - Routing with A* in Ruby Peter Neubauer not only does a fantastic job explaining a complicated routing algorithm using the graph database Neo4j, but he surfaces an interesting architectural conundrum: make it really fast so work can be done on the reads or do all the work on the writes so the reads are really fast.

The money quote pointing out the competing options is:

[Being] able to do these calculations in sub-second speeds on graphs of millions of roads and waypoints makes it possible in many cases to abandon the normal approach of precomputing indexes with K/V stores and be able to put routing into the critical path with the possibility to adapt to the live conditions and build highly personalized and dynamic spatial services.

The poster boys for the precompute strategy is SimpleGeo, a startup that is building a "scaling infrastructure for geodata." Their strategy for handling geodata is to use Cassandra and build two clusters: one for indexes and one for records. The records cluster is a simple data lookup. The index cluster has a carefully constructed key for every lookup scenario. The indexes are computed on the write, so reads are very fast. Ad hoc queries are not allowed. You can only search on what has been precomputed.

What I think Peter is saying is because a graph database represents the problem in such a natural way and graph navigation is so fast, it becomes possible to run even large complex queries in real-time. No special infrastructure is needed.

If you are creating a geo service, which approach would you choose? Before you answer, let's first ponder: is the graph database solution really solving the same problem as SimpleGeo is solving?

Not really. In this configuration the graph database is serving more like a specialized analytics database.

What SimpleGeo wanted is a system supporting very high write loads, be an operational no-brainer, be highly available, and perform well. SimpleGeo just started and they have over 1TB of data. It must be much larger now and growing all the time, probably exponentially. A graph database constrained to a single node simply can't compete in this space.

So SimpleGeo's architecture makes a great deal of sense for their requirements. Using a lot of up front smarts they've created a system that can handle very high write loads and it performs amazingly well for reads. The downside is they are very limited because everything has to be precanned. Well, that works for them, so far.

Wouldn't a SSD solve everything? Consider a system handling very high write loads while sequentially scanning very large tables to satisfy requests. The backplane of that box will be saturated in a heartbeat. SSD doesn't make large problems small and doesn't make complex queries over large datasets magically solvable either. SSD simply changes the scale-up scale-out transition curve.

SimpleGeo could keep graph databases as their index nodes. There's a lag between their data cluster and their index cluster anyway, so the index cluster could be a graph cluster. Or perhaps they could keep a graph cluster in parallel for ad hoc queries. What I'm not sure about is how a graph larger than RAM will perform or how it will perform in a replicated load balanced situation within the same datacenter and then across  multiple datacenters. Mapping all queries to a key-value lookup is a very robust approach.

Joe Stump took a lot of crap for suggesting a relational database couldn't scale to solve their problem. Given all the hubbub it's interesting that a graph database and not a relational database turns out to be the speed demon. It almost makes one think that in the same way there are optimal human body types for each sport (top level sprinters are muscular, higher proportion of fast twitch fibers, for example), there may be optimal data structures for solving specific problems.

Related Articles

Reader Comments (2)

Hi there,
thanks for the writeup! Regarding the SimpleGeoscenario, of course the solutions depend a lot on what type of data and what type of query is to be served. To be fair, seeing the sheer amount and type of data handled there, the graph will need to be sharded and thus not fit without some work into the scenario outlined in the post.

The A* algo and routing is extremely suited for deep traversals and graph databases, so the point is to show that, not to discard everything else. We are moving towards Polyglot Persistence, meaning that all processing and persistence frameworks will be used on interesting subsets of the whole dataset in most suitable scenarios.


April 6, 2010 | Unregistered CommenterPeter Neubauer

I don't get the SimpleGeo strategy. Seamlessly distributable polygon indexes will scale on massive clusters without loss of ingest performance concurrent with very high query loads. I've seen (as of 2007) single petabyte data sets with tens of terabytes of ingest per day (sustained rates of millions of polygons ingested per second) and linear scalability, nowhere near the limit. Fully general and in more than two dimensions, but the algorithms were unusual. I'm building a similar system now. I'm not following the use of graphs to fix a design/implementation bug.

Graph analytics is trickier but I've recently seen an unusual design by an old school company that linearly shards those on massive clusters as well...

April 7, 2010 | Unregistered CommenterAndrew

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>