Strategy

Strategy: Make it Really Fast vs Do the Work Up Front

In Cool spatial algos with Neo4j: Part 1 - Routing with A* in Ruby Peter Neubauer not only does a fantastic job explaining a complicated routing algorithm using the graph database Neo4j, but he surfaces an interesting architectural conundrum: make it really fast so work can be done on the reads or do all the work on the writes so the reads are really fast.

The money quote pointing out the competing options is:

[Being] able to do these calculations in sub-second speeds on graphs of millions of roads and waypoints makes it possible in many cases to abandon the normal approach of precomputing indexes with K/V stores and be able to put routing into the critical path with the possibility to adapt to the live conditions and build highly personalized and dynamic spatial services.

The poster boys for the precompute strategy is SimpleGeo, a startup that is building a "scaling infrastructure for geodata." Their strategy for handling geodata is to use Cassandra and build two clusters: one for indexes and one for records. The records cluster is a simple data lookup. The index cluster has a carefully constructed key for every lookup scenario. The indexes are computed on the write, so reads are very fast. Ad hoc queries are not allowed. You can only search on what has been precomputed.

What I think Peter is saying is because a graph database represents the problem in such a natural way and graph navigation is so fast, it becomes possible to run even large complex queries in real-time. No special infrastructure is needed.

If you are creating a geo service, which approach would you choose? Before you answer, let's first ponder: is the graph database solution really solving the same problem as SimpleGeo is solving?

Not really. In this configuration the graph database is serving more like a specialized analytics database.

What SimpleGeo wanted is a system supporting very high write loads, be an operational no-brainer, be highly available, and perform well. SimpleGeo just started and they have over 1TB of data. It must be much larger now and growing all the time, probably exponentially. A graph database constrained to a single node simply can't compete in this space.

So SimpleGeo's architecture makes a great deal of sense for their requirements. Using a lot of up front smarts they've created a system that can handle very high write loads and it performs amazingly well for reads. The downside is they are very limited because everything has to be precanned. Well, that works for them, so far.

Wouldn't a SSD solve everything? Consider a system handling very high write loads while sequentially scanning very large tables to satisfy requests. The backplane of that box will be saturated in a heartbeat. SSD doesn't make large problems small and doesn't make complex queries over large datasets magically solvable either. SSD simply changes the scale-up scale-out transition curve.

SimpleGeo could keep graph databases as their index nodes. There's a lag between their data cluster and their index cluster anyway, so the index cluster could be a graph cluster. Or perhaps they could keep a graph cluster in parallel for ad hoc queries. What I'm not sure about is how a graph larger than RAM will perform or how it will perform in a replicated load balanced situation within the same datacenter and then across multiple datacenters. Mapping all queries to a key-value lookup is a very robust approach.

Joe Stump took a lot of crap for suggesting a relational database couldn't scale to solve their problem. Given all the hubbub it's interesting that a graph database and not a relational database turns out to be the speed demon. It almost makes one think that in the same way there are optimal human body types for each sport (top level sprinters are muscular, higher proportion of fast twitch fibers, for example), there may be optimal data structures for solving specific problems.