Neo4j - a Graph Database that Kicks Buttox

Update: Social networks in the database: using a graph database. A nice post on representing, traversing, and performing other common social network operations using a graph database.

If you are Digg or LinkedIn you can build your own speedy graph database to represent your complex social network relationships. For those of more modest means Neo4j, a graph database, is a good alternative.

A graph is a collection nodes (things) and edges (relationships) that connect pairs of nodes. Slap properties (key-value pairs) on nodes and relationships and you have a surprisingly powerful way to represent most anything you can think of. In a graph database "relationships are first-class citizens. They connect two nodes and both nodes and relationships can hold an arbitrary amount of key-value pairs. So you can look at a graph database as a key-value store, with full support for relationships."

A graph looks something like:

For more lovely examples take a look at the Graph Image Gallery.

Here's a good summary by Emil Eifrem, founder of the Neo4j, making the case for why graph databases rule:

Most applications today handle data that is deeply associative, i.e. structured as graphs (networks). The most obvious example of this is social networking sites, but even tagging systems, content management systems and wikis deal with inherently hierarchical or graph-shaped data.

This turns out to be a problem because it’s difficult to deal with recursive data structures in traditional relational databases. In essence, each traversal along a link in a graph is a join, and joins are known to be very expensive. Furthermore, with user-driven content, it is difficult to pre-conceive the exact schema of the data that will be handled. Unfortunately, the relational model requires upfront schemas and makes it difficult to fit this more dynamic and ad-hoc data.

A graph database uses nodes, relationships between nodes and key-value properties instead of tables to represent information. This model is typically substantially faster for associative data sets and uses a schema-less, bottoms-up model that is ideal for capturing ad-hoc and rapidly changing data.

So relational database can't handle complex relationships. Graph systems are opaque, unmaintainable, and inflexible. OO databases loose flexibility by combining logic and data. Key-value stores require the programmer to maintain all relationships. There, everybody sucks :-)

Neo4j's Key Characteristics

  • Dual license: open source and commercial.
  • Well suited for many web use cases such as tagging, metadata annotations, social networks, wikis and other network-shaped or hierarchical data sets.
  • An intuitive graph-oriented model for data representation. Instead of static and rigid tables, rows and columns, you work with a flexible graph network consisting of nodes, relationships and properties.
  • Decent documentation, active and responsive email list, a few releases, good buzz. All a good sign for something that has a chance to last a while.
  • Has bindings for a number of languages Python, Jython, Ruby, and Clojure. No binding for .Net yet. The recommendation is to access using a REST interface.
  • Disk-based, native storage manager completely optimized for storing graph structures for maximum performance and scalability. SSD ready.
  • Massive scalability. Neo4j can handle graphs of several billion nodes/relationships/properties on a single machine.
  • Frequently outperforms relational backends with >1000x for many increasingly important use cases.
  • Powerful traversal framework for high-speed traversals in the node space.
  • Small footprint. Neo4j is a single <500k jar with one dependency (the Java Transaction API).
  • Simple and convenient object-oriented API.
  • Retrieving children is trivial in a graph database.
  • No need to flatten and serialize an object graph as graphs are native to a graph database.
  • Fully transactional like a real database. Supports JTA/JTS, XA, 2PC, Tx recovery, deadlock detection, etc.
  • Current implementation is built to handle large graphs that don't fit in memory with durability. It's not a cache, it's a fully persistent transactional store.
  • No events or triggers. Planned in a future release.
  • No sharding. A suggestion for how one might shard is here.
  • Some common graph calculations are missing. For example, in a social network finding a common friend for a set of users.
  • Separates data and logic with a more "natural" representation than tables. This makes it easy to use Neo4j as the storage tier for OO code while keeping behaviour and state separate.
  • Neo4j traverses depths of 1000 levels and beyond at millisecond speed. That's many orders of magnitude faster than relational systems.

    Neo4j vs Hadoop

    This post makes an illuminating comparison between Neo4j vs Hadoop:

    In principle, Hadoop and other Key-Value stores are mostly concerned with relatively flat data structures. That is, they are extremely fast and scalable regarding retrieval of simple objects, like values, documents or even objects.

    However, if you want to do deeper traversal of e.g. a graph, you will have to retrieve the nodes for every traversal step (very fast) and then match them yourself in some manner (e.g. in Java or so) - slow.

    Neo4j in contrast is build around the concept of "deep" data structures. This gives you almost unlimited flexibility regarding the layout of your data and domain object graph and very fast deep
    traversals (hops over several nodes) since they are handled natively by the Neo4j engine down to the storage layer and not your client code. The drawback is that for huge data amounts (>1Billion nodes) the clustering and partitioning of the graph becomes non-trivial, which is one of the areas we are working on.

    Then of course there are differences in the transaction models, consistency and others, but I hope this gives you a very short philosophical answer :)

    It would have never occurred to me to compare the two, but the comparison shows why we need multiple complementary views of data. Hadoop scales the data grid and the compute grid and is more flexible in how data are queried and combined. Neo4j has far lower latencies for complex navigation problems. It's not a zero-sum game.

    Related Articles

  • Neo4j -- or why graph dbs kick ass
  • The current database debate and graph databases by Anders Nawroth
  • On Building a Stupidly Fast Graph Database by Scott Wheeler and the Hacker News Thread
  • Network Model from wikipedia
  • Databases as a service: FathomDB
  • Using Neo4J to load and query OWL ontologies by Sujit Pal
  • Graph Databases and the Future of Large-Scale Knowledge Management by Marko A. Rodriguez
  • Memo To The Semantic Web: Drop “Semantic” And Become The “Graph Web” by Hank Williams
  • Is the Relational Database Doomed? by Tony Bain
  • Neo Database Introduction
  • Neo4j Email List
  • flare Data Visualization for the Web
  • Giant Global Graph by Tim Berners-Lee
  • Tim Berners-Lee -- Linked Data at TED
  • Drop ACID and Think About Data by Bob Ippolito
  • Analyzing and adapting graph algorithms for large persistent graphs by Larsson, Patrik
  • Thursday

    Yahoo! Distribution of Hadoop

    Many people in the Apache Hadoop community have asked Yahoo! to publish the version of Apache Hadoop they test and deploy across their large Hadoop clusters. As a service to the Hadoop community, Yahoo is releasing the Yahoo! Distribution of Hadoop -- a source code distribution that is based entirely on code found in the Apache Hadoop project.

    This source distribution includes code patches that they have added to improve the stability and performance of their clusters. In all cases, these patches have already been contributed back to Apache, but they may not yet be available in an Apache release of Hadoop.

    Read more and get the Hadoop distribution from Yahoo


    Hive - A Petabyte Scale Data Warehouse using Hadoop

    This post about using Hive and Hadoop for analytics comes straight from Facebook engineers.

    Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and non-engineering. Apart from ad hoc analysis and business intelligence applications used by analysts across the company, a number of Facebook products are also based on analytics.

    These products range from simple reporting applications like Insights for the Facebook Ad Network, to more advanced kind such as Facebook's Lexicon product.

    As a result a flexible infrastructure that caters to the needs of these diverse applications and users and that also scales up in a cost effective manner with the ever increasing amounts of data being generated on Facebook, is critical. Hive and Hadoop are the technologies that we have used to address these requirements at Facebook.

    Read the rest of the article on Engineering @ Facebook's Notes page


    Dealing with multi-partition transactions in a distributed KV solution

    I've been getting asked about this a lot lately so I figured I'd just blog about it. Products like WebSphere eXtreme Scale work by taking a dataset, partitioning it using a key and then assigning those partitions to a number of JVMs. Each partition usually has a primary and a replica. These 'shards' are assigned to JVMs. A transactional application typically interacts with the data on a single partition at a time. This means the transaction is executed in a single JVM. A server box will be able to do M of those transactions per second and it scales because N boxes does MN (M multiplied by N) transactions per second. Increase N, you get more transactions per second. Availability is very good because a transaction only depends on 1 of the N servers that are currently online. Any of the other (N-1) servers can go down or fail with no impact on the transaction. So, single partition transactions can scale indefinitely from a throughput point of view, offer very consistent response times and they are very available because they only point a small part of the grid at once.

    All-partition transactions are different. A simple example might be that we are storing bank accounts in a grid. The account key is the bank account number. The value is an account object with the users online username and their password, address, portal profile, bank account information etc. Almost all access to the account is using the account number. Now, lets look at the login process for the banks portal. The user doesn't login with their account number, they login with the username. We have not partitioned on user name, we partitioned on account and did so for good reason as every other transaction type is keyed on account number.

    So, given we can't easily look up a record using the user name what can we do. Option 1. Lets do a parallel search across all partitions to find account objects whose user name attribute is 'billy'. We can use a MapGridAgent in WebSphere eXtreme Scale to do this. The agent code will be executed in parallel across all partitions. It will run a query within that partition to find any accounts in that partition with a username of 'billy'. One account object should match across the whole grid and the client which called the agent should receive the account number as the result. Problem solved!

    Not so fast. Lets examine this parallel search. How long does it take to run? The client invokes instructs each partition to execute the search code. These searches run in parallel and the client blocks until they all return. So, the client basically waits for the slowest 'partition' or server to return before it continues. How many of these lookup transactions can the grid perform per second? As many as the slowest box can do. If the number of accounts was to double, we could double the size of the grid. This lets us store twice as many accounts but what about the effect on our parallel search? It's true we are searching twice as fast as before (double the CPUs) but there is also twice as much data to search through so we are probably achieving the same response time as before. What about throughput? It's still the same. We can only do as many transactions per second as the slowest machine. Our throughput hasn't changed even though we doubled the size of the grid. Now, we can search twice as many records with the same response time as before, but throughput wise, nothing changed. The grid is scaling in terms of account capacity and records searched/second but the throughput number is not scaling at all.

    Availability is also impacted when compared with single partition transactions. The single partition transactions only used a single partition/server. The every partition transaction needs the whole grid to be up to complete. The failure of a single box will delay the transaction from completing. Now, products like WebSphere eXtreme Scale will very quickly recover from a failure (typically sub second) but on a large enough grid then you'll see response time glitches where maybe a second or so is added if the admins are cycling through servers doing maintenance or something like that. This delay is very unlikely to happen in a single partition transaction case. You'd have a 1/N change of it happening. Much better than the 100% chance with a every partition transaction.

    This lack of throughput scalability for every partition transactions is a problem as login is a operation whose throughput needs to go up as the web site becomes more popular. So, it looks like using parallel search for an operations which need to scale from a throughput point of view is a bad idea. What else can we do?

    We could partition using user name instead of account but now we have the search problem for all the account number based transactions which are the bulk of all transactions and besides, users like being able to change the user name which would be a nightmare if everything was based on usernames.

    We could cache the results of looking up usernames with parallel searches. The cache would be a Map whose key was username and the value was account number. A Loader attached to the Map would do a parallel search with a MapGridAgent if its Loader#get method was called on a cache miss. The problem here is that when we warm up the cache, we'll be getting a lot of cache misses and a lot of parallel searches. Not good either.

    Or, we could maintain a persistent reverse index. This index is a Map which has the user name for the key and the account id for the value. The Map is backed by a database table or other long term persistence mechanism. Now, when a user logs in, we simply do a Map.get("billy") and receive the account id with a single partition transaction and the throughput of those does scale with grid size. We have to maintain this reverse index so that if the user changes their username then we need to make sure the reverse index is updated and so on.

    Login now is a matter of looking up the user name in the reverse index map (revMap.get "billy" returning 1234) and then retrieving the account object using a second get to check the password (accMap.get "1234" returning the account object with the password). This is a much better solution than a parallel search. This is a query cache. Effectively, we are caching the results of the parallel search using a persistent map. We have converted a parallel transaction to a single partition transaction and as a result, our login operation is now throughput scalable.

    Multi-partition transactions can be great for searching large amounts of data in parallel. The search speed/second does increase with the grid size. Larger grids can store larger amounts of data but the throughput typically stays the same as the grid grows (assuming the data size grows linearly with grid size). This means using parallel operations for something whose throughput will grow as your application scales up is a mistake as the throughput of the grid has nothing to do with the grid size, it's limited to the throughput of the slowest box.

    You need to convert that parallel search operation to a single partition get if you want the system to scale from a throughput point of view. Caching the parallel searches OR using a reverse index (effectively this is a disk persistent query cache) is the normal way to handle this conversion.

    How can you make an every partition operation scale from a throughput point of view then if you can't use reverse indexes? Use multiple grids which are all the same and round robin the requests over them. Each grid will be able to do M transactions per second and N grids givens you MN per second. If you need throughput scalable every partition transactions then this is probably the only way to make it scale from a throughput point of view. Ever wonder why google needs millions of servers...

    This article is really talking about transactions that involve every partition like a search. Some transaction may use two partitions for example or some small number of partitions relative to the total number but thats for another blog entry...


    Paper: Graph Databases and the Future of Large-Scale Knowledge Management

    Relational databases, document databases, and distributed hash tables get most of the hype these days, but there's another option: graph databases. Back to the future it seems. Here's a really interesting paper by Marko A. Rodriguez introducing the graph model and it's extension to representing the world wide web of data.

    Modern day open source and commercial graph databases can store on the order of 1 billion relationships with some databases reaching the 10 billion mark. These developments are making the graph database practical for applications that require large-scale knowledge structures. Moreover, with
    the Web of Data standards set forth by the Linked Data community, it is possible to interlink graph databases across the web into a giant global knowledge structure. This talk will discuss graph databases, their underlying data model, their querying mechanisms, and the benefits of the graph data structure for modeling and analysis.

    Managing cross partition transactions in a distributed KV system

    I spend a blog entry discussing single partition and every partition transactions when using distributed KV systems and solutions for some common problems


    Distribution of queries per second

    We need to measure the number of queries-per-second our site gets for capacity planning purposes.

    Obviously, we need to provision the site based on the peak QPS, not average QPS. There will always be some spikes in traffic, though, where for one particular second we get a really huge number of queries. It's ok if site performance slightly degrades during that time. So what I'd really like to do is estimate the *near* peak QPS based on average or median QPS. Near peak might be defined as the QPS that I get at the 95th percentile of the busiest seconds during the day.

    My guess is that this is similar to what ISPs do when they measure your bandwidth usage and then charge for usage over the 95th percentile.

    What we've done is analyzed our logs, counted the queries executed during each second during the day, sorted from the busiest seconds to the least busy ones, and graphed it. What you get is a histogram that steeply declines and flattens out near zero.

    Does anyone know if there is a mathematical formula that describes this distribution?

    I'd like to say with some certainty that the second at the 95th percentile will get X times the number of average or median number of QPS.

    (Experimentally, our data shows, over a six week period, an avg QPS of 7.3, a median of 4, and a 95th percentile of 27. But I want a better theoretical basis for claiming that we need to be able to handle 4x the average amount of traffic.)


    Graph server

    I've seen mentioned in few times sites like Digg or LinkedIn using graph servers to hold their social graphs. But the only sort of open source graph server I've found is .

    Can anyone recommend an open source graph server?



    Google Wave Architecture

    Update: Good Vibrations by Radovan Semančík. Lot's of interesting questions about how Wave works, scalability, security, RESTyness, and so on.

    Google Wave is a new communication and collaboration platform based on hosted XML documents (called waves) supporting concurrent modifications and low-latency updates. This platform enables people to communicate and work together in new, convenient and effective ways. We will offer these benefits to users of Google Wave and we also want to share them with everyone else by making waves an open platform that everybody can share. We welcome others to run wave servers and become wave providers, for themselves or as services for their users, and to "federate" waves, that is, to share waves with each other and with Google Wave. In this way users from different wave providers can communicate and collaborate using shared waves. We are introducing the Google Wave Federation Protocol for federating waves between wave providers on the Internet.

    Here are the initial white papers that are available to complement the Google Wave Federation Protocol:

    • Google Wave Federation Architecture

    • Google Wave Data Model and Client-Server Protocol

    • Google Wave Operational Transform

    • General Verifiable Federation

    The Google Wave APIs are documented here.


    HotPads Shows the True Cost of Hosting on Amazon

    Mather Corgan, president of HotPads, gave a great talk on how HotPads uses AWS to run their real estate search engine. I loved the presentation for a few reasons:

  • It gives real costs on on their servers, how many servers they have, what they are used for, and exactly how they use S2, EBS, CloudFront and other AWS services. This is great information for anybody trying to architect a system and wondering where to run it.
  • HotPads is a "real" application. It's a small company and at 4.5 million page-views/month it's large but not super large. It has custom server side components like indexing engines, image processing, and background database update engines for syncing new real estate data. And it also stores a lot of images and has low latency requirements.

    This a really good example mix of where many companies are or would like to be with their applications.

    Their total costs are about $11K/month, which is about what they were paying at their previous provider. I found this is a little surprising as I thought the cloud would be more expensive, but they only pay for what they need instead of having to over provision for transient uses like testing. And some servers aren't necessary anymore as EBS handles backups so database slave servers are no longer required.

    There are lots more lessons like this that I've abstracted down below.

    Site: - a map-based real estate search engine, listing homes for sale, apartments, condos, and rental houses.


  • 800,000 visits/month
  • 4.5 million page-views/month
  • 3.5 million real-estate listings updated daily


  • Java
  • MySQL
  • AWS


  • EC2 - $7400/month - run 20 of various size instances at anyone time. Most work is in the background processing of images, not web serving.
    * $150: 2 Small HAProxy Load Balancers - 2 for failover, these have the elastic IPs, round robin DNS point at the elastic IPs.
    * $1,200: 3-5 Large Tomcat Web Servers - an array of 3 run at night and 5 during the day.
    * $1,500: 5 Large Tomcat Job Servers
    * $900: 1 X-Large 1 Large Index Server - used to power property search and have several GB of RAM for the JVM
    * $1,200: 1 X-Large 2 Large MySQL masters
    * $1,200: 1 X-Large 2 Large MySQL slaves
    * $300: 1 Large Messaging Server ActiveMQ - will be replaced with SQS
    * $300: 1 Large Map tile creation servers Tilecache
    * $600: Development/testing/migration/ servers
  • S3 - $1500/month - few hundred million objects for files for maps and real-estate listing photos. 4TB of database backup stored as EBS diffs ($600/month).
  • Elastic Block Storage - $500/month
  • CloudFront - $460/month - is used to serve static files and map files throughout the world. It serves static files, map tiles, and listing photos.
  • Elastic IP Addresses - $8/month
  • RightScale - $500/month - used for management and deployment.

    Lessons Learned

  • Major reason for choosing EC2 was the cloud API which allows adding servers at any time. In their previous hosting service they had to prepay for a month at a time so they would order the minimum necessary to get by that month. That doesn't leave room for servers for development, test, preview servers for customers or making live database servers upgrades (which requires 2x servers)?
  • Overall cost is about the same as with previous hosting site but the overall speed of development and ease of management is night and day different. Getting more servers and lots more flexibility.
  • HotPads is a small company and doesn't think added trouble of colocation isn't worth it for them yet.
  • Advantage of Amazon over something like Google App Engine is that Amazon allows you to innovate by building your own services on your own machines.
  • S3 is better for larger objects because for small files that are not viewed often the cost of puts outweighs everything. Not a cache to use for short lived objects because the put costs start to dominate.
    * For a 67 KB object (600 px image) which is where the cost of putting an image into S3 equals the cost of storing it there and about equal the cost of storing it once.
    * For a 6.7 KB object (15 px thumb nail) the put (small fee for putting an object into S3) cost is 10x the storage transfer costs.
  • Costs have to figured into the algorithms you use.
    * In April 330 GB of images downloaded at $.15/GB cost $49. 55mm GETs at $1/mm cost $55. 42mm PUTs at $1/1k cost $420!
    * $100 download and GETs of maptiles.
    * So S3 very cheap for larger files, watch out for lots of short lived small files.
  • CloudFront is 10 times faster than S3 but is more expensive for infrequently viewed files.
    * Makes frequently viewed listings faster.
    * For infrequently viewed listings the CloudFront has to go to S3 to get the file the first time which means you have to pay twice for a file that will be viewed only once.
  • EBS
    * Used on database servers because it's faster than local storage (especially for random writes), blocks of data redundant, and supports easy backups and versioning via cloning.
    * Only 10% cost overhead.
    * Allowed them to get rid of second set of slaves because the backups were so CPU intensive they had to have slaves to do the backups. EBS allows snapshots of running drives so the extra slaves are unnecessary.
    * Databases are I/O bound and the CPU is vastly underutilized so there's extra capacity when you need it.
  • SimpleDB - not using, pretty proprietary. May be of value because you only pay for what you use given how under utilized your own database servers can be.
  • Reserved Instances
    * 1 year for the cost of 6 months and guaranteed (denied one time) to get an instance.
    * Con is tied to an instance type and they want more flexibility to choose instance types as their software changes and take advantage of new instance types as they are released.
  • Rather than having dedicated memcached machines they've scavenged 8 GB of memory from their existing servers.

    Related Sites

  • AWS Start-Up Event DC 2009: HotPads On AWS Slideshow.
  • Cloud Programming Directly Feeds Cost Allocation Back into Software Design
  • AWS Elastic Load Balancer Tutorial