« How to Avoid the Top 5 Scale-Out Pitfalls | Main | 10 Ways to Take your Site from One to One Million Users by Kevin Rose »

Building a Unique Data Warehouse

There are many reasons to roll your own data storage solution on top of existing technologies. We've seen stories on HighScalability about custom databases for very large sets of individual data(like Twitter) and large amounts of binary data (like Facebook pictures). However, I recently ran into a unique type of problem. I was tasked with recording and storing bandwidth information for more than 20,000 servers and their associated networking equipment. This data needed to be accessed in real-time, with less than a 5 minute delay between the data being recorded and the datashowing up on customer bandwidth graphs on our customer portal.

After numerous false starts with off the shelf components and existing database clustering technology, we decided we must roll our own system. The real key to our problem (literally) was the ratio of the size of the key to the size of the actual data. Because the tracked metric was so small (a 64-bit counter) compared to the unique identifier (32-bit network component ID, 32-bit timestamp, 16-bit data type identifier) existing database technologies would choke on the key sizes.

Eventually it was decided that the best solution was to write our own wrapper for standard MySQL databases. No fancy features, no clustering, no merge tables or partitioning, no extra indexes, just hundreds of thousands of flat tables on as many physical machines as was necessary. I chronicled the whole decision making process in the full article, located here, on our developers' blog.

Reader Comments (2)

Presuming none of the 200,000 flows is transferring more than 819MiB/sec, this amounts to an aggregate write bandwidth of about 62.5k/sec of unique data, or about 156GiB/month before compression and excluding indices. An off the shelf timeseries database could handle this without blinking. Commercial solutions aside, rrdtool could also handle this with much reduced storage requirement.

Any column-oriented database would make it cheap to generate graphs on the fly, negating the second problem. This ability is also built into rrdtool

October 6, 2009 | Unregistered CommenterDavid W

Oracle's index-organized tables would have been a great fit here.

October 12, 2009 | Unregistered CommenterXailor

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>