I met the CodeFutures folks, makers of dbShards, at Gluecon. They occupy an interesting niche in the database space, somewhere between NoSQL, which jettisons everything SQL, and high end analytics platforms that completely rewrite the backend while keeping a SQL facade.
High concept: I think of dbShards as a sort of commercial OLTP mashup of features from HSCALE (partitioning) + MySQL Proxy (transparent intermediate layer) + Memcached (client side sharding) + Gigaspaces (parallel query) + MySQL (transactions).
You may find dbShards interesting if you are looking to keep SQL, need scale out writes and reads, need out of the box parallel query capabilities, and would prefer to use a standard platform like MySQL as a base. To learn more about dbShards I asked Cory Isaacson (CEO and CTO) a few devastatingly difficult questions (not really).
Who are you, what is dbShards, and what problem was dbShards created to solve?
I’m Cory Isaacson, CEO/CTO of CodeFutures Corporation, the makers of dbShards. dbShards was specifically created to address the scalability needs of general purpose commodity databases, specifically those that require high transaction volumes, large growth, and high availability. dbShards operates in conjunction with commodity databases to make database sharding easy to implement and manage, accommodating rapid growth in data size and transaction volumes.
What kind of customer are you targeting with dbShards? Who ends up using your product and why?
The primary customers for dbShards fit into two categories: 1) fast-growing Web or online applications (e.g., Gaming, Facebook apps, social network sites, analytics) and 2) any application involved in high volume data collection and analysis (e.g., Device Measurement). Any application that requires high rates of read/write transaction volumes with a growing data set is a good candidate for the technology.
Can you compare Terracotta (and ehcache) vs. Memcache vs dbShards?
Memcache and Terracotta perform very similar jobs, and are both compatible with dbShards. Memcache can store objects in memory, reducing reads to the database. We have several customers that use it, and it is helpful, but our experience is that this approach can only go so far.With large databases you cannot of course load the entire thing into memory, and if the database is write-intensive it does not help at all. Terracotta is similar, but has replication across servers (they actually replicate the Java JVM memory across nodes). These products require a special API, and the application has to interact with them directly. The technique has been around for many years, and is effective, but again limited if an application continues to scale (particular in writes where there is no benefit).
dbShards is a full, general purpose database scalability platform, allowing you to scale both reads and writes across multiple nodes. We have seen that the DBMS engines themselves are excellent at “most recently used” caching, so by sharding your database you change the data size to memory/CPU balance. Ideally we balance the shards such that at least the indexes are in memory, often much of the data too. That means that any normal SQL statements automatically take advantage of the native DBMS’s raw performance for reads (which is very fast). We have scaled to 10s of 1000s of single row SELECT statement reads/second with just 6 servers in a data center environment (single row SELECT statements are the most intensive type of read in a sharded environment).
On the write side it is critical that the database be persistent in the database, and fully reliable. dbShards handles this with our high-performance reliable replication which guarantees each write across at least 2 servers for a single shard. Then we shard across multiple servers and provide linear scalability (i.e., 4 shards is 4X the write performance of 1 shard).
Lastly, dbShards offers Parallel Query, where users can perform what we call a “Go Fish” query across all shards, with the results collated automatically by the environment. This is very similar to the MapReduce approach, and can provide aggregate reads at extremely high speeds. An example would be: Get a count of all customers who purchased the latest Stephen King book last month. In many cases using this approach we can provide much better than linear improvement as shards are added, and this type of access cannot be supported at all by a caching layer.
When our customers that have used this technique successfullly, they store common, frequently accessed data in the cache layer, but important data directly in the database with dbShards.
In summary, every technology has its use and purpose. Caching is great, it can give you a very quick improvement for specific problems, but for general scalability and read/write intensive applications, relying on proven database technologies like MySQL and Postgres are still a requirement (which of course dbShards is designed to get the most from).
Why should someone use dbShards over Cassandra or CouchDB or some other option?
Every technology has its area of applicability. For the NoSQL databases, they are very good for high volume writes, but do not support the relationships natively — the developer has to do that. For some applications this is a great fit, for others it can be a real challenge as you end up needing to hand-code all of your data relationships and maintenance of these. Since dbShards uses commodity database engines like MySQL to do the actual data management, all of the natural capabilities are there: Joins, Referential Integrity, Aggregation, etc. These engines are fully proven components through years of development and intensive usage, that’s is what the product capitalizes on and improves upon. So its really a matter of application requirements, what is the best tool for the job.
Can you characterize the performance, scalability, and fault tolerance of dbShards?
dbShards is the first packaged implementation of Database Sharding, a technique which involves partitioning data by rows across a number of distributed servers. As such it delivers linear scalability improvements as shards are added to the environment. Further, dbShards utilized our own plug-in driver technology, so the application in fact connects directly to each native database running on each shard server. This provides the same or better latency (performance) as you would expect with a native database connection — with the tremendous advantage of dealing with databases that are far smaller in size when compared to a monolithic setup. We know from extensive testing that really large databases are slow, small ones are fast. The reason for this is simply the balance between CPU, memory and database size — get that right for a given application utilizing a sharding approach, and everything flies.
One issue with sharding is the introduction of more failure points, given the very nature of the distributed database structure across multiple servers. To address this we developed our patent-pending dbShards/Replicate technology. This feature gives us reliable writes across at least 2 servers per shard, in a high-performance, scalable package. In other words, our reliable replication is designed to keep up with a sharded environment which makes it very unique. It supports failover, and in fact customers can perform planned failover so that specific shard databases can be taken offline for maintenance without disruption of the application. We are extending this to support disaster recovery at remote sites in the near future.
How much does it cost? Including support and maintenance for lots of nodes.
dbShards/Enterprise is designed to be installed on your own hardware, and starts at $5,000 per year per server. We also offer dbShards/Cloud which works in environments such as Amazon EC2, and its pricing is $0.60/instance/hour. Both editions include support with the subscription, and the pricing goes down as the number of servers increases.
What are operations like with dbShards?
dbShards provides a management console and monitoring framework, providing continuous visibility into the running dbShards environment. The management console is very easy to use, and allows you to view the health of all components, restart components if required, and perform other routine tasks. Once a customer gets the product set up and has automated backups in place, it takes very little attention to operate.
Are there any single points of failure? How do you handle fault tolerance? Replication? Cross data center operations?
No. There is no single point of failure or shared “bottleneck” component in the environment, everything is redundant and implements a complete “shared nothing” approach — the only way to get the type of scalability and reliability we are after. (See above re: cross data center operations, that is coming soon).
How is node rebalancing handled? Is data automatically rebalanced when node limits are reached? In the cloud, using an elastic architecture, is data rebalanced when nodes are added and deleted?
dbShards implements a concept called Virtual Shards, where we design more shards into the architecture than are actually deployed physically. Because we have the ability to take a replicate server for a shard offline without disruption, re-sharding (or “Shard Mitosis” as we like to call it) can then be performed by separating out one physical shard into two or more physical shards along virtual shard definition boundaries.
For example, if a database is sharded by customer_no, then we might define 200 Virtual Shards, but shards with only 2 Physical Shards. Then it is a simple matter to expand the number of Physical Shards as demand grows. The same process is used in the cloud. Currently the shard strategy and configuration is defined by a system administrator, but we have plans to improve upon that going forward using Dynamic Sharding.
Is SQL fully supported across all the shards? What limitations are there?
We support all dynamic SQL that the database supports. The dbShards driver examines each query and directs it to the proper shard based on the “shard key”, or directs it to all shards using our parallel query mechanism (which we like to call a “Go Fish” query). Go Fish queries are automatically distributed and results collated in a very high performance fashion, so the application developer issues a single SELECT and gets back a single result set. There are some rules we have developed to make an application Shard Safe, but mainly these are performance driven and make sense in any partitioned environment.
Are distributed transactions supported? If so, is linear growth maintained?
We have done distributed transactions, but don’t recommend it as there is a performance penalty. IN all of our applications to date we have not found a case where distributed transactions were a must.
Anything else you would like to add?
Database Sharding is typically a lot of work, and requires major modifications to an application to support it, but with dbShards the whole thing is transparent to the application developer. Our support for sharding, high availability, Global Table replication, Parallel Query, and other features really make it much more practical for a wide audience to implement. The scalability is excellent, and we have customers with viral applications who have been able to fully unleash their marketing efforts, comfortably handling the increased user load in all application tiers due to the dbShards implementation.
I'd really like to thank Cory Isaacson for taking the time to answer my questions. I think his answers will give you a pretty good picture of why or why you may not want to adopt dbShards in your own product.
- HSCALE - Handling 200 Million Transactions Per Month Using Transparent Partitioning With MySQL Proxy
- DbShards Part Deux - The Internals
- dbShards — a lot like an MPP OLTP DBMS based on MySQL or PostgreSQL