As software developers, we value abstraction. The simpler the API, the more attractive it becomes. Arguably, MongoDB’s greatest strengths are its elegant API and its agility, which let developers simply code.
But when MongoDB runs into scalability problems on big data, developers need to peek underneath the covers to understand the underlying issues and how to fix them. Without understanding, one may end up with an inefficient solution that costs time and money. For example, one may shard prematurely, increasing hardware and management costs, when a simpler replication setup would do. Or, one may increase the size of a replica set when upgrading to SSDs would suffice.
This article shows how to reason about some big data scalability problems in an effort to find efficient solutions.
Defining the Issues
First, let’s define the application. We are discussing MongoDB applications. That means we are addressing a document-store database that supports secondary indexes and sharded clusters. In the context of other NoSQL technologies, such as Riak or Cassandra, we may discuss these I/O bottlenecks differently, but this article just focuses on the properties of MongoDB.
Second, what does the application do? Are we processing transactions on-line (OLTP) or are we doing analytical processing (OLAP)? For this article, we are discussing OLTP applications. OLAP applications have big data challenges that MongoDB may or may not be able to address, but this article focuses on OLTP applications.
Third, what’s big data? By big data, we mean that we are accessing and using more data than we can fit in RAM on a single machine. As a result, if the data resides on one server, then most of it must reside on disk, and require I/O to access. Note that we are not discussing scenarios where the database is large, but the data accessed or used (sometimes called the “working set”) is small. An example would be storing years of data, but the application only frequently accesses the last day’s worth.
Fourth, what are the limiting factors in OLTP applications with big data? In short: I/O. Hard disk drives do at most hundreds of I/O’s per second. RAM, on the other hand, accesses data millions of times per second. The disparity in these limits causes I/O a bottleneck for big data applications.
Lastly, how do we solve our I/O bottlenecks? With analytical thinking. Formulas and direct instructions can get us most of the way there, but a lasting solution requires understanding. Users must look at the I/O characteristics of their application and make design decisions to best fit those characteristics.
To solve I/O bottlenecks, we first need to understand what database operations induce I/O.
One can argue that MongoDB, and many other databases, underneath all of the bells and whistles, perform three basic operations:
● Point Query: Find a single document. Given the location of a document somewhere either on disk or in memory, retrieve the document. On big data, where the document is likely not in memory, this operation probably causes an I/O.
● Range Query: Find some number of documents in succession in an index. This is generally a MUCH more effective operation than a point query, because the data we are looking for is packed together on disk and brought into memory with very few I/Os. A range query used to retrieve 100 documents may induce 1 I/O, whereas 100 point queries to retrieve 100 documents may induce 100 I/O’s.
● Write: Write a document to the database. For traditional databases such as MongoDB, this may cause I/O. For databases with write-optimized data structures, such as TokuMX, this induces very little I/O. Unlike traditional databases, write-optimized data structures are able to amortize the I/O performed against many inserts.
Understanding the I/O implications of these three basic operations leads to understanding the I/O used by MongoDB statements made against a database. MongoDB takes these three basic operations and builds four basic user level operations:
● Inserts. This writes a new document to the database.
● Queries. Using an index on a collection, this does a combination of range queries and point queries. If the index is a covering index or a clustering index (conceptually the same as TokuDB for MySQL’s clustering index), then the query is likely doing just range queries. If not, then a combination of range queries and point queries are used. I explain these concepts in an indexing talk. The talk uses MySQL, but all of the concepts apply to MongoDB and TokuMX as well.
● Updates and Deletes. These are a combination of queries and writes. A query is used to find the documents to be updated or deleted, and then writes are used update or remove the found documents.
Now that we understand the cost model, to resolve I/O bottlenecks, the user needs to understand where the application induces I/O. This is where we need to break some abstraction and peek at how the database behaves. Does the I/O come from queries? If so, how are the queries behaving that is causing the I/O? Does it come from updates? If it comes from updates, is it coming from the query used in the update or the insert used in the update? Once the user understands what is inducing the I/O, steps can be taken to resolve the bottleneck.
Assuming we understand the I/O characteristics of the application, we can talk about several approaches to addressing them. The approach I like to take is this: first attack the problem with software, and when that is not enough, then attack the problem with hardware. Software is cheaper and easier to maintain.
Possible Software Solutions
A possible software solution is one where we change the software or the application to reduce I/O usage. Here are possible solutions for different bottlenecks
Problem: Inserts causing too much I/O.
Problem: Queries causing too much I/O.
Possible Solutions: Use better indexing. Reduce point queries by using range queries instead.
In my talk, “Understanding Indexing”, I explain how to reason about indexes to reduce I/O for queries. It’s difficult to summarize the talk in one paragraph, but the gist is as follows. One can reduce the I/O of the application by avoiding doing individual point queries to retrieve each document. To do this, we use covering or clustering indexes that smartly filter the documents analyzed by the query, and can report results using range queries.
Better indexing may not be sufficient. If you have an OLTP application and your queries are essentially point queries (because they retrieve very few documents), then even with proper indexes, you may still have an I/O bottleneck. In such cases, a hardware solution is probably necessary.
Also, additional indexes increase the cost of insertions, as each insertion must keep the indexes up to date as well, but write-optimized databases mitigate that cost. This is where we need to approach the problem with an understanding of our application. For some applications, this is practical, and for others it is not.
Problem: Updates/Deletes cause too much I/O
Solution: Combine the solutions above.
Updates and deletes are tricky in that they are a combination of queries and inserts. Improving their performance requires a good understanding of the cost of the operation. Which part of the operation induces I/O? Is it the query? If so, one can try to improve the indexing. Is it the write? Is it both? Based on what part is inducing the I/O, we apply the solutions above.
One mistake many make is taking a write-optimized database such as TokuMX and expect it to eliminate I/O bottlenecks of updates and deletes without changing any of the indexing. A write-optimized database is not enough. The implicit query within an update/delete must be handled as well.
Possible Hardware Solutions
As mentioned above, when software solutions are not enough we look to hardware. Let’s look at these possibilities, and analyze their benefits and drawbacks:
● Buy more memory to hopefully get more, if not all, of your working set into memory.
● Increase your IOPS by moving to an SSD.
● Buy more machines and move to a scaled out solution. This can be:
○ Read scaling via replication
Buying more memory
RAM is expensive, and there is only so much RAM one can put on a single machine. Simply put, if data is large enough, keeping it in RAM is not a feasible option. This solution may work for a wide range of applications, but our focus here is tackling applications that cannot do this.
Moving to SSDs
Making the storage device an SSD is a practical solution for improving throughput. If I/O is the limiting factor of your application, an increase in IOPS (I/Os per second) naturally leads to an increase in throughput. This may be simple and practical.
The cost model of SSDs is different than the cost model of hard disks. SSDs dramatically improve your I/O throughput, but are not cheap. They are smaller, cost more, and wear out faster. Therefore, the cost per GB of SSDs is quite higher than the cost per GB of hard disks. To keep costs down, data compression becomes a key.
So, the cost of the hardware increases, but the cost of managing the application does not.
Read scaling via replication
Read scaling with replication is effective for applications where queries are the bottleneck. The idea is as follows:
● Use replication to store multiple copies of your data on separate machines
● Distribute reads across the machines, thereby improving read throughput
By taking the reads that used to bottleneck on one machine and spreading them out, more resources are available for the application to either handle more concurrent queries with the same write workload or to increase the write workload on the application.
If inserts, updates, or deletes are your bottleneck, then replication may not be very effective, because the write work is duplicated on all servers that are added to the replica set. The machine that takes the data input (called the master in MySQL or primary in MongoDB) will still have the same bottleneck.
Sharding partitions your data across different replica sets based on a shard key. Different replica sets in the cluster are responsible for ranges of values in the shard key space. So, an application’s write throughput is increased by spreading the write workload across separate replica sets in a cluster. For high write workloads, sharding can be very effective.
By partitioning the data by ranges in the shard key space, queries that use the shard key can effectively do range queries on a few shards, making such queries very efficient. If one makes the shard key a hash, then all range queries must run on all shards in the cluster, but point queries on the shard key run on single shards.
Because MongoDB is schema-less and does not support joins, sharding is elegant and relatively easy to use. If the solutions above are not enough to handle your application and more resources are required, then one can argue that sharding is inevitable.
Nevertheless, sharding is arguably the most heavyweight solution and has a high cost. For starters, your entire hardware budget is multiplied. You do not just add machines to a sharded setup, you add entire replica sets. You need to add and manage config servers. Due to these costs, one should really consider if sharding is truly necessary. Usually, any one of the solutions presented above are cheaper.
Another big challenge is selecting a shard key. A good shard key has the following properties:
● Many (if not all) of your queries use the shard key. Any query that does not use it must run on all shards. This can be a pain point.
● The shard key should do a good job of distributing writes to different replica sets in the cluster. If all writes are directed to the same replica set in the cluster, then that replica set becomes a bottleneck for writes, just as it was in a non-sharded setup. This makes something like a timestamp a very bad shard key.
These requirements are not always easy to fill. Sometimes, a good shard key does not exist, making sharding ineffective.
In conclusion, many solutions may work, but none is always guaranteed, not even sharding. This is why understanding the characteristics of the application is crucial. These solutions are tools. How best to use the tools, is up to the user.