Dealing with multi-partition transactions in a distributed KV solution

I've been getting asked about this a lot lately so I figured I'd just blog about it. Products like WebSphere eXtreme Scale work by taking a dataset, partitioning it using a key and then assigning those partitions to a number of JVMs. Each partition usually has a primary and a replica. These 'shards' are assigned to JVMs. A transactional application typically interacts with the data on a single partition at a time. This means the transaction is executed in a single JVM. A server box will be able to do M of those transactions per second and it scales because N boxes does MN (M multiplied by N) transactions per second. Increase N, you get more transactions per second. Availability is very good because a transaction only depends on 1 of the N servers that are currently online. Any of the other (N-1) servers can go down or fail with no impact on the transaction. So, single partition transactions can scale indefinitely from a throughput point of view, offer very consistent response times and they are very available because they only point a small part of the grid at once.

All-partition transactions are different. A simple example might be that we are storing bank accounts in a grid. The account key is the bank account number. The value is an account object with the users online username and their password, address, portal profile, bank account information etc. Almost all access to the account is using the account number. Now, lets look at the login process for the banks portal. The user doesn't login with their account number, they login with the username. We have not partitioned on user name, we partitioned on account and did so for good reason as every other transaction type is keyed on account number.

So, given we can't easily look up a record using the user name what can we do. Option 1. Lets do a parallel search across all partitions to find account objects whose user name attribute is 'billy'. We can use a MapGridAgent in WebSphere eXtreme Scale to do this. The agent code will be executed in parallel across all partitions. It will run a query within that partition to find any accounts in that partition with a username of 'billy'. One account object should match across the whole grid and the client which called the agent should receive the account number as the result. Problem solved!

Not so fast. Lets examine this parallel search. How long does it take to run? The client invokes instructs each partition to execute the search code. These searches run in parallel and the client blocks until they all return. So, the client basically waits for the slowest 'partition' or server to return before it continues. How many of these lookup transactions can the grid perform per second? As many as the slowest box can do. If the number of accounts was to double, we could double the size of the grid. This lets us store twice as many accounts but what about the effect on our parallel search? It's true we are searching twice as fast as before (double the CPUs) but there is also twice as much data to search through so we are probably achieving the same response time as before. What about throughput? It's still the same. We can only do as many transactions per second as the slowest machine. Our throughput hasn't changed even though we doubled the size of the grid. Now, we can search twice as many records with the same response time as before, but throughput wise, nothing changed. The grid is scaling in terms of account capacity and records searched/second but the throughput number is not scaling at all.

Availability is also impacted when compared with single partition transactions. The single partition transactions only used a single partition/server. The every partition transaction needs the whole grid to be up to complete. The failure of a single box will delay the transaction from completing. Now, products like WebSphere eXtreme Scale will very quickly recover from a failure (typically sub second) but on a large enough grid then you'll see response time glitches where maybe a second or so is added if the admins are cycling through servers doing maintenance or something like that. This delay is very unlikely to happen in a single partition transaction case. You'd have a 1/N change of it happening. Much better than the 100% chance with a every partition transaction.

This lack of throughput scalability for every partition transactions is a problem as login is a operation whose throughput needs to go up as the web site becomes more popular. So, it looks like using parallel search for an operations which need to scale from a throughput point of view is a bad idea. What else can we do?

We could partition using user name instead of account but now we have the search problem for all the account number based transactions which are the bulk of all transactions and besides, users like being able to change the user name which would be a nightmare if everything was based on usernames.

We could cache the results of looking up usernames with parallel searches. The cache would be a Map whose key was username and the value was account number. A Loader attached to the Map would do a parallel search with a MapGridAgent if its Loader#get method was called on a cache miss. The problem here is that when we warm up the cache, we'll be getting a lot of cache misses and a lot of parallel searches. Not good either.

Or, we could maintain a persistent reverse index. This index is a Map which has the user name for the key and the account id for the value. The Map is backed by a database table or other long term persistence mechanism. Now, when a user logs in, we simply do a Map.get("billy") and receive the account id with a single partition transaction and the throughput of those does scale with grid size. We have to maintain this reverse index so that if the user changes their username then we need to make sure the reverse index is updated and so on.

Login now is a matter of looking up the user name in the reverse index map (revMap.get "billy" returning 1234) and then retrieving the account object using a second get to check the password (accMap.get "1234" returning the account object with the password). This is a much better solution than a parallel search. This is a query cache. Effectively, we are caching the results of the parallel search using a persistent map. We have converted a parallel transaction to a single partition transaction and as a result, our login operation is now throughput scalable.

Multi-partition transactions can be great for searching large amounts of data in parallel. The search speed/second does increase with the grid size. Larger grids can store larger amounts of data but the throughput typically stays the same as the grid grows (assuming the data size grows linearly with grid size). This means using parallel operations for something whose throughput will grow as your application scales up is a mistake as the throughput of the grid has nothing to do with the grid size, it's limited to the throughput of the slowest box.

You need to convert that parallel search operation to a single partition get if you want the system to scale from a throughput point of view. Caching the parallel searches OR using a reverse index (effectively this is a disk persistent query cache) is the normal way to handle this conversion.

How can you make an every partition operation scale from a throughput point of view then if you can't use reverse indexes? Use multiple grids which are all the same and round robin the requests over them. Each grid will be able to do M transactions per second and N grids givens you MN per second. If you need throughput scalable every partition transactions then this is probably the only way to make it scale from a throughput point of view. Ever wonder why google needs millions of servers...

This article is really talking about transactions that involve every partition like a search. Some transaction may use two partitions for example or some small number of partitions relative to the total number but thats for another blog entry...