advertise
« Facebook at 13 Million Queries Per Second Recommends: Minimize Request Variance | Main | NoSQL Took Away the Relational Model and Gave Nothing Back »
Monday
Nov012010

Hot Trend: Move Behavior to Data for a New Interactive Application Architecture

Two forces account for the trend of moving behavior to data: larger values used in key-value stores and spotty cloud networks. For some time we've seen functions pushed close to data with MapReduce, which is a batch process, but we are now seeing this model extend to interactive applications, which match the current emphasis on highly scalable, real-time, event driven applications.

To see the trend look at the increasing support for collocated behavior at the datastore level:

Isn't this Just Stored Procedures All Over Again?

But wait I hear you say, we've always had stored procedures in databases for this exactly the same efficiency reasons. Why is this happening all over again? Well, to really understand that you'll need to watch the entire Battle Star Galactica series. But, in short: 1) the move to the cloud, 2) large values, 3)  stored procedures are now horizontally scalable and won't become the bottleneck, 4) more palatable implementation languages (PL/SQL anyone?).

The Problem of the Cloud, Spotty Networks, and Large Values

For the past decade or so, to overcome non-scalable databases, objects were put in caches. The caches were sharded, or spread horizontally in order to scale. The logic tier of an application was also scaled horizontally across nodes, using the caching tier for fast and scalable data access. An application does a get, processes the data, and does a set with the new values. This works great, until...

...until you move to the cloud and you no-longer have direct control over the quality of your network. One of the biggest complaints about the Amazon cloud is its poor network. Many developers have found Amazon's network lossy, with a high and variable latency. Connections consistently drop which has caused a number to consider moving out of the cloud to get more control over the network.

Spotty networks are also exacerbated by large values. With the move to key-value stores we've seen values become larger as they are denormalized. Values have become fatter as more and more references have been transformed into contained objects. A user object, for example, may now contain its social network. If you have 10,000 people in your social network that user object just became large. And this trend is accelerating. These large objects over a spotty network are a problem. One partial solution is for databases to support programmers with easier to use references. 

When you control your own equipment you can of course use powerful SANs, SSDs, fast fibre channel connections, more disks, more spindles, write caches, smart controllers, and so on. If you've made the move to the cloud however, these tricks are not available to you, so you have to make do with the resources that are available.

Sharding Could Remove Stored Procedures as the Bottleneck

Another solution is to move the functionality that was in the logic tier to the database tier. Won't this become a bottleneck as it always did before? Not necessarily. The difference now is that databases themselves are sharded, which means the functions running in the database may not become the bottleneck.

Stored procedures in the database bring up all sorts of locking and threading issues that could kill performance. The clean get/set a buffer from a slab type logic is certainly complicated by having application logic running at the same time. MongoDB takes a giant lock while a stored proc executes. Membase is getting around this somewhat by moving the code to a separate process and leaving the data manger clean, which of course adds another hop. VoltDB puts stringent requirements on how long a stored procedure can operate.

This trend is why in Troubles With Sharding - What Can We Learn From The Foursquare Incident? I defined sharding as: The goal of sharding is to continually structure a system to ensure data is chunked in small enough units and spread across enough nodes that data operations are not limited by resources. The trick is to keep stored procedures within bounds by sizing shards not simply on memory, but on CPU and IO as well. A little different approach than we have now, but one that may be necessary if data and behavior are to be collocated.

The Server Side Environment Needs to be Richer

Another weakness of these systems is that although they will have nicer server side languages (Java, JavaScript, etc), the programming model on the server side is still impoverished. It would great to see something like Node.js merged in with the database so programmers would have a real application platform to build on.

An obvious question is how should application logic be divided? By being able to shard on multiple criteria the typical arguments against using the database as an application container are muted a bit because the same horizontal partitioning at the application layer can be pushed down to do the database. Given this, should all application logic be pushed into the database? Should some reside in the database and some outside? How do you know where the line is? Should a web server be merged into each shard so clients can directly access code via REST? Could or should this replace the service tier? How does this work with search, secondary indexes and other database oriented features? It's these kind of difficult to answer questions that still make keeping behavior out of the database very attractive.

What Does All this Seem to Add Up To?

The next generation cloud enabled, interactive application architecture may be:

  • A rich, asynchronous, evented server side application environment, coded in a dynamic language with direct, low latency access to a collocated and integrated datastore.
  • Dynamically sharded to ensure both scalable data and behavior.

Or maybe that's just "Silly me. Silly, silly me."

Reader Comments (10)

MondgoDB? A typo, I guess.

November 1, 2010 | Unregistered CommenterSteven Prostre

Isn't this Just Stored Procedures All Over Again? <-- yes, but it's for NoSQL. :)

November 1, 2010 | Unregistered Commenterjstephens

That's the french spelling Steven :-)

November 1, 2010 | Registered CommenterTodd Hoff

I added similar functionality to Memcache using Lua. Well not quite the regular memcache, but a compatible implementation in java using NIO. The main benefit are:

1) Atomicity (CAS only tells something is wrong, retry)
2) network bandwidth and latency (consider adding one element to a 1000 element list and if this is high frequency update, consider number of retries)

http://chakpak.blogspot.com/2010/10/scriptable-object-cache.html

November 2, 2010 | Unregistered CommenterRohit Karlupia

Don't we really want to move from data to behavior? People make decisions based on the context the data describes, not on the data itself. This low level data centric view is part of the problem, isn't it? We can't move low level data around the cloud in a timely or flexible fashion.

November 2, 2010 | Unregistered CommenterCharlie

That's cool Rohit. I'd love to see something similar done for Node.js. I looked at using Redis, which has a very clean code base, and I would think would make a good match.

Charlie, I don't really know. The data we are talking about is usually denormalized so much of the context is there. I don't think the service layer is gone though. Some application component needs to run the state machines that make the required requests and integrates the responses. This could be distributed to any data node, but that would probably kill those nodes. What may change is that less work is done on the service layer and more work is done on the data layer. Less raw data would need to be brought into the service layer for processing, which on a cloud network is a win.

November 2, 2010 | Registered CommenterTodd Hoff

Todd

Yes - it's nice to see this sort of "forward" thinking - as in, bringing the data forward, alongside the processing.

Your "collocated and integrated datastore" will give problems for

- demanding applications (too slow, difficult to manage the move) and
- EC2 users (where your virtual disk vanishes with a reboot).

The other alternative to a datasore is putting all data into RAM - as in your RAMClouds article.

Then you're back to in-memory data grids or equivalent, which probably explains the big push in this area from the big industry players.

So, if you've got a tier of service+data nodes, sharded based on data affinity, the design issues are how to

- work with a spotty network
- aggregate and collect the large data values.

I'm not sure these is a "no-brain" answer to this - you have to consider each data domain individually. The patterns we use are

- use hot backups for all nodes - for when a node goes AWOL
- give the application programmer an easy way to group data, so that data with high affinity is placed in the same node
- send service invocations (and map-reduce searches with a known data target) using the data affinity to the master data record.

Intuitively, this gives programmers to tools to achieve optimal data placement. I agree it's not automatic ... but years of experience with Tuxedo proves it can perform!

November 3, 2010 | Unregistered CommenterMatthew Fowler

@Todd

Be sure I am not a GigaSpaces employee, neither a GigaSpaces user ;-)
I can not tell if GigaSpaces or GridGain matches *all* criteria you have written into your conclusion.
But, AFAIK GigaSpaces (and GridGain ?) is a step in the right direction...

November 5, 2010 | Unregistered CommenterDominique De Vito

@Todd

[I re-send the first part of my msg, because only the second part has hit your server]

You wrote:
""
The next generation [...] application architecture may be:
- A rich, asynchronous, evented server side application environment, coded in a [...] language with direct, low latency access to a collocated and integrated datastore.
- Dynamically sharded to ensure both scalable data and behavior.
""

As far as I have read (and understood it), GigaSpaces (GS) looks like to match the above criteria as code and data are colocated within GS, and developers have the ability to express, through annotations, the links between data, so that they will be colocated inside a server (a "processor unit" according to GS naming) - it's all about "data affinity" as named by GS writers. I see GridGain as an app server with quite similar features too, even if I see GridGain as a kind of "little brother" of GigaSpaces.

See the GigaSpaces whitepapers: http://www.gigaspaces.com/WhitePapers

In "Data-Awareness and Low-Latency on the Enterprise Grid", one could find:


There are two main technical requirements when enabling data awareness on the Enterprise Grid:
- The Enterprise Grid must know what data is stored on which IMDG instances.
- There must be a way to guarantee data affinity—tasks must always execute with the relevant data coupled to them.

In "The Scalability Revolution - From Dead End to Open Road", GS writers point the "data affinity" goal:


The SBA Value Proposition
SBA removes the scalability dead-end facing today’s tier-based applications, and guarantees improved scalability in four dimensions: (1) fixed, minimal latency; (2) low, predictable hardware/software costs; (3) reduced development costs; and (4) consistent data affinity.

In "Migrating from JEE to GigaSpaces", there is some details about data routing:


4.2 Controlling Object Routing
Let's examine the way we control how bid and ask orders for the same stock, end up in the same physical location. This is done by using the GigaSpaces support for object routing. The user determines for each class written to the space, the field which is used to control the routing. This means that two instances of the same class are routed to the same physical location, provided that they have the same value for the field used for the routing. The designation of the routing field can be done either via annotations or XML. Here's how this is done for the Order object in our application (note the @SpaceRouting annotation) [...]

November 6, 2010 | Unregistered CommenterDominique De Vito

Dominique, there's definitely a circling back to data grids here, the difference being how tightly data and behaviors are coupled. Memcached and other key-value stores have one largely on simplicity. They can be used from any language using a client library that speaks the right protocol. When you are bound to a particular language like C++ and Java, and you have to run post processors at compile time to make it work, and it must be managed completely through a complicated and closed infrastructure, then you've lost a lot of flexibility and that agility is very attractive to developers. You also of course gain a great deal if you are willing to commit to that. Combining the best of both worlds is attractive.

November 6, 2010 | Registered CommenterTodd Hoff

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>