« Super Bowl Advertisers Ready for the Traffic? Nope..It's Lights Out. | Main | Sponsored Post: Amazon, Zoosk, aiCache, Teradata Aster, Aerospike, Percona, ScaleOut, New Relic, NetDNA, Logic Monitor, AppDynamics, ManageEngine, Site24x7 »

Ask HighScalability: Memcached and Relations

Hi everybody

I'm wondering what you would do: I develop a webapp using Grails, Memcached and Mysql as persistence. Now, I have following domain classes (simplified):

Product: Can be in one category
Category: Can have nested children, and have multiple products.

I need to access all product objects by id which led me to the idea to store all products in one big Memcached-entry with a key: PRODUCTMAP and as value, all product attributes as array, like: [productId1: [title: 'title'], productId2: [title: 'title']]

If I browse to category 4, I simply get my map categoryMap with value [cateoryId: [productId1, productId2]]
I also can list all products of a certain category by providing that id. 

The bad thing about this is that I always have to put back everything if I modify a single product.

Who can give advice how to realize that?
Any help will be appreciated! 




Reader Comments (8)

Isn't it two more months before April 1?

February 6, 2013 | Unregistered CommenterRichard

Store each product as individual key and use Memcached::getMulti
Loading all products on each requests is not good path

February 6, 2013 | Unregistered CommenterRokiś

You should consider switching to Redis instead of Memcached... You will be able to get more feature than Memcached (not criticizing memcached) but Redis provide various advantages over Memcached.

February 6, 2013 | Unregistered CommenterSteph

Or just build an in-memory structure. We had a similar problem where we needed a fast index and the DB just couldn't cut it so we simply stored it in memory and used a message bus to query the service. If the server went down and had to rebuild the index, it could take a few minutes, but you could offset that by having a second redundant server responding to the same queues (or put it behind an HTTP load balancer if queues aren't your thing).

Not every problem needs to be solved with a database or memcached.

February 6, 2013 | Unregistered CommenterBryan Murphy

I would use couchbase as the database with elastic search for discovery: See Couchbase Blog Post (super easy by the way)

* Give each product and category a unique identifier, see: Couchbase design patterns
* Map products to categories and categories to categories
* Create a views that list all the products in a category
* Create a view that lists all the categories / sub categories
* Use search for everything else

February 6, 2013 | Unregistered CommenterSteve

Use Redis instead of MemCache. It provides much better key and data handling functions (for example listing all keys matching a pattern!) with comparable performance.

See the command list: Redis commands

February 7, 2013 | Unregistered CommenterOchronus

Rokiś has it. No need to switch away from Memcached for this use-case.

I would say your best trade-off between simple updates and quick reads would be to design for a max of 3 Memcached hits on a request instead of just 1. It will still be really, really fast, but it will also use more keys, which means it will scale better over multiple memcached instances as your data grows (not sure if this is a concern, as you didn't state how many categories and products we're talking about here.

I would store products by their ID with a serialized (JSON, whatever) version of all their attributes as the value, including their category id.

Categories are then stored with their serialized attributes, a list of all product ids in that category and a list of sub-categories. If you ever need to retrieve all of a category's products, including all products in sub-categories, you can also store 'descendant_categories', a list of all nested categories, which would include grandchildren, etc.

For listing all products in a category:
1. One GET on the category ID
2. One multi-get on all of the product ids in the category.

To list all products in a category, including products in any child/grandchild/etc categories:
1. One GET on the category ID
2. One multi-get with all of the category ID's in the `descendant_categories`
3. One multi-get on the combined list of product ids from your original category plus its `descendant_categories`

To update the cache when a product's attributes change:
Just one SET operation.

To update the cache when a product's category changes:
Either a DB hit plus a cache SET, or just a GET followed by a SET.

To update the cache when category nesting changes:
You'll have to walk up the category tree with GET and SET with this implementation, but you only need to update Categories up the ancestry chain, so if you don't have deep nesting, that's not a big deal. If you do have deeply-nested categories, or your category nesting changes often (both unlikely), then you could store `ancestor_categories` on a category as well, to make this a GET plus mult-GET plus multi-SET operation. This would come at the expense of needing to update more-often, though, because you'd need to update categories in both directions if any of them change.

Another way to make category nesting updates easier if they're done often, but to avoid needing to update the chain in both directions, would be to just store the root category ID for any child category. This would free you from needing to walk the chain and instead you could just get the root, then use `descendant_categories` to pull all affected categories, and then only update the ones above the change (including updating the root category ID if the previous root now has a parent).

Note: In almost all of the cache update situations, you're dealing with non-atomic operations, and the naive implementation means you're subject to older, slower updates overwritting newer, faster updates. You'd benefit from either doing this on a single-process queue responsible for updating the cache, or you could incorporate atomic locks.


February 7, 2013 | Unregistered CommenterWes Winham

I think there are several questions that need to be asked here before a really good answer can be given: How large is the dataset that will be stored in Memcache? How far away is the Memcache server, and would hitting Memcache with get_multi be an option in terms of bandwidth/etc? Are you concerned with bandwidth limitations to the Memcache server, or is bandwidth near-infinite?

If you are running Memcache locally or near-locally (IE Amazon + VPC), calls to Memcache are very inexpensive - using a single call to get categories which is stored as a tree with product IDs as values, then retrieving individual products with get_multi would be the most efficient way to go. This also uses significantly less RAM than the other method if you have a large dataset.

If you're not running Memcache locally and making several calls to the remote would be undesirable but you do have lots of bandwidth, it would be best to store the data in a more complex format; performing replacements in a low-traffic scenario would mean getting the category key, updating the product, then pushing the update back to Memcache. This falls apart if you have multiple processes updating items simultaneously - you'd have to request the data from the DB to ensure consistency between the DB and cache, format/ORM it, then update.

If it's far away and low-bandwidth, switch away from GoDaddy. ;)

February 10, 2013 | Unregistered CommenterIsrael

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>