advertise
« HA for switches | Main | Amazon Architecture »
Tuesday
Sep182007

Session management in highly scalable web sites

Hi,

Every application server has its own session management implementations for supporting high scalability. But an application architect/developer has to design and implement the application to make the best use of it.

What are the guiding principles and pattern for session state management?

Websphere System management red book mentions that "Session management performance is optimum when session data per user is around 2Kb. It degrades if session data is more than that".

I have following questions.
1. How do you measure session data per user?
2. It is generally recommended that you should keep all the session state in database and keep only the keys in HttpSession object. Then everytime a web request is processed, session data is fetched from the database. This way all the data remains in memory only till the request is processed and actual data in HttpSession is very less. (Only few keys).
What is the general practice?
At what point you should be switching from keeping data in HttpSession to database?
Are websites like Amazon or eBay follow this?
3. Is there any open source framework which helps you do session management in a way mentioned in point no. 2?

Thanks,
Unmesh
Thanks,
Unmesh

Reader Comments (13)

I like the Amazon approach of storing most all the data in the database anyway, so there's little in the way of a separate session data store. For when you really need session state how about use something like memcached so you can take a lot write pressure off your DB?

November 29, 1990 | Unregistered CommenterTodd Hoff

But at what point you will decide if you have to store session state in database. Like I mentioned in the last post, Websphere red book recommends having <= 2KB data per user in session (They do not mention for how many users though). I have seen three type of data generally kept in session in web applications.
1. User profile and other user related information.
2. If a transaction requires information from multiple web pages that user submits one after other. (Like typical shopping cart scenario).
3. For showing search results with pagination, resultset of the query.
While for 3, we can use pagination support of database. For 1 and 2 it is many times useful to populate object model and keep it in memory for easy access. Many times in scenario 1 above, we have to call many web services to get all the data for complete user profile, which is needed for processing other transactions. In this scenario is it better to use a Cache instead of HttpSession?
And how do we generally measure how much data is kept in session per user?

November 29, 1990 | Unregistered Commenterunmesh

My standard answer to this type of question is usually: Try as far as possible to avoid keeping session data on the server. HttpSession is a very convenient way of keeping data on the server, but it doesn't scale very well. Of the three examples you describe in your post, I think only 2 is a candidate for session information. You would want to save user data to the database, anyway. For 3, as you say, using database pagination or offsets in your queries is a better way than using session cookies (which is what we're talking about anyway).
There might be ways of doing 2 without server-side sessions, but that might require some real deep thinking from a developer standpoint, but is not by any means impossible.

The only case for server-side sessions, as far as I can see, is for really short-lived data that does not naturally fit into the database naturally (like for point 2). But deciding to introduce server-side sessions and cookies has a lot of implication for your infrastructure (load balancing, session replication, etc), and should not be done lightly.

If you do decide to use server-side sessions, however, letting the server-side hold the actual data (not necessarily in the database, maybe in some sort of replicated session mechanism like memcached or Tapestry), and letting the client only hold a reference would seem like a much safer option, minimizing the chance of someone clever rolling their own cookies and sending bogus data to the server.

Hope this answers at least a bit of your question.

November 29, 1990 | Unregistered CommenterKyrre

Thanks Kyrre. You've made me very happy. That's the kind of quality answer I was hoping would happen when I started this site!

November 29, 1990 | Unregistered CommenterTodd Hoff

Thanks a lot for very detailed answer. It was definitely very useful. One of the things I tried to for short lived data is to populate object model as user goes on entering data and serialize objects to database. I had a special table in the database to keep this kind of conversational state with just few columns, a key, timestamp and CLOB (where i kept all the serialized objects). So HttpSession contained only key to rows in this Cache table. This I thought was also helpful for keeping user profile. Because sometimes getting all the user profile involves multiple web service calls.
Is this a kind of standard solution?
Do you have experience of using in memory cache like JCache etc? I was also thinking of introducing some in memory LRU cache.

November 29, 1990 | Unregistered Commenterunmesh

Thanks for the great feedback from my post! Always nice to hear that people like my ramblings ;-)

As for storing objects in the database as CLOBS (or probably better, BLOBS), I don't think it's a very "standard" way of doing things. As far as I can read from your post, you're using the database as a cache for the web service calls and short-lived application data. If you expect the load on the site to increase a lot, or spike in certain periods (like X-mas for shopping sites), you'd probably be better off using something that is meant to be used as a distributed cache in the first place (a lot of visits will probably put a strain on the database for your caching, as well as all the other database access your site is doing). Even though you are caching your objects in JCache on each application instance, if you're for example using round-robin load balancing in front, the chances for cache hits degrade, and you'll be going all the way down to the basement for data a distributed cache would probably already have replicated to you.

As for using JCache, I have very good experience using this on a few sites, but all very simple sites that are primarily read-only (eg the database contents are updated on a batch-basis, so flushing the cache at the same time is easy, you just flush everything, and build the cache up again as you go along).

It's all really down to the nature of your site, your data and the amount of traffic you're getting, but keeping short-lived data away from the database is usually a good idea in order to be able to handle spikes and growing popularity.

BTW, good thing this is not a math question, I actually got the question for the spam-guard wrong...

November 29, 1990 | Unregistered CommenterKyrre

Hello.

Just wanted to point out that the sentence "maybe in some sort of replicated session mechanism like memcached or Tapestry" should read "maybe in some sort of replicated session mechanism like memcached or Terracotta"

November 29, 1990 | Unregistered CommenterKyrre

In ColdFusion, there was (is) a concept of "client" variables from the early days of the platform.
It exposed a scope called "client" which you could write to in your code. Then for each request that data was serialized to a WDDX format (basically xml describing native data types) and at the end of the request it was saved in a configurable source (database table being the useful one). At the start of each request that data was read from the db and de-serialized.
I found that one of the most useful features of the platform and used it to create shared-nothing architectures. The session data was an xml packet in a schema separate from my application schema. That way I could add new app servers or drop them on the fly without interruption to the session. If a box dropped, all users on it would switch to another box while maintaining their session transparently (as all boxes looked at the same session store database). Lo-tek clustering ;o) very effective. In fact, so effective, I'll be reluctant to change the strategy for "traditional clusters".

Harel

November 29, 1990 | Unregistered CommenterHarel Malka

I just moved from DB sessions, to memcache sessions ( 100% transparent, just apt-get install php5-memcache ) and setting up the memcache severs. Just a few mins.

It works perfectly!!!! And my MySQLs are happier than ever :D

Right now I have arround 600 active sessions, and everything seems to go smooth :)

It can be tested easily, without touching any line of code ( if you are using standard php session commands ).

November 29, 1990 | Unregistered CommenterAlberto

We share sharedance for managing session for about 25K websites.
It perfoms quiet well and has a very small foot print.

Give it a try.
http://sharedance.pureftpd.org/project/sharedance

November 29, 1990 | Unregistered Commenteratif.ghaffar

If using a DB to store session, does anyone know how enterprise class sites that are housed in 2 different data centers(that are live/live) maintain the session between both data centers. The problem as I see it is that since each data center has their own session database, if I was to flip the users to only access Data Center 1 then that would cause all previous Data Center 2 users to lose their session. What would be some pure hardware based solutions to this that are being used now?

November 29, 1990 | Unregistered CommenterGusto

Simple: if the session is not found in the current data center, try the other one. If it's found there, move it to the current data center. Of course, there will be some latency, so it's something you want to avoid doing often. If you're using PHP, it's a simple matter of re-writing your sessions handling functions.

Why do it in hardware when it's easy to do in software?

November 29, 1990 | Unregistered CommenterMark Rose

We'd had several customers move to using IBM WebSphere eXtreme Scale (an IMDG) for session state recently. They were using none, the builtin application server session management or a database prior to this. This does scale linearly and we give them several deployment options depending on what works best for them.

1) Run the IMDG in the same JVM as their tomcat or servlet container. We store the primary copy there and have a dynamic replica running somewhere else. This is very fast, scalable but uses memory on the servlet container and requires affinity to work best.

2) Run a separate IMDB tier and let the applications pull the session each time. This scales linearly also but now you have to manage two tiers from a scaling point of view.

3) Implement 2 and use a session cache in the servlet container. This requires session affinity to work but eliminates the latency on fetching sessions.

4) Use 1 as the normal page but on a failure degrade to 3 or 2. We're seeing this pattern a lot when dealing with multiple data centers.

The above patterns can be applied regardless of the technology used to actually store the sessions.

November 29, 1990 | Unregistered CommenterBilly Newport

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>