advertise
« Five Ways to Stop Framework Fixation from Crashing Your Scaling Strategy | Main | Pyshards aspires to build sharding toolkit for Python »
Saturday
Jun282008

ID generation schemes

Hi,

Generating unique ids is a common requirements in many projects. Generally, this responsibility is given to Database layer. By using sequences or some other technique. This is a problem for horizontal scalability.
What are the Guid generation schemes used in high scalable web sites generally? I have seen use java's SecureRandom class to generate Guid. What are the other methods generally used?

Thanks
Unmesh

Reader Comments (3)

In my project I've got a bunch of web servers connected to a central SQL box, but they also do a load of crunch work on their own local SQLExpress instances for speed.

To get a set of unique ID's that are system wide, I precomputed a load of 9 char keys:

AAAAAAAAAA
AAAAAAAAAB
AAAAAAAAAC

etc
(in actual fact, I converted from decimal to base 36, and went up the range uaing every 1,000,000th value so the keys look a load more random than that.... for my second server I used

1
1,000,001
2,000,001

for 3rd server:

2
1,000,002
2,000,002

This means every server can have it's own allocation of ID's, or just connect via a webservice to get a key.

The table is simply:

Key (Char 9)
Status (Boolean)
TempGUID

Because I'll never repeat a value, for speed I can keep the table fairly small, say around 100,000 free keys. When keys have been marked as used I can archive them off... when it's low on keys I can generate new ones.

To get the keys at random I just do a very simple:

in code:
TempGUID = newGUID (or whatever it is in ASP.net, i forget now)

in SQL;
SET ROWCOUNT 1
UPDATE Keys SET STATUS=1, TempGUID =@TempGUID where status = 0
SELECT key from keys where TEMPGUID = @TempGUID

Hey presto, a 100% unique key even across a set of sharded servers

November 29, 1990 | Unregistered CommenterBrianDrought

There have been some good suggestions so far. My recommendation though would be to encapsulate this process.

Lets break down the needs:
1. High performance & scalability
2. Unique identifier
3. unique identifier generation must be flexible for the purpose (maybe an int/string/date/etc.)

Some of the common solutions and problems are things like:
1. GUID/UUID = not guaranteed unique, has some form of performance hit to generate, commonly large amount of data (~16 bytes) where a much smaller amount would due
2. Natural/Real Data = commonly larger amounts of data, ID changes needed if the data changes which created them, can also result in development issues where programmers use this ID and break it down to use the data within (programmers may make mistakes that cost performance or data integrity issues)
3. Hi-Lo/Data Ranges = requires a "central service" hit to get the range, can result in unused data values
4. ID Lists = requires a "central service" hit, can result in unused data ranges, can result in local performance hits when large lists of ID's are downloaded
5. Central Service ID Generator (service that is called to get the next ID; it manages all ID locks & generation) = Bottleneck for new items

The common problems that come out of all of this is that either the ID is misused, too big data wise, unused unique identifiers happen, or performance/scalability bottlenecks occur. So how can we win this battle? There has to be a solution.

Here is what you do:

1. Create a service (webservice or whatever works for you in your environment)
2. Create a method that takes a key that returns an object to your application to be persisted locally, for each local unique service
3. Include within the object a time to live allowance
4. The object that was returned should have a getID() method
5. After the object throws an event called requiresRefresh(), your service should request a new object (don't keep the object or program it to get it's own updates)

FYI: For time savings, you can make an object that encapsulates the object passed around and automatically runs the refresh update. You just have to watch out for this happening when it isn't needed. For example, the TTL expires and you don't need new ID's for a few hours. You also have to be aware of using this object too much and not tending to it, thus making it a potential infrastructure bottleneck for new feature rollouts.

The reasoning behind this is that you want something fast right? Of course. You need something flexible. For the purpose of needing a new ID, at the application level...who cares what/where it comes from as long as it is unique, fast, and not bloated. The central service could generate the high-lo pattern for a while, then go to get a list of the unused ids and generate and hand out lists of unused ids. However, your application wouldn't even know. Why, because it doesn't matter. You want a local object so it runs fast. Now, how your persist this object locally will totally depend on your needs and environment.

The result of all of this is a hybrid solution. Use a central ID generation service, to generate an object (or pattern) that your applications use locally until either the clock runs out or the object's pattern implementation is empty. You need the central service to ensure true unique ID's. You need a local object that actually gives you the ID's without delay, for speed reasons. You need to pass an object to allow multi-pattern usage as desired. This pattern also works for disconnected environments (FYI).

I realize that the thought of just saying, never mind I'll just use GUID's, is not unreasonable. It's easy and for most peoples' cares, it's unique. But what about the size? A GUID is 16 bytes while an Int is 4 and a Long is 8. Let's be honest, if you need a large amount of unique identifiers I think a Long will more than suffice. And in a world of read-intensive applications/websites, having the opportunity to save half your storage space/variable memory space (i realize it is just for ID's) and half of your bandwidth is not something to baulk at. In many cases an Int would even work fine, and there we are talking about a 4x increase to a GUID.

As always, you have to make the decisions regarding what architecture is right for your project and environment.

Hope this rambling helps. :-)

November 29, 1990 | Unregistered CommenterRandy Skopecek

Okay okay, I couldn't help myself. There is another solution to the ID generation game. However it must be a fit for the purpose. Simply generate an ID using on of the previously mentioned methodologies that require no external service calls. Then have a back-end ID replacement service that works on normalizing the ID to a fully utilized unique ID value whose size is reduced.

The two gotchas are:
1) If the ID's generated on the service/front-end could overlap the back-end ID replacement
2) The new ID's have to be replaced everywhere the old ones were.

Unfortunately, ID replacement can be tricky since it might need to have all of the old ID's locked, and if the ID get's persisted in data sources outside your control or time line (for synchronous ID replacement) you will have to keep and ID translator service (which to me isn't worth it,...why have two ID's even with the size differential, you should consider a different solution for this situation). The up side though to all of this is you get the best of both worlds, high performance ID creation and optimized ID size.

Once again, you have to consider what is right for your situation.

November 29, 1990 | Unregistered CommenterRandy Skopecek

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>