I've been interested in sharding concepts since first hearing the term "shard" a few years back. My interest had been piqued earlier, the first time I read about Google's original approach to distributed search. It was described as a hashtable-like system in which independent physical machines play the role of the buckets. More recently, I needed the capacity and performance of a Sharded system, but did not find helpful libraries or toolkits which would assist with the configuration for my language of preference these days, which is Python. And, since I had a few weeks on my hands, I decided I would begin the work of creating these tools. The result of my initial work the Pyshards project, a still-incomplete python and MySQL based horizontal partitioning and sharding toolkit. HighScalability.com readers will already know that horizontal partitioning is a data segmenting pattern in which distinct groups of physical row-based datasets are distributed across multiple partitions. When the partitions exist as independent databases and when they exist within a shared-nothing architecture they are known as shards. (Google apparently coined the term shard for such database partitions, and pyshards has adopted it.) The goal is to provide big opportunities for database scalability while maintaining good performance. Sharded datasets can be queried individually (one shard) or collectively (aggregate of all shards). In the spirit of The Zen of Python, Pyshards focuses on one obvious way to accomplish horizontal partitioning, and that is by using a hash/modulo based algorithm. Pyshards provides the ability to reasonably add polynomial capacity (number of original shards squared) without re-balancing (re-sharding). Pyshards is designed with re-sharding in mind (because the time will come when you must re-balance) and provides re-sharding algorithms and tools. Finally, Pyshards aspires to provide a web-based shard monitoring tool so that you can keep an eye on resource capacity. So why publish an incomplete open source project? I'd really prefer to work with others who are interested in this topic instead of working in a vacuum. If you are curious, or think you might want to get involved, come visit the project page, join a mailing list, or add a comment on the WIKI. http://code.google.com/p/pyshards/wiki/Pyshards Devin
Update 2: Hank Williams says iPhone Background Processing: Not Fixed But Halfway There. Excellent analysis of all the reasons you need real background processing. Hey, you can't even build an alarm clock! Hard to believe some commenters say it's not so.. Update: Josh Lowensohn of Webware tells us Why users should be scared of Apple's new notification system. A big item on the iPhone developer iWishlist has been background processing. If you can't write an app to poll for new data in the background how will you keep your even more important non-foreground app in sync? Live from the Apple developer conference we learn the solution is a centralized push based architecture. Here's the relevant MacRumorsLive transcript:
I have a table .This table has many columns but search performed based on 1 columns ,this table can have more than million rows. The data in these columns is something like funny,new york,hollywood User can search with parameters as funny hollywood .I need to take this 2 words and then search on column whether that column contain this words and how many times .It is not possible to index here .If the results return say 1200 results then without comparing each and every column i can't determine no of results.I need to compare for each and every column.This query is very frequent .How can i approach for this problem.What type of architecture,tools is helpful. I just know that this can be accomplished with distributed system but how can i make this system. I also see in this website that LinkedIn uses Lucene for search .Is Lucene is helpful in my case.My table has also lots of insertion ,however updation in not very frequent.
If you just can't get enough high scalability talk you might want to take a look GigaOm's Structure 08 conference. The slate of speakers looks appropriately interesting and San Francisco is truly magical this time of year. High Scalability readers even get a price break is you use the HIGHSCALE discount code! I'll be on vacation so I won't see you there, but it looks like a good time. For a nice change of pace consider visiting MoMA next door. Here's a blurb on the conference:
A reminder to our readers about Structure 08, GigaOm's upcoming conference dedicated to web infrastructure. In addition to keynotes from leaders like Jim Crowe, chairman and CEO of Level 3 Communications and Werner Vogels, CTO of Amazon, the event will feature workshops from Google App Engine, Microsoft and a special workshop from Fenwick and West who will cover how to raise money for an infrastructure start up. Learn from the guru's at Amazon, Google, Microsoft, Sun, VMWare and more about what the future of internet infrastructure is at Structure 08, which will be held on June 25 in San Francisco.
Scalability forces us to think differently. What worked on a small scale doesn't always work on a large scale -- and costs are no different. If 90% of our application is free of contention, and only 10% is spent on a shared resources, we will need to grow our compute resources by a factor of 100 to scale by a factor of 10! Another important thing to note is that 10x, in this case, is the limit of our ability to scale, even if more resources are added. 1. The cost of non-linearly scalable applications grows exponentially with the demand for more scale. 2. Non-linearly scalable applications have an absolute limit of scalability. According to Amdhal's Law, with 10% contention, the maximum scaling limit is 10. With 40% contention, our maximum scaling limit is 2.5 - no matter how many hardware resources we will throw at the problem. This post discuss in further details how to measure the true cost of non linearly scalable systems and suggest a model for reducing that cost significantly.
I would like to compile a comparison matrix on the total cost of ownership for .Net, Java, Lamp & Rails. Where should I start? Has anyone seen or know of a recent study on this subject?
My first post, please be gentle. I know it is long. You are all like doctors - the more info, the better the diagnosis. ----------- What is the best way to store a list of all of your friends in the memcached cache (a simple boolean saying “yes this user is your friend”, or “no”)? Think Robert Scoble (26,000+ “friends”) on Twitter.com. He views a list of ALL existing users, and in this list, his friends are highlighted. I came up with 4 possible methods: --store in memcache as an array, a list of all the "yes" friend ID's --store your friend ID's as individual elements. --store as a hash of arrays based on last 3 digits of friend's ID -- so have up to 1000 arrays for you. --comma-delimited string of ID's as one element I'm using the second one because I think it is faster to update. The single array or hash of arrays feels like too much overhead calculating and updating – and even just loading – to check for existence of a friend. The key is FRIEND[small ID#]_[big ID#]. The value is 1. This way there are no dupes. (I add u as friend, it always adds me as ur friend...I remove u, u remove me). Store with it 2 additional flags: One denotes start of entries. One denotes end of entries. As friends are added, the end flag position relative to new friends will become meaningless, but that is ok (I think). To see if someone is your friend, the system checks if both start and end flags exist. If both exist, it can check for existence of friend ID - if exists, then friend. Start flag is required. If start flag is pushed out of cache, we must assume some friends were also pushed out. Currently, the system loads from DB in a daemon in the background after you log in (if two flags are not already set). Until the two flags are set, it does db lookups. There is no timeout on the data in cache. Adding/removing friends to your account adds/removes to/from memcache - so, theoretically, it might never have to pre-load anything. Downside of my method is if the elements span multiple servers and one dies, you loose some of your friends (that's the upside of using arrays). I don't know how to resolve if the lost box didn't contain either of the flags -- in that case, the users' info will NEVER get refreshed. This is my concern. Any ideas? Thanks so much!!!
It's surprising that the blogosphere hasn't picked up the biggest difference in pricing: Google's datastore is less than a tenth of the price of Amazon's SimpleDB while offering a better API.If money matters to you then the burn rate under GAE could be convincingly lower. Let's compare the numbers: GAE pricing: * $0.10 - $0.12 per CPU core-hour * $0.15 - $0.18 per GB-month of storage * $0.11 - $0.13 per GB outgoing bandwidth * $0.09 - $0.11 per GB incoming bandwidth SimpleDB Pricing: * $0.14 per Amazon SimpleDB Machine Hour consumed * Structured Data Storage - $1.50 per GB-month * $0.100 per GB - all data transfer in * $0.170 per GB - first 10 TB / month data transfer out (more on the site) Clearly Google priced their services to be competitive with Amazon. We may see a response by Amazon in the near feature, but the database storage cost for GAE is dramatically cheaper at $0.15 - $0.18 per GB-month vs $1.50 per GB-month. Interestingly, Google's price is the same as Amazon's S3 (file storage) pricing. Google seems to think of database storage as more like file storage. That makes a certain amount of sense because BigTable is a layer on the Google File System. File system pricing may be the more appropriate price reference point. On SimpleDB a 1TB database costs $1,500/month and BigTable costs in the $180/month range. As you grow into ever larger data sets the difference becomes even more compelling. If you are a startup your need for funding just dropped another notch. It's hard to self-finance many thousands of dollars a month, but hundreds of dollars is an easy nut to make. Still, Amazon's advantage is they support application clusters that can access the data for free within AWS. GAE excels at providing a scalable two tier architecture for displaying web pages. Doing anything else with your data has to be done outside GAE, which kicks up your bandwidth costs considerably. How much obviously depends on your application. But if your web site is of the more vanilla variety the cost savings could be game changing.