advertise

Entries by devinvenable (1)

Wednesday
Jun112008

Pyshards aspires to build sharding toolkit for Python

I've been interested in sharding concepts since first hearing the term "shard" a few years back. My interest had been piqued earlier, the first time I read about Google's original approach to distributed search. It was described as a hashtable-like system in which independent physical machines play the role of the buckets. More recently, I needed the capacity and performance of a Sharded system, but did not find helpful libraries or toolkits which would assist with the configuration for my language of preference these days, which is Python. And, since I had a few weeks on my hands, I decided I would begin the work of creating these tools. The result of my initial work the Pyshards project, a still-incomplete python and MySQL based horizontal partitioning and sharding toolkit. HighScalability.com readers will already know that horizontal partitioning is a data segmenting pattern in which distinct groups of physical row-based datasets are distributed across multiple partitions. When the partitions exist as independent databases and when they exist within a shared-nothing architecture they are known as shards. (Google apparently coined the term shard for such database partitions, and pyshards has adopted it.) The goal is to provide big opportunities for database scalability while maintaining good performance. Sharded datasets can be queried individually (one shard) or collectively (aggregate of all shards). In the spirit of The Zen of Python, Pyshards focuses on one obvious way to accomplish horizontal partitioning, and that is by using a hash/modulo based algorithm. Pyshards provides the ability to reasonably add polynomial capacity (number of original shards squared) without re-balancing (re-sharding). Pyshards is designed with re-sharding in mind (because the time will come when you must re-balance) and provides re-sharding algorithms and tools. Finally, Pyshards aspires to provide a web-based shard monitoring tool so that you can keep an eye on resource capacity. So why publish an incomplete open source project? I'd really prefer to work with others who are interested in this topic instead of working in a vacuum. If you are curious, or think you might want to get involved, come visit the project page, join a mailing list, or add a comment on the WIKI. http://code.google.com/p/pyshards/wiki/Pyshards Devin

Click to read more ...