Shard

Todd Hoff's picture

An Unorthodox Approach to Database Design : The Coming of the Shard

Update: Dan Pritchett shares some excellent Sharding Lessons: Size Your Shards, Use Math on Shard Counts, Carefully Consider the Spread, Plan for Exceeding Your Shards

Once upon a time we scaled databases by buying ever bigger, faster, and more expensive machines. While this arrangement is great for big iron profit margins, it doesn't work so well for the bank accounts of our heroic system builders who need to scale well past what they can afford to spend on giant database servers. In a extraordinary two article series, Dathan Pattishall, explains his motivation for a revolutionary new database architecture--sharding--that he began thinking about even before he worked at Friendster, and fully implemented at Flickr. Flickr now handles more than 1 billion transactions per day, responding in less then a few seconds and can scale linearly at a low cost.

What is sharding and how has it come to be the answer to large website scaling problems?

Todd Hoff's picture

Sharding the Hibernate Way

Update: A very nice JavaWorld podcast interview with Google engineer Max Ross on Hibernate Shards. Max defines Hibernate Shards (horizontal partitioning), how it works (pretty well), virtual shards (don't ask), what they need to do in the future (query, replication, operational tools), and how it relates to Google AppEngine (not much).

To scale you are supposed to partition your data. Sounds good, but how do you do it? When you actually sit down to work out all the details it’s not that easy. Hibernate Shards to the rescue! Hibernate shards is: an extension to the core Hibernate product that adds facilities for horizontal partitioning. If you know the core Hibernate API you know the shards API. No learning curve at all. Here is what a few members of the core group had to say about the Hibernate Shards open source project. Although there are some limitations, from the sound of it they are doing useful stuff in the right way and it’s very much worth looking at, especially if you use Hibernate or some other ORM layer.

Todd Hoff's picture

Federation at Flickr: Doing Billions of Queries Per Day

Flickr's lone database guy Dathan Pattishall made his excellent presentation available on how on how Flickr scales its backend to handle tremendous loads. Some of this information is available in Flickr Architecture, but the paper is so good it's worth another read. If you want to see sharding done right, at scale, take a look.

Pyshards aspires to build sharding toolkit for Python

I've been interested in sharding concepts since first hearing the term "shard" a few years back. My interest had been piqued earlier, the first time I read about Google's original approach to distributed search. It was described as a hashtable-like system in which independent physical machines play the role of the buckets. More recently, I needed the capacity and performance of a Sharded system, but did not find helpful libraries or toolkits which would assist with the configuration for my language of preference these days, which is Python. And, since I had a few weeks on my hands, I decided I would begin the work of creating these tools.

The result of my initial work the Pyshards project, a still-incomplete python and MySQL based horizontal partitioning and sharding toolkit. HighScalability.com readers will already know that horizontal partitioning is a data segmenting pattern in which distinct groups of physical row-based datasets are distributed across multiple partitions. When the partitions exist as independent databases and when they exist within a shared-nothing architecture they are known as shards. (Google apparently coined the term shard for such database partitions, and pyshards has adopted it.) The goal is to provide big opportunities for database scalability while maintaining good performance. Sharded datasets can be queried individually (one shard) or collectively (aggregate of all shards). In the spirit of The Zen of Python, Pyshards focuses on one obvious way to accomplish horizontal partitioning, and that is by using a hash/modulo based algorithm.

Pyshards provides the ability to reasonably add polynomial capacity (number of original shards squared) without re-balancing (re-sharding). Pyshards is designed with re-sharding in mind (because the time will come when you must re-balance) and provides re-sharding algorithms and tools. Finally, Pyshards aspires to provide a web-based shard monitoring tool so that you can keep an eye on resource capacity.

So why publish an incomplete open source project? I'd really prefer to work with others who are interested in this topic instead of working in a vacuum. If you are curious, or think you might want to get involved, come visit the project page, join a mailing list, or add a comment on the WIKI.

http://code.google.com/p/pyshards/wiki/Pyshards

Devin

Todd Hoff's picture

How I Learned to Stop Worrying and Love Using a Lot of Disk Space to Scale

Update 3: ReadWriteWeb says Google App Engine Announces New Pricing Plans, APIs, Open Access. Pricing is specified but I'm not sure what to make of it yet. An image manipulation library is added (thus the need to pay for more CPU :-) and memcached support has been added. Memcached will help resolve the can't write for every read problem that pops up when keeping counters.
Update 2: onGWT.com threw a GAE load party and a lot of people came. The results at Load test : Google App Engine = 1, Community = 0. GAE handled a peak of 35 requests/second and a sustained 10 requests/second. Some think performance was good, others not so good. My GMT watch broke and I was late to arrive. Maybe next time. Also added a few new design rules from the post.
Update: Added a few new rules gleaned from the GAE Meetup: Design By Explicit Cost Model and Puts are Precious.

How do you structure your database using a distributed hash table like BigTable? The answer isn't what you might expect. If you were thinking of translating relational models directly to BigTable then think again. The best way to implement joins with BigTable is: don't. You--pause for dramatic effect--duplicate data instead of normalize it. *shudder*

Flickr anticipated this design in their architecture when they chose to duplicate comments in both the commentor and the commentee user shards rather than create a separate comment relation. I don't know how that decision was made, but it must have gone against every fiber in their relational bones...

Todd Hoff's picture

Scaling Mania at MySQL Conference 2008

The 2008 MySQL Conference & Expo has now closed, but what is still open for viewing is all the MySQL scaling knowledge that was shared. Planet MySQL is a great source of the goings on:

  • Scaling out MySQL: Hardware today and tomorrow by Jeremy Cole and Eric Bergen of Proven Scaling. In it are answered all the big questions of life: What about 64-bit? How many cores? How much memory? Shared storage? Finally we learn the secrets of true happiness.
  • Panel Video: Scaling MySQL? Up or Out?. Don't have time? Take a look at the Diamond Note excellent game day summary. Companies like MySQL, Sun, Flickr, Fotolog, Wikipedia, Facebook and YouTube share intel on how many web servers they have, how they handle failure, and how they scale.
  • Kevin Burton in Scaling MySQL and Java in High Write Throughput Environments - How we built Spinn3r shows how they crawl and index 500k posts per hour using MySQL and 40 servers.
  • Venu Anuganti channels Dathan Pattishall's talk on scaling heavy concurrent writes in real time.
  • This time Venu channels Helping InnoDB scale on servers with many cores by Mark Callaghan from Google.
  • Exploring Amazon EC2 for Scale-out Applications by Morgan Tocker, MySQL Canada, Carl Mercier, Defensio. RoR based spam filtering services that runs completely on EC2. Show evolution from a simple configuration to a sharded architecture.
  • Applied Partitioning and Scaling Your (OLTP) Database System by Phil Hilderbrand.
  • Real World Web: Performance & Scalability by Ask Bjorn Hansen. (189 slides!). He promises you haven't seen this talk before. The secret: Think Horizontal.
  • Too many to list here. All the presentations are available on scribd.

  • Todd Hoff's picture

    YouTube Architecture

    Update: YouTube: The Platform. YouTube adds a new rich set of APIs in order to become your video platform leader--all for free. Upload, edit, watch, search, and comment on video from your own site without visiting YouTube. Compose your site internally from APIs because you'll need to expose them later anyway.

    YouTube grew incredibly fast, to over 100 million video views per day, with only a handful of people responsible for scaling the site. How did they manage to deliver all that video to all those users? And how have they evolved since being acquired by Google?

    Todd Hoff's picture

    Flickr Architecture

    Update: Flickr hits 2 Billion photos served. That's a lot of hamburgers.

    Flickr is both my favorite bird and the web's leading photo sharing site. Flickr has an amazing challenge, they must handle a vast sea of ever expanding new content, ever increasing legions of users, and a constant stream of new features, all while providing excellent performance. How do they do it?

    Todd Hoff's picture

    Scaling Secret #2: Denormalizing Your Way to Speed and Profit

    Alan Watts once observed how after we accepted Descartes' separation of the mind and body we've been trying to smash them back together again ever since when really they were never separate to begin with.

    The database normalization-denormalization dualism has the same mobius shaped reverberations as Descartes' error. We separate data into a million jagged little pieces and then spend all our time stooping over, picking them and up, and joining them back together again.

    Normalization has been standard practice now for decades. But times are changing. Many mega-website architects are concluding Watts was right: the data was never separate to begin with. And even more radical, we may even need to store multiple copies of data.

    Todd Hoff's picture

    Product: Hibernate Shards

    If you want to adopt a shard architecture, but don't want to start from scratch, you may want to consider Hibernate's sharding system.

    Hibernate Shards is a framework that is designed to encapsulate and minimize this complexity by adding support for horizontal partitioning to Hibernate Core.

    Hibernate Shards key features:

    * Standard Hibernate programming model - Hibernate Shards allows you to continue using the Hibernate APIs you know and love: SessionFactory, Session, Criteria, Query. If you already know how to use Hibernate, you already know how to use Hibernate Shards.

    * Flexible sharding strategies - Distribute data across your shards any way you want. Use one of the default strategies we provide or plug in your own application-specific logic.

    * Support for virtual shards - Think your sharding strategy is never going to change? Think again. Adding new shards and redistributing your data is one of the toughest operational challenges you will face once you've deployed your shard-aware application. Hibernate Sharding supports virtual shards, a feature designed to simplify the process of resharding your data.

    * Free/open source - Hibernate Shards is licensed under the LGPL (Lesser GNU Public License)

    Todd Hoff's picture

    GoogleTalk Architecture

    Google Talk is Google's instant communications service. Interestingly the IM messages aren't the major architectural challenge, handling user presence indications dominate the design. They also have the challenge of handling small low latency messages and integrating with many other systems. How do they do it?

    Todd Hoff's picture

    Paper: Replication Under Scalable Hashing

    Replication Under Scalable Hashing:
    A Family of Algorithms for Scalable Decentralized Data Distribution

    Typical algorithms for decentralized data distribution
    work best in a system that is fully built before it first used;
    adding or removing components results in either extensive
    reorganization of data or load imbalance in the system.
    We have developed a family of decentralized algorithms,
    RUSH (Replication Under Scalable Hashing), that
    maps replicated objects to a scalable collection of storage
    servers or disks. RUSH algorithms distribute objects to
    servers according to user-specified server weighting. While
    all RUSH variants support addition of servers to the system,
    different variants have different characteristics with
    respect to lookup time in petabyte-scale systems, performance
    with mirroring (as opposed to redundancy codes),
    and storage server removal. All RUSH variants redistribute
    as few objects as possible when new servers are
    added or existing servers are removed, and all variants
    guarantee that no two replicas of a particular object are
    ever placed on the same server. Because there is no central
    directory, clients can compute data locations in parallel,
    allowing thousands of clients to access objects on thousands
    of servers simultaneously.

    Todd Hoff's picture

    mixi.jp Architecture

    Mixi is a fast growing social networking site in Japan. They provide services like: diary, community, message, review, and photo album. Having a lot in common with LiveJournal they also developed many of the same approaches. Their write up on how they scaled their system is easily one of the best out there.

    Todd Hoff's picture

    LiveJournal Architecture

    A fascinating and detailed story of how LiveJournal evolved their system to scale. LiveJournal was an early player in the free blog service race and faced issues from quickly adding a large number of users. Blog posts come fast and furious which causes a lot of writes and writes are particularly hard to scale. Understanding how LiveJournal faced their scaling problems will help any aspiring website builder.

    Syndicate content