The key (no pun intended) to understanding how to organize your dataset’s data is to think of each shard not as an individual database, but as one large singular database. Just as in a normal single server database setup where you have a unique key for each row within a table, each row key within each individual shard must be unique to the whole dataset partitioned across all shards.
There are a few different ways we can accomplish uniqueness of row keys across a shard cluster. Each has its pro’s and con’s and the one chosen should be specific to the problems you’re trying to solve.
Successful software design is all about trade-offs. In the typical (if there is such a thing) distributed system, recognizing the importance of trade-offs within the design of your architecture is integral to the success of your system. Despite this reality, I see time and time again, developers choosing a particular solution based on an ill-placed belief in their solution as a “silver bullet”, or a solution that conquers all, despite the inevitable occurrence of changing requirements. Regardless of the reasons behind this phenomenon, I’d like to outline a few of the methods I use to ensure that I’m making good scalable decisions without losing sight of the trade-offs that accompany them. I’d also like to compile (pun intended) the issues at hand, by formulating a simple theorem that we can use to describe this oft occurring situation.
This article is a primer, intended to shine some much needed light on the logical, process oriented implementations of database scalability strategies in the form of a broad introduction. More specifically, the intent is to elaborate on the majority of these implementations by example.
Kevin Clark, director of IT operations for Lucasfilm, discusses how their data center works:
* Linux-based platform, SUSE (looking to change), and a lot of proprietary open source applications for content creation.
* 4,500-processor render farm in the datacenter. Workstations are used off hours.
* Developed their own proprietary scheduler to schedule their 5,500 available processors.
* Render nodes, the blade racks (from Verari), run dual-core dual Opteron chips with 32GB of memory on board, but are expanding those to quad-core. Are an AMD shop.
* 400TB of storage online for production.
* Every night they write out 10-20TB of new data on a render. A project will use up to a hundred-plus terabytes of storage.
* Incremental backups are a challenge because the data changes up to 50 percent over a week.
* NetApps used for storage. They like the global namespace in the virtual file system.
* Foundry Networks architecture shop. One of the larger 10-GbE-backbone facilities on the West coast. 350-plus 10 GbE ports that used for distribution throughout the facility and the backend.
* Grid computing used for over 4 years.
* A 10-Gig dark fiber connection connects San Rafael and their home office. Enables them to co-render and co-storage between the two facilities. No difference in performance in terms of where they went to look for their data and their shots.
* Artists get server class machines: HP 9400 workstations with dual-core dual Opteron processors and 16GB of memory.
* Challenge now is to better segment storage to not continue to sink costs into high-cost disks.
* VMware used to host a lot of development environments. Allows the quick turn up of testing as the tests can be allocated across VMs.
* Provides PoE (power-over-ethernet) out from the switch to all of our Web farms.
* Next push on the facilities side. How to be more efficient at airflow management and power utilization.
By Bhavin Turakhia CEO, Directi. Covers:
* Why scalability is important. Viral marketing can result in instant success. With RSS/Ajax/SOA number of requests grow exponentially with user base. Goal is to build a web 2.0 app that can server millions of users with zero downtime.
* Introduction to the variables. Scalability, performance, responsiveness, availability, downtime impact, cost, maintenance effort.
* Introduction to the factors. Platform selection, hardware, application design, database architecture, deployment architecture, storage architecture, abuse prevention, monitoring mechanisms, etc.
* Building our own scalable architecture in incremental steps: vertical scaling, vertical partitioning, horizontal scaling, horizontal partitioning, etc. First buy bigger. Then deploy each service on a separate node. Then increase the number of nudes and load balance. Deal with session management. Remove single points of failure. Use a shared nothing cluster. Choice of master-slave multi-master setup. Replication can be synchronous or asynchronous.
* Platform selection considerations. Use a global redirector for serving multiple datacenters. Add object, session API, and page cache. Add reverse proxy. Think about CDNs, Grid computing.
* Tips. Use a Horizontal DB architecture from the beginning. Loosely couple all modules. Use a REST interface for easier caching. Perform application sizing ongoingly to ensure optimal hardware utilization.
By George Palmer of 3dogsbark.com. Covers:
* How you start out: shared hosting, web server DB on same machine. Move two 2 machines. Minimal code changes.
* Scaling the database. Add read slaves on their own machines. Then master-master setup. Still minimal code changes.
* Scaling the web server. Load balance against multiple application servers. Application servers scale but the database doesn't.
* User clusters. Partition and allocate users to their own dedicated cluster. Requires substantial code changes.
* Caching. A large percentage of hits are read only. Use reverse proxy, memcached, and language specific cache.
* Elastic architectures. Based on Amazon EC2. Start and stop instances on demand. For global applications keep a cache on each continent, assign users to clusters by location, maintain app servers on each continent, use transaction replication software if you must replicate your site globally.
By John Engales CTO, Rackspace. Good presentation of the stages a typical successful website goes through:
* Stage 1 - The Beginning: Simple architecture, low complexity. no redundancy. Firewall, load balancer, a pair of web servers, database server, and internal storage.
* Stage 2 - More of the same, just bigger.
* Stage 3 - The Pain Begins: publicity hits. Use reverse proxy, cache static content, load balancers, more databases, re-coding.
* Stage 4 - The Pain Intensifies: caching with memcached, writes overload and replication takes too long, start database partitioning, shared storage makes sense for content, significant re-architecting for DB.
* Stage 5 - This Really Hurts!: rethink entire application, partition on geography user ID, etc, create user clusters, using hashing scheme for locating which user belongs to which cluster.
* Stage 6 - Getting a little less painful: scalable application and database architecture, acceptable performance, starting to add ne features again, optimizing some code, still growing but manageable.
* Stage 7 - Entering the unknown: where are the remaining bottlenecks (power, space, bandwidth, CDN, firewall, load balancer, storage, people, process, database), all eggs in one basked (single datacenter, single instance of data).
Just thought I'd drop a brief suggestion to anyone building a large mail system. Our solution for scaling mail pickup was to develop a sharded architecture whereby accounts are spread across a cluster of servers, each with imap/pop3 capability. Then we use a cluster of reverse proxies (Perdition) speaking to the backend imap/pop3 servers . The benefit of this approach is you can use simply use round-robin or HA loadbalancing on the perdition servers that end users connect to (e.g. admins can easily move accounts around on the backend storage servers without affecting end users). Perdition manages routing users to the appropriate backend servers and has MySQL support. What we also liked about this approach was that it had no dependency on a distributed or networked filesystem, so less chance of corruption or data consistency issues. When an individual server reaches capacity, we just off load users to a less used server. If any server goes offline, it only affects the fraction of users assigned to that server.
Best,
Erik Osterman
Recent comments
4 hours 37 min ago
1 day 6 hours ago
1 day 12 hours ago
2 days 3 hours ago
2 days 7 hours ago
2 days 16 hours ago
2 days 20 hours ago
3 days 9 hours ago
3 days 17 hours ago
3 days 22 hours ago