advertise
Monday
Feb162009

Handle 1 Billion Events Per Day Using a Memory Grid

Moshe Kaplan of RockeTier shows the life cycle of an affiliate marketing system that starts off as a cub handling one million events per day and ends up a lion handling 200 million to even one billion events per day. The resulting system uses ten commodity servers at a cost of $35,000. Mr. Kaplan's paper is especially interesting because it documents a system architecture evolution we may see a lot more of in the future: database centric --> cache centric --> memory grid. As scaling and performance requirements for complicated operations increase, leaving the entire system in memory starts to make a great deal of sense. Why use cache at all? Why shouldn't your system be all in memory from the start?

General Approach to Evolving the System to Scale

  • Analyze the system architecture and the main business processes. Detect the main hardware bottlenecks and the related business process causing them. Focus efforts on points of greatest return.
  • Rate the bottlenecks by importance and provide immediate and practical recommendation to improve performance.
  • Implement the recommendations to provide immediate relief to problems. Risk is reduced by avoiding a full rewrite and spending a fortune on more resources.
  • Plan a road map for meeting next generation solutions.
  • Scale up and scale out when redesign is necessary.

    One Million Event Per Day System

  • The events are common advertising system operations like: ad impressions, clicks, and sales.
  • Typical two tier system. Impressions and banner sales are written directly to the database.
  • The immediate goal was to process 2.5 million events per day so something needed to be done.

    2.5 Million Event Per Day System

  • PerfMon was used to check web server and DB performance counters. CPU usage was at 100% at peak usage.
  • Immediate fixes included: tuning SQL queries, implementing stored procedures, using a PHP compiler, removing include files and fixing other programming errors.
  • The changes successfully double the performance of the system within 3 months. The next goal was to handle 20 million events per day.

    20 Million Event Per Day System

  • To make this scaling leap a rethinking of how the system worked was in order.
  • The main load of the system was validating inputs in order to prevent forgery.
  • A cache was maintained in the application servers to cut unnecessary database access. The result was 50% reduction in CPU utilization.
  • An in-memory database was used to accumulate transactions over time (impression counting, clicks, sales recording).
  • A periodic process was used to write transactions from the in-memory database to the database server.
  • This architecture could handle 20 million events using existing hardware.
  • Business projections required a system that could handle 200 million events.

    200 Million Event Per Day System

  • The next architectural evolution was to a scale out grid product. It's not mentioned in the paper but I think GigaSpaces was used.
  • A Layer 7 load balancer is used to route requests to sharded application servers. Each app server supports a different set of banners.
  • Data is still stored in the database as the data is used for statistics, reports, billing, fraud detection and so on.
  • Latency was slashed because logic was separated out of the HTTP request/response loop into a separate process and database persistence is done offline. At this point architecture supports near-linear scaling and it's projected that it can easily scale to a billion events per day.

    Related Articles

  • GridGain: One Compute Grid, Many Data Grids

    Click to read more ...

  • Saturday
    Feb142009

    Scaling Digg and Other Web Applications

    Joe Stump, Lead Architect at Digg, gave this presentation at the Web 2.0 Expo. I couldn't find the actual presentation, but fortunately Kris Jordan took some great notes. That's how key moments in history are accidentally captured forever. Joe was also kind enough to respond to my email questions with a phone call. In this first part of the post Joe shares some timeless wisdom that you may or may not have read before. I of course take some pains to extract all the wit from the original presentation in favor of simple rules. What really struck me however was how Joe thought MemcacheDB Will be the biggest new kid on the block in scaling. MemcacheDB has been around for a little while and I've never thought of it in that way. Well learn why Joe is so excited by MemcacheDB at the end of the post.

    Impressive Stats

  • 80th-100th largest site in the world
  • 26 million uniques a month
  • 30 million users.
  • Uniques are only half that traffic. Traffic = unique web visitors + APIs + Digg buttons.
  • 2 billion requests a month
  • 13,000 requests a second, peak at 27,000 requests a second.
  • 3 Sys Admins, 2 DBAs, 1 Network Admin, 15 coders, QA team
  • Lots of servers.

    Scaling Strategies

  • Scaling is specialization. When off the shelf solutions no longer work at a certain scale you have to create systems that work for your particular needs.
  • Lesson of web 2.0: people love making crap and sharing it with the world.
  • Web 2.0 sucks for scalability. Web 1.0 was flat with a lot of static files. Additional load is handled by adding more hardware. Web 2.0 is heavily interactive. Content can be created at a crushing rate.
  • Languages don't scale. 100% of the time bottlenecks are in IO. Bottlenecks aren't in the language when you are handling so many simultaneous requests. Making PHP 300% faster won't matter. Don't optimize PHP by using single quotes instead of double quotes when the database is pegged.
  • Don’t share state. Decentralize. Partitioning is required to process a high number of requests in parallel.
  • Scale out instead of up. Expect failures. Just add boxes to scale and avoid the fail.
  • Database-driven sites need to be partitioned to scale both horizontally and vertically. Horizontal partitioning means store a subset of rows on a different machines. It is used when there's more data than will fit on one machine. Vertical partitioning means putting some columns in one table and some columns in another table. This allows you to add data to the system without downtime.
  • Data are separated into separate clusters: User Actions, Users, Comments, Items, etc.
  • Build a data access layer so partitioning is hidden behind an API.
  • With partitioning comes the CAP Theorem: you can only pick two of the following three: Strong Consistency, High Availability, Partition Tolerance.
  • Partitioned solutions require denormalization and has become a big problem at Digg. Denormalization means data is copied in multiple objects and must be kept synchronized.
  • MySQL replication is used to scale out reads.
  • Use an asynchronous queuing architecture for near-term processing. - This approach pushes chunks of processing to another service and let's that service schedule the processing on a grid of processors. - It's faster and more responsive than cron and only slightly less responsive than real-time. - For example, issuing 5 synchronous database requests slows you down. Do them in parallel. - Digg uses Gearman. An example use is to get a permalink. Three operations are done parallel: get the current logged, get the permalink, and grab the comments. All three are then combined to return a combined single answer to the client. It's also used for site crawling and logging. It's a different way of thinking. - See Flickr - Do the Essential Work Up-front and Queue the Rest and The Canonical Cloud Architecture for more information.
  • Bottlenecks are in IO so you have tune the database. When the database is bigger than RAM the disk is hit all the time which kills performance. As the database gets larger the table can't be scanned anymore. So you have to: - denormalize - avoid joins - avoid large scans across databases by partitioning - cache - add read slaves - don't use NFS
  • Run numbers before you try and fix a problem to make sure things actually will work.
  • Files like for icons and photos are handled by using MogileFS, a distributed file system. DFSs support high request rates because files are distributed and replicated around a network.
  • Cache forever and explicitly expire.
  • Cache fairly static content in a file based cache.
  • Cache changeable items in memcached
  • Cache rarely changed items in APC. APC is a local cache. It's not distributed so no other program have access to the values.
  • For caching use the Chain of Responsibility pattern. Cache in MySQL, memcached APC, and PHP globals. First check PHP globals as the fastest cache. If not present check APC, memcached and on up the chain.
  • Digg's recommendation engine is a custom graph database that is eventually consistent. Eventually consistent means that writes to one partition will eventually make it to all the other partitions. After a write reads made one after another don't have to return the same value as they could be handled by different partitions. This is a more relaxed constraint than strict consistency which means changes must be visible at all partitions simultaneously. Reads made one after another would always return the same value.
  • Assume 1 million people a day will bang on any new feature so make it scalable from the start. Example: the About page on Digg did a live query against the master database to show all employees. Just did a quick hack to get out. Then a spider went crazy and took the site down.

    Miscellaneous

  • Digg buttons were a major key to generating traffic.
  • Uses Debian Linux, Apache, PHP, MySQL.
  • Pick a language you enjoy developing in, pick a coding standard, add inline documentation that's extractable, use a code repository, and a bug tracker. Likes PHP, Track, and SVN.
  • You are only as good as your people. Have to trust guy next to you that he's doing his job. To cultivate trust empower people to make decisions. Trust that people have it handled and they'll take care of it. Cuts down on meetings because you know people will do the job right.
  • Completely a Mac shop.
  • Almost all developers are local. Some people are remote to offer 24 hour support.
  • Joe's approach is pragmatic. He doesn't have a language fetish. People went from PHP, to Python/Ruby, to Erlang. Uses vim. Develops from the command line. Has no idea how people constantly change tool sets all the time. It's not very productive.
  • Services (SOA) decoupling is a big win. Digg uses REST. Internal services return a vanilla structure that's mapped to JSON, XML, etc. Version in URL because it costs you nothing, for example: /1.0/service/id/xml. Version both internal and external services.
  • People don't understand how many moving parts are in a website. Something is going to happen and it will go down.

    MemcacheDB: Evolutionary Step for Code, Revolutionary Step for Performance

    Imagine Kevin Rose, the founder of Digg, who at the time of this presentation had 40,000 followers. If Kevin diggs just once a day that's 40,000 writes. As the most active diggers are the most followed it becomes a huge performance bottleneck. Two problems appear. You can't update 40,000 follower accounts at once. Fortunately the queuing system we talked about earlier takes care of that. The second problem is the huge number of writes that happen. Digg has a write problem. If the average user has 100 followers that’s 300 million diggs day. That's 3,000 writes per second, 7GB of storage per day, and 5TB of data spread across 50 to 60 servers. With such a heavy write load MySQL wasn’t going to work for Digg. That’s where MemcacheDB comes in. In Initial tests on a laptop MemcacheDB was able to handle 15,000 writes a second. MemcacheDB's own benchmark shows it capable of 23,000 writes/second and 64,000 reads/second. At those write rates it's easy to see why Joe was so excited about MemcacheDB's ability to handle their digg deluge. What is MemcacheDB? It's a distributed key-value storage system designed for persistent. It is NOT a cache solution, but a persistent storage engine for fast and reliable key-value based object storage and retrieval. It conforms to memcache protocol(not completed, see below), so any memcached client can have connectivity with it. MemcacheDB uses Berkeley DB as a storing backend, so lots of features including transaction and replication are supported. Before you get too excited keep in mind this is a key-value store. You read and write records by a single key. There aren't multiple indexes and there's no SQL. That's why it can be so fast. Digg uses MemcacheDB to scale out the huge number of writes that happen when data is denormalized. Remember it's a key-value store. The value is usually a complete application level object merged together from a possibly large number of normalized tables. Denormalizing introduces redundancies because you are keeping copies of data in multiple records instead of just one copy in a nicely normalized table. So denormalization means a lot more writes as data must be copied to all the records that contain a copy. To keep up they needed a database capable of handling their write load. MemcacheDB has the performance, especially when you layer memcached's normal partitioning scheme on top. I asked Joe why he didn't turn to one of the in-memory data grid solutions? Some of the reasons were:
  • This data is generated from many different databases and takes a long time to generate. So they want it in a persistent store.
  • MemcacheDB uses the memcache protocol. Digg already uses memcache so it's a no-brainer to start using MemcacheDB. It's easy to use and easy to setup.
  • Operations is happy with deploying it into the datacenter as it's not a new setup.
  • They already have memcached high availability and failover code so that stuff already works.
  • Using a new system would require more ramp-up time.
  • If there are any problems with the code you can take a look. It's all open source.
  • Not sure those other products are stable enough. So it's an evolutionary step for code and a revolutionary step for performance. Digg is looking at using MemcacheDB across the board.

    Related Articles

  • Scaling Digg and Other Web Applications by Kris Jordan.
  • MemcacheDB
  • Joe Stump's Blog
  • MemcachedRelated Tags on HighScalability
  • Caching Related Tags on HighScalability
  • BigTable
  • SimpleDB
  • Anti-RDBMS: A list of distributed key-value stores
  • An Unorthodox Approach to Database Design : The Coming of the Shard
  • Flickr Architecture
  • Episode 4: Scaling Large Web Sites with Joe Stump, Lead Architect at DIGG

    Click to read more ...

  • Thursday
    Feb122009

    MySpace Architecture

    Update:Presentation: Behind the Scenes at MySpace.com. Dan Farino, Chief Systems Architect at MySpace shares details of some of MySpace's cool internal operations tools. MySpace.com is one of the fastest growing site on the Internet with 65 million subscribers and 260,000 new users registering each day. Often criticized for poor performance, MySpace has had to tackle scalability issues few other sites have faced. How did they do it? Site: http://myspace.com

    Information Sources

  • Presentation: Behind the Scenes at MySpace.com
  • Inside MySpace.com

    Platform

  • ASP.NET 2.0
  • Windows
  • IIS
  • SQL Server

    What's Inside?

  • 300 million users.
  • Pushes 100 gigabits/second to the internet. 10Gb/sec is HTML content.
  • 4,500+ web servers windows 2003/IIS 6.0/APS.NET.
  • 1,200+ cache servers running 64-bit Windows 2003. 16GB of objects cached in RAM.
  • 500+ database servers running 64-bit Windows and SQL Server 2005.
  • MySpace processes 1.5 Billion page views per day and handles 2.3 million concurrent users during the day
  • Membership Milestones: - 500,000 Users: A Simple Architecture Stumbles - 1 Million Users:Vertical Partitioning Solves Scalability Woes - 3 Million Users: Scale-Out Wins Over Scale-Up - 9 Million Users: Site Migrates to ASP.NET, Adds Virtual Storage - 26 Million Users: MySpace Embraces 64-Bit Technology
  • 500,000 accounts was too much load for two web servers and a single database.
  • At 1-2 Million Accounts - They used a database architecture built around the concept of vertical partitioning, with separate databases for parts of the website that served different functions such as the log-in screen, user profiles and blogs. - The vertical partitioning scheme helped divide up the workload for database reads and writes alike, and when users demanded a new feature, MySpace would put a new database online to support it. - MySpace switched from using storage devices directly attached to its database servers to a storage area network (SAN), in which a pool of disk storage devices are tied together by a high-speed, specialized network, and the databases connect to the SAN. The change to a SAN boosted performance, uptime and reliability.
  • At 3 Million Accounts - the vertical partitioning solution didn't last because they replicated some horizontal information like user accounts across all vertical slices. With so many replications one would fail and slow down the system. - individual applications like blogs on sub-sections of the Web site would grow too large for a single database server - Reorganized all the core data to be logically organized into one database - split its user base into chunks of 1 million accounts and put all the data keyed to those accounts in a separate instance of SQL Server
  • 9 Million–17 Million Accounts - Moved to ASP.NET which used less resources than their previous architecture. 150 servers running the new code were able to do the same work that had previously required 246. - Saw storage bottlenecks again. Implementing a SAN had solved some early performance problems, but now the Web site's demands were starting to periodically overwhelm the SAN's I/O capacity—the speed with which it could read and write data to and from disk storage. - Hit limits with the 1 million-accounts-per-database division approach as these limits were exceeded. - Moved to a virtualized storage architecture where the entire SAN is treated as one big pool of storage capacity, without requiring that specific disks be dedicated to serving specific applications. MySpace now standardized on equipment from a relatively new SAN vendor, 3PARdata
  • Added a caching tier—a layer of servers placed between the Web servers and the database servers whose sole job was to capture copies of frequently accessed data objects in memory and serve them to the Web application without the need for a database lookup.
  • 26 Million Accounts - Moved to 64-bit SQL server to work around their memory bottleneck issues. Their standard database server configuration uses 64 GB of RAM.
  • Horizontally Federated Database. Databases are partition by purpose. Have profile, email databases etc. Partition is based on user range. 1 Million users live in each database. So you have Profile1, Profile2 all the way up to Profile300 as they have 300 million users.
  • Doesn't use ASP cache because they don't have a high enough hit rate on the front-end. The middle tier cache does have a high hit rate.
  • Failure isolation. Segment requests into web server by database. Allow only 7 threads per database. So if the database is slow only those threads will slowdown and the traffic in the other threads will flow.

    Operations

  • PerfCollector. Centralized collection of performance data via UDP. More reliable than Windows and allows any client to connect and see stats.
  • Web Based Stack Dump Tool. Can right-click on a problem server and get stack dump of the .Net managed threads. Used to have to RDC into system and attach a debugger and 1/2 later get an answer. Slow, nonscalable, and tedious. Not just a stack dump, gives a lot of context about what the thread is doing. Troubleshooting is easier because you can see 90 threads are blocked on a database so the database may be down.
  • Web Base Heap Dump Tool. Dumps all memory allocations. Very useful for developers. Save hours of doing it by hand.
  • Profiler. Traces a request from start to finish and produces a report. See URL, methods, status, everything that will help you identify a slow request. Looks at lock contentions, are a lot of exceptions being thrown, anything that might be interesting. Very light weight. It's running on one box in every VIP (group of 100 servers) in production. Samples 1 thread every 10 seconds. Always tracing in background.
  • Powershell. Microsoft's new shell that runs in process and pass objects between commands versus parsing text output. MySpace develops a lot of commandlets to support operations.
  • Developed their own asynchronous communication technology to get around windows networking problems and treat servers as a group. Can ship a .cs file, compile it, run it, and ship the response back.
  • Codespew. Pushes code updates on their communication technology. Used to do 5 code pushes a day, now down to 1 a week.

    Lessons Learned

  • You can build big websites using Microsoft tech.
  • A cache should have been used from the beginning.
  • The cache is a better place to store transitory data that doesn't need to be recorded in a database, such as temporary files created to track a particular user's session on the Web site.
  • Built in OS features to detect denial of service attacks can cause inexplicable failures.
  • Distribute your data to geographically diverse data centers to handle power failures.
  • Consider using virtualized storage/clustered file systems from the start. It allows you to massively parallelize IO access while being able to add disk as needed without any reorganization needed.
  • Develop tools that work in a production environment. Can't simulate everything in test environment. The scale and variety of uses APIs are put to can't be simulated in QA during testing. Legitimate users and hackers will run into corner cases that weren't hit in testing, though QA will find most of the problems.
  • Throw hardware at problems. Easier than changing their backend software to a new way of doing things. The example is they add a new database server for every million users. It might be more efficient to change their approach to more efficiently use the database hardware, but it's easier just to add servers. For now.

    Click to read more ...

  • Monday
    Feb092009

    Paper: Consensus Protocols: Two-Phase Commit  

    Henry Robinson has created an excellent series of articles on consensus protocols. Henry starts with a very useful discussion of what all this talk about consensus really means: The consensus problem is the problem of getting a set of nodes in a distributed system to agree on something - it might be a value, a course of action or a decision. Achieving consensus allows a distributed system to act as a single entity, with every individual node aware of and in agreement with the actions of the whole of the network. In this article Henry tackles Two-Phase Commit, the protocol most databases use to arrive at a consensus for database writes. The article is very well written with lots of pretty and informative pictures. He did a really good job. In conclusion we learn 2PC is very efficient, a minimal number of messages are exchanged and latency is low. The problem is when a co-ordinator fails availability is dramatically reduced. This is why 2PC isn't generally used on highly distributed systems. To solve that problem we have to move on to different algorithms and that is the subject of other articles.

    Click to read more ...

    Thursday
    Feb052009

    Product: HAProxy - The Reliable, High Performance TCP/HTTP Load Balancer

    Update: Load Balancing in Amazon EC2 with HAProxy. Grig Gheorghiu writes a nice post on HAProxy functionality and configuration: Emulating virtual servers, Logging, SSL, Load balancing algorithms, Session persistence with cookies, Server health checks, etc. Adapted From the website: HAProxy is a free, very fast and reliable solution offering high availability, load balancing, and proxying for TCP and HTTP-based applications. It is particularly suited for web sites crawling under very high loads while needing persistence or Layer7 processing. Supporting tens of thousands of connections is clearly realistic with todays hardware. Its mode of operation makes its integration into existing architectures very easy and riskless, while still offering the possibility not to expose fragile web servers to the Net. Currently, two major versions are supported : * version 1.1 - maintains critical sites online since 200 The most stable and reliable, has reached years of uptime. Receives no new feature, dedicated to mission-critical usages only. * version 1.2 - opening the way to very high traffic sites The same as 1.1 with some new features such as poll/epoll support for very large number of sessions, IPv6 on the client side, application cookies, hot-reconfiguration, advanced dynamic load regulation, TCP keepalive, source hash, weighted load balancing, rbtree-based scheduler, and a nice Web status page. This code is still evolving but has significantly stabilized since 1.2.8. Unlike other free "cheap" load-balancing solutions, this product is only used by a few hundreds of people around the world, but those people run very big sites serving several millions hits and between several tens of gigabytes to several terabytes per day to hundreds of thousands of clients. They need 24x7 availability and have internal skills to risk to maintain a free software solution. Often, the solution is deployed for internal uses and I only know about it when they send me some positive feedback or when they ask for a missing feature ;-) According to many users HAProxy competes quite well with the likes of Pound and Ultramonkey.

    Click to read more ...

    Thursday
    Feb052009

    Beta testers wanted for ultra high-scalability/performance clustered object storage system designed for web content delivery

    DataDirect Networks (www.ddn.com) is searching for beta testers for our exciting new object-based clustered storage system. Does this sound like you? * Need to store millions to hundreds of billions of files * Want to use one big file system but can't because no single file system scales big enough * Running out of inodes * Have to constantly tweak file systems to perform better * Need to replicate content to more than one data center across geographies * Have thumbnail images or other small files that wreak havoc on your file and storage systems * Constantly tweaking and engineering around performance and scalability limits * No storage system delivers enough IOPS to serve your content * Spend time load balancing the storage environment * Want a single, simple way to manage all this data If this sounds like you, please contact me at jgoldstein@ddn.com. DataDirect Networks is a 10-year old, well-established storage systems company specializing in Extreme Storage environments. We've deployed both the largest and the fastest storage/file systems on the planet - currently running at over 250GB/s. Our upcoming product is going to change the way storage is deployed for scalable web content and we're seeking testers who can throw their most challenging problems at our new system. It's time for something better and we're going to deliver it.

    Click to read more ...

    Tuesday
    Feb032009

    10 More Rules for Even Faster Websites

    Update:How-To Minimize Load Time for Fast User Experiences. Shows how to analyze the bottlenecks preventing websites and blogs from loading quickly and how to resolve them. 80-90% of the end-user response time is spent on the frontend, so it makes sense to concentrate efforts there before heroically rewriting the backend. Take a shower before buying a Porsche, if you know what I mean. Steve Souders, author of High Performance Websites and Yslow, has ten more best practices to speed up your website:

  • Split the initial payload
  • Load scripts without blocking
  • Don’t scatter scripts
  • Split dominant content domains
  • Make static content cookie-free
  • Reduce cookie weight
  • Minify CSS
  • Optimize images
  • Use iframes sparingly
  • To www or not to www Sadly, according to String Theory, there are only 26.7 rules left, so get them while they're still in our dimension. Here are slides on the first few rules. Love the speeding dog slide. That's exactly what my dog looks like traveling down the road, head hanging out the window, joyfully battling the wind. Also see 20 New Rules for Faster Web Pages.

    Click to read more ...

  • Tuesday
    Feb032009

    Paper: Optimistic Replication

    To scale in the large you have to partition. Data has to be spread around, replicated, and kept consistent (keeping replicas sufficiently similar to one another despite operations being submitted independently at different sites). The result is a highly available, well performing, and scalable system. Partitioning is required, but it's a pain to do efficiently and correctly. Until Quantum teleportation becomes a reality how data is kept consistent across a bewildering number of failure scenarios is a key design decision. This excellent paper by Yasushi Saito and Marc Shapiro takes us on a wild ride (OK, maybe not so wild) of different approaches to achieving consistency. What's cool about this paper is they go over some real systems that we are familiar with and cover how they work: DNS (single-master, state-transfer), Usenet (multi-master), PDAs (multi-master, state-transfer, manual or application-specific conflict resolution), Bayou (multi-master, operation-transfer, epidemic propagation, application conflict resolution), CVS (multi-master operation-transfer, centralized, manual conflict resolution). The paper then goes on to explain in detail the different approaches to achieving consistency. Most of us will never have to write the central nervous system of an application like this, but knowing about the different approaches and tradesoffs is priceless. The abstract:

    Data replication is a key technology in distributed data sharing systems, enabling higher availability and performance. This paper surveys optimistic replication algorithms that allow replica contents to diverge in the short term, in order to support concurrent work practices and to tolerate failures in low-quality communication links. The importance of such techniques is increasing as collaboration through wide-area and mobile networks becomes popular. Optimistic replication techniques are different from traditional “pessimistic” ones. Instead of synchronous replica coordination, an optimistic algorithm propagates changes in the background, discovers conflicts after they happen and reaches agreement on the final contents incrementally. We explore the solution space for optimistic replication algorithms. This paper identifies key challenges facing optimistic replication systems — ordering operations, detecting and resolving conflicts, propagating changes efficiently, and bounding replica divergence—and provides a comprehensive survey of techniques developed for addressing these challenges.
    If you can't wait to know the ending, here's the summary of the paper:
    We summarize some of the lessons learned from our own experience and in reviewing the literature. Optimistic, asynchronous data replication is an appealing technique; it indeed improves networking flexibility and scalability. Some environments or application areas could simply not function without optimistic replication. However, optimistic replication also comes with a cost. The algorithmic complexity of ensuring eventual consistency can be high. Conflicts usually require application-specific resolution, and the lost update problem is ultimately unavoidable. Hence our recommendations: (1) Keep it simple. Traditional, pessimistic replication, with many off-the-shelf solutions, is perfectly adequate in small-scale, fully connected, reliable networking environments. Where pessimistic techniques are the cause of poor performance or lack of availability, or do not scale well, try single-master replication: it is simple, conflictfree, and scales well in practice. State transfer using Thomas’s write rule works well for many applications. Advanced techniques such as version vectors and operation transfer should be used only when you need flexibility and semantically rich conflict resolution. (2) Propagate operations quickly to avoid conflicts. While connected, propagate often and keep replicas in close synchronization. This will minimize divergence when disconnection does occur. (3) Exploit commutativity. Commutativity should be the default; design your system so that non-commutative operations are the uncommon case. For instance, whenever possible, partition data into small, independent objects. Within an object, use monotonic data structures such as an append-only log, a monotonically increasing counter, or a union-only set. When operations are dependent upon each other, represent the invariants explicitly.

    Related Articles

  • The End of an Architectural Era (It’s Time for a Complete Rewrite)
  • Big Table
  • Google's Paxos Made Live – An Engineering Perspective
  • Dynamo: Amazon’s Highly Available Key-value Store
  • Eventually Consistent - Revisited by Werner Vogels

    Click to read more ...

  • Sunday
    Feb012009

    More Chips Means Less Salsa

    Yes, I just got through watching the Superbowl so chips and salsa are on my mind and in my stomach. In recreational eating more chips requires downing more salsa. With mulitcore chips it turns out as cores go up salsa goes down, salsa obviously being a metaphor for speed. Sandia National Laboratories found in their simulations: a significant increase in speed going from two to four multicores, but an insignificant increase from four to eight multicores. Exceeding eight multicores causes a decrease in speed. Sixteen multicores perform barely as well as two, and after that, a steep decline is registered as more cores are added. The problem is the lack of memory bandwidth as well as contention between processors over the memory bus available to each processor. The implication for those following a diagonal scaling strategy is to work like heck to make your system fit within eight multicores. After that you'll need to consider some sort of partitioning strategy. What's interesting is the research on where the cutoff point will be.

    Click to read more ...

    Thursday
    Jan292009

    Event: MySQL Conference & Expo 2009

    The 5th annual MySQL Conference & Expo, co-presented by Sun Microsystems, MySQL and O'Reilly Media. Happening April 20-23, 2009 in Santa Clara, CA, at the Santa Clara Convention Center and Hyatt Regency Santa Clara, brings over 2,000 open source and database enthusiasts together to harness the power of MySQL and celebrate the huge MySQL ecosystem. All around the world, people just like you are innovating with MySQL—and MySQL is fueling the innovation engine by releasing new mission critical solutions to help you work smarter. This deeply technical conference brings all of that creativity, energy, and knowledge together in one place for four very full days. Early registration ends February 16, 2009. The largest gathering of MySQL developers, users, and DBAs worldwide, the event reflects MySQL's wide-ranging appeal and capabilities. The open atmosphere of the MySQL Conference & Expo helps IT professionals and community members launch and develop the best database applications, tools, and software. As companies of all sizes look for ways to remain competitive and manage costs, open source software and tools provide valuable and efficient solutions for the enterprise. The 2009 edition of the MySQL Conference & Expo will present strategies for businesses to not just survive, but thrive in a challenging economy. Through expert instruction, hands-on tutorials, and readily available MySQL developers, users at all levels gain the knowledge they need to rapidly build solid applications with MySQL that scale with the enterprise. New to the 2009 program will be MySQL Camp, a space where any and all participants can create an "unconference" within the larger event.

    Click to read more ...