High Performance Multithreaded Access to Amazon SimpleDB is a great follow up to the idea in How SimpleDB Differs from a RDBMS that more programming is the price paid for performance in SimpleDB. It shows how much work and infrastructure is required to batter better performance out of SimpleDB.
Remember, in SimpleDB you get keys to records from queries so if you want to get all the fields for records you need to make separate requests. Since SimpleDB isn't exactly a speed daemon the obvious strategy is to parallelize. Even if a job takes a 100 msecs you can get a lot done in a little time if you can execute enough jobs in parallel.
Parallelization is the approach taken by Haakon@AWS in his Java code example of how to get the most out of SimpleDB. You can find the code at Indexing and Querying Amazon S3 Metadata with Amazon SimpleDB. We'll also consider how a back-end service architecture built on Erlang may be a better fit with cloud computing.
Update 2: A HSCALE benchmark finds HSCALE "adds a maximum overhead of about 0.24 ms per query (against a partitioned table)." Future releases promise much improved results.
Update: A new presentation at An Introduction to HSCALE.
After writing Skype Plans for PostgreSQL to Scale to 1 Billion Users, which shows how Skype smartly uses a proxy architecture for scaling, I'm now seeing MySQL Proxy articles all over the place. It's like those "get rich quick" books that say all you have to do is visualize a giraffe with a big yellow dot superimposed over it and by sympathetic magic giraffes will suddenly stampede into your life. Without realizing it I must have visualized transparent proxies smothered in yellow dots.
One of the brightest images is a wonderful series of articles by Peter Romianowski describing the evolution of their proxy architecture. Their application is an OLTP system executing 200 million transaction per month, tables with more than 1.5 billion rows, and a 600 GB total database size. They ran into a wall buying bigger boxes and wanted to move to a sharded architecture. The question for them was: how do you implement sharding?
Caching is like aspirin for headaches. Head hurts: pop a 'sprin. Slow site: add caching. Facebook must have a lot of headaches because they popped 805 memcached servers between 10,000 web servers and 1,800 MySQL servers and they reportedly have a 99% cache hit rate. But what's the best way for you to cache for your application? It's a remarkably complex and rich topic. Alexey Kovyrin talks about one common caching problem called the Dog Pile Effect in Dog-pile Effect and How to Avoid it with Ruby on Rails. Glenn Franxman also has a Django solution in MintCache.
Data is usually cached because it's too expensive to calculate for every hit. Maybe it's a gnarly SQL query you want to avoid and a little stale data is OK. Or maybe the amount of data you have is simply larger than physical memory on any one machine. Or maybe you have the temerity to write to your database and cause its cache to flush so database caching isn't sufficient at a certain level of scale.
Typical examples are for caching article vote counts, comment threads, and event streams. One familiar example that bit me hard is displaying the the top N blog articles. Do you want to scan through your entire access log table for every page display? Absolutely not. Especially when the nightly backups are going on and the network is very slow. Not good :-) Yet you still want to update the results every X minutes so the stats stay fresh.
Data freshness requires a refrigeration truck or an expiry time on your cache entry that causes stats to be periodically recalculated. Now, what happens when your cached data expires and a 1000 requests simultaneously try to recalculate the expensive to calculate data? Database load spikes and the world nearly ends. And since memcached operations are not atomic it's possible stale data could be cached and you'll serve stale data. Which kind of defeats of the purpose of taking load off the data while providing accurate data. So, how do you unpile the dogs?
80-90% of the end-user response time is spent on the frontend, so it makes sense to concentrate efforts there before heroically rewriting the backend. Take a shower before buying a Porsche, if you know what I mean. Steve Souders, author of High Performance Websites and Yslow, has ten more best practices to speed up your website:
Sadly, according to String Theory, there are only 26.7 rules left, so get them while they're still in our dimension.
Here are slides on the first few rules. Love the speeding dog slide. That's exactly what my dog looks like traveling down the road, head hanging out the window, joyfully battling the wind.
Also see 20 New Rules for Faster Web Pages.
Update: Arjen links to video Supporting Scalable Online Statistical Processing which shows
"rather than doing complete aggregates, use statistical sampling to provide a reasonable estimate (unbiased guess) of the result."
When you have a lot of data, sampling allows you to draw conclusions from a much smaller amount of data. That's why sampling is a scalability solution. If you don't have to process all your data to get the information you need then you've made the problem smaller and you'll need fewer resources and you'll get more timely results.
It's a sad fact of life, but processes die. I know, it's horrible. You start them, send them out into process space, and hope for the best. Yet sometimes, despite your best coding, they core dump, seg fault, or some other calamity befalls them. Unlike our messy biological world so cruelly ruled by entropy, in the digital world processes can be given another chance. They can be restarted. A greater destiny awaits. And hopefully this time the random lottery of unforeseen killing factors will be avoided and a long productive life will be had by all.
This is fun code to write because it's a lot more complicated than you might think. And restarting processes is a highly effective high availability strategy. Most faults are transient, caused by an unexpected series of events. Rather than taking drastic action, like taking a node out of production or failing over, transients can be effectively masked by simply restarting failed processes. Though complexity makes it a fun problem, it's also why you may want to "buy" rather than build. If you are in the market, Supervisor looks worth a visit.
Adapted from their website:
Supervisor is a Python program that allows you to start, stop, and restart other programs on UNIX systems. It can restart crashed processes.
Update: Nice explanation in The importance of bandwidth versus latency of how long latencies cause cascading delays in resource loading. Doloto tries to optimize how resources are loaded.
Twenty new rules have been added to the original 14 rules for sizzling web performance. Part of scalability is worrying about performance too. The front-end is where 80-90% of end-user response time is spent and following these best practices improved the performance of Yahoo! properties by 25-50%. The rules are divided into server, content, cookie, JavaScript, CSS, images, and mobile categories. The new rules are:
I attended Sebastian Stadil's AWS Training Camp Saturday and during the class Sebastian brought up a wonderfully counter-intuitive idea: CPU (EC2) costs a lot less than storage (S3, SDB) so you should systematically move as much work as you can to the CPU. This is said to be the Client-Cloud Paradigm. It leverages the well pummeled trend that CPU power follows Moore's Law while storage follows The Great Plains' Law (flat). And what sane computing professional would do battle with Sir Moore and his trusty battle sword of a law?
Embedded systems often make similar environmental optimizations. CPU rich and memory poor means operate on compressed serialized data structures. Deserialized data structures use a lot of memory, so why use them? It's easy enough to create an object wrapper around a buffer. Programmers shouldn't care how their objects are represented anyway. Yet we waste ginormous amounts of time and memory uselessly transforming XML in and out of different representations. Just transport compressed binary objects around and use them in place. Serialization and deserialization happen only on access (Pimpl Idiom).
It never occurred to me that in the land of AWS plenty similar "tricks" would make sense. But EC2 is a loss leader in AWS. CPU is plentiful and cheap. It's IO and storage that costs you...
Cal Henderson writes at thinkvitamin.com: "With our so-called "Web 2.0' applications and their rich content and interaction, we expect our applications to increasingly make use of CSS and JavaScript. To make sure these applications are nice and snappy to use, we need to optimize the size and nature of content required to render the page, making sure we’re delivering the optimum experience. In practice, this means a combination of making our content as small and fast to download as possible, while avoiding unnecessarily refetching unmodified resources." A lot of good comments too.
True high availability requires a presence in multiple data centers. The recent downtime of even a high quality operation like Amazon makes this need all the more clear. Typically only the big boys can afford the complexity of operating in two or more data centers. Cloud computing along with utility billing starts to change that equation, leveling the playing field. Even smaller outfits will be in a position to manage risk by spreading machines amongst EC2, 3tera, Slicehost, Mosso and other providers.
The question then becomes: given we aren't Angels, how do we walk amongst the clouds? One fascinating answer is exquisitely explained by Dmitriy Samovskiy in his Linux Journal article titled Building a Multisourced Infrastructure Using OpenVPN.
One of the most important architectural decisions that must be done early on in a scalable web site project is splitting the data flow into two streams: one that is user specific and one that is generic. If this is done properly, the system will be able to grow easily. On the other hand, if the data streams are not separated from the start, then the growth options will be severely limited. Trying to make such a web site scale will be just painting the corpse, and this change will cost a whole lot more when you need to introduce it later (and it is "when" in this case, not "if").
This is what Mike Peters says he can do: make your site run 10 times faster. His test bed is "half a dozen servers parsing 200,000 pages per hour over 40 IP addresses, 24 hours a day." Before optimization CPU spiked to 90% with 50 concurrent connections. After optimization each machine "was effectively handling 500 concurrent connections per second with CPU at 8% and no degradation in performance."
Mike identifies six major bottlenecks:
Mike's solutions:
Atif Ghaffar has a nice strategy to deal with virus checking uploads:
This removes the CPU bottleneck from your web servers and distributes it through your cluster. Keep your web servers providing prompt service to users. Let your cluster do the heavy lifting. This minimizes response time and maximizes throughput. A similar system can be used for creating thumbnails, transcoding, copyright checks, updating indexes, event notification or any other kind of intensive work.
I had a false belief
I thought I came here to stay
We're all just visiting
All just breaking like waves
The oceans made me, but who came up with me?
Push me, pull me, push me, or pull me out .
So true Perl Jam (Push me Pull me lyrics), so true. I too have wondered how web clients should be notified of model changes. Should servers push events to clients or should clients pull events from servers? A topic worthy of its own song if ever there was one.
A very entertaining and somewhat educational article on IBM Poopheads say LAMP Users Need to "grow up". The physical three tier architecture turns out to be the root of all evil and shared nothing architectures brings simplicity and light.
In the comments Simon Willison makes an insightful comment on why fine grained caching works for personalized pages and proxy's don't:
Great post, but I have to disagree with you on the finely grained caching part. If you look at big LAMP deployments such as Flickr, LiveJournal and Facebook the common technology component that enables them to scale is memcached - a tool for finely grained caching. That's not to say that they aren't doing shared-nothing, it's just that memcached is critical for helping the database layer scale. LiveJournal serves around 50% of its page views "permission controlled" (friends only) so an HTTP proxy on the front end isn't the right solution - but memcached reduces their database hits by 90%.
Release It! author Michael Nygard tells a tale of two web sites, both brought low by unexpectedly huge unbounded results sets that slowed down their sites to the speed of a Christmas checkout line.
I've committed this error more than a few times. During testing the results sets are often small, so you don't see problems. Or when a product is new you don't have a lot of data so everything is fine, until some magic line is crossed and you get that dreaded 2AM fix it call.
This is a question asked on the ycombinator list and there are some good responses. I gave a quick response, but I particularly like neilk's knock out of the park insightful answer:
Viewing an application as a series of state transitions instead of a blizzard of actions and events is a way under appreciated design perspective. This is one of they key design approaches for making robust embedded systems. A great paper talking about this sort of stuff is Mission Planning and Execution Within the Mission Data System - an effort to make engineering flight software more straightforward and less prone to error through the explicit modeling of spacecraft state. Another interesting paper is CLEaR: Closed Loop Execution and Recovery High-Level Onboard Autonomy for Rover Operations.
In general I call these Fact Based Architectures. I'm really glad neilk brought it up.
A lot of apps need to map IP addresses to locations. Jeremy Cole in On efficiently geo-referencing IPs with MaxMind GeoIP and MySQL GIS succinctly explains the many uses for such a feature:
Geo-referencing IPs is, in a nutshell, converting an IP address, perhaps from an incoming web visitor, a log file, a data file, or some other place, into the name of some entity owning that IP address. There are a lot of reasons you may want to geo-reference IP addresses to country, city, etc., such as in simple ad targeting systems, geographic load balancing, web analytics, and many more applications.
This is difficult to do efficiently, at least it gives me a bit of brain freeze. In the same post Jeremy nicely explains where to get the geo-rereferncing data, how to load data, and the performance of different approaches for IP address searching. It's a great practical introduction to the subject.
Now that the internet has become defined as a mashup over a collection of service APIs, we have a little problem: for clients using APIs is a lot like drinking beer through a straw. You never get as much beer as you want and you get a headache after. But what if I've been a good boy and deserve a bigger straw? Maybe we can use game theory to model trust relationships over a life time of interactions over many different services and then give more capabilities/beer to those who have earned them?
Bees have a similar problem to website servers: how to do a lot of work with limited resources in an ever changing environment. Usually lessons from biology are hard to apply to computer problems. Nature throws hardware at problems. Billions and billions of cells cooperate at different levels of organizations to find food, fight lions, and make sure your DNA is passed on.
Nature's software is "simple," but her hardware rocks. We do the opposite. For us hardware is in short supply so we use limited hardware and leverage "smart" software to work around our inability to throw hardware at problems. But we might be able to borrow some load balancing techniques from bees. What do bees do that we can learn from?
Michael Nygard talks about Two Ways To Boost Your Flagging Web Site. The idea behind cache farms is to move memory devoted to the various caching layers into one large farm of caches, as with memcached. The idea behind read pools is to allocate your database read requests to a pool of dedicated read servers, thus offloading the write server. Using a combination of the strategies you aren't forced to scale up the database tier to scale your website.
All the cool kids advocate scaling out as the secret sauce of scaling. And it is, but don't forget to serve some tasty "scaling up" as a side dish. Scaling up doesn't have to mean buying a jet propelled, liquid cooled, 128 core monster super computer. Scaling up can just mean buying at the high end of the commodity buffet by buying more cores, more memory and using a shared nothing architecture to take advantage of all that power without adding complexity. Scale out when you need to, but big beefy boxes can absorb a lot of load before it's necessary to hit up your data center for more rack space. Here are a few examples of scaling out and up:
Jesse Robbins at O'Reily Radar has a nice post on how spending a little up front time on figuring out how to scale your operations process saves money on ops people and allows you to save time adding and upgrading servers. Adding, monitoring, and upgrading servers can get so incredibly screwed up that a herd of squirrels has to work overtime just to put out a release. Or it can be one button simple from your automated build system out to your servers. This is one area where "do the simplest thing that could possibly work" is a dumb idea and Jesse does a good job capturing the advantages of doing it right.
One of the premier scaling strategies is always: get someone else to do the work for you. But unlike Huckleberry Finn in Tom Sawyer, you won't have to trick anyone into whitewashing a fence for you. Times have changed. Companies like Ning, Facebook, and Salesforce are more than happy to help. Their price: lock-in.
Previously you had few options when building a "real" website. You needed to do everything yourself. Infrastructure and application were all yours. Then companies stepped in by commoditizing parts of the infrastructure, but the application was still yours. The next step is full on Borg take no prisoners assimilation where the infrastructure and application are built as one collective. What you have to decide as someone faced with building a scalable website is if these new options are worth the price.
Robert Stewart shared this useful Ajax related scalability strategy:
We avoided XMLHttpRequests for individual keystrokes, choosing to go back to the server only when a field lost focus. Google can afford all the servers to handle the load for that, but we didn't want to.
Do you have a scalability strategy to share? Then share it!.