Hi High Scalability community!
Do you have any information on the architecture behind Google Voice, the new service by Google that offers one Google Number for all your calls and SMS? It is based on GrandCentral who has been acquired by Google 2 years ago.
Hi High Scalability community!
Update 6: Some interesting changes from Twitter's Evan Weaver: everything in RAM now, database is a backup; peaks at 300 tweets/second; every tweet followed by average 126 people; vector cache of tweet IDs; row cache; fragment cache; page cache; keep separate caches; GC makes Ruby optimization resistant so went with Scala; Thrift and HTTP are used internally; 100s internal requests for every external request; rewrote MQ but kept interface the same; 3 queues are used to load balance requests; extensive A/B testing for backwards capability; switched to C memcached client for speed; optimize critical path; faster to get the cached results from the network memory than recompute them locally.
Update 5: Twitter on Scala. A Conversation with Steve Jenson, Alex Payne, and Robey Pointer by Bill Venners. A fascinating discussion of why Twitter moved to the Java JVM for their server infrastructure (long lived processes) and why they moved to Scala to program against it (high level language, static typing, functional). Ruby is used on the front-end but wasn't performant or reliable enough for the back-end.
Update 4: Improving Running Components at Twitter by Evan Weaver. Tells how Twitter changed their infrastructure to go from handling 3 requests to 139 requests a second. They moved to a messaging model, asynchronous process, 3 levels of cache, and moved their middleware to a mixture C and Scala/JVM.
Update 3: Upgrading Twitter without service disruptions by Gojko Adzic. Lots of good updates on the new Twitter architecture.
Update 2: a commenter in Twitter Fails Macworld Keynote Test said this entry needs to be updated. LOL. My uneducated guess is it's not a language or architecture problem, but more a problem of not being able to add hardware fast enough into their data center. The predictability of this problem is debatable, but once you have it, it's hard to fix.
Update: Twitter releases Starling - light-weight persistent queue server that speaks the MemCache protocol. It was built to drive Twitter's backend, and is in production across Twitter's cluster.
Twitter started as a side project and blew up fast, going from 0 to millions of page views within a few terrifying months. Early design decisions that worked well in the small melted under the crush of new users chirping tweets to all their friends. Web darling Ruby on Rails was fingered early for the scaling problems, but Blaine Cook, Twitter's lead architect, held Ruby blameless:
For us, it’s really about scaling horizontally - to that end, Rails and Ruby haven’t been stumbling blocks, compared to any other language or framework. The performance boosts associated with a “faster” language would give us a 10-20% improvement, but thanks to architectural changes that Ruby and Rails happily accommodated, Twitter is 10000% faster than it was in January.
If Ruby on Rails wasn't to blame, how did Twitter learn to scale ever higher and higher?
Update: added slides Small Talk on Getting Big. Scaling a Rails App & all that Jazz
- For example, if getting a count is slow, you can memoize the count into memcache in a millisecond.
- Getting your friends status is complicated. There are security and other issues. So rather than doing a query, a friend's status is updated in cache instead. It never touches the database. This gives a predictable response time frame (upper bound 20 msecs).
- ActiveRecord objects are huge so that's why they aren't cached. So they want to store critical attributes in a hash and lazy load the other attributes on access.
- 90% of requests are API requests. So don't do any page/fragment caching on the front-end. The pages are so time sensitive it doesn't do any good. But they cache API requests.
- Use message a lot. Producers produce messages, which are queued, and then are distributed to consumers. Twitter's main functionality is to act as a messaging bridge between different formats (SMS, web, IM, etc).
- Send message to invalidate friend's cache in the background instead of doing all individually, synchronously.
- Started with DRb, which stands for distributed Ruby. A library that allows you to send and receive messages from remote Ruby objects via TCP/IP. But it was a little flaky and single point of failure.
- Moved to Rinda, which a shared queue that uses a tuplespace model, along the lines of Linda. But the queues are persistent and the messages are lost on failure.
- Tried Erlang. Problem: How do you get a broken server running at Sunday Monday with 20,000 users waiting? The developer didn't know. Not a lot of documentation. So it violates the use what you know rule.
- Moved to Starling, a distributed queue written in Ruby.
- Distributed queues were made to survive system crashes by writing them to disk. Other big websites take this simple approach as well.
- They do a review and push out new mongrel servers. No graceful way yet.
- An internal server error is given to the user if their mongrel server is replaced.
- All servers are killed at once. A rolling blackout isn't used because the message queue state is in the mongrels and a rolling approach would cause all the queues in the remaining mongrels to fill up.
- A lot of down time because people crawl the site and add everyone as friends. 9000 friends in 24 hours. It would take down the site.
- Build tools to detect these problems so you can pinpoint when and where they are happening.
- Be ruthless. Delete them as users.
- Plan to partition in the future. Currently they don't. These changes have been enough so far.
- The partition scheme will be based on time, not users, because most requests are very temporally local.
- Partitioning will be difficult because of automatic memoization. They can't guarantee read-only operations will really be read-only. May write to a read-only slave, which is really bad.
- Their API is the most important thing Twitter has done.
- Keeping the service simple allowed developers to build on top of their infrastructure and come up with ideas that are way better than Twitter could come up with. For example, Twitterrific, which is a beautiful way to use Twitter that a small team with different priorities could create.
- Index everything. Rails won't do this for you.
- Use explain to how your queries are running. Indexes may not be being as you expect.
- Denormalize a lot. Single handedly saved them. For example, they store all a user IDs friend IDs together, which prevented a lot of costly joins.
- Avoid complex joins.
- Avoid scanning large sets of data.
- You want to know when you deploy an application that it will render correctly.
- They have a full test suite now. So when the caching broke they were able to find the problem before going live.
- Scale changes what can be stupid.
- Trying to load 3000 friends at once into memory can bring a server down, but when there were only 4 friends it works great.
Update 5: PlentyOfFish Update - 6 Billion Pageviews And 32 Billion Images A Month
Update 4: Jeff Atwood costs out Markus' scale up approach against a scale out approach and finds scale up wanting. The discussion in the comments is as interesting as the article. My guess is Markus doesn't want to rewrite his software to work across a scale out cluster so even if it's more expensive scale up works better for his needs.
Update 3: POF now has 200 million images and serves 10,000 images served per second. They'll be moving to a 250,000 IOPS RamSan to handle the load. Also upgraded to a core database machine with 512 GB of RAM, 32 CPU’s, SQLServer 2008 and Windows 2008.
Update 2: This seems to be a POF Peer1 love fest infomercial. It's pretty content free, but the production values are high. Lots of quirky sounds and fish swimming on the screen.
Update: by Facebook standards Read/WriteWeb says POF is worth a cool one billion dollars. It helps to talk like Dr. Evil when saying it out loud.
PlentyOfFish is a hugely popular on-line dating system slammed by over 45 million visitors a month and 30+ million hits a day (500 - 600 pages per second). But that's not the most interesting part of the story. All this is handled by one person, using a handful of servers, working a few hours a day, while making $6 million a year from Google ads. Jealous? I know I am. How are all these love connections made using so few resources?
- IIS arbitrarily limits the total connections to 64,000 so a load balancer was added to handle the large number of simultaneous connections. Adding a second IP address and then using a round robin DNS was considered, but the load balancer was considered more redundant and allowed easier swap in of more web servers. And using ServerIron allowed advanced functionality like bot blocking and load balancing based on passed on cookies, session data, and IP data.
- The Windows Network Load Balancing (NLB) feature was not used because it doesn't do sticky sessions. A way around this would be to store session state in a database or in a shared file system.
- 8-12 NLB servers can be put in a farm and there can be an unlimited number of farms. A DNS round-robin scheme can be used between farms. Such an architecture has been used to enable 70 front end web servers to support over 300,000 concurrent users.
- NLB has an affinity option so a user always maps to a certain server, thus no external storage is used for session state and if the server fails the user loses their state and must relogin. If this state includes a shopping cart or other important data, this solution may be poor, but for a dating site it seems reasonable.
- It was thought that the cost of storing and fetching session data in software was too expensive. Hardware load balancing is simpler. Just map users to specific servers and if a server fails have the user log in again.
- The cost of a ServerIron was cheaper and simpler than using NLB. Many major sites use them for TCP connection pooling, automated bot detection, etc. ServerIron can do a lot more than load balancing and these features are attractive for the cost.
- One database is the main database.
- Two databases are for search. Load balanced between search servers based on the type of search performed.
- Monitors performance using task manager. When spikes show up he investigates. Problems were usually blocking in the database. It's always database issues. Rarely any problems in .net. Because POF doesn't use the .net library it's relatively easy to track down performance problems. When you are using many layers of frameworks finding out where problems are hiding is frustrating and hard.
- If you call the database 20 times per page view you are screwed no matter what you do.
- Separate database reads from writes. If you don't have a lot of RAM and you do reads and writes you get paging involved which can hang your system for seconds.
- Try and make a read only database if you can.
- Denormalize data. If you have to fetch stuff from 20 different tables try and make one table that is just used for reading.
- One day it will work, but when your database doubles in size it won't work anymore.
- If you only do one thing in a system it will do it really really well. Just do writes and that's good. Just do reads and that's good. Mix them up and it messes things up. You run into locking and blocking issues.
- If you are maxing the CPU you've either done something wrong or it's really really optimized. If you can fit the database in RAM do it.
Thanks to Erik Osterman for recommending profiling PlentyOfFish.
Nick Belhomme wrote up a excellent summary of a talk given by Eli White on building scalable web applications. Eli worked at digg.com and is now the PHP Community Manager & DevZone Editor-in-Chief at Zend Technologies. Eli takes us on a grand tour through various proven scaling strategies. On the trip you'll visit:
InfoQueue has this excellent talk by Brian Goetz on the new features being added to Java SE 7 that will allow programmers to fully exploit our massively multi-processor future. While the talk is about Java it's really more general than that and there's a lot to learn here for everyone.
Brian starts with a short, coherent, and compelling explanation of why programmers can't expect to be saved by ever faster CPUs and why we must learn to exploit the strengths of multiple core computers to make our software go faster.
Some techniques for exploiting multiple cores are given in an equally short, coherent, and compelling explanation of why divide and conquer as the secret to multi-core bliss, fork-join, how the Java approach differs from map-reduce, and lots of other juicy topics.
The multi-core "problem" is only going to get worse. Tilera founder Anant Agarwal estimates by 2017 embedded processors could have 4,096 cores, server CPUs might have 512 cores and desktop chips could use 128 cores. Some disagree saying this is too optimistic, but Agarwal maintains the number of cores will double every 18 months.
An abstract of the talk follows though I would highly recommend watching the whole thing. Brian does a great job.
Why is Parallelism More Important Now?
What has Java 7 Added to Support Parallelism?
Distributed systems are not typically a place domain driven design is applied. Distributed processing projects often start with an overall architecture vision and the idea about a processing model which basically drives the whole thing, including object design if it exists at all. Elaborate object designs are thought of as something that just gets in the way of distribution and performance, so the idea of spending time to apply DDD principles gets rejected in favour of raw throughput and processing power. However, from my experience, some more advanced DDD concepts can significantly improve the performance, scalability and throughput of distributed systems when applied correctly.
This article a summary of the presentation titled "DDD in a distributed world" from the DDD Exchange 09 in London.
GemStone has unveiled GemFire 6.0 which is the culmination of several years of development and the continuous solving of the hardest data management problems in the world. With this release GemFire touts some of the latest innovative features in data management.
In this release:
- GemFire introduces a resource manager to continuously monitor and protect cache instances from running out of memory, triggering rebalancing to migrate data to less loaded nodes or allow dynamic increase/decrease in the number of nodes hosting data for linear scalability without impeding ongoing operations (no contention points).
- GemFire provides explicit control over when rebalancing can be triggered, on what class of data and even allows the administrator to simulate a "rebalance" operation to quantify the benefits before actually doing it.
- With built in instrumentation that captures throughput and latency metrics, GemFire now enables applications to sense changing performance patterns and proactively provision extra resources and trigger rebalancing. The end result is predictable data access throughput and latency without the need to overprovision capacity.
- We continue down the path of making the product more resilient than ever before - dealing with complex membership issues when operating in large clusters and allowing thresholds to be set in terms of consumption of memory in any server JVM that significantly reduces the probability of "stop the world" garbage collection cycles.
- Advanced Data Partitioning: Applications are no longer restricted by the memory available across the cluster to manage partitioned data. Applications can pool available memory as well as disk and stripe the data across memory and disk across the cluster. When the data fabric is configured as a cache, partitioned data can be expired or evicted so that only the most frequently used data is managed.
- Data-aware application behavior routing: There are several extensions added to the GemFire data-aware function execution service - a simple grid programming model that allows the application to synchronously or asynchronously execute application behavior on the data nodes. Applications invoke functions hinting the data they are dependent on and the service parallelizes the execution of the application function on all the grid nodes where the data is being managed. Applications can now define relationships between different classes of data to colocate all related data sets and application functions when routed to the data nodes can execute complex queries on in-process data. These and other features offered in the 'Function execution service' offers linear scalability for compute and data intensive applications. Simply add more nodes when demand spikes to rebalance data and behavior to increase the overall throughput for your application.
- API additions for C++, C#: Support for continuous querying, client side connection pooling and dynamic load balancing and ability to invoke server side functions.
- Cost based Query optimization: A new compact index to conserve memory utilizaton and enhanced query processor design with cost-based optimization has been introduced as part of this release.
- Developer productivity tools: It can be daunting when developers have to quickly develop and test their clustered application. Developers need the capability to browse the distributed data using ad-hoc queries, apply corrections or monitor resource utilization and performance metrics. A new graphical Data browser permits browsing and editing of data across the entire cluster, execution of ad-hoc queries and even create real-time table views that are continuously kept up-to-date through continuous queries. The GemFire Monitor tool (GFMon) also has several enhancements making the tool much more developer friendly.
For more information on GemFire, view our newly rewritten technical white paper at:
To continue the graph theme Google has got into the act and released information on Pregel. Pregel does not appear to be a new type of potato chip. Pregel is instead a scalable infrastructure...
...to mine a wide range of graphs. In Pregel, programs are expressed as a sequence of iterations. In each iteration, a vertex can, independently of other vertices, receive messages sent to it in the previous iteration, send messages to other vertices, modify its own and its outgoing edges' states, and mutate the graph's topology.
Currently, Pregel scales to billions of vertices and edges, but this limit will keep expanding. Pregel's applicability is harder to quantify, but so far we haven't come across a type of graph or a practical graph computing problem which is not solvable with Pregel. It computes over large graphs much faster than alternatives, and the application programming interface is easy to use. Implementing PageRank, for example, takes only about 15 lines of code. Developers of dozens of Pregel applications within Google have found that "thinking like a vertex," which is the essence of programming in Pregel, is intuitive.
Pregel does not appear to be publicly available, so it's not clear what the purpose of the announcement could be. Maybe it will be a new gmail extension :-)
I'm working on a web site that might totally flop or it might explode to be the next facebook/flickr/digg/etc. Since I really don't know how popular the site will be I don't want to spend a ton of money on the hardware/hosting right away but I want to be able to scale it easily if it does grow rapidly. With this in mind, what would be the best approach to launch the site?