High Performance Multithreaded Access to Amazon SimpleDB is a great follow up to the idea in How SimpleDB Differs from a RDBMS that more programming is the price paid for performance in SimpleDB. It shows how much work and infrastructure is required to batter better performance out of SimpleDB. Remember, in SimpleDB you get keys to records from queries so if you want to get all the fields for records you need to make separate requests. Since SimpleDB isn't exactly a speed daemon the obvious strategy is to parallelize. Even if a job takes a 100 msecs you can get a lot done in a little time if you can execute enough jobs in parallel. Parallelization is the approach taken by Haakon@AWS in his Java code example of how to get the most out of SimpleDB. You can find the code at Indexing and Querying Amazon S3 Metadata with Amazon SimpleDB. We'll also consider how a back-end service architecture built on Erlang may be a better fit with cloud computing. Two general mechanisms of parallelism are available: threads and boxes. To get the most bang out of a single machine you need threads (events, etc). To scale beyond the load handled by a single machine you need multiple boxes. The example code uses the Executor Thread Pool for parallelism within a program. Thread pools are a pretty common idiom by now. Amazon's queue service SQS was used to distribute work amongst boxes. Work was queued to SQS in batches of 1000 work items. The items were pulled by the thread pool and processed. Why 1000? The idea is to balance processing overhead with work overhead. You don't want popping items off SQS to dominate your processing time so you have to do enough work in each pass to make it worth the investment. The architecture uses two thread pools: one to run queries and one to get record values. Applications must carefully tune the number of threads in each pool so the queries to overwhelm the gets. Using a query thread pool with 2 threads and a get thread pool with 32 threads it was possible to perform 300 TPS on a small EC2 instances. Theoretically the advantage of this architecture is that it will scale to any size you need. SQS is your work distribution backbone and you just spin up the number of thread pool instances you need. The disadvantage is that this is a lot of programmer effort. But let's consider that you had to do some serious processing on each record, you would need something like this approach anyway to scale out the processing. But to perform simple aggregation operations it's total overkill which is why more time needs to be spent on the write site of the equation in SimpleDB/BigTable than the read side as we are used to with a RDBMS. What's the best way to go parallel? On the front-end life is simple. Go shared nothing and compose your pages from scalable back-end services. This is how Amazon does it and it's how Google AppEngine does it. GAE completely punts on the back-end service layer architecture. Unfortunately we still need to create a back-end architecture for more complex applications. Thread pools and SQS is one parallelization approach. Instead of thread pools something like Java's fork/join framework could be used. Initially I thought piling on more low level primitive threading facilities into Java was the wrong way to go. Yes, it is a "'multicore-friendly lightweight parallel framework' that supports a style of parallel programming where problems are recursively split into smaller fragments, solved in parallel and recombined," but it's also a style of programming that is very difficult to program correctly. If cloud architectures will rely on these primitives for efficiency then I think we have regressed. Erlang style architectures described by Luke Hoersten in Scalable Web Apps: Erlang + Python is a simpler more reliable to programming model. An event driven actor based approach is much harder to screw up than closely cooperating threads in a shared memory space. Erlang originally ran in embedded systems where the requirement was to reliably squeeze the most work possible out of limited CPU and other compute resources. Oddly enough the embedded node of old closely parallels your basic cloud VM. Start your work horse Erlang (or other similar system) instances and let them efficiently chew up your work loads. Erlang's scheduling model fits perfectly with a service centric job engine cloud instance. It will get more work done then your typical thread based system ever would.
HSCALE - Handling 200 Million Transactions Per Month Using Transparent Partitioning With MySQL Proxy
Update 2: A HSCALE benchmark finds HSCALE "adds a maximum overhead of about 0.24 ms per query (against a partitioned table)." Future releases promise much improved results. Update: A new presentation at An Introduction to HSCALE. After writing Skype Plans for PostgreSQL to Scale to 1 Billion Users, which shows how Skype smartly uses a proxy architecture for scaling, I'm now seeing MySQL Proxy articles all over the place. It's like those "get rich quick" books that say all you have to do is visualize a giraffe with a big yellow dot superimposed over it and by sympathetic magic giraffes will suddenly stampede into your life. Without realizing it I must have visualized transparent proxies smothered in yellow dots. One of the brightest images is a wonderful series of articles by Peter Romianowski describing the evolution of their proxy architecture. Their application is an OLTP system executing 200 million transaction per month, tables with more than 1.5 billion rows, and a 600 GB total database size. They ran into a wall buying bigger boxes and wanted to move to a sharded architecture. The question for them was: how do you implement sharding? In the first article four approaches to sharding were identified:
Misusing HTTP sessions is probably the number one obstacle to building scalable web sites today. Here are some tips how to consume HTTP sessions responsibly.
Update 6: nginx_http_push_module. Turn nginx into a long-polling message queuing HTTP push server.
Update 5: In Load Balancer Update Barry describes how WordPress.com moved from Pound to Nginx and are now "regularly serving about 8-9k requests/second and about 1.2Gbit/sec through a few Nginx instances and have plenty of room to grow!".
Update 4: Nginx better than Pound for load balancing. Pound spikes at 80% CPU, Nginx uses 3% and is easier to understand and better documented.
Update 3: igvita.com combines two cool tools together for better performance in Nginx and Memcached, a 400% boost!.
Update 2: Software Project on Installing Nginx Web Server w/ PHP and SSL. Breaking away from mother Apache can be a scary proposition and this kind of getting started article really helps easy the separation.
Update: Slicehost has some nice tutorials on setting up Nginx.
From their website:
Nginx ("engine x") is a high-performance HTTP server and reverse proxy, as well as an IMAP/POP3/SMTP proxy server. Nginx was written by Igor Sysoev for Rambler.ru, Russia's second-most visited website, where it has been running in production for over two and a half years. Igor has released the source code under a BSD-like license. Although still in beta, Nginx is known for its stability, rich feature set, simple configuration, and low resource consumption.
Bob Ippolito says of Nginx:
The only solution I know of that's extremely high performance that offers all of the features that you want is Nginx... I currently have Nginx doing reverse proxy of over tens of millions of HTTP requests per day (thats a few hundred per second) on a single server. At peak load it uses about 15MB RAM and 10% CPU on my particular configuration (FreeBSD 6).
Under the same kind of load, Apache falls over (after using 1000 or so processes and god knows how much RAM), Pound falls over (too many threads, and using 400MB+ of RAM for all the thread stacks), and Lighty leaks more than 20MB per hour (and uses more CPU, but not significantly more).
Update: Jake in Does Django really scale better than Rails? thinks apps like FFS shouldn't need so much hardware to scale.
In a short three months Friends for Sale (think Hot-or-Not with a market economy) grew to become a top 10 Facebook application handling 200 gorgeous requests per second and a stunning 300 million page views a month. They did all this using Ruby on Rails, two part time developers, a cluster of a dozen machines, and a fairly standard architecture. How did Friends for Sale scale to sell all those beautiful people? And how much do you think your friends are worth on the open market?
- 6, 4 core 8 GB application servers.
- Each application server runs 16 mongrels for a total of 96 mongrels. -
- 4 GB memcache instance on each application server
- 2 32GB 4 core servers with 4x 15K SCSI RAID 10 disks in a master-slave setup
Getting to Know You
Our system is designed for our Facebook application, Friends for Sale.
It's basically Hot-or-Not with a market economy. At the time of this
writing it's the 10th most popular application on Facebook.
Their Facebook description reads: Buy and sell your friends as pets! You can make your pets poke, send gifts, or just show off for you.
Make money as a shrewd pets investor or as a hot commodity! Friends for Sale is the bees knees!
We designed this as more of an experiment to see if we understood virality concepts and metrics on Facebook. I guess we do. =)
As a Facebook application, every request is dynamic so no page caching is possible. Also, it is a very interactive, write heavy application so scaling the database was a challenge.
We memcached extensively early on - every page reload results in 0 SQL calls. We use Rail's fragment caching with custom expiration logic mostly.
We had more than half a million unique visitors yesterday and growing fast. We're on track to do more than 300 million page views this month.
We used around 3 terabytes of bandwidth last month. This month should be at least 5TB or so. This number is just for a few icons and XHTML/CSS.
We don't really have unique documents ... we do have around 10 million user profiles though.
The only images we store are a few static image icons.
We went from around 3M page views per day a month ago to more than 10M page views a day. A month before that we were doing 1M page views per day. So that's around a 300% monthly growth rate but that is plateauing. On a request per second basis, we get around 200 requests per second.
It's all free.
It's around 1% per day, with a growth rate of 3% or so per day in terms of installed users.
We had roughtly 2.1 million unique visitors in the past month according to Google.
It's a relatively standard Rails cluster. We have a dedicated front end proxy balancer / static web server running nginx, which proxies directly to 6, 4 core 8 GB application servers. Each application server runs 16 mongrels for a total of 96 mongrels. The front end load balancer proxies directly to the mongrel ports. In addition, we run a 4 GB memcache instance on each application server, along with a local starling distributed queue server and misc background processes.
We use god to monitor our processes.
On the DB layer, we have 2 32GB 4 core servers with 4x 15K SCSI RAID 10 disks in a master-slave setup. We use Dr Nic's magic multi-connection's gem in production split reads and writes to each
We are adding more slaves right now so we can distribute the read load better and have better redundancy and backup policies. We also get help from Percona (the mysqlperformanceblog guys) for remote DBA work.
We're hosted on Softlayer - they're a fantastic host. The only problem was that their hardware load balancing server doesn't really work very well ... we had lots of problems with hanging connections and latency. Switching a dedicated box running just nginx fixed everything.
It really isn't. On the application layer we are shared-nothing so it's pretty trivial. On the database side we're still with a monolithic master and we're trying to push off sharding for as long as we can. We're still vertically scaled on the database side and I think we can get away with it for quite some time.
The three things that are unique is -
1. Neither of the two developers in involved had previous experience in large scale Rails deployment.
2. Our growth trajectory is relatively rare in the history of Rails deployments
3. We had very little opportunity for static page caching - each request does hit the full Rails stack
We learned that a good host, good hardware, and a good DBA are very important. We used to be hosted on Railsmachine, which to be fair is an excellent shared hosting company and they did go out of there way to support us. In the end though, we were barely responsive for a good month due to hardware problems, and it only took two hours to get up and running on Softlayer without a hitch. Choose a good host if you plan on scaling, because migrating isn't fun.
The most important thing we learned is that your scalability problems is pretty much always, always, always the database. Check it first, and if you don't find anything, check again. Then check again. Without exception, every performance problem we had can be traced to the database server, the database configuration, the query, or the use and non-use of indices.
We definitely should have gotten on to a better host earlier in the game so we would have been up.
We definitely wouldn't change our choice of framework - Rails was invaluable for rapid application development, and I think we've pretty much proven that two guys without a lot of scaling experience can scale a Rails app up. The whole 'but does Rails scale?' discussion sounds like a bunch of masturbation - the point is moot.
We have two Rails developers, inclusive of me. We very recently retained the services of a remote DBA for help on the database end.
On the technical side, 2 part time (now full time), and 1 remote DBA contractor.
The full time employees are also located in the SOMA area of San Francisco.
The two developers server as co-founders . I (Siqi) was responsible for front end design and development early on, but since I had some experience with deployment I also ended up handling network operations and deployment as well. My co founder Alex is responsible for the bulk of the Rails code - basically all the application logic is from him. Now I find myself doing more deep back end network operations tasks like MySQL optimization and replication - it's hard to find time to get back to the front end which is what I love. But it's been a real fun learning experience so I've been eating up all I can from this.
Yes - basically find the smartest people you can, give them the best deal possible, and get out of their way. The best managers GET OUT OF THE WAY, so I try to run the company as much as I can with that in mind. I think I usually fail at it.
We'd have to have some really good communication tools in the cloud - somebody would have to be a Basecamp nazi. I think remote work / outsourcing is really difficult - I prefer to stay away with from it
for core development. For something like MySQL DBA or even sysadmin - it might make more sense.
What do you use?We use Rails with a bunch of plugins, most notable cache-fu from Chris Wanstrath and magic multi connections from Dr. Nic. I use VIM as the editor with the rails.vim plugin.
Ruby / Rails
We now have 12 servers in the cluster.
4 DB servers, 6 application servers, 1 staging server, and 1 front end server.
We order them from Softlayer - there's a less than 4 hour turn around for most boxes, which is awesome.
CentOS 5 (64 bit)
We just use nginx's built in proxy balancer.
We use a dedicated hosting service, Softlayer.
We use NAS for backups but internal SCSI drives for our production boxes.
Across all of our boxes we probably have around ... 5 TB of storage or
Ad-hoc. We haven't done a proper capacity planning study, to our detriment.
Right now we just persist it to the database - it would be fairly easy to use memcache directly for this purpose though.
Master/slave right now. We're moving towards a Master/Multi-slave with a read only load balancing proxy to the slave cluster.
We do it in software via nginx.
We run network ads. We also weight our various ad networks by eCPM on our application layer.
Me: Front end design, development, limited Rails. Obviously, recently proficient in MySQL optimization and large scale Rails deployment.
Alex: application logic development, front end design, general software engineering.
Alex develops on OSX while I develop on Ubuntu. We use SVN for version control. I use VIM for editing and Alex uses TextMate.
On the logic layer, it's very test driven - we test extensively. On the application layer, it's all about quick iterations and testing.
We cache both in memcache with no TTL, and we just manually expire.
How do you manage your system?
We use Pingdom for external website monitoring - they're really good.
Right now we're just relying on our external monitoring and Softlayer's ping monitoring. We're investigating FiveRuns for monitoring as a possible solution to server monitoring.
We deploy to staging and run some sanity tests, then we do a deploy to all application servers.
We trace back every SQL query in development to make sure we're not doing any unnecessary calls or model instantiations. Other than that, we haven't done any real benchmarking.
User feedback and critical thinking. We are big believers in simplicity so we are pretty careful to consider before we add any major features.
We use a home grown metrics tracking system for virality optimization,
and we also use Google Analytics.
Yes, from the time to time we will tweak aspects of our design to optimize for virality.
How is your data center setup?
Don't know to all of the above.
We use LVM to do incrementals on a weekly and daily basis.
Right now they are done manually, except for new Rails application deployments. We use capistrano to update and restart our application servers.
We usually migrate on a slave first and then just switch masters.
Not very good.
Oh we wish.
CPM - more page views more money. We also have incentivized direct offers through our virtual currency.
Word of mouth - the social graph. We just leverage viral design tactics to grow.
I think Ruby is pretty particularly cool. But no, not really - we're not doing rocket science, we're just trying to get people laid.
No, that wouldn't be very smart.
Hm. I'd say none if you haven't scaled up anything before, and a lot if you have. It's hard to know what's actually going to be the problem until you've actually been through and see what real load problems look like. Once you've done that, then you have enough domain knowledge to do some actual meaningful up front design on our next go around.
How unreliable vendor hardware can be, and how different support can be from host to host. The number one most important thing you will need is a scaled up dedicated host who can support your needs. We use Softlayer and we can't recommend them highly enough.
On the other hand, it's surprising how far just a master-multislave setup can take you on commodity hardware. You can easily do a Billion page views per month on this setup.
It doesn't really, we just fix bottle necks as they come and we see them coming.
Brad Fitzpatrick for inventing memcache, and anyone who has successfully horizontally scaled anything.
We will have to start sharding by users soon as we hit database size and write limits.
Their Thoughts on Facebook Virality
I'd really like to thank Siqi taking the time to answer all my questions and provide this fascinating look in to their system. It's amazing what you've done in so little time. Excellent job and thanks again.
Website stats:Webserver: Apache 2.2 Database: MySQL 5.0 APC cache for php CMS: Drupal 6.2 (bleeding-edge version)* *Aggressive caching is ON, Page Compression ON, Block Cache ON (can't use CCS),Optimize CSS/JS ON. 2 Servers: Apache/Mysql (low-tech servers - Celeron processors, 512 MB RAM, 7200 RPM HDD) Bandwidth 10 Mb/s
The benchmark:Used ab : ab -n 1000 -c 20 howwhatwho.com Server Software: Apache/2.2.3 Server Hostname: howwhatwho.com Server Port: 80 Document Path: / Document Length: 41639 bytes Concurrency Level: 20 Time taken for tests: 13.556796 seconds Complete requests: 1000 Failed requests: 0 Write errors: 0 Total transferred: 42118000 bytes HTML transferred: 41639000 bytes Requests per second: 73.76 [#/sec] (mean) Time per request: 271.136 [ms] (mean) Time per request: 13.557 [ms] (mean, across all concurrent requests) Transfer rate: 3033.90 [Kbytes/sec] received The Apache server is also running the postifx and bind although they aren't resource intensive applications. The Cron job for drupal runs every 50 minutes, and the agreggator module is enabled and fetches more than 30 rss feeds every time. The site used to be hosted on a single Celeron machine but on peak times the CPU went up to 80 %. Question : Does anybody know a website hosted on an IBM Mainframe? :) Todd?
Update: Arjen links to video Supporting Scalable Online Statistical Processing which shows "rather than doing complete aggregates, use statistical sampling to provide a reasonable estimate (unbiased guess) of the result." When you have a lot of data, sampling allows you to draw conclusions from a much smaller amount of data. That's why sampling is a scalability solution. If you don't have to process all your data to get the information you need then you've made the problem smaller and you'll need fewer resources and you'll get more timely results. Sampling is not useful when you need a complete list that matches a specific criteria. If you need to know the exact set of people who bought a car in the last week then sampling won't help. But, if you want to know many people bought a car then you could take a sample and then create estimate of the full data-set. The difference is you won't really know the exact car count. You'll have a confidence interval saying how confident you are in your estimate. We generally like exact numbers. But if running a report takes an entire day because the data set is so large, then taking a sample is an excellent way to scale.
What have bunch of applications which run on Debian servers, which processes huge amount of data stored in a shared NFS drive. we have 3 applications working as a pipeline, which process data stored in the NFS drive. The first application processes the data and store the output in some folder in the NFS drive, the second app in the pipeline process the data from the previous step and so on. The data load to the pipeline is like 1 GBytes per minute. I think the NFS drive is the bottleneck here. Would buying a specialized file server improve the performance of data read write from the disk ?
The recent Data-Intensive Computing Symposium brought together experts in system design, programming, parallel algorithms, data management, scientific applications, and information-based applications to better understand existing capabilities in the development and application of large-scale computing systems, and to explore future opportunities. Google Fellow Jeff Dean had a very interesting presentation on Handling Large Datasets at Google: Current Systems and Future Directions. He discussed: • Hardware infrastructure • Distributed systems infrastructure: –Scheduling system –GFS –BigTable –MapReduce • Challenges and Future Directions –Infrastructure that spans all datacenters –More automation It is really like a "How does Google work" presentation in ~60 slides? Check out the slides and the video!
I've been trying to find a high availability file storage solution without success. I tried GlusterFS which looks very promising but experienced problems with stability and don't want something I can't easily control and rely on. Other solutions are too complicated or have a SPOF. So I'm thinking of the following setup: Two NFS servers, a primary and a warm backup. The primary server will be rsynced with the warm backup every minute or two. I can do it so frequently as a PHP script will know which directories have changed recently from a database and only rsync those. Both servers will be NFS mounted on a cluster of web servers as /mnt/nfs-primary (sym linked as /home/websites) and /mnt/nfs-backup. I'll then use Ucarp (http://www.ucarp.org/project/ucarp) to monitor both NFS servers availability every couple of seconds and when one goes down, the Ucarp up script will be set to change the symbolic link on all web servers for the /home/websites dir from /mnt/nfs-primary to /mnt/nfs-backup The rsync script will then switch and the backup NFS will become primary and backup to the previous primary when it gets back online. Can it really be this simple or am I missing something? Just setting up a trial system now but would be interested in feedback. :) Also, I can't find out whether it's best to use NFS V3 or V4 these days?