Check Yourself Before You Wreck Yourself - Avocado's 5 Early Stages of Architecture Evolution

In Don’t panic! Here’s how to quickly scale your mobile apps Mike Maelzer paints a wonderful picture of how Avocado, a mobile app for connecting couples, evolved to handle 30x traffic within a few weeks. If you are just getting started then this is a great example to learn from.

What I liked: it's well written, packing a lot of useful information in a little space; it's failure driven, showing the process of incremental change driven by purposeful testing and production experience; it shows awareness of what's important, in their case, user signup; a replica setup was used for testing, a nice cloud benefit.

Their Biggest lesson learned is a good one:

It would have been great to start the scaling process much earlier. Due to time pressure we had to make compromises –like dropping four of our media resizer boxes. While throwing more hardware at some scaling problems does work, it’s less than ideal.

Here's my gloss on the article:

Evolution One - Make it Work

When just starting you just want to get stuff done, pretty much regardless of how it's done. How do you decide what needs fixing? What Avacodo did was purposefully test their software, look for weaknesses, fix them, iterate. A very sensible incremental process. You may think hey, why not jump to the end result, but that never really works out. There is a lot of truth to all that Agile/Lean thinking. Every large system evolves from a smaller system.

  • The first question: do you really need to scale? Why go through all this effort? Avocado was having a promotion that they expected to drive a lot more traffic and they couldn't afford to have those potential new users have a bad experience. So dealing with large traffic spikes was a must. 
  • They started with a typical have a server for every function kind of setup: 1 frontend server (stored: API, web presence, socket.io, HAProxy, stunnel/SSL encryption); 1 Redis server; 1 MySQL server; 1 batch server (running miscellaneous daemons).  
  • A replica test setup was used to run common scenarios like account creation.
  • Using smaller EC2 instances in the test environment saved money and flushed out resource related errors.
  • Test scripts built with jmeter simulated tens of thousands of user signups per minute.
  • Testing was directed at finding weak spots using different scenarios, like adding a second server and using all the CPU on a server. 
  • Performance crapped out with thousands of simultaneous registrations:
    • CPU spikes caused by creating new Python processes to send email verifications and to resize profile pictures. 
    • All this happened on one box. 
    • Some MySQL queries were taking a long time.
    • SSL was a major CPU load

Evolution Two - Test Driven Changes

Driven by test findings they:

  • Added some database indexes to speed up access. Shouldn't you already know which indexes to add when you create the tables? Isn't it obvious? Sometimes yes and sometimes no. Letting your database tell you what is slow works too, especially early in the game.
  • Rewrote email verification to use node.js' async capabilities instead of forking a separate Python process each time. Throughput increased 3x-5x. Why did they use the Python approach in the first place?  It seems obviously wasteful. Maybe they had the code already? Maybe they knew how to do it in Python? And it might not have mattered on a larger machine. But this is a sensible incremental change. Do what works and fix it when it doesn't. Though they still have a huge failure hole here. If a process dies a user registration state machine will stall.
  • SSL was moved to its own server. Again, fairly obvious that this could be a problem, and a larger machine could make it a non problem, but it's certainly easier to start on one box.
  • For the promotion four new instances were allocated for image resizing. After the promotion these were removed. A good strategy for handling expected traffic spikes.

Evolution Three - Production Driven Changes

As we get deeper into production experience we can see the type of problems changing to be those strange ones that are very hard to predict and only seem to have sucky fixes.

  • During the promotion monitoring alerted on slow response times, so a second front-end server was added. So have monitoring. Eventually 15 servers were added.
  • It was noticed on larger EC2 instances that only one CPU was being used, node.js is a single processor solution and HAProxy routed to only one instance per box. Restarting HAProxy to read in a new configuration to fix this situation took a few seconds, so they created a redundant HAProxy server to reduce downtime.
  • ELB was used to route traffic to the two HAProxy servers and handle SSL, which reduced the number of proxy boxes and saved money.
  • This caused a multiple server problem with socket.io as a user needs to maintain one session with one server. So they went to sticky sessions, which required some finicky changes.
  • Interesting finding: Websockets does not work on all cellular networks. As this is a mobile app that's critical. Also, ELB does not support websockets. So they used XHR-polling. A surprising turn of fate.
  • Socket.io was moved to its own box with 8 socket.io servers (one per CPU), for a total of 16 instances of socket.io servers.
  • That seems like a lot of socket servers. To reduce the number a database connection pool was used. Initially each socket.io server has one database connection, which queued up work to the database such that 60% of a box's 8 CPUs was used by socket.io. Using  a pool of 2 to 10 database connections per server instance dropped CPU usage to 3% and improved response times dramatically.
  • Why not use connection pools from the start? That's pretty standard. I was surprised they didn't, but there will be all kinds of these issues, the point isn't which issues they are, but that you find and work through them.

Evolution Four - Going Global

  • Mobile couples exist throughout the world, so they adapted by launching servers throughout the world too, directing Amazon to route by latency. Note how crazy this is. An international launch would have been a huge deal at one time. Now it's just another thing you do.
  • Should they have went global earlier? Mobile performance is key to adoption after all. But that would be crazy, you have to have experience with your system before you go to multiple locations.

Evolution Five - Now Make it Work Better

  • After handling the initial promotional load, learning how their system works, and fixing problems, now it's time to worry about automation and auto-deployment strategies. Take it slow so you can worry about one kind of problem at a time. Automating from the start would have complicated things greatly because you are trying to automate something when you don't even know how it should work yet. More good incremental thinking.