Khan Academy Checkbook Scaling to 6 Million Users a Month on GAE

Khan Academy is a non profit company started by Salman Khan with the Big Hairy Audacious Goal of providing a free, world class education to anyone, anywhere, anytime. That’s a lot of knowledge. Having long been inspired and captivated by the Khan Academy, I was really curious to know how they plan to do it. Ben Kamens, lead developer at Khan Academy, gives the somewhat surprising answer in an interview: How to Scale your Startup to Millions of Users.

The short answer: develop a strong team, focus on features, let Google App Engine do the heavy lifting.

Some people seem to be turned off by all the GAE love in the interview. Part of it is that the interviewer is Fred Sauer, Developer Advocate for Google App Engine, so there’s a level of familiarity between the two. But the biggest part is simply that they really like GAE, for all the reasons your are supposed to like GAE. And that’s OK. In this day and age you are free to love whichever platform you choose.

Biggest surprise:

  • A profile on 60 Minutes drove more traffic than TechChrunch, HackerNews, and everything else combined. Old media is not dead.

Part I liked the best:

GAE is an abstraction over all the typical scalability issues and that let’s you focus on business problems. All abstractions leak, you are going to have to deal with problems no matter what you choose, but you are choosing the type of problems you want  to deal with by the platform you select. It's all about understanding the tradeoffs you're making.

Here’s my gloss on the major takeaways from the interview:

  • Khan Academy has about 6 million users a month, perhaps 15 million registered users. For comparison, Coursera has 6 million users total, but I’m not sure how active those users are.

  • Evolution:

    • Initially videos were created in support of learning how to solve math problems. Those videos were hosted on YouTube. No scaling problems with YouTube.

    • But YouTube is not interactive and a big part of what they want to do is present a learning tree to users, give quizzes, manage results, etc. So they first tried a self hosted Java based site, which was crushed under the load.

    • Then the switch was made to GAE. Khan Academy is different from a lot of startups because it already had a ready made audience built on YouTube. Hundreds of thousands of users were switched over from YouTube to GAE. They slowly built the foundation of a non-YouTube site that could handle new users. Bootstrapping customers on YouTube could be a good general strategy.

    • There was lots of press and the whole thing snowballed, it just kept growing and growing.

  • Why GAE?

    • It doesn’t make sense for any startup not to be on some provider like GAE or EC2. They help solve a lot of scalability problems that you don’t want to have to solve. A no-brainer. If you get a lot of attention you can throw more money and scale immediately.

    • Spectrum: how much control do you want vs how much scalability do you want out of the box? The more control you get the more likely you are to shoot yourself in the foot and do something wrong.

    • GAE gives you scalability out of the box. As long as your credit card is plugged in you won’t have many scalability problems. Which brought up the point for me: how do they afford GAE? Nonprofit economics, interestingly enough, may enable a shift away from focussing on costs per new customer acquisition and worrying about monetization strategies.

    • With AWS you have to be more on your game, know how to deal with instances, it takes more time.

    • They don’t carry pagers, don’t worry about replication, don’t have to restart instances, and don’t apply OS patches. There are a lot of things they don’t have to do with GAE.

    • Dylan Vassallo, a summer intern at Khan Academy, worked on various App Engine projects and made an excellent comment in a Hacker News thread on why they like GAE: I'm a summer intern at Khan Academy working on various App Engine projects. My uneducated guess is that we're one of the platform's bigger customers. One of GAE's biggest downsides is lack of control: unlike EC2 where you get to mold vanilla Linux installations to suit your needs, your application must fit GAE's service model through and through. But by giving up some of those freedoms you receive a whole lot of awesome in return. Khan Academy has successfully leveraged App Engine to scale to millions of users without hiring a single sysadmin or spending too much time worrying about anything ops-related. We're able to handle traffic spikes like our 60 Minutes appearance and the launch of our new computer science curriculum (http://khanacademy.org/cs) with no sweat. To deploy the site, any developer (or our friendly CI bot) can simply run our "deploy.py" and wait a few minutes, then get back to spending time on the product. We haven't had to think once about whether or not the database can handle the write load we throw at it; the App Engine Datastore is uniquely worry-free in that regard. (Well, I'm sure Google SREs worry about it plenty, but we don't have to.)

  • Why old media is still relevant. All traffic combined from TechCrunch, HackerNews, etc. was nothing compared to that traffic driven by 60 Minutes. Not even close.

    • They knew they were going to get a lot of traffic so they were prepared to turn the right dials that would start new instances quickly. That’s all they did. Which was a mistake.

    • They ended up doing a lot of work for these new users that didn’t need to be done. They needed to simplify the experience. For example, they thought with all this new traffic it would be a good time to run, for the first time,  an A/B test with their new testing infrastructure to track all this delicious data. It failed.

    • Just make sure the main experience is clean and simple. If you have to track data then make sure it’s really important and uses a well tested data tracking solution.

    • Don’t run new code for a one time uptick in traffic. You only get it one time. If half the nation can’t access the site because you are trying to track too much data then it’s not worth it.

    • What they do now is make the home page as static as they can make it, simple, loads quickly, if you branch off the home page then you can get the whole experience. Simplify for the incoming traffic.

  • Didn’t do enough load testing. 60 Minutes drove a lot of traffic and they hadn’t tested to that level. But they have consistent high traffic loads that are spikey throughout the day as users in different time zones come on line. Traffic is also seasonal. They get a lot of useful stress testing by their natural traffic patterns. They have to handle spikes, valleys, months where there’s less traffic, and then when there’s more traffic. So they know they have to configure their code in GAE to handle that spike.

  • Offload as much work as you can to somebody else. Which is GAE for them. Huge win. At 9 or 10 employees they felt it was no longer acceptable to patch over performance problems and hope it went OK next time. They dedicated people to performance. They build dashboards to predict problems ahead of time. Tried to be more active about performance and scalability issues.

  • At a group size of 2 - 5 people don’t focus on scalability issues. Focus on features. Make your product great. Get people coming back to it. Then you can over engineer all you want once you have a successful business. Worry about problems when you have a chance to build a successful product.

  • Performance is everyone's job and everyone is assigned to it. Try to create a general awareness of performance through the entire team.

    • You’ll have those big wins performance problems, then it’s a game of repeated smaller improvements of  .25%, .5% at time that lead towards performance improvements, so it has to be everyone’s job. People can’t just throw in extra javascript or make new database queries, it will just slow your site down.

    • They have quarters where a one or two person team focuses on performance. It could be performance improvements or tools to become more aware of performance.

  • Collecting a lot of data. They want to report data to teachers and other people. Their old data storage system couldn’t summarize all that data. So they’ve worked on:

    • rolling up data

    • doing more work on writes so reads are fast.  Rearchitecting data. Copy it everywhere. Make it really fast to analyze later.

    • Looks like for analytics they might be going outside GAE, but that wasn’t specifically said.

  • Don’t obsess over the hard problems, like rearchitecturing features on the fly until there’s a lot of demand.

  • They are small enough that everyone is responsible for DevOps, keeping the system up. In the future there may be enough systemes outside of GAE they they will have separate teams, but they aren’t there yet.

  • Keep things simple both technically and in the product until you know absolutely what you need to build. A lot of the features they’ve built to solve and edge case turned out to cause real problems in the long run. It’s hard to turn things off once you build them.

  • Be lazy. Use other people’s tools. Until your business is doing really well don’t be afraid to use other people’s work.