How is software developed at Amazon?

How is software developed at Amazon? Get a couple of prime pizzas delivered and watch this excellent interview with Ken Exner, GM of AWS Developer Tools. It's notable Ken is from the tools group because progress in an industry is almost always made possible by the development of better tools.

The key themes from the talk: decomposition, automation, and organize around the customer.

The key idea:

Scaling is by mitosis. Teams split apart into smaller teams that completely own a service. EC2 started as one two pizza team.

This quote nicely embodies all three of the themes and is the key reason AWS keeps on winning the public cloud. Bottom up, Amazon adaptively grows their entire organization in response to customer inputs.

If you want an example of how a complex AWS feature was developed from customer input then take a listen to Heavy Networking 433: An Insider’s Guide To AWS Transit Gateways. The AWS Transit Gateway was developed because customers asked for it...and AWS listened.

AWS is eating the world because customers keep on asking for a bigger menu.

And here's a short gloss of the talk...

  • Amazon loves decomposition. Amazon used to have a monolithic organization and software architecture (Perl/Mason/C++). They decomposed the monolith into services and decomposed the organization into two pizza teams. Teams are autonomous, independent, and have ownership. Teams own a service end-to-end. They deal with customers, development, testing, support, etc.
  • Amazon loves automation. Automate all the things. Their first tools automated the build and release process, then deployment was automated. At first it's scary that a committed change automatically flows in to production, but anything you can do manually can be put in automation, so it happens the same way every single time. As part of every deployment they go through several different kinds of testing. Started with integration testing. Browser and web based testing. Load testing. They monitor and measure everything. They found they were able to push out changes more frequently and the quality was higher. They could release more and better.
  • Deployment is a pessimistic process, they constantly try to find reasons to fail a deployment either in pre-production or in production. In production they roll out to one box in one AZ. Any problems? Rollback. Success? Fan out to the AZ, then to more AZs, and then more regions. If a problem is found then roll back to a known good state.
  • Security is managed throughout the entire process. Developers need to think like security engineers. That's part of Amazon's culture. Engineers need to be developers, operators, architects, testers and security experts.  Amazon teaches developers these skills. Robert A. Heinlein would be proud. DevOps is also DevSecsOps, it's about injecting security into the process.
  • When starting a new project the first thing developers work on is an architecture and a threat model. The threat model is reviewed with a security engineer. Developers own their own security because they're closest to the problem, so they're most likely to find problems. Then development is started. Code is submitted for review. Peers give feedback before commit. Static analysis is performed. Then it goes to the build which also has static analysis. Then it goes in to the release pipeline where there are more checks. There are canary monitors that run positive and negative checks against the deployment before the code goes out.
  • Checks are built in to the entire pipeline through a combination of local and globally mandated policies. If you can inspect a pipeline you can determine if it's following best practices. If you can describe best practices people can create rules that govern the shape, structure, and contents of the pipeline. As an organizational leader you can have rules for your team, like every new commit must have 70% unit test  code coverage before it can deploy. There are AWS wide rules that cover every deployment, like you can't deploy to every region at the same time. That's a bad practice, it can be stopped with a rule. Rules can be applied at the team level, organizational level, and the company level. This inspection capability makes sure people can't do bad things. Pipelines have best practices baked in from years of learning. It has been very liberating. Developers don't have to make mistakes and learn the hard way. Through automation you can ensure processes are being followed every single time.  
  • Developers on a team are responsible for architecture, the architecture doesn't come from architects. Once they have an architecture it's reviewed with an architect or a principal engineer. The role of a principal engineer is to review and teach, not do the architecture. Same with security. The role of a security engineer is not to create the threat model, that's a developer in the team, they review threat models. Same with testing. A team owns the entire process. A lot of time is spent teaching because you want developers to learn.
  • Leaders at Amazon are expected to model what's important. Operations is important at Amazon. You know that because leadership spends a lot of time on operations. For something to be taken seriously leadership must take it seriously. For example, every team must be able to present their dashboards at the weekly ops meetings. Every blip must be able to be explained.
  • The best way to plan is bottom up. Teams closest to the product are closest to the customer. They know what the customer wants. The people closest to the customer should tell Amazon what to do. Every year there are two docs OP1 and OP2 (Operating Plan). Every organization level writes a 6 page document about what they want to do next year. In the plan you say what you would do if you had flat resources and incremental resources. These 6 page business plans are presented at every level of the organization. Managers take the 6 page docs from all the teams they manage, make their own 6 page doc and present it to their management. This happens all the way up to Bezos. Resources then flow down to the teams.
  • The layers of management arbitrate different requests and apply judgement. The ideas still come from teams closest to the customer.  
  • Teams also have goals and they are given resources to attain those goals, which are tracked. Teams are thought of as startups and management acts as a board of directors managing their different startups by reviewing goals and metrics.
  • Teams can have specialists. They can have a mix of different skills, like a webdev, SE, PM, doc writer, marketer, etc.
  • Communication and consistency can be difficult because the teams are separate. Amazon often ends up with two of something, but it's better to have two of something rather than none of something. An accepted risk that can be fixed afterwards. It's better not to slow things down. Consistency is solved by refactoring teams. Create another team and another service to handle a responsibility.
  • How do you convince another team to do something you need them to do? You must be convincing. Global initiatives are driven top down during the annual planning process. For example, if they're going into a new region, teams must plan for that.

Just a note, when I hear high level manager types explain how the software development process at their company works I am always a little bit dubious. As a long time individual contributer I know management often has no idea how the sausage is really made. But according to the reddit thread listed below, people who I assume work at Amazon, agree this is actually how it works. Color me impressed.

  • On Reddit
  • Site Reliability Engineering
  • mjr00: Despite what the article says, you can deploy to all regions in one day, but you require VP approval. So a critical bug could be fixed as fast as your deployment code allows. However, this is not a regular occurrence. The real fun stuff happens after you've fixed the bug: you get to dig into all the logs and metrics to explain what happened, why it happened, why it wasn't detected sooner, and how you're going to make sure it never happens again. Then you get to prepare a document, lovingly called a "correction of error" or COE, which if you're lucky, will only be looked at and approved by your director. (And they don't rubber-stamp. They will have questions.) If you're unlucky, you get to do the honor of presenting your document to Charlie Bell and Andy Jassy, who will tear it apart. Oh yeah, and the entire AWS engineering organization is in the room or watching on stream.