This is a wonderfully informative Amazon update based on Joachim Rohde's discovery of an interview with Amazon's CTO. You'll learn about how Amazon organizes their teams around services, the CAP theorem of building scalable systems, how they deploy software, and a lot more. Many new additions from the ACM Queue article have also been included.
Amazon grew from a tiny online bookstore to one of the largest stores on earth. They did it while pioneering new and interesting ways to rate, review, and recommend products. Greg Linden shared is version of Amazon's birth pangs in a series of blog articles
AJAX, etc. Too complex. If middleware was available in smaller components, more as a tool than a framework, they would be more interested.
- Services are the independent units delivering functionality within Amazon. It's also how Amazon is organized internally in terms of teams.
- If you have a new business idea or problem you want to solve you form a team. Limit the team to 8-10 people because communication hard. They are called two pizza teams. The number of people you can feed off two pizzas.
- Teams are small. They are assigned authority and empowered to solve a problem as a service in anyway they see fit.
- As an example, they created a team to find phrases within a book that are unique to the text. This team built a separate service interface for that feature and they had authority to do what they needed.
- Extensive A/B testing is used to integrate a new service . They see what the impact is and take extensive measurements.
- They create special infrastructure for managing dependencies and doing a deployment.
- Goal is to have all right services to be deployed on a box. All application code, monitoring, licensing, etc should be on a box.
- Everyone has a home grown system to solve these problems.
- Output of deployment process is a virtual machine. You can use EC2 to run them.
- Work from the customer backward. Focus on value you want to deliver
for the customer.
- Force developers to focus on value delivered to the customer instead of building technology first and then figuring how to use it.
- Start with a press release of what features the user will see and work backwards to check that you are building something valuable.
- End up with a design that is as minimal as possible. Simplicity is the key if you really want to build large distributed systems.
- Internally they can deliver infinite storage.
- Not all that many operations are stateful. Checkout steps are stateful.
- Most recent clicked web page service has recommendations based on session IDs.
- They keep track of everything anyway so it's not a matter of keeping state. There's little separate state that needs to be kept for a session. The services will already be keeping the information so you just use the services.
- Three properties of a system: consistency, availability, tolerance to network partitions.
- You can have at most two of these three properties for any shared-data system.
- Partitionability: divide nodes into small groups that can see other groups, but they can't see everyone.
- Consistency: write a value and then you read the value you get the same value back. In a partitioned system there are windows where that's not true.
- Availability: may not always be able to write or read. The system will say you can't write because it wants to keep the system consistent.
- To scale you have to partition, so you are left with choosing either high consistency or high availability for a particular system. You must find the right overlap of availability and consistency.
- Choose a specific approach based on the needs of the service.
- For the checkout process you always want to honor requests to add items to a shopping cart because it's revenue producing. In this case you choose high availability. Errors are hidden from the customer and sorted out later.
- When a customer submits an order you favor consistency because several services--credit card processing, shipping and handling, reporting--are simultaneously accessing the data.
reality, embrace it. For example, go more with a fast reboot and fast recover approach. With a decent spread of data and services you might get close to 100%. Create self-healing, self-organizing lights out operations.
Avinash Kaushik calls this getting rid of the influence of the HiPPO's, the highest paid people in the room. This is done with techniques like A/B testing and Web Analytics. If you have a question about what you should do code it up, let people use it, and see which alternative gives you the results you want.