S3 Failed Because of Authentication Overload

Being an authentic human being is difficult and apparently authenticating all those S3 requests can be a bit overwhelming as well. Amazon fingered a lot of processor heavy authentication requests as the reason for their downtime:

Early this morning, at 3:30am PST, we started seeing elevated levels of authenticated requests from multiple users in one of our locations.  While we carefully monitor our overall request volumes and these remained within normal ranges, we had not been monitoring the proportion of authenticated requests.  Importantly, these cryptographic requests consume more resources per call than other request types.

Shortly before 4:00am PST, we began to see several other users significantly increase their volume of authenticated calls.  The last of these pushed the authentication service over its maximum capacity before we could complete putting new capacity in place.  In addition to processing authenticated requests, the authentication service also performs account validation on every request Amazon S3 handles.  This caused Amazon S3 to be unable to process any requests in that location, beginning at 4:31am PST.  By 6:48am PST, we had moved enough capacity online to resolve the issue.

Interesting problem. Same thing happens with sites using a lot of SSL. They need to purchase specialized SSL concentrators to handle the load which makes capacity planning a lot trickier and more expensive.

In the comments Allen conjectured What caused the problem however was a sudden unexpected surge in a particular type of usage (PUT's and GET's of private files which require cryptographic credentials, rather than GET's of public files that require no credentials).  As I understand what Kathrin said, the surge was caused by several large customers suddenly and unexpectedly increasing their usage.  Perhaps they all decided to go live with a new service at around the same time, although this is not clear.

We see these kinds of bring up problems all the time. The Skype failure was blamed on software updates which caused all nodes to relogin at the same time. Bring up a new disk storage filer and if you aren't load balancing requests  all new storage requests will go to that new filer and you'll be down lickity split.

Booting is one of the most stressful times on large networks. Bandwidth and CPU all become restricted which causes a cascade of failures.  ARP packets can get dropped or lost and machines never get their IP addresses.  Packets drop which causes retransmissions which chews up bandwidth which uses CPU and causes more drops. CPUs spike which causes timeouts and reconnects which again spirals everything out of control.

When I worked at a settop company we had the scenario of a neighborhood rebooting after a power outage. Lots of houses needing to boot large boot images over asymmetric low bandwidth cable connections. As a fix we broadcasted boot image blocks to all settops. No settops performed your typical boot image download. Worked like a charm.

Amazon's problem was a subtle one in a very obscure corner of their system. It's not surprising they found a weakness. But I'm sure Amazon will be back even bigger and better once they get their improvements on line.