« Web Accelerators - snake oil or miracle remedy? | Main | What's your scalability plan? »

S3 Failed Because of Authentication Overload

Being an authentic human being is difficult and apparently authenticating all those S3 requests can be a bit overwhelming as well. Amazon fingered a lot of processor heavy authentication requests as the reason for their downtime:

Early this morning, at 3:30am PST, we started seeing elevated levels of authenticated requests from multiple users in one of our locations. While we carefully monitor our overall request volumes and these remained within normal ranges, we had not been monitoring the proportion of authenticated requests. Importantly, these cryptographic requests consume more resources per call than other request types.

Shortly before 4:00am PST, we began to see several other users significantly increase their volume of authenticated calls. The last of these pushed the authentication service over its maximum capacity before we could complete putting new capacity in place. In addition to processing authenticated requests, the authentication service also performs account validation on every request Amazon S3 handles. This caused Amazon S3 to be unable to process any requests in that location, beginning at 4:31am PST. By 6:48am PST, we had moved enough capacity online to resolve the issue.

Interesting problem. Same thing happens with sites using a lot of SSL. They need to purchase specialized SSL concentrators to handle the load which makes capacity planning a lot trickier and more expensive.

In the comments Allen conjectured What caused the problem however was a sudden unexpected surge in a particular type of usage (PUT's and GET's of private files which require cryptographic credentials, rather than GET's of public files that require no credentials). As I understand what Kathrin said, the surge was caused by several large customers suddenly and unexpectedly increasing their usage. Perhaps they all decided to go live with a new service at around the same time, although this is not clear.

We see these kinds of bring up problems all the time. The Skype failure was blamed on software updates which caused all nodes to relogin at the same time. Bring up a new disk storage filer and if you aren't load balancing requests all new storage requests will go to that new filer and you'll be down lickity split.

Booting is one of the most stressful times on large networks. Bandwidth and CPU all become restricted which causes a cascade of failures. ARP packets can get dropped or lost and machines never get their IP addresses. Packets drop which causes retransmissions which chews up bandwidth which uses CPU and causes more drops. CPUs spike which causes timeouts and reconnects which again spirals everything out of control.

When I worked at a settop company we had the scenario of a neighborhood rebooting after a power outage. Lots of houses needing to boot large boot images over asymmetric low bandwidth cable connections. As a fix we broadcasted boot image blocks to all settops. No settops performed your typical boot image download. Worked like a charm.

Amazon's problem was a subtle one in a very obscure corner of their system. It's not surprising they found a weakness. But I'm sure Amazon will be back even bigger and better once they get their improvements on line.

Reader Comments (5)

and those who did NOT rely only on AWS, most probably you do NOT hear their vocal frustrations... and the matter of reliability of AWS was casted waaaay ago http://web.archive.org/web/20070406174427/http://blogs.smugmug.com/don/files/ETech-SmugMug-Amazon-2007.pdf">ETech 2007 SmugMug Amazon Slides

November 29, 1990 | Unregistered CommenterA.T.

I guess services like S3 need to realize that they do have to maintain some kind of relation with big users that goes beyond simply providing the service and sending bills. Large users should be actively encouraged to easily be able to give prior notice of larger changes in their usage. Without it, it kind of becomes an issue for people trying to manage going life with something that will go form 0 to 100 in a short amount of time.

November 29, 1990 | Unregistered CommenterLukas

One thing that we don't know is just how much headroom was in their current resources capacity when they noticed the problem. The ability to add capacity quickly is apparently an issue there as well, and for a cloud type of service, they need to optimize that not just for traffix spikes, but for growth as well.

Snarky side note, assuming they eat their own dogfood, Amazon.com could reduce their resource load by chucking the half-page of crap every item listing has, or even dynamically adjust their site to not do certain functions when a crisis is looming, then shift resources to the cloud as needed. Assuming they eat their own dogfood, that is. The cloud itself should be dynamically configurable to throttle things back -- better a slowdown than a shutdown.

November 29, 1990 | Unregistered CommenterAnonymous

"(and Europe?" Just yes.

November 29, 1990 | Unregistered CommenterElizabeth T Mcmahon

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>