advertise
« Stuff The Internet Says On Scalability For August 12th, 2016 | Main | Stuff The Internet Says On Scalability For August 5th, 2016 »
Tuesday
Aug092016

10 Gameday Failure Testing Scenarios from Obama for America

I have dozens if not hundreds of half finished articles and snippets of ideas in the haunted house that is my Google Docs. Walking the house around midnight, with the lights turned off course, I stumbled upon one ghost that has been haunting me since 2012. It is time to perform the ritual of exorcism by just publishing something.

You may or may not remember Obama for America, which in 2012 had a staff of 120 people that built and maintained the infrastructure that helped get out the vote for Obama. 

Harper Reed and Dylan Richard headed up the effort. Around that time they were getting a lot of press. One of the things that interested me was how they held Gameday test events, where they would simulate failure modes in their testing environments. Google calls these DiRT (Disaster Recovery Testing event) exercises

So I asked Harper and Dylan what these exercises actually were and they were kind enough to reply. And I apparently forgot all about it. My apologies. Better late than never? Yah, let's go with that.

Here are some of the failure testing scenarios carried out by the Obama for America team:

  1. Flush memcache
  2. Kill memcache (null route on instances)
  3. Kill replicants (we used security groups to deny access)
  4. Kill master
  5. Kill the backing API (we had a heavy SOA)
  6. Put API in read-only (killing master should accomplish this - but this tests client apps explicitly)
  7. Kill SQS (we used it heavily, particularly for decoupled systems and fall backs)
  8. Emulate an EBS failure (kill all DBs [we used RDS], kill all EBS backed instances)
  9. Emulate full east coast failure (we had a 2 stage failover plan to the west coast - fail to a read only mode which we could do easily, and fail over permanently which would only happen in the case of extended east coast AWS unavailability)
  10. Emulate human error (claim to have done something [scale up, restart a DB, flush the cache, bounce the wsgi proc, etc] but don't actually do it) 

Now there's one less ghost haunting the halls.

Related Articles

Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>