I have dozens if not hundreds of half finished articles and snippets of ideas in the haunted house that is my Google Docs. Walking the house around midnight, with the lights turned off course, I stumbled upon one ghost that has been haunting me since 2012. It is time to perform the ritual of exorcism by just publishing something.
Harper Reed and Dylan Richard headed up the effort. Around that time they were getting a lot of press. One of the things that interested me was how they held Gameday test events, where they would simulate failure modes in their testing environments. Google calls these DiRT (Disaster Recovery Testing event) exercises.
So I asked Harper and Dylan what these exercises actually were and they were kind enough to reply. And I apparently forgot all about it. My apologies. Better late than never? Yah, let's go with that.
Here are some of the failure testing scenarios carried out by the Obama for America team:
- Flush memcache
- Kill memcache (null route on instances)
- Kill replicants (we used security groups to deny access)
- Kill master
- Kill the backing API (we had a heavy SOA)
- Put API in read-only (killing master should accomplish this - but this tests client apps explicitly)
- Kill SQS (we used it heavily, particularly for decoupled systems and fall backs)
- Emulate an EBS failure (kill all DBs [we used RDS], kill all EBS backed instances)
- Emulate full east coast failure (we had a 2 stage failover plan to the west coast - fail to a read only mode which we could do easily, and fail over permanently which would only happen in the case of extended east coast AWS unavailability)
- Emulate human error (claim to have done something [scale up, restart a DB, flush the cache, bounce the wsgi proc, etc] but don't actually do it)
Now there's one less ghost haunting the halls.