Second Hand Seizure : A New Cause of Site Death

Like a digital SWAT team that implodes the wrong door on a raid, the FBI seized multiple racks of computers from DigitalOne, these racks host websites from many clients that just happened to be in the same racks as whomever they are investigating. Downed sites include Instapaper, Curbed Network, and Pinboard. With the density of servers these days many 1000s of sites could easily have been effected.

Sites like Pinboard were victims by association, they did not inhale. This is an association sites have no control over. On a shared hosting service, you have no control over your fellow VM mates. In a cloud or a managed service, you have no control over which racks your servers are in. So like second hand smoke, you get the disease by random association. There's something inherently unfair about that.

A comment by illumin8 shows just how Darth insidious this process can be:

A popular method used by hackers is to sign up for a virtual server with a stolen credit card. If they are careful and only access it through a proxy, their hacking attempts are virtually untraceable. With the amount of hacking going on lately by Lulzsec and other groups, there is bound to be a lot of collateral damage.

A New Disaster Scenario - Law Enforcement

We are used to data center downtime for more prosaic reasons like natural disasters, power failures, infrastructure meltdown. It looks like we'll have to add search and seizures to our list of disaster planning scenarios. As crime in the digital realm can only but increase, what is a small risk now, will surely grow apace.

How Would You Do it if You Were the FBI?

It's understandably tempting to jump immediately to the civil rights card, but is it that simple?

There's an interesting angle here from mikem, Founder and CEO of M5Hosting.com, which implies the FBI resorts to these measures when the hosting services doesn't cooperate fully with the FBI. What are your hosts policies for working with the FBI? Do you really want them to fight tooth and nail when it would mean taking down so many unrelated sites?

You can imagine the problem the FBI has. How would you begin analyzing a distributed complex thing like a web site? To get the servers working in an external lab it would make sense to get the entire rack. It has the power, KVM, network, and switch, etc., so taking the entire rack is the most effective strategy.

How Bad Can it Get?

How far does this go? Would an entire SAN be taken? If data was replicated across multiple data centers would all the data centers be impacted? If data was replicated across multiple SANs would all the SANs be taken? If you had an architecture that could failover to another data center would that be allowed to run? If you run a 1000 servers on AWS with local storage would they take all 1000 servers which would in turn delete all the local storage from your VM mates? What if they transitively extend the seizure to all shared resources, like virtual drives and network equipment?

The implications are as unclear as they are scary.

How Pinboard is Dealing With It

Pinboard is one of the site effected by the seizure and there's a good Hacker News thread discussing their professional response.

  • Pinboard is not down. They are running on a smaller backup server. Services like API, search, and feeds have been turned off and the main DB server is unreachable. So the site is in a highly degraded, but it hasn't disappeared.
  • Bookmarks that are being added will not be lost, this is an essential capability.
  • This was not a restoration from an offsite back up. It was a normal failover. The failover machines are in the same data center, but were not seized.

For more information on their architecture see Pinboard.In Architecture - Pay To Play To Keep A System Small. It's clear Pinboard tries hard to do a quality job for their customers in a cost efficient way. That's evident in how nicely they are able to handle this outage.

What Can You Do?

There's really no difference between this use case and any other data center failure scenario, stuff can fail at any time, this is just another reason, handle it.

One possible difference is that a legal disaster could potentially do more damage than any natural disaster. An agency could simultaneously roll up every server, every copy of all your data, wherever it is (in the US at least). There's no handling that. It would be a true death.

Some options:

  • The Big List Of Articles On The Amazon Outage - many good article how to architect robust services. The usual sort of stuff. No single points of failure, diversity, resilience, replication, redundancy, monitoring. Make sure nodes are in multiple racks and if possible multiple availability zones. Make sure to backup and test the backup. Make sure to have a failure plan and to test that plan.
  • Netflix: Continually Test By Failing Servers With Chaos Monkey - how to test your Second Hand Seizure remedies
  • To guard against a true death scenario think about true geographical and legal diversity. Designing active sites across country boundaries using some sort of global load balancing to handle failures. Given that governments cooperate and the recent weakness of the DNS system, this is not a slam dunk, but it would reduce the risks, at a considerable expense.