Most US readers have probably seen reports in the news media of the huge storm that affected the US, from the mid-west to the mid-Atlantic states, beginning Friday afternoon, June 29, and extending into the early hours of Saturday, June 30. The storm, a line of severe thunderstorms plus a windstorm, a type known as a derecho, traveled generally eastward at speeds of about 85 km/hr (60 mph), and packed wind gusts in excess of 70 mph when it reached us here in the Washington DC metro area. I can personally testify that it was impressively noisy.
We were very fortunate that, in our immediate area of Northern Virginia, west of Washington DC, we had only minimal storm damage, and have not (touch wood!) had any significant power outages. That last is very good news when daytime temperatures of 100 F (38 C) make air conditioning close to a necessity. Many other folks nearby were not so lucky. As always with thunderstorms, though, the effects varied widely over a relatively small geographic area. Amazon has a data center a handful of miles from here, which lost power during the storm, and, owing to a chain of problems, was offline for several hours, affecting some of Amazon’s “cloud” customers, including Instagram and Netflix. Many people in the affected areas are still, four days later, without electricity.
The folks over at the SANS Internet Storm Center have posted a diary entry with an anonymous report from a sysadmin at an affected data center. The story is, I think, well worth reading, because, at least in my experience, nothing in it sounds very much out of the ordinary for this kind of event.
The story that follows offers some lessons relearned and possibly a few new ones.
It’s only a few paragraphs long, so I urge you to have a read; I would emphasize these points:
- Mother Nature is not obliged to follow the script in your emergency/disaster plans.
- Having good solutions to the technical problems posed by a disaster is a Very Good Thing. However, low- or no-tech problems can be as bad or worse (e.g., how do you get key people on site if there are problems with the roads?).
- Emergency recovery procedures and data (e.g., critical configuration parameters and passwords) need to be documented and accessible, even when systems are not available. (Paper still has its uses.)
One of the things that makes planning for this kind of event difficult is the habits of mind we acquire, largely unconsciously. Under ordinary circumstances, for example, electrical power from the grid is just there. Since I can drive from my house to the data center in ten minutes, 30 minutes of UPS power is ample. In a way, this problem is like debugging, or like security evaluation. Commonly, in looking at potential solutions to a problem, we ask, “Will this work?”. With debugging, security evaluation, or disaster recovery planning, the question has to be something like, “How can this possibly go wrong?”.
I’m grateful to those, like the anonymous reporter to the SANS ISC, who share their experiences with us.