The planet, a popular hosting provider, had earlier this weekend a fire and explosion resulting in an outage of their H1 data center.
Reading through the announcements and the usual techno-press reports on it, a few things struck me. While the last word isn't -by far- said about this, I saw a few striking things in the light of a BCP/DRP viewpoint:
- First I'd like to mention that I'm actually impressed by the frequent communication and the calmness of those messages from The Planet: http://forums.theplanet.com/index.php?showtopic=90185.
I think it's important to teach those dealing with (major) incidents to remain calm. Not just when dealing with the public or the press, but also internally. Think through your decisions, before you act, as doing things in a panic will result in making the wrong choices.
Also communicating the right way can be critical, planning ahead helps a lot.
- Next I saw they were "requiring us to take down all generators as instructed by the fire department". I had seen plans for BCP/DRP derail before due to officials stepping in and doing their response to an emergency in their way and not in the way the organization itself had planned it.
I think it would be interesting for most of us to actually talk to fire departments and/or police officers on what their normal responses are and take them into account in our plans. When you build a BCP you basically try to build (and spend money) on making sure you don't loose a site. One of those things you foresee is redundant power, but if you're not going to be allowed to use it, ... perhaps your priorities would shift to doing other things with your money and to fix it on another layer ?
- The reason they went down seems to have been: "electrical gear shorted, creating an explosion and fire that knocked down three walls surrounding our electrical equipment room". While it doesn't say as much, knocking down 3 walls is violent. Now an explosion can do that, and transformers indeed can explode, but there's another thing that can knock down walls: violently expanding gases from fire suppression systems, that's why you have those automatic vents in the walls. Please note: I'm not saying I know what happened, I don't know it. But there's one thing I'd do as a precaution: I'd like make sure that my facilities processes includes some regular check to see if those vents are still OK somehow. Knocking over walls is just too much of a scary idea.
- The Planet got vendors involved during the weekend itself: "As you know, we have vendors onsite at the H1 data center. With their help, we’ve created a list of equipment that will be required, and we’re already dealing with those manufacturers to find the gear. Since it’s Saturday night, we do have a few challenges".
What have you foreseen to have within hours of a fire/explosion vendors helping you to assess what equipment you need to get back on-line. Can you even reach them during a weekend ? Every bit of time you put in collecting and updating this information up-front in your BCP/DRP will pay back many times in getting back on-line.
- It's good to see they made a list of priorities public.
Your plans could include such lists pre-made. It's easier to cross off items you still have than to think up the list yourself during the emergency.
- There is talk form both The Planet and some of their customers about DNS and redundancy. I'm pretty sure it's not entirely the Planet's fault, customers putting all their eggs in one basket exist all too often.
Still, I find this strange: DNS in my opinion is about the most redundant system you can get. You can easily add another server anywhere in the world, there is hardly any penalty for having them not all of them in the same spot. So why would you even consider having them in the same spot ? Yet I've more than once seen such setups where all the NS records entered in a TLD are on adjacent IP addresses, and when doing a traceroute they actually route exactly the same. This isn't using DNS to what it can do for you, it'll protect from a server outage, but not much more than that, while if you had a handful DNS servers out there, you'd be next to impossible to get off the air DNS wise.
- The Planet is slowly getting back in the air, so that's good.
I think it would be a good idea for a next BCP/DRP exercise to replay an existing incident and measure how you do against how they did in real life.
- Lastly The Planet seems to be suffering from a /. effect on their forum. I think this is about the worst moment to get on slashdot you can imagine. But it's a likely result of the incident that those things you still have will attract more visitors than ever before.
Again something to plan for ... although -we here at the ISC had to have a few /. features before we nailed it ourselves as well-.
Basically make sure to have your emergency communication as solid as you can, as static as possible, and as lightweight on the server(s) as you can imagine. The last you want to do during an emergency is to have to survive a DDoS from curious people -like we all are ourselves-.
My best wishes to the folks at The Planet and glad to read nobody got hurt.
To the thousands of customers affected, well there's the SLA, but there's also some pretty decent work in recovering going on. And I can only hope those companies where I host servers would be able to do equally well and be as open about it as these folks have been so far.
Swa Frantzen -- Gorilla Security
Jun 1st 2008
I can see the fire department's perspective here. Backup power systems are great for other kinds of disasters, but if the building is on fire the fire department is likely to want to know that equipment has been de-energized before they go in and start spraying water around. Some data centers have an E-stop button that kills all the UPS power for exactly this reason...in fact, I think this is actually an electrical code requirement in some places.