12/2 So I am trying to convince my company to take disaster planning
more seriously. Does anyone have any hard numbers on how often
data centers fail? I mean blow up, burn down, flood, etc, with
total loss of all services for an extended period of time.
\_ hard numbers tend to be SEKRET. But check out Yahoo's recent
outage and UltraDNS' outage. Those were both pretty bad.
\_ I don't know, but I am in the same predicament. Instead of
focusing on _TOTAL FAILURE_ (which is rare and evokes a
response of "In a 10.2 earthquake I am not coming in to work
anyway, because I am going to go check on my kids and
get my gun, buddy.") focus on real problems which are much more
likely to occur: power outages, localized floods, localized fires,
etc. The entire data center does not have to blow up for there
to be a major, major problem.
\_ What are the likelihood of those events then? This is pretty
important engineering data, someone must study this.
\_ Any company that has a lot of data would know.
Insurance companies, government data on disaster, etc.
\_ 1. This is going to vary based on a lot of factors like
your geography, industry, utilities, building construction
etc. I would argue it is so specialized to your site
that averages do not matter. Plus, we all have
different needs as far as availability and uptime.
\_ I don't care what your needs are, I can determine that
for my own company (okay I actually make a menu list
and senior executives make the decision based on my
risk analysis vs. cost). What I need is hard data to
make an informed risk analysis. What is the MTBF for
an entire data center? It is in Dallas if you really
think that matters.
\_ Listen, guy. I am trying to help you. I don't
need attitude from you. The MTBF between a data
center made of concrete and buried a mile under
the Rockies is quite different from the MTBF for
a data center built out of paper next to the banks
of the Mississippi. So MTBF for "the average data
center" has very little bearing on _your_ MTBF.
Hell yes it matters that you are in Dallas versus,
say, Somalia, dipshit.
\_ I'm not the op, but my reading comprehension
is a lot better than yours so let me help you
out. The disaster planning guy simply wants
a real life story of a disaster that was
averted due to planning. A story, albeit
irrelevant to actual circumstances, is
sometimes more powerful than listing boring
numbers that a lot of the upper management
MBA business dudes don't understand and
don't want to hear.
\_ Good point, but it's not what the guy asked
for. He asked for "hard data" for a
"risk analysis". Your reading comprehension
sucks.
\_ why don't we let the op decide -pp
\_ I suspect that in the public sector, the best
way to empire build is to create a horrific
scenario that only you can solve, by the
application of a multi-million dollar budget
and a bevy of new hires. This explains crazy
shit like having thousands of Federal officers
force everyone to take off their shoes before
forcing everyone to take off their shoes before
flying and data centers a mile underground.
In the private sector, you have to do a Cost-
Benefit analysis to prove that what you want
to accomplish makes financial sense. So anecdotes
won't do, though they might help. Your reminder
that historical data is the best way to go is
useful though, our DC provider is AT&T who
surely must have already done this analysis. I
will ask them. Thank you for your advice.
\_ You're an idiot.
2. This is not easy data to find. I am guessing companies
do not like to make this info public. Sungard keeps
some statistics like:
a. Hardware failure remains the leading cause of business
disruption (almost 50%)
b. Problems resulting from disruptions to power
supplies account for more than one-quarter (26%) of
customer disaster invocations
c. Flooding and infrastructure-related problems such
as air conditioning faults and failure of uninterrupted
power supply systems were the third biggest cause of
business disruption.
d. The average customer affected by Katrina used the
backup facility for 22 days.
\_ Can you point me to the Sungard info? Thanks.
3. What we did was analyze our own site and infrastructure
over the last 30 years. Sure, we might be missing the
100 year flood and 100 million year meteor impact, but it
gives us a good idea of events likely to occur and
protecting against those does a pretty good job
against the rare events, too, in most cases.
\_ This company does not have that kind of data available
internally, for one thing we are only 15 years old.
Plus we only are in a few datacenters, so we just
don't have enough data points.
\_ Certainly you can examine the last 15 years
and for certain catastrophes like hurricanes
you can go back even before the company existed.
For example, we have a good idea of how often
earthquakes strike California. You should have
a good idea of how often disastrous tornadoes
(or whatever) strike your area even if the
the last one happened in 1950. In our case we
expect a wildfire every 50 years, an earthquake
every 20 years, a windstorm every decade, etc.
Figure this out for your own site and then add
in other variables like construction of your
buildings, physical security (terrorism),
how good your utilities have been over the
15 years you have data (blackouts), etc. |