Vox, TypePad and Movable Type—were down 90 minutes. On complete stop, at which point the pistons’ positions would be
LiveJournal, the company posted an apology, explaining that recorded for the next startup sequence. This is critical, because
during outages it would normally display a message telling visi- if the fuel is injected at the wrong time, the engines won’t start.
tors about the status of the site. “But because this was a full But over several years, during which time 365 Main had accu-
power outage there was a period of time where we could not mulated more than 1,000 hours of operation on the diesels, the
access or update a status page,” the posting explained. engines had been fully broken in, so their shut-down time had
Fortunately, 365 Main was able to manually restart the gen- increased to as much as 13 seconds.
erators that failed to kick in automatically, which allowed it to Still, in the controllers’ memory from the last shutdown,
operate on backup power until PG&E began delivering a stable the pistons were recorded as being several seconds out of posi-
power supply. tion, because each digital controller is calibrated to initiate the
The root cause of the outage turned out to be highly unusual: next restart based on the position of the pistons only seven to
The data center’s $1.2 million 10 seconds after the last shut-
diesel-powered generators had down. That variance of three or
gradually fallen out of sync with four seconds had caused four of
the electronic controllers that the diesel engines to be out of
start them. “This was a truly rare incident,” Staten says. 365 MAIN their normal starting sequence, so they misfired, failing to start
Unlike centers with battery- AT A GLANCE and keep running.
started backup generators, 365 Headquarters: 365 Main St., San Francisco, CA 94105 “This was completely
Main’s data center has a con- Phone: (877) 365-6246 unique—it was a true bug,” Kelly
tinuous power system. Energy URL: www.365main.com says. The fix was to adjust the
from the local utility flows into Business: Data center developer and operator providing controller to allow more time
the system to operate its genera- mission-critical operations and business continuity for between the shutdown and the
tors, which supply power to the tenants reset command. The company
building. In the event of a power Senior Vice President, Operations: Jean-Paul Balajadia implemented the fix not only
failure, the system normally Financials: NA (privately held) at its San Francisco facility but
restarts using energy stored in Challenge: Maintain the company’s post-outage record of also at its El Segundo, Calif., data
each generator’s flywheel. The 99.9967% uptime across half a dozen data centers around the U.S., including the San Francisco facility center, which had the same Hitec
flywheels, basically large spin- generators containing the iden-
ning discs, keep turning long ticalcontrollers.
enough after a power failure to restart the diesel engines. Hitec reports that only about 100 such engines were shipped
With its 10 backup diesel-powered generator units, 365 Main’s in 2001 with this particular Detroit Diesel controller, and that
primary data center had operated without a glitch through the other companies using them in data centers have had their
numerous power outages since its construction in the spring of controllers fixed as a result. Newer diesel generators have a more
2001. The building has eight data rooms, each with its own dedi- sophisticated ignition sequence. “We had only two other sites
cated generating unit. There are also two extra units, ready to that used these engines as extensively, and both customers had
kick in if one of the eight dedicated backup generators fails. reported isolated incidents where single engines failed to start,”
As it turned out, on July 24, four of the diesel engines failed says John Sears, marketing and sales manager at Hitec Power
to start, causing three computer rooms to lose power. “We could Solutions in Stafford, Texas.
have failed three units, and through an automatic load-shed of Although the root cause of the outage is rare, the lessons
chillers and air conditioning units, we could have continued gleaned from the experience are useful for data center managers
to function,” says Jean-Paul Balajadia, senior vice president of seeking ways to guard against massive system failures:
operations at 365 Main. In other words, the facility had enough Work closely with vendors: If 365 Main’s technical staff had
backup generators to run the computer rooms if only three units known the meaning of the error code or had had access to
had gone down. But with four unable to start and keep running, relevant online technical information, they’d have been able
up to 45% of the building’s computer systems shut down. to solve the problem much faster.
As soon as the failure occurred, Balajadia and his staff Distribute data center resources: If you have the budget,
called the manufacturer of the power generators, Hitec Power establish a disaster recovery plan that allows for redundancy.
Solutions. They also called Cupertino Electric, the engineers Communicate immediately and openly. “One of the things
and project manager for the building’s construction. CIOs can learn from our experience on July 24th is that we
After several days, they determined the cause to be a dis- communicated as transparently as possible,” Kelly says. Sun’s
crepancy in the engines’ start-up routine. Over the years, as Snow agrees: “Their remediation and open correspondence
each engine was periodically tested and shut down, the engine’s on the problem and resolution was adequate.”
digital controller would record the exact position of the pistons Be ready to deal with the unexpected: “Sometimes in live
in the cylinders when they stopped so that, on the next start-up, environments things happen that you don’t anticipate,” Kelly
fuel would be injected at the precise moment. “The controller says. “One hundred percent uptime is not a reality.”
writes this into memory at zero RPM, reading the information Never get complacent: “The extent of the disruption caused
and then clearing out the prior memory,” Balajadia explains. by 365 Main’s power outage to the industry as a whole causes
When the engines were first shipped from the factory, it us to realize that, more than ever, a company can never be too
took seven seconds to 10 seconds for the engine to come to a prepared,” Snow says.