Temperatures exceeding 100 degrees Fahrenheit (38 degrees Celsius) may not be damaging to disk drives, according to new research by Google engineers which casts doubt on previous findings linking heat to increased failure rates.
After studying five years worth of monitoring statistics from Google’s massive datacentres, the engineers say they could find no consistent pattern linking failure rates to high temperatures or high utilisation levels. Temperature, they say, is often cited as the most important environmental factor affecting disk drive reliability.
“This is a fairly surprising result, which could indicate that datacentre or server designers have more freedom than previously thought when setting operating temperatures for equipment that contains disk drives,” write Google engineers Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andre Barroso. “We can conclude that at moderate temperature ranges it is likely that there are other effects which affect failure rates much more strongly than temperatures do.”
The Google researchers are more optimistic about the impact of heat on computer systems than a Forrester Research analyst who, in a webinar last month, said the increasingly fine features of new chips must be protected by lowering maximum operating temperatures.
The Google research, presented this month’s Usenix Conference on File and Storage Technologies, examined datacentre performance at temperatures from 15 to 45 degrees Celsius (59 to 113 degrees Fahrenheit).
They found negative effects from high temperatures only for the higher end of the temperature range (40 degrees Celsius, or 104 degrees Fahrenheit or more) and even at those temperatures the negative effects were only observed for drives at least three years old.
By contrast, software and hardware manufacturer Avtech says the “optimal” temperature range to maintain datacentre reliability is between 20 and 24 degrees Celsius (68 -75 degrees Fahrenheit).
The Google engineers do report seeing a “modest increase” in failure rates at the lowest end of the temperature distribution they studied.
The engineers did not see a consistent correlation between high utilisation and high failure rates, a finding they say also contradicts previous literature on the subject. Frequent utilisation seems to lead to problems in drives that are less than a year old, and also in drives that are at least five years old, but not in drives that are in the middle of the age range, they found. This may happen because drives that perform poorly when utilised often do not survive past their first year.
More than 90% of new information produced today is stored on magnetic media, mostly hard disk drives, according to an estimate cited in the Google paper. Drive manufacturers say yearly failure rates are below 2%, but user studies have found rates as high as 6%, the paper states.
The Google researchers did find several measures useful for predicting drive failure. The measures, known as SMART (self-monitoring analysis and reporting technology) parameters, include scan errors, which are reported when drives scan the disk surface in the background.
“After their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors,” the Google researchers write.
But more than half of Google’s failed drives did not exhibit scan errors or any of the four most prominent SMART signals. This makes it difficult to develop a comprehensive model for predicting failure.
“It is possible, however, that models that use parameters beyond those provided by SMART could achieve significantly better accuracies,” the Google engineers write. “For example, performance anomalies and other application or operating system signals could be useful in conjunction with SMART data to create more powerful models. We plan to explore this possibility in our future work.”
Although the Google data showed higher failure rates in older disk drives, the numbers do not prove there is a correlation between age and failure rates because there were many different models of disk drives observed in the study. “These data are not directly useful in understanding the effects of disk age on failure rates,” the engineers write.