
04-17-2008, 08:56 PM
|
| Junior Member | | Join Date: Sep 2009
Posts: 0
| |
Re: Recommended hard drive temperature Previously Franc Zabkar wrote:
> On 17 Apr 2008 13:22:52 GMT, Arno Wagner put finger
> to keyboard and composed:
[...]
>>Ok, I have it now. I think you refer to figure 5: "AFR for average
>>drove Temperature". This one seems to indicate slightly higher failure
>>rates for the 15...30C window than for the others in drives younger
>>than 3 years. If you consult figure 4, you see that temperature
>>extremes are rare. Then there is one thing: Partially defective drives
>>work slower or not at all. This may result in lower drive temperatures
>>(spin down, refusal to execute access) and higher drive temperatures
>>(lots and lots of retries, heat from bearings). This can
>>significantly skew the results.
> I would expect that Google would identify a partially defective drive
> (assuming it was detected by SMART) and eventually take it out of
> service. Certainly, if the drive does not work at all, then by
> definition it must be totally, not partially, defective. Having said
> that, the article doesn't really give a satisfactory definition of
> failure other than to say that it is the reason that a drive is
> replaced.
Problem is also that the failure time (according to the article)
was the replacement time. I have heard the chief Google technology
guy speak about this and he stated something like "every few months
defectives are repaired". There can be a long time between
faulyre and replacement.
> As for spin problems, the article states ...
> "Spin Retries. Counts the number of retries when the drive is
> attempting to spin up. We did not register a single count within our
> entire population."
That may just mean that no drive managed to get spun-up
at all after the first try failed. Or the attribute is unused.
>>The basic results could be that
>>failing drives run hotter or colder than others. I am also missing
>>more break-downs into different temperature profiles (e.g. mainly
>>constant, strong variation, etc..) as it is, e.g., possible thet the
>>problem in the low temp section is due to cycling temperatures.
> The article states ...
> "As is common in server-class deployments, the disks were powered on,
> spinning, and generally in service for essentially all of their
> recorded life. They were deployed in rack-mounted servers and housed
> in professionally managed datacenter facilities."
> I think that would discount your temperature cycling hypothesis.
Not at all. The very fact that disks managed to get to high
temperatures means that temperature cycles are possible.
>>I am not saying the results are wrong, but they are suspicuous and
>>with the data given are _very_ difficult to even understand
>>properly. It does not seem any statistics expert was consulted by the
>>writers and the temperature results are by far the weakest in the
>>paper. I also miss a proof or at least conclusive argument that the
>>remaining observations are temperature independent, both for absolute
>>value and different change profiles.
>>
>>The paper is still very valuable. Figures 7-10 give solid results, and
>>need no further details. Scanning your disks every 2 weeks or so and
>>monitoring reallocation counts is a very good idea (and something I
>>have been doing for several years now). The folks at Google likely
>>also found that the SMART status alone is typically over-optimistic.
>>As to many failures not being predicted by SMART data, my results
>>are different. It is possible that the drive selection here again
>>skewed the picture compared to modern drives. Personally I have had
>>100% prediction by SMART attributes (not SMART status though) in
>>an addmittedly small population of about 50 drives over three
>>years and with mostly Maxtors that are known to fail gradually.
>>
>>Arno
> With respect, I prefer to accept Google's experience.
> "It is difficult to add temperature to this analysis since despite it
> being reported as part of SMART there are no crisp thresholds that
> directly indicate errors. However, if we arbitrarily assume that
> spending more than 50% of the observed time above 40C is an indication
> of possible problem, and add those drives to the set of predictable
> failures, we still are left with about 36% of all drives with no
> failure signals at all."
This does not counter my argument. It just states that there are
at least 36% failures that are not temperature related. And it
is, as noted, quite arbitratily. The authors are speculating here
about whether temperature above 40C is the killer when observed more
than 50% of the time. It is not in their environment. This does not
surprise me at all.
Also note that there is no "Googles experience" in the paper.
This is "observations in a specfic environment by three people
with Google" and certainly the observations are not well
documented with regard to temperature. On the other hand, an air
conditioned data center and only two years of observation is not
enough to answer that question conclusively.
> I notice also that Google have an interesting observation regarding
> seek errors.
> "When examining our population, we find that seek errors are
> widespread within drives of one manufacturer only, while others are
> more conservative in showing this kind of errors. For this one
> manufacturer, the trend in seek errors is not clear, changing from one
> vintage to another. For other manufacturers, there is no correlation
> between failure rates and seek errors."
> I wonder if the abovementioned manufacturer is Seagate. IME, when
> Seagate drives report a "seek error rate", they are actually reporting
> a seek count.
Quite frankly this shows that the authors have not a lot of
experience with SMART data. Seek errors are due to modern drives
starting reading before the heads have settled. This usually works,
but when it does not work it becomes a seek error. Some
manufacuters list these in the SMART data, other do not. The
number seen does not mean much, which is well known to people
that work a lot with SMART data.
Arno |