Re: [9fans] ata drive capabilities - Plan9

This is a discussion on Re: [9fans] ata drive capabilities - Plan9 ; > the google paper shows a 40% afr for the first 6 months after some > smart errors appear. (unfortunately they don't do numbers for > a simple smart status.) Yes, and I rather mischaracterized the google paper's comments on ...

+ Reply to Thread
Results 1 to 2 of 2

Thread: Re: [9fans] ata drive capabilities

  1. Re: [9fans] ata drive capabilities

    > the google paper shows a 40% afr for the first 6 months after some
    > smart errors appear. (unfortunately they don't do numbers for
    > a simple smart status.)


    Yes, and I rather mischaracterized the google paper's comments on
    SMART. A reread (I first read them a few months ago) shows the above.
    Further, the CMU paper even references the google study on the SMART
    subject:

    ``They find that [ ... ] the value of several SMART counters
    correlate highly with failures.''

    So SMART appears a little less dumb. I'd say meets the better than
    nothing criterion.

    > from my understanding of how google do things, loosing a drive just
    > means they need to replace it. so it's cheeper to let drives fail.
    > on the other hand, we have our main filesystem raided on an aoe
    > appliance. suppose that one of those raids has two disks showing
    > a smart status of "will fail". in this case i want to know the
    > elevated
    > risk and i will allocate a spare drive to replace at least one of the
    > drives.
    >
    > i guess this is the long way of saying, it all depends on how painful
    > loosing your data might be. if it's painful enough, even a poor tool
    > like smart is better than nothing.
    >

    I agree (plus I was just wrong about SMART at first), though I do
    think your example above is about preventing downtime, not so much
    data loss (Even without smart entirely, and all the disks come up
    corrupt, we're all backed up within some acceptable window, right?)


    > what a pity! it would have been so great to have had
    > an objective assessment of reliability by manufacturer.
    >

    Since the CMU thing found no difference between disk *types*, I
    wonder if it might be that there's little difference between
    manufacturers either -- instead the difference is in manufacturing,
    i.e., `vintage' & the like.

    > i've found it really quite hard to find useful data to
    > indicate how reliable a drive might be.
    >


    I think Fig. 2, Sec. 4.2 of the CMU paper relates to that; the
    `infant mortality' of manufactured mechanical parts isn't captured in
    MTTF -- but IDEMA is apparently going to solve this by replacing the
    single MTTF number that I don't quite understand with 4 different
    MTTF numbers, one for each `phase' of a disk's life.

    --
    Josh




  2. Re: [9fans] ata drive capabilities

    > > from my understanding of how google do things, loosing a drive just
    > > means they need to replace it. so it's cheeper to let drives fail.
    > > on the other hand, we have our main filesystem raided on an aoe
    > > appliance. suppose that one of those raids has two disks showing
    > > a smart status of "will fail". in this case i want to know the
    > > elevated
    > > risk and i will allocate a spare drive to replace at least one of the
    > > drives.
    > >
    > > i guess this is the long way of saying, it all depends on how painful
    > > loosing your data might be. if it's painful enough, even a poor tool
    > > like smart is better than nothing.
    > >

    > I agree (plus I was just wrong about SMART at first), though I do
    > think your example above is about preventing downtime, not so much
    > data loss (Even without smart entirely, and all the disks come up
    > corrupt, we're all backed up within some acceptable window, right?)


    i don't know. if you lean that direction, then the only thing raid gives
    you is reduced downtime.

    i think of raid as reliable storage. backups are for saving one's bacon in
    the face of other disasters. you know, sysadmin mistakes, misconfiguration,
    code gone wild, building burns down — disaster recovery.

    (and if my experience with backups is any indiciation, it's best not to
    rely on them.)

    but this thinking is probablly specific to how i use raid. i imagine the
    exact answer on what raid gives you should be worked out based on
    the application. for linux-type filesystems, e.g., raid won't save your
    accidently deleted files.

    - erik

+ Reply to Thread