[9fans] Recovering a venti from disk failure - Plan9

This is a discussion on [9fans] Recovering a venti from disk failure - Plan9 ; I had a disk fail a few days ago, after a power outage here. Various spots in the fossil partition generate IO errors. The Venti arenas seem intact. I've bought a new (and larger) drive, and have done a basic ...

+ Reply to Thread
Results 1 to 13 of 13

Thread: [9fans] Recovering a venti from disk failure

  1. [9fans] Recovering a venti from disk failure

    I had a disk fail a few days ago, after a power outage here. Various
    spots in the fossil partition generate IO errors. The Venti arenas
    seem intact. I've bought a new (and larger) drive, and have done a
    basic Plan 9 install onto that, and moved the old disk from sdC0 to
    sdD0. I'd like to recover the data from my old venti. It looks like
    the process looks like this:

    1) Boot off new disk.
    2) Recover last score from old fossil w/ fossil/last (this works).
    3) Start a venti running off old disk (will require editing venti/conf?)
    4) venti/copy old-venti new-venti score-from-setp-2
    5) Reboot off some other medium.
    6) Reformat new fossil from new venti using score from step 2.
    7) Reboot off new fossil+venti.

    Does that sound like an accurate script? Can anyone who's done this
    confirm? By chance an easier way?
    Anthony

  2. Re: [9fans] Recovering a venti from disk failure

    > 1) Boot off new disk.
    > 2) Recover last score from old fossil w/ fossil/last (this works).
    > 3) Start a venti running off old disk (will require editing venti/conf?)
    > 4) venti/copy old-venti new-venti score-from-setp-2
    > 5) Reboot off some other medium.
    > 6) Reformat new fossil from new venti using score from step 2.
    > 7) Reboot off new fossil+venti.


    Looks good to me. You'll have to change the file
    names in the venti config to get the old venti
    running again, but the rest can stay the same.

    Russ


  3. Re: [9fans] Recovering a venti from disk failure

    > 1) Boot off new disk.
    > 2) Recover last score from old fossil w/ fossil/last (this works).
    > 3) Start a venti running off old disk (will require editing venti/conf?)
    > 4) venti/copy old-venti new-venti score-from-setp-2


    Wouldn't this lose all old snapshots?

  4. Re: [9fans] Recovering a venti from disk failure

    If your old venti is intact, I don't see the need to copy it (or is it
    on the drive with the damaged fossil and you don't trust the drive?).
    I would just format the new fossil partition using the last fossil
    dump score.

    If the arenas are okay but not the index, I'd rebuild the index.


  5. Re: [9fans] Recovering a venti from disk failure

    > If your old venti is intact, I don't see the need to copy it (or is it
    > on the drive with the damaged fossil and you don't trust the drive?).


    If your venti disk is similar to the failed disk and of same
    vintage, it is also likely to fail and should be replaced
    before it actually does. Similarly replace all disks in a
    RAID if one of the disks dies (and the death was not in the
    first few months of its life).

  6. Re: [9fans] Recovering a venti from disk failure

    Geoff wrote:
    // (or is it on the drive with the damaged fossil and you
    // don't trust the drive?).

    exactly. i'm also replacing it with something about five times the
    size at the same cost. amazing what a few years will do. once the
    transition is complete, i'll most of the old disk as an "other". i'm
    also considering options for a more disaster-tollerant setup.

    Bakul wrote:
    // Wouldn't this lose all old snapshots?

    I believe the method described will take the root of fossil, before
    /active, the normally visible part. I'm particularly encouraged by the
    example involving a corrupted disk and fossil/flfmt in fossil(4). i'll
    let you know if my experience says otherwise.

  7. Re: [9fans] Recovering a venti from disk failure

    > // Wouldn't this lose all old snapshots?
    >
    > I believe the method described will take the root of fossil, before
    > /active, the normally visible part. I'm particularly encouraged by the
    > example involving a corrupted disk and fossil/flfmt in fossil(4). i'll
    > let you know if my experience says otherwise.


    You get to keep /n/dump but not /n/snap.

    Russ


  8. Re: [9fans] Recovering a venti from disk failure

    >> If your old venti is intact, I don't see the need to copy it (or is it
    >> on the drive with the damaged fossil and you don't trust the drive?).

    >
    > If your venti disk is similar to the failed disk and of same
    > vintage, it is also likely to fail and should be replaced
    > before it actually does. Similarly replace all disks in a
    > RAID if one of the disks dies (and the death was not in the
    > first few months of its life).


    our experience has been this is not a cost-effective way of dealing
    with disk failures as disks do not fail en masse.

    also, once a failure has occured, it is too late. if there are other
    disks with latent errors, raid reconstruction will fail. in a raid 5
    array, a latent error + failed disk means that block can't be
    reconstructed.

    i think this corelation gives people the false impression that they do
    fail en masse, but that's really wrong. the latent errors probablly
    happened months ago.

    our solution to this problem (RaidShield™) is to preemtively scan
    disks while the raid is idle and rewrite these blocks. this either
    (a) corrects the bit rot (b) causes the drive to remap the sector or
    (c) notifies the user that there's a real disk problem so it can be
    replaced before a second drive in the array fails.

    - erik


  9. Re: [9fans] Recovering a venti from disk failure

    > >> If your old venti is intact, I don't see the need to copy it (or is it
    > >> on the drive with the damaged fossil and you don't trust the drive?).

    > >
    > > If your venti disk is similar to the failed disk and of same
    > > vintage, it is also likely to fail and should be replaced
    > > before it actually does. Similarly replace all disks in a
    > > RAID if one of the disks dies (and the death was not in the
    > > first few months of its life).

    >
    > our experience has been this is not a cost-effective way of dealing
    > with disk failures as disks do not fail en masse.


    Various studies seem to indicate failure rates are highly
    correlated with drive model, vintage and manufacturer.
    Assuming a RAID is built from similar disks, when one fails
    the others are more likely to fail.

    Google has a recent paper on disk failures that shows 8% to
    8.6% annualized failure rate for 2 to 3 year old disks
    (measured across a very large population of disks and not
    accounting for vendor/model differences). Things are likely
    to be worse in a typical home environment -- dust, heat
    variation, poor or nonworking surge protectors, more power
    cycles, computers get moved around or kicked more, poorer
    quality components etc.

    If you can afford it, replacing your most critical disks
    every three years is a good rule of thumb -- the first disk
    failure is a strong reminder of that:-)

    > i think this corelation gives people the false impression that they do
    > fail en masse, but that's really wrong. the latent errors probablly
    > happened months ago.


    Yes but if there are many latent errors and/or the error rate
    is going up it is time to replace it.

    > our solution to this problem (RaidShield™) is to preemtively scan
    > disks while the raid is idle and rewrite these blocks. this either
    > (a) corrects the bit rot (b) causes the drive to remap the sector or
    > (c) notifies the user that there's a real disk problem so it can be
    > replaced before a second drive in the array fails.


    This is a good idea. We did this in 1983, back when disks
    were simpler beasts. No RAID then of course.

  10. Re: [9fans] Recovering a venti from disk failure

    > Various studies seem to indicate failure rates are highly
    > correlated with drive model, vintage and manufacturer.
    > Assuming a RAID is built from similar disks, when one fails
    > the others are more likely to fail.


    while it is true that some disks vintages are better than others, when
    one drive fails, the probability of the other drives failing has not
    changed. this is the same as if you flip a coin ten times and get ten
    heads, the probability of flipping the same coin and getting heads, is
    still 1/2.

    >> i think this corelation gives people the false impression that they do
    >> fail en masse, but that's really wrong. the latent errors probablly
    >> happened months ago.

    >
    > Yes but if there are many latent errors and/or the error rate
    > is going up it is time to replace it.


    maybe. the goggle paper you cited didn't find a strong correlation
    between smart errors (including block relocation) and failure.

    > This is a good idea. We did this in 1983, back when disks
    > were simpler beasts. No RAID then of course.


    even a better idea back then. disks didn't have 1/4 million
    lines of firmware relocating blocks and doing other things to^w
    i mean for you.

    - erik


  11. Re: [9fans] Recovering a venti from disk failure

    Bakul Shah wrote:

    *snip*

    >
    > This is a good idea. We did this in 1983, back when disks
    > were simpler beasts. No RAID then of course.
    >


    Sure there were. From SMD days in the '70's, even.

    We called them 'mirrored' (2 drives, one controller and cable set)
    or 'duplexed' (separate controllers and cables) - sometimes on separate hosts
    sharing common drives.

    The other so-called 'level's and 'RAID' terminology took longer to catch on. Or
    become affordable enough to leave the mainframe arena anyway.

    Drives - such as the ISS-80 - a few of whose remains litter my garage yet today,
    were hardly 'simpler'.

    Quite the reverse! Built like a milling machine or Hardinge lathe to even
    *attempt* to keep heads aligned with (replaceable) media.

    "Simplicity" - or decent and affordable reliability at least - arrived with
    IBM-UK 'Winchester' technology and the CDC-'birdnames' notably Lark and Wren.

    Even with 'Winchester', 'Bernoulli' 8" cartidges were far more reliable than the
    comparable 'SeaGRRRRRATE' HDD of the same era. Twinned units were not 'RAID' but
    had a fast 'dd'-like imaging utility for manual duplication.

    Seagate 5 1/4" drives used to be good for 3 to 6 months, Western Digital or
    Microscience 6 to 12, CDC 12 to 24 months.

    Big improvement as first-generation NCR' Century' series once lunched a set of
    platters as often as twice a week...

    The reasons Fossil and Venti are as they are may no longer be as obvious or
    compelling as they once were, but how soon we forget how *seriously* fragile and
    failure-prone HDD once were.

    Bill



  12. Re: [9fans] Recovering a venti from disk failure

    erik quanstrom wrote:
    >> Various studies seem to indicate failure rates are highly
    >> correlated with drive model, vintage and manufacturer.
    >> Assuming a RAID is built from similar disks, when one fails
    >> the others are more likely to fail.

    >
    > while it is true that some disks vintages are better than others, when
    > one drive fails, the probability of the other drives failing has not
    > changed. this is the same as if you flip a coin ten times and get ten
    > heads, the probability of flipping the same coin and getting heads, is
    > still 1/2.
    >
    >>> i think this corelation gives people the false impression that they do
    >>> fail en masse, but that's really wrong. the latent errors probablly
    >>> happened months ago.

    >> Yes but if there are many latent errors and/or the error rate
    >> is going up it is time to replace it.

    >
    > maybe. the goggle paper you cited didn't find a strong correlation
    > between smart errors (including block relocation) and failure.
    >
    >> This is a good idea. We did this in 1983, back when disks
    >> were simpler beasts. No RAID then of course.

    >
    > even a better idea back then. disks didn't have 1/4 million
    > lines of firmware relocating blocks and doing other things to^w
    > i mean for you.
    >
    > - erik
    >
    >


    And - lest we forget - a RAID array actually has a higher statistical chance of
    failure, and a *lower* MTBF than a single drive. Simple math.

    What we gain is a reduced risk of *unrecoverable* damage, not fewer failures,
    per se.

    Bill




  13. Re: [9fans] Recovering a venti from disk failure

    > while it is true that some disks vintages are better than others, when
    > one drive fails, the probability of the other drives failing has not
    > changed. this is the same as if you flip a coin ten times and get ten
    > heads, the probability of flipping the same coin and getting heads, is
    > still 1/2.


    They are not independent events since they see similar wear
    and tear and experience the same environmental stress. In
    fact by 2nd/3rd year they are starting to approach the other
    end of the bathtub curve of failures.

    The disk industry needs to hire real actuaries. Or may be
    google will!

    >
    > >> i think this corelation gives people the false impression that they do
    > >> fail en masse, but that's really wrong. the latent errors probablly
    > >> happened months ago.

    > >
    > > Yes but if there are many latent errors and/or the error rate
    > > is going up it is time to replace it.

    >
    > maybe. the goggle paper you cited didn't find a strong correlation
    > between smart errors (including block relocation) and failure.


    Actually they did find strong correlation for some of the
    SMART parameters. I think what they said was that over 36%
    of failed disks had no SMART errors of certain kinds (that
    could be because the drive firmware didn't actually count
    those errors but it seems they didn't investigate this). If
    you do see more soft errors there is definitely something to
    worry about.

    > > This is a good idea. We did this in 1983, back when disks
    > > were simpler beasts. No RAID then of course.

    >
    > even a better idea back then. disks didn't have 1/4 million
    > lines of firmware relocating blocks and doing other things to^w
    > i mean for you.


    I preferred doing bad block forwarding etc. in 100 or so
    lines of C code in the OS but it was clear even then that
    disk vendors were going to make their disks "smarter".

+ Reply to Thread