Am I endangering my RAID 5 array? - Storage

This is a discussion on Am I endangering my RAID 5 array? - Storage ; At home I have a Linux desktop and a headless Linux fileserver with a software RAID 5 array (see for details). A few days ago the UPS battery conked out. Besides the battery alarm ringing every ten seconds and slowly ...

+ Reply to Thread
Results 1 to 8 of 8

Thread: Am I endangering my RAID 5 array?

  1. Am I endangering my RAID 5 array?

    At home I have a Linux desktop and a headless Linux fileserver with a
    software RAID 5 array (see

    for details).

    A few days ago the UPS battery conked out. Besides the battery alarm
    ringing every ten seconds and slowly driving me mad, the fileserver is
    spontaneously rebooting several times a day; apparently momentary dips
    and other irregularities in the power here in downtown San Francisco,
    which the UPS had before filtered (and which likely prematurely aged
    the battery after replacement only 19 months ago), are causing it to
    reboot. (Interestingly, the Linux desktop hasn't hiccuped once;
    apparently its power supply is less sensitive.)

    The storage array is a pretty straightforward
    JFS-on-LVM2-on-software-RAID 5 setup. Each time the server reboots it
    usually causes the array to automatically rebuild. Sometimes the
    reboots occur during the rebuilding process, causing it to restart.

    My question: Am I risking data corruption through the repeated
    rebuilds? Should I just shut the server down until the replacement UPS
    battery arrives? Or, given that there haven't actually been any
    hardware drive failures, are the RAID structure and filesystem robust
    enough in the meanwhile?

    --
    PERTH ----> *
    Cpu(s): 48.6% us, 3.3% sy, 0.6% ni, 45.8% id, 1.5% wa, 0.1% hi, 0.0% si
    Mem: 515416k total, 481280k used, 34136k free, 8212k buffers
    Swap: 3052208k total, 1206032k used, 1846176k free, 56704k cached

  2. Re: Am I endangering my RAID 5 array?

    Yeechang Lee wrote

    > At home I have a Linux desktop and a headless
    > Linux fileserver with a software RAID 5 array (see
    >
    > for details).


    > A few days ago the UPS battery conked out. Besides the battery alarm
    > ringing every ten seconds and slowly driving me mad, the fileserver is
    > spontaneously rebooting several times a day; apparently momentary
    > dips and other irregularities in the power here in downtown San Francisco,
    > which the UPS had before filtered (and which likely prematurely aged
    > the battery after replacement only 19 months ago), are causing it to reboot.


    Its much more likely the UPS itself is the cause of the reboots.

    > (Interestingly, the Linux desktop hasn't hiccuped once;
    > apparently its power supply is less sensitive.)


    Yeah, thats not unusual.

    > The storage array is a pretty straightforward
    > JFS-on-LVM2-on-software-RAID 5 setup. Each time the server reboots it
    > usually causes the array to automatically rebuild. Sometimes the
    > reboots occur during the rebuilding process, causing it to restart.


    > My question: Am I risking data corruption through the repeated rebuilds?


    Yes.

    > Should I just shut the server down until the replacement UPS battery arrives?


    It would be better to just plug it into the mains without the UPS.

    > Or, given that there haven't actually been any
    > hardware drive failures, are the RAID structure
    > and filesystem robust enough in the meanwhile?


    You're risking a reboot while writing and that can produce
    significant turds on the drives with some drives.



  3. Re: Am I endangering my RAID 5 array?

    Rod Speed wrote:
    > Its much more likely the UPS itself is the cause of the reboots.


    Makes sense. Or, to put it more accurately, likely there are
    fluctuations in the power which are relatively harmless in real life
    but which the UPS dutifully tries to fix up any way, but of course
    can't because the battery is out (it's a nice model, in which the
    battery is always providing the power regardless of whether power is
    actually available or not, thus eliminating downtime when the power
    does go out).

    > > Should I just shut the server down until the replacement UPS
    > > battery arrives?

    >
    > It would be better to just plug it into the mains without the UPS.


    I'll make the switch when I get home.

    > > My question: Am I risking data corruption through the repeated
    > > rebuilds?

    >
    > Yes.


    On the other hand, the only thing that's writing on the drive at the
    moment is BitTorrent downloads, and that is an inherently
    self-correcting mechanism, so I'm not too worried.

    Filesystemwise, I'll run a fsck of the entire RAID once I remove the
    UPS from the equation. (I'm curious as to how long it'll take on a
    2.8TB array!)

    --
    PERTH ----> *
    Cpu(s): 49.1% us, 3.3% sy, 0.6% ni, 45.3% id, 1.5% wa, 0.1% hi, 0.0% si
    Mem: 515416k total, 477372k used, 38044k free, 3720k buffers
    Swap: 3052208k total, 1244744k used, 1807464k free, 67164k cached

  4. Re: Am I endangering my RAID 5 array?

    Yeechang Lee wrote
    > Rod Speed wrote


    >> Its much more likely the UPS itself is the cause of the reboots.


    > Makes sense. Or, to put it more accurately, likely there
    > are fluctuations in the power which are relatively harmless
    > in real life but which the UPS dutifully tries to fix up any
    > way, but of course can't because the battery is out


    Its much more likely that it isnt actually attempting to switch
    to the battery due to sags in the mains at that high rate.

    > (it's a nice model, in which the battery is always providing the
    > power regardless of whether power is actually available or not,


    And thats why its likely that its not sags in the mains, just
    the UPS not being able to work properly with failing batterys.
    Its likely got a shorted cell and that means that the voltage
    available from the battery isnt enough to provide a high enough
    UPS output voltage to keep the server power supply happy now.

    > thus eliminating downtime when the power does go out).


    Yeah, always on UPSs are by far the best approach.

    Tho they do have that downside if the batterys have gone bad.

    I bet the reason the server reboots and the desktop
    doesnt is just because the server has a much higher
    load on its power supply and so its internal caps cant
    ride thru much of a sag in the mains it sees from the UPS.

    >>> Should I just shut the server down until
    >>> the replacement UPS battery arrives?


    >> It would be better to just plug it into the mains without the UPS.


    > I'll make the switch when I get home.


    >>> My question: Am I risking data corruption
    >>> through the repeated rebuilds?


    >> Yes.


    > On the other hand, the only thing that's writing on
    > the drive at the moment is BitTorrent downloads,


    Thats not correct with the rebuilds.

    > and that is an inherently self-correcting
    > mechanism, so I'm not too worried.


    Sure, the main potential problem is that some drives
    dont handle a power down while writing very well and
    can produce bad sectors on the drive as a result of that.

    > Filesystemwise, I'll run a fsck of the entire RAID
    > once I remove the UPS from the equation. (I'm
    > curious as to how long it'll take on a 2.8TB array!)


    Yeah, it will be an interesting test.



  5. Re: Am I endangering my RAID 5 array?

    Yeechang Lee wrote:

    ....

    > On the other hand, the only thing that's writing on the drive at the
    > moment is BitTorrent downloads, and that is an inherently
    > self-correcting mechanism, so I'm not too worried.


    A more insidious problem could be corrupted parity data, which you might
    never see until some other failure occurred and you suddenly had to
    depend upon it. Validating (or, if that's not an available option, just
    forcing a complete rebuild of) the parity data after you've eliminated
    the problem of frequent restarts might be prudent (if you're truly
    paranoid, you'll back all the data up first).

    - bill

  6. Re: Am I endangering my RAID 5 array?

    Bill Todd wrote:
    > A more insidious problem could be corrupted parity data, which you
    > might never see until some other failure occurred and you suddenly
    > had to depend upon it. Validating (or, if that's not an available
    > option, just forcing a complete rebuild of) the parity data after
    > you've eliminated the problem of frequent restarts might be prudent


    Good point. However, I'm not aware of a way of forcing a parity
    rebuild in Linux software RAID except for marking a drive as failed
    then reinserting it into the array. In any case, the resulting resync
    shouldn't be any different than the automatic postboot resyncing the
    array is doing right now (after indeed having eliminated the
    random-restarting problem by bypassing the faulty UPS), right?

    > (if you're truly paranoid, you'll back all the data up first).


    If you know of a cost- and time-effective way of backing up a 2.8TB
    storage array being used for personal purposes, please let me
    know. I'm not being flippant; if there is such a thing, I'd really
    like to know! But I'm pretty sure there isn't one.

    --
    PERTH ----> *
    Cpu(s): 50.5% us, 3.4% sy, 0.7% ni, 43.8% id, 1.5% wa, 0.1% hi, 0.0% si
    Mem: 515416k total, 479540k used, 35876k free, 11148k buffers
    Swap: 3052208k total, 1405240k used, 1646968k free, 92024k cached

  7. Re: Am I endangering my RAID 5 array?

    Yeechang Lee wrote:
    > Bill Todd wrote:
    >
    >>A more insidious problem could be corrupted parity data, which you
    >>might never see until some other failure occurred and you suddenly
    >>had to depend upon it. Validating (or, if that's not an available
    >>option, just forcing a complete rebuild of) the parity data after
    >>you've eliminated the problem of frequent restarts might be prudent

    >
    >
    > Good point. However, I'm not aware of a way of forcing a parity
    > rebuild in Linux software RAID except for marking a drive as failed
    > then reinserting it into the array. In any case, the resulting resync
    > shouldn't be any different than the automatic postboot resyncing the
    > array is doing right now (after indeed having eliminated the
    > random-restarting problem by bypassing the faulty UPS), right?


    I'm not sufficiently familiar with the Linux design to say. If it makes
    no attempt to log what it's doing and simply does a brute-force complete
    rebuild of *all* the parity information after an interruption, then yes.

    >
    >
    >>(if you're truly paranoid, you'll back all the data up first).

    >
    >
    > If you know of a cost- and time-effective way of backing up a 2.8TB
    > storage array being used for personal purposes, please let me
    > know. I'm not being flippant; if there is such a thing, I'd really
    > like to know! But I'm pretty sure there isn't one.


    What is cost-and time-effective really depends upon the relationship
    between the value you place on your data and the value you place on
    other things. Or, to look at it another way, data you don't back up is
    by definition not worth backing up (which means that any data that *is*
    worth backing up must be placed on storage which it is feasible to back up).

    A solid RAID implementation has its own built-in paranoia and shouldn't
    make an already-bad situation worse during a rebuild (e.g., if it finds
    a hard-to-read sector it will *really* try to read it rather than
    immediately go to the rest of the stripe to rebuild it, just in case
    whatever affected the original sector may have left something in the
    rest of the stripe - parity being the most obvious possibility, since it
    would have been being written at about the same time - inconsistent as
    well). How solid the Linux implementation is in this regard I don't know.

    - bill

  8. Re: Am I endangering my RAID 5 array?

    In comp.sys.ibm.pc.hardware.storage Yeechang Lee wrote:
    > At home I have a Linux desktop and a headless Linux fileserver with a
    > software RAID 5 array (see
    >
    > for details).


    > A few days ago the UPS battery conked out. Besides the battery alarm
    > ringing every ten seconds and slowly driving me mad, the fileserver is
    > spontaneously rebooting several times a day; apparently momentary dips
    > and other irregularities in the power here in downtown San Francisco,
    > which the UPS had before filtered (and which likely prematurely aged
    > the battery after replacement only 19 months ago), are causing it to
    > reboot. (Interestingly, the Linux desktop hasn't hiccuped once;
    > apparently its power supply is less sensitive.)


    > The storage array is a pretty straightforward
    > JFS-on-LVM2-on-software-RAID 5 setup. Each time the server reboots it
    > usually causes the array to automatically rebuild. Sometimes the
    > reboots occur during the rebuilding process, causing it to restart.


    > My question: Am I risking data corruption through the repeated
    > rebuilds? Should I just shut the server down until the replacement UPS
    > battery arrives? Or, given that there haven't actually been any
    > hardware drive failures, are the RAID structure and filesystem robust
    > enough in the meanwhile?



    This is a bit strange. Is this still a 2.4.x or older 2.6.x Kernel? The
    newer ones only rebuild if the array was dirty.

    On the quetion of risk: Yes, there is some risk, but but more that the
    JFS gets corrupted (unless you have switched write-buffering off on
    the disks, AFAIK cannot be done reliably at the moment) that that the
    array itself dies. At least thet is my intuition with Linux software
    RAID. You also have a pretty high risk of the PSU in that system
    dying, so I would take the machine offline.

    Arno


+ Reply to Thread