Need help with corrupted Linux raid and filesystems - Help

This is a discussion on Need help with corrupted Linux raid and filesystems - Help ; Hi, My mailserver has had some problems with its hardware over the last months. The machine was totally reinstalled about 6 weeks ago, and has appeared to be OK, until last week when the 3COM NIC started to fail every ...

+ Reply to Thread
Results 1 to 4 of 4

Thread: Need help with corrupted Linux raid and filesystems

  1. Need help with corrupted Linux raid and filesystems

    Hi,

    My mailserver has had some problems with its hardware over the last months.
    The machine was totally reinstalled about 6 weeks ago, and has appeared to
    be OK, until last week when the 3COM NIC started to fail every 1 - 2 days.
    I shut down the machine today and replaced the NIC. When I tried to boot,
    Linux found no valid raid partitions.

    From the boot-screen
    md: Autodetecting RAID arrays
    md: invalid raid superblock magic on hda1
    md: hda1 has invalid sb, nor importing!
    md: could not import hda1!
    md: invalid raid superblock magic on hda3
    .... (repeats itself for each partition)
    ...
    Kernel Panic: VFS: Unable to mount root fs on hda2

    The disks are partitioned as:
    /dev/hd?1 --> md0 - 100mb /boot (ext2)
    /dev/hd?2 --> md1 - 2gb swap
    /dev/hd?3 --> md2 - 8gb / (ext3)
    /dev/hd?4 --> md3 - 67gb /var/lib/courier/spool (xfs)

    The machine is an (old) P4 1.6 Ghz based PC/server with 1,256 GB ram, 2 x
    Western Digital 80GB ATA disks in Linux raid 1 (mirror), Linux kernel 2.6.6
    or 8 (I'm not sure about the exact version, as I'm unable to access the
    disks) with lilo as boot-loader. Linux is configured for 4 CPU's and
    hyperthreading (to be fast if I boot it on my dual CPU workstation). The
    Linux distribution is Debian "woody" with some packages from backports.org
    to handle kernel 2.6.*

    I unmounted the disks, and placed one disk as a second disk in my Linux
    workstation. I edited /etc/raidtab on the workstations disk to fit the raid
    on the second disk, and ran "mkraid" on each of the failed raid devices.
    The command did not complain. Then I tried to mount the raid devices (as I
    have done successfully before a few times when Linux raid has failed).
    Mount did not recognize the file system type on any of the newly added raid
    devices. I tried to run "fsck.ext3" and "xfs_repair" in read-only mode.
    Both complained. I then ran "fsck.ext3 -y /dev/md2" and
    "xfs_repair /dev/md3". xfs_repair complained about a bad superblock, but
    found a second superblock. It then suggested to mount the disk to replay
    the logfile. Mount failed also this time. I ran "xfs_repair -L /dev/md3",
    and this time it found _lots_ of errors. When I mounted the device, it
    turned out that all remaining files on that file-system was in /lost+found.
    None had their original file-name. The ext3 filesystem on /dev/md2 was also
    destroyed, and the remaining files in that filesystems lost+found, most of
    them without their original names.

    If I run "fsck.ext3 -n /dev/md0", fsck complains about a missing superblock,
    and then starts to list lots of problems on the filesystem (like open
    files, bad mode ...). The /boot filesystem that resides there, should not
    have any open files ( - at least not for write), so the errors indicate
    that something is very wrong indeed.


    When I shut down the machine, I checked /var/log/kern.log - and all that
    caught my attention was problems with the network card. Aside from the
    network problem, the machine appeared to be fine. No error-messages was
    printed to the console - as I'm used to if Linux detects serious problems
    with the disks.

    I made a copy of the second disk, and the copy appears to have the same
    problems as the first one. (I tried to boot from the second disk, but when
    that failed with the same errors as the first one, I decided to leave the
    disk unaltered, and made a copy to a third disk that was used for the
    recovery attempts).

    Using "less -f" on a non-altered disk I can see that /dev/hd?1 (boot/ext2)
    contains binry data. /dev/hd?3 (ext3/root filesystem) begins with a large
    block of zeros, then a large block 0xff, then more zeros and then binary
    data. /dev/hd?4 (xfs/mailspool) begins with a small block with zero, then
    binary data that looks like the second block on another machine with the
    xfs filesystem, and then binary data and text from email-messages on the
    filesystem.

    Immediately - it seems unlikely that all three filesystems on the disk(s)
    should be corrupted to this degree. The machine was rebooted about 30 hours
    before, and no problems (except for the NIC) surfaced until this last boot.
    Also - the partition table seems OK, and the lilo works until it's unable
    to mount the root file system. If the filesystem was corrupted with random
    data all over (as I first suspected) - lilo or the Linux image should have
    aborted with an illegal instruction.

    I strongly suspect (and certainly hope!) that there is some obvious error in
    my recovery attempts so far. Please reply if you have a clue about how to
    get the raid and filesystems back. I hope to avoid reinstalling and
    restoring from backup, as the faulty NIC has caused the backup (to another
    machine) for the last few days to fail. Also - my email is not working at
    the moment - so please reply to this thread.

    Jarle
    --
    Jarle Aase http://www.jgaa.com
    mailto:jgaa@jgaa.com

    <<< no need to argue - just kill'em all! >>>

  2. Re: Need help with corrupted Linux raid and filesystems

    Jarle Aase wrote:
    : Hi,

    : My mailserver has had some problems with its hardware over the last months.
    : The machine was totally reinstalled about 6 weeks ago, and has appeared to
    : be OK, until last week when the 3COM NIC started to fail every 1 - 2 days.
    : I shut down the machine today and replaced the NIC. When I tried to boot,
    : Linux found no valid raid partitions.

    Since my first suspicion would be hardware issues- like maybe power
    supply or the like- first thing I'd do is to try mounting the
    disks on another system. If that's successful then you've
    confirmed that there's local problem.

    Next after that I'd try booting from your recovery disks or
    Knoppix or the like to see if independent software does the trick.

    At that point you should know if you have local software problem
    or local hardware problem.

    Stan
    --
    Stan Bischof ("stan" at the below domain)
    www.worldbadminton.com

  3. Re: Need help with corrupted Linux raid and filesystems

    essteeaenn@worldbadminton.com wrote:

    > Since my first suspicion would be hardware issues- like maybe power
    > supply or the like- first thing I'd do is to try mounting the
    > disks on another system.


    I've tried to boot and mount the disks on two other machines. The behavior
    does not change.

    Jarle
    --
    Jarle Aase http://www.jgaa.com
    mailto:jgaa@jgaa.com

    <<< no need to argue - just kill'em all! >>>

  4. Re: Need help with corrupted Linux raid and filesystems

    Jarle Aase wrote:
    : essteeaenn@worldbadminton.com wrote:

    :> Since my first suspicion would be hardware issues- like maybe power
    :> supply or the like- first thing I'd do is to try mounting the
    :> disks on another system.

    : I've tried to boot and mount the disks on two other machines. The behavior
    : does not change.
    :

    Then either you're doing something grossly wrong or you have some
    fried disks. I'd be very cautious about mounting anything on that
    server since the odds are that if you havemultiple fried disks
    there is a power supply problem that caused it.

    On the other hand if the disks are actually OK and the bits just
    got scrambled due to a FS glitch then all you'd have to do is to
    reformat and restore from your latest backup.

    Once you determine that the data is irretrievable try formatting
    one of the disks to see if the disk is OK.

    good luck

    Stan
    --
    Stan Bischof ("stan" at the below domain)
    www.worldbadminton.com

+ Reply to Thread