sata sil3114 vs. certain seagate drives results in filesystem corruptions - Kernel

This is a discussion on sata sil3114 vs. certain seagate drives results in filesystem corruptions - Kernel ; Dear all, I finally managed to find a *reproducible* setup and way to trigger random corruptions using a sata sil 3114 controller connected to 4 seagate drives port 1: ST3400832AS sda port 2: ST3400620AS sdb port 3: ST3750640AS sdc port ...

+ Reply to Thread
Results 1 to 9 of 9

Thread: sata sil3114 vs. certain seagate drives results in filesystem corruptions

  1. sata sil3114 vs. certain seagate drives results in filesystem corruptions

    Dear all,

    I finally managed to find a *reproducible* setup and way to trigger
    random corruptions using a sata sil 3114 controller connected to 4
    seagate drives

    port 1: ST3400832AS sda
    port 2: ST3400620AS sdb
    port 3: ST3750640AS sdc
    port 4: ST3750640AS sdd

    sda & sdb form md0 via a raid1 setup followed by an additional
    devicemapper layer ( root ). sdc and sdb are separate and also have an
    additional device mapper layer ( public ) and ( backups ).

    Now when I write large files of zeros to root(sda&sdb) and read the file
    back in it contains a few nonzero entries:

    # dd if=/dev/zero of=/foo bs=1M count=2000
    # hexdump /foo
    0000000 0000 0000 0000 0000 0000 0000 0000 0000
    *
    1GB random parts, within large blocks of zeroes>

    I can reliably trigger this on the md0 / devmapper-root setup when I
    write about 2GB of data (note that this machine has 1.5G of memory - and
    still 1GB is often enough to see this problem). Here it does not matter
    where in the filesystem I do these writes.

    As a test I did the same on sdc / devmapper-public and
    sdd/devmapper-backups with even 30G of zeros. Nothing, no errors
    everything is perfectly OK.

    So I thought that this is also the the sil mod15write problem
    http://home-tj.org/wiki/index.php/Sil_m15w and applied patches 1 & 2
    from http://lkml.org/lkml/2007/10/11/115 (adding my two disks) and
    rebooted. Now there was some MOD15 stuff in dmesg for the two disks but
    still apart from the disks being even slower it was of no use - the
    corruption problem was still there (I then also tried patch 3 from Bernd
    but that immediately caused oopses fs/errors). So it looks like the
    problem I am having is different...

    Now I remembered that this machine also has two idle promise pdc20376
    sata ports where I first tried the ST3400832AS (sda) and ST3400620AS
    (sdb) on about a year ago
    http://lists.openwall.net/linux-kernel/2006/08/27/106 . At that time I
    just saw random error messages and then finally hangs - quoting Tejon
    Heo:

    "I see. your drive is reporting error for some reason and libata is
    failing to recover."

    Now promise_sata is converted to new EH, so I simply gave it a go, i.e.
    I attached ST3400832AS and ST3400620AS to the promise controller and
    rebooted and redid the experiments from above.

    No data corruptions whatsoever. I even ran the dd on all three devmapped
    mount points simultaneously with a size of 30GB each, still no
    corruption. However the error messages I've seen a year ago are back for
    the ST3400832AS and ST3400620AS attached to the promise controller (see
    below).

    Please find all the details below:

    - uname

    Linux 2.6.23.1 #3 PREEMPT Fri Oct 19 20:39:45 CEST 2007 i686 GNU/Linux

    - lspci

    00:0e.0 RAID bus controller: Silicon Image, Inc. SiI 3114 [SATALink/SATARaid] Serial ATA Controller (rev 02)
    00:0e.0 0104: 1095:3114 (rev 02)

    00:08.0 RAID bus controller: Promise Technology, Inc. PDC20376 (FastTrak 376) (rev 02)
    00:08.0 0104: 105a:3376 (rev 02)

    - proc interrupts

    17: 4434549 IO-APIC-fasteoi sata_promise, sata_sil, ohci1394

    - dmesg

    sata_sil 0000:00:0e.0: version 2.3
    ACPI: PCI Interrupt 0000:00:0e.0[A] -> GSI 17 (level, low) -> IRQ 17
    sata_sil 0000:00:0e.0: Applying R_ERR on DMA activate FIS errata fix
    scsi3 : sata_sil
    scsi4 : sata_sil
    scsi5 : sata_sil
    scsi6 : sata_sil
    ata4: SATA max UDMA/100 cmd 0xf882e080 ctl 0xf882e08a bmdma 0xf882e000 irq 17
    ata5: SATA max UDMA/100 cmd 0xf882e0c0 ctl 0xf882e0ca bmdma 0xf882e008 irq 17
    ata6: SATA max UDMA/100 cmd 0xf882e280 ctl 0xf882e28a bmdma 0xf882e200 irq 17
    ata7: SATA max UDMA/100 cmd 0xf882e2c0 ctl 0xf882e2ca bmdma 0xf882e208 irq 17
    ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
    ata4.00: ATA-7: ST3400832AS, 3.01, max UDMA/133
    ata4.00: 781422768 sectors, multi 16: LBA48 NCQ (depth 0/32)
    ata4.00: configured for UDMA/100
    ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
    ata5.00: ATA-7: ST3400620AS, 3.AAE, max UDMA/133
    ata5.00: 781422768 sectors, multi 16: LBA48 NCQ (depth 0/32)
    ata5.00: configured for UDMA/100
    ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
    ata6.00: ATA-7: ST3750640AS, 3.AAE, max UDMA/133
    ata6.00: 1465149168 sectors, multi 16: LBA48 NCQ (depth 0/32)
    ata6.00: configured for UDMA/100
    ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
    ata7.00: ATA-7: ST3750640AS, 3.AAC, max UDMA/133
    ata7.00: 1465149168 sectors, multi 16: LBA48 NCQ (depth 0/32)
    ata7.00: configured for UDMA/100
    scsi 3:0:0:0: Direct-Access ATA ST3400832AS 3.01 PQ: 0 ANSI: 5
    sd 3:0:0:0: [sda] 781422768 512-byte hardware sectors (400088 MB)
    sd 3:0:0:0: [sda] Write Protect is off
    sd 3:0:0:0: [sda] Mode Sense: 00 3a 00 00
    sd 3:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    sd 3:0:0:0: [sda] 781422768 512-byte hardware sectors (400088 MB)
    sd 3:0:0:0: [sda] Write Protect is off
    sd 3:0:0:0: [sda] Mode Sense: 00 3a 00 00
    sd 3:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    sda: unknown partition table
    sd 3:0:0:0: [sda] Attached SCSI disk
    sd 3:0:0:0: Attached scsi generic sg0 type 0
    scsi 4:0:0:0: Direct-Access ATA ST3400620AS 3.AA PQ: 0 ANSI: 5
    sd 4:0:0:0: [sdb] 781422768 512-byte hardware sectors (400088 MB)
    sd 4:0:0:0: [sdb] Write Protect is off
    sd 4:0:0:0: [sdb] Mode Sense: 00 3a 00 00
    sd 4:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    sd 4:0:0:0: [sdb] 781422768 512-byte hardware sectors (400088 MB)
    sd 4:0:0:0: [sdb] Write Protect is off
    sd 4:0:0:0: [sdb] Mode Sense: 00 3a 00 00
    sd 4:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    sdb: unknown partition table
    sd 4:0:0:0: [sdb] Attached SCSI disk
    sd 4:0:0:0: Attached scsi generic sg1 type 0
    scsi 5:0:0:0: Direct-Access ATA ST3750640AS 3.AA PQ: 0 ANSI: 5
    sd 5:0:0:0: [sdc] 1465149168 512-byte hardware sectors (750156 MB)
    sd 5:0:0:0: [sdc] Write Protect is off
    sd 5:0:0:0: [sdc] Mode Sense: 00 3a 00 00
    sd 5:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    sd 5:0:0:0: [sdc] 1465149168 512-byte hardware sectors (750156 MB)
    sd 5:0:0:0: [sdc] Write Protect is off
    sd 5:0:0:0: [sdc] Mode Sense: 00 3a 00 00
    sd 5:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    sdc: unknown partition table
    sd 5:0:0:0: [sdc] Attached SCSI disk
    sd 5:0:0:0: Attached scsi generic sg2 type 0
    scsi 6:0:0:0: Direct-Access ATA ST3750640AS 3.AA PQ: 0 ANSI: 5
    sd 6:0:0:0: [sdd] 1465149168 512-byte hardware sectors (750156 MB)
    sd 6:0:0:0: [sdd] Write Protect is off
    sd 6:0:0:0: [sdd] Mode Sense: 00 3a 00 00
    sd 6:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    sd 6:0:0:0: [sdd] 1465149168 512-byte hardware sectors (750156 MB)
    sd 6:0:0:0: [sdd] Write Protect is off
    sd 6:0:0:0: [sdd] Mode Sense: 00 3a 00 00
    sd 6:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    sdd: unknown partition table
    sd 6:0:0:0: [sdd] Attached SCSI disk
    sd 6:0:0:0: Attached scsi generic sg3 type 0


    - promise errors:

    ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x2
    ata1.00: port_status 0x20200000
    ata1.00: cmd 25/00:00:c0:b6:74/00:01:20:00:00/e0 tag 0 cdb 0x0 data 131072 in
    res 51/0c:00:c0:b6:74/0c:01:20:00:00/e0 Emask 0x10 (ATA bus error)
    ata1: soft resetting port
    ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
    ata1.00: configured for UDMA/133
    ata1: EH complete
    sd 0:0:0:0: [sda] 781422768 512-byte hardware sectors (400088 MB)
    sd 0:0:0:0: [sda] Write Protect is off
    sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
    sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x2
    ata2.00: port_status 0x20200000
    ata2.00: cmd c8/00:00:40:16:fe/00:00:00:00:00/e1 tag 0 cdb 0x0 data 131072 in
    res 51/0c:00:40:16:fe/00:00:00:00:00/e1 Emask 0x10 (ATA bus error)
    ata2: soft resetting port
    ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
    ata2.00: configured for UDMA/133
    ata2: EH complete
    sd 1:0:0:0: [sdb] 781422768 512-byte hardware sectors (400088 MB)
    sd 1:0:0:0: [sdb] Write Protect is off
    sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
    sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x2
    ata2.00: port_status 0x20200000
    ata2.00: cmd c8/00:50:58:25:e3/00:00:00:00:00/ec tag 0 cdb 0x0 data 40960 in
    res 51/0c:50:58:25:e3/00:00:00:00:00/ec Emask 0x10 (ATA bus error)
    ata2: soft resetting port
    ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
    ata2.00: configured for UDMA/133
    ata2: EH complete
    sd 1:0:0:0: [sdb] 781422768 512-byte hardware sectors (400088 MB)
    sd 1:0:0:0: [sdb] Write Protect is off
    sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
    sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x2
    ata2.00: port_status 0x20200000
    ata2.00: cmd c8/00:08:b0:54:b0/00:00:00:00:00/ed tag 0 cdb 0x0 data 4096 in
    res 51/0c:08:b0:54:b0/00:00:00:00:00/ed Emask 0x10 (ATA bus error)
    ata2: soft resetting port
    ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
    ata2.00: configured for UDMA/133
    ata2: EH complete
    sd 1:0:0:0: [sdb] 781422768 512-byte hardware sectors (400088 MB)
    sd 1:0:0:0: [sdb] Write Protect is off
    sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
    sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x2
    ata2.00: port_status 0x20200000
    ata2.00: cmd 25/00:00:f8:af:c0/00:02:12:00:00/e0 tag 0 cdb 0x0 data 262144 in
    res 51/0c:00:f8:af:c0/0c:02:12:00:00/e0 Emask 0x10 (ATA bus error)
    ata2: soft resetting port
    ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
    ata2.00: configured for UDMA/133
    ata2: EH complete
    sd 1:0:0:0: [sdb] 781422768 512-byte hardware sectors (400088 MB)
    sd 1:0:0:0: [sdb] Write Protect is off
    sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
    sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x2
    ata2.00: port_status 0x20200000
    ata2.00: cmd c8/00:00:60:f5:e2/00:00:00:00:00/e2 tag 0 cdb 0x0 data 131072 in
    res 51/0c:00:60:f5:e2/0c:02:12:00:00/e2 Emask 0x10 (ATA bus error)
    ata2: soft resetting port
    ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
    ata2.00: configured for UDMA/133
    ata2: EH complete
    sd 1:0:0:0: [sdb] 781422768 512-byte hardware sectors (400088 MB)
    sd 1:0:0:0: [sdb] Write Protect is off
    sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
    sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x2
    ata2.00: port_status 0x20200000
    ata2.00: cmd 25/00:f0:80:54:da/00:00:12:00:00/e0 tag 0 cdb 0x0 data 122880 in
    res 51/0c:f0:80:54:da/0c:00:12:00:00/e0 Emask 0x10 (ATA bus error)
    ata2: soft resetting port
    ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
    ata2.00: configured for UDMA/133
    ata2: EH complete
    sd 1:0:0:0: [sdb] 781422768 512-byte hardware sectors (400088 MB)
    sd 1:0:0:0: [sdb] Write Protect is off
    sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
    sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

    Open for suggestions/ideas,
    Soeren.
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: sata sil3114 vs. certain seagate drives results in filesystem corruptions

    Helo,

    Soeren Sonnenburg wrote:
    > I finally managed to find a *reproducible* setup and way to trigger
    > random corruptions using a sata sil 3114 controller connected to 4
    > seagate drives
    >
    > port 1: ST3400832AS sda
    > port 2: ST3400620AS sdb
    > port 3: ST3750640AS sdc
    > port 4: ST3750640AS sdd
    >
    > sda & sdb form md0 via a raid1 setup followed by an additional
    > devicemapper layer ( root ). sdc and sdb are separate and also have an
    > additional device mapper layer ( public ) and ( backups ).
    >
    > Now when I write large files of zeros to root(sda&sdb) and read the file
    > back in it contains a few nonzero entries:
    >
    > # dd if=/dev/zero of=/foo bs=1M count=2000
    > # hexdump /foo
    > 0000000 0000 0000 0000 0000 0000 0000 0000 0000
    > *
    > 1GB random parts, within large blocks of zeroes>
    >
    > I can reliably trigger this on the md0 / devmapper-root setup when I
    > write about 2GB of data (note that this machine has 1.5G of memory - and
    > still 1GB is often enough to see this problem). Here it does not matter
    > where in the filesystem I do these writes.


    Thanks. I'll try to reproduce the problem here. What's your motherboard?

    > Now promise_sata is converted to new EH, so I simply gave it a go, i.e.
    > I attached ST3400832AS and ST3400620AS to the promise controller and
    > rebooted and redid the experiments from above.
    >
    > No data corruptions whatsoever. I even ran the dd on all three devmapped
    > mount points simultaneously with a size of 30GB each, still no
    > corruption. However the error messages I've seen a year ago are back for
    > the ST3400832AS and ST3400620AS attached to the promise controller (see
    > below).

    [--snip--]
    > ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x2
    > ata1.00: port_status 0x20200000
    > ata1.00: cmd 25/00:00:c0:b6:74/00:01:20:00:00/e0 tag 0 cdb 0x0 data 131072 in
    > res 51/0c:00:c0:b6:74/0c:01:20:00:00/e0 Emask 0x10 (ATA bus error)
    > ata1: soft resetting port


    Yeah, still the same. Your drives don't like the way promise controller
    speaks to them (e.g. promise generates signals which are ) but now that
    sata_promise has proper EH. It can recover from those errors. As long
    as nothing worse happens, it should be okay.

    Thanks.

    --
    tejun
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: sata sil3114 vs. certain seagate drives results in filesystem corruptions

    On Mon, 2007-10-22 at 11:12 +0900, Tejun Heo wrote:
    > Helo,
    >
    > Soeren Sonnenburg wrote:
    > > I finally managed to find a *reproducible* setup and way to trigger
    > > random corruptions using a sata sil 3114 controller connected to 4
    > > seagate drives
    > >
    > > port 1: ST3400832AS sda
    > > port 2: ST3400620AS sdb
    > > port 3: ST3750640AS sdc
    > > port 4: ST3750640AS sdd
    > >
    > > sda & sdb form md0 via a raid1 setup followed by an additional
    > > devicemapper layer ( root ). sdc and sdb are separate and also have an
    > > additional device mapper layer ( public ) and ( backups ).
    > >
    > > Now when I write large files of zeros to root(sda&sdb) and read the file
    > > back in it contains a few nonzero entries:
    > >
    > > # dd if=/dev/zero of=/foo bs=1M count=2000
    > > # hexdump /foo
    > > 0000000 0000 0000 0000 0000 0000 0000 0000 0000
    > > *
    > > 1GB random parts, within large blocks of zeroes>
    > >
    > > I can reliably trigger this on the md0 / devmapper-root setup when I
    > > write about 2GB of data (note that this machine has 1.5G of memory - and
    > > still 1GB is often enough to see this problem). Here it does not matter
    > > where in the filesystem I do these writes.

    >
    > Thanks. I'll try to reproduce the problem here. What's your motherboard?


    It is an asus a7v8x with a AMD Athlon(TM) XP 3000+ and admittingly
    almost completely filled pci slots (4 dvb cards, 1 with the sil3114; 1
    empty; in the agp slot a radeon 9200). Nevertheless I would not expect
    the power supply to be the problem (it got replaced recently by a 500W
    one), enough cooling (it is winter in germany + several fans).

    > > Now promise_sata is converted to new EH, so I simply gave it a go, i.e.
    > > I attached ST3400832AS and ST3400620AS to the promise controller and
    > > rebooted and redid the experiments from above.
    > >
    > > No data corruptions whatsoever. I even ran the dd on all three devmapped
    > > mount points simultaneously with a size of 30GB each, still no
    > > corruption. However the error messages I've seen a year ago are back for
    > > the ST3400832AS and ST3400620AS attached to the promise controller (see
    > > below).

    > [--snip--]
    > > ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x2
    > > ata1.00: port_status 0x20200000
    > > ata1.00: cmd 25/00:00:c0:b6:74/00:01:20:00:00/e0 tag 0 cdb 0x0 data 131072 in
    > > res 51/0c:00:c0:b6:74/0c:01:20:00:00/e0 Emask 0x10 (ATA bus error)
    > > ata1: soft resetting port

    >
    > Yeah, still the same. Your drives don't like the way promise controller
    > speaks to them (e.g. promise generates signals which are ) but now that
    > sata_promise has proper EH. It can recover from those errors. As long
    > as nothing worse happens, it should be okay.


    These errors only appear when I generate some stress (like with the dd).
    The machine is now up 2 days 8hrs and no further such warnings in the
    log.

    Soeren
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: sata sil3114 vs. certain seagate drives results in filesystem corruptions

    Hello,

    On Monday 22 October 2007 04:12:44 Tejun Heo wrote:
    > Helo,
    >
    > Soeren Sonnenburg wrote:
    > > I finally managed to find a *reproducible* setup and way to trigger
    > > random corruptions using a sata sil 3114 controller connected to 4
    > > seagate drives
    > >
    > > port 1: ST3400832AS sda
    > > port 2: ST3400620AS sdb
    > > port 3: ST3750640AS sdc
    > > port 4: ST3750640AS sdd
    > >
    > > sda & sdb form md0 via a raid1 setup followed by an additional
    > > devicemapper layer ( root ). sdc and sdb are separate and also have an
    > > additional device mapper layer ( public ) and ( backups ).
    > >
    > > Now when I write large files of zeros to root(sda&sdb) and read the file
    > > back in it contains a few nonzero entries:
    > >
    > > # dd if=/dev/zero of=/foo bs=1M count=2000
    > > # hexdump /foo
    > > 0000000 0000 0000 0000 0000 0000 0000 0000 0000
    > > *
    > > 1GB random parts, within large blocks of zeroes>
    > >
    > > I can reliably trigger this on the md0 / devmapper-root setup when I
    > > write about 2GB of data (note that this machine has 1.5G of memory - and
    > > still 1GB is often enough to see this problem). Here it does not matter
    > > where in the filesystem I do these writes.


    Thats almost the same test as I'm always doing. Only I do not write only 2GB,
    but as much as it fits onto the disk. On reading back this file, the
    filesystem will report errors somewhere between 50GB and 230GB (disk size is
    250GB).

    >
    > Thanks. I'll try to reproduce the problem here. What's your motherboard?


    All tested S2882 boards here.


    Cheers,
    Bernd

    --
    Bernd Schubert
    Q-Leap Networks GmbH
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: sata sil3114 vs. certain seagate drives results in filesystem corruptions

    On Mon, 2007-10-22 at 11:48 +0200, Bernd Schubert wrote:
    > Hello,
    >
    > On Monday 22 October 2007 04:12:44 Tejun Heo wrote:
    > > Helo,
    > > [...]
    > > > Now when I write large files of zeros to root(sda&sdb) and read the file
    > > > back in it contains a few nonzero entries:
    > > >
    > > > # dd if=/dev/zero of=/foo bs=1M count=2000
    > > > # hexdump /foo
    > > > 0000000 0000 0000 0000 0000 0000 0000 0000 0000
    > > > *
    > > > 1GB random parts, within large blocks of zeroes>
    > > >
    > > > I can reliably trigger this on the md0 / devmapper-root setup when I
    > > > write about 2GB of data (note that this machine has 1.5G of memory - and
    > > > still 1GB is often enough to see this problem). Here it does not matter
    > > > where in the filesystem I do these writes.

    >
    > Thats almost the same test as I'm always doing. Only I do not write only 2GB,


    Well when I read your mail I thought that I could be seeing exactly the
    same bug... it still may be. However ``my'' problem does not go away
    with the mod15fix ...

    > but as much as it fits onto the disk. On reading back this file, the
    > filesystem will report errors somewhere between 50GB and 230GB (disk size is
    > 250GB).


    Wow, I really see lots of corruptions (well every 1-2 GB a couple of
    bytes are corrupted). Are you getting similiarly many in the 50G - 230G
    region?

    > > Thanks. I'll try to reproduce the problem here. What's your motherboard?

    >
    > All tested S2882 boards here.


    I assume all equipped with lots of memory and mostly empty pci slots?

    Soeren
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: sata sil3114 vs. certain seagate drives results in filesystem corruptions

    On Monday 22 October 2007 12:36:32 Soeren Sonnenburg wrote:
    > > but as much as it fits onto the disk. On reading back this file, the
    > > filesystem will report errors somewhere between 50GB and 230GB (disk size
    > > is 250GB).

    >
    > Wow, I really see lots of corruptions (well every 1-2 GB a couple of
    > bytes are corrupted). Are you getting similiarly many in the 50G - 230G
    > region?


    I never tested what is corrupted. Well, a diff over 250GB would take quite a
    lot of time...

    --
    Bernd Schubert
    Q-Leap Networks GmbH
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: sata sil3114 vs. certain seagate drives results in filesystem corruptions

    On Monday 22 October 2007 12:36:32 Soeren Sonnenburg wrote:
    > On Mon, 2007-10-22 at 11:48 +0200, Bernd Schubert wrote:
    > > Hello,
    > >
    > > On Monday 22 October 2007 04:12:44 Tejun Heo wrote:
    > > > Helo,
    > > > [...]
    > > >
    > > > > Now when I write large files of zeros to root(sda&sdb) and read the
    > > > > file back in it contains a few nonzero entries:
    > > > >
    > > > > # dd if=/dev/zero of=/foo bs=1M count=2000
    > > > > # hexdump /foo
    > > > > 0000000 0000 0000 0000 0000 0000 0000 0000 0000
    > > > > *
    > > > > 1GB random parts, within large blocks of zeroes>
    > > > >
    > > > > I can reliably trigger this on the md0 / devmapper-root setup when I
    > > > > write about 2GB of data (note that this machine has 1.5G of memory -
    > > > > and still 1GB is often enough to see this problem). Here it does not
    > > > > matter where in the filesystem I do these writes.

    > >
    > > Thats almost the same test as I'm always doing. Only I do not write only
    > > 2GB,

    >
    > Well when I read your mail I thought that I could be seeing exactly the
    > same bug... it still may be. However ``my'' problem does not go away
    > with the mod15fix ...


    Yeah, pity it did not fix it I will try to port Tejuns patch
    (http://home-tj.org/wiki/index.php/Sil_m15w#Patches) to 2.6.23 today or
    tomorrow. If you are testing anyway, could you then also try this?

    >
    > > but as much as it fits onto the disk. On reading back this file, the
    > > filesystem will report errors somewhere between 50GB and 230GB (disk size
    > > is 250GB).

    >
    > Wow, I really see lots of corruptions (well every 1-2 GB a couple of
    > bytes are corrupted). Are you getting similiarly many in the 50G - 230G
    > region?
    >
    > > > Thanks. I'll try to reproduce the problem here. What's your
    > > > motherboard?

    > >
    > > All tested S2882 boards here.

    >
    > I assume all equipped with lots of memory and mostly empty pci slots?


    Yes, all pci-slots are free and the systems to have between 4 and 16GB memory
    (ecc, monitored with edac). Well, those are cluster systems (actually tyan
    names those B2882).
    Do you think the configuration is related? Here it also happens with odirect,
    we tested this to minimize memory effects.


    Cheers,
    Bernd


    --
    Bernd Schubert
    Q-Leap Networks GmbH
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: sata sil3114 vs. certain seagate drives results in filesystem corruptions

    On Mon, 2007-10-22 at 13:02 +0200, Bernd Schubert wrote:
    > On Monday 22 October 2007 12:36:32 Soeren Sonnenburg wrote:
    > > > but as much as it fits onto the disk. On reading back this file, the
    > > > filesystem will report errors somewhere between 50GB and 230GB (disk size
    > > > is 250GB).

    > >
    > > Wow, I really see lots of corruptions (well every 1-2 GB a couple of
    > > bytes are corrupted). Are you getting similiarly many in the 50G - 230G
    > > region?

    >
    > I never tested what is corrupted. Well, a diff over 250GB would take quite a
    > lot of time...


    Actually hexdump does not display duplicate lines, so if your file is
    really all zeros it will only display a single line + the count, however
    I think it is not so optimized...

    Soeren
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: sata sil3114 vs. certain seagate drives results in filesystem corruptions

    On Mon, 2007-10-22 at 12:59 +0200, Bernd Schubert wrote:
    > On Monday 22 October 2007 12:36:32 Soeren Sonnenburg wrote:
    > > On Mon, 2007-10-22 at 11:48 +0200, Bernd Schubert wrote:
    > > > Hello,
    > > >
    > > > On Monday 22 October 2007 04:12:44 Tejun Heo wrote:
    > > > > Helo,
    > > > > [...]
    > > > >
    > > > > > Now when I write large files of zeros to root(sda&sdb) and read the
    > > > > > file back in it contains a few nonzero entries:
    > > > > >
    > > > > > # dd if=/dev/zero of=/foo bs=1M count=2000
    > > > > > # hexdump /foo
    > > > > > 0000000 0000 0000 0000 0000 0000 0000 0000 0000
    > > > > > *
    > > > > > 1GB random parts, within large blocks of zeroes>
    > > > > >
    > > > > > I can reliably trigger this on the md0 / devmapper-root setup when I
    > > > > > write about 2GB of data (note that this machine has 1.5G of memory -
    > > > > > and still 1GB is often enough to see this problem). Here it does not
    > > > > > matter where in the filesystem I do these writes.
    > > >
    > > > Thats almost the same test as I'm always doing. Only I do not write only
    > > > 2GB,

    > >
    > > Well when I read your mail I thought that I could be seeing exactly the
    > > same bug... it still may be. However ``my'' problem does not go away
    > > with the mod15fix ...

    >
    > Yeah, pity it did not fix it I will try to port Tejuns patch
    > (http://home-tj.org/wiki/index.php/Sil_m15w#Patches) to 2.6.23 today or
    > tomorrow. If you are testing anyway, could you then also try this?


    Hmmhh, dmesg said the m15 fix was turned on (at least it appeared for
    the 2 drives in question in dmesg), so I fear it is something different.
    On the other hand this is a 'production' machine so I am not too eager
    to try very experimental things...

    > > > but as much as it fits onto the disk. On reading back this file, the
    > > > filesystem will report errors somewhere between 50GB and 230GB (disk size
    > > > is 250GB).

    > >
    > > Wow, I really see lots of corruptions (well every 1-2 GB a couple of
    > > bytes are corrupted). Are you getting similiarly many in the 50G - 230G
    > > region?
    > >
    > > > > Thanks. I'll try to reproduce the problem here. What's your
    > > > > motherboard?
    > > >
    > > > All tested S2882 boards here.

    > >
    > > I assume all equipped with lots of memory and mostly empty pci slots?

    >
    > Yes, all pci-slots are free and the systems to have between 4 and 16GB memory
    > (ecc, monitored with edac). Well, those are cluster systems (actually tyan
    > names those B2882).
    > Do you think the configuration is related? Here it also happens with odirect,
    > we tested this to minimize memory effects.


    Mine is just a a7v8x with via KT400 chipset... really old, but several
    of the pci slots are filled, so the problem may be more likely to happen
    it may happen here... on the other hand I never tried writing 50-250G on
    the drives I considered OK. Will do. Also what could be helpful is that
    we both see patterns in the corruptions, like corruptions are always 512
    bytes long or so (IIRC in my case they were only up to 64 bytes).

    Soeren
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread