Bug#419950: NETDEV WATCHDOG problems return - Debian

This is a discussion on Bug#419950: NETDEV WATCHDOG problems return - Debian ; Back in April, 2007, I opened this bug because of problems with "NETDEV WATCHDOG: eth0: transmit timed out" errors. I found that changing one of my BIOS settings seemed to make the problem go away. The setup menu named "Resource ...

+ Reply to Thread
Results 1 to 5 of 5

Thread: Bug#419950: NETDEV WATCHDOG problems return

  1. Bug#419950: NETDEV WATCHDOG problems return

    Back in April, 2007, I opened this bug because of problems with
    "NETDEV WATCHDOG: eth0: transmit timed out" errors.

    I found that changing one of my BIOS settings seemed to make the
    problem go away. The setup menu named "Resource Configuration"
    offers a setting named "Shared PCI IRQs". I found that if I left
    this set to the original "Auto", then I would have the problems
    described in the bug report, making the network unusable; but if
    I changed this setting to "Share Three IRQs", then everything seemed
    to work OK.

    I ran like this for several months without seeing the problem.
    Sometime around October/November, in the midst of several kernel revisions,
    the problem returned briefly, but before I had time to investigate it,
    yet another kernel upgrade or two seemed to get me back to normal.
    (All these are the standard etch kernels 2.6.18-686, as pushed to me
    by security updates).

    This morning I finally got around to installing a big batch of recent
    security updates, including a new kernel, and I'm sorry to report that
    I'm seeing NETDEV WATCHDOG network paralysis again.

    Previous to rebooting this morning, the machine was up for 19 days,
    without problems -- previous boot was December 18th,
    Linux version 2.6.18-5-686 (Debian 2.6.18.dfsg.1-13etch5).
    Today's version ran OK for approximately 13 hours, then went into
    continuous network lockup. Today's version is:
    Linux version 2.6.18-5-686 (Debian 2.6.18.dfsg.1-17).

    Here is a little kern.log extract showing the
    end of the reboot, and the start of the lockup:

    Jan 6 01:16:26 legba kernel: IPv6 over IPv4 tunneling driver
    Jan 6 01:16:26 legba kernel: 0000:00:10.0: tulip_stop_rxtx() failed
    Jan 6 01:16:26 legba kernel: eth0: Setting full-duplex based on MII#1 link partner capability of 41e1.
    Jan 6 01:16:32 legba kernel: eth0: no IPv6 routers present
    Jan 6 01:16:33 legba kernel: lp0: using parport0 (interrupt-driven).
    Jan 6 01:16:33 legba kernel: ppdev: user-space parallel port driver
    Jan 6 04:09:07 legba kernel: 0000:00:10.0: tulip_stop_rxtx() failed
    Jan 6 04:11:06 legba kernel: 0000:00:10.0: tulip_stop_rxtx() failed
    Jan 6 04:18:21 legba kernel: 0000:00:10.0: tulip_stop_rxtx() failed
    Jan 6 14:37:02 legba kernel: 0000:00:10.0: tulip_stop_rxtx() failed
    Jan 6 14:37:11 legba kernel: NETDEV WATCHDOG: eth0: transmit timed out
    Jan 6 14:37:11 legba kernel: 0000:00:10.0: tulip_stop_rxtx() failed
    Jan 6 14:37:19 legba kernel: NETDEV WATCHDOG: eth0: transmit timed out
    Jan 6 14:37:19 legba kernel: 0000:00:10.0: tulip_stop_rxtx() failed
    Jan 6 14:37:27 legba kernel: NETDEV WATCHDOG: eth0: transmit timed out
    Jan 6 14:37:27 legba kernel: 0000:00:10.0: tulip_stop_rxtx() failed

    These 2 lines continue to repeat every 8 or 12 seconds, until I rebooted.

    Any suggestions for what I should experiment with, are welcome.
    Any other information you might want, will be happily provided.
    On this new reboot, I added the kernel parameter "pci=routeirq"
    as my own experiment with this, but the box has only been up for
    about an hour, so I can't say for sure if it helps.



    --
    To UNSUBSCRIBE, email to debian-bugs-dist-REQUEST@lists.debian.org
    with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org

  2. Bug#419950: NETDEV WATCHDOG problems return

    On Sun, 6 Jan 2008, Lou Poppler wrote:

    > Back in April, 2007, I opened this bug because of problems with
    > "NETDEV WATCHDOG: eth0: transmit timed out" errors.

    [snip]
    > Previous to rebooting this morning, the machine was up for 19 days,
    > without problems -- previous boot was December 18th,
    > Linux version 2.6.18-5-686 (Debian 2.6.18.dfsg.1-13etch5).
    > Today's version ran OK for approximately 13 hours, then went into
    > continuous network lockup. Today's version is:
    > Linux version 2.6.18-5-686 (Debian 2.6.18.dfsg.1-17).

    [snip]

    I found version (Debian 2.6.18.dfsg.1-13etch6) in my cache, and
    downgraded to it, as a test. (This had been skipped before, I had
    gone straight from 2.6.18.dfsg.1-13etch5 to 2.6.18.dfsg.1-17).
    1-13.etch6 is also giving me the same problems with NETDEV WATCHDOG.

    Is it possible to find version 2.6.18.dfsg.1-13etch5 somewhere ?
    I don't seem to have it on my disks, and I don't see it on the ftp sites.

    I would like to try running it again, to test some more and verify
    that it really does work better here than the 2 later kernels.

    Thanks,
    Lou



    --
    To UNSUBSCRIBE, email to debian-bugs-dist-REQUEST@lists.debian.org
    with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org

  3. Bug#419950: NETDEV WATCHDOG problems return

    On Tue, Jan 08, 2008 at 03:44:11PM -0500, Lou Poppler wrote:
    > Is it possible to find version 2.6.18.dfsg.1-13etch5 somewhere ?


    snapshot.debian.net

    --
    maks



    --
    To UNSUBSCRIBE, email to debian-bugs-dist-REQUEST@lists.debian.org
    with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org

  4. Bug#419950: NETDEV WATCHDOG problems return

    On Tue, 8 Jan 2008, maximilian attems wrote:

    > On Tue, Jan 08, 2008 at 03:44:11PM -0500, Lou Poppler wrote:
    >> Is it possible to find version 2.6.18.dfsg.1-13etch5 somewhere ?

    >
    > snapshot.debian.net


    Thanks Maks,
    I re-installed 1-13etch5 and it also fails for me now.
    With any more than light load, the network card becomes unusable
    under 1-13etch5, 1-13etch6, and 1-17. Apparently something
    else besides the kernel has changed here to bring back the problem.

    I guess I better re-install 1-17, and experiment with changing kernel
    options; but I don't really have any good ideas at this point.



    --
    To UNSUBSCRIBE, email to debian-bugs-dist-REQUEST@lists.debian.org
    with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org

  5. Bug#419950: eth ini. vs. ide ini.

    As long as I have had this computer, since some 2.6.8 sarge kernel,
    I have occasional problems where the network goes bad, with these lines
    repeating forever in the syslog:
    Feb 14 06:44:10 legba kernel: NETDEV WATCHDOG: eth0: transmit timed out
    Feb 14 06:44:10 legba kernel: 0000:00:0f.0: tulip_stop_rxtx() failed

    Sometimes the problem does not occur, and everything runs just fine until
    I reboot the system, even if I pound on the network, trying to make it fail.
    Sometimes the problem shows up, even with moderate network load, and the
    network is _very_ sluggish until I reboot.

    So far, this does not seem to depend on the kernel version. Each kernel
    I've tried is bad sometimes, and occasionally will boot up OK.
    After combing through the logs, I have found a pattern which correlates
    with my problems. It looks like when I have the problem, there is some
    overlapping of the initialization messages for hda and for eth0; and when
    the machine is booting OK and will not have a problem, these initialization
    messages are separated in the logs. Here are some sample logs:

    Here is an extract from dmesg on 2008-04-01, still running today with
    no problems:
    > Linux Tulip driver version 1.1.13-NAPI (May 11, 2002)
    > PCI: Enabling device 0000:00:0f.0 (0114 -> 0117)
    > ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 10
    > PCI: setting IRQ 10 as level-triggered
    > ACPI: PCI Interrupt 0000:00:0f.0[A] -> Link [LNKC] -> GSI 10 (level, low) -> IRQ 10
    > tulip0: MII transceiver #1 config 1000 status 786d advertising 05e1.
    > eth0: ADMtek Comet rev 17 at 00011400, 00:14:BF:5C:E1:35, IRQ 10.
    > Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
    > ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
    > PIIX4: IDE controller at PCI slot 0000:00:07.1
    > PIIX4: chipset revision 1
    > PIIX4: not 100% native mode: will probe irqs later
    > ide0: BM-DMA at 0x1000-0x1007, BIOS settings: hdaio, hdbMA
    > ide1: BM-DMA at 0x1008-0x100f, BIOS settings: hdcMA, hddio
    > Probing IDE interface ide0...
    > usb 1-2: new full speed USB device using uhci_hcd and address 2
    > usb 1-2: configuration #1 chosen from 1 choice
    > hub 1-2:1.0: USB hub found
    > hub 1-2:1.0: 4 ports detected
    > hda: WDC WD800JB-00CRA1, ATA DISK drive
    > Time: acpi_pm clocksource has been installed.
    > hdb: ST3250623A, ATA DISK drive
    > ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
    > Probing IDE interface ide1...
    > hdc: Hewlett-Packard CD-Writer Plus 9100, ATAPI CD/DVD-ROM drive
    > ide1 at 0x170-0x177,0x376 on irq 15
    > hda: max request size: 128KiB
    > hda: 156301488 sectors (80026 MB) w/8192KiB Cache, CHS=65535/16/63, UDMA(33)
    > hda: cache flushes not supported
    > hda: hda1 hda2 hda3 hda4 < hda5 hda6 hda7 hda8 hda9 hda10 >
    > hdb: max request size: 512KiB
    > hdb: 488397168 sectors (250059 MB) w/16384KiB Cache, CHS=30401/255/63, UDMA(33)
    > hdb: cache flushes supported
    > hdb: hdb1 hdb2


    For contrast, here is a similar extract from 2008-03-16 dmesg, after which
    the network became bad under light bittorrent pressure:
    > Linux Tulip driver version 1.1.13-NAPI (May 11, 2002)
    > hda: WDC WD800JB-00CRA1, ATA DISK drive
    > Time: acpi_pm clocksource has been installed.
    > hdb: ST3250623A, ATA DISK drive
    > ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
    > Probing IDE interface ide1...
    > hdc: Hewlett-Packard CD-Writer Plus 9100, ATAPI CD/DVD-ROM drive
    > ide1 at 0x170-0x177,0x376 on irq 15
    > ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 9
    > PCI: setting IRQ 9 as level-triggered
    > ACPI: PCI Interrupt 0000:00:07.2[D] -> Link [LNKD] -> GSI 9 (level, low) -> IRQ 9
    > uhci_hcd 0000:00:07.2: UHCI Host Controller
    > uhci_hcd 0000:00:07.2: new USB bus registered, assigned bus number 1
    > uhci_hcd 0000:00:07.2: irq 9, io base 0x00001020
    > usb usb1: configuration #1 chosen from 1 choice
    > hub 1-0:1.0: USB hub found
    > hub 1-0:1.0: 2 ports detected
    > hda: max request size: 128KiB
    > hda: 156301488 sectors (80026 MB) w/8192KiB Cache, CHS=65535/16/63, UDMA(33)
    > hda: cache flushes not supported
    > hda:PCI: Enabling device 0000:00:0f.0 (0114 -> 0117)
    > ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 10
    > PCI: setting IRQ 10 as level-triggered
    > ACPI: PCI Interrupt 0000:00:0f.0[A] -> Link [LNKC] -> GSI 10 (level, low) -> IRQ 10
    > tulip0: MII transceiver #1 config 1000 status 786d advertising 05e1.
    > eth0: ADMtek Comet rev 17 at 00011400, 00:14:BF:5C:E1:35, IRQ 10.
    > hda1 hda2 hda3 hda4 < hda5 hda6 hda7 hda8 hda9 hda10 >
    > hdb: max request size: 512KiB
    > hdb: 488397168 sectors (250059 MB) w/16384KiB Cache, CHS=30401/255/63, UDMA(33)
    > hdb: cache flushes supported
    > hdb: hdb1 hdb2


    What I notice here is that the log message that should be 1 line like this:
    > hda: hda1 hda2 hda3 hda4 < hda5 hda6 hda7 hda8 hda9 hda10 >

    is split after the " hda:" in all the cases of an unsuccessful boot,
    with some of the ethernet initialization messages printed before the
    remaining part of the hda message
    " hda1 hda2 hda3 hda4 < hda5 hda6 hda7 hda8 hda9 hda10 >"

    >From what I observe, this corresponds 100% with the bad network behavior.

    The kernel version currently running here is:
    Linux version 2.6.18-6-686 (Debian 2.6.18.dfsg.1-18etch1) (waldi@debian.org)

    I'm willing to try other kernel versions or parameters, and willing to
    provide any other info that might help someone understand this problem.

    For now, I at least have a clumsy workaround of rebooting until I see that
    the eth0 and hda initializations are not intermingled in dmesg.




    --
    To UNSUBSCRIBE, email to debian-bugs-dist-REQUEST@lists.debian.org
    with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org

+ Reply to Thread