net: tx timeouts with skge, 8139too, dmfe drivers/NICs - Kernel

This is a discussion on net: tx timeouts with skge, 8139too, dmfe drivers/NICs - Kernel ; Hi all, I experience very rare freezes at heavy outbound traffic (sending ~4GB DVD image to another host(s) on the same LAN) using skge driver (NIC on the mobo) as well as (recently tested) using rtl8139 or dmfe NICs on ...

+ Reply to Thread
Results 1 to 7 of 7

Thread: net: tx timeouts with skge, 8139too, dmfe drivers/NICs

  1. net: tx timeouts with skge, 8139too, dmfe drivers/NICs

    Hi all,

    I experience very rare freezes at heavy outbound traffic
    (sending ~4GB DVD image to another host(s) on the same LAN)
    using skge driver (NIC on the mobo) as well as (recently tested)
    using rtl8139 or dmfe NICs on the PCI bus. There is a single
    switch between them (tested with another one just to exclude
    a faulty switch).

    skge <--> Marvell 88E8001 chip
    8139too <--> Realtek 8136B chip
    dmfe <--> Davicom DM9102 chip

    Symptoms are similar: tx timeouts and no more net activity.
    KDE desktop works, computational programs - work, the machine
    is usable, but cannot ping, nor can be ping-ed anymore.
    rmmod && modprobe the respective modules repairs the problem.
    Simple surfing/e-mailing from it do not trigger the problem.

    The machine is used as LTSP server for old PCs (as X terminals)
    (mostly outbound traffic) and is not usable as such due to this
    problem.

    The kernel is 2.6.24.2-SMP/x86_32 (PREEMPT or not - NO difference).

    As far as this happens with 3 different NICs/drivers could it be
    a problem in the (common for all of them) networking subsystem?

    As far as many persons are working on this machine only limited
    testing could be done.

    Thank you in advance for your suggestions, help (and patches).

    Regards.

    Marin Mitov
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: net: tx timeouts with skge, 8139too, dmfe drivers/NICs

    Marin Mitov wrote:
    > Hi all,
    >
    > I experience very rare freezes at heavy outbound traffic
    > (sending ~4GB DVD image to another host(s) on the same LAN)
    > using skge driver (NIC on the mobo) as well as (recently tested)
    > using rtl8139 or dmfe NICs on the PCI bus. There is a single
    > switch between them (tested with another one just to exclude
    > a faulty switch).
    >
    > skge <--> Marvell 88E8001 chip
    > 8139too <--> Realtek 8136B chip
    > dmfe <--> Davicom DM9102 chip
    >
    > Symptoms are similar: tx timeouts and no more net activity.
    > KDE desktop works, computational programs - work, the machine
    > is usable, but cannot ping, nor can be ping-ed anymore.
    > rmmod && modprobe the respective modules repairs the problem.
    > Simple surfing/e-mailing from it do not trigger the problem.
    >
    > The machine is used as LTSP server for old PCs (as X terminals)
    > (mostly outbound traffic) and is not usable as such due to this
    > problem.
    >
    > The kernel is 2.6.24.2-SMP/x86_32 (PREEMPT or not - NO difference).
    >
    > As far as this happens with 3 different NICs/drivers could it be
    > a problem in the (common for all of them) networking subsystem?


    A TX timeout (like hardware timeouts, in general) is a very generic
    behavior, with many causes.

    In general, when you see timeouts with varied hardware and drivers,
    you're almost always dealing with a problem with interrupt delivery, or
    a generic system problem, rather than bugs in the network stack or all
    three drivers.

    Jeff



    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: net: tx timeouts with skge, 8139too, dmfe drivers/NICs

    On Monday 25 February 2008 10:53:01 pm you wrote:
    > Marin Mitov wrote:
    > > Hi all,
    > >
    > > I experience very rare freezes at heavy outbound traffic
    > > (sending ~4GB DVD image to another host(s) on the same LAN)
    > > using skge driver (NIC on the mobo) as well as (recently tested)
    > > using rtl8139 or dmfe NICs on the PCI bus. There is a single
    > > switch between them (tested with another one just to exclude
    > > a faulty switch).
    > >
    > > skge <--> Marvell 88E8001 chip
    > > 8139too <--> Realtek 8136B chip
    > > dmfe <--> Davicom DM9102 chip
    > >
    > > Symptoms are similar: tx timeouts and no more net activity.
    > > KDE desktop works, computational programs - work, the machine
    > > is usable, but cannot ping, nor can be ping-ed anymore.
    > > rmmod && modprobe the respective modules repairs the problem.
    > > Simple surfing/e-mailing from it do not trigger the problem.
    > >
    > > The machine is used as LTSP server for old PCs (as X terminals)
    > > (mostly outbound traffic) and is not usable as such due to this
    > > problem.
    > >
    > > The kernel is 2.6.24.2-SMP/x86_32 (PREEMPT or not - NO difference).
    > >
    > > As far as this happens with 3 different NICs/drivers could it be
    > > a problem in the (common for all of them) networking subsystem?

    >
    > A TX timeout (like hardware timeouts, in general) is a very generic
    > behavior, with many causes.
    >
    > In general, when you see timeouts with varied hardware and drivers,
    > you're almost always dealing with a problem with interrupt delivery, or


    All the drivers are using #INTA on PCI bus (no MSI/MSI-X).

    "problem with interrupt delivery" - you suspect interrupts incorrectly
    disabled (lost) in the drivers or faulty hardware(motherboard)?

    > a generic system problem, rather than bugs in the network stack or all


    "a generic system problem" - bad config or faulty hardware(motherboard)?

    Where I should look for the problem?

    Just for info: the system is very stable - uptime (if no power outages) could
    be a month or more (rebooting for kernel changes or updates).

    Marin Mitov

    > three drivers.
    >
    > Jeff
    >
    >
    >
    > --
    > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    > the body of a message to majordomo@vger.kernel.org
    > More majordomo info at http://vger.kernel.org/majordomo-info.html
    > Please read the FAQ at http://www.tux.org/lkml/



    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: net: tx timeouts with skge, 8139too, dmfe drivers/NICs

    On Mon, 25 Feb 2008 23:36:06 +0200
    Marin Mitov wrote:

    > On Monday 25 February 2008 10:53:01 pm you wrote:
    > > Marin Mitov wrote:
    > > > Hi all,
    > > >
    > > > I experience very rare freezes at heavy outbound traffic
    > > > (sending ~4GB DVD image to another host(s) on the same LAN)
    > > > using skge driver (NIC on the mobo) as well as (recently tested)
    > > > using rtl8139 or dmfe NICs on the PCI bus. There is a single
    > > > switch between them (tested with another one just to exclude
    > > > a faulty switch).
    > > >
    > > > skge <--> Marvell 88E8001 chip
    > > > 8139too <--> Realtek 8136B chip
    > > > dmfe <--> Davicom DM9102 chip
    > > >
    > > > Symptoms are similar: tx timeouts and no more net activity.
    > > > KDE desktop works, computational programs - work, the machine
    > > > is usable, but cannot ping, nor can be ping-ed anymore.
    > > > rmmod && modprobe the respective modules repairs the problem.
    > > > Simple surfing/e-mailing from it do not trigger the problem.
    > > >
    > > > The machine is used as LTSP server for old PCs (as X terminals)
    > > > (mostly outbound traffic) and is not usable as such due to this
    > > > problem.
    > > >
    > > > The kernel is 2.6.24.2-SMP/x86_32 (PREEMPT or not - NO difference).
    > > >
    > > > As far as this happens with 3 different NICs/drivers could it be
    > > > a problem in the (common for all of them) networking subsystem?

    > >
    > > A TX timeout (like hardware timeouts, in general) is a very generic
    > > behavior, with many causes.
    > >
    > > In general, when you see timeouts with varied hardware and drivers,
    > > you're almost always dealing with a problem with interrupt delivery, or

    >
    > All the drivers are using #INTA on PCI bus (no MSI/MSI-X).
    >
    > "problem with interrupt delivery" - you suspect interrupts incorrectly
    > disabled (lost) in the drivers or faulty hardware(motherboard)?
    >
    > > a generic system problem, rather than bugs in the network stack or all

    >
    > "a generic system problem" - bad config or faulty hardware(motherboard)?
    >
    > Where I should look for the problem?
    >
    > Just for info: the system is very stable - uptime (if no power outages) could
    > be a month or more (rebooting for kernel changes or updates).
    >
    > Marin Mitov


    Make sure the interrupt is showing up as level triggered in /proc/interrupts.
    The BIOS may be configuring it as edge-triggered and that won't work with
    Ethernet drivers that use NAPI.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: net: tx timeouts with skge, 8139too, dmfe drivers/NICs

    Hi Stephen,

    > Make sure the interrupt is showing up as level triggered in
    > /proc/interrupts. The BIOS may be configuring it as edge-triggered and that
    > won't work with Ethernet drivers that use NAPI.


    for: skge <--> Marvell 88E8001 chip
    cat /proc/interrupts gives (AMD64 X2 SMP):
    CPU0 CPU1
    21: 11691000 11933174 IO-APIC-fasteoi eth0

    It is neither IO-APIC-edge, nor IO-APIC-level.

    Could it be the problem?

    Marin Mitov


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: net: tx timeouts with skge, 8139too, dmfe drivers/NICs

    On Tue, 26 Feb 2008 00:09:46 +0200
    Marin Mitov wrote:

    > Hi Stephen,
    >
    > > Make sure the interrupt is showing up as level triggered in
    > > /proc/interrupts. The BIOS may be configuring it as edge-triggered and that
    > > won't work with Ethernet drivers that use NAPI.

    >
    > for: skge <--> Marvell 88E8001 chip
    > cat /proc/interrupts gives (AMD64 X2 SMP):
    > CPU0 CPU1
    > 21: 11691000 11933174 IO-APIC-fasteoi eth0
    >
    > It is neither IO-APIC-edge, nor IO-APIC-level.
    >
    > Could it be the problem?
    >
    > Marin Mitov


    No. that isn't the problem.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: net: tx timeouts with skge, 8139too, dmfe drivers/NICs

    On Monday 25 February 2008 10:53:01 pm you wrote:
    > > As far as this happens with 3 different NICs/drivers could it be
    > > a problem in the (common for all of them) networking subsystem?

    >
    > A TX timeout (like hardware timeouts, in general) is a very generic
    > behavior, with many causes.
    >
    > In general, when you see timeouts with varied hardware and drivers,
    > you're almost always dealing with a problem with interrupt delivery, or
    > a generic system problem, rather than bugs in the network stack or all
    > three drivers.


    Well, this gave me a direction of research.

    Using printk in various parts of skge driver, as well as modifying it to
    collect different statistics (used via ethtool -S eth0), the following observations
    had been made when it freezes:

    1. interrupts are generated (status register shows there are pending
    interrupts and they are NOT masked), but irq_handler is NOT invoked.

    2. Looking on the cat /proc/interrups shows that when skge is working
    both CPUs receive any IRQs. When skge freezes NO CPU receives skge's
    interrupts, CPU[0] receives any others IRQs, but skge's, CPU[1] do not
    receive any IRQ above the line (see bellow), but receives LOC: and RES:
    below the line.
    #cat /proc/interrups
    CPU0 CPU1
    0: 85 1 IO-APIC-edge timer
    1: 34078 9 IO-APIC-edge i8042
    6: 1 4 IO-APIC-edge floppy
    7: 216 1 IO-APIC-edge parport0
    8: 0 1 IO-APIC-edge rtc
    9: 0 0 IO-APIC-fasteoi acpi
    12: 893003 1390080 IO-APIC-edge i8042
    14: 59682 286628 IO-APIC-edge ide0
    15: 5458527 12 IO-APIC-edge ide1
    16: 60547054 1 IO-APIC-fasteoi mga@pci:0000:01:00.0
    17: 1634623 914447 IO-APIC-fasteoi sata_via
    18: 7768 7 IO-APIC-fasteoi sata_promise
    19: 0 0 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb3, uhci_hcd:usb4, uhci_hcd:usb5
    20: 535380 1 IO-APIC-fasteoi VIA8237
    21: 30780380 31448992 IO-APIC-fasteoi eth0
    ---------line added by me----------------------------------
    NMI: 0 0 Non-maskable interrupts
    LOC: 154311126 154736178 Local timer interrupts
    RES: 1325239 2423719 Rescheduling interrupts
    CAL: 40893 456 function call interrupts
    TLB: 52651 29184 TLB shootdowns
    TRM: 0 0 Thermal event interrupts
    SPU: 0 0 Spurious interrupts
    ERR: 0
    MIS: 0

    That looks like IRQs are somehow disabled (at IO-APIC/LAPIC?)
    at some priority and bellow.

    Here is the place to say that after freezing, ifconfig down/up (+routing info)
    does NOT solve the problem, while rmmod/modprobe the driver, makes it work
    again.

    So, I moved the functions request_irq()/free_irq() from driver's probe()/release()
    methods to open()/stop() methods. Thus modified, when skge freezes,
    ifconfig down/up makes it work again (no need to rmmod/modprobe).

    That makes me think that somehow skge's IRQ is disabled OUT of the driver
    and free_irq()/request_irq() clears the problem. Am I wrong?

    Could it be possible? How could this happen?

    Any comments/suggestions/patches wellcome.

    Regards

    Marin Mitov

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread