RE: e1000e: sporadic "hardware error"s with Intel 82563EB on Supermicro X7DB3 - Kernel

This is a discussion on RE: e1000e: sporadic "hardware error"s with Intel 82563EB on Supermicro X7DB3 - Kernel ; Hi Gernot, Thanks for reporting this issue. We have witnessed this in our labs too, only on platforms that have BMC management firmware. I'm very familiar with the problem, and believe that we have fixed it, though the application of ...

+ Reply to Thread
Results 1 to 8 of 8

Thread: RE: e1000e: sporadic "hardware error"s with Intel 82563EB on Supermicro X7DB3

  1. RE: e1000e: sporadic "hardware error"s with Intel 82563EB on Supermicro X7DB3

    Hi Gernot,

    Thanks for reporting this issue. We have witnessed this in our labs too,
    only on platforms that have BMC management firmware. I'm very familiar
    with the problem, and believe that we have fixed it, though the
    application of the fix may not be simple. The problem is a result of
    improper synchronization between the platform FW and the e1000e driver
    when they attempt concurrent access to LAN resources, and fixes were
    made both on the driver side, and on the FW side. On some platforms a
    simple driver update resolves the problem, others require FW fixes too.

    The 0.2.0 driver in 2.6.25 has no fixes for this problem, and so I am
    not surprised that you see it there. The first set of changes for this
    issue are already in the 0.3.3.3-k2 driver that you are still seeing the
    problem with on 2.6.26, so either those changes are not good, or your
    issue requires one of the additional fixes.

    There have been further improvements made to the driver synchronization
    code since the 0.3.3.3-k2 driver, and it is possible that a newer driver
    would resolve the issue. It'd be good for us to know if that's the case.
    The driver version is not yet (AFAICS) upstream, but is already
    available in the standalone e1000e-0.4.1.7 driver on sourceforge.
    (google "sourceforge e1000e"). Would you be able to try that, as a first
    step ?

    If this does not resolve the issue for the Supermicro board, you likely
    also require a "FW-side" fix, and this comes in one of two flavors. If
    the board has an INTEL BMC, then we will need to update it with a new
    BMC version. If the board has a Supermicro BMC (I expect that it does),
    then we can provide a patch to some of the platform microcode using a
    EEPROM update. To determine which is appropriate for you, we'll need to
    know more about the platform. There's probably a BMC version number on
    one of the BIOS menus. I can work with you to find the info we need, and
    then, to help you to perform the necessary steps to perform an upgrade.

    Dave



    Dave-----Original Message-----
    From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
    On Behalf Of Hillier, Gernot
    Sent: Tuesday, October 07, 2008 7:26 AM
    To: Brandeburg, Jesse
    Cc: linux-kernel@vger.kernel.org; netdev@vger.kernel.org; Allan, Bruce W
    Subject: e1000e: sporadic "hardware error"s with Intel 82563EB on
    Supermicro X7DB3

    Hi there,

    On at least two machines using the Supermicro X7DB3 board with Intel
    82563EB (a.k.a. PCI device 8086:1096), we see sporadic problems on
    modprobe
    (about 1 time in some hundred tries):

    e1000e: Intel(R) PRO/1000 Network Driver - 0.3.3.3-k2
    e1000e: Copyright (c) 1999-2008 Intel Corporation.
    e1000e 0000:06:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
    e1000e 0000:06:00.0: setting latency timer to 64
    0000:06:00.0: 0000:06:00.0: Hardware Error
    0000:06:00.0: eth0: (PCI Express:2.5GB/s:Width x4) 00:30:48:67:f5:f6
    0000:06:00.0: eth0: Intel(R) PRO/1000 Network Connection
    0000:06:00.0: eth0: MAC: 3, PHY: 5, PBA No: 2050ff-0ff
    e1000e 0000:06:00.1: PCI INT B -> GSI 19 (level, low) -> IRQ 19
    e1000e 0000:06:00.1: setting latency timer to 64
    0000:06:00.1: eth1: (PCI Express:2.5GB/s:Width x4) 00:30:48:67:f5:f7
    0000:06:00.1: eth1: Intel(R) PRO/1000 Network Connection
    0000:06:00.1: eth1: MAC: 3, PHY: 5, PBA No: 2050ff-0ff
    0000:06:00.0: eth0: Hardware Error

    eth0 is not available after module loading. During boot, this means the
    machine won't come up correctly. Problem can be "fixed" by removing and
    reloading the module.

    This happens on the rather old SUSE-patched 2.6.25.11 with e1000e 0.2.0
    as
    well as with vanilla 2.6.27-rc8 including e1000e 0.3.3.3-k2.

    The machines are equipped with two Quad-Core Xeons E5440 and 8GB of RAM.
    Both kernels are compiled for x86_64.

    Supermicro claims that there's no known hardware problem with these
    boards
    and that the Windows driver doesn't show any issue...

    Is there anything I can do to help narrowing down the problem? Anything
    I
    can test? Any help greatly appreciated...

    TIA!

    --
    Gernot Hillier
    Siemens AG, CT SE 2, Corporate Competence Center Embedded Linux
    --
    To unsubscribe from this list: send the line "unsubscribe netdev" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: e1000e: sporadic "hardware error"s with Intel 82563EB on Supermicro X7DB3

    On Wed, 8 Oct 2008 08:25:49 -0700
    "Graham, David" wrote:

    > Hi Gernot,
    >
    > Thanks for reporting this issue. We have witnessed this in our labs too,
    > only on platforms that have BMC management firmware. I'm very familiar
    > with the problem, and believe that we have fixed it, though the
    > application of the fix may not be simple. The problem is a result of
    > improper synchronization between the platform FW and the e1000e driver
    > when they attempt concurrent access to LAN resources, and fixes were
    > made both on the driver side, and on the FW side. On some platforms a
    > simple driver update resolves the problem, others require FW fixes too.
    >
    > The 0.2.0 driver in 2.6.25 has no fixes for this problem, and so I am
    > not surprised that you see it there. The first set of changes for this
    > issue are already in the 0.3.3.3-k2 driver that you are still seeing the
    > problem with on 2.6.26, so either those changes are not good, or your
    > issue requires one of the additional fixes.
    >
    > There have been further improvements made to the driver synchronization
    > code since the 0.3.3.3-k2 driver, and it is possible that a newer driver
    > would resolve the issue. It'd be good for us to know if that's the case.
    > The driver version is not yet (AFAICS) upstream, but is already
    > available in the standalone e1000e-0.4.1.7 driver on sourceforge.
    > (google "sourceforge e1000e"). Would you be able to try that, as a first
    > step ?


    Repeat rant heard from many users and vendors:
    Why does Intel continue to not do driver development in mainline kernel?

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: e1000e: sporadic "hardware error"s with Intel 82563EB on Supermicro X7DB3

    Dear David,

    first of all thanks for your quick answer! This is what I call great
    support from a hardware vendor!! :-)

    Graham, David wrote:
    > Thanks for reporting this issue. We have witnessed this in our labs too,
    > only on platforms that have BMC management firmware. I'm very familiar
    > with the problem, and believe that we have fixed it, though the
    > application of the fix may not be simple. The problem is a result of
    > improper synchronization between the platform FW and the e1000e driver
    > when they attempt concurrent access to LAN resources, and fixes were
    > made both on the driver side, and on the FW side. On some platforms a
    > simple driver update resolves the problem, others require FW fixes too.


    That sounds quite promising and seems to fit to our problem.

    However, one detail confuses us: we can currently reproduce this problem on
    two machines. One of them is equipped with an optional IPMI card, the other
    one isn't. (The Supermicro X7DB3 doesn't include full IPMI support onboard,
    but has a "LP IPMI 2.0 (SIMLP) Slot" where you can place an optional card).

    The box with the IPMI card shows the hardware errors quite often (in one of
    about 200 tries) while the other box still shows the problem, but much more
    seldom (in one of >1000 tries). Now we wonder if the BMC is on the IPMI
    card or on the board itself - in the first case, I'm not sure if you thesis
    fully explains the problems we can see.

    And there's another detail I'd like to mention: we first found the problem
    by doing continuous reboots as originally described, but we found we can
    also reproduce it with an endless loop of "rmmod;sleep 3;modprobe". Does
    this somehow contradict with your thesis?

    > There have been further improvements made to the driver synchronization
    > code since the 0.3.3.3-k2 driver, and it is possible that a newer driver
    > would resolve the issue. It'd be good for us to know if that's the case.
    > The driver version is not yet (AFAICS) upstream, but is already
    > available in the standalone e1000e-0.4.1.7 driver on sourceforge.
    > (google "sourceforge e1000e"). Would you be able to try that, as a first
    > step ?


    Yes, I did. Unfortunately, 0.4.1.7 still shows the problem - on both machines:

    e1000e: Intel(R) PRO/1000 Network Driver - 0.4.1.7-NAPI
    e1000e: Copyright (c) 1999-2008 Intel Corporation.
    ACPI: PCI Interrupt 0000:06:00.0[A] -> GSI 18 (level, low) -> IRQ 18
    PCI: Setting latency timer of device 0000:06:00.0 to 64
    0000:06:00.0: 0000:06:00.0: Hardware Error
    0000:06:00.0: eth0: (PCI Express:2.5GB/s:Width x4) 00:30:48:66:c7:06
    0000:06:00.0: eth0: Intel(R) PRO/1000 Network Connection
    0000:06:00.0: eth0: MAC: 5, PHY: 5, PBA No: 2050ff-0ff
    ACPI: PCI Interrupt 0000:06:00.1[B] -> GSI 19 (level, low) -> IRQ 19
    PCI: Setting latency timer of device 0000:06:00.1 to 64
    0000:06:00.1: eth1: (PCI Express:2.5GB/s:Width x4) 00:30:48:66:c7:07
    0000:06:00.1: eth1: Intel(R) PRO/1000 Network Connection
    0000:06:00.1: eth1: MAC: 5, PHY: 5, PBA No: 2050ff-0ff
    0000:06:00.0: eth0: Hardware Error
    0000:06:00.0: eth0: Hardware Error
    0000:06:00.0: eth0: Hardware Error
    0000:06:00.0: eth0: Hardware Error
    0000:06:00.0: eth0: Hardware Error

    Is there any further debug code I could add to narrow down things?

    > If this does not resolve the issue for the Supermicro board, you likely
    > also require a "FW-side" fix, and this comes in one of two flavors. If
    > the board has an INTEL BMC, then we will need to update it with a new
    > BMC version. If the board has a Supermicro BMC (I expect that it does),
    > then we can provide a patch to some of the platform microcode using a
    > EEPROM update. To determine which is appropriate for you, we'll need to
    > know more about the platform. There's probably a BMC version number on
    > one of the BIOS menus. I can work with you to find the info we need, and
    > then, to help you to perform the necessary steps to perform an upgrade.


    Sorry, but we can't provide any further details about this yet. We still
    try to get through to the Supermicro developers, but so far our FAE contact
    insists on telling us "don't use e1000e, e1000 is the right driver for your
    hardware".

    --
    Gernot Hillier
    Siemens AG, CT SE 2
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: e1000e: sporadic "hardware error"s with Intel 82563EB on Supermicro X7DB3

    Hi Dave!

    Sorry for the delay (and the self-follow-up), but now I can hopefully
    provide answers to all your questions...

    Hillier, Gernot wrote:
    > However, one detail confuses us: we can currently reproduce this problem on
    > two machines. One of them is equipped with an optional IPMI card, the other
    > one isn't. (The Supermicro X7DB3 doesn't include full IPMI support onboard,
    > but has a "LP IPMI 2.0 (SIMLP) Slot" where you can place an optional card).


    The "IPMI card" we use is a "Supermicro AOC-SIMLP-B".

    Overview: http://www.supermicro.com/products/a.../addon/sim.cfm
    Manual: http://www.supermicro.com/manuals/other/AOC-SIMLP.pdf

    > The box with the IPMI card shows the hardware errors quite often (in one of
    > about 200 tries) while the other box still shows the problem, but much more
    > seldom (in one of >1000 tries). Now we wonder if the BMC is on the IPMI
    > card or on the board itself - in the first case, I'm not sure if you thesis
    > fully explains the problems we can see.


    However, after digging through some manuals, I'm quite sure the BMC is
    integrated in the Intel ESB2 I/O Controller Hub used on our board, not
    on the IPMI card. So we should have an Intel BMC.

    > And there's another detail I'd like to mention: we first found the problem
    > by doing continuous reboots as originally described, but we found we can
    > also reproduce it with an endless loop of "rmmod;sleep 3;modprobe". Does
    > this somehow contradict with your thesis?
    >
    >> There have been further improvements made to the driver synchronization
    >> code since the 0.3.3.3-k2 driver, and it is possible that a newer driver
    >> would resolve the issue. It'd be good for us to know if that's the case.
    >> The driver version is not yet (AFAICS) upstream, but is already
    >> available in the standalone e1000e-0.4.1.7 driver on sourceforge.
    >> (google "sourceforge e1000e"). Would you be able to try that, as a first
    >> step ?

    >
    > Yes, I did. Unfortunately, 0.4.1.7 still shows the problem - on both machines:
    >
    > e1000e: Intel(R) PRO/1000 Network Driver - 0.4.1.7-NAPI
    > e1000e: Copyright (c) 1999-2008 Intel Corporation.
    > ACPI: PCI Interrupt 0000:06:00.0[A] -> GSI 18 (level, low) -> IRQ 18
    > PCI: Setting latency timer of device 0000:06:00.0 to 64
    > 0000:06:00.0: 0000:06:00.0: Hardware Error
    > 0000:06:00.0: eth0: (PCI Express:2.5GB/s:Width x4) 00:30:48:66:c7:06
    > 0000:06:00.0: eth0: Intel(R) PRO/1000 Network Connection
    > 0000:06:00.0: eth0: MAC: 5, PHY: 5, PBA No: 2050ff-0ff
    > ACPI: PCI Interrupt 0000:06:00.1[B] -> GSI 19 (level, low) -> IRQ 19
    > PCI: Setting latency timer of device 0000:06:00.1 to 64
    > 0000:06:00.1: eth1: (PCI Express:2.5GB/s:Width x4) 00:30:48:66:c7:07
    > 0000:06:00.1: eth1: Intel(R) PRO/1000 Network Connection
    > 0000:06:00.1: eth1: MAC: 5, PHY: 5, PBA No: 2050ff-0ff
    > 0000:06:00.0: eth0: Hardware Error
    > 0000:06:00.0: eth0: Hardware Error
    > 0000:06:00.0: eth0: Hardware Error
    > 0000:06:00.0: eth0: Hardware Error
    > 0000:06:00.0: eth0: Hardware Error
    >
    > Is there any further debug code I could add to narrow down things?
    >
    >> If this does not resolve the issue for the Supermicro board, you likely
    >> also require a "FW-side" fix, and this comes in one of two flavors. If
    >> the board has an INTEL BMC, then we will need to update it with a new
    >> BMC version. If the board has a Supermicro BMC (I expect that it does),
    >> then we can provide a patch to some of the platform microcode using a
    >> EEPROM update. To determine which is appropriate for you, we'll need to
    >> know more about the platform. There's probably a BMC version number on
    >> one of the BIOS menus. I can work with you to find the info we need, and
    >> then, to help you to perform the necessary steps to perform an upgrade.

    >

    [...]
    Still no helpful contact within Supermicro, but we found the following
    information in the web interface provided by the "IPMI card":

    Device InformationProduct Name: Supermicro Daughter Card
    Serial Number: 02969601ac46a6df
    Device IP Address: 192.168.2.4
    Device MAC Address: 08:15:08:15:08:15
    Firmware Version: 01.59.00
    Firmware Build Number: 5420
    Firmware Description: Sep-29-2008-09-45-NonKVM
    Hardware Revision: 0x22

    The BIOS IPMI menu itself says:

    IPMI Specification Version: 2.0
    Firmware Version: 1.59

    I hope that those details answered your questions, so that we can
    proceed with your suggestions. Think we now need the "new BMC version"
    you mentioned, right?

    If there's anything I can test or lookup from the software side to
    speedup things (like additional debugging of the driver, etc.), please
    don't hesitate to ask!

    --
    Gernot
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. RE: e1000e: sporadic "hardware error"s with Intel 82563EB on Supermicro X7DB3

    Hi Gernot,

    I think that the system with the SuperMicro IPMI card is configured as
    having an "external BMC" from the perspective of the INTEL-based system.
    My experience of such configurations is that the IPMI traffic is handled
    by the BMC in the card, but routed in/out of the system over the "eth0"
    on-motherboard esb2 interface. I looked at the AOC-SIMPL-B card
    described in the SuperMicro link you provided and see that it too has an
    ethernet interface. I'm not sure if the interface on the card provides a
    second IPMI interface to the system, or that IPMI to the mainboard eth0
    is disabled. I have IPMI management contacts here in INTEL, and am
    trying to find out.

    If this system does route IPMI traffic between the SuperMicro card & the
    mainboard LAN eth0, the onboard LAN now has two clients, one on the
    SuperMicro card, and one in the host OS. INTEL provides APIs to external
    BMCs so that they can use the LAN, and hidden behind those APIs is code
    to allow each client to operate without having to be aware of the state
    of the other. There is a bug in this code that can be exposed when the
    host resets the LAN. The bug is resolved by a patch to the API code,
    which is applied as an EEPROM update to the system. I am working with
    Jeff Hockert & others in-house to find out details of how we are
    deploying that EEPROM update.

    I continue to review - with help- the information that you have already
    provided, to determine whether this system does match the IPMI
    configuration that I think it does. I'll keep you up to date.

    OK, now for the system without the IPMI card. Probably that one does
    have an active INTEL BMC. And, if it does, the core bug that I (sort-of)
    explained above is also relevant there, though it's not fixable in the
    same way because the buggy code in this case is integrated directly as
    part of the INTEL BMC. In this case, you'll need a BMC upgrade. But
    first, just like for the other case, I need to confirm that the
    configuration is what I think it is.

    It would help if you could provide a little more information. Could you
    provide (for one of each of the two configurations that you have - one
    with the IPMI card, one without):

    lspci -t
    lspci -vvv -xxxx
    ethtool -e eth0
    BIOS "IPMI" menus (I know you already gave us one, but both
    would be good)

    Thanks

    Dave

    -----Original Message-----
    From: Gernot Hillier [mailto:gernot.hillier@siemens.com]
    Sent: Tuesday, October 14, 2008 2:18 AM
    To: Graham, David
    Cc: linux-kernel@vger.kernel.org; netdev@vger.kernel.org; Allan, Bruce
    W; Hockert, Jeff W
    Subject: Re: e1000e: sporadic "hardware error"s with Intel 82563EB on
    Supermicro X7DB3

    Hi Dave!

    Sorry for the delay (and the self-follow-up), but now I can hopefully
    provide answers to all your questions...

    Hillier, Gernot wrote:
    > However, one detail confuses us: we can currently reproduce this

    problem on
    > two machines. One of them is equipped with an optional IPMI card, the

    other
    > one isn't. (The Supermicro X7DB3 doesn't include full IPMI support

    onboard,
    > but has a "LP IPMI 2.0 (SIMLP) Slot" where you can place an optional

    card).

    The "IPMI card" we use is a "Supermicro AOC-SIMLP-B".

    Overview: http://www.supermicro.com/products/a.../addon/sim.cfm
    Manual: http://www.supermicro.com/manuals/other/AOC-SIMLP.pdf

    > The box with the IPMI card shows the hardware errors quite often (in

    one of
    > about 200 tries) while the other box still shows the problem, but much

    more
    > seldom (in one of >1000 tries). Now we wonder if the BMC is on the

    IPMI
    > card or on the board itself - in the first case, I'm not sure if you

    thesis
    > fully explains the problems we can see.


    However, after digging through some manuals, I'm quite sure the BMC is
    integrated in the Intel ESB2 I/O Controller Hub used on our board, not
    on the IPMI card. So we should have an Intel BMC.

    > And there's another detail I'd like to mention: we first found the

    problem
    > by doing continuous reboots as originally described, but we found we

    can
    > also reproduce it with an endless loop of "rmmod;sleep 3;modprobe".

    Does
    > this somehow contradict with your thesis?
    >
    >> There have been further improvements made to the driver

    synchronization
    >> code since the 0.3.3.3-k2 driver, and it is possible that a newer

    driver
    >> would resolve the issue. It'd be good for us to know if that's the

    case.
    >> The driver version is not yet (AFAICS) upstream, but is already
    >> available in the standalone e1000e-0.4.1.7 driver on sourceforge.
    >> (google "sourceforge e1000e"). Would you be able to try that, as a

    first
    >> step ?

    >
    > Yes, I did. Unfortunately, 0.4.1.7 still shows the problem - on both

    machines:
    >
    > e1000e: Intel(R) PRO/1000 Network Driver - 0.4.1.7-NAPI
    > e1000e: Copyright (c) 1999-2008 Intel Corporation.
    > ACPI: PCI Interrupt 0000:06:00.0[A] -> GSI 18 (level, low) -> IRQ 18
    > PCI: Setting latency timer of device 0000:06:00.0 to 64
    > 0000:06:00.0: 0000:06:00.0: Hardware Error
    > 0000:06:00.0: eth0: (PCI Express:2.5GB/s:Width x4) 00:30:48:66:c7:06
    > 0000:06:00.0: eth0: Intel(R) PRO/1000 Network Connection
    > 0000:06:00.0: eth0: MAC: 5, PHY: 5, PBA No: 2050ff-0ff
    > ACPI: PCI Interrupt 0000:06:00.1[B] -> GSI 19 (level, low) -> IRQ 19
    > PCI: Setting latency timer of device 0000:06:00.1 to 64
    > 0000:06:00.1: eth1: (PCI Express:2.5GB/s:Width x4) 00:30:48:66:c7:07
    > 0000:06:00.1: eth1: Intel(R) PRO/1000 Network Connection
    > 0000:06:00.1: eth1: MAC: 5, PHY: 5, PBA No: 2050ff-0ff
    > 0000:06:00.0: eth0: Hardware Error
    > 0000:06:00.0: eth0: Hardware Error
    > 0000:06:00.0: eth0: Hardware Error
    > 0000:06:00.0: eth0: Hardware Error
    > 0000:06:00.0: eth0: Hardware Error
    >
    > Is there any further debug code I could add to narrow down things?
    >
    >> If this does not resolve the issue for the Supermicro board, you

    likely
    >> also require a "FW-side" fix, and this comes in one of two flavors.

    If
    >> the board has an INTEL BMC, then we will need to update it with a new
    >> BMC version. If the board has a Supermicro BMC (I expect that it

    does),
    >> then we can provide a patch to some of the platform microcode using a
    >> EEPROM update. To determine which is appropriate for you, we'll need

    to
    >> know more about the platform. There's probably a BMC version number

    on
    >> one of the BIOS menus. I can work with you to find the info we need,

    and
    >> then, to help you to perform the necessary steps to perform an

    upgrade.
    >

    [...]
    Still no helpful contact within Supermicro, but we found the following
    information in the web interface provided by the "IPMI card":

    Device InformationProduct Name: Supermicro Daughter Card
    Serial Number: 02969601ac46a6df
    Device IP Address: 192.168.2.4
    Device MAC Address: 08:15:08:15:08:15
    Firmware Version: 01.59.00
    Firmware Build Number: 5420
    Firmware Description: Sep-29-2008-09-45-NonKVM
    Hardware Revision: 0x22

    The BIOS IPMI menu itself says:

    IPMI Specification Version: 2.0
    Firmware Version: 1.59

    I hope that those details answered your questions, so that we can
    proceed with your suggestions. Think we now need the "new BMC version"
    you mentioned, right?

    If there's anything I can test or lookup from the software side to
    speedup things (like additional debugging of the driver, etc.), please
    don't hesitate to ask!

    --
    Gernot
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: e1000e: sporadic "hardware error"s with Intel 82563EB on Supermicro X7DB3

    Hi Dave!

    I added Zoltan Fodor from your PAE department to the distribution list as
    he also supports us regarding this problem.

    Am 15.10.2008 18:37 schrieb Graham, David:
    > I think that the system with the SuperMicro IPMI card is configured as
    > having an "external BMC" from the perspective of the INTEL-based system.


    Exactly. That's what our hardware experts told me in the meantime, too.

    > My experience of such configurations is that the IPMI traffic is handled
    > by the BMC in the card, but routed in/out of the system over the "eth0"
    > on-motherboard esb2 interface. I looked at the AOC-SIMPL-B card
    > described in the SuperMicro link you provided and see that it too has an
    > ethernet interface. I'm not sure if the interface on the card provides
    > a second IPMI interface to the system, or that IPMI to the mainboard
    > eth0 is disabled. I have IPMI management contacts here in INTEL, and am
    > trying to find out.
    >
    > If this system does route IPMI traffic between the SuperMicro card & the
    > mainboard LAN eth0, the onboard LAN now has two clients, one on the
    > SuperMicro card, and one in the host OS.


    The latter is true for us. This IPMI card has an own eth interface as you
    mentioned, but due to product requirements, we can't use it but need the
    "shared NIC" feature. Therefore, this card is configured (jumpered) to
    route its IPMI traffic through the eth0 on the motherboard.

    > INTEL provides APIs to external BMCs so that they can use the LAN, and
    > hidden behind those APIs is code to allow each client to operate without
    > having to be aware of the state of the other. There is a bug in this
    > code that can be exposed when the host resets the LAN. The bug is
    > resolved by a patch to the API code, which is applied as an EEPROM
    > update to the system. I am working with Jeff Hockert & others in-house
    > to find out details of how we are deploying that EEPROM update.


    Thanks for the explanation. I would be more than happy to try anything in
    that area!

    > I continue to review - with help- the information that you have already
    > provided, to determine whether this system does match the IPMI
    > configuration that I think it does. I'll keep you up to date.


    As explained above, your assumptions should exactly apply to our scenario, yes.

    > OK, now for the system without the IPMI card. Probably that one does
    > have an active INTEL BMC. And, if it does, the core bug that I (sort-of)
    > explained above is also relevant there, though it's not fixable in the
    > same way because the buggy code in this case is integrated directly as
    > part of the INTEL BMC. In this case, you'll need a BMC upgrade. But
    > first, just like for the other case, I need to confirm that the
    > configuration is what I think it is.
    >
    > It would help if you could provide a little more information. Could you
    > provide (for one of each of the two configurations that you have - one
    > with the IPMI card, one without):
    >
    > lspci -t
    > lspci -vvv -xxxx
    > ethtool -e eth0


    I will provide those as soon as possible. Currently, they would be
    meeningless for you probably as our hardware experts tried some kind of
    firmware update which broke the "Shared NIC" feature - so I doubt we can
    reproduce the bug in the current configuration.

    As soon, as I can get the machines back to the state where we can reproduce
    the issue, I'll send you the requested details.

    > BIOS "IPMI" menus (I know you
    > already gave us one, but both would be good)


    For this, I can already tell that there is no BIOS IPMI menu available if
    there's no IPMI card plugged in. Seems like the Supermicro BIOS developers
    deny access to the Intel BMC in standalone mode...

    --
    With kind regards,
    Gernot Hillier
    Siemens AG, CT SE 2, Corporate Competence Center Embedded Linux
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: e1000e: sporadic "hardware error"s with Intel 82563EB on Supermicro X7DB3

    Hi there!

    Graham, David wrote:
    > It would help if you could provide a little more information. Could you
    > provide (for one of each of the two configurations that you have - one
    > with the IPMI card, one without):
    >
    > lspci -t
    > lspci -vvv -xxxx
    > ethtool -e eth0


    Ok, it turned out that we still can reproduce the problem - even after the
    firmware upgrade. So I collected the information you requested from two
    machines:

    - BVSIM3 is the one with the IPMI card.
    - BVSIM5 the one without the IPMI card.

    As the output from the commands you requested is rather large, I uploaded
    it to the following URLs instead posting it to the list, hope that's ok:

    http://www.hillier.de/bvsim5-ethtool-e.txt
    http://www.hillier.de/bvsim5-lspci-t.txt
    http://www.hillier.de/bvsim5-lspci-vvv-xxxx.txt
    http://www.hillier.de/bvsim3-ethtool-e.txt
    http://www.hillier.de/bvsim3-lspci-t.txt
    http://www.hillier.de/bvsim3-lspci-vvv-xxxx.txt

    All commands were run in the error case, i.e. after e1000e said "Hardware
    error".

    > BIOS "IPMI" menus (I know you already gave us one, but both
    > would be good)


    BVSIM3 shows the IPMI menu I already provided
    BVSIM5 shows no IPMI menu

    Thanks in advance!

    --
    With kind regards,
    Gernot Hillier
    Siemens AG, CT SE 2, Corporate Competence Center Embedded Linux
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: e1000e: sporadic "hardware error"s with Intel 82563EB on Supermicro X7DB3

    Dear Dave,

    On 2008-10-16, Hillier, Gernot wrote:
    > Hi there!
    >
    > Graham, David wrote:
    >> It would help if you could provide a little more information. Could you
    >> provide (for one of each of the two configurations that you have - one
    >> with the IPMI card, one without):
    >>
    >> lspci -t
    >> lspci -vvv -xxxx
    >> ethtool -e eth0

    >
    > Ok, it turned out that we still can reproduce the problem - even after the
    > firmware upgrade. So I collected the information you requested from two
    > machines:


    Wanted to let you know that this problem seems to be fixed for us.

    We received a preliminary update from Supermicro which contains a new NIC
    firmware 2.5 they recently received from your side with improved Shared LAN
    support (56313.eep, release date 2008-10-01) .

    After flashing this update together with a new BMC card firmware 1.59, the
    problem has finally vanished for us. (For some reason, only updating the
    NIC firmware wasn't possible, so we can't unfortunately nail down which
    update part really fixed the problem.)

    1.5 days of an rmmod/insmod loop and 2.5 days of a complete OS reboot loop
    now have been passed w/o problems. Both tests triggered the problem
    reliably within at most one day before.

    So thanks again for your help!

    --
    With kind regards,

    Gernot Hillier
    Siemens AG, Corporate Competence Center Embedded Linux
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread