NMI error and Intel S5000PSL Motherboards - Kernel

This is a discussion on NMI error and Intel S5000PSL Motherboards - Kernel ; We have about 100 servers based on Intel S5000PSL-SATA motherboards. They have been running for anywhere between 1 and 10 months. For the past few months, after updating them all to the 2.6.20.15 kernel (because of a bug in the ...

+ Reply to Thread
Results 1 to 4 of 4

Thread: NMI error and Intel S5000PSL Motherboards

  1. NMI error and Intel S5000PSL Motherboards

    We have about 100 servers based on Intel S5000PSL-SATA motherboards.
    They have been running for anywhere between 1 and 10 months. For the
    past few months, after updating them all to the 2.6.20.15 kernel
    (because of a bug in the 2.6.18 kernel), we are seeing some strange NMI
    errors. For example:

    Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown reason 30.
    Aug 29 09:02:10 master kernel: Do you have a strange power saving mode
    enabled?
    Aug 29 09:02:10 master kernel: Dazed and confused, but trying to continue

    Sometimes these errors cause a total system freeze. Most of the time the
    systems keep running.

    We have determined these errors come most frequently on machines that
    have an Intel PCI-e Quad Port Gigabit Adapter. On machines that HAVE
    these cards (it doesn't matter what slot they are in), the NMI errors
    can occur as frequently as every 3-5 minutes. On machines that do NOT
    have these Quad Port Adapters, the NMI errors occur about once per month
    on average. (we have tried the "in-kernel" e1000 drivers, as well as
    Intel's latest - 7.6.5).

    We have also determined (through a chance discovery) that running
    scanpci can 100 percent reliably reproduce the NMI error on any
    machine that has the Quad Port NICS. Our various motherboards have
    different Intel BIOS versions some have Rev 70, others 74, 79 or 81.
    They all exhibit the same behavior regardless of BIOS version.

    We have reproduced this problem with:

    Mandriva 2008 RC2 (2.6.22 kernel)
    Mandriva 2007 with custom 2.6.20.15 kernel
    Mandriva 2007 with custom 2.6.19.8 kernel
    Ubuntu Feisty with 2.6.20 kernel
    Fedora Core 7 with 2.6.22 kernel

    The problem does NOT occur with any distribution running a 2.6.18 kernel
    or lower. I.E., CentOS or SUSE 10 and also Mandriva 2007 with included
    2.6.17 kernel or custom-compiled 2.6.18 kernel.

    We have been in contact with Intel. Their high level tech support people
    have basically said,

    the errors we have logged so far are pointing to a kernel issue and
    not a hardware problem. If we [Intel] can confirm this, it will be
    up to the kernel developer or OS system manufacturer to debug those
    ones, as we do not perform Operating system support.

    In other words, Intel seems to be blaming the problem we are seeing on
    something introduced starting with the 2.6.19 kernel. We are not looking
    to blame anybody. We are only looking for a solution.

    Does anybody have an idea what could be going on here, as well as what
    the solution may be? Going back to 2.6.18 or lower is not an option.


    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: NMI error and Intel S5000PSL Motherboards

    On Wed, 26 Sep 2007 02:12:34 -0800 AndrewL733 wrote:

    > We have about 100 servers based on Intel S5000PSL-SATA motherboards.


    product info (for others):
    http://support.intel.com/support/mot...0psl/index.htm

    > They have been running for anywhere between 1 and 10 months. For the
    > past few months, after updating them all to the 2.6.20.15 kernel
    > (because of a bug in the 2.6.18 kernel), we are seeing some strange NMI
    > errors. For example:
    >
    > Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown reason 30.
    > Aug 29 09:02:10 master kernel: Do you have a strange power saving mode
    > enabled?
    > Aug 29 09:02:10 master kernel: Dazed and confused, but trying to continue
    >
    > Sometimes these errors cause a total system freeze. Most of the time the
    > systems keep running.
    >
    > We have determined these errors come most frequently on machines that
    > have an Intel PCI-e Quad Port Gigabit Adapter. On machines that HAVE
    > these cards (it doesn't matter what slot they are in), the NMI errors
    > can occur as frequently as every 3-5 minutes. On machines that do NOT
    > have these Quad Port Adapters, the NMI errors occur about once per month
    > on average. (we have tried the "in-kernel" e1000 drivers, as well as
    > Intel's latest - 7.6.5).
    >
    > We have also determined (through a chance discovery) that running
    > “scanpci” can 100 percent reliably reproduce the NMI error on any
    > machine that has the Quad Port NICS. Our various motherboards have
    > different Intel BIOS versions – some have Rev 70, others 74, 79 or 81.
    > They all exhibit the same behavior regardless of BIOS version.
    >
    > We have reproduced this problem with:
    >
    > Mandriva 2008 RC2 (2.6.22 kernel)
    > Mandriva 2007 with custom 2.6.20.15 kernel
    > Mandriva 2007 with custom 2.6.19.8 kernel
    > Ubuntu “Feisty” with 2.6.20 kernel
    > Fedora Core 7 with 2.6.22 kernel
    >
    > The problem does NOT occur with any distribution running a 2.6.18 kernel
    > or lower. I.E., CentOS or SUSE 10 and also Mandriva 2007 with included
    > 2.6.17 kernel or custom-compiled 2.6.18 kernel.
    >
    > We have been in contact with Intel. Their high level tech support people
    > have basically said,
    >
    > “the errors we have logged so far are pointing to a kernel issue and
    > not a hardware problem. If we [Intel] can confirm this, it will be
    > up to the kernel developer or OS system manufacturer to debug those
    > ones, as we do not perform Operating system support.”
    >
    > In other words, Intel seems to be blaming the problem we are seeing on
    > something introduced starting with the 2.6.19 kernel. We are not looking
    > to blame anybody. We are only looking for a solution.
    >
    > Does anybody have an idea what could be going on here, as well as what
    > the solution may be? Going back to 2.6.18 or lower is not an option.



    Please provide some basic info, like:

    - how much RAM
    - what CPUs (be precise: use 'cat /proc/cpuinfo')
    - output of 'lspci -v'
    - what kind(s) of SATA drives
    - are you using 32-bit or 64-bit kernel(s)

    Can you test kernels from kernel.org (i.e., not vendor kernels,
    no other [unkwown] patches applied to them)?

    Does tracing 'scanpci' produce any helpful information?
    # strace -o scanpci.trace scanpci


    ---
    ~Randy
    Phaedrus says that Quality is about caring.
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: NMI error and Intel S5000PSL Motherboards

    On Wed, 26 Sep 2007 02:12:34 -0800 AndrewL733 wrote:

    > We have about 100 servers based on Intel S5000PSL-SATA motherboards.
    > They have been running for anywhere between 1 and 10 months. For the
    > past few months, after updating them all to the 2.6.20.15 kernel
    > (because of a bug in the 2.6.18 kernel), we are seeing some strange NMI
    > errors. For example:
    >
    > Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown reason 30.
    > Aug 29 09:02:10 master kernel: Do you have a strange power saving mode
    > enabled?
    > Aug 29 09:02:10 master kernel: Dazed and confused, but trying to continue
    >
    > Sometimes these errors cause a total system freeze. Most of the time the
    > systems keep running.
    >
    > We have determined these errors come most frequently on machines that
    > have an Intel PCI-e Quad Port Gigabit Adapter. On machines that HAVE
    > these cards (it doesn't matter what slot they are in), the NMI errors
    > can occur as frequently as every 3-5 minutes. On machines that do NOT
    > have these Quad Port Adapters, the NMI errors occur about once per month
    > on average. (we have tried the "in-kernel" e1000 drivers, as well as
    > Intel's latest - 7.6.5).
    >
    > We have also determined (through a chance discovery) that running
    > “scanpci” can 100 percent reliably reproduce the NMI error on any
    > machine that has the Quad Port NICS. Our various motherboards have
    > different Intel BIOS versions – some have Rev 70, others 74, 79 or 81.
    > They all exhibit the same behavior regardless of BIOS version.
    >
    > We have reproduced this problem with:
    >
    > Mandriva 2008 RC2 (2.6.22 kernel)
    > Mandriva 2007 with custom 2.6.20.15 kernel
    > Mandriva 2007 with custom 2.6.19.8 kernel
    > Ubuntu “Feisty” with 2.6.20 kernel
    > Fedora Core 7 with 2.6.22 kernel
    >
    > The problem does NOT occur with any distribution running a 2.6.18 kernel
    > or lower. I.E., CentOS or SUSE 10 and also Mandriva 2007 with included
    > 2.6.17 kernel or custom-compiled 2.6.18 kernel.
    >
    > We have been in contact with Intel. Their high level tech support people
    > have basically said,
    >
    > “the errors we have logged so far are pointing to a kernel issue and
    > not a hardware problem. If we [Intel] can confirm this, it will be
    > up to the kernel developer or OS system manufacturer to debug those
    > ones, as we do not perform Operating system support.”
    >
    > In other words, Intel seems to be blaming the problem we are seeing on
    > something introduced starting with the 2.6.19 kernel. We are not looking
    > to blame anybody. We are only looking for a solution.
    >
    > Does anybody have an idea what could be going on here, as well as what
    > the solution may be? Going back to 2.6.18 or lower is not an option.


    Answer #2: if a kernel change was responsible for this problem,
    the direct way to find that change is to clone the kernel 'git' tree
    and then use git bisect to find the culprit. If you are certain
    that 2.6.18 is good and 2.6.19 is bad, then use those git tree tags
    instead of the ones that are used in the example at:
    http://www.kernel.org/pub/software/s...it-bisect.html

    git wiki is here: http://git.or.cz/
    and git docs are here: http://www.kernel.org/pub/software/scm/git/docs/

    If you want to use this tool, say so and I think that we (the royal
    "we") will try to work you thru it.

    ---
    ~Randy
    Phaedrus says that Quality is about caring.
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: NMI error and Intel S5000PSL Motherboards

    > Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown reason 30.
    > Aug 29 09:02:10 master kernel: Do you have a strange power saving mode
    > enabled?


    What would be useful is to know under what situations that board can
    raise NMI 30.

    > In other words, Intel seems to be blaming the problem we are seeing on
    > something introduced starting with the 2.6.19 kernel. We are not looking
    > to blame anybody. We are only looking for a solution.


    The first thing to find out is to find out in which kernel the behaviour
    is introduced. It might also be worth disabling msi in case Intel screwed
    the board up somewhat.

    > Does anybody have an idea what could be going on here, as well as what
    > the solution may be? Going back to 2.6.18 or lower is not an option.


    See if 2.6.20.* with the 2.6.18 driver compiles and how that behaves.
    Also see if pci=nomsi and/or pci=nommconf make a difference.

    Alan
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread