X4100 mysterious failure - SUN

This is a discussion on X4100 mysterious failure - SUN ; We have an X4100 which last week seems to have turned itself off under somewhat mysterious circumstances. There is no noise at all in any Solaris logs (it just rebooted cleanly when it was turned back on, and the machine ...

+ Reply to Thread
Results 1 to 12 of 12

Thread: X4100 mysterious failure

  1. X4100 mysterious failure

    We have an X4100 which last week seems to have turned itself off under
    somewhat mysterious circumstances. There is no noise at all in any
    Solaris logs (it just rebooted cleanly when it was turned back on, and
    the machine was idle so there's not even a clear trace of when it
    died), but the ILOM event log reports:

    305 03/21/2007 12:49:54 ps1.pwrok Power Supply State Deasserted -
    Asserted
    304 03/21/2007 12:49:52 ps0.pwrok Power Supply State Deasserted -
    Asserted

    which (perhaps?) means it thinks it lost power then. We're fairly
    sure that there was no real power outage of any kind (other machines
    running of the same power strips were all fine, there's a UPS etc).

    Has anyone seen anything like this?

    --tim

    PS there is confusion about ILOM / other firmware revisions, but the
    ILOM reports:

    Device Hardware Revision 1
    Device Firmware Revision 1.0.117440512
    IPMI Version 2.0
    Filesystem Version 0.1.13
    Build Number 12513

    At the time we built the machine (Feb) these were one back from the
    most recent, as the remote console didn't work in the most recent
    version (as reported here:
    http://groups.google.com/group/comp....4a86acef73064a)


  2. Re: X4100 mysterious failure

    On Mar 26, 2:56 am, "Tim Bradshaw" wrote:
    > We have an X4100 which last week seems to have turned itself off under
    > somewhat mysterious circumstances. There is no noise at all in any
    > Solaris logs (it just rebooted cleanly when it was turned back on, and
    > the machine was idle so there's not even a clear trace of when it
    > died), but the ILOM event log reports:
    >
    > 305 03/21/2007 12:49:54 ps1.pwrok Power Supply State Deasserted -
    > Asserted
    > 304 03/21/2007 12:49:52 ps0.pwrok Power Supply State Deasserted -
    > Asserted
    >
    > which (perhaps?) means it thinks it lost power then. We're fairly
    > sure that there was no real power outage of any kind (other machines
    > running of the same power strips were all fine, there's a UPS etc).
    >
    > Has anyone seen anything like this?
    >
    > --tim
    >
    > PS there is confusion about ILOM / other firmware revisions, but the
    > ILOM reports:
    >
    > Device Hardware Revision 1
    > Device Firmware Revision 1.0.117440512
    > IPMI Version 2.0
    > Filesystem Version 0.1.13
    > Build Number 12513
    >
    > At the time we built the machine (Feb) these were one back from the
    > most recent, as the remote console didn't work in the most recent
    > version (as reported here:http://groups.google.com/group/comp....owse_thread/th...)


    Hi Tim,

    Did you ever resolve this? I'm having exactly the same problem. The
    x4100 has powered itself off about five times already.

    Regards,
    Gavin


  3. Re: X4100 mysterious failure

    On 2007-04-13 17:48:41 +0100, kgmathias@gmail.com said:

    > Did you ever resolve this? I'm having exactly the same problem. The
    > x4100 has powered itself off about five times already.


    No. We now have a case open with Sun (it's done it once more since
    then). If you mail me I can send you the ID on Monday and you could
    cross reference it. The engineer thinks it may be a thermal trip issue
    which is plausible except that the machine is completely idle when it
    dies so anything that might make it hot won't be (and it's in a
    decently cooled room). We're going to try VTS on Monday.

    If you can (if the machine is new) I'd argue that this is a warranty
    case and you just want a new one. We might try and do that (we should)
    but there's a vast bureaucracy in the way, this being a large company
    (I think the machine was bought in France even thought we're in the UK,
    so you can see the issues).

    It's spectacularly annoying because we can't really deploy it now it's
    done this to us as we'd be relying on it a bit.

    --tim


  4. Re: X4100 mysterious failure

    On 13 Apr 2007 09:48:41 -0700, kgmathias@gmail.com wrote:

    >On Mar 26, 2:56 am, "Tim Bradshaw" wrote:
    >> We have an X4100 which last week seems to have turned itself off under
    >> somewhat mysterious circumstances. There is no noise at all in any
    >> Solaris logs (it just rebooted cleanly when it was turned back on, and
    >> the machine was idle so there's not even a clear trace of when it
    >> died), but the ILOM event log reports:
    >>
    >> 305 03/21/2007 12:49:54 ps1.pwrok Power Supply State Deasserted -
    >> Asserted
    >> 304 03/21/2007 12:49:52 ps0.pwrok Power Supply State Deasserted -
    >> Asserted
    >>
    >> which (perhaps?) means it thinks it lost power then. We're fairly
    >> sure that there was no real power outage of any kind (other machines
    >> running of the same power strips were all fine, there's a UPS etc).
    >>
    >> Has anyone seen anything like this?
    >>
    >> --tim
    >>
    >> PS there is confusion about ILOM / other firmware revisions, but the
    >> ILOM reports:
    >>
    >> Device Hardware Revision 1
    >> Device Firmware Revision 1.0.117440512
    >> IPMI Version 2.0
    >> Filesystem Version 0.1.13
    >> Build Number 12513
    >>
    >> At the time we built the machine (Feb) these were one back from the
    >> most recent, as the remote console didn't work in the most recent
    >> version (as reported here:http://groups.google.com/group/comp....owse_thread/th...)

    >
    >Hi Tim,
    >
    >Did you ever resolve this? I'm having exactly the same problem. The
    >x4100 has powered itself off about five times already.
    >
    >Regards,
    >Gavin



    I had a Compaq box do something similar to me- DL590/64 with 2 power
    supplies running Debian.. It would turn itself off at random times.
    Turn it back on and it was okay. For a while. The machine was rather
    heavily loaded (CPU usage >90%) running 4 instances of the BOINC
    SETI@Home application and little else.

    It went from turning itself off once every few days to turning itself
    off every few hours as the condition continued to deteriorate. The
    only other symptom was the DC powered fans seemed to slow down
    _slightly_ just a couple seconds before the power off 'click'. Since
    the machine was an eBay purchase (I spent more on freight than I did
    on the computer) I spent some time troubleshooting it myself.

    It turned out that one of the power supplies had a bad output filter
    capacitor. It would go to a near-shorted condition after a time, then
    recover functionality after a power cycle, then go shorted again after
    a while. When the 'bad' supply started acting up it would drag down
    the 48V buss supply until voltage dropped to about 30V and the system
    would quietly click itself off.

    I changed out the 'bad' capacitor and a couple more questionable
    looking electrolytic capacitors in the 'bad' power supply and the
    machine is now fine. For insurance, I changed the same electrolytics
    in the 'good' power supply. I spent $15 in parts from Digi-Key and 4
    or 5 hours time.

    I know we're comparing oranges and pickles here with Sun and Compaq,
    but power supplies are power supplies.... Maybe Sun had a bad batch
    from their vendor.

    Have you got another X4100 you could swap a power supply from
    temporarily? Might save fighting red-tape with Sun to replace the
    whole machine.

    a/k/a Brian


  5. Re: X4100 mysterious failure

    On 2007-04-13 21:51:03 +0100, lost@the.net said:

    > It turned out that one of the power supplies had a bad output filter
    > capacitor. It would go to a near-shorted condition after a time, then
    > recover functionality after a power cycle, then go shorted again after
    > a while. When the 'bad' supply started acting up it would drag down
    > the 48V buss supply until voltage dropped to about 30V and the system
    > would quietly click itself off.


    That might make sense as a failure mode - if one PSU fails in such a
    way to short (or near short) some rail to earth I guess it could cause
    the other supply to have a tantrum and die (reasonably so, since it
    can't keep the machine alive).

    Have you got another X4100 you could swap a power supply from
    > temporarily? Might save fighting red-tape with Sun to replace the
    > whole machine.


    That might be a good approach, yes (we do have another one).

    --t


  6. Re: X4100 mysterious failure

    On Apr 13, 7:52 pm, Tim Bradshaw wrote:
    > On 2007-04-13 17:48:41 +0100, kgmath...@gmail.com said:
    >
    > > Did you ever resolve this? I'm having exactly the same problem. The
    > >x4100has powered itself off about five times already.

    >
    > No. We now have a case open with Sun (it's done it once more since
    > then). If you mail me I can send you the ID on Monday and you could
    > cross reference it. The engineer thinks it may be a thermal trip issue
    > which is plausible except that the machine is completely idle when it
    > dies so anything that might make it hot won't be (and it's in a
    > decently cooled room). We're going to try VTS on Monday.
    >


    For what it's worth: we did try VTS and I can reliably kill the
    machine in an hour or so (with identical symptoms to the original
    mysterious death ones by running the CPU tests. Indications are that
    it's some thermal issue. So I'd try VTS on it, run the CPU stress
    tests with the number of instances set much higher than the default
    (we're using 10, up from 2), and you might find something interesting.

    --tim


  7. Re: X4100 mysterious failure

    Hi Tim,
    Just wondering if you have found any solution to your X4100 mysterious
    failure. We seem to be plagued with the same problem recently. Did the
    VTS give you with any helpful information? Did you end up getting new
    power supplies or a new server?

    Our machine has only failed between 1AM and 6AM, which means there is no
    stress on it. Could it be a temp sensor issue?
    Thanks for your help in advance.
    cheers,
    Amin


  8. Re: X4100 mysterious failure

    On Jul 26, 12:07 pm, "AMV" wrote:
    > Hi Tim,
    > Just wondering if you have found any solution to your X4100 mysterious
    > failure. We seem to be plagued with the same problem recently. Did the
    > VTS give you with any helpful information? Did you end up getting new
    > power supplies or a new server?
    >
    > Our machine has only failed between 1AM and 6AM, which means there is no
    > stress on it. Could it be a temp sensor issue?
    > Thanks for your help in advance.
    > cheers,
    > Amin


    I am seeing exactly the same problem on an X4200. No load on the
    machine, decently cooled area (housing other servers and AC, etc),
    mysterious power-offs occurring, generally every 2-3 weeks. The
    machine shuts down cleanly but this is obviously not a situation we
    want to continue.

    Did anyone definitively establish if this was a temperature trip or
    just faulty PSU's (which would they both go at the same time)?

    Thanks



  9. Re: X4100 mysterious failure

    On Aug 2, 10:15 am, RobT wrote:
    > On Jul 26, 12:07 pm, "AMV" wrote:
    >
    > > Hi Tim,
    > > Just wondering if you have found any solution to your X4100 mysterious
    > > failure. We seem to be plagued with the same problem recently. Did the
    > > VTS give you with any helpful information? Did you end up getting new
    > > power supplies or a new server?

    >
    > > Our machine has only failed between 1AM and 6AM, which means there is no
    > > stress on it. Could it be a temp sensor issue?
    > > Thanks for your help in advance.
    > > cheers,
    > > Amin

    >
    > I am seeing exactly the same problem on an X4200. No load on the
    > machine, decently cooled area (housing other servers and AC, etc),
    > mysterious power-offs occurring, generally every 2-3 weeks. The
    > machine shuts down cleanly but this is obviously not a situation we
    > want to continue.
    >
    > Did anyone definitively establish if this was a temperature trip or
    > just faulty PSU's (which would they both go at the same time)?
    >
    > Thanks


    Sun Engineer had me update ILOM firmware to latest revision (1.1.8 at
    time of writing) as earlier versions were buggy. I guess we have to
    wait a while to see if that does the trick.


  10. Re: X4100 mysterious failure

    Keep us posted. I'd like to know if that does the trick.

  11. Re: X4100 mysterious failure

    I think I the same problem as yours.

    We have two X4100M2 servers delivered one month ago. They came with
    latest SP firmware 1.1.8 and BIOS version 039. The servers are
    connection to an APC UPS.

    We noticed the problem one day when the two servers were switched off
    mysteriously at same time. The UPS log showed that there was a power
    outage that lasted for just a second. The SP processors also log
    recorded something like this:
    ID = 1100 : Wed Sep 19 16:51:51 2007 : IPMI : Log : critical : ID = 1f5
    : 09/19/2007 : 16:51:51 : Power Supply : ps0.pwrok : State Deasserted
    ID = 1101 : Wed Sep 19 16:51:53 2007 : IPMI : Log : critical : ID = 1f6
    : 09/19/2007 : 16:51:53 : Power Supply : ps1.pwrok : State Deasserted

    The same UPS had other loads including an Ethernet switch and a SPARC
    workstation. All were OK and did not reboot or switched off themselves.

    We simulated a power outage again by disconnecting the UPS from the main
    power line. The two X4100M2 servers simply switched themselves off at
    the instance when the UPS transfered from main line power to battery power.

    Strangely the SP processors of the two servers were not reset or
    rebooted. Our login sessions were still there.

    So it appeared to me that the power supplies of the servers survived the
    power outage, but there was some kind of circuity that switched the
    servers off when abnormal power conditions were detected.

    I have reported it to Sun and waiting for their response.

  12. Re: X4100 mysterious failure

    On Sep 20, 3:48 am, WS Chan wrote:
    > I think I the same problem as yours.
    >
    > We have two X4100M2 servers delivered one month ago. They came with
    > latest SP firmware 1.1.8 and BIOS version 039. The servers are
    > connection to an APC UPS.
    >
    > We noticed the problem one day when the two servers were switched off
    > mysteriously at same time. The UPS log showed that there was a power
    > outage that lasted for just a second. The SP processors also log
    > recorded something like this:
    > ID = 1100 : Wed Sep 19 16:51:51 2007 : IPMI : Log : critical : ID = 1f5
    > : 09/19/2007 : 16:51:51 : Power Supply : ps0.pwrok : State Deasserted
    > ID = 1101 : Wed Sep 19 16:51:53 2007 : IPMI : Log : critical : ID = 1f6
    > : 09/19/2007 : 16:51:53 : Power Supply : ps1.pwrok : State Deasserted
    >
    > The same UPS had other loads including an Ethernet switch and a SPARC
    > workstation. All were OK and did not reboot or switched off themselves.
    >
    > We simulated a power outage again by disconnecting the UPS from the main
    > power line. The two X4100M2 servers simply switched themselves off at
    > the instance when the UPS transfered from main line power to battery power.
    >
    > Strangely the SP processors of the two servers were not reset or
    > rebooted. Our login sessions were still there.
    >
    > So it appeared to me that the power supplies of the servers survived the
    > power outage, but there was some kind of circuity that switched the
    > servers off when abnormal power conditions were detected.
    >
    > I have reported it to Sun and waiting for their response.


    My case number is 65598735 (same problem for a X4200 M2).
    I have run LOTS of diagnostics from Sun and this is still not
    resolved.


+ Reply to Thread