V880: going down on signal 15 - SUN

This is a discussion on V880: going down on signal 15 - SUN ; On returning from vacation I found the V880 down (Solaris 8, fully patched). The primary TrippLite UPS that powers it showed line power present but the inverter was off. It took a couple of hours to get the system to ...

+ Reply to Thread
Results 1 to 3 of 3

Thread: V880: going down on signal 15

  1. V880: going down on signal 15

    On returning from vacation I found the V880 down (Solaris 8, fully
    patched). The primary TrippLite UPS that powers it showed line
    power present but the inverter was off. It took a couple of hours to
    get the system to come back up because for some obscure reason it
    couldn't see the disks on one of the two Ultra320 channels. (Note that
    the system disk is on the internal fibre-channel, so any ultra320
    problems should have showed up in /var/adm/messages prior to the crash,
    but didn't.) There may have been a stuck bit somewhere as it finally
    resolved after leaving both the server and the external JBOD powered off
    for several minutes.

    Now the mystery, why did the V880 go down? The only hint is the last 5
    lines of /var/adm/messages which were all logged within 1 second of each
    other starting at 07:56:15 and were:

    gec pseudO: [ID 129642 kern.info] pseudo-device: tod0
    gec genunix: [ID 936769 kern.info] tod0 is /pseudo/tod@0
    gec pseudo: [ID 129642 kern.info] pseudo-device: pm0
    gec genunix: [ID 936769 kern.info] pm0 is /pseudo/pm@0
    gec syslogd: going down on signal 15

    The messaged preceding this were logged over 2 hours earlier for a
    Corrected memory error. The ECC catches these once a week or so and
    spews a lot of lines into the message log, but they have not previously
    been correlated with any crash.

    The PowerAlertPlus logs didn't indicate any power events, and
    specifically, didn't indicate that the power monitoring software
    shutdown the system. However it looked maybe it had, given the "off"
    state of the UPS. But since when the Solaris boxes go down
    they lose control of their serial lines and the random values that come
    out could conceivably have triggered an inverter shutdown.

    There last message in /var/cron/log prior to the shutdown was
    ! SIGTERM Fri Dec 30 07:56:16 2005
    ! ******* CRON ABORTED ******** Fri Dec 30 07:56:16 2005

    There is no crontab entry for root or any other process that runs at or
    near that time.

    Any suggestions where else to look? There was nothing else obvious
    anywhere in /var/adm.

    Thanks

    David Mathog
    mathog@caltech.edu

  2. Re: V880: going down on signal 15

    David Mathog writes:

    >Now the mystery, why did the V880 go down? The only hint is the last 5
    >lines of /var/adm/messages which were all logged within 1 second of each
    >other starting at 07:56:15 and were:


    >gec pseudO: [ID 129642 kern.info] pseudo-device: tod0
    >gec genunix: [ID 936769 kern.info] tod0 is /pseudo/tod@0
    >gec pseudo: [ID 129642 kern.info] pseudo-device: pm0
    >gec genunix: [ID 936769 kern.info] pm0 is /pseudo/pm@0
    >gec syslogd: going down on signal 15


    This looks like an ordinary shutdown where the shutdown scripts are run.

    >The PowerAlertPlus logs didn't indicate any power events, and
    >specifically, didn't indicate that the power monitoring software
    >shutdown the system. However it looked maybe it had, given the "off"
    >state of the UPS. But since when the Solaris boxes go down
    >they lose control of their serial lines and the random values that come
    >out could conceivably have triggered an inverter shutdown.


    Solaris boxes don't go "down" when that happens; they drop into the
    monitor prompt. See kbd(1m) on way to disable that feature.

    >There last message in /var/cron/log prior to the shutdown was
    >! SIGTERM Fri Dec 30 07:56:16 2005
    >! ******* CRON ABORTED ******** Fri Dec 30 07:56:16 2005


    >There is no crontab entry for root or any other process that runs at or
    >near that time.


    >Any suggestions where else to look? There was nothing else obvious
    >anywhere in /var/adm.


    last, perhaps?

    It looks very much like a clean shutdown.

    Casper
    --
    Expressed in this posting are my opinions. They are in no way related
    to opinions held by my employer, Sun Microsystems.
    Statements on Sun products included here are not gospel and may
    be fiction rather than truth.

  3. Re: V880: going down on signal 15

    Casper H.S. Dik wrote:
    >
    > This looks like an ordinary shutdown where the shutdown scripts are run.
    >


    Agreed.

    I dug around further and found that PowerAlertPlus had indeed shut the
    system down. There was nothing in the logs to indicate a power problem
    but they did show a "UPS On Battery" (way over to the right, scrolled
    off the screen) and a separate log file said that paplus had shut down
    after going on batteries for a minute (as it is supposed to). Odd that
    it went to batteries though since none of the other systems in the room
    logged any power events. TrippLite support suggests that the unit may
    have cut over to batteries because of line noise, or because of some
    glitch in their software. They also just revealed that there is
    a SolarisSparc version of PAP 12 on their password protected ftp site.
    I'll upgrade to that from the current 10.1 (from their software web
    site) and hopefully it will at least eliminate the latter possibility.

    Thanks,

    David Mathog
    mathog@caltech.edu

+ Reply to Thread