Trying to identify why a system rebooted by itself - SCO

This is a discussion on Trying to identify why a system rebooted by itself - SCO ; My clients system has started throwing the following message on the system console (retyped from description read over the phone): WARNING: allocb failed - NSTRPAGES exceeded I read TA116684 on how to debug failures and the items on kernel tuning ...

+ Reply to Thread
Results 1 to 4 of 4

Thread: Trying to identify why a system rebooted by itself

  1. Trying to identify why a system rebooted by itself

    My clients system has started throwing the following message
    on the system console (retyped from description read over
    the phone):

    WARNING: allocb failed - NSTRPAGES exceeded

    I read TA116684 on how to debug failures and the
    items on kernel tuning and drivers don't seem to be
    applicable as this just started happening.

    4. Failing hardware

    5. External network hardware misbehaving

    6. Extremely high network traffic

    Items 4, 5, and 6 seem to be possible candidates.

    The client called today when he simply rebooted the system
    (as that was what I was advising him to do on previous
    occasions) and then the system spontaneously rebooted about
    an hour later.

    I connected via SSH to the running system and as I was
    monitoring it, it rebooted again.

    I found the following in /usr/adm/syslog:
    Mar 7 14:55:05 vetreal bootpd[1359]: IP address not found: 192.168.160.143
    Mar 7 15:00:53 vetreal TLW param1=-1
    Fri Mar 7 15:00:53 CST 2008 reboot initated
    Mar 7 15:03:44 vetreal syslogd: restart

    The two odd things that jump out above is TLW param1=-1
    and "reboot initiated" both at 15:00:53.

    Does anyone recognize these two entries that seem to be related?

    I've never seen a system log with the message "YYYY reboot initiated"

    Checking /usr/adm/syslog:

    # grep "2008 reboot initated" /usr/adm/syslog
    Mon Jan 28 13:19:20 CST 2008 reboot initated
    Sat Feb 16 13:35:05 CST 2008 reboot initated
    Mon Feb 25 17:30:05 CST 2008 reboot initated
    Wed Mar 5 16:37:54 CST 2008 reboot initated
    Fri Mar 7 12:38:13 CST 2008 reboot initated
    Fri Mar 7 14:09:16 CST 2008 reboot initated
    Fri Mar 7 14:16:23 CST 2008 reboot initated
    Fri Mar 7 15:00:53 CST 2008 reboot initated
    # grep "2007 reboot initated" /usr/adm/syslog
    Tue May 1 00:04:44 CDT 2007 reboot initated
    # grep "2006 reboot initated" /usr/adm/syslog
    # grep "2005 reboot initated" /usr/adm/syslog
    #

    Syslog starts Jan 31 2005.

    I see that it has been occurring but not to the level that it
    has today.



    --
    Steve Fabac
    S.M. Fabac & Associates
    816/765-1670

  2. Re: Trying to identify why a system rebooted by itself

    Steve M. Fabac, Jr. wrote:
    > My clients system has started throwing the following message
    > on the system console (retyped from description read over
    > the phone):
    >
    > WARNING: allocb failed - NSTRPAGES exceeded
    >
    > I read TA116684 on how to debug failures and the
    > items on kernel tuning and drivers don't seem to be
    > applicable as this just started happening.
    >
    > 4. Failing hardware
    >
    > 5. External network hardware misbehaving
    >
    > 6. Extremely high network traffic
    >
    > Items 4, 5, and 6 seem to be possible candidates.
    >
    > The client called today when he simply rebooted the system
    > (as that was what I was advising him to do on previous
    > occasions) and then the system spontaneously rebooted about
    > an hour later.
    >
    > I connected via SSH to the running system and as I was
    > monitoring it, it rebooted again.
    >
    > I found the following in /usr/adm/syslog:
    > Mar 7 14:55:05 vetreal bootpd[1359]: IP address not found: 192.168.160.143
    > Mar 7 15:00:53 vetreal TLW param1=-1
    > Fri Mar 7 15:00:53 CST 2008 reboot initated
    > Mar 7 15:03:44 vetreal syslogd: restart
    >
    > The two odd things that jump out above is TLW param1=-1
    > and "reboot initiated" both at 15:00:53.
    >
    > Does anyone recognize these two entries that seem to be related?
    >
    > I've never seen a system log with the message "YYYY reboot initiated"
    >
    > Checking /usr/adm/syslog:
    >
    > # grep "2008 reboot initated" /usr/adm/syslog
    > Mon Jan 28 13:19:20 CST 2008 reboot initated
    > Sat Feb 16 13:35:05 CST 2008 reboot initated
    > Mon Feb 25 17:30:05 CST 2008 reboot initated
    > Wed Mar 5 16:37:54 CST 2008 reboot initated
    > Fri Mar 7 12:38:13 CST 2008 reboot initated
    > Fri Mar 7 14:09:16 CST 2008 reboot initated
    > Fri Mar 7 14:16:23 CST 2008 reboot initated
    > Fri Mar 7 15:00:53 CST 2008 reboot initated
    > # grep "2007 reboot initated" /usr/adm/syslog
    > Tue May 1 00:04:44 CDT 2007 reboot initated
    > # grep "2006 reboot initated" /usr/adm/syslog
    > # grep "2005 reboot initated" /usr/adm/syslog
    > #
    >
    > Syslog starts Jan 31 2005.
    >
    > I see that it has been occurring but not to the level that it
    > has today.
    >
    >
    >


    Is this on an HP or Compaq system, with the full EFS installed?

    I've seen similar things from cpqmon, the EFS health monitor.

    But nothing matching that particular entry. Do you have some other HW
    monitor installed?

    Is 'initiated' really misspelled 'initated' that way? Might be
    worthwhile to run strings on binaries looking for 'reboot' or that
    particular mis-spelling of initiated.

    --
    ----------------------------------------------------
    Pat Welch, UBB Computer Services, a WCS Affiliate
    SCO Authorized Partner
    Microlite BackupEdge Certified Reseller
    Unix/Linux/Windows/Hardware Sales/Support
    (209) 745-1401 Cell: (209) 251-9120
    E-mail: patubb@inreach.com
    ----------------------------------------------------

  3. Re: Trying to identify why a system rebooted by itself

    On Mar 7, 10:02*pm, "Steve M. Fabac, Jr." wrote:
    > My clients system has started throwing the following message
    > on the system console (retyped from description read over
    > the phone):
    >
    > WARNING: allocb failed - NSTRPAGES exceeded
    >
    > I read TA116684 on how to debug failures and the
    > items on kernel tuning and drivers don't seem to be
    > applicable as this just started happening.


    Might look at http://aplawrence.com/SCOFAQ/FAQ_scotec1haltcatch.html
    which includes some suggestions Bela made way back when..



  4. Re: Trying to identify why a system rebooted by itself

    Pat Welch wrote:
    > Steve M. Fabac, Jr. wrote:
    >> My clients system has started throwing the following message
    >> on the system console (retyped from description read over
    >> the phone):
    >>
    >> WARNING: allocb failed - NSTRPAGES exceeded
    >>
    >> I read TA116684 on how to debug failures and the
    >> items on kernel tuning and drivers don't seem to be
    >> applicable as this just started happening.
    >>
    >> 4. Failing hardware
    >>
    >> 5. External network hardware misbehaving
    >>
    >> 6. Extremely high network traffic
    >>
    >> Items 4, 5, and 6 seem to be possible candidates.
    >>
    >> The client called today when he simply rebooted the system
    >> (as that was what I was advising him to do on previous
    >> occasions) and then the system spontaneously rebooted about
    >> an hour later.
    >>
    >> I connected via SSH to the running system and as I was
    >> monitoring it, it rebooted again.
    >>
    >> I found the following in /usr/adm/syslog:
    >> Mar 7 14:55:05 vetreal bootpd[1359]: IP address not found:
    >> 192.168.160.143
    >> Mar 7 15:00:53 vetreal TLW param1=-1
    >> Fri Mar 7 15:00:53 CST 2008 reboot initated
    >> Mar 7 15:03:44 vetreal syslogd: restart
    >>
    >> The two odd things that jump out above is TLW param1=-1
    >> and "reboot initiated" both at 15:00:53.
    >>
    >> Does anyone recognize these two entries that seem to be related?
    >>
    >> I've never seen a system log with the message "YYYY reboot initiated"
    >>
    >> Checking /usr/adm/syslog:
    >>
    >> # grep "2008 reboot initated" /usr/adm/syslog
    >> Mon Jan 28 13:19:20 CST 2008 reboot initated
    >> Sat Feb 16 13:35:05 CST 2008 reboot initated
    >> Mon Feb 25 17:30:05 CST 2008 reboot initated
    >> Wed Mar 5 16:37:54 CST 2008 reboot initated
    >> Fri Mar 7 12:38:13 CST 2008 reboot initated
    >> Fri Mar 7 14:09:16 CST 2008 reboot initated
    >> Fri Mar 7 14:16:23 CST 2008 reboot initated
    >> Fri Mar 7 15:00:53 CST 2008 reboot initated
    >> # grep "2007 reboot initated" /usr/adm/syslog
    >> Tue May 1 00:04:44 CDT 2007 reboot initated
    >> # grep "2006 reboot initated" /usr/adm/syslog
    >> # grep "2005 reboot initated" /usr/adm/syslog
    >> #
    >>
    >> Syslog starts Jan 31 2005.
    >>
    >> I see that it has been occurring but not to the level that it
    >> has today.
    >>
    >>
    >>

    >
    > Is this on an HP or Compaq system, with the full EFS installed?
    >
    > I've seen similar things from cpqmon, the EFS health monitor.
    >
    > But nothing matching that particular entry. Do you have some other HW
    > monitor installed?
    >
    > Is 'initiated' really misspelled 'initated' that way? Might be
    > worthwhile to run strings on binaries looking for 'reboot' or that
    > particular mis-spelling of initiated.
    >


    That was a good suggestion. I found the offending script in:

    #
    # Purpose:
    # To install and umpgrade the Fault-Freedom II Driver
    # Description:
    # arg 1: Name of step to perform.
    # arg 2: Keyword list, e.g. UPGRADE.
    # arg 3: space-separated list of packages.
    #************************************************* ****************

    LOGFILE=/tmp/ff2install.log
    INSTALL_DIR=/usr/local/ff2
    HALTSYS=`l -Wv /etc/haltsys | awk '{ print $11 }'`
    SHUTDOWN=`l -Wv /etc/shutdown | awk '{ print $11 }'`
    INITFILE=`l -Wv /etc/inittab | awk '{ print $11 }'`
    INITBASE=/etc/conf/cf.d/init.base
    ....

    ###############
    # /etc/reboot #
    ###############
    grep Ff2 ${HALTSYS} > /dev/null 2>&1
    if [ "$?" = "1" ]
    then
    ex_cmd cp ${HALTSYS} /etc/haltsys.preff2
    sed -e '/^haltsys/a\
    [ -x /usr/local/ff2/bin/Ff2 ] && \
    {\
    /usr/local/ff2/bin/Ff2 shutdown\
    DATE=\`date\`\
    > echo "${DATE} reboot initated" >> /usr/adm/syslog\

    }' < /tmp/haltsys$$ > ${HALTSYS}
    ex_cmd rm /tmp/haltsys$$
    ex_cmd chmod 700 ${HALTSYS}
    ex_cmd chown root ${HALTSYS}
    ex_cmd chgrp sys ${HALTSYS}
    fi
    "ccs" line 250 of 649 --38%--
    :!pwd
    /opt/K/1776/FF2/1.0.2W/cntl/packages/FF2

    FaultFredomII from 1776 Software is installed but
    not running (no heartbeat connection).

    It's a long story. When I was called in July 2001, the
    client was trying to get FF2 working but whenever they
    tried to manually fail to the backup server, the system
    would lock up. So it was installed but disabled from
    starting all its daemons.

    I wrote scripts to manually switch identities between the
    two machines and mirror the data directories overnight.

    Now I've got to look to see what is still running and why
    it has suddenly shut the server down four times in one day.

    What we did find was the Cisco switches were showing a
    lot of chatter. After hours, the client powered the
    switches down and then back on and the chatter subsided.

    Netstat -m run today after rebooting the Cisco switches on Friday
    shows all zeros in the fail column where before, they were
    getting 300+ in multiple buffers.

    so it looks like
    >> 5. External network hardware misbehaving

    is the correct assessment.



    --
    Steve Fabac
    S.M. Fabac & Associates
    816/765-1670

+ Reply to Thread