Trying to identify why a system rebooted by itself - SCO
This is a discussion on Trying to identify why a system rebooted by itself - SCO ; My clients system has started throwing the following message
on the system console (retyped from description read over
the phone):
WARNING: allocb failed - NSTRPAGES exceeded
I read TA116684 on how to debug failures and the
items on kernel tuning ...
-
Trying to identify why a system rebooted by itself
My clients system has started throwing the following message
on the system console (retyped from description read over
the phone):
WARNING: allocb failed - NSTRPAGES exceeded
I read TA116684 on how to debug failures and the
items on kernel tuning and drivers don't seem to be
applicable as this just started happening.
4. Failing hardware
5. External network hardware misbehaving
6. Extremely high network traffic
Items 4, 5, and 6 seem to be possible candidates.
The client called today when he simply rebooted the system
(as that was what I was advising him to do on previous
occasions) and then the system spontaneously rebooted about
an hour later.
I connected via SSH to the running system and as I was
monitoring it, it rebooted again.
I found the following in /usr/adm/syslog:
Mar 7 14:55:05 vetreal bootpd[1359]: IP address not found: 192.168.160.143
Mar 7 15:00:53 vetreal TLW param1=-1
Fri Mar 7 15:00:53 CST 2008 reboot initated
Mar 7 15:03:44 vetreal syslogd: restart
The two odd things that jump out above is TLW param1=-1
and "reboot initiated" both at 15:00:53.
Does anyone recognize these two entries that seem to be related?
I've never seen a system log with the message "YYYY reboot initiated"
Checking /usr/adm/syslog:
# grep "2008 reboot initated" /usr/adm/syslog
Mon Jan 28 13:19:20 CST 2008 reboot initated
Sat Feb 16 13:35:05 CST 2008 reboot initated
Mon Feb 25 17:30:05 CST 2008 reboot initated
Wed Mar 5 16:37:54 CST 2008 reboot initated
Fri Mar 7 12:38:13 CST 2008 reboot initated
Fri Mar 7 14:09:16 CST 2008 reboot initated
Fri Mar 7 14:16:23 CST 2008 reboot initated
Fri Mar 7 15:00:53 CST 2008 reboot initated
# grep "2007 reboot initated" /usr/adm/syslog
Tue May 1 00:04:44 CDT 2007 reboot initated
# grep "2006 reboot initated" /usr/adm/syslog
# grep "2005 reboot initated" /usr/adm/syslog
#
Syslog starts Jan 31 2005.
I see that it has been occurring but not to the level that it
has today.
--
Steve Fabac
S.M. Fabac & Associates
816/765-1670
-
Re: Trying to identify why a system rebooted by itself
Steve M. Fabac, Jr. wrote:
> My clients system has started throwing the following message
> on the system console (retyped from description read over
> the phone):
>
> WARNING: allocb failed - NSTRPAGES exceeded
>
> I read TA116684 on how to debug failures and the
> items on kernel tuning and drivers don't seem to be
> applicable as this just started happening.
>
> 4. Failing hardware
>
> 5. External network hardware misbehaving
>
> 6. Extremely high network traffic
>
> Items 4, 5, and 6 seem to be possible candidates.
>
> The client called today when he simply rebooted the system
> (as that was what I was advising him to do on previous
> occasions) and then the system spontaneously rebooted about
> an hour later.
>
> I connected via SSH to the running system and as I was
> monitoring it, it rebooted again.
>
> I found the following in /usr/adm/syslog:
> Mar 7 14:55:05 vetreal bootpd[1359]: IP address not found: 192.168.160.143
> Mar 7 15:00:53 vetreal TLW param1=-1
> Fri Mar 7 15:00:53 CST 2008 reboot initated
> Mar 7 15:03:44 vetreal syslogd: restart
>
> The two odd things that jump out above is TLW param1=-1
> and "reboot initiated" both at 15:00:53.
>
> Does anyone recognize these two entries that seem to be related?
>
> I've never seen a system log with the message "YYYY reboot initiated"
>
> Checking /usr/adm/syslog:
>
> # grep "2008 reboot initated" /usr/adm/syslog
> Mon Jan 28 13:19:20 CST 2008 reboot initated
> Sat Feb 16 13:35:05 CST 2008 reboot initated
> Mon Feb 25 17:30:05 CST 2008 reboot initated
> Wed Mar 5 16:37:54 CST 2008 reboot initated
> Fri Mar 7 12:38:13 CST 2008 reboot initated
> Fri Mar 7 14:09:16 CST 2008 reboot initated
> Fri Mar 7 14:16:23 CST 2008 reboot initated
> Fri Mar 7 15:00:53 CST 2008 reboot initated
> # grep "2007 reboot initated" /usr/adm/syslog
> Tue May 1 00:04:44 CDT 2007 reboot initated
> # grep "2006 reboot initated" /usr/adm/syslog
> # grep "2005 reboot initated" /usr/adm/syslog
> #
>
> Syslog starts Jan 31 2005.
>
> I see that it has been occurring but not to the level that it
> has today.
>
>
>
Is this on an HP or Compaq system, with the full EFS installed?
I've seen similar things from cpqmon, the EFS health monitor.
But nothing matching that particular entry. Do you have some other HW
monitor installed?
Is 'initiated' really misspelled 'initated' that way? Might be
worthwhile to run strings on binaries looking for 'reboot' or that
particular mis-spelling of initiated.
--
----------------------------------------------------
Pat Welch, UBB Computer Services, a WCS Affiliate
SCO Authorized Partner
Microlite BackupEdge Certified Reseller
Unix/Linux/Windows/Hardware Sales/Support
(209) 745-1401 Cell: (209) 251-9120
E-mail: patubb@inreach.com
----------------------------------------------------
-
Re: Trying to identify why a system rebooted by itself
On Mar 7, 10:02*pm, "Steve M. Fabac, Jr." wrote:
> My clients system has started throwing the following message
> on the system console (retyped from description read over
> the phone):
>
> WARNING: allocb failed - NSTRPAGES exceeded
>
> I read TA116684 on how to debug failures and the
> items on kernel tuning and drivers don't seem to be
> applicable as this just started happening.
Might look at http://aplawrence.com/SCOFAQ/FAQ_scotec1haltcatch.html
which includes some suggestions Bela made way back when..
-
Re: Trying to identify why a system rebooted by itself
Pat Welch wrote:
> Steve M. Fabac, Jr. wrote:
>> My clients system has started throwing the following message
>> on the system console (retyped from description read over
>> the phone):
>>
>> WARNING: allocb failed - NSTRPAGES exceeded
>>
>> I read TA116684 on how to debug failures and the
>> items on kernel tuning and drivers don't seem to be
>> applicable as this just started happening.
>>
>> 4. Failing hardware
>>
>> 5. External network hardware misbehaving
>>
>> 6. Extremely high network traffic
>>
>> Items 4, 5, and 6 seem to be possible candidates.
>>
>> The client called today when he simply rebooted the system
>> (as that was what I was advising him to do on previous
>> occasions) and then the system spontaneously rebooted about
>> an hour later.
>>
>> I connected via SSH to the running system and as I was
>> monitoring it, it rebooted again.
>>
>> I found the following in /usr/adm/syslog:
>> Mar 7 14:55:05 vetreal bootpd[1359]: IP address not found:
>> 192.168.160.143
>> Mar 7 15:00:53 vetreal TLW param1=-1
>> Fri Mar 7 15:00:53 CST 2008 reboot initated
>> Mar 7 15:03:44 vetreal syslogd: restart
>>
>> The two odd things that jump out above is TLW param1=-1
>> and "reboot initiated" both at 15:00:53.
>>
>> Does anyone recognize these two entries that seem to be related?
>>
>> I've never seen a system log with the message "YYYY reboot initiated"
>>
>> Checking /usr/adm/syslog:
>>
>> # grep "2008 reboot initated" /usr/adm/syslog
>> Mon Jan 28 13:19:20 CST 2008 reboot initated
>> Sat Feb 16 13:35:05 CST 2008 reboot initated
>> Mon Feb 25 17:30:05 CST 2008 reboot initated
>> Wed Mar 5 16:37:54 CST 2008 reboot initated
>> Fri Mar 7 12:38:13 CST 2008 reboot initated
>> Fri Mar 7 14:09:16 CST 2008 reboot initated
>> Fri Mar 7 14:16:23 CST 2008 reboot initated
>> Fri Mar 7 15:00:53 CST 2008 reboot initated
>> # grep "2007 reboot initated" /usr/adm/syslog
>> Tue May 1 00:04:44 CDT 2007 reboot initated
>> # grep "2006 reboot initated" /usr/adm/syslog
>> # grep "2005 reboot initated" /usr/adm/syslog
>> #
>>
>> Syslog starts Jan 31 2005.
>>
>> I see that it has been occurring but not to the level that it
>> has today.
>>
>>
>>
>
> Is this on an HP or Compaq system, with the full EFS installed?
>
> I've seen similar things from cpqmon, the EFS health monitor.
>
> But nothing matching that particular entry. Do you have some other HW
> monitor installed?
>
> Is 'initiated' really misspelled 'initated' that way? Might be
> worthwhile to run strings on binaries looking for 'reboot' or that
> particular mis-spelling of initiated.
>
That was a good suggestion. I found the offending script in:
#
# Purpose:
# To install and umpgrade the Fault-Freedom II Driver
# Description:
# arg 1: Name of step to perform.
# arg 2: Keyword list, e.g. UPGRADE.
# arg 3: space-separated list of packages.
#************************************************* ****************
LOGFILE=/tmp/ff2install.log
INSTALL_DIR=/usr/local/ff2
HALTSYS=`l -Wv /etc/haltsys | awk '{ print $11 }'`
SHUTDOWN=`l -Wv /etc/shutdown | awk '{ print $11 }'`
INITFILE=`l -Wv /etc/inittab | awk '{ print $11 }'`
INITBASE=/etc/conf/cf.d/init.base
....
###############
# /etc/reboot #
###############
grep Ff2 ${HALTSYS} > /dev/null 2>&1
if [ "$?" = "1" ]
then
ex_cmd cp ${HALTSYS} /etc/haltsys.preff2
sed -e '/^haltsys/a\
[ -x /usr/local/ff2/bin/Ff2 ] && \
{\
/usr/local/ff2/bin/Ff2 shutdown\
DATE=\`date\`\
> echo "${DATE} reboot initated" >> /usr/adm/syslog\
}' < /tmp/haltsys$$ > ${HALTSYS}
ex_cmd rm /tmp/haltsys$$
ex_cmd chmod 700 ${HALTSYS}
ex_cmd chown root ${HALTSYS}
ex_cmd chgrp sys ${HALTSYS}
fi
"ccs" line 250 of 649 --38%--
:!pwd
/opt/K/1776/FF2/1.0.2W/cntl/packages/FF2
FaultFredomII from 1776 Software is installed but
not running (no heartbeat connection).
It's a long story. When I was called in July 2001, the
client was trying to get FF2 working but whenever they
tried to manually fail to the backup server, the system
would lock up. So it was installed but disabled from
starting all its daemons.
I wrote scripts to manually switch identities between the
two machines and mirror the data directories overnight.
Now I've got to look to see what is still running and why
it has suddenly shut the server down four times in one day.
What we did find was the Cisco switches were showing a
lot of chatter. After hours, the client powered the
switches down and then back on and the chatter subsided.
Netstat -m run today after rebooting the Cisco switches on Friday
shows all zeros in the fail column where before, they were
getting 300+ in multiple buffers.
so it looks like
>> 5. External network hardware misbehaving
is the correct assessment.
--
Steve Fabac
S.M. Fabac & Associates
816/765-1670