sco 5.0.5 panic crashes - SCO
This is a discussion on sco 5.0.5 panic crashes - SCO ; 5.0.5 box that has run quietly for 30 months, has developed the distinctly
un-quiet habit of crashing with panic messages, almost exactly every 73
minutes
Sorry, I don't have the screen messages, remote from site, will need someone
to copy ...
-
sco 5.0.5 panic crashes
5.0.5 box that has run quietly for 30 months, has developed the distinctly
un-quiet habit of crashing with panic messages, almost exactly every 73
minutes
Sorry, I don't have the screen messages, remote from site, will need someone
to copy them down
Can anyone make any suggestions on what might explain this behavior, or, a
strategy that might help isolate the issue?
I'm thinking it must be a hardware issue-- could it be software?
Will sar tell me anything useful?
Any suggestions would be greatly welcomed
tia
Barry
-
Re: sco 5.0.5 panic crashes
Barry Swane wrote:
> 5.0.5 box that has run quietly for 30 months, has developed the distinctly
> un-quiet habit of crashing with panic messages, almost exactly every 73
> minutes
> Sorry, I don't have the screen messages, remote from site, will need someone
> to copy them down
> Can anyone make any suggestions on what might explain this behavior, or, a
> strategy that might help isolate the issue?
> I'm thinking it must be a hardware issue-- could it be software?
> Will sar tell me anything useful?
> Any suggestions would be greatly welcomed
> tia
> Barry
>
>
See SCO's http://wdb1.sco.com/kb/showta?taid=106009 and my
http://aplawrence.com/Unixart/trape.html which, though about E traps
specifically, also has general advice.
Unusual to have that degree of precise repetition..
--
Tony Lawrence
Unix/Linux/Mac OS X resources: http://aplawrence.com
-
Re: sco 5.0.5 panic crashes
"Tony Lawrence" wrote in message
news:i_ednd8ZOLzFv2jfRVn-rg@comcast.com...
> Barry Swane wrote:
>> 5.0.5 box that has run quietly for 30 months, has developed the
>> distinctly un-quiet habit of crashing with panic messages, almost exactly
>> every 73 minutes
>> Sorry, I don't have the screen messages, remote from site, will need
>> someone to copy them down
>> Can anyone make any suggestions on what might explain this behavior, or,
>> a strategy that might help isolate the issue?
>> I'm thinking it must be a hardware issue-- could it be software?
>> Will sar tell me anything useful?
>> Any suggestions would be greatly welcomed
>> tia
>> Barry
>>
>>
>
>
> See SCO's http://wdb1.sco.com/kb/showta?taid=106009 and my
> http://aplawrence.com/Unixart/trape.html which, though about E traps
> specifically, also has general advice.
>
> Unusual to have that degree of precise repetition..
>
> --
> Tony Lawrence
> Unix/Linux/Mac OS X resources: http://aplawrence.com
Thanks so much for the pointers, Tony-- there's a wealth of information
available there, and I really do appreciate you taking the time to point me
towards it.
I have a little more information now, which I will post here, just in case
someone has a thought for me:
The system continues to crash exactly 73 minutes after bootup.
The trap message is:
PANIC: k_trap-kernel mode trap type 0x0000000 E
I have had the site log the cs and eip regiesters-- they are identical every
time
cs 0X00000158
eip 0XF00A8C14
As I understand things, this almost certainly points to a software issue, as
opposed to a hardware issue, where I would expect random contents in the cs
and eip registers.
I have been watching for runaway processes, but I see nothing.
Thanks for any suggestions from anyone
Barry
-
Re: sco 5.0.5 panic crashes
"Barry Swane" wrote in message
news:L9adnZ2dnZ3jPw22nZ2dnRLuat-dnZ2dRVn-052dnZ0@rogers.com...
>
> "Tony Lawrence" wrote in message
> news:i_ednd8ZOLzFv2jfRVn-rg@comcast.com...
>> Barry Swane wrote:
>>> 5.0.5 box that has run quietly for 30 months, has developed the
>>> distinctly un-quiet habit of crashing with panic messages, almost
>>> exactly every 73 minutes
>>> Sorry, I don't have the screen messages, remote from site, will need
>>> someone to copy them down
>>> Can anyone make any suggestions on what might explain this behavior, or,
>>> a strategy that might help isolate the issue?
>>> I'm thinking it must be a hardware issue-- could it be software?
>>> Will sar tell me anything useful?
>>> Any suggestions would be greatly welcomed
>>> tia
>>> Barry
>>>
>>>
>>
>>
>> See SCO's http://wdb1.sco.com/kb/showta?taid=106009 and my
>> http://aplawrence.com/Unixart/trape.html which, though about E traps
>> specifically, also has general advice.
>>
>> Unusual to have that degree of precise repetition..
>>
>> --
>> Tony Lawrence
>> Unix/Linux/Mac OS X resources: http://aplawrence.com
> Thanks so much for the pointers, Tony-- there's a wealth of information
> available there, and I really do appreciate you taking the time to point
> me towards it.
> I have a little more information now, which I will post here, just in case
> someone has a thought for me:
> The system continues to crash exactly 73 minutes after bootup.
> The trap message is:
> PANIC: k_trap-kernel mode trap type 0x0000000 E
>
> I have had the site log the cs and eip regiesters-- they are identical
> every time
>
> cs 0X00000158
>
> eip 0XF00A8C14
>
> As I understand things, this almost certainly points to a software issue,
> as opposed to a hardware issue, where I would expect random contents in
> the cs and eip registers.
>
> I have been watching for runaway processes, but I see nothing.
The first thing I do in respect to Trap E errors if it has happend more than
once within
one month, not matter what the CS and EIP registers state, is replace the
memory.
Memory is pretty cheap and if the system is not full, the extra can't hurt.
This has solved 99% of my Trap E problems.
The .1% was after a reload went kooky, reloaded immediately and it has not
had a problem
for years.
moncho
-
Re: sco 5.0.5 panic crashes
Barry Swane wrote:
> The system continues to crash exactly 73 minutes after bootup.
> The trap message is:
> PANIC: k_trap-kernel mode trap type 0x0000000 E
>
> I have had the site log the cs and eip regiesters-- they are identical every
> time
>
> cs 0X00000158
>
> eip 0XF00A8C14
>
> As I understand things, this almost certainly points to a software issue, as
> opposed to a hardware issue, where I would expect random contents in the cs
> and eip registers.
It could still be either software or hardware -- a more specific kind of
hardware problem causing a failure at a specific place in the software.
Translate the EIP by doing:
# crash
> ts 0XF00A8C14
> quit
using the same kernel that's been crashing. What is the symbol+offset?
Also, what is the value of CR2 in these crashes? That gives the address
to which the bad reference was made.
>Bela<
-
Re: sco 5.0.5 panic crashes
"Bela Lubkin" wrote in message
news:200508081133.aa14834@deepthought.armory.com.. .
> Barry Swane wrote:
>
>> The system continues to crash exactly 73 minutes after bootup.
>> The trap message is:
>> PANIC: k_trap-kernel mode trap type 0x0000000 E
>>
>> I have had the site log the cs and eip regiesters-- they are identical
>> every
>> time
>>
>> cs 0X00000158
>>
>> eip 0XF00A8C14
>>
>> As I understand things, this almost certainly points to a software issue,
>> as
>> opposed to a hardware issue, where I would expect random contents in the
>> cs
>> and eip registers.
>
> It could still be either software or hardware -- a more specific kind of
> hardware problem causing a failure at a specific place in the software.
>
> Translate the EIP by doing:
>
> # crash
> > ts 0XF00A8C14
> > quit
>
> using the same kernel that's been crashing. What is the symbol+offset?
> Also, what is the value of CR2 in these crashes? That gives the address
> to which the bad reference was made.
>
>>Bela<
Thanks, everbody, for your contributions. It appears I have managed to
resolve my issue. The fact that I inflicted this upon myself is
embarrasing, but I feel compelled to post the "resolution", after pleading
for help:
I had built a networked printer last Friday at 10am -- software-wise, and
sent out a test job, waiting to be printed when the print server got
connected. Which didn't happen. No big deal, I do it all the time.
Except, that in this case, when I set up the host address in /etc/hosts, I
typed in the address of the sco server itself.
God knows (I sure don't) what happened after that, but it resulted in
remarkably consistent behavior from 5.0.5-- it would corrupt memory totally,
somewhere between 73 and 76 minutes, without fail. Resulting in the panic
crashes I was reporting.
What finally got my fossilized brain kicked into some semblance of
animation, was, struggling thru syslog, poking around, and coming across
this:
Aug 5 10:00:40 amre lpd[4898]: intnetprint: lost connection
Aug 5 10:01:12 amre last message repeated 5 times
Aug 5 10:01:44 amre lpd[4898]: intnetprint: lost connection
Aug 5 10:02:49 amre lpd[4898]: intnetprint: lost connection
Aug 5 10:04:57 amre lpd[4898]: intnetprint: lost connection
Aug 5 10:09:13 amre lpd[4898]: intnetprint: lost connection
And lots more, every 4 or 5 minutes.
So, somewhere along the way, the kernel got fed up with this, and cashed
(cached? -- sorry!) in it's (memory?--really sorry!) chips.
I still feel this was a rather diabolical revenge for a simple typing error,
but, I probably deserved it.
Thanks again for the suggestions and help. If nothing else, I did learn
some neat stuff over the weekend.
Barry
-
Re: sco 5.0.5 panic crashes
Barry Swane wrote:
> I had built a networked printer last Friday at 10am -- software-wise, and
> sent out a test job, waiting to be printed when the print server got
> connected. Which didn't happen. No big deal, I do it all the time.
> Except, that in this case, when I set up the host address in /etc/hosts, I
> typed in the address of the sco server itself.
> God knows (I sure don't) what happened after that, but it resulted in
> remarkably consistent behavior from 5.0.5-- it would corrupt memory totally,
> somewhere between 73 and 76 minutes, without fail. Resulting in the panic
> crashes I was reporting.
> What finally got my fossilized brain kicked into some semblance of
> animation, was, struggling thru syslog, poking around, and coming across
> this:
> Aug 5 10:00:40 amre lpd[4898]: intnetprint: lost connection
> Aug 5 10:01:12 amre last message repeated 5 times
> Aug 5 10:01:44 amre lpd[4898]: intnetprint: lost connection
> Aug 5 10:02:49 amre lpd[4898]: intnetprint: lost connection
> Aug 5 10:04:57 amre lpd[4898]: intnetprint: lost connection
> Aug 5 10:09:13 amre lpd[4898]: intnetprint: lost connection
>
> And lots more, every 4 or 5 minutes.
> So, somewhere along the way, the kernel got fed up with this, and cashed
> (cached? -- sorry!) in it's (memory?--really sorry!) chips.
> I still feel this was a rather diabolical revenge for a simple typing error,
> but, I probably deserved it.
That shouldn't cause a crash. You've located the trigger but not the
real cause. There's some sort of fragility in your system that should
not exist. It may be that nothing else will ever trigger it -- or you
may be in for future mysterious crashes without an as easily located
trigger...
I have a guess: something about running out of STREAMS buffers. What
you were doing shouldn't have eaten up buffers, but I can imagine how it
might. Do you care to set up the problem scenario again and watch
`netstat -m` output for a while? You can cancel the job before (well
before) the moment of fatality... Run `netstat -m` every few minutes
while the job is building up "lost connection" messages in syslog. Look
at the "streams memory in use" value -- is it rising rapidly? Can you
project that after 73 minutes it will exceed or approximate the "total
configured streams memory" value?
>Bela<
-
Re: sco 5.0.5 panic crashes
"Barry Swane" wrote in message
news:YIednQdaUvUNZGrfRVn-gg@rogers.com...
>
> "Bela Lubkin" wrote in message
> news:200508081133.aa14834@deepthought.armory.com.. .
>> Barry Swane wrote:
>>
>>> The system continues to crash exactly 73 minutes after bootup.
>>> The trap message is:
>>> PANIC: k_trap-kernel mode trap type 0x0000000 E
>>>
>>> I have had the site log the cs and eip regiesters-- they are identical
>>> every
>>> time
>>>
>>> cs 0X00000158
>>>
>>> eip 0XF00A8C14
>>>
>>> As I understand things, this almost certainly points to a software
>>> issue, as
>>> opposed to a hardware issue, where I would expect random contents in the
>>> cs
>>> and eip registers.
>>
>> It could still be either software or hardware -- a more specific kind of
>> hardware problem causing a failure at a specific place in the software.
>>
>> Translate the EIP by doing:
>>
>> # crash
>> > ts 0XF00A8C14
>> > quit
>>
>> using the same kernel that's been crashing. What is the symbol+offset?
>> Also, what is the value of CR2 in these crashes? That gives the address
>> to which the bad reference was made.
>>
>>>Bela<
> Thanks, everbody, for your contributions. It appears I have managed to
> resolve my issue. The fact that I inflicted this upon myself is
> embarrasing, but I feel compelled to post the "resolution", after pleading
> for help:
> I had built a networked printer last Friday at 10am -- software-wise, and
> sent out a test job, waiting to be printed when the print server got
> connected. Which didn't happen. No big deal, I do it all the time.
> Except, that in this case, when I set up the host address in /etc/hosts, I
> typed in the address of the sco server itself.
> God knows (I sure don't) what happened after that, but it resulted in
> remarkably consistent behavior from 5.0.5-- it would corrupt memory
> totally, somewhere between 73 and 76 minutes, without fail. Resulting in
> the panic crashes I was reporting.
> What finally got my fossilized brain kicked into some semblance of
> animation, was, struggling thru syslog, poking around, and coming across
> this:
> Aug 5 10:00:40 amre lpd[4898]: intnetprint: lost connection
> Aug 5 10:01:12 amre last message repeated 5 times
> Aug 5 10:01:44 amre lpd[4898]: intnetprint: lost connection
> Aug 5 10:02:49 amre lpd[4898]: intnetprint: lost connection
> Aug 5 10:04:57 amre lpd[4898]: intnetprint: lost connection
> Aug 5 10:09:13 amre lpd[4898]: intnetprint: lost connection
>
> And lots more, every 4 or 5 minutes.
> So, somewhere along the way, the kernel got fed up with this, and cashed
> (cached? -- sorry!) in it's (memory?--really sorry!) chips.
> I still feel this was a rather diabolical revenge for a simple typing
> error, but, I probably deserved it.
> Thanks again for the suggestions and help. If nothing else, I did learn
> some neat stuff over the weekend.
> Barry
I beleive Bela is correct that this is not the exact problem with the
system.
I would still stick with replacing the memory.
If I am wrong, the worst it will cost you is about $200. If you are going
down every 73 minutes then you will have recouped the $200 in 73 minutes.
Atleast
purchase the memory to have on hand so if the system does go down again, you
can immediately replace it during the down time.
Anytime I get a trap E more than once in a reasonable amount of time, I
automatically
change the memory off the bat. Barring the fact that I did not make any
changes to the
system recently.
As I said in my previous post, this has worked more often than not.
moncho
-
Re: sco 5.0.5 panic crashes
"Bela Lubkin" wrote in message
news:200508090019.aa21532@deepthought.armory.com.. .
> Barry Swane wrote:
>
>> I had built a networked printer last Friday at 10am -- software-wise, and
>> sent out a test job, waiting to be printed when the print server got
>> connected. Which didn't happen. No big deal, I do it all the time.
>> Except, that in this case, when I set up the host address in /etc/hosts,
>> I
>> typed in the address of the sco server itself.
>> God knows (I sure don't) what happened after that, but it resulted in
>> remarkably consistent behavior from 5.0.5-- it would corrupt memory
>> totally,
>> somewhere between 73 and 76 minutes, without fail. Resulting in the
>> panic
>> crashes I was reporting.
>> What finally got my fossilized brain kicked into some semblance of
>> animation, was, struggling thru syslog, poking around, and coming across
>> this:
>> Aug 5 10:00:40 amre lpd[4898]: intnetprint: lost connection
>> Aug 5 10:01:12 amre last message repeated 5 times
>> Aug 5 10:01:44 amre lpd[4898]: intnetprint: lost connection
>> Aug 5 10:02:49 amre lpd[4898]: intnetprint: lost connection
>> Aug 5 10:04:57 amre lpd[4898]: intnetprint: lost connection
>> Aug 5 10:09:13 amre lpd[4898]: intnetprint: lost connection
>>
>> And lots more, every 4 or 5 minutes.
>> So, somewhere along the way, the kernel got fed up with this, and cashed
>> (cached? -- sorry!) in it's (memory?--really sorry!) chips.
>> I still feel this was a rather diabolical revenge for a simple typing
>> error,
>> but, I probably deserved it.
>
> That shouldn't cause a crash. You've located the trigger but not the
> real cause. There's some sort of fragility in your system that should
> not exist. It may be that nothing else will ever trigger it -- or you
> may be in for future mysterious crashes without an as easily located
> trigger...
>
> I have a guess: something about running out of STREAMS buffers. What
> you were doing shouldn't have eaten up buffers, but I can imagine how it
> might. Do you care to set up the problem scenario again and watch
> `netstat -m` output for a while? You can cancel the job before (well
> before) the moment of fatality... Run `netstat -m` every few minutes
> while the job is building up "lost connection" messages in syslog. Look
> at the "streams memory in use" value -- is it rising rapidly? Can you
> project that after 73 minutes it will exceed or approximate the "total
> configured streams memory" value?
>
>>Bela<
Bela, I tried to set up the problem scenario again, to monitor netstat, as
you suggested. I can't explain why, but I don't seem to be able to recreate
the same scenario. I did the same as before, and monitored netstat for 50
minutes. I saw some fluctuations, but nothing that would imply it getting
out of control, and it was up and down. I also did not see the same
messages in syslog, which leads me to believe that I didn't recreate the
problem.
I did this during the eveing, with no one else on the system, perhaps that
is an explanation why results different.
Interestingly, syslog shows the following entries this evening:
Aug 9 21:55:28 amre lpd[4379]: torinvoice: lost connection
Aug 9 21:55:28 amre lpd[4379]: restarting torinvoice
Aug 9 22:15:30 amre lpd[4379]: torinvoice: lost connection
Aug 9 22:15:30 amre lpd[4379]: restarting torinvoice
Aug 9 22:24:39 amre lpd[4379]: torinvoice: lost connection
Aug 9 22:24:39 amre lpd[4379]: restarting torinvoice
Aug 9 22:33:42 amre lpd[4379]: torinvoice: lost connection
Aug 9 22:33:42 amre lpd[4379]: restarting torinvoice
Aug 9 22:42:46 amre lpd[4379]: torinvoice: lost connection
Aug 9 22:42:46 amre lpd[4379]: restarting torinvoice
Aug 9 22:51:56 amre lpd[4379]: torinvoice: lost connection
Aug 9 22:51:56 amre lpd[4379]: restarting torinvoice
This for an entirely different networked printer, for which I believe there
is no problem, other than the printer is just not ready to print. (I don't
really believe there is a problem reaching this print server, but I don't
know that for sure)
These syslog entries are similar-- but not identical-- to the ones I found
the other day.
I'll see if I can screw up my courage to attempt another test during
production hours.
Barry