sco 5.0.5 panic crashes - SCO

This is a discussion on sco 5.0.5 panic crashes - SCO ; 5.0.5 box that has run quietly for 30 months, has developed the distinctly un-quiet habit of crashing with panic messages, almost exactly every 73 minutes Sorry, I don't have the screen messages, remote from site, will need someone to copy ...

+ Reply to Thread
Results 1 to 9 of 9

Thread: sco 5.0.5 panic crashes

  1. sco 5.0.5 panic crashes

    5.0.5 box that has run quietly for 30 months, has developed the distinctly
    un-quiet habit of crashing with panic messages, almost exactly every 73
    minutes
    Sorry, I don't have the screen messages, remote from site, will need someone
    to copy them down
    Can anyone make any suggestions on what might explain this behavior, or, a
    strategy that might help isolate the issue?
    I'm thinking it must be a hardware issue-- could it be software?
    Will sar tell me anything useful?
    Any suggestions would be greatly welcomed
    tia
    Barry



  2. Re: sco 5.0.5 panic crashes

    Barry Swane wrote:
    > 5.0.5 box that has run quietly for 30 months, has developed the distinctly
    > un-quiet habit of crashing with panic messages, almost exactly every 73
    > minutes
    > Sorry, I don't have the screen messages, remote from site, will need someone
    > to copy them down
    > Can anyone make any suggestions on what might explain this behavior, or, a
    > strategy that might help isolate the issue?
    > I'm thinking it must be a hardware issue-- could it be software?
    > Will sar tell me anything useful?
    > Any suggestions would be greatly welcomed
    > tia
    > Barry
    >
    >



    See SCO's http://wdb1.sco.com/kb/showta?taid=106009 and my
    http://aplawrence.com/Unixart/trape.html which, though about E traps
    specifically, also has general advice.

    Unusual to have that degree of precise repetition..

    --
    Tony Lawrence
    Unix/Linux/Mac OS X resources: http://aplawrence.com

  3. Re: sco 5.0.5 panic crashes


    "Tony Lawrence" wrote in message
    news:i_ednd8ZOLzFv2jfRVn-rg@comcast.com...
    > Barry Swane wrote:
    >> 5.0.5 box that has run quietly for 30 months, has developed the
    >> distinctly un-quiet habit of crashing with panic messages, almost exactly
    >> every 73 minutes
    >> Sorry, I don't have the screen messages, remote from site, will need
    >> someone to copy them down
    >> Can anyone make any suggestions on what might explain this behavior, or,
    >> a strategy that might help isolate the issue?
    >> I'm thinking it must be a hardware issue-- could it be software?
    >> Will sar tell me anything useful?
    >> Any suggestions would be greatly welcomed
    >> tia
    >> Barry
    >>
    >>

    >
    >
    > See SCO's http://wdb1.sco.com/kb/showta?taid=106009 and my
    > http://aplawrence.com/Unixart/trape.html which, though about E traps
    > specifically, also has general advice.
    >
    > Unusual to have that degree of precise repetition..
    >
    > --
    > Tony Lawrence
    > Unix/Linux/Mac OS X resources: http://aplawrence.com

    Thanks so much for the pointers, Tony-- there's a wealth of information
    available there, and I really do appreciate you taking the time to point me
    towards it.
    I have a little more information now, which I will post here, just in case
    someone has a thought for me:
    The system continues to crash exactly 73 minutes after bootup.
    The trap message is:
    PANIC: k_trap-kernel mode trap type 0x0000000 E

    I have had the site log the cs and eip regiesters-- they are identical every
    time

    cs 0X00000158

    eip 0XF00A8C14

    As I understand things, this almost certainly points to a software issue, as
    opposed to a hardware issue, where I would expect random contents in the cs
    and eip registers.

    I have been watching for runaway processes, but I see nothing.

    Thanks for any suggestions from anyone

    Barry





  4. Re: sco 5.0.5 panic crashes

    "Barry Swane" wrote in message
    news:L9adnZ2dnZ3jPw22nZ2dnRLuat-dnZ2dRVn-052dnZ0@rogers.com...
    >
    > "Tony Lawrence" wrote in message
    > news:i_ednd8ZOLzFv2jfRVn-rg@comcast.com...
    >> Barry Swane wrote:
    >>> 5.0.5 box that has run quietly for 30 months, has developed the
    >>> distinctly un-quiet habit of crashing with panic messages, almost
    >>> exactly every 73 minutes
    >>> Sorry, I don't have the screen messages, remote from site, will need
    >>> someone to copy them down
    >>> Can anyone make any suggestions on what might explain this behavior, or,
    >>> a strategy that might help isolate the issue?
    >>> I'm thinking it must be a hardware issue-- could it be software?
    >>> Will sar tell me anything useful?
    >>> Any suggestions would be greatly welcomed
    >>> tia
    >>> Barry
    >>>
    >>>

    >>
    >>
    >> See SCO's http://wdb1.sco.com/kb/showta?taid=106009 and my
    >> http://aplawrence.com/Unixart/trape.html which, though about E traps
    >> specifically, also has general advice.
    >>
    >> Unusual to have that degree of precise repetition..
    >>
    >> --
    >> Tony Lawrence
    >> Unix/Linux/Mac OS X resources: http://aplawrence.com

    > Thanks so much for the pointers, Tony-- there's a wealth of information
    > available there, and I really do appreciate you taking the time to point
    > me towards it.
    > I have a little more information now, which I will post here, just in case
    > someone has a thought for me:
    > The system continues to crash exactly 73 minutes after bootup.
    > The trap message is:
    > PANIC: k_trap-kernel mode trap type 0x0000000 E
    >
    > I have had the site log the cs and eip regiesters-- they are identical
    > every time
    >
    > cs 0X00000158
    >
    > eip 0XF00A8C14
    >
    > As I understand things, this almost certainly points to a software issue,
    > as opposed to a hardware issue, where I would expect random contents in
    > the cs and eip registers.
    >
    > I have been watching for runaway processes, but I see nothing.


    The first thing I do in respect to Trap E errors if it has happend more than
    once within
    one month, not matter what the CS and EIP registers state, is replace the
    memory.

    Memory is pretty cheap and if the system is not full, the extra can't hurt.

    This has solved 99% of my Trap E problems.

    The .1% was after a reload went kooky, reloaded immediately and it has not
    had a problem
    for years.

    moncho



  5. Re: sco 5.0.5 panic crashes

    Barry Swane wrote:

    > The system continues to crash exactly 73 minutes after bootup.
    > The trap message is:
    > PANIC: k_trap-kernel mode trap type 0x0000000 E
    >
    > I have had the site log the cs and eip regiesters-- they are identical every
    > time
    >
    > cs 0X00000158
    >
    > eip 0XF00A8C14
    >
    > As I understand things, this almost certainly points to a software issue, as
    > opposed to a hardware issue, where I would expect random contents in the cs
    > and eip registers.


    It could still be either software or hardware -- a more specific kind of
    hardware problem causing a failure at a specific place in the software.

    Translate the EIP by doing:

    # crash
    > ts 0XF00A8C14
    > quit


    using the same kernel that's been crashing. What is the symbol+offset?
    Also, what is the value of CR2 in these crashes? That gives the address
    to which the bad reference was made.

    >Bela<


  6. Re: sco 5.0.5 panic crashes


    "Bela Lubkin" wrote in message
    news:200508081133.aa14834@deepthought.armory.com.. .
    > Barry Swane wrote:
    >
    >> The system continues to crash exactly 73 minutes after bootup.
    >> The trap message is:
    >> PANIC: k_trap-kernel mode trap type 0x0000000 E
    >>
    >> I have had the site log the cs and eip regiesters-- they are identical
    >> every
    >> time
    >>
    >> cs 0X00000158
    >>
    >> eip 0XF00A8C14
    >>
    >> As I understand things, this almost certainly points to a software issue,
    >> as
    >> opposed to a hardware issue, where I would expect random contents in the
    >> cs
    >> and eip registers.

    >
    > It could still be either software or hardware -- a more specific kind of
    > hardware problem causing a failure at a specific place in the software.
    >
    > Translate the EIP by doing:
    >
    > # crash
    > > ts 0XF00A8C14
    > > quit

    >
    > using the same kernel that's been crashing. What is the symbol+offset?
    > Also, what is the value of CR2 in these crashes? That gives the address
    > to which the bad reference was made.
    >
    >>Bela<

    Thanks, everbody, for your contributions. It appears I have managed to
    resolve my issue. The fact that I inflicted this upon myself is
    embarrasing, but I feel compelled to post the "resolution", after pleading
    for help:
    I had built a networked printer last Friday at 10am -- software-wise, and
    sent out a test job, waiting to be printed when the print server got
    connected. Which didn't happen. No big deal, I do it all the time.
    Except, that in this case, when I set up the host address in /etc/hosts, I
    typed in the address of the sco server itself.
    God knows (I sure don't) what happened after that, but it resulted in
    remarkably consistent behavior from 5.0.5-- it would corrupt memory totally,
    somewhere between 73 and 76 minutes, without fail. Resulting in the panic
    crashes I was reporting.
    What finally got my fossilized brain kicked into some semblance of
    animation, was, struggling thru syslog, poking around, and coming across
    this:
    Aug 5 10:00:40 amre lpd[4898]: intnetprint: lost connection
    Aug 5 10:01:12 amre last message repeated 5 times
    Aug 5 10:01:44 amre lpd[4898]: intnetprint: lost connection
    Aug 5 10:02:49 amre lpd[4898]: intnetprint: lost connection
    Aug 5 10:04:57 amre lpd[4898]: intnetprint: lost connection
    Aug 5 10:09:13 amre lpd[4898]: intnetprint: lost connection

    And lots more, every 4 or 5 minutes.
    So, somewhere along the way, the kernel got fed up with this, and cashed
    (cached? -- sorry!) in it's (memory?--really sorry!) chips.
    I still feel this was a rather diabolical revenge for a simple typing error,
    but, I probably deserved it.
    Thanks again for the suggestions and help. If nothing else, I did learn
    some neat stuff over the weekend.
    Barry



  7. Re: sco 5.0.5 panic crashes

    Barry Swane wrote:

    > I had built a networked printer last Friday at 10am -- software-wise, and
    > sent out a test job, waiting to be printed when the print server got
    > connected. Which didn't happen. No big deal, I do it all the time.
    > Except, that in this case, when I set up the host address in /etc/hosts, I
    > typed in the address of the sco server itself.
    > God knows (I sure don't) what happened after that, but it resulted in
    > remarkably consistent behavior from 5.0.5-- it would corrupt memory totally,
    > somewhere between 73 and 76 minutes, without fail. Resulting in the panic
    > crashes I was reporting.
    > What finally got my fossilized brain kicked into some semblance of
    > animation, was, struggling thru syslog, poking around, and coming across
    > this:
    > Aug 5 10:00:40 amre lpd[4898]: intnetprint: lost connection
    > Aug 5 10:01:12 amre last message repeated 5 times
    > Aug 5 10:01:44 amre lpd[4898]: intnetprint: lost connection
    > Aug 5 10:02:49 amre lpd[4898]: intnetprint: lost connection
    > Aug 5 10:04:57 amre lpd[4898]: intnetprint: lost connection
    > Aug 5 10:09:13 amre lpd[4898]: intnetprint: lost connection
    >
    > And lots more, every 4 or 5 minutes.
    > So, somewhere along the way, the kernel got fed up with this, and cashed
    > (cached? -- sorry!) in it's (memory?--really sorry!) chips.
    > I still feel this was a rather diabolical revenge for a simple typing error,
    > but, I probably deserved it.


    That shouldn't cause a crash. You've located the trigger but not the
    real cause. There's some sort of fragility in your system that should
    not exist. It may be that nothing else will ever trigger it -- or you
    may be in for future mysterious crashes without an as easily located
    trigger...

    I have a guess: something about running out of STREAMS buffers. What
    you were doing shouldn't have eaten up buffers, but I can imagine how it
    might. Do you care to set up the problem scenario again and watch
    `netstat -m` output for a while? You can cancel the job before (well
    before) the moment of fatality... Run `netstat -m` every few minutes
    while the job is building up "lost connection" messages in syslog. Look
    at the "streams memory in use" value -- is it rising rapidly? Can you
    project that after 73 minutes it will exceed or approximate the "total
    configured streams memory" value?

    >Bela<


  8. Re: sco 5.0.5 panic crashes


    "Barry Swane" wrote in message
    news:YIednQdaUvUNZGrfRVn-gg@rogers.com...
    >
    > "Bela Lubkin" wrote in message
    > news:200508081133.aa14834@deepthought.armory.com.. .
    >> Barry Swane wrote:
    >>
    >>> The system continues to crash exactly 73 minutes after bootup.
    >>> The trap message is:
    >>> PANIC: k_trap-kernel mode trap type 0x0000000 E
    >>>
    >>> I have had the site log the cs and eip regiesters-- they are identical
    >>> every
    >>> time
    >>>
    >>> cs 0X00000158
    >>>
    >>> eip 0XF00A8C14
    >>>
    >>> As I understand things, this almost certainly points to a software
    >>> issue, as
    >>> opposed to a hardware issue, where I would expect random contents in the
    >>> cs
    >>> and eip registers.

    >>
    >> It could still be either software or hardware -- a more specific kind of
    >> hardware problem causing a failure at a specific place in the software.
    >>
    >> Translate the EIP by doing:
    >>
    >> # crash
    >> > ts 0XF00A8C14
    >> > quit

    >>
    >> using the same kernel that's been crashing. What is the symbol+offset?
    >> Also, what is the value of CR2 in these crashes? That gives the address
    >> to which the bad reference was made.
    >>
    >>>Bela<

    > Thanks, everbody, for your contributions. It appears I have managed to
    > resolve my issue. The fact that I inflicted this upon myself is
    > embarrasing, but I feel compelled to post the "resolution", after pleading
    > for help:
    > I had built a networked printer last Friday at 10am -- software-wise, and
    > sent out a test job, waiting to be printed when the print server got
    > connected. Which didn't happen. No big deal, I do it all the time.
    > Except, that in this case, when I set up the host address in /etc/hosts, I
    > typed in the address of the sco server itself.
    > God knows (I sure don't) what happened after that, but it resulted in
    > remarkably consistent behavior from 5.0.5-- it would corrupt memory
    > totally, somewhere between 73 and 76 minutes, without fail. Resulting in
    > the panic crashes I was reporting.
    > What finally got my fossilized brain kicked into some semblance of
    > animation, was, struggling thru syslog, poking around, and coming across
    > this:
    > Aug 5 10:00:40 amre lpd[4898]: intnetprint: lost connection
    > Aug 5 10:01:12 amre last message repeated 5 times
    > Aug 5 10:01:44 amre lpd[4898]: intnetprint: lost connection
    > Aug 5 10:02:49 amre lpd[4898]: intnetprint: lost connection
    > Aug 5 10:04:57 amre lpd[4898]: intnetprint: lost connection
    > Aug 5 10:09:13 amre lpd[4898]: intnetprint: lost connection
    >
    > And lots more, every 4 or 5 minutes.
    > So, somewhere along the way, the kernel got fed up with this, and cashed
    > (cached? -- sorry!) in it's (memory?--really sorry!) chips.
    > I still feel this was a rather diabolical revenge for a simple typing
    > error, but, I probably deserved it.
    > Thanks again for the suggestions and help. If nothing else, I did learn
    > some neat stuff over the weekend.
    > Barry


    I beleive Bela is correct that this is not the exact problem with the
    system.

    I would still stick with replacing the memory.

    If I am wrong, the worst it will cost you is about $200. If you are going
    down every 73 minutes then you will have recouped the $200 in 73 minutes.
    Atleast
    purchase the memory to have on hand so if the system does go down again, you
    can immediately replace it during the down time.

    Anytime I get a trap E more than once in a reasonable amount of time, I
    automatically
    change the memory off the bat. Barring the fact that I did not make any
    changes to the
    system recently.

    As I said in my previous post, this has worked more often than not.

    moncho



  9. Re: sco 5.0.5 panic crashes


    "Bela Lubkin" wrote in message
    news:200508090019.aa21532@deepthought.armory.com.. .
    > Barry Swane wrote:
    >
    >> I had built a networked printer last Friday at 10am -- software-wise, and
    >> sent out a test job, waiting to be printed when the print server got
    >> connected. Which didn't happen. No big deal, I do it all the time.
    >> Except, that in this case, when I set up the host address in /etc/hosts,
    >> I
    >> typed in the address of the sco server itself.
    >> God knows (I sure don't) what happened after that, but it resulted in
    >> remarkably consistent behavior from 5.0.5-- it would corrupt memory
    >> totally,
    >> somewhere between 73 and 76 minutes, without fail. Resulting in the
    >> panic
    >> crashes I was reporting.
    >> What finally got my fossilized brain kicked into some semblance of
    >> animation, was, struggling thru syslog, poking around, and coming across
    >> this:
    >> Aug 5 10:00:40 amre lpd[4898]: intnetprint: lost connection
    >> Aug 5 10:01:12 amre last message repeated 5 times
    >> Aug 5 10:01:44 amre lpd[4898]: intnetprint: lost connection
    >> Aug 5 10:02:49 amre lpd[4898]: intnetprint: lost connection
    >> Aug 5 10:04:57 amre lpd[4898]: intnetprint: lost connection
    >> Aug 5 10:09:13 amre lpd[4898]: intnetprint: lost connection
    >>
    >> And lots more, every 4 or 5 minutes.
    >> So, somewhere along the way, the kernel got fed up with this, and cashed
    >> (cached? -- sorry!) in it's (memory?--really sorry!) chips.
    >> I still feel this was a rather diabolical revenge for a simple typing
    >> error,
    >> but, I probably deserved it.

    >
    > That shouldn't cause a crash. You've located the trigger but not the
    > real cause. There's some sort of fragility in your system that should
    > not exist. It may be that nothing else will ever trigger it -- or you
    > may be in for future mysterious crashes without an as easily located
    > trigger...
    >
    > I have a guess: something about running out of STREAMS buffers. What
    > you were doing shouldn't have eaten up buffers, but I can imagine how it
    > might. Do you care to set up the problem scenario again and watch
    > `netstat -m` output for a while? You can cancel the job before (well
    > before) the moment of fatality... Run `netstat -m` every few minutes
    > while the job is building up "lost connection" messages in syslog. Look
    > at the "streams memory in use" value -- is it rising rapidly? Can you
    > project that after 73 minutes it will exceed or approximate the "total
    > configured streams memory" value?
    >
    >>Bela<

    Bela, I tried to set up the problem scenario again, to monitor netstat, as
    you suggested. I can't explain why, but I don't seem to be able to recreate
    the same scenario. I did the same as before, and monitored netstat for 50
    minutes. I saw some fluctuations, but nothing that would imply it getting
    out of control, and it was up and down. I also did not see the same
    messages in syslog, which leads me to believe that I didn't recreate the
    problem.
    I did this during the eveing, with no one else on the system, perhaps that
    is an explanation why results different.
    Interestingly, syslog shows the following entries this evening:
    Aug 9 21:55:28 amre lpd[4379]: torinvoice: lost connection
    Aug 9 21:55:28 amre lpd[4379]: restarting torinvoice
    Aug 9 22:15:30 amre lpd[4379]: torinvoice: lost connection
    Aug 9 22:15:30 amre lpd[4379]: restarting torinvoice
    Aug 9 22:24:39 amre lpd[4379]: torinvoice: lost connection
    Aug 9 22:24:39 amre lpd[4379]: restarting torinvoice
    Aug 9 22:33:42 amre lpd[4379]: torinvoice: lost connection
    Aug 9 22:33:42 amre lpd[4379]: restarting torinvoice
    Aug 9 22:42:46 amre lpd[4379]: torinvoice: lost connection
    Aug 9 22:42:46 amre lpd[4379]: restarting torinvoice
    Aug 9 22:51:56 amre lpd[4379]: torinvoice: lost connection
    Aug 9 22:51:56 amre lpd[4379]: restarting torinvoice

    This for an entirely different networked printer, for which I believe there
    is no problem, other than the printer is just not ready to print. (I don't
    really believe there is a problem reaching this print server, but I don't
    know that for sure)
    These syslog entries are similar-- but not identical-- to the ones I found
    the other day.
    I'll see if I can screw up my courage to attempt another test during
    production hours.
    Barry



+ Reply to Thread