How do I diagnose a server that crashes every night? - VMS

This is a discussion on How do I diagnose a server that crashes every night? - VMS ; Hi, I have very little VMS experience but we have inherited a nice shiny new alphaserver 250 to support (ok its not very shiny or new!) which is located in the middle of the sea. EVERY night without fail, this ...

+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 20 of 21

Thread: How do I diagnose a server that crashes every night?

  1. How do I diagnose a server that crashes every night?

    Hi,

    I have very little VMS experience but we have inherited a nice shiny
    new alphaserver 250 to support (ok its not very shiny or new!) which
    is located in the middle of the sea.

    EVERY night without fail, this server is crashing and restarting
    itself. I'd really like to get to the bottom of this as I am being
    called every morning at 3am to log in and start some services which
    don't seem launch at startup despite being in the startup file ("ahh
    it's always been that way...")

    Below is the FATAL BUGCHECK which I suspect is causing the machine to
    reboot. The process which appears to be crashing is key to this
    servers functionality. This is a relatively new problem as this box
    has run itself for the past 10 years.

    I have no idea where to even begin determining the source of the
    problem from this. Is anyone able to give me any pointers as to what I
    should be looking for, what information I will need, and how to make
    sense of it all? My VMS knowledge as I say is extremely limited, so
    any commands would be useful and appreciated also. Nice to see if
    anyone can help!

    Thanks
    str8



    ******************************* ENTRY 435.
    *******************************
    ERROR SEQUENCE 432. LOGGED ON: CPU_TYPE
    00000006
    DATE/TIME 10-SEP-2008 05:27:25.87 SYS_TYPE
    0000000D
    SYSTEM UPTIME: 1 DAYS 00:55:22
    SCS NODE: PIN01 OpenVMS
    AXP V6.2-1H3

    HW_MODEL: 00000000 Hardware Model = 0.

    FATAL BUGCHECK AlphaStation 250 4/266

    MACHINECHK, Machine check while in kernel mode

    PROCESS NAME BLYSEM_I1
    PROCESS ID 0001001F

    ERROR PC FFFFFFFF 800485F8

    Process Status = 20000000 00001F04, SW = 00, Previous Mode =
    KERNEL
    System State = 01, Current Mode = KERNEL
    VMM = 00 IPL = 31, SP Alignment = 32

    STACK POINTERS

    KSP 00000000 7FF91EE0 ESP 00000000 7FF96000 SSP 00000000 7FF9C100
    USP 00000000 7EE7D390

    GENERAL REGISTERS

    R0 00000000 00000002 R1 00000000 0000940A R2 FFFFFFFF 80C2DB50
    R3 FFFFFFFF 80C04D98 R4 00000000 00000048 R5 00000000 00001F04
    R6 00000000 00000000 R7 00000000 00000001 R8 00000000 7FF9C1F8
    R9 00000000 7FF9C400 R10 00000000 7FF9D228 R11 00000000 7FFBE3E0
    R12 00000000 00000000 R13 FFFFFFFF 8326B910 R14 00000000 00000000
    R15 00000000 7EE7D498 R16 00000000 00000215 R17 00000000 00000001
    R18 00000000 00000001 R19 00000000 00000000 R20 FFFFFFFF FFFFFFF8
    R21 00000000 00000017 R22 00000000 00000100 R23 FFFFFFFF 80E08368
    R24 FFFFFFFF 80E08000 R25 00000000 00000003 R26 00000000 00000210
    R27 FFFFFFFF 80C34D60 R28 FFFFFFFF 8003B9C4 FP 00000000 7FF91EE0
    SP 00000000 7FF91EE0 PC FFFFFFFF 800485F8 PS 20000000 00001F04

    SYSTEM REGISTERS

    PTBR 00000000 00001F19
    Page Table Base Register
    PCBB 00000000 0414A080
    Privileged Context Block Base
    PRBR FFFFFFFF 80E0A000
    Processor Base Register
    VPTB 00000002 00000000
    Virtual Page Table Base
    Register
    SCBB 00000000 000001A1
    System Control Block Base
    SISR 00000000 00000000
    Software Interrupt Summary
    Register
    ASN 00000000 0000003B
    Address Space Number
    ASTSR_ASTEN 00000000 0000000F
    AST Summary/AST Enable
    FEN 00000000 00000001
    Floating-Point Enable
    IPL 00000000 0000001F
    Interrupt Priority Level
    MCES 00000000 00000008
    Machine Check Error Summary

  2. Re: How do I diagnose a server that crashes every night?

    In article <6df72036-9e6f-4a79-96cf-a841020f7b26@l64g2000hse.googlegroups.com>, StraightEight writes:
    >I have very little VMS experience but we have inherited a nice shiny
    >new alphaserver 250 to support (ok its not very shiny or new!) which
    >is located in the middle of the sea.
    >
    >EVERY night without fail, this server is crashing and restarting
    >itself. I'd really like to get to the bottom of this as I am being
    >called every morning at 3am to log in and start some services which
    >don't seem launch at startup despite being in the startup file ("ahh
    >it's always been that way...")
    >
    >Below is the FATAL BUGCHECK which I suspect is causing the machine to
    >reboot.


    Correct, a FATAL BUGCHECK results in a crash of the system.

    >The process which appears to be crashing is key to this
    >servers functionality. This is a relatively new problem as this box
    >has run itself for the past 10 years.


    So the first question is: was anything changed on this system?

    >I have no idea where to even begin determining the source of the
    >problem from this. Is anyone able to give me any pointers as to what I
    >should be looking for, what information I will need, and how to make
    >sense of it all?


    Have a look in sys$common:[syserr] for files named CLUE$*.LIS.
    In addition see the online help for "ANALYZE/ERROR". In addition, is there
    anyzting in sys$managerperator.log?

    Regards,
    Christoph Gartmann

    --
    Max-Planck-Institut fuer Phone : +49-761-5108-464 Fax: -80464
    Immunbiologie
    Postfach 1169 Internet: gartmann@immunbio dot mpg dot de
    D-79011 Freiburg, Germany
    http://www.immunbio.mpg.de/home/menue.html

  3. Re: How do I diagnose a server that crashes every night?

    StraightEight wrote:

    > EVERY night without fail, this server is crashing and restarting
    > itself. I'd really like to get to the bottom of this as I am being
    > called every morning at 3am to log in and start some services which


    Does it crash at exactly the same time every night ? Or does it vary ?
    Any relationship with actual operations being done related to that
    machine ? Or does it crash when some link goes down and the code just
    doesn't handle this properly ?


    > I have no idea where to even begin determining the source of the
    > problem from this.


    It would help to provide more background on what the application is. Is
    it some COBOL app that just prints an accounting report, or it is some
    real time applictaion that controls a drilling rig ?

    What sort of stuff is connected to that machine using what sort of
    protocol ?


    In terms of services not starting when it boots and needing to be
    started manually, you would need to look at the SYSTARTUP_VMS.COM file
    in the SYS$MANAGER directory and take a careful look at it. The output
    normally just goes on the operator console, so if you are not on site,
    you have hard time seeing error messages.

    However, if you start a service by submitting a batch job, then there
    should be a log file that contains some information on why the service
    didn't start.

    If SYSTARTUP_VMS.COM calls a command procedure to start a service, you
    can add /OUTPUT=logfile.log to the command

    eg:
    @disk:[directory]myapplication_startup.com/output=sys$manager:myapplication_startup.log

    Then, you could consult the log file later on to find out why the
    application didn't start.

    Remember that some services take some time to become available, so on a
    faster machine, you might be trying to start your app become TCPIP is
    fully available for instance, and the app would fail. But later on when
    you log in to fix the problem, TCPIP would be available and the app
    would start properly.



  4. Re: How do I diagnose a server that crashes every night?

    Thanks for a quick reply. Here are my findings.

    > So the first question is: was anything changed on this system?

    No, as far as I am aware, this server has always been running
    unchanged for several years

    > Have a look in sys$common:[syserr] for files named CLUE$*.LIS


    If I look at these files there are pages of information, but I am
    unsure just what I need to be looking at!

    > In addition see the online help for "ANALYZE/ERROR". In addition, is there
    > anyzting in sys$managerperator.log?


    This mostly just contains our Telnet requests, heres one i
    spotted...do you know what this means? Sometimes when we try to
    connect by telnet to the server we see No License for the Active
    Product (or something along those lines) Could it be something as
    simple as a licensing problem, or is this a red herring?

    %%%%%%%%%%% OPCOM 9-SEP-2008 04:32:26.73 %%%%%%%%%%%
    Message from user SYSTEM on PIN01
    %LICENSE-E-TERM, C ALL-IL-1997NOV26-2136 License has terminated

    Thanks!

  5. Re: How do I diagnose a server that crashes every night?

    StraightEight wrote:

    > %%%%%%%%%%% OPCOM 9-SEP-2008 04:32:26.73 %%%%%%%%%%%
    > Message from user SYSTEM on PIN01
    > %LICENSE-E-TERM, C ALL-IL-1997NOV26-2136 License has terminated


    This is the C compiler licence.

    The command:

    SHOW LICENSE will give you list of active licences on that node.

    LICENSE LIST will give you list of registered licences. (this will
    exclude expired licences or licences that aren't valuid for this node).

    It is possible that you have 2 C licences, the "real" one and some
    temporary one which has expired.


    Not having the C compiler would not cause problems to run programs. It
    would only affect the invocation of the C compiler (CC command).
    Programs compiled with this compiler will run find without the licence.

  6. Re: How do I diagnose a server that crashes every night?

    On 10 Sep, 10:46, JF Mezei wrote:

    > Does it crash at exactly the same time every night ? Or does it vary ?
    > Any relationship with actual operations being done related to that
    > machine ? Or does it crash *when some link goes down and the code just
    > doesn't handle this properly ?


    Doesn't really seem to be any pattern, some nights it restarts just
    once, some nights it can happen up to 4 times.

    > It would help to provide more background on what the application is. Is
    > it some COBOL app that just prints an accounting report, or it is some
    > real time applictaion that controls a drilling rig ?
    > What sort of stuff is connected to that machine using what sort of
    > protocol ?


    The file BLYSEM i'm sure is a software interface to a Bailey INFI900
    DCS (so real time data aquisition on a rig as guessed!) for OSI PI
    software. The volume of data this handles has probably increased over
    the years...could a capacity problem knock a service over?

    > In terms of services not starting when it boots and needing to be
    > started manually, you would need to look at *the SYSTARTUP_VMS.COM file
    > in the SYS$MANAGER directory and take a careful look at it. *The output
    > normally just goes on the operator console, so if you are not on site,
    > you have hard time seeing error messages.
    >
    > However, if you start a service by submitting a batch job, then there
    > should be a log file that contains some information on why the service
    > didn't start.
    >
    > If SYSTARTUP_VMS.COM calls a command procedure to start a service, you
    > can add /OUTPUT=logfile.log to the command
    >
    > eg:
    > @disk:[directory]myapplication_startup.com/output=sys$manager:myapplication*_startup.log
    >
    > Then, you could consult the log file later on to find out why the
    > application didn't start.
    >
    > Remember that some services take some time to become available, so on a
    > faster machine, you might be trying to start your app become TCPIP is
    > fully available for instance, and the app would fail. But later on when
    > you log in to fix the problem, TCPIP would be available and the app
    > would start properly.


    Thanks for the tips, I think I will try the output switch and see what
    is logged. It's a good hunch at the end....the call to the service is
    at the very end of the startup file, would each line in the startup
    file wait until it is executed before moving to the next, or does it
    just fire off all the commands at once? Now I think about it, the last
    time we caught the error very early there was still a batch job
    running...maybe we should call it at the end of this batch job?

    Thanks for your response!



  7. Re: How do I diagnose a server that crashes every night?


    "StraightEight" wrote in message
    news:6df72036-9e6f-4a79-96cf-a841020f7b26@l64g2000hse.googlegroups.com...

    > I have no idea where to even begin determining the source of the
    > problem from this.


    MACHINECHK, Machine check while in kernel mode suggests
    hardware. There may be other entries in the error log as well as
    the bugcheck, which may give more detail.

    Looking at the CLUE files in sys$errorlog, particularly the
    _collect.dat may help nail down common features.



  8. Re: How do I diagnose a server that crashes every night?

    On Sep 10, 6:18*am, StraightEight wrote:
    > On 10 Sep, 10:46, JF Mezei wrote:
    >
    > > Does it crash at exactly the same time every night ? Or does it vary ?
    > > Any relationship with actual operations being done related to that
    > > machine ? Or does it crash *when some link goes down and the code just
    > > doesn't handle this properly ?

    >
    > Doesn't really seem to be any pattern, some nights it restarts just
    > once, some nights it can happen up to 4 times.
    >
    > > It would help to provide more background on what the application is. Is
    > > it some COBOL app that just prints an accounting report, or it is some
    > > real time applictaion that controls a drilling rig ?
    > > What sort of stuff is connected to that machine using what sort of
    > > protocol ?

    >
    > The file BLYSEM i'm sure is a software interface to a Bailey INFI900
    > DCS (so real time data aquisition on a rig as guessed!) for OSI PI
    > software. The volume of data this handles has probably increased over
    > the years...could a capacity problem knock a service over?
    >
    >
    >
    > > In terms of services not starting when it boots and needing to be
    > > started manually, you would need to look at *the SYSTARTUP_VMS.COM file
    > > in the SYS$MANAGER directory and take a careful look at it. *The output
    > > normally just goes on the operator console, so if you are not on site,
    > > you have hard time seeing error messages.

    >
    > > However, if you start a service by submitting a batch job, then there
    > > should be a log file that contains some information on why the service
    > > didn't start.

    >
    > > If SYSTARTUP_VMS.COM calls a command procedure to start a service, you
    > > can add /OUTPUT=logfile.log to the command

    >
    > > eg:
    > > @disk:[directory]myapplication_startup.com/output=sys$manager:myapplication*_startup.log

    >
    > > Then, you could consult the log file later on to find out why the
    > > application didn't start.

    >
    > > Remember that some services take some time to become available, so on a
    > > faster machine, you might be trying to start your app become TCPIP is
    > > fully available for instance, and the app would fail. But later on when
    > > you log in to fix the problem, TCPIP would be available and the app
    > > would start properly.

    >
    > Thanks for the tips, I think I will try the output switch and see what
    > is logged. It's a good hunch at the end....the call to the service is
    > at the very end of the startup file, would each line in the startup
    > file wait until it is executed before moving to the next, or does it
    > just fire off all the commands at once? Now I think about it, the last
    > time we caught the error very early there was still a batch job
    > running...maybe we should call it at the end of this batch job?
    >
    > Thanks for your response!


    str8,

    I do not have an Alpha CPU manual handy, so I will restrict this set
    of comments to the other issues raised. However, one important
    question is whether the machine check error information is the same on
    every crash.

    Being responsible for a system with little or no documentation can be
    a significant challenge. I have seen this kind of situation often when
    getting called into a site which has been without a good system
    manager in a while, It is common to find things "broken", that in
    effect, were never working correctly. Not having failed noticeably
    does not mean that there was not an issue that did not rise to the
    severity to be noticed.

    While the overall STARTUP process is capable of parallel operation,
    each individual command file is executed sequentially (using the
    parallel execution features can speed restarts substantially, as I
    noted in my presentation "SYSMAN for Improved Restart Performance" at
    the Fall 1999 US DECUS symposium (slides available via
    http://www.rlgsc.com/decus/usf99/index.html ).

    Most likely, the parallel execution features were not used in this
    case. If processes that are supposed to start during a restart do not
    in fact start, and the requests to start them are in the system
    startup file, the most common reason for the failure is a small
    typographical error made when editing the startup file. If there is an
    error, the startup file will exit, with only a transiently visible
    message on the console.

    Typically, there are two ways to resolve this: 1) extremely close
    inspection of the startup file (generally SYS
    $MANAGER:SYSTARTUP_VMS.COM), or 2) enable logging of the startup
    sequence using the SYSMAN STARTUP OPTIONS/OUTPUT=FILE command. [the
    latter creates the file SYS$SPECIFIC:[SYSEXE]STARTUP.LOG]. Reviewing
    the log file generated often clarifies precisely what messages were
    scrolled rapidly off the screen. I often leave unattended systems in
    the FILE setting so that it is possible to resolve problems on
    unattended systems.

    One important recommendation is to make sure that there is a good
    backup of the system disk, and a log kept of any changes to any of the
    files.

    It goes without saying that at some point, it may be wise to retain
    outside experienced assistance to examine the problem [Disclosure: our
    firm does provide consulting services in this area].

    - Bob Gezelter, http://www.rlgsc.com

  9. Re: How do I diagnose a server that crashes every night?

    str8,

    I should note that it is also possible that the Machine Check and the
    active process are, in effect, not related.

    The last client system that was having machine checks turned out to be
    caused by an erratic power supply. The power supply worked well when
    it was working, but it was apparently having problems. The fact that
    the system in question would appear to be in a somewhat industrial
    setting raises the question of whether there is an external power or
    grounding event that is the underlying cause of the Machine Check.

    If there is a UPS involved, there could also be a problem there.

    - Bob Gezelter, http://www.rlgsc.com

  10. Re: How do I diagnose a server that crashes every night?

    In article <9074c354-2557-4638-961e-7860857e3485@s50g2000hsb.googlegroups.com>, Bob Gezelter writes:
    > str8,
    >
    > I should note that it is also possible that the Machine Check and the
    > active process are, in effect, not related.
    >
    > The last client system that was having machine checks turned out to be
    > caused by an erratic power supply. The power supply worked well when
    > it was working, but it was apparently having problems. The fact that
    > the system in question would appear to be in a somewhat industrial
    > setting raises the question of whether there is an external power or
    > grounding event that is the underlying cause of the Machine Check.
    >
    > If there is a UPS involved, there could also be a problem there.
    >


    The OP should also be aware that although a machine check is usually a
    hardware issue, it can be caused by a faulty device driver as well.

    Personal experience here: I have caused VMS to issue machine checks while
    I have been developing VMS device drivers in the past.

    Simon.

    --
    Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP
    Microsoft: Bringing you 1980's technology to a 21st century world

  11. Re: How do I diagnose a server that crashes every night?

    > The last client system that was having machine checks turned out to be
    > caused by an erratic power supply. The power supply worked well when
    > it was working, but it was apparently having problems. The fact that
    > the system in question would appear to be in a somewhat industrial
    > setting raises the question of whether there is an external power or
    > grounding event that is the underlying cause of the Machine Check.
    >
    > If there is a UPS involved, there could also be a problem there.


    I think you could be onto something here...as funnily enough we _had_
    two VMS servers, one came back to be repaired (power supply problem!)
    Now mentioning UPS gets me wondering, if there is indeed a UPS (I'll
    need to check) I would imagine both servers would come off the same
    UPS...maybe machine 1 never had problems with its power supply after
    all! Definitely something to rule out (and perhaps in light of recent
    experiences something I should have considered straight away!) Many
    thanks.

  12. Re: How do I diagnose a server that crashes every night?

    In article <6df72036-9e6f-4a79-96cf-a841020f7b26@l64g2000hse.googlegroups.com>, StraightEight writes:
    > Hi,
    >
    > I have very little VMS experience but we have inherited a nice shiny
    > new alphaserver 250 to support (ok its not very shiny or new!) which
    > is located in the middle of the sea.


    I would think _very_ seriously about contracting a consultant who
    knows VMS.


  13. Re: How do I diagnose a server that crashes every night?

    On Sep 10, 7:52*am, koeh...@eisner.nospam.encompasserve.org (Bob
    Koehler) wrote:
    > In article <6df72036-9e6f-4a79-96cf-a841020f7...@l64g2000hse.googlegroups..com>, StraightEight writes:
    >
    > > Hi,

    >
    > > I have very little VMS experience but we have inherited a nice shiny
    > > new alphaserver 250 to support (ok its not very shiny or new!) which
    > > is located in the middle of the sea.

    >
    > * *I would think _very_ seriously about contracting a consultant who
    > * *knows VMS.


    With the OP mentioning that the system was located out to sea
    somewhere, I wonder what might be happening to power and/or other
    environmental stuff during non-daylight hours?


  14. Re: How do I diagnose a server that crashes every night?

    StraightEight wrote:
    > Hi,
    >
    > I have very little VMS experience but we have inherited a nice shiny
    > new alphaserver 250 to support (ok its not very shiny or new!) which
    > is located in the middle of the sea.
    >
    > EVERY night without fail, this server is crashing and restarting
    > itself. I'd really like to get to the bottom of this as I am being
    > called every morning at 3am to log in and start some services which
    > don't seem launch at startup despite being in the startup file ("ahh
    > it's always been that way...")
    >
    > Below is the FATAL BUGCHECK which I suspect is causing the machine to
    > reboot. The process which appears to be crashing is key to this
    > servers functionality. This is a relatively new problem as this box
    > has run itself for the past 10 years.
    >
    > I have no idea where to even begin determining the source of the
    > problem from this. Is anyone able to give me any pointers as to what I
    > should be looking for, what information I will need, and how to make
    > sense of it all? My VMS knowledge as I say is extremely limited, so
    > any commands would be useful and appreciated also. Nice to see if
    > anyone can help!
    >
    > Thanks
    > str8
    >
    >
    >
    > ******************************* ENTRY 435.
    > *******************************
    > ERROR SEQUENCE 432. LOGGED ON: CPU_TYPE
    > 00000006
    > DATE/TIME 10-SEP-2008 05:27:25.87 SYS_TYPE
    > 0000000D
    > SYSTEM UPTIME: 1 DAYS 00:55:22
    > SCS NODE: PIN01 OpenVMS
    > AXP V6.2-1H3
    >
    > HW_MODEL: 00000000 Hardware Model = 0.
    >
    > FATAL BUGCHECK AlphaStation 250 4/266
    >
    > MACHINECHK, Machine check while in kernel mode
    >
    > PROCESS NAME BLYSEM_I1
    > PROCESS ID 0001001F
    >
    > ERROR PC FFFFFFFF 800485F8
    >
    > Process Status = 20000000 00001F04, SW = 00, Previous Mode =
    > KERNEL
    > System State = 01, Current Mode = KERNEL
    > VMM = 00 IPL = 31, SP Alignment = 32
    >
    > STACK POINTERS
    >
    > KSP 00000000 7FF91EE0 ESP 00000000 7FF96000 SSP 00000000 7FF9C100
    > USP 00000000 7EE7D390
    >
    > GENERAL REGISTERS
    >
    > R0 00000000 00000002 R1 00000000 0000940A R2 FFFFFFFF 80C2DB50
    > R3 FFFFFFFF 80C04D98 R4 00000000 00000048 R5 00000000 00001F04
    > R6 00000000 00000000 R7 00000000 00000001 R8 00000000 7FF9C1F8
    > R9 00000000 7FF9C400 R10 00000000 7FF9D228 R11 00000000 7FFBE3E0
    > R12 00000000 00000000 R13 FFFFFFFF 8326B910 R14 00000000 00000000
    > R15 00000000 7EE7D498 R16 00000000 00000215 R17 00000000 00000001
    > R18 00000000 00000001 R19 00000000 00000000 R20 FFFFFFFF FFFFFFF8
    > R21 00000000 00000017 R22 00000000 00000100 R23 FFFFFFFF 80E08368
    > R24 FFFFFFFF 80E08000 R25 00000000 00000003 R26 00000000 00000210
    > R27 FFFFFFFF 80C34D60 R28 FFFFFFFF 8003B9C4 FP 00000000 7FF91EE0
    > SP 00000000 7FF91EE0 PC FFFFFFFF 800485F8 PS 20000000 00001F04
    >
    > SYSTEM REGISTERS
    >
    > PTBR 00000000 00001F19
    > Page Table Base Register
    > PCBB 00000000 0414A080
    > Privileged Context Block Base
    > PRBR FFFFFFFF 80E0A000
    > Processor Base Register
    > VPTB 00000002 00000000
    > Virtual Page Table Base
    > Register
    > SCBB 00000000 000001A1
    > System Control Block Base
    > SISR 00000000 00000000
    > Software Interrupt Summary
    > Register
    > ASN 00000000 0000003B
    > Address Space Number
    > ASTSR_ASTEN 00000000 0000000F
    > AST Summary/AST Enable
    > FEN 00000000 00000001
    > Floating-Point Enable
    > IPL 00000000 0000001F
    > Interrupt Priority Level
    > MCES 00000000 00000008
    > Machine Check Error Summary


    Well, it says "Machine Check" and that generally means a hardware
    problem of some sort. If you have a service contract, just pick up the
    phone and call for help. If not, get prior approval from whoever pays
    the bills and then pick up the phone and call for help!

    It might also help to try to find out what else happens every morning at
    3:00 AM. The fact that the timing is consistent suggests that it's
    something happening in the environment that triggers the machine check.


  15. Re: How do I diagnose a server that crashes every night?

    If you see MACHINECHK crashes, think of hardware problems first. There
    should be errlog-entries immediately preceeding the system crash. Find
    those and analyze them.

    $ ANAL/ERR/SINCE=<1-minute-before-system-crash>

    or look at the errors in the dump:

    $ ANAL/CRASH SYS$SYSTEM
    SDA> CLUE ERRLOG
    SDA> EXIT

    You may need to install DECevent V3.4 ( $ DIAGNOSE command ) to
    translate those error to meaningful text.

    ---
    Volker Halle, Invenate GmbH, OpenVMS Support

    An OpenVMS crashdump analysis a day
    makes the Windows headaches go away.


  16. Re: How do I diagnose a server that crashes every night?

    On Sep 10, 6:49 am, clubley@remove_me.eisner.decus.org-Earth.UFP
    (Simon Clubley) wrote:
    > In article <9074c354-2557-4638-961e-7860857e3...@s50g2000hsb.googlegroups.com>, Bob Gezelter writes:
    >
    > > str8,

    >
    > > I should note that it is also possible that the Machine Check and the
    > > active process are, in effect, not related.

    >
    > > The last client system that was having machine checks turned out to be
    > > caused by an erratic power supply. The power supply worked well when
    > > it was working, but it was apparently having problems. The fact that
    > > the system in question would appear to be in a somewhat industrial
    > > setting raises the question of whether there is an external power or
    > > grounding event that is the underlying cause of the Machine Check.

    >
    > > If there is a UPS involved, there could also be a problem there.

    >
    > The OP should also be aware that although a machine check is usually a
    > hardware issue, it can be caused by a faulty device driver as well.
    >
    > Personal experience here: I have caused VMS to issue machine checks while
    > I have been developing VMS device drivers in the past.
    >
    > Simon.
    >
    > --
    > Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP
    > Microsoft: Bringing you 1980's technology to a 21st century world



    Simon,

    Indeed. When one is not careful running in Kernel mode, particularly
    at interrupt level, all kinds of strange results can ensue, for all
    kinds of reaons.

    My favorite was a problem on an early version of a third-party J-11-
    based product, there was no RESET control, it was presumed that
    PowerFail could do it. I pointed out that there were many situations
    in which PowerFail would cause a problem if there was not a valid
    kernel stack pointer.Oops.

    - Bob Gezelter, http://www.rlgsc.com

  17. Re: How do I diagnose a server that crashes every night?

    Volker Halle wrote:
    > If you see MACHINECHK crashes, think of hardware problems first.



    When in kernel mode, is there a wide variety of possible crash reasons,
    or does that narrow down to machinechk ?

    For instance, if a driver were to divide by 0, or try to execute an
    instruction whose opcode doesn't exist, would it be a different crash
    reason ?

    And if the crashes happen ONLY during the night, then it would seem that
    some external factors might be triggering this.

    Question to the OP: Does that machine get a yearly visit to be
    maintained, cleaned to remove the dust bunnies etc ? Or has it been
    running for years without anyone giving it any hardware maintenance ?

    As other have pointed out before, if you could compare the crash logs
    from different nights to see if it crashes at the same location/reason
    every night, this might be some clue.

    Is it possible that the data generating device is shutdown every night
    for a few minutes and this greatly confuses your VMS application because
    it wasn't setup to handle such events ?


    Does the device use TCP or UDP or serial line communications ?

  18. Re: How do I diagnose a server that crashes every night?

    In article <48c81caa$0$12385$c3e8da3@news.astraweb.com>, JF Mezei writes:
    >
    > When in kernel mode, is there a wide variety of possible crash reasons,
    > or does that narrow down to machinechk ?


    There is a very wide variety crash reasons available in kernel mode,
    and very few elsewhere. Or were you asking the OP what he's actually
    seen?

    > For instance, if a driver were to divide by 0, or try to execute an
    > instruction whose opcode doesn't exist, would it be a different crash
    > reason ?


    Divide by 0 exception is not a machine check, so as a good example, it
    would lead to some other bugcheck.


  19. Re: How do I diagnose a server that crashes every night?

    In article
    <3360773d-860d-4145-8509-21752f00e75a@m73g2000hsh.googlegroups.com>,
    DaveG wrote:

    > On Sep 10, 7:52*am, koeh...@eisner.nospam.encompasserve.org (Bob
    > Koehler) wrote:
    > > In article
    > > <6df72036-9e6f-4a79-96cf-a841020f7...@l64g2000hse.googlegroups.com>,
    > > StraightEight writes:
    > >
    > > > Hi,

    > >
    > > > I have very little VMS experience but we have inherited a nice shiny
    > > > new alphaserver 250 to support (ok its not very shiny or new!) which
    > > > is located in the middle of the sea.

    > >
    > > * *I would think _very_ seriously about contracting a consultant who
    > > * *knows VMS.

    >
    > With the OP mentioning that the system was located out to sea
    > somewhere, I wonder what might be happening to power and/or other
    > environmental stuff during non-daylight hours?


    A few thoughts on environmental and power stuff.

    I've seen instances where folks have switched off the air conditioning
    when working in a computer room (because they didn't like the noise) and
    cleaners borrowing an occupied power socket to plug their equipment in.
    There's also the possibility of hefty electrical machinery being
    switched off then on, particularly at a work break or shift change.

    One amusing example was the new production line worker who hung his
    overalls over a bar code reader at his mid-morning and lunch breaks. It
    took some lateral thinking to diagnose that one!

    --
    Paul Sture

  20. Re: How do I diagnose a server that crashes every night?

    Simon Clubley wrote:
    > In article <9074c354-2557-4638-961e-7860857e3485@s50g2000hsb.googlegroups.com>, Bob Gezelter writes:
    >> str8,
    >>
    >> I should note that it is also possible that the Machine Check and the
    >> active process are, in effect, not related.
    >>
    >> The last client system that was having machine checks turned out to be
    >> caused by an erratic power supply. The power supply worked well when
    >> it was working, but it was apparently having problems. The fact that
    >> the system in question would appear to be in a somewhat industrial
    >> setting raises the question of whether there is an external power or
    >> grounding event that is the underlying cause of the Machine Check.
    >>
    >> If there is a UPS involved, there could also be a problem there.
    >>

    >
    > The OP should also be aware that although a machine check is usually a
    > hardware issue, it can be caused by a faulty device driver as well.
    >
    > Personal experience here: I have caused VMS to issue machine checks while
    > I have been developing VMS device drivers in the past.
    >
    > Simon.
    >


    While I would agree that one can cause MCHK with a device driver, on a
    system that has been in use for years without change (as stated by the
    OP), I would have to say that it is either a PS or possibly memory. I
    would check the logfiles for ECC errors as well. Not having access to
    the full error logs, diagnosing will be an exercise in divination.

+ Reply to Thread
Page 1 of 2 1 2 LastLast