How do I diagnose a server that crashes every night? - VMS
This is a discussion on How do I diagnose a server that crashes every night? - VMS ; Hi,
I have very little VMS experience but we have inherited a nice shiny
new alphaserver 250 to support (ok its not very shiny or new!) which
is located in the middle of the sea.
EVERY night without fail, this ...
-
How do I diagnose a server that crashes every night?
Hi,
I have very little VMS experience but we have inherited a nice shiny
new alphaserver 250 to support (ok its not very shiny or new!) which
is located in the middle of the sea.
EVERY night without fail, this server is crashing and restarting
itself. I'd really like to get to the bottom of this as I am being
called every morning at 3am to log in and start some services which
don't seem launch at startup despite being in the startup file ("ahh
it's always been that way...")
Below is the FATAL BUGCHECK which I suspect is causing the machine to
reboot. The process which appears to be crashing is key to this
servers functionality. This is a relatively new problem as this box
has run itself for the past 10 years.
I have no idea where to even begin determining the source of the
problem from this. Is anyone able to give me any pointers as to what I
should be looking for, what information I will need, and how to make
sense of it all? My VMS knowledge as I say is extremely limited, so
any commands would be useful and appreciated also. Nice to see if
anyone can help!
Thanks
str8
******************************* ENTRY 435.
*******************************
ERROR SEQUENCE 432. LOGGED ON: CPU_TYPE
00000006
DATE/TIME 10-SEP-2008 05:27:25.87 SYS_TYPE
0000000D
SYSTEM UPTIME: 1 DAYS 00:55:22
SCS NODE: PIN01 OpenVMS
AXP V6.2-1H3
HW_MODEL: 00000000 Hardware Model = 0.
FATAL BUGCHECK AlphaStation 250 4/266
MACHINECHK, Machine check while in kernel mode
PROCESS NAME BLYSEM_I1
PROCESS ID 0001001F
ERROR PC FFFFFFFF 800485F8
Process Status = 20000000 00001F04, SW = 00, Previous Mode =
KERNEL
System State = 01, Current Mode = KERNEL
VMM = 00 IPL = 31, SP Alignment = 32
STACK POINTERS
KSP 00000000 7FF91EE0 ESP 00000000 7FF96000 SSP 00000000 7FF9C100
USP 00000000 7EE7D390
GENERAL REGISTERS
R0 00000000 00000002 R1 00000000 0000940A R2 FFFFFFFF 80C2DB50
R3 FFFFFFFF 80C04D98 R4 00000000 00000048 R5 00000000 00001F04
R6 00000000 00000000 R7 00000000 00000001 R8 00000000 7FF9C1F8
R9 00000000 7FF9C400 R10 00000000 7FF9D228 R11 00000000 7FFBE3E0
R12 00000000 00000000 R13 FFFFFFFF 8326B910 R14 00000000 00000000
R15 00000000 7EE7D498 R16 00000000 00000215 R17 00000000 00000001
R18 00000000 00000001 R19 00000000 00000000 R20 FFFFFFFF FFFFFFF8
R21 00000000 00000017 R22 00000000 00000100 R23 FFFFFFFF 80E08368
R24 FFFFFFFF 80E08000 R25 00000000 00000003 R26 00000000 00000210
R27 FFFFFFFF 80C34D60 R28 FFFFFFFF 8003B9C4 FP 00000000 7FF91EE0
SP 00000000 7FF91EE0 PC FFFFFFFF 800485F8 PS 20000000 00001F04
SYSTEM REGISTERS
PTBR 00000000 00001F19
Page Table Base Register
PCBB 00000000 0414A080
Privileged Context Block Base
PRBR FFFFFFFF 80E0A000
Processor Base Register
VPTB 00000002 00000000
Virtual Page Table Base
Register
SCBB 00000000 000001A1
System Control Block Base
SISR 00000000 00000000
Software Interrupt Summary
Register
ASN 00000000 0000003B
Address Space Number
ASTSR_ASTEN 00000000 0000000F
AST Summary/AST Enable
FEN 00000000 00000001
Floating-Point Enable
IPL 00000000 0000001F
Interrupt Priority Level
MCES 00000000 00000008
Machine Check Error Summary
-
Re: How do I diagnose a server that crashes every night?
In article <6df72036-9e6f-4a79-96cf-a841020f7b26@l64g2000hse.googlegroups.com>, StraightEight writes:
>I have very little VMS experience but we have inherited a nice shiny
>new alphaserver 250 to support (ok its not very shiny or new!) which
>is located in the middle of the sea.
>
>EVERY night without fail, this server is crashing and restarting
>itself. I'd really like to get to the bottom of this as I am being
>called every morning at 3am to log in and start some services which
>don't seem launch at startup despite being in the startup file ("ahh
>it's always been that way...")
>
>Below is the FATAL BUGCHECK which I suspect is causing the machine to
>reboot.
Correct, a FATAL BUGCHECK results in a crash of the system.
>The process which appears to be crashing is key to this
>servers functionality. This is a relatively new problem as this box
>has run itself for the past 10 years.
So the first question is: was anything changed on this system?
>I have no idea where to even begin determining the source of the
>problem from this. Is anyone able to give me any pointers as to what I
>should be looking for, what information I will need, and how to make
>sense of it all?
Have a look in sys$common:[syserr] for files named CLUE$*.LIS.
In addition see the online help for "ANALYZE/ERROR". In addition, is there
anyzting in sys$manager
perator.log?
Regards,
Christoph Gartmann
--
Max-Planck-Institut fuer Phone : +49-761-5108-464 Fax: -80464
Immunbiologie
Postfach 1169 Internet: gartmann@immunbio dot mpg dot de
D-79011 Freiburg, Germany
http://www.immunbio.mpg.de/home/menue.html
-
Re: How do I diagnose a server that crashes every night?
StraightEight wrote:
> EVERY night without fail, this server is crashing and restarting
> itself. I'd really like to get to the bottom of this as I am being
> called every morning at 3am to log in and start some services which
Does it crash at exactly the same time every night ? Or does it vary ?
Any relationship with actual operations being done related to that
machine ? Or does it crash when some link goes down and the code just
doesn't handle this properly ?
> I have no idea where to even begin determining the source of the
> problem from this.
It would help to provide more background on what the application is. Is
it some COBOL app that just prints an accounting report, or it is some
real time applictaion that controls a drilling rig ?
What sort of stuff is connected to that machine using what sort of
protocol ?
In terms of services not starting when it boots and needing to be
started manually, you would need to look at the SYSTARTUP_VMS.COM file
in the SYS$MANAGER directory and take a careful look at it. The output
normally just goes on the operator console, so if you are not on site,
you have hard time seeing error messages.
However, if you start a service by submitting a batch job, then there
should be a log file that contains some information on why the service
didn't start.
If SYSTARTUP_VMS.COM calls a command procedure to start a service, you
can add /OUTPUT=logfile.log to the command
eg:
@disk:[directory]myapplication_startup.com/output=sys$manager:myapplication_startup.log
Then, you could consult the log file later on to find out why the
application didn't start.
Remember that some services take some time to become available, so on a
faster machine, you might be trying to start your app become TCPIP is
fully available for instance, and the app would fail. But later on when
you log in to fix the problem, TCPIP would be available and the app
would start properly.
-
Re: How do I diagnose a server that crashes every night?
Thanks for a quick reply. Here are my findings.
> So the first question is: was anything changed on this system?
No, as far as I am aware, this server has always been running
unchanged for several years
> Have a look in sys$common:[syserr] for files named CLUE$*.LIS
If I look at these files there are pages of information, but I am
unsure just what I need to be looking at!
> In addition see the online help for "ANALYZE/ERROR". In addition, is there
> anyzting in sys$manager
perator.log?
This mostly just contains our Telnet requests, heres one i
spotted...do you know what this means? Sometimes when we try to
connect by telnet to the server we see No License for the Active
Product (or something along those lines) Could it be something as
simple as a licensing problem, or is this a red herring?
%%%%%%%%%%% OPCOM 9-SEP-2008 04:32:26.73 %%%%%%%%%%%
Message from user SYSTEM on PIN01
%LICENSE-E-TERM, C ALL-IL-1997NOV26-2136 License has terminated
Thanks!
-
Re: How do I diagnose a server that crashes every night?
StraightEight wrote:
> %%%%%%%%%%% OPCOM 9-SEP-2008 04:32:26.73 %%%%%%%%%%%
> Message from user SYSTEM on PIN01
> %LICENSE-E-TERM, C ALL-IL-1997NOV26-2136 License has terminated
This is the C compiler licence.
The command:
SHOW LICENSE will give you list of active licences on that node.
LICENSE LIST will give you list of registered licences. (this will
exclude expired licences or licences that aren't valuid for this node).
It is possible that you have 2 C licences, the "real" one and some
temporary one which has expired.
Not having the C compiler would not cause problems to run programs. It
would only affect the invocation of the C compiler (CC command).
Programs compiled with this compiler will run find without the licence.
-
Re: How do I diagnose a server that crashes every night?
On 10 Sep, 10:46, JF Mezei wrote:
> Does it crash at exactly the same time every night ? Or does it vary ?
> Any relationship with actual operations being done related to that
> machine ? Or does it crash *when some link goes down and the code just
> doesn't handle this properly ?
Doesn't really seem to be any pattern, some nights it restarts just
once, some nights it can happen up to 4 times.
> It would help to provide more background on what the application is. Is
> it some COBOL app that just prints an accounting report, or it is some
> real time applictaion that controls a drilling rig ?
> What sort of stuff is connected to that machine using what sort of
> protocol ?
The file BLYSEM i'm sure is a software interface to a Bailey INFI900
DCS (so real time data aquisition on a rig as guessed!) for OSI PI
software. The volume of data this handles has probably increased over
the years...could a capacity problem knock a service over?
> In terms of services not starting when it boots and needing to be
> started manually, you would need to look at *the SYSTARTUP_VMS.COM file
> in the SYS$MANAGER directory and take a careful look at it. *The output
> normally just goes on the operator console, so if you are not on site,
> you have hard time seeing error messages.
>
> However, if you start a service by submitting a batch job, then there
> should be a log file that contains some information on why the service
> didn't start.
>
> If SYSTARTUP_VMS.COM calls a command procedure to start a service, you
> can add /OUTPUT=logfile.log to the command
>
> eg:
> @disk:[directory]myapplication_startup.com/output=sys$manager:myapplication*_startup.log
>
> Then, you could consult the log file later on to find out why the
> application didn't start.
>
> Remember that some services take some time to become available, so on a
> faster machine, you might be trying to start your app become TCPIP is
> fully available for instance, and the app would fail. But later on when
> you log in to fix the problem, TCPIP would be available and the app
> would start properly.
Thanks for the tips, I think I will try the output switch and see what
is logged. It's a good hunch at the end....the call to the service is
at the very end of the startup file, would each line in the startup
file wait until it is executed before moving to the next, or does it
just fire off all the commands at once? Now I think about it, the last
time we caught the error very early there was still a batch job
running...maybe we should call it at the end of this batch job?
Thanks for your response!
-
Re: How do I diagnose a server that crashes every night?
"StraightEight" wrote in message
news:6df72036-9e6f-4a79-96cf-a841020f7b26@l64g2000hse.googlegroups.com...
> I have no idea where to even begin determining the source of the
> problem from this.
MACHINECHK, Machine check while in kernel mode suggests
hardware. There may be other entries in the error log as well as
the bugcheck, which may give more detail.
Looking at the CLUE files in sys$errorlog, particularly the
_collect.dat may help nail down common features.
-
Re: How do I diagnose a server that crashes every night?
On Sep 10, 6:18*am, StraightEight wrote:
> On 10 Sep, 10:46, JF Mezei wrote:
>
> > Does it crash at exactly the same time every night ? Or does it vary ?
> > Any relationship with actual operations being done related to that
> > machine ? Or does it crash *when some link goes down and the code just
> > doesn't handle this properly ?
>
> Doesn't really seem to be any pattern, some nights it restarts just
> once, some nights it can happen up to 4 times.
>
> > It would help to provide more background on what the application is. Is
> > it some COBOL app that just prints an accounting report, or it is some
> > real time applictaion that controls a drilling rig ?
> > What sort of stuff is connected to that machine using what sort of
> > protocol ?
>
> The file BLYSEM i'm sure is a software interface to a Bailey INFI900
> DCS (so real time data aquisition on a rig as guessed!) for OSI PI
> software. The volume of data this handles has probably increased over
> the years...could a capacity problem knock a service over?
>
>
>
> > In terms of services not starting when it boots and needing to be
> > started manually, you would need to look at *the SYSTARTUP_VMS.COM file
> > in the SYS$MANAGER directory and take a careful look at it. *The output
> > normally just goes on the operator console, so if you are not on site,
> > you have hard time seeing error messages.
>
> > However, if you start a service by submitting a batch job, then there
> > should be a log file that contains some information on why the service
> > didn't start.
>
> > If SYSTARTUP_VMS.COM calls a command procedure to start a service, you
> > can add /OUTPUT=logfile.log to the command
>
> > eg:
> > @disk:[directory]myapplication_startup.com/output=sys$manager:myapplication*_startup.log
>
> > Then, you could consult the log file later on to find out why the
> > application didn't start.
>
> > Remember that some services take some time to become available, so on a
> > faster machine, you might be trying to start your app become TCPIP is
> > fully available for instance, and the app would fail. But later on when
> > you log in to fix the problem, TCPIP would be available and the app
> > would start properly.
>
> Thanks for the tips, I think I will try the output switch and see what
> is logged. It's a good hunch at the end....the call to the service is
> at the very end of the startup file, would each line in the startup
> file wait until it is executed before moving to the next, or does it
> just fire off all the commands at once? Now I think about it, the last
> time we caught the error very early there was still a batch job
> running...maybe we should call it at the end of this batch job?
>
> Thanks for your response!
str8,
I do not have an Alpha CPU manual handy, so I will restrict this set
of comments to the other issues raised. However, one important
question is whether the machine check error information is the same on
every crash.
Being responsible for a system with little or no documentation can be
a significant challenge. I have seen this kind of situation often when
getting called into a site which has been without a good system
manager in a while, It is common to find things "broken", that in
effect, were never working correctly. Not having failed noticeably
does not mean that there was not an issue that did not rise to the
severity to be noticed.
While the overall STARTUP process is capable of parallel operation,
each individual command file is executed sequentially (using the
parallel execution features can speed restarts substantially, as I
noted in my presentation "SYSMAN for Improved Restart Performance" at
the Fall 1999 US DECUS symposium (slides available via
http://www.rlgsc.com/decus/usf99/index.html ).
Most likely, the parallel execution features were not used in this
case. If processes that are supposed to start during a restart do not
in fact start, and the requests to start them are in the system
startup file, the most common reason for the failure is a small
typographical error made when editing the startup file. If there is an
error, the startup file will exit, with only a transiently visible
message on the console.
Typically, there are two ways to resolve this: 1) extremely close
inspection of the startup file (generally SYS
$MANAGER:SYSTARTUP_VMS.COM), or 2) enable logging of the startup
sequence using the SYSMAN STARTUP OPTIONS/OUTPUT=FILE command. [the
latter creates the file SYS$SPECIFIC:[SYSEXE]STARTUP.LOG]. Reviewing
the log file generated often clarifies precisely what messages were
scrolled rapidly off the screen. I often leave unattended systems in
the FILE setting so that it is possible to resolve problems on
unattended systems.
One important recommendation is to make sure that there is a good
backup of the system disk, and a log kept of any changes to any of the
files.
It goes without saying that at some point, it may be wise to retain
outside experienced assistance to examine the problem [Disclosure: our
firm does provide consulting services in this area].
- Bob Gezelter, http://www.rlgsc.com
-
Re: How do I diagnose a server that crashes every night?
str8,
I should note that it is also possible that the Machine Check and the
active process are, in effect, not related.
The last client system that was having machine checks turned out to be
caused by an erratic power supply. The power supply worked well when
it was working, but it was apparently having problems. The fact that
the system in question would appear to be in a somewhat industrial
setting raises the question of whether there is an external power or
grounding event that is the underlying cause of the Machine Check.
If there is a UPS involved, there could also be a problem there.
- Bob Gezelter, http://www.rlgsc.com
-
Re: How do I diagnose a server that crashes every night?
In article <9074c354-2557-4638-961e-7860857e3485@s50g2000hsb.googlegroups.com>, Bob Gezelter writes:
> str8,
>
> I should note that it is also possible that the Machine Check and the
> active process are, in effect, not related.
>
> The last client system that was having machine checks turned out to be
> caused by an erratic power supply. The power supply worked well when
> it was working, but it was apparently having problems. The fact that
> the system in question would appear to be in a somewhat industrial
> setting raises the question of whether there is an external power or
> grounding event that is the underlying cause of the Machine Check.
>
> If there is a UPS involved, there could also be a problem there.
>
The OP should also be aware that although a machine check is usually a
hardware issue, it can be caused by a faulty device driver as well.
Personal experience here: I have caused VMS to issue machine checks while
I have been developing VMS device drivers in the past.
Simon.
--
Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP
Microsoft: Bringing you 1980's technology to a 21st century world
-
Re: How do I diagnose a server that crashes every night?
> The last client system that was having machine checks turned out to be
> caused by an erratic power supply. The power supply worked well when
> it was working, but it was apparently having problems. The fact that
> the system in question would appear to be in a somewhat industrial
> setting raises the question of whether there is an external power or
> grounding event that is the underlying cause of the Machine Check.
>
> If there is a UPS involved, there could also be a problem there.
I think you could be onto something here...as funnily enough we _had_
two VMS servers, one came back to be repaired (power supply problem!)
Now mentioning UPS gets me wondering, if there is indeed a UPS (I'll
need to check) I would imagine both servers would come off the same
UPS...maybe machine 1 never had problems with its power supply after
all! Definitely something to rule out (and perhaps in light of recent
experiences something I should have considered straight away!) Many
thanks.
-
Re: How do I diagnose a server that crashes every night?
In article <6df72036-9e6f-4a79-96cf-a841020f7b26@l64g2000hse.googlegroups.com>, StraightEight writes:
> Hi,
>
> I have very little VMS experience but we have inherited a nice shiny
> new alphaserver 250 to support (ok its not very shiny or new!) which
> is located in the middle of the sea.
I would think _very_ seriously about contracting a consultant who
knows VMS.
-
Re: How do I diagnose a server that crashes every night?
On Sep 10, 7:52*am, koeh...@eisner.nospam.encompasserve.org (Bob
Koehler) wrote:
> In article <6df72036-9e6f-4a79-96cf-a841020f7...@l64g2000hse.googlegroups..com>, StraightEight writes:
>
> > Hi,
>
> > I have very little VMS experience but we have inherited a nice shiny
> > new alphaserver 250 to support (ok its not very shiny or new!) which
> > is located in the middle of the sea.
>
> * *I would think _very_ seriously about contracting a consultant who
> * *knows VMS.
With the OP mentioning that the system was located out to sea
somewhere, I wonder what might be happening to power and/or other
environmental stuff during non-daylight hours?
-
Re: How do I diagnose a server that crashes every night?
StraightEight wrote:
> Hi,
>
> I have very little VMS experience but we have inherited a nice shiny
> new alphaserver 250 to support (ok its not very shiny or new!) which
> is located in the middle of the sea.
>
> EVERY night without fail, this server is crashing and restarting
> itself. I'd really like to get to the bottom of this as I am being
> called every morning at 3am to log in and start some services which
> don't seem launch at startup despite being in the startup file ("ahh
> it's always been that way...")
>
> Below is the FATAL BUGCHECK which I suspect is causing the machine to
> reboot. The process which appears to be crashing is key to this
> servers functionality. This is a relatively new problem as this box
> has run itself for the past 10 years.
>
> I have no idea where to even begin determining the source of the
> problem from this. Is anyone able to give me any pointers as to what I
> should be looking for, what information I will need, and how to make
> sense of it all? My VMS knowledge as I say is extremely limited, so
> any commands would be useful and appreciated also. Nice to see if
> anyone can help!
>
> Thanks
> str8
>
>
>
> ******************************* ENTRY 435.
> *******************************
> ERROR SEQUENCE 432. LOGGED ON: CPU_TYPE
> 00000006
> DATE/TIME 10-SEP-2008 05:27:25.87 SYS_TYPE
> 0000000D
> SYSTEM UPTIME: 1 DAYS 00:55:22
> SCS NODE: PIN01 OpenVMS
> AXP V6.2-1H3
>
> HW_MODEL: 00000000 Hardware Model = 0.
>
> FATAL BUGCHECK AlphaStation 250 4/266
>
> MACHINECHK, Machine check while in kernel mode
>
> PROCESS NAME BLYSEM_I1
> PROCESS ID 0001001F
>
> ERROR PC FFFFFFFF 800485F8
>
> Process Status = 20000000 00001F04, SW = 00, Previous Mode =
> KERNEL
> System State = 01, Current Mode = KERNEL
> VMM = 00 IPL = 31, SP Alignment = 32
>
> STACK POINTERS
>
> KSP 00000000 7FF91EE0 ESP 00000000 7FF96000 SSP 00000000 7FF9C100
> USP 00000000 7EE7D390
>
> GENERAL REGISTERS
>
> R0 00000000 00000002 R1 00000000 0000940A R2 FFFFFFFF 80C2DB50
> R3 FFFFFFFF 80C04D98 R4 00000000 00000048 R5 00000000 00001F04
> R6 00000000 00000000 R7 00000000 00000001 R8 00000000 7FF9C1F8
> R9 00000000 7FF9C400 R10 00000000 7FF9D228 R11 00000000 7FFBE3E0
> R12 00000000 00000000 R13 FFFFFFFF 8326B910 R14 00000000 00000000
> R15 00000000 7EE7D498 R16 00000000 00000215 R17 00000000 00000001
> R18 00000000 00000001 R19 00000000 00000000 R20 FFFFFFFF FFFFFFF8
> R21 00000000 00000017 R22 00000000 00000100 R23 FFFFFFFF 80E08368
> R24 FFFFFFFF 80E08000 R25 00000000 00000003 R26 00000000 00000210
> R27 FFFFFFFF 80C34D60 R28 FFFFFFFF 8003B9C4 FP 00000000 7FF91EE0
> SP 00000000 7FF91EE0 PC FFFFFFFF 800485F8 PS 20000000 00001F04
>
> SYSTEM REGISTERS
>
> PTBR 00000000 00001F19
> Page Table Base Register
> PCBB 00000000 0414A080
> Privileged Context Block Base
> PRBR FFFFFFFF 80E0A000
> Processor Base Register
> VPTB 00000002 00000000
> Virtual Page Table Base
> Register
> SCBB 00000000 000001A1
> System Control Block Base
> SISR 00000000 00000000
> Software Interrupt Summary
> Register
> ASN 00000000 0000003B
> Address Space Number
> ASTSR_ASTEN 00000000 0000000F
> AST Summary/AST Enable
> FEN 00000000 00000001
> Floating-Point Enable
> IPL 00000000 0000001F
> Interrupt Priority Level
> MCES 00000000 00000008
> Machine Check Error Summary
Well, it says "Machine Check" and that generally means a hardware
problem of some sort. If you have a service contract, just pick up the
phone and call for help. If not, get prior approval from whoever pays
the bills and then pick up the phone and call for help!
It might also help to try to find out what else happens every morning at
3:00 AM. The fact that the timing is consistent suggests that it's
something happening in the environment that triggers the machine check.
-
Re: How do I diagnose a server that crashes every night?
If you see MACHINECHK crashes, think of hardware problems first. There
should be errlog-entries immediately preceeding the system crash. Find
those and analyze them.
$ ANAL/ERR/SINCE=<1-minute-before-system-crash>
or look at the errors in the dump:
$ ANAL/CRASH SYS$SYSTEM
SDA> CLUE ERRLOG
SDA> EXIT
You may need to install DECevent V3.4 ( $ DIAGNOSE command ) to
translate those error to meaningful text.
---
Volker Halle, Invenate GmbH, OpenVMS Support
An OpenVMS crashdump analysis a day
makes the Windows headaches go away.
-
Re: How do I diagnose a server that crashes every night?
On Sep 10, 6:49 am, clubley@remove_me.eisner.decus.org-Earth.UFP
(Simon Clubley) wrote:
> In article <9074c354-2557-4638-961e-7860857e3...@s50g2000hsb.googlegroups.com>, Bob Gezelter writes:
>
> > str8,
>
> > I should note that it is also possible that the Machine Check and the
> > active process are, in effect, not related.
>
> > The last client system that was having machine checks turned out to be
> > caused by an erratic power supply. The power supply worked well when
> > it was working, but it was apparently having problems. The fact that
> > the system in question would appear to be in a somewhat industrial
> > setting raises the question of whether there is an external power or
> > grounding event that is the underlying cause of the Machine Check.
>
> > If there is a UPS involved, there could also be a problem there.
>
> The OP should also be aware that although a machine check is usually a
> hardware issue, it can be caused by a faulty device driver as well.
>
> Personal experience here: I have caused VMS to issue machine checks while
> I have been developing VMS device drivers in the past.
>
> Simon.
>
> --
> Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP
> Microsoft: Bringing you 1980's technology to a 21st century world
Simon,
Indeed. When one is not careful running in Kernel mode, particularly
at interrupt level, all kinds of strange results can ensue, for all
kinds of reaons.
My favorite was a problem on an early version of a third-party J-11-
based product, there was no RESET control, it was presumed that
PowerFail could do it. I pointed out that there were many situations
in which PowerFail would cause a problem if there was not a valid
kernel stack pointer.Oops.
- Bob Gezelter, http://www.rlgsc.com
-
Re: How do I diagnose a server that crashes every night?
Volker Halle wrote:
> If you see MACHINECHK crashes, think of hardware problems first.
When in kernel mode, is there a wide variety of possible crash reasons,
or does that narrow down to machinechk ?
For instance, if a driver were to divide by 0, or try to execute an
instruction whose opcode doesn't exist, would it be a different crash
reason ?
And if the crashes happen ONLY during the night, then it would seem that
some external factors might be triggering this.
Question to the OP: Does that machine get a yearly visit to be
maintained, cleaned to remove the dust bunnies etc ? Or has it been
running for years without anyone giving it any hardware maintenance ?
As other have pointed out before, if you could compare the crash logs
from different nights to see if it crashes at the same location/reason
every night, this might be some clue.
Is it possible that the data generating device is shutdown every night
for a few minutes and this greatly confuses your VMS application because
it wasn't setup to handle such events ?
Does the device use TCP or UDP or serial line communications ?
-
Re: How do I diagnose a server that crashes every night?
In article <48c81caa$0$12385$c3e8da3@news.astraweb.com>, JF Mezei writes:
>
> When in kernel mode, is there a wide variety of possible crash reasons,
> or does that narrow down to machinechk ?
There is a very wide variety crash reasons available in kernel mode,
and very few elsewhere. Or were you asking the OP what he's actually
seen?
> For instance, if a driver were to divide by 0, or try to execute an
> instruction whose opcode doesn't exist, would it be a different crash
> reason ?
Divide by 0 exception is not a machine check, so as a good example, it
would lead to some other bugcheck.
-
Re: How do I diagnose a server that crashes every night?
In article
<3360773d-860d-4145-8509-21752f00e75a@m73g2000hsh.googlegroups.com>,
DaveG wrote:
> On Sep 10, 7:52*am, koeh...@eisner.nospam.encompasserve.org (Bob
> Koehler) wrote:
> > In article
> > <6df72036-9e6f-4a79-96cf-a841020f7...@l64g2000hse.googlegroups.com>,
> > StraightEight writes:
> >
> > > Hi,
> >
> > > I have very little VMS experience but we have inherited a nice shiny
> > > new alphaserver 250 to support (ok its not very shiny or new!) which
> > > is located in the middle of the sea.
> >
> > * *I would think _very_ seriously about contracting a consultant who
> > * *knows VMS.
>
> With the OP mentioning that the system was located out to sea
> somewhere, I wonder what might be happening to power and/or other
> environmental stuff during non-daylight hours?
A few thoughts on environmental and power stuff.
I've seen instances where folks have switched off the air conditioning
when working in a computer room (because they didn't like the noise) and
cleaners borrowing an occupied power socket to plug their equipment in.
There's also the possibility of hefty electrical machinery being
switched off then on, particularly at a work break or shift change.
One amusing example was the new production line worker who hung his
overalls over a bar code reader at his mid-morning and lunch breaks. It
took some lateral thinking to diagnose that one!
--
Paul Sture
-
Re: How do I diagnose a server that crashes every night?
Simon Clubley wrote:
> In article <9074c354-2557-4638-961e-7860857e3485@s50g2000hsb.googlegroups.com>, Bob Gezelter writes:
>> str8,
>>
>> I should note that it is also possible that the Machine Check and the
>> active process are, in effect, not related.
>>
>> The last client system that was having machine checks turned out to be
>> caused by an erratic power supply. The power supply worked well when
>> it was working, but it was apparently having problems. The fact that
>> the system in question would appear to be in a somewhat industrial
>> setting raises the question of whether there is an external power or
>> grounding event that is the underlying cause of the Machine Check.
>>
>> If there is a UPS involved, there could also be a problem there.
>>
>
> The OP should also be aware that although a machine check is usually a
> hardware issue, it can be caused by a faulty device driver as well.
>
> Personal experience here: I have caused VMS to issue machine checks while
> I have been developing VMS device drivers in the past.
>
> Simon.
>
While I would agree that one can cause MCHK with a device driver, on a
system that has been in use for years without change (as stated by the
OP), I would have to say that it is either a PS or possibly memory. I
would check the logfiles for ECC errors as well. Not having access to
the full error logs, diagnosing will be an exercise in divination.