Re: RX2600 hangs (VMS 8.3) - VMS

This is a discussion on Re: RX2600 hangs (VMS 8.3) - VMS ; In article , Malcolm Dunnett wrote: > I have an rx2600 (dual 1.4GHz/1.5MB CPUs) which has been happily running > VMS for several years. It's at 8.3 with VMS83I_UPDATE4 and VMS83I_SYS4 > (along with a few other) patches installed. It's ...

+ Reply to Thread
Results 1 to 3 of 3

Thread: Re: RX2600 hangs (VMS 8.3)

  1. Re: RX2600 hangs (VMS 8.3)

    In article <47583595$1@flight>,
    Malcolm Dunnett wrote:

    > I have an rx2600 (dual 1.4GHz/1.5MB CPUs) which has been happily running
    > VMS for several years. It's at 8.3 with VMS83I_UPDATE4 and VMS83I_SYS4
    > (along with a few other) patches installed. It's running Oracle 10.2.0.2
    > server as its only application.
    >
    > This morning at around 4am it "stopped cold". No crash dump, no errors
    > in the system error log, nothing untoward in the OPERATOR log, no error
    > messages on the console. The other cluster nodes simply report losing
    > communication to it at that time.
    >
    > Unfortunately in the heat of the moment to get it going again this
    > morning I didn't get a chance to take a crash dump (I will if the
    > problem occurs again.) I did note the following errors in the BMC error
    > log that happened at the time of the hang:
    >
    > 576 SFW 0 2 0x5680028500E02630 0000000000000000 MC_INITIALIZED_RSE
    > 06 Dec 2007 04:13:10
    > 577 SFW *7 0xC1475776D6022650 003FA17000130300 Type-02 137001 1273857
    > 06 Dec 2007 04:13:10
    > 578 SFW 0 *7 0xF680009800E02660 000000000000000B MC_INITIATED
    > 06 Dec 2007 04:13:10
    > 579 SFW 0 2 0x568002A100E02680 08000000FFF61020 MC_PSP
    > 06 Dec 2007 04:13:11
    > 580 SFW 0 2 0x5680010900E026A0 0000000000000000 PAL_CORRECTED_MC
    > 06 Dec 2007 04:13:11
    > 581 SFW 0 2 0x568002B000E026C0 2007120600041313 MC_TIMESTAMP
    > 06 Dec 2007 04:13:11
    > 582 SFW 0 2 0x568002A000E026E0 0000000000000000 MC_POST_PROCESS_PLAT
    > 06 Dec 2007 04:13:11
    > 583 SFW 0 *7 0x7680011700E02700 0000000000000000
    > UNEXPECTED_RET_TO_SAL_CHECK
    > 06 Dec 2007 04:13:11
    >
    > do these provide any clue as to what the problem might be?
    >
    > ps I already have a call open through ITRC - but I have a feeling
    > there's folks in this group who know a lot more than ITRC does.


    This cascading series of system events involves a "Machine Check" a.k.a.
    "Machine Check Abort".

    While an MCA can be caused by software -- a wayward device driver or
    other inner-mode code -- the general situation you describe sounds more
    like a hardware-initiated problem.

    There is much more error information logged in the system than what is
    shown here. I recommend that you do NOT clear any of the logs until the
    HP support folks ask you too. But looking at the logs shouldn't cause
    any harm...

    Does this system have the optional Management Processor card? If so, it
    it much easier to look at the logs. I'll assume you have the MP...

    The events you saw are logged in the "System Event Log" (SEL) and/or the
    "Forward Progress Log" (FPL). Generally the SEL only contains the most
    serious events, while the FPL contains many routine status messages as
    well. If the FPL fills up, the oldest events are overridden. The SEL
    is persistent until something deletes events; VMS will do so if it gets
    too full. VMS also attempts to transfer events from the SEL to the VMS
    error log, where they can be analyzed by tools like WEBES. But the OS
    can't get the information for severe failures, at least not until after
    the fact. If the failure is bad enough, VMS does not get control of the
    system to do any logging.

    From the MP main menu, you can view the SEL by choosing the SL menu
    item, then the E option. Once you are in the SEL section, you can type
    T to switch to Text mode, where you get a more verbose interpretation of
    each message. You can use J to jump to a particular event by event
    number, and + and - to navigate forward and backward.

    The full text interpretation of the messages from you system may shed
    some additional light.

    Lots of information from the status and error registers in the chipset
    is logged in a separate section of NVRAM. You can view this from the
    EFI shell, using the "errdump" command. "errdump MCA" would be of
    interest here. If you can capture the errdump information, the services
    folks have a tool that can interpret it in great detail. The tool can
    often point right at the component that most likely caused the failure.

    Unfortunately, the errdump memory can only hold one event worth of data,
    and it saves the first event until it is cleared. If your system has
    ancient data in the errdump memory, it won't be helpful in solving the
    current problem.

    There's another log that hold information about memory pages that have
    been marked bad and deallocated. That is the Page Deallocation Table
    (PDT), and you can view it from the EFI shell via the "pdt" command.
    It's not unusually to see a small number of entries on a system that's
    been in production for a while.

    -- Robert

  2. Re: RX2600 hangs (VMS 8.3)

    Malcolm,

    I have to agree on the India problem (and Costa Rica). I have a few
    names and e-mail addresses of some very special people within support
    for whom English is a native language and VMS a specialty. I open the
    call with Bangalore or Costa Rica, describe the problem (knowing that
    the phone 'droid likely doesn't understand me, nor the problem), and
    get the case number. Then I send an e-mail with the case-number in the
    subject line, to one of those people.

    They then help to dispatch the case to someone appropriate.

    I don't know what happens to those VMS support cases when the default
    HP policies are followed. For instance, 3 days after my last problem
    (combination hardware and VMS diagnostic tools) was solved, the Indian
    support person sent me an e-mail wondering if I needed any assistance
    ... 3 days... sigh

    Carl Friedberg

    On Dec 7, 2007 2:03 PM, Malcolm Dunnett wrote:

    > ps. I have to concur with those others who have bemoaned the quality
    > of the service since it got transferred to India. The person I was
    > dealing with said that the log entries (same ones I posted here that
    > clearly show Machine Checks) were "nothing special there - just a
    > machine initated messages."


  3. Re: RX2600 hangs (VMS 8.3)

    In article <475998ee$1@flight>,
    Malcolm Dunnett wrote:

    > Robert Deininger wrote:
    >
    > > This cascading series of system events involves a "Machine Check" a.k.a.
    > > "Machine Check Abort".
    > >
    > > While an MCA can be caused by software -- a wayward device driver or
    > > other inner-mode code -- the general situation you describe sounds more
    > > like a hardware-initiated problem.
    > >

    > The system hung again yesterday afternoon. It would not respond to
    > the ^P on the console to get the IPC> prompt so that I could take a
    > crash dump. I suspect this would most likely only happen with a hardware
    > failure.


    There are a few failure modes where the crash message sent via ^P never
    arrives. I'm not convinced they are entirely hardware-related.

    But you have clear symptoms of a hardware problem. The hardware
    indications need to be diagnosed. If they point to possible OS
    involvement, then worry about getting a crash dump.

    > I swapped another server in for this one and I'll see what happens
    > with it (so far it's good since yesterday afternoon whereas the other
    > one failed twice in 24 hours).
    >
    > > Does this system have the optional Management Processor card?

    >
    > Yes.
    >
    > > Lots of information from the status and error registers in the chipset
    > > is logged in a separate section of NVRAM. You can view this from the
    > > EFI shell, using the "errdump" command. "errdump MCA" would be of
    > > interest here. If you can capture the errdump information, the services
    > > folks have a tool that can interpret it in great detail. The tool can
    > > often point right at the component that most likely caused the failure.
    > >

    >
    > I somehow managed to blow this away (or it never got logged ), in
    > either case there's nothing in there now.


    If it never got logged, it's probably a hardware problem.

    > Thanks for the information - I now know better what to capture next
    > time something like this happens.
    >
    > ps. I have to concur with those others who have bemoaned the quality
    > of the service since it got transferred to India. The person I was
    > dealing with said that the log entries (same ones I posted here that
    > clearly show Machine Checks) were "nothing special there - just a
    > machine initated messages."


    Grrr. So what is the status of the case now? Did it get logged as a
    hardware problem, or a VMS problem? Hardware support should be dealing
    with it.

    Did you get any useful progress on the case yet? Are you waiting for
    2nd-level support to take a look?

    -- Robert

+ Reply to Thread