On Apr 11, 6:09 am, Malcolm Dunnett wrote:
> Can someone out there refresh my memory on this. I have an ES40 with 4
> CPUs. A couple of times in the last few months it has "hung". It falls
> out of the cluster and won't respond to the console terminal (^P doesn't
> do anything). Pressing the reset button on the OCP reboots it and it
> continues to run fine once again. I don' think it's a case of one of the
> CPUs failing as it never gets the CPUSPINWAIT error (yesterday it was
> hung for about 5 hours overnight)
>
> ISTR a report a few years ago that early versions of ES40s with multiple
> CPUs showed a problem like this. Does anyone recall if that sounds like
> the same symptoms? This system has old EV6's in it:
>
> CPU ID 0. CPU State rc,pa,pp,cv,pv,pmv,pl
> CPU Type EV6 Pass 2.3 (21264)
> PAL Code 1.98-4 Halt PC 00000000.20000000
> CPU Revision .... Halt PS 00000000.00001F00
> Serial Number AY92637351 Halt Code "Bootstrap or
> Powerfail"
> Console Vers V7.3-1 Halt Request "Default, No Action"
>
> IIRC you needed at least pass 2.5 to avoid the problem. I've got a set
> of 667MHz CPUs I can swap in there, but before I try that I'm hoping to
> verify that this sounds like the problem caused by the earlier CPU revs.
>
> Anyone recall?


I agree with Volker, that the CPUs are suspicious.

At my former employer, we had many ES40 clusters in
the company, mine was the earliest one purchased (I'm
told), and so had early pass EV6's (as did a few other
clusters).

We did a lot of debug work with HP over precisely the
same symptoms: system hangs, no crash dump written,
nothing in the error log. We were given extensive
instructions on things to capture when this happened
(reading LED's, going into the RMC and dumping some
registers, etc.).

HP finally acknowledged the bug in the CPUs and we
put a program in place to upgrade all of them (which
eliminated the problem).

-Ken