During MemTest86+ v1.70 (latest with Win98SE boot floppy) for
reliabilty testing in upgrading used RAM memory in a popular,
redeployed IBM xSeries 345 (dual Xeon 2.8GHz, 8670-61X, 100MHz FSB,
2002 era with latest v1.21 BIOS, v1.09 ISMP) 2U rack server, the
screen hung about 30 min into the test. Then the box would not boot,
dark screen, no BIOS. Box powers on with Power-on green LED on front
panel but otherwise appears totally dead. The box has dual IBM 350
watt power supplies with both green LEDs on in rear.

Have not seen this IBM xSeries server issue discussed when googling
the newsgroups and Tek-tips, so this long solution is described here
with additional questions for enhancing reliability of legacy IBM
servers. Details follow, regard, Phil

-----

The RAM totals 4GB (8GB max with 2GB DIMMs and dual PS); 4 pieces of
1GB IBM FRU 09N4308 / 38L4031 184-pin double-sided DIMMs (DDR 266MHz,
PC2100 CL2.5 registered ECC, spec 100MHz 2.5v) with Samsung SDRAM
memory chips (K4H510638D-TC80) which got quite hot-to-the-touch during
this strenuous testing. This was quite evident with older chips with
2002 date codes, compared to 2003.

The previous memory totalled 1GB RAM; 4 pieces of 256MB registered ECC
DIMMs FRU 09N4306 / 38L4029 with double-sided MicronT 46V32M4-75A
chips (512MB (2x256MB) min). One pair had older late-02 and one pair
mid-03 chip date codes. Micron has 75ns rated chips vs Samsung 80ns
rated chips; but both appear to have sufficient design margin for
100ns operational spec. The memory sticks are installed in matched
pairs for 2-way interleaved operation; factory labels face inwards.

Have any SysE added gamer RAM chip coolers to servers in heavy
production? There is only 3/16" 0.1875 in between the adjacent slot's
DIMM's chips and 4 3/4" long chip array, 1/4" 0.25 in double sided
thickness (1/8" 0.125" on 256MB). So the metal heat sinks need to be
only 1/8" thick with fins. Does one remove the labels for better
plastic-metal heat transfer?

When booting the box from cold, the computer was powered on but there
was no video from the integrated ATI RAGE XL chipset, no BIOS beeps.
The Light Path Diagnostics panel showed nothing (latest Integrated Sys
Mgmt Processor firmware). The blinking green LED next to the CMOS
battery (ISMP activity) just shows that AC is connected to the box.
The LEDs next to the DIMM slots showed nothing. Testing was done in
pairs in DIMM slots 1 and 2, which are closest to edge of the
mainboard and case, HW manual p57-8.

The IBM's Hardware Maintenance manual (48P9718, 11th Ed, Feb 04,
latest) showed no such diagnosis and no such remedy. See Chap6 Symptom-
to-FRU index, p83-113. The closest symptom would be BIOS beep code
1-1-3 (CMOS write/read test failed, p83), but there were no beeps. The
"No-beep symptoms" table on p86 was clueless. Same with other manuals,
including Options p21-3 (48P9719, 1st Ed, 7/2002), User's memory spec
p3, reliability p5-6 (48P9717, 1st Ed, 7/2002). The Installation Guide
has Chap2 Installing Options Memory p9-11, and Chap5 Solving Problems
p27, Table 3 showed with Boot Code=No beep, to call IBM Service
(48P9714, 2nd Ed, 7/2002).

The only real clues to solving the problem was in the HW Maint Manual
"Undetermined problems" p112 near the end of chapter, which had Notes
1 and 2 on damaged data in CMOS and BIOS.

The un-documented fix or remedy was to remove, short the leads, and
replace the CMOS CR2032 3v Lithium battery (FRU 33F8354) to reset the
BIOS to default. Since this systemboard has in an upright CMOS battery
retainer, one needs to use an insulated forcep (napkin) to remove it
with one hand and the other hand's fingernail on the retaining clip
(the flat positive + side with the mfgr name faces backwards). Advise
removal of any ServeRAID board too in Slot 2 PCI-X 100MHz for ease of
access. Use a screw driver in the black plastic clip to ease opening
the Adapter Retainer without breaking the blue plastic pronged snaps.
The manual procedures on "Replacing the battery" p69-71, and
"Installing a ServeRAID-5i adapter" p54-5 has diagrams on the
details.

Then the box will boot with a "161 Bad CMOS battery" error p102 and
the BIOS needs the time and date reset and the system-error logs
cleared.

After several repetitions of this scenario, was able to get both pairs
of Samsung memory to pass at least one complete cycle of the test (abt
45 min), but not much further than that; before the box hung again and
required having to R&R the CMOS battery. The newer pair had less of a
problem making the test hurdle and felt to run cooler.

Even with the chassis cover on and including all 8 chassis fan array,
the rear-most memory chips were noticably hotter than the chips closer
to the dual fans and CPUs. And the DIMMs closer to edge of case also
ran hotter, thus the newer pair's final destination was DIMM slots 1
and 2. The BIOS did spin-down the fans; after initial startup with all
8 fans roaring away. During the hour long memory testing, at no time
did the fans spin-up with a room ambient about 60F.

We now have 4GB of iffy RAM memory...any comments from fellow IBM
SysE?? Did Samsung change (shrink) their process technology
fabricating half gigabit (512Mb) DDR DRAM memory chips in the late
2002 timeframe?? I used to think that Samsung set the world standard
in memory chips. This is during the late 90s era when Intel / RamBus
RDRAM RIMMs was battling the SDRAM DDR world.

Are there more reliable memory chip mfgrs that IBM OEMS such as Hynix
(another Korean), Elpida Opt 33L5039 (Japanese), Infineon (higher
rated IBM FRU 09N4308 33L5039 CL2 PC2100) (German), and Micron
(unbranded IBM compatible to FRU 10K0071) (USA)?? Should we be looking
at enhanced memory specialists such as Corsair, OCZ, Patriot, etc? Or
is the best solution to overheating memory chip issue IBM's ChipKill
technology. Has anyone installed this more expensive memory in xSeries
servers??

BTW, this issue was cross-posted in news:comp.os.linux.hardware ,
news:comp.sys.ibm.pc.hardware.chips and Tek-Tips.com' IBM Server
discussion group on 27Apl07.