ECC errors: is my RAM bad? - Linux
This is a discussion on ECC errors: is my RAM bad? - Linux ; I powered up my Alphastation 500/266 after it had been off for
several months. Now, during fsck, I get lots of errors like this
(many per second):
CIA machine check: vector=0x630 pc=0xfffffc00004913e8 code=0x86
machine check type: correctable ECC error (retryable)
...
-
ECC errors: is my RAM bad?
I powered up my Alphastation 500/266 after it had been off for
several months. Now, during fsck, I get lots of errors like this
(many per second):
CIA machine check: vector=0x630 pc=0xfffffc00004913e8 code=0x86
machine check type: correctable ECC error (retryable)
pc = [] ra = [] ps = 0000 Not tainted
v0 = 0000000000000038 t0 = 0000000000000000 t1 = bfffd2e500000000
t2 = 0000000000000000 t3 = 0000000000000030 t4 = 0000000000000020
t5 = 00000001200c6c78 t6 = fffffc00168d9fc8 t7 = fffffc001fbf4000
a0 = fffffc001fbf7eb8 a1 = 000000000007c0a4 a2 = 0000000000000000
a3 = 0000000000002000 a4 = 00000001200c4cb0 a5 = 000000011ffffa80
t8 = 0000000000008000 t9 = 0000020000170e58 t10= 0000000000000008
t11= 0000000000000001 pv = fffffc00004912e0 at = fffffc000034c834
gp = fffffc0000547458 sp = fffffc001fbf7e08
Does this mean the RAM has gone bad? What is the failing address?
The memory configuration is:
Memory Size = 512Mb
Bank Size/Sets Base Addr Speed
------ ---------- --------- -----
00 256Mb/2 000000000 Fast
01 256Mb/2 010000000 Fast
This machine has SRM V7.2-2 (Apr 4 2000 17:17:43). I tried the
"memory" command, which should run "memtest" with the appropriate
arguments, but it just fails with "Invalid group selection".
Are there any NVRAM variables that could cause ECC errors?
I am using Debian GNU/Linux, kernel-image-2.4.21-5-generic.
-
Re: ECC errors: is my RAM bad?
Kalle Olavi Niemitalo writes:
> I powered up my Alphastation 500/266 after it had been off for
> several months.
I forgot to mention I also moved it to a different room, where it
now lies flat on the desk. In the previous room, it had been
standing on its right side (where there are no ventilation holes)
in a tower orientation of sorts. I suppose this change might
have loosened a contact, or something.
Anyway, I now rebooted Linux with mem=256M, and there are no ECC
errors so far. Which makes me suspect the flaw is in the second
bank.
-
Re: ECC errors: is my RAM bad?
Kalle Olavi Niemitalo writes:
> I powered up my Alphastation 500/266 after it had been off for
> several months. Now, during fsck, I get lots of errors like this
> (many per second):
>
> CIA machine check: vector=0x630 pc=0xfffffc00004913e8 code=0x86
> machine check type: correctable ECC error (retryable)
> pc = [] ra = [] ps = 0000 Not tainted
> v0 = 0000000000000038 t0 = 0000000000000000 t1 = bfffd2e500000000
> t2 = 0000000000000000 t3 = 0000000000000030 t4 = 0000000000000020
> t5 = 00000001200c6c78 t6 = fffffc00168d9fc8 t7 = fffffc001fbf4000
> a0 = fffffc001fbf7eb8 a1 = 000000000007c0a4 a2 = 0000000000000000
> a3 = 0000000000002000 a4 = 00000001200c4cb0 a5 = 000000011ffffa80
> t8 = 0000000000008000 t9 = 0000020000170e58 t10= 0000000000000008
> t11= 0000000000000001 pv = fffffc00004912e0 at = fffffc000034c834
> gp = fffffc0000547458 sp = fffffc001fbf7e08
>
> Does this mean the RAM has gone bad?
Most likely. It did when I got it, at least.
> What is the failing address?
Impossible to tell from that information.
--
Måns Rullgård
mru@kth.se
-
Re: ECC errors: is my RAM bad?
Kalle Olavi Niemitalo writes:
> Kalle Olavi Niemitalo writes:
>
>> I powered up my Alphastation 500/266 after it had been off for
>> several months.
>
> I forgot to mention I also moved it to a different room, where it
> now lies flat on the desk. In the previous room, it had been
> standing on its right side (where there are no ventilation holes)
> in a tower orientation of sorts. I suppose this change might
> have loosened a contact, or something.
Take the memory out, then put it back, just to be sure there isn't a
loose connection.
> Anyway, I now rebooted Linux with mem=256M, and there are no ECC
> errors so far. Which makes me suspect the flaw is in the second
> bank.
That seems reasonable to me.
--
Måns Rullgård
mru@kth.se
-
Re: ECC errors: is my RAM bad?
mru@kth.se (Måns Rullgård) writes:
> Take the memory out, then put it back, just to be sure there isn't a
> loose connection.
No effect. :-(
-
Re: ECC errors: is my RAM bad?
Kalle Olavi Niemitalo writes:
> mru@kth.se (Måns Rullgård) writes:
>
>> Take the memory out, then put it back, just to be sure there isn't a
>> loose connection.
>
> No effect. :-(
Then you'll just have to swap it out.
--
Måns Rullgård
mru@kth.se
-
Re: ECC errors: is my RAM bad?
mru@kth.se (Måns Rullgård) writes:
> Kalle Olavi Niemitalo writes:
>> What is the failing address?
>
> Impossible to tell from that information.
There were no other messages in /var/log/kern.log nor on the
console, between the messages that I already posted.
How can I tell Linux to display the address?
Or doesn't Linux even know it?
I would be interested in knowing whether the fault is at a
specific address or spans the entire bank. I fear it's the
latter, because I get so many of those ECC messages.
I tried to trigger the error by writing and reading at the
beginning of the second bank in SRM:
>>>deposit -q -physical -n f 10000000 0123456789abcdef
>>>examine -q -physical -n f 10000000
I didn't get any unusual messages from this. Does SRM report
recoverable ECC errors at all?
If I specify mem=256M to Linux, can I regardless access the
second bank via /dev/mem?
-
Re: ECC errors: is my RAM bad?
Kalle Olavi Niemitalo writes:
> mru@kth.se (Måns Rullgård) writes:
>
>> Kalle Olavi Niemitalo writes:
>>> What is the failing address?
>>
>> Impossible to tell from that information.
>
> There were no other messages in /var/log/kern.log nor on the
> console, between the messages that I already posted.
>
> How can I tell Linux to display the address?
> Or doesn't Linux even know it?
I don't know. Try poking around in arch/alpha/kernel/irc_alpha.c and
core_YOURCHIPSET.c.
> I would be interested in knowing whether the fault is at a
> specific address or spans the entire bank. I fear it's the
> latter, because I get so many of those ECC messages.
It could be that one data pin is broken somehow. That would make the
whole module useless. I once had a 32 MB module that had an error at
just one place. The first 8 MB were fine, IIRC.
> I tried to trigger the error by writing and reading at the
> beginning of the second bank in SRM:
>
>>>>deposit -q -physical -n f 10000000 0123456789abcdef
>>>>examine -q -physical -n f 10000000
>
> I didn't get any unusual messages from this. Does SRM report
> recoverable ECC errors at all?
You could try playing around with memexer, I think that's what it's
called. It's supposed to loop over a region of memory reading and
writing until it's stopped. I've never managed to get any useful
information from it, though.
> If I specify mem=256M to Linux, can I regardless access the
> second bank via /dev/mem?
I wouldn't think so.
--
Måns Rullgård
mru@kth.se