ECC errors: is my RAM bad? - Linux

This is a discussion on ECC errors: is my RAM bad? - Linux ; I powered up my Alphastation 500/266 after it had been off for several months. Now, during fsck, I get lots of errors like this (many per second): CIA machine check: vector=0x630 pc=0xfffffc00004913e8 code=0x86 machine check type: correctable ECC error (retryable) ...

+ Reply to Thread
Results 1 to 8 of 8

Thread: ECC errors: is my RAM bad?

  1. ECC errors: is my RAM bad?

    I powered up my Alphastation 500/266 after it had been off for
    several months. Now, during fsck, I get lots of errors like this
    (many per second):

    CIA machine check: vector=0x630 pc=0xfffffc00004913e8 code=0x86
    machine check type: correctable ECC error (retryable)
    pc = [] ra = [] ps = 0000 Not tainted
    v0 = 0000000000000038 t0 = 0000000000000000 t1 = bfffd2e500000000
    t2 = 0000000000000000 t3 = 0000000000000030 t4 = 0000000000000020
    t5 = 00000001200c6c78 t6 = fffffc00168d9fc8 t7 = fffffc001fbf4000
    a0 = fffffc001fbf7eb8 a1 = 000000000007c0a4 a2 = 0000000000000000
    a3 = 0000000000002000 a4 = 00000001200c4cb0 a5 = 000000011ffffa80
    t8 = 0000000000008000 t9 = 0000020000170e58 t10= 0000000000000008
    t11= 0000000000000001 pv = fffffc00004912e0 at = fffffc000034c834
    gp = fffffc0000547458 sp = fffffc001fbf7e08

    Does this mean the RAM has gone bad? What is the failing address?
    The memory configuration is:

    Memory Size = 512Mb

    Bank Size/Sets Base Addr Speed
    ------ ---------- --------- -----
    00 256Mb/2 000000000 Fast
    01 256Mb/2 010000000 Fast


    This machine has SRM V7.2-2 (Apr 4 2000 17:17:43). I tried the
    "memory" command, which should run "memtest" with the appropriate
    arguments, but it just fails with "Invalid group selection".
    Are there any NVRAM variables that could cause ECC errors?

    I am using Debian GNU/Linux, kernel-image-2.4.21-5-generic.

  2. Re: ECC errors: is my RAM bad?

    Kalle Olavi Niemitalo writes:

    > I powered up my Alphastation 500/266 after it had been off for
    > several months.


    I forgot to mention I also moved it to a different room, where it
    now lies flat on the desk. In the previous room, it had been
    standing on its right side (where there are no ventilation holes)
    in a tower orientation of sorts. I suppose this change might
    have loosened a contact, or something.

    Anyway, I now rebooted Linux with mem=256M, and there are no ECC
    errors so far. Which makes me suspect the flaw is in the second
    bank.

  3. Re: ECC errors: is my RAM bad?

    Kalle Olavi Niemitalo writes:

    > I powered up my Alphastation 500/266 after it had been off for
    > several months. Now, during fsck, I get lots of errors like this
    > (many per second):
    >
    > CIA machine check: vector=0x630 pc=0xfffffc00004913e8 code=0x86
    > machine check type: correctable ECC error (retryable)
    > pc = [] ra = [] ps = 0000 Not tainted
    > v0 = 0000000000000038 t0 = 0000000000000000 t1 = bfffd2e500000000
    > t2 = 0000000000000000 t3 = 0000000000000030 t4 = 0000000000000020
    > t5 = 00000001200c6c78 t6 = fffffc00168d9fc8 t7 = fffffc001fbf4000
    > a0 = fffffc001fbf7eb8 a1 = 000000000007c0a4 a2 = 0000000000000000
    > a3 = 0000000000002000 a4 = 00000001200c4cb0 a5 = 000000011ffffa80
    > t8 = 0000000000008000 t9 = 0000020000170e58 t10= 0000000000000008
    > t11= 0000000000000001 pv = fffffc00004912e0 at = fffffc000034c834
    > gp = fffffc0000547458 sp = fffffc001fbf7e08
    >
    > Does this mean the RAM has gone bad?


    Most likely. It did when I got it, at least.

    > What is the failing address?


    Impossible to tell from that information.

    --
    Måns Rullgård
    mru@kth.se

  4. Re: ECC errors: is my RAM bad?

    Kalle Olavi Niemitalo writes:

    > Kalle Olavi Niemitalo writes:
    >
    >> I powered up my Alphastation 500/266 after it had been off for
    >> several months.

    >
    > I forgot to mention I also moved it to a different room, where it
    > now lies flat on the desk. In the previous room, it had been
    > standing on its right side (where there are no ventilation holes)
    > in a tower orientation of sorts. I suppose this change might
    > have loosened a contact, or something.


    Take the memory out, then put it back, just to be sure there isn't a
    loose connection.

    > Anyway, I now rebooted Linux with mem=256M, and there are no ECC
    > errors so far. Which makes me suspect the flaw is in the second
    > bank.


    That seems reasonable to me.

    --
    Måns Rullgård
    mru@kth.se

  5. Re: ECC errors: is my RAM bad?

    mru@kth.se (Måns Rullgård) writes:

    > Take the memory out, then put it back, just to be sure there isn't a
    > loose connection.


    No effect. :-(

  6. Re: ECC errors: is my RAM bad?

    Kalle Olavi Niemitalo writes:

    > mru@kth.se (Måns Rullgård) writes:
    >
    >> Take the memory out, then put it back, just to be sure there isn't a
    >> loose connection.

    >
    > No effect. :-(


    Then you'll just have to swap it out.

    --
    Måns Rullgård
    mru@kth.se

  7. Re: ECC errors: is my RAM bad?

    mru@kth.se (Måns Rullgård) writes:

    > Kalle Olavi Niemitalo writes:
    >> What is the failing address?

    >
    > Impossible to tell from that information.


    There were no other messages in /var/log/kern.log nor on the
    console, between the messages that I already posted.

    How can I tell Linux to display the address?
    Or doesn't Linux even know it?

    I would be interested in knowing whether the fault is at a
    specific address or spans the entire bank. I fear it's the
    latter, because I get so many of those ECC messages.

    I tried to trigger the error by writing and reading at the
    beginning of the second bank in SRM:

    >>>deposit -q -physical -n f 10000000 0123456789abcdef
    >>>examine -q -physical -n f 10000000


    I didn't get any unusual messages from this. Does SRM report
    recoverable ECC errors at all?

    If I specify mem=256M to Linux, can I regardless access the
    second bank via /dev/mem?

  8. Re: ECC errors: is my RAM bad?

    Kalle Olavi Niemitalo writes:

    > mru@kth.se (Måns Rullgård) writes:
    >
    >> Kalle Olavi Niemitalo writes:
    >>> What is the failing address?

    >>
    >> Impossible to tell from that information.

    >
    > There were no other messages in /var/log/kern.log nor on the
    > console, between the messages that I already posted.
    >
    > How can I tell Linux to display the address?
    > Or doesn't Linux even know it?


    I don't know. Try poking around in arch/alpha/kernel/irc_alpha.c and
    core_YOURCHIPSET.c.

    > I would be interested in knowing whether the fault is at a
    > specific address or spans the entire bank. I fear it's the
    > latter, because I get so many of those ECC messages.


    It could be that one data pin is broken somehow. That would make the
    whole module useless. I once had a 32 MB module that had an error at
    just one place. The first 8 MB were fine, IIRC.

    > I tried to trigger the error by writing and reading at the
    > beginning of the second bank in SRM:
    >
    >>>>deposit -q -physical -n f 10000000 0123456789abcdef
    >>>>examine -q -physical -n f 10000000

    >
    > I didn't get any unusual messages from this. Does SRM report
    > recoverable ECC errors at all?


    You could try playing around with memexer, I think that's what it's
    called. It's supposed to loop over a region of memory reading and
    writing until it's stopped. I've never managed to get any useful
    information from it, though.

    > If I specify mem=256M to Linux, can I regardless access the
    > second bank via /dev/mem?


    I wouldn't think so.

    --
    Måns Rullgård
    mru@kth.se

+ Reply to Thread