Re: Opteron Rev E has a bug ... a locked instruction doesn't act as a read-acquire barrier (confirmed) - Kernel

This is a discussion on Re: Opteron Rev E has a bug ... a locked instruction doesn't act as a read-acquire barrier (confirmed) - Kernel ; On Wednesday 06 August 2008, Wahlig, Elsie wrote: > Your issue may be one that has been seen on 1st generation > AMD Opteron processor's with cpuid family 0Fh, cpuid model's > > operation after acquiring a semaphore. Matches my ...

+ Reply to Thread
Results 1 to 4 of 4

Thread: Re: Opteron Rev E has a bug ... a locked instruction doesn't act as a read-acquire barrier (confirmed)

  1. Re: Opteron Rev E has a bug ... a locked instruction doesn't act as a read-acquire barrier (confirmed)

    On Wednesday 06 August 2008, Wahlig, Elsie wrote:
    > Your issue may be one that has been seen on 1st generation
    > AMD Opteron processor's with cpuid family 0Fh, cpuid model's
    > < 40h with the code sequence that performs a read-modify write
    > operation after acquiring a semaphore.


    Matches my hardware

    cpu family : 15
    model : 33

    >
    > The memory read ordering between a semaphore operation and a
    > subsequent read-modify-write instruction (an instruction which
    > uses the same memory location as both a source and destination)
    > may allow the read-modify-write instruction to operate on the
    > memory location ahead of the completion of the semaphore
    > operation and an erratum may occur.


    I wonder why there was no official errata about this?


    > If you think your software is encountering this code sequence,
    > a work-around should be implemented by adding an LFENCE
    > instruction right after the semaphore, after a cpuid check.
    > The workaround's applied to OpenSolaris at
    > http://mail.opensolaris.org/pipermai...ober/009080.ht
    > ml
    > and Google performance tools tool at
    > http://google-perftools.googlecode.c...nk/src/base/at
    > omicops-internals-x86.cc
    > are suitable examples.
    > A list of the model numbers this issue may occur on is at
    > http://products.amd.com/en-us/downlo...Generation_Ref
    > erence_101607.pdf.


    Would be better to fix the bug on kernel level if this is possible. Just
    someone with the knowledge needs to do this. Anyone interested?

    > Mikael Pettersson writes:
    > ... snip ...
    >
    > > I investigated the Solaris track, but I've found no detailed
    > > explanation of the alleged bug. I've asked the Sun engineer
    > > who committed the fix for an explanation, but so far there's
    > > been no reply.
    > >
    > > Anyway, here's what I've found out.
    > >
    > > It's Solaris bug # 6323525.
    > >
    > > They call it "Mutex primitives don't work as expected."
    > >
    > > if (number_of_cores() < 2) then don't have bug if (family ==
    > > 0xf && Model < 0x40) then have bug if
    > > (rdmsr(MSR_BU_CFG/*0xC0011023*/) & 2) then bug is masked
    > > lock: // mutex_lock, spin_lock, etc
    > > ...
    > > lock; cmpxchg ..
    > > jnz fail
    > > ret; nop; nop; nop // patched to "lfence; ret" if bug The
    > > workaround is to place a fencing instruction (lfence) between
    > > the mutex operation and the subsequent read-modify-write instruction.
    > > (This provides the necessary load memory barrier.)
    > >
    > > There's no change to the unlock code.
    > >
    > > Anyone know who to contact @ AMD about confirming or denying this?
    > >
    > > /Mikael




    --
    Arkadiusz Miśkiewicz PLD/Linux Team
    arekm / maven.pl http://ftp.pld-linux.org/
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: Opteron Rev E has a bug ... a locked instruction doesn't act as a read-acquire barrier (confirmed)

    On Wed, 6 Aug 2008 19:13:34 +0200, Arkadiusz Miskiewicz wrote:
    >On Wednesday 06 August 2008, Wahlig, Elsie wrote:
    >> Your issue may be one that has been seen on 1st generation
    >> AMD Opteron processor's with cpuid family 0Fh, cpuid model's
    >> < 40h with the code sequence that performs a read-modify write
    >> operation after acquiring a semaphore.

    >
    >Matches my hardware
    >
    >cpu family : 15
    >model : 33
    >
    >>
    >> The memory read ordering between a semaphore operation and a
    >> subsequent read-modify-write instruction (an instruction which
    >> uses the same memory location as both a source and destination)
    >> may allow the read-modify-write instruction to operate on the
    >> memory location ahead of the completion of the semaphore
    >> operation and an erratum may occur.


    Thanks for the detailed erratum description.

    >I wonder why there was no official errata about this?


    Indeed.

    >> If you think your software is encountering this code sequence,
    >> a work-around should be implemented by adding an LFENCE
    >> instruction right after the semaphore, after a cpuid check.
    >> The workaround's applied to OpenSolaris at
    >> http://mail.opensolaris.org/pipermai...ober/009080.ht
    >> ml
    >> and Google performance tools tool at
    >> http://google-perftools.googlecode.c...nk/src/base/at
    >> omicops-internals-x86.cc
    >> are suitable examples.
    >> A list of the model numbers this issue may occur on is at
    >> http://products.amd.com/en-us/downlo...Generation_Ref
    >> erence_101607.pdf.

    >
    >Would be better to fix the bug on kernel level if this is possible. Just=20
    >someone with the knowledge needs to do this. Anyone interested?


    In principle it's easy. We append a 3-byte nop to the lock-taking
    instructions. We invent an AMD_MUTEX_BUG synthetic cpuid feature
    bit and add boot-time code to detect it. We use the alternatives()
    infrastructure to replace that nop with lfence at boot-time if
    AMD_MUTEX_BUG is present.

    I think the hardest part is locating all lock-taking code sequences.

    Also I think I'll start by writing a user-space test program that
    does a stress-test of the plain lock;rmw;unlobk sequence to see if
    it can break it. (Locks/mutexes are also used in user-space.)

    /Mikael
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. RE: Opteron Rev E has a bug ... a locked instruction doesn't act as a read-acquire barrier (confirmed)



    Mikael Pettersson writes:
    >
    > On Wed, 6 Aug 2008 19:13:34 +0200, Arkadiusz Miskiewicz wrote:
    > >On Wednesday 06 August 2008, Wahlig, Elsie wrote:
    > >> Your issue may be one that has been seen on 1st generation AMD
    > >> Opteron processor's with cpuid family 0Fh, cpuid model's <

    > 40h with
    > >> the code sequence that performs a read-modify write

    > operation after
    > >> acquiring a semaphore.

    > >
    > >Matches my hardware
    > >
    > >cpu family : 15
    > >model : 33
    > >
    > >>
    > >> The memory read ordering between a semaphore operation and a
    > >> subsequent read-modify-write instruction (an instruction

    > which uses
    > >> the same memory location as both a source and destination)

    > may allow
    > >> the read-modify-write instruction to operate on the memory

    > location
    > >> ahead of the completion of the semaphore operation and an

    > erratum may
    > >> occur.

    >
    > Thanks for the detailed erratum description.
    >
    > >I wonder why there was no official errata about this?

    >
    > Indeed.


    I don't know but I will see about getting it in there.

    Elsie

    >
    > >> If you think your software is encountering this code sequence, a
    > >> work-around should be implemented by adding an LFENCE instruction
    > >> right after the semaphore, after a cpuid check.
    > >> The workaround's applied to OpenSolaris at
    > >>

    > http://mail.opensolaris.org/pipermai...October/009080
    > >> .ht
    > >> ml
    > >> and Google performance tools tool at
    > >>

    > http://google-perftools.googlecode.c...trunk/src/base
    > >> /at
    > >> omicops-internals-x86.cc
    > >> are suitable examples.
    > >> A list of the model numbers this issue may occur on is at
    > >>

    > http://products.amd.com/en-us/downlo...st_Generation_
    > >> Ref
    > >> erence_101607.pdf.

    > >
    > >Would be better to fix the bug on kernel level if this is possible.
    > >Just=20 someone with the knowledge needs to do this. Anyone

    > interested?
    >
    > In principle it's easy. We append a 3-byte nop to the
    > lock-taking instructions. We invent an AMD_MUTEX_BUG
    > synthetic cpuid feature bit and add boot-time code to detect
    > it. We use the alternatives() infrastructure to replace that
    > nop with lfence at boot-time if AMD_MUTEX_BUG is present.
    >
    > I think the hardest part is locating all lock-taking code sequences.
    >
    > Also I think I'll start by writing a user-space test program
    > that does a stress-test of the plain lock;rmw;unlobk sequence
    > to see if it can break it. (Locks/mutexes are also used in
    > user-space.)
    >
    > /Mikael
    >
    >


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: Opteron Rev E has a bug ... a locked instruction doesn't act as a read-acquire barrier (confirmed)

    On Wednesday 06 August 2008, Mikael Pettersson wrote:
    > On Wed, 6 Aug 2008 19:13:34 +0200, Arkadiusz Miskiewicz wrote:
    > >On Wednesday 06 August 2008, Wahlig, Elsie wrote:
    > >> Your issue may be one that has been seen on 1st generation
    > >> AMD Opteron processor's with cpuid family 0Fh, cpuid model's
    > >> < 40h with the code sequence that performs a read-modify write
    > >> operation after acquiring a semaphore.

    [...]
    > Also I think I'll start by writing a user-space test program that
    > does a stress-test of the plain lock;rmw;unlobk sequence to see if
    > it can break it. (Locks/mutexes are also used in user-space.)


    Bugreported so hopefully it won't be lost.

    http://bugzilla.kernel.org/show_bug.cgi?id=11305

    > /Mikael


    --
    Arkadiusz Miśkiewicz PLD/Linux Team
    arekm / maven.pl http://ftp.pld-linux.org/
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread