really strange memmove problem - Linux

This is a discussion on really strange memmove problem - Linux ; hello, I've got a really strange issue on a multithreaded program running under 2.6 smp kernel on two different linux boxes; the problem has shown up during tests running under valgrind machine 1 details: 4 logical cpus (2 dual core ...

+ Reply to Thread
Results 1 to 7 of 7

Thread: really strange memmove problem

  1. really strange memmove problem

    hello,
    I've got a really strange issue on a multithreaded program running
    under 2.6 smp kernel on two different linux boxes; the problem has
    shown up during tests running under valgrind

    machine 1 details:

    4 logical cpus (2 dual core cpus)
    model: Dual Core AMD Opteron Processor 275,
    2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:54:53 EST 2006 athlon GNU/Linux

    machine 2 details:

    8 logical cpus (2 dual core ht cpus)
    cpu model: Intel(R) Xeon(TM) CPU 3.20GHz
    2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 GNU/Linux

    the program erratically crashes while doing a memmove; stack analysis
    shows the memmove has been called with overlapping sources and
    destination - the following diagram shows the pattern:

    [xxxxxxxx....] -> [....xxxxxxxx]

    the implementation of memmove linked-in loads esi and edi with the end
    of data (shown as x's) and end of buffer respectively, then it does the
    copy backwards; at the heart of memmove is the following statement:

    0xa9c15d : repz movsl %ds%esi),%es%edi)

    to work properly this needs the direction flag to be set so that
    addresses are decremented after they're dereferenced; well, it happens
    that sometimes the program segfaults due to accessing addresses after
    the end of the buffer, I look at eflags register and it seems as if
    someone changed (unset) the direction flag during the copy!

    how can this be? I kept trying on a 2.4 box and hadn't the chance to
    make the program crash; but I also noticed that on that box the memmove
    linked-in is implemented differently (no direction flag involved):

    0x4009d334 : mov 0xffffffe0(%edi),%eax
    0x4009d337 : sub $0x20,%ecx
    0x4009d33a : mov 0xfffffffc(%esi),%eax
    0x4009d33d : mov 0xfffffff8(%esi),%edx
    0x4009d340 : mov %eax,0xfffffffc(%edi)
    0x4009d343 : mov %edx,0xfffffff8(%edi)
    0x4009d346 : mov 0xfffffff4(%esi),%eax
    0x4009d349 : mov 0xfffffff0(%esi),%edx
    0x4009d34c : mov %eax,0xfffffff4(%edi)
    0x4009d34f : mov %edx,0xfffffff0(%edi)
    0x4009d352 : mov 0xffffffec(%esi),%eax
    0x4009d355 : mov 0xffffffe8(%esi),%edx
    0x4009d358 : mov %eax,0xffffffec(%edi)
    0x4009d35b : mov %edx,0xffffffe8(%edi)
    0x4009d35e : mov 0xffffffe4(%esi),%eax
    0x4009d361 : mov 0xffffffe0(%esi),%edx
    0x4009d364 : mov %eax,0xffffffe4(%edi)
    0x4009d367 : mov %edx,0xffffffe0(%edi)
    0x4009d36a : lea 0xffffffe0(%esi),%esi
    0x4009d36d : lea 0xffffffe0(%edi),%edi
    0x4009d370 : jns 0x4009d334

    thanks for any insight!

    P.


  2. Re: really strange memmove problem

    Hi!

    > I've got a really strange issue on a multithreaded program running
    > under 2.6 smp kernel on two different linux boxes; the problem has
    > shown up during tests running under valgrind


    Does the problem only shows up under valgrind?

    > the program erratically crashes while doing a memmove; stack analysis
    > shows the memmove has been called with overlapping sources and
    > destination - the following diagram shows the pattern:
    >
    > [xxxxxxxx....] -> [....xxxxxxxx]
    >
    > the implementation of memmove linked-in loads esi and edi with the end
    > of data (shown as x's) and end of buffer respectively, then it does the
    > copy backwards; at the heart of memmove is the following statement:
    >
    > 0xa9c15d : repz movsl %ds%esi),%es%edi)
    >
    > to work properly this needs the direction flag to be set so that
    > addresses are decremented after they're dereferenced; well, it happens
    > that sometimes the program segfaults due to accessing addresses after
    > the end of the buffer, I look at eflags register and it seems as if
    > someone changed (unset) the direction flag during the copy!
    >
    > how can this be? I kept trying on a 2.4 box and hadn't the chance to
    > make the program crash; but I also noticed that on that box the memmove
    > linked-in is implemented differently (no direction flag involved):


    Hmmm... If you obtain SIGSEGV, this means that you are accessing an
    address during the memmove() which is outside the process space. Since
    your process is multithreaded, this could be the result of a nasty race
    condition.

    I would serialize all the memmove operations using a big lock to see if
    the problem still persists or disappears. This could be done as
    follows:

    pthread_mutex_t memmoveLock = PTHREAD_MUTEX_INITIALIZER;

    void*
    my_memmove(void *dst, const void* src, size_t n)
    {
    void* p;

    pthread_mutex_lock(&memmoveLock);
    p = memmove(dst, src, n);
    pthread_mutex_unlock(&memmoveLock);
    return p;
    }

    #define memmove(dst,src,n) my_memmove(dst,src,n)

    Another possibility would be an issue with valgrind. But you should
    consider this option as unlikely.

    If possible, post a minimalistic code that compiles and shows your
    problem.

    HTH,
    Loic.


  3. Re: really strange memmove problem


    spamtrap@crayne.org ha scritto:

    > Hi!
    >
    > > I've got a really strange issue on a multithreaded program running
    > > under 2.6 smp kernel on two different linux boxes; the problem has
    > > shown up during tests running under valgrind

    >
    > Does the problem only shows up under valgrind?
    >


    yes

    > > the program erratically crashes while doing a memmove; stack analysis
    > > shows the memmove has been called with overlapping sources and
    > > destination - the following diagram shows the pattern:
    > >
    > > [xxxxxxxx....] -> [....xxxxxxxx]
    > >
    > > the implementation of memmove linked-in loads esi and edi with the end
    > > of data (shown as x's) and end of buffer respectively, then it does the
    > > copy backwards; at the heart of memmove is the following statement:
    > >
    > > 0xa9c15d : repz movsl %ds%esi),%es%edi)
    > >
    > > to work properly this needs the direction flag to be set so that
    > > addresses are decremented after they're dereferenced; well, it happens
    > > that sometimes the program segfaults due to accessing addresses after
    > > the end of the buffer, I look at eflags register and it seems as if
    > > someone changed (unset) the direction flag during the copy!
    > >
    > > how can this be? I kept trying on a 2.4 box and hadn't the chance to
    > > make the program crash; but I also noticed that on that box the memmove
    > > linked-in is implemented differently (no direction flag involved):

    >


    > Hmmm... If you obtain SIGSEGV, this means that you are accessing an
    > address during the memmove() which is outside the process space. Since
    > your process is multithreaded, this could be the result of a nasty race
    > condition.
    >


    only one thread accesses that buffer; moreover, crash dump analysis has
    shown that memmove has been entered with proper arguments - memmove
    itself is a couple screenful of assembly code and I've verified it
    holds all of its variables in registers - I think these are saved in a
    system stack during context switch, so that nothing in the program has
    ever the chance to interfere once memmove has been called (ok, another
    thread could release the memory, but that's not the case)

    >
    > Another possibility would be an issue with valgrind. But you should
    > consider this option as unlikely.
    >


    yet if the analysis above is correct, that could explain how something
    meddled with the registers during the copy

    > If possible, post a minimalistic code that compiles and shows your
    > problem.


    unluckily we weren't able to reproduce the problem outside the affected
    program

    > HTH,


    thanks,

    P.


  4. Re: really strange memmove problem

    Hello,

    > > Does the problem only shows up under valgrind?
    > >

    >
    > yes




    > only one thread accesses that buffer; moreover, crash dump analysis has
    > shown that memmove has been entered with proper arguments - memmove
    > itself is a couple screenful of assembly code and I've verified it
    > holds all of its variables in registers - I think these are saved in a
    > system stack during context switch, so that nothing in the program has
    > ever the chance to interfere once memmove has been called (ok, another
    > thread could release the memory, but that's not the case)


    I don't know if that's doable, but you could get better confidence that
    Valgrind is doing wrong by checking the validity of every addresses
    used in the memmove():

    #define valid_ptr(p) ((access((char*)p, 0) == -1) ? (errno != EFAULT)
    : 1)

    void*
    my_memmove(void *dst, const void* src, size_t n)
    {
    void* p;

    for (size_t i=0; i if ( !valid_ptr(src+i) ) {
    // oupsi...
    // address outside process space
    }
    }
    return memmove(dst, src, n);
    }

    #define memmove(dst,src,n) my_memmove(dst,src,n)

    Otherwise, you may want to raise your issue on the valgrind mailing
    list.

    HTH,
    Loic.


  5. Re: really strange memmove problem


    wrote in message
    news:1161763231.185115.56480@k70g2000cwa.googlegro ups.com...
    > Hello,
    >
    > > > Does the problem only shows up under valgrind?
    > > >

    > >
    > > yes

    >
    >
    >
    > > only one thread accesses that buffer; moreover, crash dump analysis has
    > > shown that memmove has been entered with proper arguments - memmove
    > > itself is a couple screenful of assembly code and I've verified it
    > > holds all of its variables in registers - I think these are saved in a
    > > system stack during context switch, so that nothing in the program has
    > > ever the chance to interfere once memmove has been called (ok, another
    > > thread could release the memory, but that's not the case)

    >
    > I don't know if that's doable, but you could get better confidence that
    > Valgrind is doing wrong by checking the validity of every addresses
    > used in the memmove():
    >
    > #define valid_ptr(p) ((access((char*)p, 0) == -1) ? (errno != EFAULT)
    > : 1)
    >


    access() and EFAULT?

    The OpenGroups IEEE 1003.1 POSIX.1 standard for access() doesn't list
    EFAULT:

    http://www.opengroup.org/onlinepubs/...ns/access.html
    http://www.opengroup.org/onlinepubs/009695399/toc.htm

    I do see that HP-UX and SunOS may return EFAULT for access(). I may have
    missed it, but my search of Glibc v2.4 doesn't seem to return EFAULT for
    access() (or much of anything for that matter...). One file, errlist.c,
    says that for GNU, EFAULT is never generated, a signal is generated instead,
    perhaps SIGSEGV or SIGBUS. Is GNU/Linux supposed to (or does it through
    some indirect method) return EFAULT?


    Rod Pemberton


  6. Re: really strange memmove problem

    Rod Pemberton wrote:

    .....
    >>#define valid_ptr(p) ((access((char*)p, 0) == -1) ? (errno != EFAULT)
    >>: 1)
    >>

    >
    >
    > access() and EFAULT?
    >
    > The OpenGroups IEEE 1003.1 POSIX.1 standard for access() doesn't list
    > EFAULT:
    >
    > http://www.opengroup.org/onlinepubs/...ns/access.html
    > http://www.opengroup.org/onlinepubs/009695399/toc.htm
    >
    > I do see that HP-UX and SunOS may return EFAULT for access(). I may have
    > missed it, but my search of Glibc v2.4 doesn't seem to return EFAULT for
    > access() (or much of anything for that matter...). One file, errlist.c,
    > says that for GNU, EFAULT is never generated, a signal is generated instead,
    > perhaps SIGSEGV or SIGBUS. Is GNU/Linux supposed to (or does it through
    > some indirect method) return EFAULT?


    Man page says it does (not verified). I see... it's a trick! Access
    isn't expected to succeed - any errno besides EFAULT means that we can
    access memory at "pathname"... but of course there's no filename there
    to check access to... Neat trick!

    ..... could this be used to guard against the "really bad bug" that man 3
    malloc mentions?

    Best,
    Frank


  7. Re: really strange memmove problem

    Frank Kotler wrote:
    > Rod Pemberton wrote:
    >
    > ....
    >
    >>> #define valid_ptr(p) ((access((char*)p, 0) == -1) ? (errno != EFAULT)
    >>> : 1)
    >>>

    >>
    >>
    >> access() and EFAULT?
    >>
    >> The OpenGroups IEEE 1003.1 POSIX.1 standard for access() doesn't list
    >> EFAULT:
    >>
    >> http://www.opengroup.org/onlinepubs/...ns/access.html
    >> http://www.opengroup.org/onlinepubs/009695399/toc.htm
    >>
    >> I do see that HP-UX and SunOS may return EFAULT for access(). I may have
    >> missed it, but my search of Glibc v2.4 doesn't seem to return EFAULT for
    >> access() (or much of anything for that matter...). One file, errlist.c,
    >> says that for GNU, EFAULT is never generated, a signal is generated
    >> instead,
    >> perhaps SIGSEGV or SIGBUS. Is GNU/Linux supposed to (or does it through
    >> some indirect method) return EFAULT?

    >
    > Man page says it does (not verified).


    Well... I verified it, I guess. I had a segfaulting program at hand that
    I thought would make a "testbed". Results confused me, so I pared it
    down to this simple test. As expected, it segfaults in short order. In
    an xterm, same thing, as expected. Uncomment the lines enabling the
    memory check, and it exits cleanly - fairly promptly. In an xterm,
    however, it runs for... a long time. Still running...

    Can anyone tell me what's going on here? TIA.

    Best,
    Frank

    ; nasm -f elf myprog.asm
    ; ld -o myprog myprog.o

    global _start

    section .bss
    buf resb 100

    section .text
    _start:

    mov esi, buf
    ...top:

    ; call is_memory_valid
    ; test eax, eax
    ; js exit

    lodsb
    jmp .top

    exit:
    mov eax, 1
    int 80h
    ;-----------------------------

    ;-----------------------------
    is_memory_valid:
    push ebx
    push ecx
    mov eax, 33 ; __NR_access
    mov ebx, esi
    xor ecx, ecx ; F_OK
    int 80h
    cmp eax, -14 ; -EFAULT
    jnz .good
    or eax, byte -1
    jmp short .done
    ...good:
    xor eax, eax
    ...done:
    pop ecx
    pop ebx
    ret
    ;----------------------------


+ Reply to Thread