really strange memmove problem - Linux
This is a discussion on really strange memmove problem - Linux ; hello,
I've got a really strange issue on a multithreaded program running
under 2.6 smp kernel on two different linux boxes; the problem has
shown up during tests running under valgrind
machine 1 details:
4 logical cpus (2 dual core ...
-
really strange memmove problem
hello,
I've got a really strange issue on a multithreaded program running
under 2.6 smp kernel on two different linux boxes; the problem has
shown up during tests running under valgrind
machine 1 details:
4 logical cpus (2 dual core cpus)
model: Dual Core AMD Opteron Processor 275,
2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:54:53 EST 2006 athlon GNU/Linux
machine 2 details:
8 logical cpus (2 dual core ht cpus)
cpu model: Intel(R) Xeon(TM) CPU 3.20GHz
2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 GNU/Linux
the program erratically crashes while doing a memmove; stack analysis
shows the memmove has been called with overlapping sources and
destination - the following diagram shows the pattern:
[xxxxxxxx....] -> [....xxxxxxxx]
the implementation of memmove linked-in loads esi and edi with the end
of data (shown as x's) and end of buffer respectively, then it does the
copy backwards; at the heart of memmove is the following statement:
0xa9c15d : repz movsl %ds
%esi),%es
%edi)
to work properly this needs the direction flag to be set so that
addresses are decremented after they're dereferenced; well, it happens
that sometimes the program segfaults due to accessing addresses after
the end of the buffer, I look at eflags register and it seems as if
someone changed (unset) the direction flag during the copy!
how can this be? I kept trying on a 2.4 box and hadn't the chance to
make the program crash; but I also noticed that on that box the memmove
linked-in is implemented differently (no direction flag involved):
0x4009d334 : mov 0xffffffe0(%edi),%eax
0x4009d337 : sub $0x20,%ecx
0x4009d33a : mov 0xfffffffc(%esi),%eax
0x4009d33d : mov 0xfffffff8(%esi),%edx
0x4009d340 : mov %eax,0xfffffffc(%edi)
0x4009d343 : mov %edx,0xfffffff8(%edi)
0x4009d346 : mov 0xfffffff4(%esi),%eax
0x4009d349 : mov 0xfffffff0(%esi),%edx
0x4009d34c : mov %eax,0xfffffff4(%edi)
0x4009d34f : mov %edx,0xfffffff0(%edi)
0x4009d352 : mov 0xffffffec(%esi),%eax
0x4009d355 : mov 0xffffffe8(%esi),%edx
0x4009d358 : mov %eax,0xffffffec(%edi)
0x4009d35b : mov %edx,0xffffffe8(%edi)
0x4009d35e : mov 0xffffffe4(%esi),%eax
0x4009d361 : mov 0xffffffe0(%esi),%edx
0x4009d364 : mov %eax,0xffffffe4(%edi)
0x4009d367 : mov %edx,0xffffffe0(%edi)
0x4009d36a : lea 0xffffffe0(%esi),%esi
0x4009d36d : lea 0xffffffe0(%edi),%edi
0x4009d370 : jns 0x4009d334
thanks for any insight!
P.
-
Re: really strange memmove problem
Hi!
> I've got a really strange issue on a multithreaded program running
> under 2.6 smp kernel on two different linux boxes; the problem has
> shown up during tests running under valgrind
Does the problem only shows up under valgrind?
> the program erratically crashes while doing a memmove; stack analysis
> shows the memmove has been called with overlapping sources and
> destination - the following diagram shows the pattern:
>
> [xxxxxxxx....] -> [....xxxxxxxx]
>
> the implementation of memmove linked-in loads esi and edi with the end
> of data (shown as x's) and end of buffer respectively, then it does the
> copy backwards; at the heart of memmove is the following statement:
>
> 0xa9c15d : repz movsl %ds
%esi),%es
%edi)
>
> to work properly this needs the direction flag to be set so that
> addresses are decremented after they're dereferenced; well, it happens
> that sometimes the program segfaults due to accessing addresses after
> the end of the buffer, I look at eflags register and it seems as if
> someone changed (unset) the direction flag during the copy!
>
> how can this be? I kept trying on a 2.4 box and hadn't the chance to
> make the program crash; but I also noticed that on that box the memmove
> linked-in is implemented differently (no direction flag involved):
Hmmm... If you obtain SIGSEGV, this means that you are accessing an
address during the memmove() which is outside the process space. Since
your process is multithreaded, this could be the result of a nasty race
condition.
I would serialize all the memmove operations using a big lock to see if
the problem still persists or disappears. This could be done as
follows:
pthread_mutex_t memmoveLock = PTHREAD_MUTEX_INITIALIZER;
void*
my_memmove(void *dst, const void* src, size_t n)
{
void* p;
pthread_mutex_lock(&memmoveLock);
p = memmove(dst, src, n);
pthread_mutex_unlock(&memmoveLock);
return p;
}
#define memmove(dst,src,n) my_memmove(dst,src,n)
Another possibility would be an issue with valgrind. But you should
consider this option as unlikely.
If possible, post a minimalistic code that compiles and shows your
problem.
HTH,
Loic.
-
Re: really strange memmove problem
spamtrap@crayne.org ha scritto:
> Hi!
>
> > I've got a really strange issue on a multithreaded program running
> > under 2.6 smp kernel on two different linux boxes; the problem has
> > shown up during tests running under valgrind
>
> Does the problem only shows up under valgrind?
>
yes
> > the program erratically crashes while doing a memmove; stack analysis
> > shows the memmove has been called with overlapping sources and
> > destination - the following diagram shows the pattern:
> >
> > [xxxxxxxx....] -> [....xxxxxxxx]
> >
> > the implementation of memmove linked-in loads esi and edi with the end
> > of data (shown as x's) and end of buffer respectively, then it does the
> > copy backwards; at the heart of memmove is the following statement:
> >
> > 0xa9c15d : repz movsl %ds
%esi),%es
%edi)
> >
> > to work properly this needs the direction flag to be set so that
> > addresses are decremented after they're dereferenced; well, it happens
> > that sometimes the program segfaults due to accessing addresses after
> > the end of the buffer, I look at eflags register and it seems as if
> > someone changed (unset) the direction flag during the copy!
> >
> > how can this be? I kept trying on a 2.4 box and hadn't the chance to
> > make the program crash; but I also noticed that on that box the memmove
> > linked-in is implemented differently (no direction flag involved):
>
> Hmmm... If you obtain SIGSEGV, this means that you are accessing an
> address during the memmove() which is outside the process space. Since
> your process is multithreaded, this could be the result of a nasty race
> condition.
>
only one thread accesses that buffer; moreover, crash dump analysis has
shown that memmove has been entered with proper arguments - memmove
itself is a couple screenful of assembly code and I've verified it
holds all of its variables in registers - I think these are saved in a
system stack during context switch, so that nothing in the program has
ever the chance to interfere once memmove has been called (ok, another
thread could release the memory, but that's not the case)
>
> Another possibility would be an issue with valgrind. But you should
> consider this option as unlikely.
>
yet if the analysis above is correct, that could explain how something
meddled with the registers during the copy
> If possible, post a minimalistic code that compiles and shows your
> problem.
unluckily we weren't able to reproduce the problem outside the affected
program
> HTH,
thanks,
P.
-
Re: really strange memmove problem
Hello,
> > Does the problem only shows up under valgrind?
> >
>
> yes
> only one thread accesses that buffer; moreover, crash dump analysis has
> shown that memmove has been entered with proper arguments - memmove
> itself is a couple screenful of assembly code and I've verified it
> holds all of its variables in registers - I think these are saved in a
> system stack during context switch, so that nothing in the program has
> ever the chance to interfere once memmove has been called (ok, another
> thread could release the memory, but that's not the case)
I don't know if that's doable, but you could get better confidence that
Valgrind is doing wrong by checking the validity of every addresses
used in the memmove():
#define valid_ptr(p) ((access((char*)p, 0) == -1) ? (errno != EFAULT)
: 1)
void*
my_memmove(void *dst, const void* src, size_t n)
{
void* p;
for (size_t i=0; i
if ( !valid_ptr(src+i) ) {
// oupsi...
// address outside process space
}
}
return memmove(dst, src, n);
}
#define memmove(dst,src,n) my_memmove(dst,src,n)
Otherwise, you may want to raise your issue on the valgrind mailing
list.
HTH,
Loic.
-
Re: really strange memmove problem
wrote in message
news:1161763231.185115.56480@k70g2000cwa.googlegro ups.com...
> Hello,
>
> > > Does the problem only shows up under valgrind?
> > >
> >
> > yes
>
>
>
> > only one thread accesses that buffer; moreover, crash dump analysis has
> > shown that memmove has been entered with proper arguments - memmove
> > itself is a couple screenful of assembly code and I've verified it
> > holds all of its variables in registers - I think these are saved in a
> > system stack during context switch, so that nothing in the program has
> > ever the chance to interfere once memmove has been called (ok, another
> > thread could release the memory, but that's not the case)
>
> I don't know if that's doable, but you could get better confidence that
> Valgrind is doing wrong by checking the validity of every addresses
> used in the memmove():
>
> #define valid_ptr(p) ((access((char*)p, 0) == -1) ? (errno != EFAULT)
> : 1)
>
access() and EFAULT?
The OpenGroups IEEE 1003.1 POSIX.1 standard for access() doesn't list
EFAULT:
http://www.opengroup.org/onlinepubs/...ns/access.html
http://www.opengroup.org/onlinepubs/009695399/toc.htm
I do see that HP-UX and SunOS may return EFAULT for access(). I may have
missed it, but my search of Glibc v2.4 doesn't seem to return EFAULT for
access() (or much of anything for that matter...). One file, errlist.c,
says that for GNU, EFAULT is never generated, a signal is generated instead,
perhaps SIGSEGV or SIGBUS. Is GNU/Linux supposed to (or does it through
some indirect method) return EFAULT?
Rod Pemberton
-
Re: really strange memmove problem
Rod Pemberton wrote:
.....
>>#define valid_ptr(p) ((access((char*)p, 0) == -1) ? (errno != EFAULT)
>>: 1)
>>
>
>
> access() and EFAULT?
>
> The OpenGroups IEEE 1003.1 POSIX.1 standard for access() doesn't list
> EFAULT:
>
> http://www.opengroup.org/onlinepubs/...ns/access.html
> http://www.opengroup.org/onlinepubs/009695399/toc.htm
>
> I do see that HP-UX and SunOS may return EFAULT for access(). I may have
> missed it, but my search of Glibc v2.4 doesn't seem to return EFAULT for
> access() (or much of anything for that matter...). One file, errlist.c,
> says that for GNU, EFAULT is never generated, a signal is generated instead,
> perhaps SIGSEGV or SIGBUS. Is GNU/Linux supposed to (or does it through
> some indirect method) return EFAULT?
Man page says it does (not verified). I see... it's a trick! Access
isn't expected to succeed - any errno besides EFAULT means that we can
access memory at "pathname"... but of course there's no filename there
to check access to... Neat trick!
..... could this be used to guard against the "really bad bug" that man 3
malloc mentions?
Best,
Frank
-
Re: really strange memmove problem
Frank Kotler wrote:
> Rod Pemberton wrote:
>
> ....
>
>>> #define valid_ptr(p) ((access((char*)p, 0) == -1) ? (errno != EFAULT)
>>> : 1)
>>>
>>
>>
>> access() and EFAULT?
>>
>> The OpenGroups IEEE 1003.1 POSIX.1 standard for access() doesn't list
>> EFAULT:
>>
>> http://www.opengroup.org/onlinepubs/...ns/access.html
>> http://www.opengroup.org/onlinepubs/009695399/toc.htm
>>
>> I do see that HP-UX and SunOS may return EFAULT for access(). I may have
>> missed it, but my search of Glibc v2.4 doesn't seem to return EFAULT for
>> access() (or much of anything for that matter...). One file, errlist.c,
>> says that for GNU, EFAULT is never generated, a signal is generated
>> instead,
>> perhaps SIGSEGV or SIGBUS. Is GNU/Linux supposed to (or does it through
>> some indirect method) return EFAULT?
>
> Man page says it does (not verified).
Well... I verified it, I guess. I had a segfaulting program at hand that
I thought would make a "testbed". Results confused me, so I pared it
down to this simple test. As expected, it segfaults in short order. In
an xterm, same thing, as expected. Uncomment the lines enabling the
memory check, and it exits cleanly - fairly promptly. In an xterm,
however, it runs for... a long time. Still running...
Can anyone tell me what's going on here? TIA.
Best,
Frank
; nasm -f elf myprog.asm
; ld -o myprog myprog.o
global _start
section .bss
buf resb 100
section .text
_start:
mov esi, buf
...top:
; call is_memory_valid
; test eax, eax
; js exit
lodsb
jmp .top
exit:
mov eax, 1
int 80h
;-----------------------------
;-----------------------------
is_memory_valid:
push ebx
push ecx
mov eax, 33 ; __NR_access
mov ebx, esi
xor ecx, ecx ; F_OK
int 80h
cmp eax, -14 ; -EFAULT
jnz .good
or eax, byte -1
jmp short .done
...good:
xor eax, eax
...done:
pop ecx
pop ebx
ret
;----------------------------