On Fri, 8 Jul 2005, Andy Polyakov wrote:

>>> Do note "[when] num [as in memcpy(ovec,ovec+num,8)] is guaranteed to be
>>> positive." Question was can you imagine memcpy implementation that would
>>> fail to handle overlapping regions when source address is *larger* than
>>> destination? Question was *not* if you can imagine memcpy implementation
>>> that would fail to handle arbitrary overlapping regions.

>> Yes.
>> void * memcpy(void * dst, const void * src, size_t len) {
>> char * d = ((char *) dst) + len;
>> const char * s = ((const char *) src) + len;
>> while (len-- > 0) {
>> *--d = *--s;
>> }
>> return dst;
>> }
>> This is a fully conformant implementation of memcpy. Not sure why you'd
>> implement it this way, but it's legal.

> Question is not how I implement it, but why ICC would. What would be a
> performance reason to implement something similar to this... But whatever,
> memmove it is...

Hmm. I am sort of jumping into the middle of things here. The question
is how portable the code needs to be? If it's using inline assembly, and
as such isn't very portable by it's nature- gcc and icc and that's about
it- then this isn't that big of a problem. Make sure it works correctly
on both compilers, and you're fine. If it's generic C code that needs to
work on a large number of platforms, then this might be a problem.

My argument isn't so much to be against non-portable code, it's to be
aware when you're writting non-portable code.

Although, if it's to be x86-specific, I'd be tempted to replace it with:
((unsigned int *) ptr)[0] = ((unsigned int *) (ptr+off))[0];
((unsigned int *) ptr)[1] = ((unsigned int *) (ptr+off))[1];

Note that this bit of code contains both gcc-isms (void * arith) and
x86-specific things (not checking the alignment of 32-bit loads and
stores, assuming ints are 32 bits). But it has the advantage of being
what you really meant to do.

>>> See a). Inlining is believed/expected to be faster than call to a
>>> function.

>> This is not always true. If the inlining causes the code size to bloat
>> and no longer fit into cache, for example. Also, shared copies of the
>> function can share branch prediction information.

> Well, if one uses designated intrinsic function, compiler has a chance
> to evaluate the trade-off and "decide" when it's appropriate to inline
> or call a function, while in case of memmove you're bound to call...

Actually, the argument in favor of inlining memcpy also applies to
memmove- especially in the case when the length is short enough the data
can first be copied to a temporary place (registers), and then copied back
to it's destination. I don't know why gcc doesn't do this.


__________________________________________________ ____________________
OpenSSL Project http://www.openssl.org
Development Mailing List openssl-dev@openssl.org
Automated List Manager majordomo@openssl.org