This is a discussion on Re: 0.9.8: cfb_enc.c bug? and AES speed on Win64/x64 - Openssl ; >> If I use the Intel C++ Compiler 9.0 for EM64T with /O2 or higher, it >> replaces the above memcpy with the optimized function >> __intel_fast_memcpy, >> which breaks DES in OpenSSL. > > For reference, note that Linux ...
>> If I use the Intel C++ Compiler 9.0 for EM64T with /O2 or higher, it
>> replaces the above memcpy with the optimized function
>> which breaks DES in OpenSSL.
> For reference, note that Linux version avoids __intel_fast_memcpy with
> -Dmemcpy=__builtin_memcpy, because libirc.a caused griefs when linked
> into shared library. __intel_fast_memcpy feels as overkill in OpenSSL
> context and inlined code [movs or unrolled loop] should do better job.
> Can you try to compile with -Dmemcpy=__builtin_memcpy
unresolved external symbol __builtin_memcpy
unresolved external symbol __builtin_memset
but /Oi- disables inlining of all intrinsic functions, and it works (as
far as destest is concerned) if I compile cfb_enc.c with that.
>> It seems like memcpy should be replaced with memmove here?
> Does it mean that you've tried to replace it with memmove and can
> confirm that DES works if compiled with ICC /O2 or higher? It actually
> smells more like compiler bug than memcpy vs. memmove issue...
Yes, DES works with memmove and breaks with memcpy for /O2 and higher.
>> 3) Is AES really a lot faster on Win64/x64 compared to the i586 asm
>> version or am I doing something wrong?
> 1. Who says that AES is assembler empowered on Win32? It's not, not yet:-)
> 2. What's wrong with 64-bit code being faster then 32-bit one? 64-bit
> code has access to wider register bank, 8 extra registers, and in AES
> case there is no need to spill any registers to stack in every loop
> spin. Less instructions, no wasted bus bandwidth -> better performance.
I assumed that the asm file was used since it was included... Some
OpenSSL-algorithms are slower on x64, like RSA. SHA1 and RC4 seem to be
faster, but the speed command breaks for all but the first test:
Doing rc4 209715200 times on 16 size blocks: 209715200 rc4's in 20.84s
Doing rc4 -14680064 times on 64 size blocks: 0 rc4's in 0.00s
OpenSSL Project http://www.openssl.org
Development Mailing List email@example.com
Automated List Manager firstname.lastname@example.org