>>>Or how about moving mozb (%rdi,%r10),%r8d upwards as movzb
>>>(%rdi,%r10),%r14b and make inter-register move between r8 and r14

>> I will try it.

> I have tried it, not performance gain.

Does it mean that it's same or does it mean that it's slower? Was it
cmov or was it jump over mov instruction? BTW, what is the
latency/throughput for Intel cmov anyway? I can't find information

Another question. Why rotations are 32-bit? Did you try 64-bit rotations
and found them slow? If so, for how much?

You may wonder why all these questions. I want to understand the code to
make it regular enough to express assembler unrolled loop in perl loop
terms. It make it easier for us to maintain and I'm even ready to
sacrifice few percents of performance for more regular looking code.

>>>BTW, 272MBps at 3.6GHz? I get 262MBps out of [as just mentioned
>>>virtually identical] 32-bit code at 2.4GHz P4... A.

>> In fact, Your implement on EM64t isn't that slow if
>> we change the inc and dec to add and sub.
>> With that change the throughput boost from 272Mb/s to 396Mb/s.

For *now* I'm committing only this change to CVS and will have closer
look at unrolled loop later on [some time next week]. BTW, there is
another idea I'd like to try, so I'm likely to send you some code for
benchmarking on EM64T hardware. A.
__________________________________________________ ____________________
OpenSSL Project http://www.openssl.org
Development Mailing List openssl-dev@openssl.org
Automated List Manager majordomo@openssl.org