This is a discussion on RE: RC4 optimize for em64t - Openssl ; > -----Original Message----- > From: firstname.lastname@example.org [mailto email@example.com] > On Behalf Of Andy Polyakov > Sent: Wednesday, April 06, 2005 5:34 PM > To: firstname.lastname@example.org > Subject: Re: RC4 optimize for em64t >=20 > >>>Or how about moving mozb (%rdi,%r10),%r8d ...
> -----Original Message-----
> From: email@example.com
> On Behalf Of Andy Polyakov
> Sent: Wednesday, April 06, 2005 5:34 PM
> To: firstname.lastname@example.org
> Subject: Re: RC4 optimize for em64t
> >>>Or how about moving mozb (%rdi,%r10),%r8d upwards as movzb
> >>>(%rdi,%r10),%r14b and make inter-register move between r8 and r14
> >> I will try it.
> > I have tried it, not performance gain.
> Does it mean that it's same or does it mean that it's slower? Was it
> cmov or was it jump over mov instruction? BTW, what is the
> latency/throughput for Intel cmov anyway? I can't find information
Using cmov here slows down a lot.
move the mov r13b, (%rdi, %rdi) to conditional has the same speed...
> Another question. Why rotations are 32-bit? Did you try 64-bit
> and found them slow? If so, for how much?
Changing to 64 bit ror will slow the throughput to around 480Mb/s
> You may wonder why all these questions. I want to understand the code
> make it regular enough to express assembler unrolled loop in perl loop
> terms. It make it easier for us to maintain and I'm even ready to
> sacrifice few percents of performance for more regular looking code.
> >>>BTW, 272MBps at 3.6GHz? I get 262MBps out of [as just mentioned
> >>>virtually identical] 32-bit code at 2.4GHz P4... A.
> >> In fact, Your implement on EM64t isn't that slow if
> >> we change the inc and dec to add and sub.
> >> With that change the throughput boost from 272Mb/s to 396Mb/s.
> For *now* I'm committing only this change to CVS and will have closer
> look at unrolled loop later on [some time next week]. BTW, there is
> aCnother idea I'd like to try, so I'm likely to send you some code for
> benchmarking on EM64T hardware. A.
I am glad to do the test for you.
I have tested changing inc and dec in 32 bit code to add and sub and
see a %2 performance gain on a P4.=20
It is a bit strange you see slowdown. Change inc to add will only
benefit on P4 in theory.
Zou Nan hai
OpenSSL Project http://www.openssl.org
Development Mailing List email@example.com
Automated List Manager firstname.lastname@example.org