This is a discussion on Re: RC4 optimize for em64t - Openssl ; On Tue, 2005-04-05 at 18:17, Andy Polyakov wrote: > > Current OpenSSL (0.9.8-dev) rc4speed throughput on a Nocona (Em64t, b4bit) 3.6GHz is 272Mb/s, while this version of RC4 code can archive 536Mb/s in RC4Speed. > > > > Would you ...
On Tue, 2005-04-05 at 18:17, Andy Polyakov wrote:
> > Current OpenSSL (0.9.8-dev) rc4speed throughput on a Nocona (Em64t, b4bit) 3.6GHz is 272Mb/s, while this version of RC4 code can archive 536Mb/s in RC4Speed.
> > Would you please review it?
> Cool conditional moves in unrolled loop. Have you considered/tried cmov
> instead of jump over move instuction? No, there is no conditional move
> with zero extention, but upper part is maintained zeroed, so that byte
> cmov shoud do... Well, I bet those jumps are seldom taken, so that
> branch prediction logic can make better job than cmov, but I have to ask:-)
Well, I tried use cmov here, it just slow down the throughput a lot...
> Or how about moving mozb (%rdi,%r10),%r8d upwards as movzb
> (%rdi,%r10),%r14b and make inter-register move between r8 and r14
I will try it.
> The reason I didn't attempt to unroll the RC4_CHAR loop was that I never
> had access to EM64T hardware and simply mechanically ported P4 loop from
> 32-bit implementation [where unrolling affected performance negatively]
> and tested it for correctness on Opteron.
> BTW, 272MBps at 3.6GHz? I get 262MBps out of [as just mentioned
> virtually identical] 32-bit code at 2.4GHz P4... A.
In fact, Your implement on EM64t isn't that slow if
we change the inc and dec to add and sub.
With that change the throughput boost from 272Mb/s to 396Mb/s.
I have not investigated the 32 bit P4 path yet,
But you should see performance gain on P4 with this change.
Zou Nan hai
OpenSSL Project http://www.openssl.org
Development Mailing List email@example.com
Automated List Manager firstname.lastname@example.org