On Wed, 2005-04-06 at 08:08, Zou Nan hai wrote:
> On Tue, 2005-04-05 at 18:17, Andy Polyakov wrote:
> > > Current OpenSSL (0.9.8-dev) rc4speed throughput on a Nocona (Em64t, b4bit) 3.6GHz is 272Mb/s, while this version of RC4 code can archive 536Mb/s in RC4Speed.
> > >
> > >   Would you please review it?

> >
> > Cool conditional moves in unrolled loop. Have you considered/tried cmov
> > instead of jump over move instuction? No, there is no conditional move
> > with zero extention, but upper part is maintained zeroed, so that byte
> > cmov shoud do... Well, I bet those jumps are seldom taken, so that
> > branch prediction logic can make better job than cmov, but I have to ask:-)
> >

> Well, I tried use cmov here, it just slow down the throughput a lot...
> > Or how about moving mozb (%rdi,%r10),%r8d upwards as movzb
> > (%rdi,%r10),%r14b and make inter-register move between r8 and r14
> > conditional?
> >

> I will try it.

I have tried it, not performance gain.
> > The reason I didn't attempt to unroll the RC4_CHAR loop was that I never
> > had access to EM64T hardware and simply mechanically ported P4 loop from
> > 32-bit implementation [where unrolling affected performance negatively]
> > and tested it for correctness on Opteron.
> >
> > BTW, 272MBps at 3.6GHz? I get 262MBps out of [as just mentioned
> > virtually identical] 32-bit code at 2.4GHz P4... A.
> >

>
> In fact, Your implement on EM64t isn't that slow if
> we change the inc and dec to add and sub.
>
> With that change the throughput boost from 272Mb/s to 396Mb/s.
>
> I have not investigated the 32 bit P4 path yet,
> But you should see performance gain on P4 with this change.
>
> Zou Nan hai


__________________________________________________ ____________________
OpenSSL Project http://www.openssl.org
Development Mailing List openssl-dev@openssl.org
Automated List Manager majordomo@openssl.org