> Current OpenSSL (0.9.8-dev) rc4speed throughput on a Nocona (Em64t, b4=
bit) 3.6GHz is 272Mb/s, while this version of RC4 code can archive 536Mb/=
s in RC4Speed.
>=20
> =E3=80=80=E3=80=80Would you please review it?


Cool conditional moves in unrolled loop. Have you considered/tried cmov=20
instead of jump over move instuction? No, there is no conditional move=20
with zero extention, but upper part is maintained zeroed, so that byte=20
cmov shoud do... Well, I bet those jumps are seldom taken, so that=20
branch prediction logic can make better job than cmov, but I have to ask:=
-)

Or how about moving mozb (%rdi,%r10),%r8d upwards as movzb=20
(%rdi,%r10),%r14b and make inter-register move between r8 and r14=20
conditional?

The reason I didn't attempt to unroll the RC4_CHAR loop was that I never =

had access to EM64T hardware and simply mechanically ported P4 loop from =

32-bit implementation [where unrolling affected performance negatively]=20
and tested it for correctness on Opteron.

BTW, 272MBps at 3.6GHz? I get 262MBps out of [as just mentioned=20
virtually identical] 32-bit code at 2.4GHz P4... A.

__________________________________________________ ____________________
OpenSSL Project http://www.openssl.org
Development Mailing List openssl-dev@openssl.org
Automated List Manager majordomo@openssl.org