>>BTW, 272MBps at 3.6GHz? I get 262MBps out of [as just mentioned
>>virtually identical] 32-bit code at 2.4GHz P4...

> In fact, Your implement on EM64t isn't that slow if
> we change the inc and dec to add and sub.
> With that change the throughput boost from 272Mb/s to 396Mb/s.

Huh? And what if you replace inc/add with lea 1(%reg),%reg to eliminate
even possibility of contention for %eflag?

> I have not investigated the 32 bit P4 path yet,
> But you should see performance gain on P4 with this change.

I see >10% slow-down, even on Prescott core... A.
