# fastest transition from bitmask to the position of it's lowest SETbit

Show 40 post(s) from this thread on one page
Page 1 of 2 1 2 Last
• 05-29-2008, 01:40 PM
unix
fastest transition from bitmask to the position of it's lowest SETbit
I want to combine a bitmask M with a byte-aligned variable V to extract
a (not necessarily) byte-aligned parameter P (from a network protocol).

The bitmask of the same bitsize as V has the (successive) bits set to 1,
where the parameter is stored, and the rest to 0.

So basically I wanna apply this math:

P = (V AND M) SHR O

where AND is a bitwise AND, SHR is bitwise shift-right and O is the
order of the lowest set bit in the bitmask.

For example:
V = b0110101000110010 (2 bytes)
M = b0000011111110000 (2 bytes)
O = ln (16) / ln (2) = 4

P = b0110101000110010
AND b0000011111110000
SHR 4
= b0000001000110000 SHR 4 = b0000000000100011 = b100011 (= d35)

Now I am quite at a loss to determine that order O in a single
mathematical operation other than the logarithm.

I could of course do this operation in a loop
while (M AND 1) == 0
P = P SHR 1
M = M SHR 1
end while

But this would definitely be slower if the parameter is a very long one.

So - any suggestions to determine the (binary) order of the lowest set
bit of a variable?

Thank you in advance!

Lars
• 05-29-2008, 02:33 PM
unix
Re: fastest transition from bitmask to the position of it's lowestSET bit
On May 29, 9:40 am, Lars Uffmann <a...@nurfuerspam.de> wrote:[color=blue]
> For example:
> V = b0110101000110010 (2 bytes)
> M = b0000011111110000 (2 bytes)
> O = ln (16) / ln (2) = 4
>
> P = b0110101000110010
> AND b0000011111110000
> SHR 4
> = b0000001000110000 SHR 4 = b0000000000100011 = b100011 (= d35)
>
> Now I am quite at a loss to determine that order O in a single
> mathematical operation other than the logarithm.[/color]

You want (V & M) / (M & -M):

V = b0110101000110010
M = b0000011111110000
-M = b1111100000010000 <-- 2's complement
M&-M = b0000000000010000 <-- lowest bit on
V&M = b0000001000110000 <-- masked
/M&-M = b0000000000100011 <-- divide is like shr low_bit_pos

Jason

• 05-29-2008, 02:53 PM
unix
Re: fastest transition from bitmask to the position of it's lowestSET bit
On May 29, 10:33 am, "jason.cipri...@gmail.com"
<jason.cipri...@gmail.com> wrote:[color=blue]
> M&-M = b0000000000010000 <-- lowest bit on[/color]

It should be apparent but I forgot to mention you'll have to handle it
as a special case if M==0. Also note that if you can guarantee that
your mask bits are contiguous and you want to avoid integer division
for any reason you can substitute (M & (~M << 1)) for (M / -M), and
not have to handle M==0 separately as well.

Jason
• 05-29-2008, 02:55 PM
unix
Re: fastest transition from bitmask to the position of it's lowestSET bit
On May 29, 10:53 am, "jason.cipri...@gmail.com"
<jason.cipri...@gmail.com> wrote:[color=blue]
> for any reason you can substitute (M & (~M << 1)) for (M / -M), and
> not have to handle M==0 separately as well.[/color]

Ignore everything after ", and" in that sentence, sorry.
• 05-29-2008, 03:24 PM
unix
Re: fastest transition from bitmask to the position of it's lowestSET bit
Lars Uffmann wrote:[color=blue]
> I want to combine a bitmask M with a byte-aligned variable V to extract
> a (not necessarily) byte-aligned parameter P (from a network protocol).
>
> The bitmask of the same bitsize as V has the (successive) bits set to 1,
> where the parameter is stored, and the rest to 0.
>
> So basically I wanna apply this math:
>
> P = (V AND M) SHR O
>
> where AND is a bitwise AND, SHR is bitwise shift-right and O is the
> order of the lowest set bit in the bitmask.
>
> For example:
> V = b0110101000110010 (2 bytes)
> M = b0000011111110000 (2 bytes)
> O = ln (16) / ln (2) = 4
>
> P = b0110101000110010
> AND b0000011111110000
> SHR 4
> = b0000001000110000 SHR 4 = b0000000000100011 = b100011 (= d35)
>
> Now I am quite at a loss to determine that order O in a single
> mathematical operation other than the logarithm.
>
> I could of course do this operation in a loop
> while (M AND 1) == 0
> P = P SHR 1
> M = M SHR 1
> end while
>
> But this would definitely be slower if the parameter is a very long one.
>
> So - any suggestions to determine the (binary) order of the lowest set
> bit of a variable?[/color]

In most situations the masks are predetermined constants:
You know in advance that you're interested in bits 4-10 of a
word. If that's your situation, then I'd suggest turning the
problem around: Instead of starting with the mask and trying
to derive the 4, start with the 4 and 10:

#define HIBIT 10
#define LOBIT 4
P = (V >> LOBIT) & ~(-1u << (HIBIT - LOBIT));

(You could add special handling for HIBIT==31 and/or LOBIT==0.)

But if the masks actually do arrive "out of the blue" and
you need to deduce LOBIT, a loop is about the best you can do.
You might speed it up somewhat by biting off more than one
bit at a time:

for (m = M; (m & 0xF) == 0; m >>= 4)
p >>= 4;
for ( ; (m & 1) == 0; m >>= 1)
p >>= 1;

You could pursue this strategy further:

p = V & M;
m = M;
if ((m & 0xFFFF) == 0) {
m >>= 16;
p >>= 16;
}
if ((m & 0xFF) == 0) {
m >>= 8;
p >>= 8;
}
if ((m & 0xF) == 0) {
...

You could even resort to a full-scale binary search:

if ((M & 0x0000FFFF) == 0) {
if ((M & 0x00FF0000) == 0) {
if ((M & 0x0F000000) == 0) {
if ((M & 0x30000000) == 0) {
if ((M & 0x40000000) == 0)
p >>= 31;
else
p >>= 30;
}
else {
if ((M & 0x10000000) == 0)
p >>= 29;
else
p >>= 28;
}
}
else ...

It's not a given, though, that any of these will be faster
than a simple loop. Measure, measure, measure.

--
[email]Eric.Sosman@sun.com[/email]
• 05-29-2008, 03:32 PM
unix
Re: fastest transition from bitmask to the position of it's lowestSET bit
Hey Jason,

[email]jason.cipriani@gmail.com[/email] wrote:[color=blue]
> You want (V & M) / (M & -M):[/color]

Thank you very much! It's embarrassing but it took me like... 40 minutes
to fully understand the 2's complement in binary, why it looks exactly
the way you described and (M & -M) will get me the lowest bit on :) Your
solution is quite a bit easier than what I had figured for the lowest
bit on:
((M << 1) XOR M) AND M

But the crucial part was that of course - I simply have to divide my
masked value by the value of the lowest bit on. Thanks for pointing that
out, even though I feel like I cheated on a simple riddle by not solving
it myself :)

Best Regards,

Lars
• 05-29-2008, 03:56 PM
unix
Re: fastest transition from bitmask to the position of it's lowestSET bit
Eric Sosman wrote:[color=blue]
> [...]
> But if the masks actually do arrive "out of the blue" and
> you need to deduce LOBIT, a loop is about the best you can do.
> [...][/color]

Oooh! See Jason Cipriani's gimmick; it's so cool
it assuages my shame in being refuted.

--
[email]Eric.Sosman@sun.com[/email]
• 05-29-2008, 04:11 PM
unix
Re: fastest transition from bitmask to the position of it's lowest SET bit
Eric Sosman <Eric.Sosman@sun.com> writes:[color=blue]
> Eric Sosman wrote:[color=green]
>> [...]
>> But if the masks actually do arrive "out of the blue" and
>> you need to deduce LOBIT, a loop is about the best you can do.
>> [...][/color]
>
> Oooh! See Jason Cipriani's gimmick; it's so cool
> it assuages my shame in being refuted.[/color]

Hmm ... the first time I have seen this particular expression is
in the ARM-specific part of Linux 2.6 (include/arm/bitops.h):

#define fls(x) \
( __builtin_constant_p(x) ? constant_fls(x) : \
({ int __r; asm("clz\t%0, %1" : "=r"(__r) : "r"(x) : "cc"); 32-__r; }) )
#define ffs(x) ({ unsigned long __t = (x); fls(__t & -__t); })
#define __ffs(x) (ffs(x) - 1)

where it (IIRC) appeared shortly after a somewhat heated discussion
with R. King regarding wether it should be allowed that drivers for
ARM SoC-devices (in particular, C. Hohnstaedt's 2.6 driver for the
NPEs being part of Intel IXP4XX-chips) make use of ARM-assembly
instructions insofar they are particularly well-suited to solving a
specific problem[*].
[*] Insofar there is any doubt regarding this: Mr King's position on
this is/ was, of course, that assembly is arcane, cryptic and
just disgusting and must not be used in a driver for any reason and
the people suggestion to use it nevertheless just because it would
make technical sense are clearly 'non desirable inferior lifeforms to
stupid for this universe which Must Be Flamed Out Of Existence'
(Yes, this was me. The IPX4XX-driver I have written for the 2nd
generation [if ever] hardware of my employer still uses clz directly).
• 05-29-2008, 05:41 PM
unix
Re: fastest transition from bitmask to the position of it's lowestSET bit
Lars Uffmann wrote:[color=blue]
> I want to combine a bitmask M with a byte-aligned variable V to extract
> a (not necessarily) byte-aligned parameter P (from a network protocol).
>
> The bitmask of the same bitsize as V has the (successive) bits set to 1,
> where the parameter is stored, and the rest to 0.
>
> So basically I wanna apply this math:
>
> P = (V AND M) SHR O
>
> where AND is a bitwise AND, SHR is bitwise shift-right and O is the
> order of the lowest set bit in the bitmask.[/color]

The method that Jason presented, using division to do the shift, is
great if you have reasonably fast integer division on your system. If
you don't then you may want to consider other methods.

Most C libraries have an "ffs" function that "Finds the First Set" bit
in a value. (I have a feeling that it's non-standard but common.) gcc
has a builtin for this that expands to an appropriate single machine
instruction on platforms that have one (and many do). So you can write:

P = (V & M) >> (ffs(M)-1)

Try "man ffs" for details. Note the need for the -1.

But: unless you're sure that you need the extra performance, stick with
the obvious loop!

Phil.
• 05-30-2008, 07:08 AM
unix
Re: fastest transition from bitmask to the position of it's lowestSET bit
[email]jason.cipriani@gmail.com[/email] wrote:[color=blue]
> It should be apparent but I forgot to mention you'll have to handle it
> as a special case if M==0.[/color]
You never know - I might have slipped and missed that, but I'm not gonna
handle parameters that do not exist (i.e. bitmask = 0x0) :) - thanks for
mentioning anyways.
[color=blue]
> Also note that if you can guarantee that your mask bits are contiguous[/color]
Luckily can :)
[color=blue]
> and you want to avoid integer division for any reason you can substitute
> (M & (~M << 1)) for (M / -M)[/color]

Erm - I gather that ~M is the 2-complement of M, but then you probably meant
(M & (~M + 1)) for (M & -M)
?

For your original suggestion doesn't contain M/-M, unless I am just
unaware of the binary operator rules and the denominator of V&M/(M&-M)
can be picked apart somehow :)

Best Regards,

Lars
• 05-30-2008, 07:13 AM
unix
Re: fastest transition from bitmask to the position of it's lowestSET bit
Eric Sosman wrote:[color=blue]
> Oooh! See Jason Cipriani's gimmick; it's so cool
> it assuages my shame in being refuted.[/color]

:) Thanks for your time anyways, though I had the idea with the loop
myself - I definitely wanted to use integer arithmetic, for that is a
set amount (in this case 4) of cpu cycles for the calculation. In a loop
I would have had to do at least a binary search tree which would have -
as you demonstrated - involved LOTS more code AND have been slower :)

my resulting function now looks like this:

long extractParam (long *buf, long mask)
{
}

It really is *that* easy.

Cheers!

Lars

PS: I'm not handling mask = 0 on purpose, the functions domain just
doesn't include mask = 0 (undefined).
• 05-30-2008, 07:42 AM
unix
Re: fastest transition from bitmask to the position of it's lowest SET bit
Lars Uffmann <aral@nurfuerspam.de> writes:
[color=blue]
> [email]jason.cipriani@gmail.com[/email] wrote:[color=green]
>> It should be apparent but I forgot to mention you'll have to handle it
>> as a special case if M==0.[/color]
> You never know - I might have slipped and missed that, but I'm not
> gonna handle parameters that do not exist (i.e. bitmask = 0x0) :) -
> thanks for mentioning anyways.
>[color=green]
>> Also note that if you can guarantee that your mask bits are contiguous[/color]
> Luckily can :)
>[color=green]
>> and you want to avoid integer division for any reason you can substitute
>> (M & (~M << 1)) for (M / -M)[/color]
>
> Erm - I gather that ~M is the 2-complement of M, but then you probably meant[/color]

~M is the bitwise inverse of M.

--
Måns Rullgård
[email]mans@mansr.com[/email]
• 05-30-2008, 07:57 AM
unix
Re: fastest transition from bitmask to the position of it's lowestSET bit
Måns Rullgård wrote:[color=blue]
> ~M is the bitwise inverse of M.[/color]

Sorry - of course. ~M = (M XOR -1)
I thought the 2-complement was the same, as opposed to -M :)

Just checked wikipedia, cleared up what was still unclear. Thanks for
pointing that out! My formula correction still applies though, as I was
thinking 2-complement = bitwise inverse :) Whatever made me come to that
assumption...

Best Regards,

Lars

• 05-30-2008, 08:27 AM
unix
Re: fastest transition from bitmask to the position of it's lowestSET bit
Lars Uffmann wrote:[color=blue]
> my resulting function now looks like this:
>
> long extractParam (long *buf, long mask)
> {
> }[/color]

Actually I think I prefer

long extractParam (void *buf, long mask)
{
}

That way the function doesn't depend on long-aligned values. Can someone
tell me how a typecast to long* behaves on a
non-sizeof(long)-byte-aligned address if the system hardware doesn't
support this? Or is it always possible to use non-aligned variables with
C compilers if you accept slower processing due to software emulation?

Thanks!

Lars
• 05-30-2008, 12:18 PM
unix
Re: fastest transition from bitmask to the position of it's lowestSET bit
Phil wrote:
[color=blue]
> Most C libraries have an "ffs" function that "Finds the First Set" bit
> in a value. (I have a feeling that it's non-standard but common.)[/color]

It's in POSIX/SUS, but is part of the XSI option (i.e. required for
UNIX conformance, but optional for POSIX conformance).

--
Geoff Clare <netnews@gclare.org.uk>
• 05-30-2008, 12:36 PM
unix
Re: fastest transition from bitmask to the position of it's lowestSET bit
Lars Uffmann wrote:[color=blue]
> Eric Sosman wrote:[color=green]
>> Oooh! See Jason Cipriani's gimmick; it's so cool
>> it assuages my shame in being refuted.[/color]
>
> :) Thanks for your time anyways, though I had the idea with the loop
> myself - I definitely wanted to use integer arithmetic, for that is a
> set amount (in this case 4) of cpu cycles for the calculation. In a loop
> I would have had to do at least a binary search tree which would have -
> as you demonstrated - involved LOTS more code AND have been slower :)
>
>
> my resulting function now looks like this:
>
> long extractParam (long *buf, long mask)
> {
> }
>
> It really is *that* easy.[/color]

I'd suggest you use `unsigned long' rather than plain
`long', to avoid problems with negative numbers.

--
Eric Sosman
[email]esosman@ieee-dot-org.inva[/email]lid
• 05-30-2008, 12:40 PM
unix
Re: fastest transition from bitmask to the position of it's lowest SET bit
Lars Uffmann <aral@nurfuerspam.de> writes:
[color=blue]
> Måns Rullgård wrote:[color=green]
>> ~M is the bitwise inverse of M.[/color]
>
> Sorry - of course. ~M = (M XOR -1)[/color]

That is only true for two's complement representation of negative
numbers.
[color=blue]
> I thought the 2-complement was the same, as opposed to -M :)
>
> Just checked wikipedia, cleared up what was still unclear.[/color]

The wikipedia coverage of C related matters is atrocious. Many
statements made there are outright lies, and much of the remainder is
inaccurate or incomplete. If you want to learn C programming, get a
real book on the subject and a copy of the C99 standard.

--
Måns Rullgård
[email]mans@mansr.com[/email]
• 05-30-2008, 12:42 PM
unix
Re: fastest transition from bitmask to the position of it's lowest SET bit
Lars Uffmann <aral@nurfuerspam.de> writes:
[color=blue]
> Lars Uffmann wrote:[color=green]
>> my resulting function now looks like this:
>> long extractParam (long *buf, long mask)
>> {
>> }[/color]
>
> Actually I think I prefer
>
> long extractParam (void *buf, long mask)
> {
> return (*(long *)buf & mask) / (mask & -mask);
> }
>
> That way the function doesn't depend on long-aligned values. Can
> someone tell me how a typecast to long* behaves on a
> non-sizeof(long)-byte-aligned address if the system hardware doesn't
> support this?[/color]

Your programme with crash with a SIGBUS. On machines where this
doesn't happen, it will run slower, sometimes 100 times slower or
worse.
[color=blue]
> Or is it always possible to use non-aligned variables with C
> compilers if you accept slower processing due to software emulation?[/color]

No.

Again, I suggest getting a good book on C programming and a copy of
the standard.

--
Måns Rullgård
[email]mans@mansr.com[/email]
• 05-30-2008, 12:45 PM
unix
Re: fastest transition from bitmask to the position of it's lowestSET bit
Lars Uffmann wrote:[color=blue]
> Lars Uffmann wrote:[color=green]
>> my resulting function now looks like this:
>>
>> long extractParam (long *buf, long mask)
>> {
>> }[/color]
>
> Actually I think I prefer
>
> long extractParam (void *buf, long mask)
> {
> return (*(long *)buf & mask) / (mask & -mask);
> }
>
> That way the function doesn't depend on long-aligned values.[/color]

... except that it does! If `buf' points to a mis-aligned
address, converting it from `void*' to `long*' doesn't magically
fix the alignment.

[color=blue]
> Can someone
> tell me how a typecast to long* behaves on a
> non-sizeof(long)-byte-aligned address if the system hardware doesn't
> support this?[/color]

I think you're asking to be told how to do X, starting
with the assumption that X is impossible ...
[color=blue]
> Or is it always possible to use non-aligned variables with
> C compilers if you accept slower processing due to software emulation?[/color]

The effect of dereferencing an improperly-aligned pointer
is undefined; in theory, anything at all could happen. In
practice, I've encountered three behaviors:

- Hardware trap leading to program termination (most
common)

- Hardware trap leading to slow, many-instruction byte-
by-byte fixup code (less common)

- Hardware pretends the offending low-order address bits
were zero, and silently accesses "nearby" memory rather
than the intended memory (I've seen this once only)

I imagine you'd be unhappy with all three of these ...

--
Eric Sosman
[email]esosman@ieee-dot-org.inva[/email]lid
• 05-30-2008, 01:01 PM
unix
Re: fastest transition from bitmask to the position of it's lowest SET bit
Eric Sosman <esosman@ieee-dot-org.invalid> writes:
[color=blue]
> Lars Uffmann wrote:[color=green]
>> Lars Uffmann wrote:[color=darkred]
>>> my resulting function now looks like this:
>>>
>>> long extractParam (long *buf, long mask)
>>> {
>>> }[/color]
>> Actually I think I prefer
>> long extractParam (void *buf, long mask)
>> {
>> return (*(long *)buf & mask) / (mask & -mask);
>> }
>> That way the function doesn't depend on long-aligned values.[/color]
>
> ... except that it does! If `buf' points to a mis-aligned
> address, converting it from `void*' to `long*' doesn't magically
> fix the alignment.
>[color=green]
>> Can someone tell me how a typecast to long* behaves on a
>> non-sizeof(long)-byte-aligned address if the system hardware doesn't
>> support this?[/color]
>
> I think you're asking to be told how to do X, starting
> with the assumption that X is impossible ...
>[color=green]
>> Or is it always possible to use non-aligned variables with C
>> compilers if you accept slower processing due to software emulation?[/color]
>
> The effect of dereferencing an improperly-aligned pointer
> is undefined; in theory, anything at all could happen. In
> practice, I've encountered three behaviors:
>
> - Hardware trap leading to program termination (most
> common)
>
> - Hardware trap leading to slow, many-instruction byte-
> by-byte fixup code (less common)
>
> - Hardware pretends the offending low-order address bits
> were zero, and silently accesses "nearby" memory rather
> than the intended memory (I've seen this once only)[/color]

You missed some:

- Hardware silently reads a rounded address, and rotates the value
read according the least significant bits of the address. Some ARM
variants do this.

- Hardware does something unspecified, usually requiring multiple
reads, returning the correct value, only 2-3 times slower than an
aligned access. Intel x86 does this.

--
Måns Rullgård
[email]mans@mansr.com[/email]
Show 40 post(s) from this thread on one page
Page 1 of 2 1 2 Last