# fastest transition from bitmask to the position of it's lowest SETbit - Unix

This is a discussion on fastest transition from bitmask to the position of it's lowest SETbit - Unix ; I want to combine a bitmask M with a byte-aligned variable V to extract a (not necessarily) byte-aligned parameter P (from a network protocol). The bitmask of the same bitsize as V has the (successive) bits set to 1, where ...

# Thread: fastest transition from bitmask to the position of it's lowest SETbit

1. ## fastest transition from bitmask to the position of it's lowest SETbit

I want to combine a bitmask M with a byte-aligned variable V to extract
a (not necessarily) byte-aligned parameter P (from a network protocol).

The bitmask of the same bitsize as V has the (successive) bits set to 1,
where the parameter is stored, and the rest to 0.

So basically I wanna apply this math:

P = (V AND M) SHR O

where AND is a bitwise AND, SHR is bitwise shift-right and O is the
order of the lowest set bit in the bitmask.

For example:
V = b0110101000110010 (2 bytes)
M = b0000011111110000 (2 bytes)
O = ln (16) / ln (2) = 4

P = b0110101000110010
AND b0000011111110000
SHR 4
= b0000001000110000 SHR 4 = b0000000000100011 = b100011 (= d35)

Now I am quite at a loss to determine that order O in a single
mathematical operation other than the logarithm.

I could of course do this operation in a loop
while (M AND 1) == 0
P = P SHR 1
M = M SHR 1
end while

But this would definitely be slower if the parameter is a very long one.

So - any suggestions to determine the (binary) order of the lowest set
bit of a variable?

Lars

2. ## Re: fastest transition from bitmask to the position of it's lowestSET bit

On May 29, 9:40 am, Lars Uffmann wrote:
> For example:
> V = b0110101000110010 (2 bytes)
> M = b0000011111110000 (2 bytes)
> O = ln (16) / ln (2) = 4
>
> P = b0110101000110010
> AND b0000011111110000
> SHR 4
> = b0000001000110000 SHR 4 = b0000000000100011 = b100011 (= d35)
>
> Now I am quite at a loss to determine that order O in a single
> mathematical operation other than the logarithm.

You want (V & M) / (M & -M):

V = b0110101000110010
M = b0000011111110000
-M = b1111100000010000 <-- 2's complement
M&-M = b0000000000010000 <-- lowest bit on
/M&-M = b0000000000100011 <-- divide is like shr low_bit_pos

Jason

3. ## Re: fastest transition from bitmask to the position of it's lowestSET bit

On May 29, 10:33 am, "jason.cipri...@gmail.com"
wrote:
> M&-M = b0000000000010000 <-- lowest bit on

It should be apparent but I forgot to mention you'll have to handle it
as a special case if M==0. Also note that if you can guarantee that
your mask bits are contiguous and you want to avoid integer division
for any reason you can substitute (M & (~M << 1)) for (M / -M), and
not have to handle M==0 separately as well.

Jason

4. ## Re: fastest transition from bitmask to the position of it's lowestSET bit

On May 29, 10:53 am, "jason.cipri...@gmail.com"
wrote:
> for any reason you can substitute (M & (~M << 1)) for (M / -M), and
> not have to handle M==0 separately as well.

Ignore everything after ", and" in that sentence, sorry.

5. ## Re: fastest transition from bitmask to the position of it's lowestSET bit

Lars Uffmann wrote:
> I want to combine a bitmask M with a byte-aligned variable V to extract
> a (not necessarily) byte-aligned parameter P (from a network protocol).
>
> The bitmask of the same bitsize as V has the (successive) bits set to 1,
> where the parameter is stored, and the rest to 0.
>
> So basically I wanna apply this math:
>
> P = (V AND M) SHR O
>
> where AND is a bitwise AND, SHR is bitwise shift-right and O is the
> order of the lowest set bit in the bitmask.
>
> For example:
> V = b0110101000110010 (2 bytes)
> M = b0000011111110000 (2 bytes)
> O = ln (16) / ln (2) = 4
>
> P = b0110101000110010
> AND b0000011111110000
> SHR 4
> = b0000001000110000 SHR 4 = b0000000000100011 = b100011 (= d35)
>
> Now I am quite at a loss to determine that order O in a single
> mathematical operation other than the logarithm.
>
> I could of course do this operation in a loop
> while (M AND 1) == 0
> P = P SHR 1
> M = M SHR 1
> end while
>
> But this would definitely be slower if the parameter is a very long one.
>
> So - any suggestions to determine the (binary) order of the lowest set
> bit of a variable?

In most situations the masks are predetermined constants:
You know in advance that you're interested in bits 4-10 of a
word. If that's your situation, then I'd suggest turning the

#define HIBIT 10
#define LOBIT 4
P = (V >> LOBIT) & ~(-1u << (HIBIT - LOBIT));

(You could add special handling for HIBIT==31 and/or LOBIT==0.)

But if the masks actually do arrive "out of the blue" and
you need to deduce LOBIT, a loop is about the best you can do.
You might speed it up somewhat by biting off more than one
bit at a time:

for (m = M; (m & 0xF) == 0; m >>= 4)
p >>= 4;
for ( ; (m & 1) == 0; m >>= 1)
p >>= 1;

You could pursue this strategy further:

p = V & M;
m = M;
if ((m & 0xFFFF) == 0) {
m >>= 16;
p >>= 16;
}
if ((m & 0xFF) == 0) {
m >>= 8;
p >>= 8;
}
if ((m & 0xF) == 0) {
...

You could even resort to a full-scale binary search:

if ((M & 0x0000FFFF) == 0) {
if ((M & 0x00FF0000) == 0) {
if ((M & 0x0F000000) == 0) {
if ((M & 0x30000000) == 0) {
if ((M & 0x40000000) == 0)
p >>= 31;
else
p >>= 30;
}
else {
if ((M & 0x10000000) == 0)
p >>= 29;
else
p >>= 28;
}
}
else ...

It's not a given, though, that any of these will be faster
than a simple loop. Measure, measure, measure.

--
Eric.Sosman@sun.com

6. ## Re: fastest transition from bitmask to the position of it's lowestSET bit

Hey Jason,

jason.cipriani@gmail.com wrote:
> You want (V & M) / (M & -M):

Thank you very much! It's embarrassing but it took me like... 40 minutes
to fully understand the 2's complement in binary, why it looks exactly
the way you described and (M & -M) will get me the lowest bit on Your
solution is quite a bit easier than what I had figured for the lowest
bit on:
((M << 1) XOR M) AND M

But the crucial part was that of course - I simply have to divide my
masked value by the value of the lowest bit on. Thanks for pointing that
out, even though I feel like I cheated on a simple riddle by not solving
it myself

Best Regards,

Lars

7. ## Re: fastest transition from bitmask to the position of it's lowestSET bit

Eric Sosman wrote:
> [...]
> But if the masks actually do arrive "out of the blue" and
> you need to deduce LOBIT, a loop is about the best you can do.
> [...]

Oooh! See Jason Cipriani's gimmick; it's so cool
it assuages my shame in being refuted.

--
Eric.Sosman@sun.com

8. ## Re: fastest transition from bitmask to the position of it's lowest SET bit

Eric Sosman writes:
> Eric Sosman wrote:
>> [...]
>> But if the masks actually do arrive "out of the blue" and
>> you need to deduce LOBIT, a loop is about the best you can do.
>> [...]

>
> Oooh! See Jason Cipriani's gimmick; it's so cool
> it assuages my shame in being refuted.

Hmm ... the first time I have seen this particular expression is
in the ARM-specific part of Linux 2.6 (include/arm/bitops.h):

#define fls(x) \
( __builtin_constant_p(x) ? constant_fls(x) : \
({ int __r; asm("clz\t%0, %1" : "=r"(__r) : "r"(x) : "cc"); 32-__r; }) )
#define ffs(x) ({ unsigned long __t = (x); fls(__t & -__t); })
#define __ffs(x) (ffs(x) - 1)

where it (IIRC) appeared shortly after a somewhat heated discussion
with R. King regarding wether it should be allowed that drivers for
ARM SoC-devices (in particular, C. Hohnstaedt's 2.6 driver for the
NPEs being part of Intel IXP4XX-chips) make use of ARM-assembly
instructions insofar they are particularly well-suited to solving a
specific problem[*].
[*] Insofar there is any doubt regarding this: Mr King's position on
this is/ was, of course, that assembly is arcane, cryptic and
just disgusting and must not be used in a driver for any reason and
the people suggestion to use it nevertheless just because it would
make technical sense are clearly 'non desirable inferior lifeforms to
stupid for this universe which Must Be Flamed Out Of Existence'
(Yes, this was me. The IPX4XX-driver I have written for the 2nd
generation [if ever] hardware of my employer still uses clz directly).

9. ## Re: fastest transition from bitmask to the position of it's lowestSET bit

Lars Uffmann wrote:
> I want to combine a bitmask M with a byte-aligned variable V to extract
> a (not necessarily) byte-aligned parameter P (from a network protocol).
>
> The bitmask of the same bitsize as V has the (successive) bits set to 1,
> where the parameter is stored, and the rest to 0.
>
> So basically I wanna apply this math:
>
> P = (V AND M) SHR O
>
> where AND is a bitwise AND, SHR is bitwise shift-right and O is the
> order of the lowest set bit in the bitmask.

The method that Jason presented, using division to do the shift, is
great if you have reasonably fast integer division on your system. If
you don't then you may want to consider other methods.

Most C libraries have an "ffs" function that "Finds the First Set" bit
in a value. (I have a feeling that it's non-standard but common.) gcc
has a builtin for this that expands to an appropriate single machine
instruction on platforms that have one (and many do). So you can write:

P = (V & M) >> (ffs(M)-1)

Try "man ffs" for details. Note the need for the -1.

But: unless you're sure that you need the extra performance, stick with
the obvious loop!

Phil.

10. ## Re: fastest transition from bitmask to the position of it's lowestSET bit

jason.cipriani@gmail.com wrote:
> It should be apparent but I forgot to mention you'll have to handle it
> as a special case if M==0.

You never know - I might have slipped and missed that, but I'm not gonna
handle parameters that do not exist (i.e. bitmask = 0x0) - thanks for
mentioning anyways.

> Also note that if you can guarantee that your mask bits are contiguous

Luckily can

> and you want to avoid integer division for any reason you can substitute
> (M & (~M << 1)) for (M / -M)

Erm - I gather that ~M is the 2-complement of M, but then you probably meant
(M & (~M + 1)) for (M & -M)
?

For your original suggestion doesn't contain M/-M, unless I am just
unaware of the binary operator rules and the denominator of V&M/(M&-M)
can be picked apart somehow

Best Regards,

Lars

11. ## Re: fastest transition from bitmask to the position of it's lowestSET bit

Eric Sosman wrote:
> Oooh! See Jason Cipriani's gimmick; it's so cool
> it assuages my shame in being refuted.

Thanks for your time anyways, though I had the idea with the loop
myself - I definitely wanted to use integer arithmetic, for that is a
set amount (in this case 4) of cpu cycles for the calculation. In a loop
I would have had to do at least a binary search tree which would have -
as you demonstrated - involved LOTS more code AND have been slower

my resulting function now looks like this:

long extractParam (long *buf, long mask)
{
}

It really is *that* easy.

Cheers!

Lars

PS: I'm not handling mask = 0 on purpose, the functions domain just
doesn't include mask = 0 (undefined).

12. ## Re: fastest transition from bitmask to the position of it's lowest SET bit

Lars Uffmann writes:

> jason.cipriani@gmail.com wrote:
>> It should be apparent but I forgot to mention you'll have to handle it
>> as a special case if M==0.

> You never know - I might have slipped and missed that, but I'm not
> gonna handle parameters that do not exist (i.e. bitmask = 0x0) -
> thanks for mentioning anyways.
>
>> Also note that if you can guarantee that your mask bits are contiguous

> Luckily can
>
>> and you want to avoid integer division for any reason you can substitute
>> (M & (~M << 1)) for (M / -M)

>
> Erm - I gather that ~M is the 2-complement of M, but then you probably meant

~M is the bitwise inverse of M.

--
Måns Rullgård
mans@mansr.com

13. ## Re: fastest transition from bitmask to the position of it's lowestSET bit

Måns Rullgård wrote:
> ~M is the bitwise inverse of M.

Sorry - of course. ~M = (M XOR -1)
I thought the 2-complement was the same, as opposed to -M

Just checked wikipedia, cleared up what was still unclear. Thanks for
pointing that out! My formula correction still applies though, as I was
thinking 2-complement = bitwise inverse Whatever made me come to that
assumption...

Best Regards,

Lars

14. ## Re: fastest transition from bitmask to the position of it's lowestSET bit

Lars Uffmann wrote:
> my resulting function now looks like this:
>
> long extractParam (long *buf, long mask)
> {
> }

Actually I think I prefer

long extractParam (void *buf, long mask)
{
}

That way the function doesn't depend on long-aligned values. Can someone
tell me how a typecast to long* behaves on a
non-sizeof(long)-byte-aligned address if the system hardware doesn't
support this? Or is it always possible to use non-aligned variables with
C compilers if you accept slower processing due to software emulation?

Thanks!

Lars

15. ## Re: fastest transition from bitmask to the position of it's lowestSET bit

Phil wrote:

> Most C libraries have an "ffs" function that "Finds the First Set" bit
> in a value. (I have a feeling that it's non-standard but common.)

It's in POSIX/SUS, but is part of the XSI option (i.e. required for
UNIX conformance, but optional for POSIX conformance).

--
Geoff Clare

16. ## Re: fastest transition from bitmask to the position of it's lowestSET bit

Lars Uffmann wrote:
> Eric Sosman wrote:
>> Oooh! See Jason Cipriani's gimmick; it's so cool
>> it assuages my shame in being refuted.

>
> Thanks for your time anyways, though I had the idea with the loop
> myself - I definitely wanted to use integer arithmetic, for that is a
> set amount (in this case 4) of cpu cycles for the calculation. In a loop
> I would have had to do at least a binary search tree which would have -
> as you demonstrated - involved LOTS more code AND have been slower
>
>
> my resulting function now looks like this:
>
> long extractParam (long *buf, long mask)
> {
> }
>
> It really is *that* easy.

I'd suggest you use `unsigned long' rather than plain
`long', to avoid problems with negative numbers.

--
Eric Sosman
esosman@ieee-dot-org.invalid

17. ## Re: fastest transition from bitmask to the position of it's lowest SET bit

Lars Uffmann writes:

> Måns Rullgård wrote:
>> ~M is the bitwise inverse of M.

>
> Sorry - of course. ~M = (M XOR -1)

That is only true for two's complement representation of negative
numbers.

> I thought the 2-complement was the same, as opposed to -M
>
> Just checked wikipedia, cleared up what was still unclear.

The wikipedia coverage of C related matters is atrocious. Many
statements made there are outright lies, and much of the remainder is
inaccurate or incomplete. If you want to learn C programming, get a
real book on the subject and a copy of the C99 standard.

--
Måns Rullgård
mans@mansr.com

18. ## Re: fastest transition from bitmask to the position of it's lowest SET bit

Lars Uffmann writes:

> Lars Uffmann wrote:
>> my resulting function now looks like this:
>> long extractParam (long *buf, long mask)
>> {
>> }

>
> Actually I think I prefer
>
> long extractParam (void *buf, long mask)
> {
> }
>
> That way the function doesn't depend on long-aligned values. Can
> someone tell me how a typecast to long* behaves on a
> non-sizeof(long)-byte-aligned address if the system hardware doesn't
> support this?

Your programme with crash with a SIGBUS. On machines where this
doesn't happen, it will run slower, sometimes 100 times slower or
worse.

> Or is it always possible to use non-aligned variables with C
> compilers if you accept slower processing due to software emulation?

No.

Again, I suggest getting a good book on C programming and a copy of
the standard.

--
Måns Rullgård
mans@mansr.com

19. ## Re: fastest transition from bitmask to the position of it's lowestSET bit

Lars Uffmann wrote:
> Lars Uffmann wrote:
>> my resulting function now looks like this:
>>
>> long extractParam (long *buf, long mask)
>> {
>> }

>
> Actually I think I prefer
>
> long extractParam (void *buf, long mask)
> {
> }
>
> That way the function doesn't depend on long-aligned values.

... except that it does! If `buf' points to a mis-aligned
address, converting it from `void*' to `long*' doesn't magically
fix the alignment.

> Can someone
> tell me how a typecast to long* behaves on a
> non-sizeof(long)-byte-aligned address if the system hardware doesn't
> support this?

I think you're asking to be told how to do X, starting
with the assumption that X is impossible ...

> Or is it always possible to use non-aligned variables with
> C compilers if you accept slower processing due to software emulation?

The effect of dereferencing an improperly-aligned pointer
is undefined; in theory, anything at all could happen. In
practice, I've encountered three behaviors:

- Hardware trap leading to program termination (most
common)

- Hardware trap leading to slow, many-instruction byte-
by-byte fixup code (less common)

- Hardware pretends the offending low-order address bits
were zero, and silently accesses "nearby" memory rather
than the intended memory (I've seen this once only)

I imagine you'd be unhappy with all three of these ...

--
Eric Sosman
esosman@ieee-dot-org.invalid

20. ## Re: fastest transition from bitmask to the position of it's lowest SET bit

Eric Sosman writes:

> Lars Uffmann wrote:
>> Lars Uffmann wrote:
>>> my resulting function now looks like this:
>>>
>>> long extractParam (long *buf, long mask)
>>> {
>>> }

>> Actually I think I prefer
>> long extractParam (void *buf, long mask)
>> {
>> }
>> That way the function doesn't depend on long-aligned values.

>
> ... except that it does! If `buf' points to a mis-aligned
> address, converting it from `void*' to `long*' doesn't magically
> fix the alignment.
>
>> Can someone tell me how a typecast to long* behaves on a
>> non-sizeof(long)-byte-aligned address if the system hardware doesn't
>> support this?

>
> I think you're asking to be told how to do X, starting
> with the assumption that X is impossible ...
>
>> Or is it always possible to use non-aligned variables with C
>> compilers if you accept slower processing due to software emulation?

>
> The effect of dereferencing an improperly-aligned pointer
> is undefined; in theory, anything at all could happen. In
> practice, I've encountered three behaviors:
>
> - Hardware trap leading to program termination (most
> common)
>
> - Hardware trap leading to slow, many-instruction byte-
> by-byte fixup code (less common)
>
> - Hardware pretends the offending low-order address bits
> were zero, and silently accesses "nearby" memory rather
> than the intended memory (I've seen this once only)

You missed some:

variants do this.

- Hardware does something unspecified, usually requiring multiple
reads, returning the correct value, only 2-3 times slower than an
aligned access. Intel x86 does this.

--
Måns Rullgård
mans@mansr.com