fastest transition from bitmask to the position of it's lowest SETbit

I want to combine a bitmask M with a byte-aligned variable V to extract

a (not necessarily) byte-aligned parameter P (from a network protocol).

The bitmask of the same bitsize as V has the (successive) bits set to 1,

where the parameter is stored, and the rest to 0.

So basically I wanna apply this math:

P = (V AND M) SHR O

where AND is a bitwise AND, SHR is bitwise shift-right and O is the

order of the lowest set bit in the bitmask.

For example:

V = b0110101000110010 (2 bytes)

M = b0000011111110000 (2 bytes)

O = ln (16) / ln (2) = 4

P = b0110101000110010

AND b0000011111110000

SHR 4

= b0000001000110000 SHR 4 = b0000000000100011 = b100011 (= d35)

Now I am quite at a loss to determine that order O in a single

mathematical operation other than the logarithm.

I could of course do this operation in a loop

while (M AND 1) == 0

P = P SHR 1

M = M SHR 1

end while

But this would definitely be slower if the parameter is a very long one.

So - any suggestions to determine the (binary) order of the lowest set

bit of a variable?

Thank you in advance!

Lars

Re: fastest transition from bitmask to the position of it's lowestSET bit

On May 29, 9:40 am, Lars Uffmann <a...@nurfuerspam.de> wrote:[color=blue]

> For example:

> V = b0110101000110010 (2 bytes)

> M = b0000011111110000 (2 bytes)

> O = ln (16) / ln (2) = 4

>

> P = b0110101000110010

> AND b0000011111110000

> SHR 4

> = b0000001000110000 SHR 4 = b0000000000100011 = b100011 (= d35)

>

> Now I am quite at a loss to determine that order O in a single

> mathematical operation other than the logarithm.[/color]

You want (V & M) / (M & -M):

V = b0110101000110010

M = b0000011111110000

-M = b1111100000010000 <-- 2's complement

M&-M = b0000000000010000 <-- lowest bit on

V&M = b0000001000110000 <-- masked

/M&-M = b0000000000100011 <-- divide is like shr low_bit_pos

Jason

Re: fastest transition from bitmask to the position of it's lowestSET bit

On May 29, 10:33 am, "jason.cipri...@gmail.com"

<jason.cipri...@gmail.com> wrote:[color=blue]

> M&-M = b0000000000010000 <-- lowest bit on[/color]

It should be apparent but I forgot to mention you'll have to handle it

as a special case if M==0. Also note that if you can guarantee that

your mask bits are contiguous and you want to avoid integer division

for any reason you can substitute (M & (~M << 1)) for (M / -M), and

not have to handle M==0 separately as well.

Jason

Re: fastest transition from bitmask to the position of it's lowestSET bit

On May 29, 10:53 am, "jason.cipri...@gmail.com"

<jason.cipri...@gmail.com> wrote:[color=blue]

> for any reason you can substitute (M & (~M << 1)) for (M / -M), and

> not have to handle M==0 separately as well.[/color]

Ignore everything after ", and" in that sentence, sorry.

Re: fastest transition from bitmask to the position of it's lowestSET bit

Lars Uffmann wrote:[color=blue]

> I want to combine a bitmask M with a byte-aligned variable V to extract

> a (not necessarily) byte-aligned parameter P (from a network protocol).

>

> The bitmask of the same bitsize as V has the (successive) bits set to 1,

> where the parameter is stored, and the rest to 0.

>

> So basically I wanna apply this math:

>

> P = (V AND M) SHR O

>

> where AND is a bitwise AND, SHR is bitwise shift-right and O is the

> order of the lowest set bit in the bitmask.

>

> For example:

> V = b0110101000110010 (2 bytes)

> M = b0000011111110000 (2 bytes)

> O = ln (16) / ln (2) = 4

>

> P = b0110101000110010

> AND b0000011111110000

> SHR 4

> = b0000001000110000 SHR 4 = b0000000000100011 = b100011 (= d35)

>

> Now I am quite at a loss to determine that order O in a single

> mathematical operation other than the logarithm.

>

> I could of course do this operation in a loop

> while (M AND 1) == 0

> P = P SHR 1

> M = M SHR 1

> end while

>

> But this would definitely be slower if the parameter is a very long one.

>

> So - any suggestions to determine the (binary) order of the lowest set

> bit of a variable?[/color]

In most situations the masks are predetermined constants:

You know in advance that you're interested in bits 4-10 of a

word. If that's your situation, then I'd suggest turning the

problem around: Instead of starting with the mask and trying

to derive the 4, start with the 4 and 10:

#define HIBIT 10

#define LOBIT 4

P = (V >> LOBIT) & ~(-1u << (HIBIT - LOBIT));

(You could add special handling for HIBIT==31 and/or LOBIT==0.)

But if the masks actually do arrive "out of the blue" and

you need to deduce LOBIT, a loop is about the best you can do.

You might speed it up somewhat by biting off more than one

bit at a time:

for (m = M; (m & 0xF) == 0; m >>= 4)

p >>= 4;

for ( ; (m & 1) == 0; m >>= 1)

p >>= 1;

You could pursue this strategy further:

p = V & M;

m = M;

if ((m & 0xFFFF) == 0) {

m >>= 16;

p >>= 16;

}

if ((m & 0xFF) == 0) {

m >>= 8;

p >>= 8;

}

if ((m & 0xF) == 0) {

...

You could even resort to a full-scale binary search:

if ((M & 0x0000FFFF) == 0) {

if ((M & 0x00FF0000) == 0) {

if ((M & 0x0F000000) == 0) {

if ((M & 0x30000000) == 0) {

if ((M & 0x40000000) == 0)

p >>= 31;

else

p >>= 30;

}

else {

if ((M & 0x10000000) == 0)

p >>= 29;

else

p >>= 28;

}

}

else ...

It's not a given, though, that any of these will be faster

than a simple loop. Measure, measure, measure.

--

[email]Eric.Sosman@sun.com[/email]

Re: fastest transition from bitmask to the position of it's lowestSET bit

Hey Jason,

[email]jason.cipriani@gmail.com[/email] wrote:[color=blue]

> You want (V & M) / (M & -M):[/color]

Thank you very much! It's embarrassing but it took me like... 40 minutes

to fully understand the 2's complement in binary, why it looks exactly

the way you described and (M & -M) will get me the lowest bit on :) Your

solution is quite a bit easier than what I had figured for the lowest

bit on:

((M << 1) XOR M) AND M

But the crucial part was that of course - I simply have to divide my

masked value by the value of the lowest bit on. Thanks for pointing that

out, even though I feel like I cheated on a simple riddle by not solving

it myself :)

Best Regards,

Lars

Re: fastest transition from bitmask to the position of it's lowestSET bit

Eric Sosman wrote:[color=blue]

> [...]

> But if the masks actually do arrive "out of the blue" and

> you need to deduce LOBIT, a loop is about the best you can do.

> [...][/color]

Oooh! See Jason Cipriani's gimmick; it's so cool

it assuages my shame in being refuted.

--

[email]Eric.Sosman@sun.com[/email]

Re: fastest transition from bitmask to the position of it's lowest SET bit

Eric Sosman <Eric.Sosman@sun.com> writes:[color=blue]

> Eric Sosman wrote:[color=green]

>> [...]

>> But if the masks actually do arrive "out of the blue" and

>> you need to deduce LOBIT, a loop is about the best you can do.

>> [...][/color]

>

> Oooh! See Jason Cipriani's gimmick; it's so cool

> it assuages my shame in being refuted.[/color]

Hmm ... the first time I have seen this particular expression is

in the ARM-specific part of Linux 2.6 (include/arm/bitops.h):

#define fls(x) \

( __builtin_constant_p(x) ? constant_fls(x) : \

({ int __r; asm("clz\t%0, %1" : "=r"(__r) : "r"(x) : "cc"); 32-__r; }) )

#define ffs(x) ({ unsigned long __t = (x); fls(__t & -__t); })

#define __ffs(x) (ffs(x) - 1)

where it (IIRC) appeared shortly after a somewhat heated discussion

with R. King regarding wether it should be allowed that drivers for

ARM SoC-devices (in particular, C. Hohnstaedt's 2.6 driver for the

NPEs being part of Intel IXP4XX-chips) make use of ARM-assembly

instructions insofar they are particularly well-suited to solving a

specific problem[*].

[*] Insofar there is any doubt regarding this: Mr King's position on

this is/ was, of course, that assembly is arcane, cryptic and

just disgusting and must not be used in a driver for any reason and

the people suggestion to use it nevertheless just because it would

make technical sense are clearly 'non desirable inferior lifeforms to

stupid for this universe which Must Be Flamed Out Of Existence'

(Yes, this was me. The IPX4XX-driver I have written for the 2nd

generation [if ever] hardware of my employer still uses clz directly).

Re: fastest transition from bitmask to the position of it's lowestSET bit

Lars Uffmann wrote:[color=blue]

> I want to combine a bitmask M with a byte-aligned variable V to extract

> a (not necessarily) byte-aligned parameter P (from a network protocol).

>

> The bitmask of the same bitsize as V has the (successive) bits set to 1,

> where the parameter is stored, and the rest to 0.

>

> So basically I wanna apply this math:

>

> P = (V AND M) SHR O

>

> where AND is a bitwise AND, SHR is bitwise shift-right and O is the

> order of the lowest set bit in the bitmask.[/color]

The method that Jason presented, using division to do the shift, is

great if you have reasonably fast integer division on your system. If

you don't then you may want to consider other methods.

Most C libraries have an "ffs" function that "Finds the First Set" bit

in a value. (I have a feeling that it's non-standard but common.) gcc

has a builtin for this that expands to an appropriate single machine

instruction on platforms that have one (and many do). So you can write:

P = (V & M) >> (ffs(M)-1)

Try "man ffs" for details. Note the need for the -1.

But: unless you're sure that you need the extra performance, stick with

the obvious loop!

Phil.

Re: fastest transition from bitmask to the position of it's lowestSET bit

[email]jason.cipriani@gmail.com[/email] wrote:[color=blue]

> It should be apparent but I forgot to mention you'll have to handle it

> as a special case if M==0.[/color]

You never know - I might have slipped and missed that, but I'm not gonna

handle parameters that do not exist (i.e. bitmask = 0x0) :) - thanks for

mentioning anyways.

[color=blue]

> Also note that if you can guarantee that your mask bits are contiguous[/color]

Luckily can :)

[color=blue]

> and you want to avoid integer division for any reason you can substitute

> (M & (~M << 1)) for (M / -M)[/color]

Erm - I gather that ~M is the 2-complement of M, but then you probably meant

(M & (~M + 1)) for (M & -M)

?

For your original suggestion doesn't contain M/-M, unless I am just

unaware of the binary operator rules and the denominator of V&M/(M&-M)

can be picked apart somehow :)

Best Regards,

Lars

Re: fastest transition from bitmask to the position of it's lowestSET bit

Eric Sosman wrote:[color=blue]

> Oooh! See Jason Cipriani's gimmick; it's so cool

> it assuages my shame in being refuted.[/color]

:) Thanks for your time anyways, though I had the idea with the loop

myself - I definitely wanted to use integer arithmetic, for that is a

set amount (in this case 4) of cpu cycles for the calculation. In a loop

I would have had to do at least a binary search tree which would have -

as you demonstrated - involved LOTS more code AND have been slower :)

my resulting function now looks like this:

long extractParam (long *buf, long mask)

{

return (*buf & mask) / (mask & -mask);

}

It really is *that* easy.

Cheers!

Lars

PS: I'm not handling mask = 0 on purpose, the functions domain just

doesn't include mask = 0 (undefined).

Re: fastest transition from bitmask to the position of it's lowest SET bit

Lars Uffmann <aral@nurfuerspam.de> writes:

[color=blue]

> [email]jason.cipriani@gmail.com[/email] wrote:[color=green]

>> It should be apparent but I forgot to mention you'll have to handle it

>> as a special case if M==0.[/color]

> You never know - I might have slipped and missed that, but I'm not

> gonna handle parameters that do not exist (i.e. bitmask = 0x0) :) -

> thanks for mentioning anyways.

>[color=green]

>> Also note that if you can guarantee that your mask bits are contiguous[/color]

> Luckily can :)

>[color=green]

>> and you want to avoid integer division for any reason you can substitute

>> (M & (~M << 1)) for (M / -M)[/color]

>

> Erm - I gather that ~M is the 2-complement of M, but then you probably meant[/color]

~M is the bitwise inverse of M.

--

Måns Rullgård

[email]mans@mansr.com[/email]

Re: fastest transition from bitmask to the position of it's lowestSET bit

Måns Rullgård wrote:[color=blue]

> ~M is the bitwise inverse of M.[/color]

Sorry - of course. ~M = (M XOR -1)

I thought the 2-complement was the same, as opposed to -M :)

Just checked wikipedia, cleared up what was still unclear. Thanks for

pointing that out! My formula correction still applies though, as I was

thinking 2-complement = bitwise inverse :) Whatever made me come to that

assumption...

Best Regards,

Lars

Re: fastest transition from bitmask to the position of it's lowestSET bit

Lars Uffmann wrote:[color=blue]

> my resulting function now looks like this:

>

> long extractParam (long *buf, long mask)

> {

> return (*buf & mask) / (mask & -mask);

> }[/color]

Actually I think I prefer

long extractParam (void *buf, long mask)

{

return (*(long *)buf & mask) / (mask & -mask);

}

That way the function doesn't depend on long-aligned values. Can someone

tell me how a typecast to long* behaves on a

non-sizeof(long)-byte-aligned address if the system hardware doesn't

support this? Or is it always possible to use non-aligned variables with

C compilers if you accept slower processing due to software emulation?

Thanks!

Lars

Re: fastest transition from bitmask to the position of it's lowestSET bit

Phil wrote:

[color=blue]

> Most C libraries have an "ffs" function that "Finds the First Set" bit

> in a value. (I have a feeling that it's non-standard but common.)[/color]

It's in POSIX/SUS, but is part of the XSI option (i.e. required for

UNIX conformance, but optional for POSIX conformance).

--

Geoff Clare <netnews@gclare.org.uk>

Re: fastest transition from bitmask to the position of it's lowestSET bit

Lars Uffmann wrote:[color=blue]

> Eric Sosman wrote:[color=green]

>> Oooh! See Jason Cipriani's gimmick; it's so cool

>> it assuages my shame in being refuted.[/color]

>

> :) Thanks for your time anyways, though I had the idea with the loop

> myself - I definitely wanted to use integer arithmetic, for that is a

> set amount (in this case 4) of cpu cycles for the calculation. In a loop

> I would have had to do at least a binary search tree which would have -

> as you demonstrated - involved LOTS more code AND have been slower :)

>

>

> my resulting function now looks like this:

>

> long extractParam (long *buf, long mask)

> {

> return (*buf & mask) / (mask & -mask);

> }

>

> It really is *that* easy.[/color]

I'd suggest you use `unsigned long' rather than plain

`long', to avoid problems with negative numbers.

--

Eric Sosman

[email]esosman@ieee-dot-org.inva[/email]lid

Re: fastest transition from bitmask to the position of it's lowest SET bit

Lars Uffmann <aral@nurfuerspam.de> writes:

[color=blue]

> Måns Rullgård wrote:[color=green]

>> ~M is the bitwise inverse of M.[/color]

>

> Sorry - of course. ~M = (M XOR -1)[/color]

That is only true for two's complement representation of negative

numbers.

[color=blue]

> I thought the 2-complement was the same, as opposed to -M :)

>

> Just checked wikipedia, cleared up what was still unclear.[/color]

The wikipedia coverage of C related matters is atrocious. Many

statements made there are outright lies, and much of the remainder is

inaccurate or incomplete. If you want to learn C programming, get a

real book on the subject and a copy of the C99 standard.

--

Måns Rullgård

[email]mans@mansr.com[/email]

Re: fastest transition from bitmask to the position of it's lowest SET bit

Lars Uffmann <aral@nurfuerspam.de> writes:

[color=blue]

> Lars Uffmann wrote:[color=green]

>> my resulting function now looks like this:

>> long extractParam (long *buf, long mask)

>> {

>> return (*buf & mask) / (mask & -mask);

>> }[/color]

>

> Actually I think I prefer

>

> long extractParam (void *buf, long mask)

> {

> return (*(long *)buf & mask) / (mask & -mask);

> }

>

> That way the function doesn't depend on long-aligned values. Can

> someone tell me how a typecast to long* behaves on a

> non-sizeof(long)-byte-aligned address if the system hardware doesn't

> support this?[/color]

Your programme with crash with a SIGBUS. On machines where this

doesn't happen, it will run slower, sometimes 100 times slower or

worse.

[color=blue]

> Or is it always possible to use non-aligned variables with C

> compilers if you accept slower processing due to software emulation?[/color]

No.

Again, I suggest getting a good book on C programming and a copy of

the standard.

--

Måns Rullgård

[email]mans@mansr.com[/email]

Re: fastest transition from bitmask to the position of it's lowestSET bit

Lars Uffmann wrote:[color=blue]

> Lars Uffmann wrote:[color=green]

>> my resulting function now looks like this:

>>

>> long extractParam (long *buf, long mask)

>> {

>> return (*buf & mask) / (mask & -mask);

>> }[/color]

>

> Actually I think I prefer

>

> long extractParam (void *buf, long mask)

> {

> return (*(long *)buf & mask) / (mask & -mask);

> }

>

> That way the function doesn't depend on long-aligned values.[/color]

... except that it does! If `buf' points to a mis-aligned

address, converting it from `void*' to `long*' doesn't magically

fix the alignment.

[color=blue]

> Can someone

> tell me how a typecast to long* behaves on a

> non-sizeof(long)-byte-aligned address if the system hardware doesn't

> support this?[/color]

I think you're asking to be told how to do X, starting

with the assumption that X is impossible ...

[color=blue]

> Or is it always possible to use non-aligned variables with

> C compilers if you accept slower processing due to software emulation?[/color]

The effect of dereferencing an improperly-aligned pointer

is undefined; in theory, anything at all could happen. In

practice, I've encountered three behaviors:

- Hardware trap leading to program termination (most

common)

- Hardware trap leading to slow, many-instruction byte-

by-byte fixup code (less common)

- Hardware pretends the offending low-order address bits

were zero, and silently accesses "nearby" memory rather

than the intended memory (I've seen this once only)

I imagine you'd be unhappy with all three of these ...

--

Eric Sosman

[email]esosman@ieee-dot-org.inva[/email]lid

Re: fastest transition from bitmask to the position of it's lowest SET bit

Eric Sosman <esosman@ieee-dot-org.invalid> writes:

[color=blue]

> Lars Uffmann wrote:[color=green]

>> Lars Uffmann wrote:[color=darkred]

>>> my resulting function now looks like this:

>>>

>>> long extractParam (long *buf, long mask)

>>> {

>>> return (*buf & mask) / (mask & -mask);

>>> }[/color]

>> Actually I think I prefer

>> long extractParam (void *buf, long mask)

>> {

>> return (*(long *)buf & mask) / (mask & -mask);

>> }

>> That way the function doesn't depend on long-aligned values.[/color]

>

> ... except that it does! If `buf' points to a mis-aligned

> address, converting it from `void*' to `long*' doesn't magically

> fix the alignment.

>[color=green]

>> Can someone tell me how a typecast to long* behaves on a

>> non-sizeof(long)-byte-aligned address if the system hardware doesn't

>> support this?[/color]

>

> I think you're asking to be told how to do X, starting

> with the assumption that X is impossible ...

>[color=green]

>> Or is it always possible to use non-aligned variables with C

>> compilers if you accept slower processing due to software emulation?[/color]

>

> The effect of dereferencing an improperly-aligned pointer

> is undefined; in theory, anything at all could happen. In

> practice, I've encountered three behaviors:

>

> - Hardware trap leading to program termination (most

> common)

>

> - Hardware trap leading to slow, many-instruction byte-

> by-byte fixup code (less common)

>

> - Hardware pretends the offending low-order address bits

> were zero, and silently accesses "nearby" memory rather

> than the intended memory (I've seen this once only)[/color]

You missed some:

- Hardware silently reads a rounded address, and rotates the value

read according the least significant bits of the address. Some ARM

variants do this.

- Hardware does something unspecified, usually requiring multiple

reads, returning the correct value, only 2-3 times slower than an

aligned access. Intel x86 does this.

--

Måns Rullgård

[email]mans@mansr.com[/email]