[PATCH] x86: Change x86 to use generic find_next_bit - Kernel

This is a discussion on [PATCH] x86: Change x86 to use generic find_next_bit - Kernel ; x86: Change x86 to use the generic find_next_bit implementation The versions with inline assembly are in fact slower on the machines I tested them on (in userspace) (Athlon XP 2800+, p4-like Xeon 2.8GHz, AMD Opteron 270). The i386-version needed a ...

+ Reply to Thread
Results 1 to 19 of 19

Thread: [PATCH] x86: Change x86 to use generic find_next_bit

  1. [PATCH] x86: Change x86 to use generic find_next_bit

    x86: Change x86 to use the generic find_next_bit implementation

    The versions with inline assembly are in fact slower on the machines I
    tested them on (in userspace) (Athlon XP 2800+, p4-like Xeon 2.8GHz, AMD
    Opteron 270). The i386-version needed a fix similar to 06024f21 to avoid
    crashing the benchmark.

    Benchmark using: gcc -fomit-frame-pointer -Os. For each bitmap size
    1...512, for each possible bitmap with one bit set, for each possible
    offset: find the position of the first bit starting at offset. If you
    follow . Times include setup of the bitmap and checking of the
    results.

    Athlon Xeon Opteron 32/64bit
    x86-specific: 0m3.692s 0m2.820s 0m3.196s / 0m2.480s
    generic: 0m2.622s 0m1.662s 0m2.100s / 0m1.572s

    If the bitmap size is not a multiple of BITS_PER_LONG, and no set
    (cleared) bit is found, find_next_bit (find_next_zero_bit) returns a
    value outside of the range [0,size]. The generic version always returns
    exactly size. The generic version also uses unsigned long everywhere,
    while the x86 versions use a mishmash of int, unsigned (int), long and
    unsigned long.

    Using the generic version does give a slightly bigger kernel, though.

    defconfig: text data bss dec hex filename
    x86-specific: 4738555 481232 626688 5846475 5935cb vmlinux (32 bit)
    generic: 4738621 481232 626688 5846541 59360d vmlinux (32 bit)
    x86-specific: 5392395 846568 724424 6963387 6a40bb vmlinux (64 bit)
    generic: 5392458 846568 724424 6963450 6a40fa vmlinux (64 bit)

    arch/x86/Kconfig | 3 ++
    arch/x86/lib/Makefile | 2 +-
    arch/x86/lib/bitops_32.c | 70 -------------------------------------------
    arch/x86/lib/bitops_64.c | 68 -----------------------------------------
    include/asm-x86/bitops.h | 6 ++++
    include/asm-x86/bitops_32.h | 16 ----------
    include/asm-x86/bitops_64.h | 2 -
    lib/find_next_bit.c | 2 +
    8 files changed, 12 insertions(+), 157 deletions(-)

    Signed-off-by: Alexander van Heukelum

    ---

    Hi,

    I think it would be a good idea to use the generic versions of
    find_next_bit and find_next_zero_bit. The i386 versions have
    a bug, and the generic ones are faster (in my probably
    not-so-representative micro-benchmark, that is).

    Patch is against -x86#testing. It compiles.

    Greetings,
    Alexander

    diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
    index ab85a04..29a5a38 100644
    --- a/arch/x86/Kconfig
    +++ b/arch/x86/Kconfig
    @@ -83,6 +83,9 @@ config GENERIC_BUG
    def_bool y
    depends on BUG

    +config GENERIC_FIND_NEXT_BIT
    + def_bool y
    +
    config GENERIC_HWEIGHT
    def_bool y

    diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
    index 25df1c1..4360932 100644
    --- a/arch/x86/lib/Makefile
    +++ b/arch/x86/lib/Makefile
    @@ -11,7 +11,7 @@ lib-y += memcpy_$(BITS).o
    ifeq ($(CONFIG_X86_32),y)
    lib-y += checksum_32.o
    lib-y += strstr_32.o
    - lib-y += bitops_32.o semaphore_32.o string_32.o
    + lib-y += semaphore_32.o string_32.o

    lib-$(CONFIG_X86_USE_3DNOW) += mmx_32.o
    else
    diff --git a/arch/x86/lib/bitops_32.c b/arch/x86/lib/bitops_32.c
    deleted file mode 100644
    index b654404..0000000
    --- a/arch/x86/lib/bitops_32.c
    +++ /dev/null
    @@ -1,70 +0,0 @@
    -#include
    -#include
    -
    -/**
    - * find_next_bit - find the next set bit in a memory region
    - * @addr: The address to base the search on
    - * @offset: The bitnumber to start searching at
    - * @size: The maximum size to search
    - */
    -int find_next_bit(const unsigned long *addr, int size, int offset)
    -{
    - const unsigned long *p = addr + (offset >> 5);
    - int set = 0, bit = offset & 31, res;
    -
    - if (bit) {
    - /*
    - * Look for nonzero in the first 32 bits:
    - */
    - __asm__("bsfl %1,%0\n\t"
    - "jne 1f\n\t"
    - "movl $32, %0\n"
    - "1:"
    - : "=r" (set)
    - : "r" (*p >> bit));
    - if (set < (32 - bit))
    - return set + offset;
    - set = 32 - bit;
    - p++;
    - }
    - /*
    - * No set bit yet, search remaining full words for a bit
    - */
    - res = find_first_bit (p, size - 32 * (p - addr));
    - return (offset + set + res);
    -}
    -EXPORT_SYMBOL(find_next_bit);
    -
    -/**
    - * find_next_zero_bit - find the first zero bit in a memory region
    - * @addr: The address to base the search on
    - * @offset: The bitnumber to start searching at
    - * @size: The maximum size to search
    - */
    -int find_next_zero_bit(const unsigned long *addr, int size, int offset)
    -{
    - const unsigned long *p = addr + (offset >> 5);
    - int set = 0, bit = offset & 31, res;
    -
    - if (bit) {
    - /*
    - * Look for zero in the first 32 bits.
    - */
    - __asm__("bsfl %1,%0\n\t"
    - "jne 1f\n\t"
    - "movl $32, %0\n"
    - "1:"
    - : "=r" (set)
    - : "r" (~(*p >> bit)));
    - if (set < (32 - bit))
    - return set + offset;
    - set = 32 - bit;
    - p++;
    - }
    - /*
    - * No zero yet, search remaining full bytes for a zero
    - */
    - res = find_first_zero_bit(p, size - 32 * (p - addr));
    - return (offset + set + res);
    -}
    -EXPORT_SYMBOL(find_next_zero_bit);
    diff --git a/arch/x86/lib/bitops_64.c b/arch/x86/lib/bitops_64.c
    index 0e8f491..0eeb704 100644
    --- a/arch/x86/lib/bitops_64.c
    +++ b/arch/x86/lib/bitops_64.c
    @@ -1,9 +1,7 @@
    #include

    #undef find_first_zero_bit
    -#undef find_next_zero_bit
    #undef find_first_bit
    -#undef find_next_bit

    static inline long
    __find_first_zero_bit(const unsigned long * addr, unsigned long size)
    @@ -57,39 +55,6 @@ long find_first_zero_bit(const unsigned long * addr, unsigned long size)
    return __find_first_zero_bit (addr, size);
    }

    -/**
    - * find_next_zero_bit - find the next zero bit in a memory region
    - * @addr: The address to base the search on
    - * @offset: The bitnumber to start searching at
    - * @size: The maximum size to search
    - */
    -long find_next_zero_bit (const unsigned long * addr, long size, long offset)
    -{
    - const unsigned long * p = addr + (offset >> 6);
    - unsigned long set = 0;
    - unsigned long res, bit = offset&63;
    -
    - if (bit) {
    - /*
    - * Look for zero in first word
    - */
    - asm("bsfq %1,%0\n\t"
    - "cmoveq %2,%0"
    - : "=r" (set)
    - : "r" (~(*p >> bit)), "r"(64L));
    - if (set < (64 - bit))
    - return set + offset;
    - set = 64 - bit;
    - p++;
    - }
    - /*
    - * No zero yet, search remaining full words for a zero
    - */
    - res = __find_first_zero_bit (p, size - 64 * (p - addr));
    -
    - return (offset + set + res);
    -}
    -
    static inline long
    __find_first_bit(const unsigned long * addr, unsigned long size)
    {
    @@ -136,40 +101,7 @@ long find_first_bit(const unsigned long * addr, unsigned long size)
    return __find_first_bit(addr,size);
    }

    -/**
    - * find_next_bit - find the first set bit in a memory region
    - * @addr: The address to base the search on
    - * @offset: The bitnumber to start searching at
    - * @size: The maximum size to search
    - */
    -long find_next_bit(const unsigned long * addr, long size, long offset)
    -{
    - const unsigned long * p = addr + (offset >> 6);
    - unsigned long set = 0, bit = offset & 63, res;
    -
    - if (bit) {
    - /*
    - * Look for nonzero in the first 64 bits:
    - */
    - asm("bsfq %1,%0\n\t"
    - "cmoveq %2,%0\n\t"
    - : "=r" (set)
    - : "r" (*p >> bit), "r" (64L));
    - if (set < (64 - bit))
    - return set + offset;
    - set = 64 - bit;
    - p++;
    - }
    - /*
    - * No set bit yet, search remaining full words for a bit
    - */
    - res = __find_first_bit (p, size - 64 * (p - addr));
    - return (offset + set + res);
    -}
    -
    #include

    -EXPORT_SYMBOL(find_next_bit);
    EXPORT_SYMBOL(find_first_bit);
    EXPORT_SYMBOL(find_first_zero_bit);
    -EXPORT_SYMBOL(find_next_zero_bit);
    diff --git a/include/asm-x86/bitops.h b/include/asm-x86/bitops.h
    index 1a23ce1..7fbf93a 100644
    --- a/include/asm-x86/bitops.h
    +++ b/include/asm-x86/bitops.h
    @@ -312,6 +312,12 @@ static int test_bit(int nr, const volatile unsigned long *addr);

    #undef ADDR

    +unsigned long find_next_bit(const unsigned long *addr,
    + unsigned long size, unsigned long offset);
    +unsigned long find_next_zero_bit(const unsigned long *addr,
    + unsigned long size, unsigned long offset);
    +
    +
    #ifdef CONFIG_X86_32
    # include "bitops_32.h"
    #else
    diff --git a/include/asm-x86/bitops_32.h b/include/asm-x86/bitops_32.h
    index e4d75fc..570f0fa 100644
    --- a/include/asm-x86/bitops_32.h
    +++ b/include/asm-x86/bitops_32.h
    @@ -38,14 +38,6 @@ static inline int find_first_zero_bit(const unsigned long *addr, unsigned size)
    }

    /**
    - * find_next_zero_bit - find the first zero bit in a memory region
    - * @addr: The address to base the search on
    - * @offset: The bit number to start searching at
    - * @size: The maximum size to search
    - */
    -int find_next_zero_bit(const unsigned long *addr, int size, int offset);
    -
    -/**
    * __ffs - find first bit in word.
    * @word: The word to search
    *
    @@ -81,14 +73,6 @@ static inline unsigned find_first_bit(const unsigned long *addr, unsigned size)
    }

    /**
    - * find_next_bit - find the first set bit in a memory region
    - * @addr: The address to base the search on
    - * @offset: The bit number to start searching at
    - * @size: The maximum size to search
    - */
    -int find_next_bit(const unsigned long *addr, int size, int offset);
    -
    -/**
    * ffz - find first zero in word.
    * @word: The word to search
    *
    diff --git a/include/asm-x86/bitops_64.h b/include/asm-x86/bitops_64.h
    index aaf1519..40bf0f8 100644
    --- a/include/asm-x86/bitops_64.h
    +++ b/include/asm-x86/bitops_64.h
    @@ -6,9 +6,7 @@
    */

    extern long find_first_zero_bit(const unsigned long *addr, unsigned long size);
    -extern long find_next_zero_bit(const unsigned long *addr, long size, long offset);
    extern long find_first_bit(const unsigned long *addr, unsigned long size);
    -extern long find_next_bit(const unsigned long *addr, long size, long offset);

    /* return index of first bet set in val or max when no bit is set */
    static inline long __scanbit(unsigned long val, unsigned long max)
    diff --git a/lib/find_next_bit.c b/lib/find_next_bit.c
    index 78ccd73..5820e07 100644
    --- a/lib/find_next_bit.c
    +++ b/lib/find_next_bit.c
    @@ -15,6 +15,8 @@
    #include

    #define BITOP_WORD(nr) ((nr) / BITS_PER_LONG)
    +#undef find_next_bit
    +#undef find_next_zero_bit

    /**
    * find_next_bit - find the next set bit in a memory region

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [PATCH] x86: Change x86 to use generic find_next_bit


    * Alexander van Heukelum wrote:

    > --- a/lib/find_next_bit.c
    > +++ b/lib/find_next_bit.c
    > @@ -15,6 +15,8 @@
    > #include
    >
    > #define BITOP_WORD(nr) ((nr) / BITS_PER_LONG)
    > +#undef find_next_bit
    > +#undef find_next_zero_bit


    this bit looks weird - did you need it for testing?

    Ingo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [PATCH] x86: Change x86 to use generic find_next_bit


    * Alexander van Heukelum wrote:

    > x86: Change x86 to use the generic find_next_bit implementation
    >
    > The versions with inline assembly are in fact slower on the machines I
    > tested them on (in userspace) (Athlon XP 2800+, p4-like Xeon 2.8GHz,
    > AMD Opteron 270). The i386-version needed a fix similar to 06024f21 to
    > avoid crashing the benchmark.
    >
    > Benchmark using: gcc -fomit-frame-pointer -Os. For each bitmap size
    > 1...512, for each possible bitmap with one bit set, for each possible
    > offset: find the position of the first bit starting at offset. If you
    > follow . Times include setup of the bitmap and checking of the
    > results.
    >
    > Athlon Xeon Opteron 32/64bit
    > x86-specific: 0m3.692s 0m2.820s 0m3.196s / 0m2.480s
    > generic: 0m2.622s 0m1.662s 0m2.100s / 0m1.572s


    ok, that's rather convincing.

    the generic version in lib/find_next_bit.c is open-coded C which gcc can
    optimize pretty nicely.

    the hand-coded assembly versions in arch/x86/lib/bitops_32.c mostly use
    the special x86 'bit search forward' (BSF) instruction - which i know
    from the days when the scheduler relied on it has some non-trivial setup
    costs. So especially when there's _small_ bitmasks involved, it's more
    expensive.

    > If the bitmap size is not a multiple of BITS_PER_LONG, and no set
    > (cleared) bit is found, find_next_bit (find_next_zero_bit) returns a
    > value outside of the range [0,size]. The generic version always
    > returns exactly size. The generic version also uses unsigned long
    > everywhere, while the x86 versions use a mishmash of int, unsigned
    > (int), long and unsigned long.


    i'm not surprised that the hand-coded assembly versions had a bug ...

    [ this means we have to test it quite carefully though, as lots of code
    only ever gets tested on x86 so code could have built dependency on
    the buggy behavior. ]

    > Using the generic version does give a slightly bigger kernel, though.
    >
    > defconfig: text data bss dec hex filename
    > x86-specific: 4738555 481232 626688 5846475 5935cb vmlinux (32 bit)
    > generic: 4738621 481232 626688 5846541 59360d vmlinux (32 bit)
    > x86-specific: 5392395 846568 724424 6963387 6a40bb vmlinux (64 bit)
    > generic: 5392458 846568 724424 6963450 6a40fa vmlinux (64 bit)


    i'd not worry about that too much. Have you tried to build with:

    CONFIG_CC_OPTIMIZE_FOR_SIZE=y
    CONFIG_OPTIMIZE_INLINING=y

    (the latter only available in x86.git)

    > Patch is against -x86#testing. It compiles.


    i've picked it up into x86.git, lets see how it goes in practice.

    Ingo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [PATCH] x86: Change x86 to use generic find_next_bit

    Alexander van Heukelum writes:
    >
    > If the bitmap size is not a multiple of BITS_PER_LONG, and no set
    > (cleared) bit is found, find_next_bit (find_next_zero_bit) returns a
    > value outside of the range [0,size]. The generic version always returns
    > exactly size.


    With that change it is likely possible to remove the min(NR_CPUS in
    lib/cpumask.c __first/__next_cpu. iirc it was just a workaround for
    the x86 quirk. I suspect with such a change it would be possible to
    inline those again, possibly speeding up some loops who do cpumask
    walking (iirc the scheduler used to do that frequently)

    -Andi

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [PATCH] x86: Change x86 to use generic find_next_bit


    On Sun, 9 Mar 2008 21:11:52 +0100, "Ingo Molnar" said:
    > * Alexander van Heukelum wrote:
    > > --- a/lib/find_next_bit.c
    > > +++ b/lib/find_next_bit.c
    > > @@ -15,6 +15,8 @@
    > > #include
    > >
    > > #define BITOP_WORD(nr) ((nr) / BITS_PER_LONG)
    > > +#undef find_next_bit
    > > +#undef find_next_zero_bit

    >
    > this bit looks weird - did you need it for testing?


    Worse, it's needed to get x86_64 to compile.

    They are defined in include/asm-x86/bitops_64.h (which gets
    included). They are used to optimize the case where the
    bitmap size is known at compile time and not larger than
    BITS_PER_LONG. Undeffing them here is the easiest way to get
    things to compile, here.

    Greetings,
    Alexander

    --
    http://www.fastmail.fm - The way an email service should be

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [PATCH] x86: Change x86 to use generic find_next_bit


    * Alexander van Heukelum wrote:

    > > > #define BITOP_WORD(nr) ((nr) / BITS_PER_LONG)
    > > > +#undef find_next_bit
    > > > +#undef find_next_zero_bit

    > >
    > > this bit looks weird - did you need it for testing?

    >
    > Worse, it's needed to get x86_64 to compile.
    >
    > They are defined in include/asm-x86/bitops_64.h (which gets included).
    > They are used to optimize the case where the bitmap size is known at
    > compile time and not larger than BITS_PER_LONG. Undeffing them here is
    > the easiest way to get things to compile, here.


    ok - but this needs to be solved in a cleaner way. That build-time
    optimization needs to be pushed into generic code so that 32-bit x86 and
    other architectures can make use of it as well. The lib/find_next_bit.c
    functions should be named __find_next_bit() or so.

    Ingo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [PATCH] x86: Change x86 to use generic find_next_bit

    Ingo Molnar writes:
    >
    > the generic version in lib/find_next_bit.c is open-coded C which gcc can
    > optimize pretty nicely.
    >
    > the hand-coded assembly versions in arch/x86/lib/bitops_32.c mostly use
    > the special x86 'bit search forward' (BSF) instruction - which i know
    > from the days when the scheduler relied on it has some non-trivial setup


    ~14 cycles on K8 for memory, but if you stay in a register it is 8 cycles

    > costs. So especially when there's _small_ bitmasks involved, it's more
    > expensive.


    I had a patchkit some time ago to special case the max_bit <= 63 case
    and always use directly inlined stream lined single instruction
    assembler for that. There was still some issue and I dropped it then,
    but doing something like that makes still sense. Even if the BSF
    is slightly slower than the open coded version just getting rid
    of the CALL will make it a win and it could be also kept in a register
    so you get the 8 cycle variant (for which I doubt you can do
    it faster open coded)

    The result would be that a standard for_each_cpu () in a NR_CPUS <= 64
    kernel wouldn't have any unnecessary calls.

    In general the problem of walking cpu masks is quite different
    from seaching ext2 bitmaps, so they likely should be special cased
    and optimized for each.

    -Andi

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [PATCH] x86: Change x86 to use generic find_next_bit

    On Sun, 9 Mar 2008 21:10:16 +0100, "Ingo Molnar" said:
    > > Athlon Xeon Opteron 32/64bit
    > > x86-specific: 0m3.692s 0m2.820s 0m3.196s / 0m2.480s
    > > generic: 0m2.622s 0m1.662s 0m2.100s / 0m1.572s

    >
    > ok, that's rather convincing.
    >
    > the generic version in lib/find_next_bit.c is open-coded C which gcc can
    > optimize pretty nicely.
    >
    > the hand-coded assembly versions in arch/x86/lib/bitops_32.c mostly use
    > the special x86 'bit search forward' (BSF) instruction - which i know
    > from the days when the scheduler relied on it has some non-trivial setup
    > costs. So especially when there's _small_ bitmasks involved, it's more
    > expensive.


    Hi,

    BSF is fine, it doesn't need any special setup. The problem is probably
    that the old versions use find_first_bit and find_first_zero_bit,
    which are also hand optimized versions... and they use "repe scasl/q".
    That's another little project .

    > > If the bitmap size is not a multiple of BITS_PER_LONG, and no set
    > > (cleared) bit is found, find_next_bit (find_next_zero_bit) returns a
    > > value outside of the range [0,size]. The generic version always
    > > returns exactly size. The generic version also uses unsigned long
    > > everywhere, while the x86 versions use a mishmash of int, unsigned
    > > (int), long and unsigned long.

    >
    > i'm not surprised that the hand-coded assembly versions had a bug ...


    Not surprised about the bug, but it was in fact noticed, and fixed
    in x86_64!

    > [ this means we have to test it quite carefully though, as lots of code
    > only ever gets tested on x86 so code could have built dependency on
    > the buggy behavior. ]


    Agreed.

    > > Using the generic version does give a slightly bigger kernel, though.
    > >
    > > defconfig: text data bss dec hex filename
    > > x86-specific: 4738555 481232 626688 5846475 5935cb vmlinux (32 bit)
    > > generic: 4738621 481232 626688 5846541 59360d vmlinux (32 bit)
    > > x86-specific: 5392395 846568 724424 6963387 6a40bb vmlinux (64 bit)
    > > generic: 5392458 846568 724424 6963450 6a40fa vmlinux (64 bit)

    >
    > i'd not worry about that too much. Have you tried to build with:


    I don't but I needed to compile something to test the build anyhow

    > CONFIG_CC_OPTIMIZE_FOR_SIZE=y
    > CONFIG_OPTIMIZE_INLINING=y


    This was defconfig in -x86#testing, they were both already enabled.
    Here is what you get with those options turned off .

    text data bss dec hex filename
    x86-specific: 5543996 481232 626688 6651916 65800c vmlinux (32 bit)
    generic: 5543880 481232 626688 6651800 657f98 vmlinux (32 bit)
    x86-specific: 6111834 846568 724424 7682826 753b0a vmlinux (64 bit)
    generic: 6111882 846568 724424 7682874 753b3a vmlinux (64 bit)

    (and I double-checked the i386 results)

    > (the latter only available in x86.git)
    >
    > > Patch is against -x86#testing. It compiles.

    >
    > i've picked it up into x86.git, lets see how it goes in practice.


    Thanks,
    Alexander

    > Ingo

    --
    Alexander van Heukelum
    heukelum@fastmail.fm

    --
    http://www.fastmail.fm - And now for something completely different…

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [PATCH] x86: Change x86 to use generic find_next_bit

    Ingo Molnar writes:
    >
    > ok - but this needs to be solved in a cleaner way. That build-time
    > optimization needs to be pushed into generic code so that 32-bit x86 and


    It probably doesn't make sense in generic code because it will be
    too large open coded.

    -Andi


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [PATCH] x86: Change x86 to use generic find_next_bit

    Andi Kleen writes:
    >
    > I had a patchkit some time ago to special case the max_bit <= 63 case


    .... never mind ... the patchkit is actually included. I had only
    dropped it at some point, but fixed later.

    -Andi
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [PATCH] x86: Change x86 to use generic find_next_bit

    Alexander van Heukelum writes:
    >
    > Benchmark using: gcc -fomit-frame-pointer -Os. For each bitmap size
    > 1...512, for each possible bitmap with one bit set, for each possible
    > offset: find the position of the first bit starting at offset. If you
    > follow . Times include setup of the bitmap and checking of the
    > results.


    BTW another comment: it would be far more sense if you did
    some profiling on what bitmap sizes a real kernel uses
    (should be easy enough with some systemtap) and benchmarked
    only that.

    I doubt 1 bit bitmap searches are common for example ...

    -Andi
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [PATCH] x86: Change x86 to use generic find_next_bit


    * Alexander van Heukelum wrote:

    > > ok, that's rather convincing.
    > >
    > > the generic version in lib/find_next_bit.c is open-coded C which gcc
    > > can optimize pretty nicely.
    > >
    > > the hand-coded assembly versions in arch/x86/lib/bitops_32.c mostly
    > > use the special x86 'bit search forward' (BSF) instruction - which i
    > > know from the days when the scheduler relied on it has some
    > > non-trivial setup costs. So especially when there's _small_ bitmasks
    > > involved, it's more expensive.

    >
    > Hi,
    >
    > BSF is fine, it doesn't need any special setup. [...]


    under "setup costs" i mean cycles spent by the CPU itself - the
    instruction itself is simple (of course) and needs no setup. If you look
    at BSF performance you'll see that it has nontrivial overhead.

    Ingo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. [RFC/PATCH] x86: Optimize find_next_(zero_)bit for small constant-size bitmaps

    x86: Optimize find_next_(zero_)bit for small constant-size bitmaps

    NOTE: This is not well tested. I'm also quite unsure if this makes the
    pre-processor situation any better.

    On an i386 defconfig (the x86#testing one), the size of vmlinux hardly
    changes with this applied. I have observed only four places where this
    optimization avoids a call into find_next_bit:

    In the functions return_unused_surplus_pages, alloc_fresh_huge_page,
    and adjust_pool_surplus, this patch avoids a call for a 1-bit bitmap.
    In __next_cpu a call is avoided for a 32-bit bitmap. That's it.

    On x86_64 I observed the following (I did not look closely yet, though).

    Current #testing defconfig:
    146 x bsf, 27 x find_next_*bit
    text data bss dec hex filename
    5392637 846592 724424 6963653 6a41c5 vmlinux

    After removing the x86_64 specific optimization for find_next_*bit:
    94 x bsf, 79 x find_next_*bit
    text data bss dec hex filename
    5392358 846592 724424 6963374 6a40ae vmlinux

    After this patch (making the optimization generic):
    146 x bsf, 27 x find_next_*bit
    text data bss dec hex filename
    5392396 846592 724424 6963412 6a40d4 vmlinux

    Signed-off-by: Alexander van Heukelum

    ---

    > ok - but this needs to be solved in a cleaner way. That build-time
    > optimization needs to be pushed into generic code so that 32-bit x86 and
    > other architectures can make use of it as well. The lib/find_next_bit.c
    > functions should be named __find_next_bit() or so.
    >
    > Ingo


    I think this is what you had in mind? It seems to work.

    Alternative implementation ideas? Comments? In particular: does
    this break non-x86?

    Greetings,
    Alexander

    include/asm-x86/bitops.h | 4 +-
    include/asm-x86/bitops_64.h | 10 -------
    include/linux/bitops.h | 60 +++++++++++++++++++++++++++++++++++++++++++
    lib/find_next_bit.c | 10 +++----
    4 files changed, 66 insertions(+), 18 deletions(-)

    diff --git a/include/asm-x86/bitops.h b/include/asm-x86/bitops.h
    index 7fbf93a..cbbe329 100644
    --- a/include/asm-x86/bitops.h
    +++ b/include/asm-x86/bitops.h
    @@ -312,9 +312,9 @@ static int test_bit(int nr, const volatile unsigned long *addr);

    #undef ADDR

    -unsigned long find_next_bit(const unsigned long *addr,
    +unsigned long __find_next_bit(const unsigned long *addr,
    unsigned long size, unsigned long offset);
    -unsigned long find_next_zero_bit(const unsigned long *addr,
    +unsigned long __find_next_zero_bit(const unsigned long *addr,
    unsigned long size, unsigned long offset);


    diff --git a/include/asm-x86/bitops_64.h b/include/asm-x86/bitops_64.h
    index 40bf0f8..87e1a17 100644
    --- a/include/asm-x86/bitops_64.h
    +++ b/include/asm-x86/bitops_64.h
    @@ -20,21 +20,11 @@ static inline long __scanbit(unsigned long val, unsigned long max)
    (__scanbit(*(unsigned long *)addr,(size))) : \
    find_first_bit(addr,size)))

    -#define find_next_bit(addr,size,off) \
    -((__builtin_constant_p(size) && (size) <= BITS_PER_LONG ? \
    - ((off) + (__scanbit((*(unsigned long *)addr) >> (off),(size)-(off)))) : \
    - find_next_bit(addr,size,off)))
    -
    #define find_first_zero_bit(addr,size) \
    ((__builtin_constant_p(size) && (size) <= BITS_PER_LONG ? \
    (__scanbit(~*(unsigned long *)addr,(size))) : \
    find_first_zero_bit(addr,size)))

    -#define find_next_zero_bit(addr,size,off) \
    -((__builtin_constant_p(size) && (size) <= BITS_PER_LONG ? \
    - ((off)+(__scanbit(~(((*(unsigned long *)addr)) >> (off)),(size)-(off)))) : \
    - find_next_zero_bit(addr,size,off)))
    -
    static inline void set_bit_string(unsigned long *bitmap, unsigned long i,
    int len)
    {
    diff --git a/include/linux/bitops.h b/include/linux/bitops.h
    index 69c1edb..ba78ca1 100644
    --- a/include/linux/bitops.h
    +++ b/include/linux/bitops.h
    @@ -72,4 +72,64 @@ static inline unsigned fls_long(unsigned long l)
    return fls64(l);
    }

    +#ifdef __KERNEL__
    +#ifdef CONFIG_GENERIC_FIND_NEXT_BIT
    +unsigned long __find_next_bit(const unsigned long *addr,
    + unsigned long size, unsigned long offset);
    +
    +static __always_inline unsigned long
    +find_next_bit(const unsigned long *addr, unsigned long size,
    + unsigned long offset)
    +{
    + unsigned long value;
    +
    + /* Here we would like to use a version of ffs that */
    + /* returns BITS_PER_LONG if its argument is zero. */
    + if (__builtin_constant_p(size) && (size == BITS_PER_LONG)) {
    + value = (*addr) & ((~0ul) << offset);
    + return (value == 0) ? BITS_PER_LONG : __ffs(value);
    + }
    +
    + /* Less than BITS_PER_LONG: put in sentinel, then __ffs */
    + /* Here we would like to have a __set_bit/__ffs combo */
    + if (__builtin_constant_p(size) && (size < BITS_PER_LONG)) {
    + value = (*addr) & ((~0ul) << offset);
    + value |= (1ul << size);
    + return __ffs(value);
    + }
    +
    + /* size not constant or too big */
    + return __find_next_bit(addr, size, offset);
    +}
    +
    +unsigned long __find_next_zero_bit(const unsigned long *addr,
    + unsigned long size, unsigned long offset);
    +
    +static __always_inline unsigned long
    +find_next_zero_bit(const unsigned long *addr, unsigned long size,
    + unsigned long offset)
    +{
    + unsigned long value;
    +
    + /* Here we would like to use a version of ffs that */
    + /* returns BITS_PER_LONG if its argument is zero. */
    + if (__builtin_constant_p(size) && (size == BITS_PER_LONG)) {
    + value = (~(*addr)) & ((~0ul) << offset);
    + return (value == 0) ? BITS_PER_LONG : __ffs(value);
    + }
    +
    + /* Less than BITS_PER_LONG: put in sentinel, then __ffs */
    + /* Here we would like to have a __set_bit/__ffs combo. */
    + if (__builtin_constant_p(size) && (size < BITS_PER_LONG)) {
    + value = (~(*addr)) & ((~0ul) << offset);
    + value |= (1ul << size);
    + return __ffs(value);
    + }
    +
    + /* size not constant or too big */
    + return __find_next_zero_bit(addr, size, offset);
    +}
    +
    +#endif /* CONFIG_GENERIC_FIND_NEXT_BIT */
    +#endif /* __KERNEL__ */
    #endif
    diff --git a/lib/find_next_bit.c b/lib/find_next_bit.c
    index 5820e07..5249f4a 100644
    --- a/lib/find_next_bit.c
    +++ b/lib/find_next_bit.c
    @@ -15,8 +15,6 @@
    #include

    #define BITOP_WORD(nr) ((nr) / BITS_PER_LONG)
    -#undef find_next_bit
    -#undef find_next_zero_bit

    /**
    * find_next_bit - find the next set bit in a memory region
    @@ -24,8 +22,8 @@
    * @offset: The bitnumber to start searching at
    * @size: The maximum size to search
    */
    -unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
    - unsigned long offset)
    +unsigned long __find_next_bit(const unsigned long *addr,
    + unsigned long size, unsigned long offset)
    {
    const unsigned long *p = addr + BITOP_WORD(offset);
    unsigned long result = offset & ~(BITS_PER_LONG-1);
    @@ -69,8 +67,8 @@ EXPORT_SYMBOL(find_next_bit);
    * This implementation of find_{first,next}_zero_bit was stolen from
    * Linus' asm-alpha/bitops.h.
    */
    -unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
    - unsigned long offset)
    +unsigned long __find_next_zero_bit(const unsigned long *addr,
    + unsigned long size, unsigned long offset)
    {
    const unsigned long *p = addr + BITOP_WORD(offset);
    unsigned long result = offset & ~(BITS_PER_LONG-1);

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [RFC/PATCH] x86: Optimize find_next_(zero_)bit for small constant-size bitmaps


    * Alexander van Heukelum wrote:

    > x86: Optimize find_next_(zero_)bit for small constant-size bitmaps


    thanks, looks nice, except for a small detail:

    > +#ifdef CONFIG_GENERIC_FIND_NEXT_BIT
    > +unsigned long __find_next_bit(const unsigned long *addr,
    > + unsigned long size, unsigned long offset);
    > +
    > +static __always_inline unsigned long
    > +find_next_bit(const unsigned long *addr, unsigned long size,
    > + unsigned long offset)


    > +#endif /* CONFIG_GENERIC_FIND_NEXT_BIT */


    there should be an #else here i think, which maps find_next_bit to
    __find_next_bit / etc.

    small syntactic nit:

    > +unsigned long __find_next_bit(const unsigned long *addr,
    > + unsigned long size, unsigned long offset);


    should be explicitly "extern", so that we see that it's a prototype
    declaration.

    Ingo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. [PATCH] x86: Optimize find_next_(zero_)bit for small constant-size bitmaps

    x86: Optimize find_next_(zero_)bit for small constant-size bitmaps.

    This moves an optimization for searching constant-sized small
    bitmaps form x86_64-specific to generic code.

    On an i386 defconfig (the x86#testing one), the size of vmlinux hardly
    changes with this applied. I have observed only four places where this
    optimization avoids a call into find_next_bit:

    In the functions return_unused_surplus_pages, alloc_fresh_huge_page,
    and adjust_pool_surplus, this patch avoids a call for a 1-bit bitmap.
    In __next_cpu a call is avoided for a 32-bit bitmap. That's it.

    On x86_64, 52 locations are optimized with a minimal increase in
    code size:

    Current #testing defconfig:
    146 x bsf, 27 x find_next_*bit
    text data bss dec hex filename
    5392637 846592 724424 6963653 6a41c5 vmlinux

    After removing the x86_64 specific optimization for find_next_*bit:
    94 x bsf, 79 x find_next_*bit
    text data bss dec hex filename
    5392358 846592 724424 6963374 6a40ae vmlinux

    After this patch (making the optimization generic):
    146 x bsf, 27 x find_next_*bit
    text data bss dec hex filename
    5392396 846592 724424 6963412 6a40d4 vmlinux

    ---

    Hi Ingo,

    I have now tested (in user-space) the one-long-versions of the
    find_next_(zero_)bit in the patch. They give the expected
    results.

    > > +#ifdef CONFIG_GENERIC_FIND_NEXT_BIT
    > > +unsigned long __find_next_bit(const unsigned long *addr,
    > > + unsigned long size, unsigned long offset);
    > > +
    > > +static __always_inline unsigned long
    > > +find_next_bit(const unsigned long *addr, unsigned long size,
    > > + unsigned long offset)

    >
    > > +#endif /* CONFIG_GENERIC_FIND_NEXT_BIT */

    >
    > there should be an #else here i think, which maps find_next_bit to
    > __find_next_bit / etc.


    No, that is intentional: architectures are expected to either set
    CONFIG_GENERIC_FIND_NEXT_BIT, or provide their own implementation
    of find_next_bit.

    An alternative would be to have all architectures providing
    __find_next_bit/__find_next_zero_bit and do the optimization
    unconditionally.

    > > +unsigned long __find_next_bit(const unsigned long *addr,
    > > + unsigned long size, unsigned long offset);

    >
    > should be explicitly "extern", so that we see that it's a prototype
    > declaration.


    ok.

    This version runs fine on qemu.

    Greetings,
    Alexander

    diff --git a/include/asm-x86/bitops.h b/include/asm-x86/bitops.h
    index 7fbf93a..1a23ce1 100644
    --- a/include/asm-x86/bitops.h
    +++ b/include/asm-x86/bitops.h
    @@ -312,12 +312,6 @@ static int test_bit(int nr, const volatile unsigned long *addr);

    #undef ADDR

    -unsigned long find_next_bit(const unsigned long *addr,
    - unsigned long size, unsigned long offset);
    -unsigned long find_next_zero_bit(const unsigned long *addr,
    - unsigned long size, unsigned long offset);
    -
    -
    #ifdef CONFIG_X86_32
    # include "bitops_32.h"
    #else
    diff --git a/include/asm-x86/bitops_64.h b/include/asm-x86/bitops_64.h
    index 40bf0f8..87e1a17 100644
    --- a/include/asm-x86/bitops_64.h
    +++ b/include/asm-x86/bitops_64.h
    @@ -20,21 +20,11 @@ static inline long __scanbit(unsigned long val, unsigned long max)
    (__scanbit(*(unsigned long *)addr,(size))) : \
    find_first_bit(addr,size)))

    -#define find_next_bit(addr,size,off) \
    -((__builtin_constant_p(size) && (size) <= BITS_PER_LONG ? \
    - ((off) + (__scanbit((*(unsigned long *)addr) >> (off),(size)-(off)))) : \
    - find_next_bit(addr,size,off)))
    -
    #define find_first_zero_bit(addr,size) \
    ((__builtin_constant_p(size) && (size) <= BITS_PER_LONG ? \
    (__scanbit(~*(unsigned long *)addr,(size))) : \
    find_first_zero_bit(addr,size)))

    -#define find_next_zero_bit(addr,size,off) \
    -((__builtin_constant_p(size) && (size) <= BITS_PER_LONG ? \
    - ((off)+(__scanbit(~(((*(unsigned long *)addr)) >> (off)),(size)-(off)))) : \
    - find_next_zero_bit(addr,size,off)))
    -
    static inline void set_bit_string(unsigned long *bitmap, unsigned long i,
    int len)
    {
    diff --git a/include/linux/bitops.h b/include/linux/bitops.h
    index 69c1edb..15360f9 100644
    --- a/include/linux/bitops.h
    +++ b/include/linux/bitops.h
    @@ -72,4 +72,81 @@ static inline unsigned fls_long(unsigned long l)
    return fls64(l);
    }

    +#ifdef __KERNEL__
    +#ifdef CONFIG_GENERIC_FIND_NEXT_BIT
    +extern unsigned long __find_next_bit(const unsigned long *addr,
    + unsigned long size, unsigned long offset);
    +
    +/**
    + * find_next_bit - find the next set bit in a memory region
    + * @addr: The address to base the search on
    + * @offset: The bitnumber to start searching at
    + * @size: The bitmap size in bits
    + */
    +static __always_inline unsigned long
    +find_next_bit(const unsigned long *addr, unsigned long size,
    + unsigned long offset)
    +{
    + unsigned long value;
    +
    + /* Avoid a function call if the bitmap size is a constant */
    + /* and not bigger than BITS_PER_LONG. */
    +
    + /* insert a sentinel so that __ffs returns size if there */
    + /* are no set bits in the bitmap */
    + if (__builtin_constant_p(size) && (size < BITS_PER_LONG)) {
    + value = (*addr) & ((~0ul) << offset);
    + value |= (1ul << size);
    + return __ffs(value);
    + }
    +
    + /* the result of __ffs(0) is undefined, so it needs to be */
    + /* handled separately */
    + if (__builtin_constant_p(size) && (size == BITS_PER_LONG)) {
    + value = (*addr) & ((~0ul) << offset);
    + return (value == 0) ? BITS_PER_LONG : __ffs(value);
    + }
    +
    + /* size is not constant or too big */
    + return __find_next_bit(addr, size, offset);
    +}
    +
    +extern unsigned long __find_next_zero_bit(const unsigned long *addr,
    + unsigned long size, unsigned long offset);
    +
    +/**
    + * find_next_zero_bit - find the next cleared bit in a memory region
    + * @addr: The address to base the search on
    + * @offset: The bitnumber to start searching at
    + * @size: The bitmap size in bits
    + */
    +static __always_inline unsigned long
    +find_next_zero_bit(const unsigned long *addr, unsigned long size,
    + unsigned long offset)
    +{
    + unsigned long value;
    +
    + /* Avoid a function call if the bitmap size is a constant */
    + /* and not bigger than BITS_PER_LONG. */
    +
    + /* insert a sentinel so that __ffs returns size if there */
    + /* are no set bits in the bitmap */
    + if (__builtin_constant_p(size) && (size < BITS_PER_LONG)) {
    + value = (~(*addr)) & ((~0ul) << offset);
    + value |= (1ul << size);
    + return __ffs(value);
    + }
    +
    + /* the result of __ffs(0) is undefined, so it needs to be */
    + /* handled separately */
    + if (__builtin_constant_p(size) && (size == BITS_PER_LONG)) {
    + value = (~(*addr)) & ((~0ul) << offset);
    + return (value == 0) ? BITS_PER_LONG : __ffs(value);
    + }
    +
    + /* size is not constant or too big */
    + return __find_next_zero_bit(addr, size, offset);
    +}
    +#endif /* CONFIG_GENERIC_FIND_NEXT_BIT */
    +#endif /* __KERNEL__ */
    #endif
    diff --git a/lib/find_next_bit.c b/lib/find_next_bit.c
    index 5820e07..ce94c4c 100644
    --- a/lib/find_next_bit.c
    +++ b/lib/find_next_bit.c
    @@ -15,17 +15,12 @@
    #include

    #define BITOP_WORD(nr) ((nr) / BITS_PER_LONG)
    -#undef find_next_bit
    -#undef find_next_zero_bit
    -
    -/**
    - * find_next_bit - find the next set bit in a memory region
    - * @addr: The address to base the search on
    - * @offset: The bitnumber to start searching at
    - * @size: The maximum size to search
    +
    +/*
    + * Find the next set bit in a memory region.
    */
    -unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
    - unsigned long offset)
    +unsigned long __find_next_bit(const unsigned long *addr,
    + unsigned long size, unsigned long offset)
    {
    const unsigned long *p = addr + BITOP_WORD(offset);
    unsigned long result = offset & ~(BITS_PER_LONG-1);
    @@ -62,15 +57,14 @@ found_first:
    found_middle:
    return result + __ffs(tmp);
    }
    -
    -EXPORT_SYMBOL(find_next_bit);
    +EXPORT_SYMBOL(__find_next_bit);

    /*
    * This implementation of find_{first,next}_zero_bit was stolen from
    * Linus' asm-alpha/bitops.h.
    */
    -unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
    - unsigned long offset)
    +unsigned long __find_next_zero_bit(const unsigned long *addr,
    + unsigned long size, unsigned long offset)
    {
    const unsigned long *p = addr + BITOP_WORD(offset);
    unsigned long result = offset & ~(BITS_PER_LONG-1);
    @@ -107,8 +101,7 @@ found_first:
    found_middle:
    return result + ffz(tmp);
    }
    -
    -EXPORT_SYMBOL(find_next_zero_bit);
    +EXPORT_SYMBOL(__find_next_zero_bit);

    #ifdef __BIG_ENDIAN


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. Re: [PATCH] x86: Optimize find_next_(zero_)bit for small constant-size bitmaps


    * Alexander van Heukelum wrote:

    > > > +#endif /* CONFIG_GENERIC_FIND_NEXT_BIT */

    > >
    > > there should be an #else here i think, which maps find_next_bit to
    > > __find_next_bit / etc.

    >
    > No, that is intentional: architectures are expected to either set
    > CONFIG_GENERIC_FIND_NEXT_BIT, or provide their own implementation of
    > find_next_bit.


    indeed, you are right.

    > This version runs fine on qemu.


    great - i've queued it up into x86.git.

    Ingo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. [RFC] non-x86: Optimize find_next_(zero_)bit for small constant-size bitmaps

    non-x86: Optimize small bitmap searches for the other archs.

    Concerns archs that have their own implementation of find_next_bit
    and find_next_zero_bit.

    For archs that define CONFIG_GENERIC_FIND_NEXT_BIT, a patch is
    submitted to be included in Ingo's x86-testing tree. It moves
    an optimization for searching small, constant-sized bitmaps
    from x86_64-specific to generic code. It works by creating
    a new __always_inline-marked function in linux/bitops.h, and
    renaming find_next_(zero_)bit to __find_next_(zero_)bit in
    the generic code. The new code is currently guarded by
    CONFIG_GENERIC_FIND_NEXT_BIT, so other archs are expected to
    work as before.

    To enable the optimization globally, all other archs need to
    implement __find_next_bit and __find_next_zero_bit either
    by exporting these symbols, or by implementing them as inline
    functions in their asm/bitops.h.

    It would also be nice if every implementation would have the
    same declaration:

    unsigned long __find_next_(zero_)bit(unsigned long *,
    unsigned long, unsigned long);

    I added a totally untested patch for reference.

    Did I mis an arch?
    Does the assembly still match the (changed) prototype?
    Can we get concensus on doing the optimization in generic code?

    Greetings,
    Alexander

    arch/arm/lib/findbit.S | 6 ++++--
    arch/avr32/kernel/avr32_ksyms.c | 4 ++--
    arch/avr32/lib/findbit.S | 4 ++--
    include/asm-arm/bitops.h | 20 ++++++++++++--------
    include/asm-m68k/bitops.h | 8 ++++----
    include/asm-powerpc/bitops.h | 4 ----
    include/asm-s390/bitops.h | 12 ++++++------
    include/linux/bitops.h | 2 --
    8 files changed, 30 insertions(+), 30 deletions(-)

    diff --git a/arch/arm/lib/findbit.S b/arch/arm/lib/findbit.S
    index a5ca024..5c92967 100644
    --- a/arch/arm/lib/findbit.S
    +++ b/arch/arm/lib/findbit.S
    @@ -36,7 +36,8 @@ ENTRY(_find_first_zero_bit_le)

    /*
    * Purpose : Find next 'zero' bit
    - * Prototype: int find_next_zero_bit(void *addr, unsigned int maxbit, int offset)
    + * Prototype: unsigned long find_next_zero_bit(unsigned long *addr,
    + * unsigned long maxbit, unsigned long offset);
    */
    ENTRY(_find_next_zero_bit_le)
    teq r1, #0
    @@ -70,7 +71,8 @@ ENTRY(_find_first_bit_le)

    /*
    * Purpose : Find next 'one' bit
    - * Prototype: int find_next_zero_bit(void *addr, unsigned int maxbit, int offset)
    + * Prototype: unsigned long find_next_zero_bit(unsigned long *addr,
    + * unsigned long maxbit, unsigned long offset);
    */
    ENTRY(_find_next_bit_le)
    teq r1, #0
    diff --git a/arch/avr32/kernel/avr32_ksyms.c b/arch/avr32/kernel/avr32_ksyms.c
    index 80f55f8..c228ed1 100644
    --- a/arch/avr32/kernel/avr32_ksyms.c
    +++ b/arch/avr32/kernel/avr32_ksyms.c
    @@ -51,9 +51,9 @@ EXPORT_SYMBOL(__const_udelay);

    /* Bit operations (lib/findbit.S) */
    EXPORT_SYMBOL(find_first_zero_bit);
    -EXPORT_SYMBOL(find_next_zero_bit);
    +EXPORT_SYMBOL(__find_next_zero_bit);
    EXPORT_SYMBOL(find_first_bit);
    -EXPORT_SYMBOL(find_next_bit);
    +EXPORT_SYMBOL(__find_next_bit);
    EXPORT_SYMBOL(generic_find_next_zero_le_bit);

    /* I/O primitives (lib/io-*.S) */
    diff --git a/arch/avr32/lib/findbit.S b/arch/avr32/lib/findbit.S
    index c6b91de..729deab 100644
    --- a/arch/avr32/lib/findbit.S
    +++ b/arch/avr32/lib/findbit.S
    @@ -29,7 +29,7 @@ ENTRY(find_first_zero_bit)
    * unsigned long size,
    * unsigned long offset)
    */
    -ENTRY(find_next_zero_bit)
    +ENTRY(__find_next_zero_bit)
    lsr r8, r10, 5
    sub r9, r11, r10
    retle r11
    @@ -93,7 +93,7 @@ ENTRY(find_first_bit)
    * unsigned long size,
    * unsigned long offset)
    */
    -ENTRY(find_next_bit)
    +ENTRY(__find_next_bit)
    lsr r8, r10, 5
    sub r9, r11, r10
    retle r11
    diff --git a/include/asm-arm/bitops.h b/include/asm-arm/bitops.h
    index 5c60bfc..08b3163 100644
    --- a/include/asm-arm/bitops.h
    +++ b/include/asm-arm/bitops.h
    @@ -158,9 +158,11 @@ extern int _test_and_set_bit_le(int nr, volatile unsigned long * p);
    extern int _test_and_clear_bit_le(int nr, volatile unsigned long * p);
    extern int _test_and_change_bit_le(int nr, volatile unsigned long * p);
    extern int _find_first_zero_bit_le(const void * p, unsigned size);
    -extern int _find_next_zero_bit_le(const void * p, int size, int offset);
    +extern unsigned long _find_next_zero_bit_le(const unsigned long * p,
    + unsigned long size, unsigned long offset);
    extern int _find_first_bit_le(const unsigned long *p, unsigned size);
    -extern int _find_next_bit_le(const unsigned long *p, int size, int offset);
    +extern unsigned long _find_next_bit_le(const unsigned long *p,
    + unsigned long size, unsigned long offset);

    /*
    * Big endian assembly bitops. nr = 0 -> byte 3 bit 0.
    @@ -172,9 +174,11 @@ extern int _test_and_set_bit_be(int nr, volatile unsigned long * p);
    extern int _test_and_clear_bit_be(int nr, volatile unsigned long * p);
    extern int _test_and_change_bit_be(int nr, volatile unsigned long * p);
    extern int _find_first_zero_bit_be(const void * p, unsigned size);
    -extern int _find_next_zero_bit_be(const void * p, int size, int offset);
    +extern unsigned long _find_next_zero_bit_be(const unsigned long * p,
    + unsigned long size, unsigned long offset);
    extern int _find_first_bit_be(const unsigned long *p, unsigned size);
    -extern int _find_next_bit_be(const unsigned long *p, int size, int offset);
    +extern unsigned long _find_next_bit_be(const unsigned long *p,
    + unsigned long size, unsigned long offset);

    #ifndef CONFIG_SMP
    /*
    @@ -208,9 +212,9 @@ extern int _find_next_bit_be(const unsigned long *p, int size, int offset);
    #define test_and_clear_bit(nr,p) ATOMIC_BITOP_LE(test_and_clear_bit,nr,p)
    #define test_and_change_bit(nr,p) ATOMIC_BITOP_LE(test_and_change_bit,nr,p)
    #define find_first_zero_bit(p,sz) _find_first_zero_bit_le(p,sz)
    -#define find_next_zero_bit(p,sz,off) _find_next_zero_bit_le(p,sz,off)
    +#define __find_next_zero_bit(p,sz,off) _find_next_zero_bit_le(p,sz,off)
    #define find_first_bit(p,sz) _find_first_bit_le(p,sz)
    -#define find_next_bit(p,sz,off) _find_next_bit_le(p,sz,off)
    +#define __find_next_bit(p,sz,off) _find_next_bit_le(p,sz,off)

    #define WORD_BITOFF_TO_LE(x) ((x))

    @@ -226,9 +230,9 @@ extern int _find_next_bit_be(const unsigned long *p, int size, int offset);
    #define test_and_clear_bit(nr,p) ATOMIC_BITOP_BE(test_and_clear_bit,nr,p)
    #define test_and_change_bit(nr,p) ATOMIC_BITOP_BE(test_and_change_bit,nr,p)
    #define find_first_zero_bit(p,sz) _find_first_zero_bit_be(p,sz)
    -#define find_next_zero_bit(p,sz,off) _find_next_zero_bit_be(p,sz,off)
    +#define __find_next_zero_bit(p,sz,off) _find_next_zero_bit_be(p,sz,off)
    #define find_first_bit(p,sz) _find_first_bit_be(p,sz)
    -#define find_next_bit(p,sz,off) _find_next_bit_be(p,sz,off)
    +#define __find_next_bit(p,sz,off) _find_next_bit_be(p,sz,off)

    #define WORD_BITOFF_TO_LE(x) ((x) ^ 0x18)

    diff --git a/include/asm-m68k/bitops.h b/include/asm-m68k/bitops.h
    index 83d1f28..c320c7a 100644
    --- a/include/asm-m68k/bitops.h
    +++ b/include/asm-m68k/bitops.h
    @@ -199,8 +199,8 @@ out:
    return ((long)p - (long)vaddr - 4) * 8 + res;
    }

    -static inline int find_next_zero_bit(const unsigned long *vaddr, int size,
    - int offset)
    +static inline unsigned long __find_next_zero_bit(const unsigned long *vaddr,
    + unsigned long size, unsigned long offset)
    {
    const unsigned long *p = vaddr + (offset >> 5);
    int bit = offset & 31UL, res;
    @@ -246,8 +246,8 @@ out:
    return ((long)p - (long)vaddr - 4) * 8 + res;
    }

    -static inline int find_next_bit(const unsigned long *vaddr, int size,
    - int offset)
    +static inline unsigned long __find_next_bit(const unsigned long *vaddr,
    + unsigned long size, unsigned long offset)
    {
    const unsigned long *p = vaddr + (offset >> 5);
    int bit = offset & 31UL, res;
    diff --git a/include/asm-powerpc/bitops.h b/include/asm-powerpc/bitops.h
    index 220d9a7..8ae6667 100644
    --- a/include/asm-powerpc/bitops.h
    +++ b/include/asm-powerpc/bitops.h
    @@ -317,8 +317,6 @@ static __inline__ int fls(unsigned int x)
    #include

    #define find_first_zero_bit(addr, size) find_next_zero_bit((addr), (size), 0)
    -unsigned long find_next_zero_bit(const unsigned long *addr,
    - unsigned long size, unsigned long offset);
    /**
    * find_first_bit - find the first set bit in a memory region
    * @addr: The address to start the search at
    @@ -328,8 +326,6 @@ unsigned long find_next_zero_bit(const unsigned long *addr,
    * containing a bit.
    */
    #define find_first_bit(addr, size) find_next_bit((addr), (size), 0)
    -unsigned long find_next_bit(const unsigned long *addr,
    - unsigned long size, unsigned long offset);

    /* Little-endian versions */

    diff --git a/include/asm-s390/bitops.h b/include/asm-s390/bitops.h
    index 965394e..13a5c33 100644
    --- a/include/asm-s390/bitops.h
    +++ b/include/asm-s390/bitops.h
    @@ -691,9 +691,9 @@ static inline unsigned long find_first_bit(const unsigned long * addr,
    * @offset: The bitnumber to start searching at
    * @size: The maximum size to search
    */
    -static inline int find_next_zero_bit (const unsigned long * addr,
    - unsigned long size,
    - unsigned long offset)
    +static inline unsigned long __find_next_zero_bit(const unsigned long * addr,
    + unsigned long size,
    + unsigned long offset)
    {
    const unsigned long *p;
    unsigned long bit, set;
    @@ -727,9 +727,9 @@ static inline int find_next_zero_bit (const unsigned long * addr,
    * @offset: The bitnumber to start searching at
    * @size: The maximum size to search
    */
    -static inline int find_next_bit (const unsigned long * addr,
    - unsigned long size,
    - unsigned long offset)
    +static inline unsigned long __find_next_bit(const unsigned long * addr,
    + unsigned long size,
    + unsigned long offset)
    {
    const unsigned long *p;
    unsigned long bit, set;
    diff --git a/include/linux/bitops.h b/include/linux/bitops.h
    index 15360f9..68be268 100644
    --- a/include/linux/bitops.h
    +++ b/include/linux/bitops.h
    @@ -73,7 +73,6 @@ static inline unsigned fls_long(unsigned long l)
    }

    #ifdef __KERNEL__
    -#ifdef CONFIG_GENERIC_FIND_NEXT_BIT
    extern unsigned long __find_next_bit(const unsigned long *addr,
    unsigned long size, unsigned long offset);

    @@ -147,6 +146,5 @@ find_next_zero_bit(const unsigned long *addr, unsigned long size,
    /* size is not constant or too big */
    return __find_next_zero_bit(addr, size, offset);
    }
    -#endif /* CONFIG_GENERIC_FIND_NEXT_BIT */
    #endif /* __KERNEL__ */
    #endif

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  18. Re: [PATCH] x86: Change x86 to use generic find_next_bit

    On Sun, Mar 09, 2008 at 09:01:04PM +0100, Alexander van Heukelum wrote:
    > x86: Change x86 to use the generic find_next_bit implementation
    >
    > The versions with inline assembly are in fact slower on the machines I
    > tested them on (in userspace) (Athlon XP 2800+, p4-like Xeon 2.8GHz, AMD
    > Opteron 270). The i386-version needed a fix similar to 06024f21 to avoid
    > crashing the benchmark.
    >
    > Benchmark using: gcc -fomit-frame-pointer -Os. For each bitmap size
    > 1...512, for each possible bitmap with one bit set, for each possible
    > offset: find the position of the first bit starting at offset. If you
    > follow . Times include setup of the bitmap and checking of the
    > results.
    >
    > Athlon Xeon Opteron 32/64bit
    > x86-specific: 0m3.692s 0m2.820s 0m3.196s / 0m2.480s
    > generic: 0m2.622s 0m1.662s 0m2.100s / 0m1.572s
    >
    > If the bitmap size is not a multiple of BITS_PER_LONG, and no set
    > (cleared) bit is found, find_next_bit (find_next_zero_bit) returns a
    > value outside of the range [0,size]. The generic version always returns
    > exactly size. The generic version also uses unsigned long everywhere,
    > while the x86 versions use a mishmash of int, unsigned (int), long and
    > unsigned long.
    >


    This problem is observed on x86_64 and powerpc also. We need a long
    aligned address for test_bit, set_bit find_bit etc. In ext4 we have
    to make sure we align the address passed to

    ext4_test_bit
    ext4_set_bit
    ext4_find_next_zero_bit
    ext4_find_next_bit

    fs/ext4/mballoc.c have some examples.

    -aneesh
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  19. Re: [PATCH] x86: Change x86 to use generic find_next_bit


    On Thu, 13 Mar 2008 18:14:24 +0530, "Aneesh Kumar K.V"
    said:
    > On Sun, Mar 09, 2008 at 09:01:04PM +0100, Alexander van Heukelum wrote:
    > > x86: Change x86 to use the generic find_next_bit implementation
    > >
    > > The versions with inline assembly are in fact slower on the machines I
    > > tested them on (in userspace) (Athlon XP 2800+, p4-like Xeon 2.8GHz, AMD
    > > Opteron 270). The i386-version needed a fix similar to 06024f21 to avoid
    > > crashing the benchmark.
    > >
    > > Benchmark using: gcc -fomit-frame-pointer -Os. For each bitmap size
    > > 1...512, for each possible bitmap with one bit set, for each possible
    > > offset: find the position of the first bit starting at offset. If you
    > > follow . Times include setup of the bitmap and checking of the
    > > results.
    > >
    > > Athlon Xeon Opteron 32/64bit
    > > x86-specific: 0m3.692s 0m2.820s 0m3.196s / 0m2.480s
    > > generic: 0m2.622s 0m1.662s 0m2.100s / 0m1.572s
    > >
    > > If the bitmap size is not a multiple of BITS_PER_LONG, and no set
    > > (cleared) bit is found, find_next_bit (find_next_zero_bit) returns a
    > > value outside of the range [0,size]. The generic version always returns
    > > exactly size. The generic version also uses unsigned long everywhere,
    > > while the x86 versions use a mishmash of int, unsigned (int), long and
    > > unsigned long.

    >
    > This problem is observed on x86_64 and powerpc also.


    I'm not entirely sure if it is a problem. In some cases it
    certainly is an inconvenience, though . I mentioned the
    difference between the old and generic versions, because
    of the possibility of dependence of this behaviour.

    Indeed I see for example (in fs/ext4/mballoc.c).

    bit = mb_find_next_zero_bit(bitmap_bh->b_data, end, bit);
    if (bit >= end)
    break;
    next = mb_find_next_bit(bitmap_bh->b_data, end, bit);
    if (next > end)
    next = end;
    free += next - bit;

    So here it needed to adjust the value.

    > We need a long
    > aligned address for test_bit, set_bit find_bit etc. In ext4 we have
    > to make sure we align the address passed to
    >
    > ext4_test_bit
    > ext4_set_bit
    > ext4_find_next_zero_bit
    > ext4_find_next_bit
    >
    > fs/ext4/mballoc.c have some examples.


    This is a different 'problem'. find_next_bit works on arrays of long,
    while the bitmaps in ext4_find_next_bit are of type void * and seem
    not to have any alignment restrictions. ext4 implements wrappers
    around find_next_bit to solve that 'problem'.

    The question that arises is: do we want find_first_bit, find_next_bit.
    etc. to always return a value in the range [0, size], or do we want
    to allow implementations that return [0, size-1] if there is a bit
    found and something else (roundup(size,bitsperlong) or ulongmax, for
    example if, none were found?

    The current x86_64 versions of find_first_bit and find_next_bit return
    roundup(size,bitsperlong) if no bits were found, but on the other hand
    I guess most bitmaps are a multiple of bitsperlong bits in size, which
    hides the difference.

    Greetings,
    Alexander

    > -aneesh

    --
    Alexander van Heukelum
    heukelum@fastmail.fm

    --
    http://www.fastmail.fm - Or how I learned to stop worrying and
    love email again

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread