[PATCH] procfs: provide slub's /proc/slabinfo - Kernel

This is a discussion on [PATCH] procfs: provide slub's /proc/slabinfo - Kernel ; On Thu, 2008-01-10 at 08:13 -0800, Linus Torvalds wrote: > > On Thu, 10 Jan 2008, Pekka J Enberg wrote: > > > > We probably don't have the same version of GCC which perhaps affects > > memory layout ...

+ Reply to Thread
Page 3 of 4 FirstFirst 1 2 3 4 LastLast
Results 41 to 60 of 68

Thread: [PATCH] procfs: provide slub's /proc/slabinfo

  1. Re: [RFC PATCH] greatly reduce SLOB external fragmentation


    On Thu, 2008-01-10 at 08:13 -0800, Linus Torvalds wrote:
    >
    > On Thu, 10 Jan 2008, Pekka J Enberg wrote:
    > >
    > > We probably don't have the same version of GCC which perhaps affects
    > > memory layout (struct sizes) and thus allocation patterns?

    >
    > No, struct sizes will not change with compiler versions - that would
    > create binary incompatibilities for libraries etc.
    >
    > So apart from the kernel itself working around some old gcc bugs by making
    > spinlocks have different size depending on the compiler version, sizes of
    > structures should be the same (as long as the configuration is the same,
    > of course).
    >
    > However, a greedy first-fit allocator will be *very* sensitive to
    > allocation pattern differences, so timing will probably make a big
    > difference. In contrast, SLUB and SLAB both use fixed sizes per allocation
    > queue, which makes them almost totally impervious to any allocation
    > pattern from different allocation sizes (they still end up caring about
    > the pattern *within* one size, but those tend to be much stabler).
    >
    > There really is a reason why traditional heaps with first-fit are almost
    > never used for any real loads.
    >
    > (I'm not a fan of slabs per se - I think all the constructor/destructor
    > crap is just that: total crap - but the size/type binning is a big deal,
    > and I think SLOB was naïve to think a pure first-fit makes any sense. Now
    > you guys are size-binning by just two or three bins, and it seems to make
    > a difference for some loads, but compared to SLUB/SLAB it's a total hack).


    Here I'm going to differ with you. The premises of the SLAB concept
    (from the original paper) are:

    a) fragmentation of conventional allocators gets worse over time
    b) grouping objects of the same -type- (not size) together should mean
    they have similar lifetimes and thereby keep fragmentation low
    c) slabs can be O(1)
    d) constructors and destructors are cache-friendly

    There's some truth to (a), but the problem has been quite overstated,
    pre-SLAB Linux kernels and countless other systems run for years. And
    (b) is known to be false, you just have to look at our dcache and icache
    pinning. (c) of course is a whole separate argument and often trumps the
    others. And (d) is pretty much crap now too.

    And as it happens, SLOB basically always beats SLAB on size.

    SLUB only wins when it starts merging caches of different -types- based
    on -size-, effectively throwing out the whole (b) concept. Which is
    good, because its wrong. So SLUB starts to look a lot like a purely
    size-binned allocator.

    > I would suggest that if you guys are really serious about memory use, try
    > to do a size-based heap thing, and do best-fit in that heap. Or just some
    > really simple size-based binning, eg
    >
    > if (size > 2*PAGE_SIZE)
    > goto page_allocator;


    I think both SLOB and SLUB punt everything >= PAGE_SIZE off to the page
    allocator.

    > bin = lookup_bin[(size+31) >> 5]
    >
    > or whatever. Because first-fit is *known* to be bad.


    It is known to be crummy, but it is NOT known to actually be worse than
    the SLAB/SLUB approach when you consider both internal and external
    fragmentation. Power-of-two ala SLAB kmalloc basically guarantees your
    internal fragmentation will be 30% or so.

    In fact, first-fit is known to be pretty good if the ratio of object
    size to arena size is reasonable. I've already shown a system booting
    with 6% overhead compared to SLUB's 35% (and SLAB's nearly 70%). The
    fragmentation measurements for the small object list are in fact quite
    nice. Not the best benchmark, but it certainly shows that there's
    substantial room for improvement.

    Where it hurts is larger objects (task structs, SKBs), which are also a
    problem for SLAB/SLUB and I don't think any variation on the search
    order is going to help there. If we threw 64k pages at it, those
    problems might very well disappear. Hell, it's pretty tempting to throw
    vmalloc at it, especially on small boxes..

    > At try to change it to address-ordered first-fit or something (which is
    > much more complex than just plain LIFO, but hey, that's life).


    Will think about that. Most of the literature here is of limited
    usefulness - even Knuth didn't look at 4k heaps, let alone collections
    of them.

    > I haven't checked much, but I *think* SLOB is just basic first-fit
    > (perhaps the "next-fit" variation?) Next-fit is known to be EVEN WORSE
    > than the simple first-fit when it comes to fragmentation (so no, Knuth was
    > not always right - let's face it, much of Knuth is simply outdated).


    The SLOB allocator is inherently two-level. I'm using a next-fit-like
    approach to decide which heap (page) to use and first-fit within that
    heap.

    --
    Mathematics is the supreme nostalgia of our time.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [RFC PATCH] greatly reduce SLOB external fragmentation

    > I would suggest that if you guys are really serious about memory use, try
    > to do a size-based heap thing, and do best-fit in that heap. Or just some


    iirc best fit usually also has some nasty long term fragmentation behaviour.
    That is why it is usually also not used.

    -Andi
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [RFC PATCH] greatly reduce SLOB external fragmentation



    On Thu, 10 Jan 2008, Matt Mackall wrote:
    > >
    > > (I'm not a fan of slabs per se - I think all the constructor/destructor
    > > crap is just that: total crap - but the size/type binning is a big deal,
    > > and I think SLOB was naïve to think a pure first-fit makes any sense. Now
    > > you guys are size-binning by just two or three bins, and it seems to make
    > > a difference for some loads, but compared to SLUB/SLAB it's a total hack).

    >
    > Here I'm going to differ with you. The premises of the SLAB concept
    > (from the original paper) are:


    I really don't think we differ.

    The advantage of slab was largely the binning by type. Everything else was
    just a big crock. SLUB does the binning better, by really just making the
    type binning be about what really matters - the *size* of the type.

    So my argument was that the type/size binning makes sense (size more so
    than type), but the rest of the original Sun arguments for why slab was
    such a great idea were basically just the crap.

    Hard type binning was a mistake (but needed by slab due to the idiotic
    notion that constructors/destructors are "good for caches" - bleargh). I
    suspect that hard size binning is a mistake too (ie there are probably
    cases where you do want to split unused bigger size areas), but the fact
    that all of our allocators are two-level (with the page allocator acting
    as a size-agnostic free space) may help it somewhat.

    And yes, I do agree that any current allocator has problems with the big
    sizes that don't fit well into a page or two (like task_struct). That
    said, most of those don't have lots of allocations under many normal
    circumstances (even if there are uses that will really blow them up).

    The *big* slab users at least for me tend to be ext3_inode_cache and
    dentry. Everything else is orders of magnitude less. And of the two bad
    ones, ext3_inode_cache is the bad one at 700+ bytes or whatever (resulting
    in ~10% fragmentation just due to the page thing, regardless of whether
    you use an order-0 or order-1 page allocation).

    Of course, dentries fit better in a page (due to being smaller), but then
    the bigger number of dentries per page make it harder to actually free
    pages, so then you get fragmentation from that. Oh well. You can't win.

    Linus
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [RFC PATCH] greatly reduce SLOB external fragmentation


    On Thu, 2008-01-10 at 10:28 -0800, Linus Torvalds wrote:
    >
    > On Thu, 10 Jan 2008, Matt Mackall wrote:
    > > >
    > > > (I'm not a fan of slabs per se - I think all the constructor/destructor
    > > > crap is just that: total crap - but the size/type binning is a big deal,
    > > > and I think SLOB was naïve to think a pure first-fit makes any sense. Now
    > > > you guys are size-binning by just two or three bins, and it seems to make
    > > > a difference for some loads, but compared to SLUB/SLAB it's a total hack).

    > >
    > > Here I'm going to differ with you. The premises of the SLAB concept
    > > (from the original paper) are:

    >
    > I really don't think we differ.
    >
    > The advantage of slab was largely the binning by type. Everything else was
    > just a big crock. SLUB does the binning better, by really just making the
    > type binning be about what really matters - the *size* of the type.
    >
    > So my argument was that the type/size binning makes sense (size more so
    > than type), but the rest of the original Sun arguments for why slab was
    > such a great idea were basically just the crap.
    >
    > Hard type binning was a mistake (but needed by slab due to the idiotic
    > notion that constructors/destructors are "good for caches" - bleargh). I
    > suspect that hard size binning is a mistake too (ie there are probably
    > cases where you do want to split unused bigger size areas), but the fact
    > that all of our allocators are two-level (with the page allocator acting
    > as a size-agnostic free space) may help it somewhat.
    >
    > And yes, I do agree that any current allocator has problems with the big
    > sizes that don't fit well into a page or two (like task_struct). That
    > said, most of those don't have lots of allocations under many normal
    > circumstances (even if there are uses that will really blow them up).
    >
    > The *big* slab users at least for me tend to be ext3_inode_cache and
    > dentry. Everything else is orders of magnitude less. And of the two bad
    > ones, ext3_inode_cache is the bad one at 700+ bytes or whatever (resulting
    > in ~10% fragmentation just due to the page thing, regardless of whether
    > you use an order-0 or order-1 page allocation).
    >
    > Of course, dentries fit better in a page (due to being smaller), but then
    > the bigger number of dentries per page make it harder to actually free
    > pages, so then you get fragmentation from that. Oh well. You can't win.


    One idea I've been kicking around is pushing the boundary for the buddy
    allocator back a bit (to 64k, say) and using SL*B under that. The page
    allocators would call into buddy for larger than 64k (rare!) and SL*B
    otherwise. This would let us greatly improve our handling of things like
    task structs and skbs and possibly also things like 8k stacks and jumbo
    frames. As SL*B would never be competing with the page allocator for
    contiguous pages (the buddy allocator's granularity would be 64k), I
    don't think this would exacerbate the page-level fragmentation issues.

    Crazy?

    --
    Mathematics is the supreme nostalgia of our time.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [RFC PATCH] greatly reduce SLOB external fragmentation

    On Thu, 10 Jan 2008, Matt Mackall wrote:

    > Here I'm going to differ with you. The premises of the SLAB concept
    > (from the original paper) are:
    >
    > a) fragmentation of conventional allocators gets worse over time


    Even fragmentation of SLAB/SLUB gets worses over time. That is why we need
    a defrag solution.

    > b) grouping objects of the same -type- (not size) together should mean
    > they have similar lifetimes and thereby keep fragmentation low


    I agree that is crap. The lifetimes argument is mostly only exploitable in
    benchmarks. That is why SLUB just groups them by size if possible.

    > d) constructors and destructors are cache-friendly


    I agree. Crap too. We removed the destructors. The constructors are needed
    so that objects in slab pages always have a definite state. That is f.e.
    necessary for slab defragmentation because it has to be able to inspect an
    object at an arbitrary time and either remove it or move it to another
    slab page.

    Constructors also make sense because the initialization of a cache object
    may be expensive. Initializing list heads and spinlocks can take some code
    and that code can be omitted if objects have a definite state when they
    are free. We saw that when measuring the buffer_head constructors effect
    on performance.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [RFC PATCH] greatly reduce SLOB external fragmentation


    On Thu, 2008-01-10 at 11:16 -0800, Christoph Lameter wrote:
    > On Thu, 10 Jan 2008, Matt Mackall wrote:
    >
    > > Here I'm going to differ with you. The premises of the SLAB concept
    > > (from the original paper) are:
    > >
    > > a) fragmentation of conventional allocators gets worse over time

    >
    > Even fragmentation of SLAB/SLUB gets worses over time. That is why we need
    > a defrag solution.
    >
    > > b) grouping objects of the same -type- (not size) together should mean
    > > they have similar lifetimes and thereby keep fragmentation low

    >
    > I agree that is crap. The lifetimes argument is mostly only exploitable in
    > benchmarks. That is why SLUB just groups them by size if possible.
    >
    > > d) constructors and destructors are cache-friendly

    >
    > I agree. Crap too. We removed the destructors. The constructors are needed
    > so that objects in slab pages always have a definite state. That is f.e.
    > necessary for slab defragmentation because it has to be able to inspect an
    > object at an arbitrary time and either remove it or move it to another
    > slab page.


    Are you saying that the state of -freed- objects matters for your active
    defragmentation? That's odd.

    > Constructors also make sense because the initialization of a cache object
    > may be expensive. Initializing list heads and spinlocks can take some code
    > and that code can be omitted if objects have a definite state when they
    > are free. We saw that when measuring the buffer_head constructors effect
    > on performance.


    Hmm. SLOB proves that you don't need to segregate objects based on
    constructors, so you could combine even slabs that have constructors and
    just delay construction until allocation. I'm surprised constructors
    have measurable advantage..

    --
    Mathematics is the supreme nostalgia of our time.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [RFC PATCH] greatly reduce SLOB external fragmentation

    On Thu, 10 Jan 2008, Matt Mackall wrote:

    > One idea I've been kicking around is pushing the boundary for the buddy
    > allocator back a bit (to 64k, say) and using SL*B under that. The page
    > allocators would call into buddy for larger than 64k (rare!) and SL*B
    > otherwise. This would let us greatly improve our handling of things like
    > task structs and skbs and possibly also things like 8k stacks and jumbo
    > frames. As SL*B would never be competing with the page allocator for
    > contiguous pages (the buddy allocator's granularity would be 64k), I
    > don't think this would exacerbate the page-level fragmentation issues.


    This would create another large page size (and that would have my
    enthusiastic support). It would decrease listlock effect drastically for
    SLUB. Even the initial simplistic implementation of SLUB was superior on
    the database transaction tests (I think it was up ~1%) on IA64 from the
    get go. Likely due to the larger 16k page size there. The larger page size
    could also be used for the page cache (ducking and running.....)? A 64k
    page size that could be allocated without zone locks would be very good
    for SLUB.

    However, isnt this is basically confessing that the page allocator is not
    efficient for 4k page allocations? I have seen some weaknesses there. The
    overhead in the allocation path in particular is bad and I was thinking
    about applying the same ideas used in SLUB also to the page allocator in
    order to bring the cycle count down from 500-1000 to 60 or so. Since both
    SLUB and SLOB use the page allocator for allocs >PAGE_SIZE this would not
    only benefit the general kernel but also the slab allocations.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [RFC PATCH] greatly reduce SLOB external fragmentation

    On Thu, 10 Jan 2008, Matt Mackall wrote:

    > > I agree. Crap too. We removed the destructors. The constructors are needed
    > > so that objects in slab pages always have a definite state. That is f.e.
    > > necessary for slab defragmentation because it has to be able to inspect an
    > > object at an arbitrary time and either remove it or move it to another
    > > slab page.

    >
    > Are you saying that the state of -freed- objects matters for your active
    > defragmentation? That's odd.


    The state of the object immediately after it is allocated matters for a
    defrag solution. A kmalloc leads to an object in a undetermined state if
    you have no constructor. Code will then initialize the object but defrag
    f.e. must be able to inspect the object before. This means either that the
    freed object has a defined state or that kmalloc establishes that state
    before the object is marked as allocated.

    > > Constructors also make sense because the initialization of a cache object
    > > may be expensive. Initializing list heads and spinlocks can take some code
    > > and that code can be omitted if objects have a definite state when they
    > > are free. We saw that when measuring the buffer_head constructors effect
    > > on performance.

    >
    > Hmm. SLOB proves that you don't need to segregate objects based on
    > constructors, so you could combine even slabs that have constructors and
    > just delay construction until allocation. I'm surprised constructors
    > have measurable advantage..


    That is not working if you need to inspect allocated objects at any time
    for a defrag solution. All objects in a defragmentable slab need to have a
    consistent object state if allocated. If you have some without
    constructors then these object have no defined state and may contain
    arbitrary bytes.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [RFC PATCH] greatly reduce SLOB external fragmentation



    On Thu, 10 Jan 2008, Matt Mackall wrote:
    >
    > One idea I've been kicking around is pushing the boundary for the buddy
    > allocator back a bit (to 64k, say) and using SL*B under that. The page
    > allocators would call into buddy for larger than 64k (rare!) and SL*B
    > otherwise. This would let us greatly improve our handling of things like
    > task structs and skbs and possibly also things like 8k stacks and jumbo
    > frames.


    Yes, something like that may well be reasonable. It could possibly solve
    some of the issues for bigger page cache sizes too, but one issue is that
    many things actually end up having those power-of-two alignment
    constraints too - so an 8kB allocation would often still have to be
    naturally aligned, which then removes some of the freedom.

    > Crazy?


    It sounds like it might be worth trying out - there's just no way to know
    how well it would work. Buddy allocators sure as hell have problems too,
    no question about that. It's not like the page allocator is perfect.

    It's not even clear that a buddy allocator even for the high-order pages
    is at all the right choice. Almost nobody actually wants >64kB blocks, and
    the ones that *do* want bigger allocations tend to want *much* bigger
    ones, so it's quite possible that it could be worth it to have something
    like a three-level allocator:

    - huge pages (superpages for those crazy db people)

    Just a simple linked list of these things is fine, we'd never care
    about coalescing large pages together anyway.

    - "large pages" (on the order of ~64kB) - with *perhaps* a buddy bitmap
    setup to try to coalesce back into huge-pages, but more likely just
    admitting that you'd need something like migration to ever get back a
    hugepage that got split into large-pages.

    So maybe a simple bitmap allocator per huge-page for large pages. Say
    you have a 4MB huge-page, and just a 64-bit free-bitmap per huge-page
    when you split it into large pages.

    - slab/slub/slob for anything else, and "get_free_page()" ends up being
    just a shorthand for saying "naturally aligned kmalloc of size
    "PAGE_SIZE<
    and maybe it would all work out ok.

    Linus
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [RFC PATCH] greatly reduce SLOB external fragmentation

    On Thu, 10 Jan 2008, Linus Torvalds wrote:

    > It's not even clear that a buddy allocator even for the high-order pages
    > is at all the right choice. Almost nobody actually wants >64kB blocks, and
    > the ones that *do* want bigger allocations tend to want *much* bigger
    > ones, so it's quite possible that it could be worth it to have something
    > like a three-level allocator:


    Excellent! I am definitely on board with this.

    > - huge pages (superpages for those crazy db people)
    >
    > Just a simple linked list of these things is fine, we'd never care
    > about coalescing large pages together anyway.
    >
    > - "large pages" (on the order of ~64kB) - with *perhaps* a buddy bitmap
    > setup to try to coalesce back into huge-pages, but more likely just
    > admitting that you'd need something like migration to ever get back a
    > hugepage that got split into large-pages.
    >
    > So maybe a simple bitmap allocator per huge-page for large pages. Say
    > you have a 4MB huge-page, and just a 64-bit free-bitmap per huge-page
    > when you split it into large pages.
    >
    > - slab/slub/slob for anything else, and "get_free_page()" ends up being
    > just a shorthand for saying "naturally aligned kmalloc of size
    > "PAGE_SIZE< >
    > and maybe it would all work out ok.


    Hmmm... a 3 level allocator? Basically we would have BASE_PAGE
    STANDARD_PAGE and HUGE_PAGE? We could simply extend the page allocator to
    have 3 pcp lists for these sizes and go from there?

    Thinking about the arches this would mean

    BASE_PAGE STANDARD_PAGE HUGE_PAGE
    x86_64 4k 64k 2M
    i386 4k 16k 4M
    ia64 16k 256k 1G

    ?

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [RFC PATCH] greatly reduce SLOB external fragmentation


    On Thu, 2008-01-10 at 11:24 -0800, Christoph Lameter wrote:
    > On Thu, 10 Jan 2008, Matt Mackall wrote:
    >
    > > One idea I've been kicking around is pushing the boundary for the buddy
    > > allocator back a bit (to 64k, say) and using SL*B under that. The page
    > > allocators would call into buddy for larger than 64k (rare!) and SL*B
    > > otherwise. This would let us greatly improve our handling of things like
    > > task structs and skbs and possibly also things like 8k stacks and jumbo
    > > frames. As SL*B would never be competing with the page allocator for
    > > contiguous pages (the buddy allocator's granularity would be 64k), I
    > > don't think this would exacerbate the page-level fragmentation issues.

    >
    > This would create another large page size (and that would have my
    > enthusiastic support).


    Well, I think we'd still have the same page size, in the sense that we'd
    have a struct page for every hardware page and we'd still have hardware
    page-sized pages in the page cache. We'd just change how we allocated
    them. Right now we've got a stack that looks like:

    buddy / page allocator
    SL*B allocator
    kmalloc

    And we'd change that to:

    buddy allocator
    SL*B allocator
    page allocator / kmalloc

    So get_free_page() would still hand you back a hardware page, it would
    just do it through SL*B.

    > It would decrease listlock effect drastically for SLUB.


    Not sure what you're referring to here.

    > However, isnt this is basically confessing that the page allocator is not
    > efficient for 4k page allocations?


    Well I wasn't thinking of doing this for any performance reasons. But
    there certainly could be some.
    --
    Mathematics is the supreme nostalgia of our time.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [RFC PATCH] greatly reduce SLOB external fragmentation

    > - huge pages (superpages for those crazy db people)
    >
    > Just a simple linked list of these things is fine, we'd never care
    > about coalescing large pages together anyway.


    I did essentially that for my GBpages hugetlbfs patchkit. GB pages are already
    beyond MAX_ORDER and increasing MAX_ORDER didn't seem attractive because
    it would require aligning the zones all to 1GB which would quite nasty.

    So it just grabbed them out of bootmem early and managed them in a
    per node list.

    Not sure it's a good idea for 2MB pages though.

    -Andi
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: [RFC PATCH] greatly reduce SLOB external fragmentation

    On Thu, 10 Jan 2008, Matt Mackall wrote:

    > Well, I think we'd still have the same page size, in the sense that we'd
    > have a struct page for every hardware page and we'd still have hardware
    > page-sized pages in the page cache. We'd just change how we allocated
    > them. Right now we've got a stack that looks like:


    We would not change the hardware page. Cannot do that. But we would have
    preferential threadment for 64k and 2M pages in the page allocator?

    > buddy / page allocator
    > SL*B allocator
    > kmalloc
    >
    > And we'd change that to:
    >
    > buddy allocator
    > SL*B allocator
    > page allocator / kmalloc
    >
    > So get_free_page() would still hand you back a hardware page, it would
    > just do it through SL*B.


    Hmm.... Not sure what effect this would have. We already have the pcp's
    that have a similar effect.

    > > It would decrease listlock effect drastically for SLUB.

    >
    > Not sure what you're referring to here.


    Allocations in 64k chunks means 16 times less basic allocation blocks to
    manage for the slab allocators. So the metadata to be maintained by the
    allocators is reduces by that factor. SLUB only needs to touch the
    list_lock (in some situations like a free to a non cpu slab) if a block
    becomes completely empty or is going from fully allocated to partially
    allocated. The larger the block size the more objects are in a block and
    the less of these actions that need a per node lock are needed.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [RFC PATCH] greatly reduce SLOB external fragmentation

    On Thu, 10 Jan 2008, Andi Kleen wrote:

    > I did essentially that for my GBpages hugetlbfs patchkit. GB pages are already
    > beyond MAX_ORDER and increasing MAX_ORDER didn't seem attractive because
    > it would require aligning the zones all to 1GB which would quite nasty.


    I am very very interested in that work and I could not find it when I
    looked for it a couple of weeks back. Can you sent me a copy?

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: [RFC PATCH] greatly reduce SLOB external fragmentation

    Hi Linus,

    (I'm replying to an old thread.)

    On Thu, 10 Jan 2008, Linus Torvalds wrote:
    > I would suggest that if you guys are really serious about memory use, try
    > to do a size-based heap thing, and do best-fit in that heap. Or just some
    > really simple size-based binning, eg
    >
    > if (size > 2*PAGE_SIZE)
    > goto page_allocator;
    > bin = lookup_bin[(size+31) >> 5];
    >
    > or whatever. Because first-fit is *known* to be bad.
    >
    > At try to change it to address-ordered first-fit or something (which is
    > much more complex than just plain LIFO, but hey, that's life).
    >
    > I haven't checked much, but I *think* SLOB is just basic first-fit
    > (perhaps the "next-fit" variation?) Next-fit is known to be EVEN WORSE
    > than the simple first-fit when it comes to fragmentation (so no, Knuth was
    > not always right - let's face it, much of Knuth is simply outdated).


    In case you're interested, it turns out that best-fit with binning gives
    roughly the same results as SLOB (or alternatively, I messed something up
    with the design and implementation). The interesting bit here is that
    BINALLOC is more stable than SLOB (almost as stable as SLUB in my tests).

    Pekka

    Subject: [PATCH] binalloc: best-fit allocation with binning
    From: Pekka Enberg

    As suggested by Linus, to optimize memory use, I have implemented a best-fit
    general purpose kernel memory allocator with binning. You can find the original
    discussion here:

    http://lkml.org/lkml/2008/1/10/229

    The results are as follows:

    [ the minimum, maximum, and average are of captured from 25 individual runs ]

    Total Free (kB) Used (kB)
    (kB) min max average sd min max average
    SLOB 122372 117676 117768 117721.12 20.51 4604 4696 4650.88
    BINALLOC 122368 117672 117732 117699.68 16.74 4636 4696 4668.32
    SLUB (no debug) 122360 117284 117328 117308.96 15.27 5032 5076 5051.04

    Thanks to Vegard Nossum for his help with debugging and testing the allocator
    and to Johannes Weiner for fixing my bugs.

    Cc: Vegard Nossum
    Cc: Johannes Weiner
    Signed-off-by: Pekka Enberg
    ---
    diff --git a/include/linux/binalloc_def.h b/include/linux/binalloc_def.h
    new file mode 100644
    index 0000000..a6ea99e
    --- /dev/null
    +++ b/include/linux/binalloc_def.h
    @@ -0,0 +1,34 @@
    +#ifndef __LINUX_BINALLOC_DEF_H
    +#define __LINUX_BINALLOC_DEF_H
    +
    +void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
    +
    +static inline void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
    +{
    + return kmem_cache_alloc_node(cachep, flags, -1);
    +}
    +
    +void *__kmalloc_node(size_t size, gfp_t flags, int node);
    +
    +static inline void *kmalloc_node(size_t size, gfp_t flags, int node)
    +{
    + return __kmalloc_node(size, flags, node);
    +}
    +
    +/**
    + * kmalloc - allocate memory
    + * @size: how many bytes of memory are required.
    + * @flags: the type of memory to allocate (see kcalloc).
    + *
    + * kmalloc is the normal method of allocating memory
    + * in the kernel.
    + */
    +static inline void *kmalloc(size_t size, gfp_t flags)
    +{
    + return __kmalloc_node(size, flags, -1);
    +}
    +
    +void *__kmalloc(size_t size, gfp_t flags);
    +
    +#endif /* __LINUX_BINALLOC_DEF_H */
    +
    diff --git a/include/linux/slab.h b/include/linux/slab.h
    index 5ff9676..eeda03d 100644
    --- a/include/linux/slab.h
    +++ b/include/linux/slab.h
    @@ -124,6 +124,8 @@ size_t ksize(const void *);
    #include
    #elif defined(CONFIG_SLOB)
    #include
    +#elif defined(CONFIG_BINALLOC)
    +#include
    #else
    #include
    #endif
    @@ -186,7 +188,7 @@ static inline void *kcalloc(size_t n, size_t size, gfp_t flags)
    return __kmalloc(n * size, flags | __GFP_ZERO);
    }

    -#if !defined(CONFIG_NUMA) && !defined(CONFIG_SLOB)
    +#if !defined(CONFIG_NUMA) && !defined(CONFIG_SLOB) && !defined(CONFIG_BINALLOC)
    /**
    * kmalloc_node - allocate memory from a specific node
    * @size: how many bytes of memory are required.
    diff --git a/init/Kconfig b/init/Kconfig
    index 250e02c..b9a6325 100644
    --- a/init/Kconfig
    +++ b/init/Kconfig
    @@ -774,6 +774,13 @@ config SLOB
    allocator. SLOB is generally more space efficient but
    does not perform as well on large systems.

    +config BINALLOC
    + depends on EMBEDDED
    + bool "BINALLOC"
    + help
    + A best-fit general-purpose kernel memory allocator with binning
    + that tries to be as memory efficient as possible.
    +
    endchoice

    config PROFILING
    diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
    index e1d4764..29e253a 100644
    --- a/lib/Kconfig.debug
    +++ b/lib/Kconfig.debug
    @@ -290,6 +290,13 @@ config SLUB_STATS
    out which slabs are relevant to a particular load.
    Try running: slabinfo -DA

    +config BINALLOC_DEBUG
    + bool "Debug binalloc memory allocations"
    + default n
    + help
    + Say Y here to have the memory allocator to do sanity checks for
    + allocation and deallocation.
    +
    config DEBUG_PREEMPT
    bool "Debug preemptible kernel"
    depends on DEBUG_KERNEL && PREEMPT && (TRACE_IRQFLAGS_SUPPORT || PPC64)
    diff --git a/mm/Makefile b/mm/Makefile
    index da4ccf0..94ed767 100644
    --- a/mm/Makefile
    +++ b/mm/Makefile
    @@ -28,6 +28,7 @@ obj-$(CONFIG_SLOB) += slob.o
    obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
    obj-$(CONFIG_SLAB) += slab.o
    obj-$(CONFIG_SLUB) += slub.o
    +obj-$(CONFIG_BINALLOC) += binalloc.o
    obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
    obj-$(CONFIG_FS_XIP) += filemap_xip.o
    obj-$(CONFIG_MIGRATION) += migrate.o
    diff --git a/mm/binalloc.c b/mm/binalloc.c
    new file mode 100644
    index 0000000..ecdfd33
    --- /dev/null
    +++ b/mm/binalloc.c
    @@ -0,0 +1,826 @@
    +/*
    + * A best-fit general purpose kernel memory allocator with binning.
    + *
    + * Copyright (C) 2008 Pekka Enberg
    + *
    + * This file is release under the GPLv2
    + *
    + * I. Overview
    + *
    + * This is a best-fit general purpose kernel memory allocator with binning. We
    + * use the page allocator to allocate one big contiguous chunk that is split
    + * into smaller chunks for successive kmalloc() calls until we run out of space
    + * after which a new page is allocated.
    + *
    + * This memory allocator bears close resemblance to the "Lea" memory allocator
    + * described in:
    + *
    + * http://g.oswego.edu/dl/html/malloc.html
    + *
    + * II. Anatomy of a chunk
    + *
    + * To detect page boundaries, a zero-sized sentinel chunk is placed at the end
    + * of a page. Objects are aligned by padding bytes at the beginning of a chunk
    + * after the boundary tag. The padding is included in the allocated object size
    + * so that neighbor boundary tags can be calculated. Likewise, boundary tags
    + * are aligned by padding bytes at the end of a chunk when splitting the chunk
    + * so that the first non-allocated byte is properly aligned.
    + *
    + * III. Operation
    + *
    + * In the following example, we assume a 64-byte page although on real machines
    + * the page size ranges from 4 KB to 64 KB. The size of the boundary tag in
    + * this example is 4 bytes and the mandatory alignment required by the machine
    + * is 4 bytes.
    + *
    + * Initially, the kernel memory allocator allocates one page from the page
    + * allocator and turns that into contiguous chunk with sentinel. You can see
    + * the boundary tag bytes marked as 'B' and the sentinel chunk boundary tag
    + * bytes as 'S' in the following diagram:
    + *
    + * 0 8 16 24 32 40 48 56
    + * +-------+-------+-------+-------+-------+-------+-------+-------
    + * BBBB SSSS
    + *
    + * Now lets assume a kmalloc() call comes in and wants to allocate 3 bytes. As
    + * we find a big enough chunk, we simply split it in two parts as follows:
    + *
    + * 0 8 16 24 32 40 48 56
    + * +-------+-------+-------+-------+-------+-------+-------+-------
    + * BBBBOOOPBBBB SSSS
    + *
    + * As the pointer after the boundary tag of the chunk is already aligned to the
    + * mandatory alignment 4, the object marked as 'O' starts immediately after the
    + * boundary tag. However, to make sure the boundary tag of the next chunk is
    + * also aligned propery, one byte of padding is added to the end of the object
    + * marked as 'P'.
    + *
    + * Now assume that a kmem_cache_alloc() call comes in to allocate 16 bytes of
    + * memory with mandatory alignment of 16. Now, as the location after the
    + * boundary tag of the second chunk is not aligned to 16 byte boundary, we add
    + * 8 bytes of padding in front of the object as illustarted in the following
    + * diagram:
    + *
    + * 0 8 16 24 32 40 48 56
    + * +-------+-------+-------+-------+-------+-------+-------+-------
    + * BBBBOOOPBBBBPPPPOOOOOOOOOOOOOOOOBBBB SSSS
    + *
    + * Note that the boundary tag is naturally aligned due to the fact that the
    + * object size is already aligned to 4 byte boundary which is the mandatory
    + * alignment for this machine.
    + */
    +
    +#include
    +#include
    +#include
    +#include
    +#include
    +
    +/*
    + * Minimum alignments
    + */
    +#ifndef ARCH_KMALLOC_MINALIGN
    +#define ARCH_KMALLOC_MINALIGN __alignof__(unsigned long)
    +#endif
    +
    +#ifndef ARCH_SLAB_MINALIGN
    +#define ARCH_SLAB_MINALIGN __alignof__(unsigned long)
    +#endif
    +
    +#define MANDATORY_ALIGN max(ARCH_KMALLOC_MINALIGN, ARCH_SLAB_MINALIGN)
    +
    +struct kmem_rcu {
    + struct rcu_head head;
    + size_t size;
    +};
    +
    +#define KMEM_CHUNK_RESERVED 0xababababUL
    +#define KMEM_CHUNK_COALESCED 0xbcbcbcbcUL
    +#define KMEM_CHUNK_FREE 0xfefefefeUL
    +
    +/*
    + * Each chunk has a boundary tag at the beginning of the chunk. The tag
    + * contains the size of this chunk and the size of the previous chunk which is
    + * required by chuck coalescing when an object is freed.
    + */
    +struct kmem_boundary_tag {
    +#ifdef CONFIG_BINALLOC_DEBUG
    + unsigned long magic;
    +#endif
    + unsigned short prev_size;
    + unsigned short size;
    + unsigned short reserved;
    + unsigned short align;
    +};
    +
    +struct kmem_chunk {
    + struct kmem_boundary_tag tag;
    + /* The following fields are defined only if the chunk is available */
    + struct rb_node bin_tree;
    +};
    +
    +/*
    + * The code assumes that the end of a boundary tag is aligned by power of two
    + * for calculating the alignment of an object in a chunk.
    + */
    +#define BOUNDARY_TAG_SIZE roundup_pow_of_two(sizeof(struct kmem_boundary_tag))
    +
    +struct kmem_bin {
    + struct rb_root freelist;
    +};
    +
    +/*
    + * The chunk needs to be big enough to fit the freelist node pointers when it's
    + * available.
    + */
    +#define MIN_CHUNK_SIZE \
    + (sizeof(struct kmem_chunk) - sizeof(struct kmem_boundary_tag))
    +
    +#define BIN_SHIFT 5
    +#define NR_BINS ((PAGE_SIZE) / (1 << BIN_SHIFT))
    +
    +static struct kmem_bin bins[NR_BINS];
    +static DEFINE_SPINLOCK(kmem_lock);
    +
    +static unsigned long chunk_size(struct kmem_chunk *chunk)
    +{
    + return chunk->tag.size;
    +}
    +
    +static void __chunk_set_size(struct kmem_chunk *chunk, unsigned long size)
    +{
    + chunk->tag.size = size;
    +}
    +
    +static void ptr_set_align(void *p, unsigned short align)
    +{
    + unsigned short *buf = (void *)p - sizeof(unsigned short);
    +
    + *buf = align;
    +}
    +
    +static unsigned short ptr_get_align(const void *p)
    +{
    + unsigned short *buf = (void *)p - sizeof(unsigned short);
    +
    + return *buf;
    +}
    +
    +static struct kmem_chunk *rawptr_to_chunk(const void *p)
    +{
    + void *q = (void *)p - BOUNDARY_TAG_SIZE;
    + return q;
    +}
    +
    +static void *chunk_to_rawptr(struct kmem_chunk *chunk)
    +{
    + return (void *)chunk + BOUNDARY_TAG_SIZE;
    +}
    +
    +static struct kmem_chunk *ptr_to_chunk(const void *p, unsigned short align)
    +{
    + return rawptr_to_chunk(p - align);
    +}
    +
    +static void *chunk_to_ptr(struct kmem_chunk *chunk, unsigned short align)
    +{
    + void *p;
    +
    + p = chunk_to_rawptr(chunk);
    + return PTR_ALIGN(p, align);
    +}
    +
    +static struct kmem_chunk *prev_chunk(struct kmem_chunk *chunk)
    +{
    + void *p = rawptr_to_chunk((void *) chunk - chunk->tag.prev_size);
    +
    + BUG_ON(!virt_addr_valid(p));
    +
    + return p;
    +}
    +
    +static struct kmem_chunk *next_chunk(struct kmem_chunk *chunk)
    +{
    + return chunk_to_rawptr(chunk) + chunk_size(chunk);
    +}
    +
    +static bool chunk_is_first(struct kmem_chunk *b)
    +{
    + return b->tag.prev_size == 0;
    +}
    +
    +static bool chunk_is_last(struct kmem_chunk *b)
    +{
    + struct kmem_chunk *next = next_chunk(b);
    +
    + return chunk_size(next) == 0;
    +}
    +
    +static void chunk_set_size(struct kmem_chunk *chunk, unsigned long size)
    +{
    + BUG_ON(size < MIN_CHUNK_SIZE);
    + BUG_ON(size > PAGE_SIZE);
    +
    + __chunk_set_size(chunk, size);
    +
    + if (!chunk_is_last(chunk)) {
    + struct kmem_chunk *next = next_chunk(chunk);
    +
    + next->tag.prev_size = size;
    + }
    +}
    +
    +#ifdef CONFIG_BINALLOC_DEBUG
    +
    +#define DUMP_BYTES 128
    +
    +static void kmem_dump_chunk(struct kmem_chunk *chunk)
    +{
    + print_hex_dump(KERN_ERR, "kmem: ", DUMP_PREFIX_ADDRESS, 16, 1,
    + chunk, DUMP_BYTES, 1);
    +}
    +
    +static void kmem_verify_chunk(struct kmem_chunk *chunk, unsigned long magic)
    +{
    + if (chunk->tag.magic != magic) {
    + printk(KERN_ERR "kmem: bad chunk magic: %lx, expected: %lx\n",
    + chunk->tag.magic, magic);
    + kmem_dump_chunk(chunk);
    + BUG();
    + }
    +}
    +
    +static void chunk_set_magic(struct kmem_chunk *chunk, unsigned long magic)
    +{
    + chunk->tag.magic = magic;
    +}
    +
    +static void kmem_verify_page(struct page *page)
    +{
    + struct kmem_chunk *chunk = page_address(page);
    +
    + do {
    + BUG_ON(chunk_size(chunk) < MIN_CHUNK_SIZE);
    + BUG_ON(virt_to_page(chunk) != page);
    + chunk = next_chunk(chunk);
    + } while (chunk_size(chunk) != 0);
    +}
    +#else
    +
    +static inline void
    +kmem_verify_chunk(struct kmem_chunk *chunk, unsigned long magic)
    +{
    +}
    +
    +static inline void
    +chunk_set_magic(struct kmem_chunk *chunk, unsigned long magic)
    +{
    +}
    +
    +static inline void kmem_verify_page(struct page *page)
    +{
    +}
    +#endif /* CONFIG_BINALLOC_DEBUG */
    +
    +static bool chunk_is_available(struct kmem_chunk *chunk)
    +{
    + return chunk->tag.reserved != 1;
    +}
    +
    +static void chunk_mark_reserved(struct kmem_chunk *chunk)
    +{
    + chunk_set_magic(chunk, KMEM_CHUNK_RESERVED);
    + chunk->tag.reserved = 1;
    +}
    +
    +static void chunk_mark_available(struct kmem_chunk *chunk)
    +{
    + chunk_set_magic(chunk, KMEM_CHUNK_FREE);
    + chunk->tag.reserved = 0;
    +}
    +
    +#define ALLOC_MASK (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK)
    +
    +static void *kmem_alloc_large(size_t size, gfp_t gfpflags, int node)
    +{
    + struct page *page;
    + int order;
    +
    + order = get_order(size);
    + page = alloc_pages_node(node, gfpflags & ALLOC_MASK, order);
    + if (!page)
    + return NULL;
    + page->private = size;
    +
    + return page_address(page);
    +}
    +
    +static int lookup_bin_idx(unsigned long size)
    +{
    + return (size-1) >> BIN_SHIFT;
    +}
    +
    +static struct kmem_bin *lookup_bin(unsigned long size)
    +{
    + int idx = lookup_bin_idx(size);
    +
    + BUG_ON(idx > NR_BINS);
    +
    + return &bins[idx];
    +}
    +
    +static struct kmem_bin *chunk_get_bin(struct kmem_chunk *chunk)
    +{
    + unsigned long size = chunk_size(chunk);
    +
    + return lookup_bin(size);
    +}
    +
    +static void __insert_to_freelist(struct kmem_bin *bin, struct kmem_chunk *chunk)
    +{
    + struct rb_node **p = &bin->freelist.rb_node;
    + struct rb_node *parent = NULL;
    +
    + while (*p) {
    + struct kmem_chunk *this;
    + parent = *p;
    +
    + this = rb_entry(parent, struct kmem_chunk, bin_tree);
    +
    + if (chunk_size(chunk) < chunk_size(this))
    + p = &(*p)->rb_left;
    + else
    + p = &(*p)->rb_right;
    + }
    + rb_link_node(&chunk->bin_tree, parent, p);
    + rb_insert_color(&chunk->bin_tree, &bin->freelist);
    +}
    +
    +static void insert_to_freelist(struct kmem_chunk *chunk)
    +{
    + struct kmem_bin *bin;
    +
    + kmem_verify_chunk(chunk, KMEM_CHUNK_FREE);
    +
    + bin = chunk_get_bin(chunk);
    + __insert_to_freelist(bin, chunk);
    +}
    +
    +static void remove_from_freelist(struct kmem_chunk *chunk)
    +{
    + struct kmem_bin *bin = chunk_get_bin(chunk);
    +
    + rb_erase(&chunk->bin_tree, &bin->freelist);
    +}
    +
    +static struct kmem_chunk *chunk_page_alloc(gfp_t gfpflags)
    +{
    + struct kmem_chunk *chunk, *sentinel;
    + struct page *page;
    +
    + page = alloc_pages_node(-1, gfpflags & ALLOC_MASK, 0);
    + if (!page)
    + return NULL;
    +
    + __SetPageSlab(page);
    +
    + chunk = page_address(page);
    + chunk->tag.prev_size = 0;
    + __chunk_set_size(chunk, PAGE_SIZE - BOUNDARY_TAG_SIZE*2);
    + chunk_mark_available(chunk);
    +
    + /*
    + * The sentinel chunk marks the end of a page and it's the only one
    + * that can have size zero.
    + */
    + sentinel = page_address(page) + PAGE_SIZE - BOUNDARY_TAG_SIZE;
    + sentinel->tag.prev_size = chunk_size(chunk);
    + __chunk_set_size(sentinel, 0);
    + chunk_mark_reserved(sentinel);
    +
    + return chunk;
    +}
    +
    +static void split_chunk(struct kmem_chunk *chunk, size_t new_size)
    +{
    + struct kmem_chunk *upper_half;
    + size_t size, remaining;
    +
    + BUG_ON(new_size < MIN_CHUNK_SIZE);
    + BUG_ON(new_size > PAGE_SIZE);
    +
    + kmem_verify_chunk(chunk, KMEM_CHUNK_FREE);
    +
    + size = chunk_size(chunk);
    + BUG_ON(size < new_size);
    +
    + remaining = size - new_size;
    +
    + /*
    + * Don't split if remaining half would end up too small.
    + */
    + if (remaining < (BOUNDARY_TAG_SIZE + MIN_CHUNK_SIZE))
    + return;
    +
    + chunk_set_size(chunk, new_size);
    +
    + upper_half = next_chunk(chunk);
    + upper_half->tag.prev_size = chunk_size(chunk);
    + chunk_set_size(upper_half, remaining - BOUNDARY_TAG_SIZE);
    +
    + chunk_mark_available(upper_half);
    + insert_to_freelist(upper_half);
    +}
    +
    +static struct kmem_chunk *__kmem_alloc(struct kmem_bin *bin, size_t size)
    +{
    + struct rb_node *n = bin->freelist.rb_node;
    + struct kmem_chunk *ret = NULL;
    +
    + while (n) {
    + struct kmem_chunk *chunk = rb_entry(n, struct kmem_chunk, bin_tree);
    +
    + if (chunk_size(chunk) < size) {
    + /*
    + * The chunk is not big enough.
    + */
    + n = n->rb_right;
    + } else {
    + /*
    + * Look up the smallest possible chunk that is big
    + * enough for us but bail out early if we find a
    + * perfect fit.
    + */
    + ret = chunk;
    + if (chunk_size(chunk) == size)
    + break;
    + n = n->rb_left;
    + }
    + }
    + if (ret)
    + remove_from_freelist(ret);
    +
    + return ret;
    +}
    +
    +/*
    + * This is the heart of the kernel memory allocator.
    + */
    +static void *
    +kmem_alloc(size_t size, gfp_t gfpflags, unsigned short align)
    +{
    + struct kmem_chunk *chunk;
    + unsigned long flags;
    + void *p;
    + int i;
    +
    + if (size < MIN_CHUNK_SIZE)
    + size = MIN_CHUNK_SIZE;
    +
    + /*
    + * The boundary tags are aligned to mandatory alignment so there's no
    + * need to reserve extra space if the user also requested the same
    + * mandatory alignment.
    + */
    + if (align != MANDATORY_ALIGN)
    + size += align;
    +
    + size = ALIGN(size, MANDATORY_ALIGN);
    +
    + spin_lock_irqsave(&kmem_lock, flags);
    + /*
    + * Look for available chunks from all bins starting from the smallest
    + * one that is big enough for the requested size.
    + */
    + for (i = lookup_bin_idx(size); i < NR_BINS; i++) {
    + struct kmem_bin *bin = &bins[i];
    +
    + chunk = __kmem_alloc(bin, size);
    +
    + /*
    + * If we found a free chunk, return a pointer it; otherwise
    + * continue scanning.
    + */
    + if (chunk)
    + goto allocate_obj;
    + }
    +
    + /*
    + * Ok, we need more pages.
    + */
    + spin_unlock_irqrestore(&kmem_lock, flags);
    + chunk = chunk_page_alloc(gfpflags);
    + if (!chunk)
    + return NULL;
    + spin_lock_irqsave(&kmem_lock, flags);
    +
    +allocate_obj:
    + split_chunk(chunk, size);
    + chunk_mark_reserved(chunk);
    + spin_unlock_irqrestore(&kmem_lock, flags);
    +
    + p = chunk_to_ptr(chunk, align);
    + ptr_set_align(p, p - chunk_to_rawptr(chunk));
    +
    + BUG_ON(!IS_ALIGNED((unsigned long) p, align));
    + return p;
    +}
    +
    +static void *
    +kmem_alloc_node(size_t size, gfp_t gfpflags, unsigned short align, int node)
    +{
    + void *p;
    +
    + if (unlikely(!size))
    + return ZERO_SIZE_PTR;
    +
    + if (size < PAGE_SIZE)
    + p = kmem_alloc(size, gfpflags, align);
    + else
    + p = kmem_alloc_large(size, gfpflags, node);
    +
    + if ((gfpflags & __GFP_ZERO) && p)
    + memset(p, 0, size);
    +
    + if (size < PAGE_SIZE)
    + kmem_verify_page(virt_to_page(p));
    +
    + return p;
    +}
    +
    +void *__kmalloc_node(size_t size, gfp_t gfpflags, int node)
    +{
    + return kmem_alloc_node(size, gfpflags, MANDATORY_ALIGN, node);
    +}
    +EXPORT_SYMBOL(__kmalloc_node);
    +
    +void *__kmalloc(size_t size, gfp_t gfpflags)
    +{
    + return kmem_alloc_node(size, gfpflags, MANDATORY_ALIGN, -1);
    +}
    +EXPORT_SYMBOL(__kmalloc);
    +
    +static void
    +coalesce_chunks(struct kmem_chunk *chunk, struct kmem_chunk *right_neighbor)
    +{
    + unsigned long new_size;
    +
    + kmem_verify_chunk(right_neighbor, KMEM_CHUNK_FREE);
    + kmem_verify_chunk(chunk, KMEM_CHUNK_FREE);
    +
    + chunk_set_magic(right_neighbor, KMEM_CHUNK_COALESCED);
    +
    + new_size = chunk_size(chunk) + chunk_size(right_neighbor);
    +
    + /*
    + * The boundary tag of the right-side neighbor is coalesced into the
    + * chunk as well.
    + */
    + chunk_set_size(chunk, new_size + BOUNDARY_TAG_SIZE);
    +}
    +
    +static void kmem_free_chunk(struct kmem_chunk *chunk, struct page *page)
    +{
    + unsigned long flags;
    +
    + spin_lock_irqsave(&kmem_lock, flags);
    +
    + chunk_mark_available(chunk);
    +
    + /*
    + * Coalesce chunk with neighbor chunks that not reserved.
    + */
    + if (likely(!chunk_is_first(chunk))) {
    + struct kmem_chunk *prev = prev_chunk(chunk);
    +
    + if (chunk_is_available(prev)) {
    + remove_from_freelist(prev);
    + coalesce_chunks(prev, chunk);
    + chunk = prev;
    + }
    + }
    + if (likely(!chunk_is_last(chunk))) {
    + struct kmem_chunk *next = next_chunk(chunk);
    +
    + if (chunk_is_available(next)) {
    + remove_from_freelist(next);
    + coalesce_chunks(chunk, next);
    + }
    + }
    +
    + /*
    + * If the chunk now covers a whole page, give it back to the page
    + * allocator; otherwise insert it to the freelist.
    + */
    + if (unlikely(chunk_size(chunk) == PAGE_SIZE-BOUNDARY_TAG_SIZE*2)) {
    + __ClearPageSlab(page);
    + __free_pages(page, 0);
    + } else
    + insert_to_freelist(chunk);
    +
    + spin_unlock_irqrestore(&kmem_lock, flags);
    +}
    +
    +static void kmem_free(const void *p, struct page *page)
    +{
    + struct kmem_chunk *chunk;
    + unsigned short align;
    +
    + kmem_verify_page(page);
    +
    + align = ptr_get_align(p);
    + chunk = ptr_to_chunk(p, align);
    +
    + kmem_free_chunk(chunk, page);
    +}
    +
    +static void __kfree(const void *p)
    +{
    + struct page *page = virt_to_page(p);
    +
    + if (PageSlab(page))
    + kmem_free(p, page);
    + else
    + put_page(page);
    +}
    +
    +void kfree(const void *p)
    +{
    + if (unlikely(ZERO_OR_NULL_PTR(p)))
    + return;
    +
    + __kfree(p);
    +}
    +EXPORT_SYMBOL(kfree);
    +
    +static size_t __ksize(const void *p)
    +{
    + struct kmem_chunk *chunk;
    + unsigned short align;
    +
    + align = ptr_get_align(p);
    + chunk = ptr_to_chunk(p, align);
    +
    + /*
    + * No need for locking here: the size of a reserved chunk can never
    + * change.
    + */
    + return chunk_size(chunk); /* XXX */
    +}
    +
    +size_t ksize(const void *p)
    +{
    + struct page *page;
    +
    + BUG_ON(!p);
    +
    + if (unlikely(ZERO_OR_NULL_PTR(p)))
    + return 0;
    +
    + page = virt_to_page(p);
    +
    + if (PageSlab(page))
    + return __ksize(p);
    +
    + return page->private;
    +}
    +EXPORT_SYMBOL(ksize);
    +
    +struct kmem_cache {
    + unsigned int size, align;
    + unsigned long gfpflags;
    + const char *name;
    + void (*ctor)(void *);
    +};
    +
    +struct kmem_cache *
    +kmem_cache_create(const char *name, size_t size, size_t align,
    + unsigned long gfpflags,
    + void (*ctor)(void *))
    +{
    + struct kmem_cache *cache;
    +
    + BUG_ON(size == 0);
    +
    + cache = kmalloc(sizeof(*cache), GFP_KERNEL);
    + if (cache) {
    + cache->size = size;
    + if (gfpflags & SLAB_DESTROY_BY_RCU) {
    + /* leave room for rcu footer at the end of chunk */
    + cache->size += sizeof(struct kmem_rcu);
    + }
    + cache->align = max(align, ARCH_SLAB_MINALIGN);
    + cache->gfpflags = gfpflags;
    + cache->name = name;
    + cache->ctor = ctor;
    + }
    + return cache;
    +}
    +EXPORT_SYMBOL(kmem_cache_create);
    +
    +void kmem_cache_destroy(struct kmem_cache *cache)
    +{
    + kfree(cache);
    +}
    +EXPORT_SYMBOL(kmem_cache_destroy);
    +
    +void *
    +kmem_cache_alloc_node(struct kmem_cache *cache, gfp_t gfpflags, int node)
    +{
    + void *p;
    +
    + p = kmem_alloc_node(cache->size, gfpflags, cache->align, node);
    + if (!p)
    + return NULL;
    +
    + if (cache->ctor)
    + cache->ctor(p);
    + return p;
    +}
    +EXPORT_SYMBOL(kmem_cache_alloc_node);
    +
    +static void kmem_rcu_free(struct rcu_head *head)
    +{
    + struct kmem_rcu *kmem_rcu = (struct kmem_rcu *)head;
    + void *p = (void *)kmem_rcu - (kmem_rcu->size - sizeof(struct kmem_rcu));
    +
    + __kfree(p);
    +}
    +
    +void kmem_cache_free(struct kmem_cache *cache, void *p)
    +{
    + if (unlikely(ZERO_OR_NULL_PTR(p)))
    + return;
    +
    + if (unlikely(cache->gfpflags & SLAB_DESTROY_BY_RCU)) {
    + struct kmem_rcu *kmem_rcu;
    +
    + kmem_rcu = p + (cache->size - sizeof(struct kmem_rcu));
    + INIT_RCU_HEAD(&kmem_rcu->head);
    + kmem_rcu->size = cache->size;
    + call_rcu(&kmem_rcu->head, kmem_rcu_free);
    + } else
    + __kfree(p);
    +}
    +EXPORT_SYMBOL(kmem_cache_free);
    +
    +unsigned int kmem_cache_size(struct kmem_cache *cache)
    +{
    + return cache->size;
    +}
    +EXPORT_SYMBOL(kmem_cache_size);
    +
    +const char *kmem_cache_name(struct kmem_cache *cache)
    +{
    + return cache->name;
    +}
    +EXPORT_SYMBOL(kmem_cache_name);
    +
    +int kmem_cache_shrink(struct kmem_cache *cache)
    +{
    + return 0;
    +}
    +EXPORT_SYMBOL(kmem_cache_shrink);
    +
    +int kmem_ptr_validate(struct kmem_cache *cache, const void *p)
    +{
    + return 0;
    +}
    +
    +static unsigned int kmem_ready __read_mostly;
    +
    +int slab_is_available(void)
    +{
    + return kmem_ready;
    +}
    +
    +static void kmem_init_bin(struct kmem_bin *bin)
    +{
    + bin->freelist = RB_ROOT;
    +}
    +
    +#define NR_ALLOCS 64
    +#define ALLOC_SIZE 128
    +
    +static void kmem_selftest(void)
    +{
    + void *objs[NR_ALLOCS];
    + int i;
    +
    + for (i = 0; i < NR_ALLOCS; i++)
    + objs[i] = kmalloc(ALLOC_SIZE, GFP_KERNEL);
    +
    + for (i = 0; i < NR_ALLOCS; i++)
    + kfree(objs[i]);
    +}
    +
    +void __init kmem_cache_init(void)
    +{
    + int i;
    +
    + for (i = 0; i < NR_BINS; i++)
    + kmem_init_bin(&bins[i]);
    + kmem_ready = 1;
    +
    + kmem_selftest();
    +}
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. Re: [RFC PATCH] greatly reduce SLOB external fragmentation



    On Thu, 31 Jul 2008, Pekka J Enberg wrote:
    >
    > Subject: [PATCH] binalloc: best-fit allocation with binning
    > From: Pekka Enberg


    Shoot me now.

    > As suggested by Linus,


    I'm happy to hear that the thing worked, but I'm not sure how happy I
    should be about yet _another_ allocator. Will it ever end?

    But seriously, it looks simple and small enough, so in that sense there
    doesn't seem to be a problem. But I really don't look forward to another
    one of these, at least not without somebody deciding that yes, we can
    prune one of the old ones as never being better. Hmm?

    Linus
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. Re: [RFC PATCH] greatly reduce SLOB external fragmentation

    Hi Linus,

    On Thu, Jul 31, 2008 at 1:00 AM, Linus Torvalds
    wrote:
    > I'm happy to hear that the thing worked, but I'm not sure how happy I
    > should be about yet _another_ allocator. Will it ever end?


    Oh, I didn't suggest this for merging. Just thought you'd be
    interested to know that best-fit doesn't really do that much better
    than what we have in the tree now. (Well, I was kinda hoping you'd
    tell me why my implementation is wrong and you were right all along.)

    On Thu, Jul 31, 2008 at 1:00 AM, Linus Torvalds
    wrote:
    > But seriously, it looks simple and small enough, so in that sense there
    > doesn't seem to be a problem. But I really don't look forward to another
    > one of these, at least not without somebody deciding that yes, we can
    > prune one of the old ones as never being better. Hmm?


    I think the current situation is bit unfortunate. SLOB doesn't want
    cater for everybody (and probably can't do that either), SLAB sucks
    for NUMA and embedded, and SLUB is too much of a "memory pig"
    (although much less so than SLAB) to replace SLOB and it has that TPC
    regression we don't have a test case for.

    So while SLAB is (slowly) on it's way out, we really don't have a
    strategy for SLOB/SLUB. I'm trying to come up with something that's
    memory efficient first and then try to tune that for SMP and NUMA.
    Looking at how tuned the fast-paths of SLAB and SLUB are (due to the
    design), it seems unlikely that we can come up with anything that
    could compete with them. But that doesn't mean we can't have fun
    trying ;-).

    Pekka
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  18. Re: [RFC PATCH] greatly reduce SLOB external fragmentation



    On Thu, 31 Jul 2008, Pekka Enberg wrote:
    >
    > Oh, I didn't suggest this for merging. Just thought you'd be
    > interested to know that best-fit doesn't really do that much better
    > than what we have in the tree now. (Well, I was kinda hoping you'd
    > tell me why my implementation is wrong and you were right all along.)


    Heh. Most allocators tend to work pretty well under normal load, and the
    real fragmentation problems all tend to happen under special patterns. The
    one in glibc, for example, sucks donkey dick when using threading, but is
    apparently ok otherwise.

    I wouldn't actually expect most "normal" kernel use to show any really bad
    patterns on any normal loads. Google for

    worst-case first-fit fragmentation

    (or 'next-fit' for that matter) to see some stuff. Of course, it is scary
    only if you can trigger it in practice (perhaps with certains games on
    packet size, or creating/removing files with pathname size patterns ec).

    [ Of course, google probably mostly returns hits from all those ACM
    portals etc. I wonder why google does that - they're almost totally
    useless search results. Sad. If somebody knows how to turn those ACM
    pay-portals off in google, pls let me know ]

    Linus
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  19. Re: [RFC PATCH] greatly reduce SLOB external fragmentation


    On Wed, 2008-07-30 at 15:35 -0700, Linus Torvalds wrote:

    > [ Of course, google probably mostly returns hits from all those ACM
    > portals etc. I wonder why google does that - they're almost totally
    > useless search results. Sad. If somebody knows how to turn those ACM
    > pay-portals off in google, pls let me know ]


    Maybe you, as one of the better known developers in the world, could
    write an open letter suggesting that ACM get a clue and open up their
    archives. It's ridiculous that physicists, biologists, and historians
    should have more and better web resources than the computer scientists.

    I propose you start it with "Hey morons".

    --
    Mathematics is the supreme nostalgia of our time.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  20. Re: [RFC PATCH] greatly reduce SLOB external fragmentation


    On Wed, 2008-07-30 at 15:00 -0700, Linus Torvalds wrote:
    >
    > On Thu, 31 Jul 2008, Pekka J Enberg wrote:
    > >
    > > Subject: [PATCH] binalloc: best-fit allocation with binning
    > > From: Pekka Enberg

    >
    > Shoot me now.
    >
    > > As suggested by Linus,

    >
    > I'm happy to hear that the thing worked, but I'm not sure how happy I
    > should be about yet _another_ allocator. Will it ever end?


    I think you can relax: the logical limit is probably two. We want an
    allocator that is both optimally fast and scalable on one end and
    optimally space-efficient on the other end and we're unlikely to find
    one allocator that is simultaneously both. But I don't think there's
    much call for things in the middle of the spectrum.

    So if this new one (which I haven't looked at yet) beats SLOB in space
    usage and simplicity, I'll be happy to see it replace SLOB.

    Finally getting rid of SLAB is a much trickier proposition because SLUB
    still loses in a few important corner cases.

    --
    Mathematics is the supreme nostalgia of our time.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread
Page 3 of 4 FirstFirst 1 2 3 4 LastLast