Re: SLUB defrag pull request? - Kernel

This is a discussion on Re: SLUB defrag pull request? - Kernel ; On Monday 13 October 2008 22:42, Hugh Dickins wrote: > On Mon, 13 Oct 2008, Pekka Enberg wrote: > > On Mon, 2008-10-13 at 10:30 +0300, Pekka Enberg wrote: > > > Hi Christoph, > > > > > > ...

+ Reply to Thread
Page 1 of 3 1 2 3 LastLast
Results 1 to 20 of 47

Thread: Re: SLUB defrag pull request?

  1. Re: SLUB defrag pull request?

    On Monday 13 October 2008 22:42, Hugh Dickins wrote:
    > On Mon, 13 Oct 2008, Pekka Enberg wrote:
    > > On Mon, 2008-10-13 at 10:30 +0300, Pekka Enberg wrote:
    > > > Hi Christoph,
    > > >
    > > > (I'm ccing Andrew and Hugh as well in case they want to chip in.)

    >
    > Thanks for the headsup: and I've taken the liberty of adding Nick too,
    > it would be good to have his input.


    Thanks. I get too easily bored with picking nits, and high level design
    review is not very highly valued, but I'll throw in my 2c anyway since
    I've been asked May as well cc some lists.

    I think it's a pretty reasonable thing to do. I was always slightly
    irritated by the design, maybe due to the topsy turvy issue... you
    almost want the subsystems to be able to use the struct pages to
    queue into their own LRU algorithms, and then queue the unreferenced
    objects off those struct pages, turning it back around the right way.


    > > > I'm planning to send a pull request of SLUB defrag today. Is there
    > > > anything in particular you would like to be mentioned in the email to
    > > > Linus?

    >
    > I do fear that it'll introduce some hard-to-track-down bugs, either
    > in itself or in the subsystems it's defragging, because it's coming
    > at them from a new angle (isn't it?).


    In many cases, yes it seems to. And some of the approaches even if
    they work now seem like they *might* cause problematic constraints
    in the design... Have Al and Christoph reviewed the dentry and inode
    patches?


    > But I've no evidence for that, it's just my usual personal FUD, and
    > I'm really rather impressed with how well it appears to be working.
    > Just allow me to say "I told you so" when the first bug appears


    I've only looked closely at the buffer_head defrag part of the patch so
    far, and it looks like there is a real problem there. The problem is that
    pages are not always pinned until after buffer heads are freed. So it is
    possible to "get" a page that has since been freed and reallocated for
    something else. Boom? In the comments, I see assertions that "this is OK",
    but not much analysis of what concurrency is possible and why it is safe.

    A nasty, conceptual problem unfortunately that I'm worried about is that
    now there is a lot more potential for concurrency from the moment the
    data structure is allocated. Obviously this is why the ctor is needed,
    but there are a lot of codepaths leading to allocations of these objects.
    Have they all been audited for concurrency issues? For example, I see
    there is buffer head defragmenting, and I see there is ext3 defragmenting,
    but alloc_buffer_head calls from various filesystems don't seem to have
    been touched. Presumably they have been looked at, but there is no
    indication of why they are OK?

    As a broad question, is such an additional concurrency constraint reasonable?

    Another high level kind of issue is the efficiency of this thing. It can
    take locks very frequently in some cases.


    > May I repeat that question overlooked from when I sent my 1/3 fix?
    > I'm wondering whether kick_buffers() ought to check PageLRU: I've no
    > evidence it's needed, but coming to writepage() or try_to_free_buffers()
    > from this direction (by virtue of having a buffer_head in a partial slab
    > page) is new, and I worry it might cause some ordering problem; whereas
    > if the page is PageLRU, then it is already fair game for such reclaim.


    Hmm, I don't see an immediate problem there, but it could be an issue.
    Have filesystem developers been made aware of the change and OKed it?


    > I believe (not necessarily correctly!) that Andrew is tied up with the
    > Linux Foundation's End User conference in NY today and tomorrow. I'd
    > be happier if you waited for his goahead before asking Linus to pull.
    > As I recall, he was unhappy with Christoph adding such a feature to
    > SLUB and not SLAB while it's still undecided which way we go.


    The problem with this approach is that it relies on being able to lock
    out access from freeing an object purely from the slab layer. SLAB
    cannot do this because it has a per-cpu frontend queue that cannot be
    blocked by another CPU. It would involve adding atomic operations in
    the SLAB fastpath, or somehow IPIing all CPUs during defrag operations.

    AFAIKS, all my concerns, including SLAB, would be addressed if we would
    be able to instead group reclaimable objects into pages, and reclaim them
    by managing lists of pages.


    > I cannot remember where we stand in relation to the odd test where
    > SLUB performs or performed worse. I think it is true to say that the
    > vast majority of developers would prefer to go forward with SLUB than
    > SLAB. There was a time when I was very perturbed by SLUB's reliance
    > on higher order allocations, but have noticed nothing to protest
    > about since it became less profligate and more adaptive. And I do
    > think it's a bit unfair to ask Christoph to enhance his competition!


    SLAB unfortunately is still significantly faster than SLUB in some
    important workloads. I am also a little worried about SLUB's higher
    order allocations, but won't harp on about that.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: SLUB defrag pull request?

    On Mon, 13 Oct 2008, Nick Piggin wrote:
    > In many cases, yes it seems to. And some of the approaches even if
    > they work now seem like they *might* cause problematic constraints
    > in the design... Have Al and Christoph reviewed the dentry and inode
    > patches?


    This d_invalidate() looks suspicious to me:

    +/*
    + * Slab has dropped all the locks. Get rid of the refcount obtained
    + * earlier and also free the object.
    + */
    +static void kick_dentries(struct kmem_cache *s,
    + int nr, void **v, void *private)
    +{
    + struct dentry *dentry;
    + int i;
    +
    + /*
    + * First invalidate the dentries without holding the dcache lock
    + */
    + for (i = 0; i < nr; i++) {
    + dentry = v[i];
    +
    + if (dentry)
    + d_invalidate(dentry);
    + }

    I think it's wrong to unhash dentries while they are possibly still
    being used. You can do the shrink_dcache_parent() here, but should
    leave the unhashing to be done by prune_one_dentry(), after it's been
    checked that there are no other users of the dentry.

    Miklos
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: SLUB defrag pull request?

    On Mon, 13 Oct 2008, Miklos Szeredi wrote:
    > On Mon, 13 Oct 2008, Nick Piggin wrote:
    > > In many cases, yes it seems to. And some of the approaches even if
    > > they work now seem like they *might* cause problematic constraints
    > > in the design... Have Al and Christoph reviewed the dentry and inode
    > > patches?

    >
    > This d_invalidate() looks suspicious to me:


    And the things kick_inodes() does without any sort of locking look
    even more dangerous.

    It should be the other way round: first make sure nothing is
    referencing the inode, and _then_ start cleaning it up with
    appropriate locks held. See prune_icache().

    Miklos
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: SLUB defrag pull request?

    Miklos Szeredi wrote:
    > I think it's wrong to unhash dentries while they are possibly still
    > being used. You can do the shrink_dcache_parent() here, but should
    > leave the unhashing to be done by prune_one_dentry(), after it's been
    > checked that there are no other users of the dentry.
    >
    >

    d_invalidate() calls shrink_dcache_parent() as needed and will fail if
    there are other users of the dentry.


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: SLUB defrag pull request?

    On Mon, 13 Oct 2008, Christoph Lameter wrote:
    > Miklos Szeredi wrote:
    > > I think it's wrong to unhash dentries while they are possibly still
    > > being used. You can do the shrink_dcache_parent() here, but should
    > > leave the unhashing to be done by prune_one_dentry(), after it's been
    > > checked that there are no other users of the dentry.
    > >
    > >

    > d_invalidate() calls shrink_dcache_parent() as needed and will fail if
    > there are other users of the dentry.


    Only if it's a directory. Now unhashing an in-use non-directory is
    not fatal, but you'll get things like "filename (deleted)" in /proc,
    and suchlike. Don't do it.

    Miklos
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: SLUB defrag pull request?

    Miklos Szeredi wrote:
    > And the things kick_inodes() does without any sort of locking look
    > even more dangerous.
    >
    > It should be the other way round: first make sure nothing is
    > referencing the inode, and _then_ start cleaning it up with
    > appropriate locks held. See prune_icache().
    >
    >

    kick_inodes() only works on inodes that first have undergone
    get_inodes() where we establish a refcount under inode_lock(). The final
    cleanup in kick_inodes() is done under iprune_mutex. You are looking at
    the loop that does writeback and invalidates attached dentries. This can
    fail for various reasons.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: SLUB defrag pull request?

    On Mon, 13 Oct 2008, Christoph Lameter wrote:
    > Miklos Szeredi wrote:
    > > And the things kick_inodes() does without any sort of locking look
    > > even more dangerous.
    > >
    > > It should be the other way round: first make sure nothing is
    > > referencing the inode, and _then_ start cleaning it up with
    > > appropriate locks held. See prune_icache().
    > >
    > >

    > kick_inodes() only works on inodes that first have undergone
    > get_inodes() where we establish a refcount under inode_lock(). The final
    > cleanup in kick_inodes() is done under iprune_mutex. You are looking at
    > the loop that does writeback and invalidates attached dentries. This can
    > fail for various reasons.


    Yes, but I'm not at all sure that calling remove_inode_buffers() or
    invalidate_mapping_pages() is OK on a live inode. They should be done
    after checking the refcount, just like prune_icache() does.

    Also, while d_invalidate() is not actually wrong here, because you
    check S_ISDIR(), but it's still the wrong function to use. You really
    just want to shrink the children. Invalidation means: the filesystem
    found out that the cached inode is invalid, so we want to throw it
    away. In the future it might actually be able to do it for
    directories as well, but currently it cannot because of possible
    mounts on the dentry.

    Miklos
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: SLUB defrag pull request?

    And BTW the whole thing seems to be broken WRT umount. Getting a
    reference to a dentry or an inode without also getting reference to a
    vfsmount is going to result in "VFS: Busy inodes after unmount of
    %s. Self-destruct in 5 seconds. Have a nice day...\n". And getting a
    reference to the vfsmount will result in EBUSY when trying to umount,
    which is also not what we want.

    So it seemst that this two pass method will not work with dentries or
    inodes at all

    Miklos
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: SLUB defrag pull request?

    Miklos Szeredi wrote:
    >
    >>> It should be the other way round: first make sure nothing is
    >>> referencing the inode, and _then_ start cleaning it up with
    >>> appropriate locks held. See prune_icache().


    The code was initially taken from prune_icache.

    >> kick_inodes() only works on inodes that first have undergone
    >> get_inodes() where we establish a refcount under inode_lock(). The final
    >> cleanup in kick_inodes() is done under iprune_mutex. You are looking at
    >> the loop that does writeback and invalidates attached dentries. This can
    >> fail for various reasons.

    >
    > Yes, but I'm not at all sure that calling remove_inode_buffers() or
    > invalidate_mapping_pages() is OK on a live inode. They should be done
    > after checking the refcount, just like prune_icache() does.


    Dont we do the same on a truncate?

    > Also, while d_invalidate() is not actually wrong here, because you
    > check S_ISDIR(), but it's still the wrong function to use. You really
    > just want to shrink the children. Invalidation means: the filesystem
    > found out that the cached inode is invalid, so we want to throw it
    > away. In the future it might actually be able to do it for
    > directories as well, but currently it cannot because of possible
    > mounts on the dentry.


    Thats the same issue as with the dentries. The new function could deal with
    both situations?


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: SLUB defrag pull request?

    On Mon, 20 Oct 2008, Christoph Lameter wrote:
    > >> kick_inodes() only works on inodes that first have undergone
    > >> get_inodes() where we establish a refcount under inode_lock(). The final
    > >> cleanup in kick_inodes() is done under iprune_mutex. You are looking at
    > >> the loop that does writeback and invalidates attached dentries. This can
    > >> fail for various reasons.

    > >
    > > Yes, but I'm not at all sure that calling remove_inode_buffers() or
    > > invalidate_mapping_pages() is OK on a live inode. They should be done
    > > after checking the refcount, just like prune_icache() does.

    >
    > Dont we do the same on a truncate?


    Yes, with i_mutex and i_alloc_sem held.

    >
    > > Also, while d_invalidate() is not actually wrong here, because you
    > > check S_ISDIR(), but it's still the wrong function to use. You really
    > > just want to shrink the children. Invalidation means: the filesystem
    > > found out that the cached inode is invalid, so we want to throw it
    > > away. In the future it might actually be able to do it for
    > > directories as well, but currently it cannot because of possible
    > > mounts on the dentry.

    >
    > Thats the same issue as with the dentries. The new function could deal with
    > both situations?


    Sure.

    The big issue is dealing with umount. You could do something like
    grab_super() on sb before getting a ref on the inode/dentry. But I'm
    not sure this is a good idea. There must be a simpler way to achieve
    this...

    Miklos
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: SLUB defrag pull request?

    Miklos Szeredi wrote:
    > On Mon, 20 Oct 2008, Christoph Lameter wrote:
    >>>> kick_inodes() only works on inodes that first have undergone
    >>>> get_inodes() where we establish a refcount under inode_lock(). The final
    >>>> cleanup in kick_inodes() is done under iprune_mutex. You are looking at
    >>>> the loop that does writeback and invalidates attached dentries. This can
    >>>> fail for various reasons.
    >>> Yes, but I'm not at all sure that calling remove_inode_buffers() or
    >>> invalidate_mapping_pages() is OK on a live inode. They should be done
    >>> after checking the refcount, just like prune_icache() does.

    >> Dont we do the same on a truncate?

    >
    > Yes, with i_mutex and i_alloc_sem held.


    There is another call to invalidate_mapping_pages() in prune_icache (that is
    where this code originates). No i_mutex and i_alloc. Only iprune_mutex held
    and that seems to be for the protection of the list. So just checking
    inode->i_count would do the trick?

    >>> Also, while d_invalidate() is not actually wrong here, because you
    >>> check S_ISDIR(), but it's still the wrong function to use. You really
    >>> just want to shrink the children. Invalidation means: the filesystem
    >>> found out that the cached inode is invalid, so we want to throw it
    >>> away. In the future it might actually be able to do it for
    >>> directories as well, but currently it cannot because of possible
    >>> mounts on the dentry.

    >> Thats the same issue as with the dentries. The new function could deal with
    >> both situations?

    >
    > Sure.
    >
    > The big issue is dealing with umount. You could do something like
    > grab_super() on sb before getting a ref on the inode/dentry. But I'm
    > not sure this is a good idea. There must be a simpler way to achieve
    > this...


    Taking a lock on vfsmount_lock? But that would make dentry reclaim a pain.

    We are only interested in the reclaim a dentry if its currently unused. If so
    then why does unmount matter? Both unmount and reclaim will attempt to remove
    the dentry.

    Have a look at get_dentries(). It takes the dcache_lock and checks the dentry
    state. Either the entry is ignored or dget_locked() removes it from the lru.
    If its off the LRU then it can no longer be reclaimed by umount.


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: SLUB defrag pull request?

    On Mon, 20 Oct 2008, Christoph Lameter wrote:
    > Miklos Szeredi wrote:
    > >>> Yes, but I'm not at all sure that calling remove_inode_buffers() or
    > >>> invalidate_mapping_pages() is OK on a live inode. They should be done
    > >>> after checking the refcount, just like prune_icache() does.
    > >> Dont we do the same on a truncate?

    > >
    > > Yes, with i_mutex and i_alloc_sem held.

    >
    > There is another call to invalidate_mapping_pages() in prune_icache (that is
    > where this code originates). No i_mutex and i_alloc. Only iprune_mutex held
    > and that seems to be for the protection of the list. So just checking
    > inode->i_count would do the trick?


    Yes, that's what I was saying.

    > > The big issue is dealing with umount. You could do something like
    > > grab_super() on sb before getting a ref on the inode/dentry. But I'm
    > > not sure this is a good idea. There must be a simpler way to achieve
    > > this..

    >
    > Taking a lock on vfsmount_lock? But that would make dentry reclaim a pain.


    No, I mean simpler than having to do this two stage stuff.

    > We are only interested in the reclaim a dentry if its currently unused. If so
    > then why does unmount matter? Both unmount and reclaim will attempt to remove
    > the dentry.
    >
    > Have a look at get_dentries(). It takes the dcache_lock and checks the dentry
    > state. Either the entry is ignored or dget_locked() removes it from the lru.
    > If its off the LRU then it can no longer be reclaimed by umount.


    How is that better? You will still get busy inodes on umount.

    And anyway the dentry could be put back onto the LRU by somebody else
    between get_dentries() and kick_dentries(). So I don't even see how
    taking the dentry off the LRU helps _anything_.

    Miklos
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: SLUB defrag pull request?

    Miklos Szeredi wrote:

    >> There is another call to invalidate_mapping_pages() in prune_icache (that is
    >> where this code originates). No i_mutex and i_alloc. Only iprune_mutex held
    >> and that seems to be for the protection of the list. So just checking
    >> inode->i_count would do the trick?

    >
    > Yes, that's what I was saying.


    Ok. Thats easy to do.

    >
    >>> The big issue is dealing with umount. You could do something like
    >>> grab_super() on sb before getting a ref on the inode/dentry. But I'm
    >>> not sure this is a good idea. There must be a simpler way to achieve
    >>> this..

    >> Taking a lock on vfsmount_lock? But that would make dentry reclaim a pain.

    >
    > No, I mean simpler than having to do this two stage stuff.


    How could it be simpler? First you need to establish a secure reference to the
    object so that it cannot vanish from under us. Then all the references can be
    checked and possibly removed. If we do not need a secure reference then the
    get_dentries() etc method can be NULL.

    >> We are only interested in the reclaim a dentry if its currently unused. If so
    >> then why does unmount matter? Both unmount and reclaim will attempt to remove
    >> the dentry.
    >>
    >> Have a look at get_dentries(). It takes the dcache_lock and checks the dentry
    >> state. Either the entry is ignored or dget_locked() removes it from the lru.
    >> If its off the LRU then it can no longer be reclaimed by umount.

    >
    > How is that better? You will still get busy inodes on umount.


    Those inodes are going to be freed by the reclaim code. Why would they be busy
    (unless the case below occurs of course).

    > And anyway the dentry could be put back onto the LRU by somebody else
    > between get_dentries() and kick_dentries(). So I don't even see how
    > taking the dentry off the LRU helps _anything_.


    get_dentries() gets a reference. dput will not put the dentry back onto the LRU.




    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: SLUB defrag pull request?

    On Mon, 20 Oct 2008, Christoph Lameter wrote:
    > >>> The big issue is dealing with umount. You could do something like
    > >>> grab_super() on sb before getting a ref on the inode/dentry. But I'm
    > >>> not sure this is a good idea. There must be a simpler way to achieve
    > >>> this..
    > >> Taking a lock on vfsmount_lock? But that would make dentry reclaim a pain.

    > >
    > > No, I mean simpler than having to do this two stage stuff.

    >
    > How could it be simpler? First you need to establish a secure
    > reference to the object so that it cannot vanish from under us. Then
    > all the references can be checked and possibly removed. If we do not
    > need a secure reference then the get_dentries() etc method can be
    > NULL.


    So, isn't it possible to do without get_dentries()? What's the
    fundamental difference between this and regular cache shrinking?

    > Those inodes are going to be freed by the reclaim code. Why would
    > they be busy (unless the case below occurs of course).


    Case below was brainfart, please ignore. But that doesn't really
    help: the VFS assumes that you cannot umount while there are busy
    dentries/inodes. Usually it works this way: VFS first gets vfsmount
    ref, then gets dentry ref, and releases them in the opposite order.
    And umount is not allowed if vfsmount has a non-zero refcount (it's a
    bit more complicated, but the essense is the same).

    The current SLUB defrag violates this: it gets dentry or inode ref
    without getting a ref on the vfsmount or the super block as well.
    This means that the umount will succeed (that's OK), but also the
    super block will be going away and that's bad. See
    generic_shutdown_super().

    > > And anyway the dentry could be put back onto the LRU by somebody else
    > > between get_dentries() and kick_dentries(). So I don't even see how
    > > taking the dentry off the LRU helps _anything_.

    >
    > get_dentries() gets a reference. dput will not put the dentry back
    > onto the LRU.


    Right.

    Miklos
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: SLUB defrag pull request?

    Miklos Szeredi wrote:
    > So, isn't it possible to do without get_dentries()? What's the
    > fundamental difference between this and regular cache shrinking?


    The fundamental difference is that slab defrag operates on sparsely populated
    dentries. It comes into effect when the density of dentries per page is low
    and lots of memory is wasted. It defragments by kicking out dentries in low
    density pages. These can then be reclaimed.


    > Case below was brainfart, please ignore. But that doesn't really
    > help: the VFS assumes that you cannot umount while there are busy
    > dentries/inodes. Usually it works this way: VFS first gets vfsmount
    > ref, then gets dentry ref, and releases them in the opposite order.
    > And umount is not allowed if vfsmount has a non-zero refcount (it's a
    > bit more complicated, but the essense is the same).


    The dentries that we get a ref on are candidates for removal. Their lifetime
    is limited. Unmounting while we are trying to remove dentries/inodes results
    in two mechanisms removing dentries/inodes.

    If we have obtained a reference then invalidate_list() will return the number
    of busy inodes which would trigger the printk in generic_shutdown_super(). But
    these are inodes currently being reclaimed by slab defrag. Just waiting a bit
    would remedy the situation.

    We would need some way to make generic_shutdown_super() wait until slab defrag
    is finished.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. Re: SLUB defrag pull request?

    On Mon, 20 Oct 2008, Christoph Lameter wrote:
    > Miklos Szeredi wrote:
    > > So, isn't it possible to do without get_dentries()? What's the
    > > fundamental difference between this and regular cache shrinking?

    >
    > The fundamental difference is that slab defrag operates on sparsely
    > populated dentries. It comes into effect when the density of
    > dentries per page is low and lots of memory is wasted. It
    > defragments by kicking out dentries in low density pages. These can
    > then be reclaimed.


    OK, but why can't this be done in just one stage?

    AFAICS the problem is exactly the same as generic shrinking, except it
    wants to evict dentries selectively: only ones which are in very
    sparse slabs.

    So is the problem selecting these dentries? Would it be too expensive
    to do it the same as normal cache shrinking and walk the lru, but only
    evict the ones which are tagged as being in a sparse page?

    > The dentries that we get a ref on are candidates for removal. Their
    > lifetime is limited. Unmounting while we are trying to remove
    > dentries/inodes results in two mechanisms removing dentries/inodes.
    >
    > If we have obtained a reference then invalidate_list() will return
    > the number of busy inodes which would trigger the printk in
    > generic_shutdown_super(). But these are inodes currently being
    > reclaimed by slab defrag. Just waiting a bit would remedy the
    > situation.


    I guess so, but that's just a hack to work around the problem, and
    creates more interdependencies between VFS and the allocator with
    unforseeable consequences.

    Miklos
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. Re: SLUB defrag pull request?

    On Mon, Oct 20, 2008 at 02:53:40PM -0500, Christoph Lameter wrote:
    > Miklos Szeredi wrote:
    > > Case below was brainfart, please ignore. But that doesn't really
    > > help: the VFS assumes that you cannot umount while there are busy
    > > dentries/inodes. Usually it works this way: VFS first gets vfsmount
    > > ref, then gets dentry ref, and releases them in the opposite order.
    > > And umount is not allowed if vfsmount has a non-zero refcount (it's a
    > > bit more complicated, but the essense is the same).

    >
    > The dentries that we get a ref on are candidates for removal. Their lifetime
    > is limited. Unmounting while we are trying to remove dentries/inodes results
    > in two mechanisms removing dentries/inodes.
    >
    > If we have obtained a reference then invalidate_list() will return the number
    > of busy inodes which would trigger the printk in generic_shutdown_super(). But
    > these are inodes currently being reclaimed by slab defrag. Just waiting a bit
    > would remedy the situation.
    >
    > We would need some way to make generic_shutdown_super() wait until slab defrag
    > is finished.


    Seems to me that prune_dcache() handles this case by holding the sb->s_umount
    semaphore while pruning. The same logic applies here, right?

    Cheers,

    Dave.
    --
    Dave Chinner
    david@fromorbit.com
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  18. Re: SLUB defrag pull request?

    On Tue, 21 Oct 2008, Christoph Lameter wrote:
    > The only way that a secure reference can be established is if the
    > slab page is locked. That requires a spinlock. The slab allocator
    > calls the get() functions while the slab lock guarantees object
    > existence. Then locks are dropped and reclaim actions can start with
    > the guarantee that the slab object will not suddenly vanish.


    Yes, you've made up your mind, that you want to do it this way. But
    it's the _wrong_ way, this "want to get a secure reference for use
    later" leads to madness when applied to dentries or inodes. Try for a
    minute to think outside this template.

    For example dcache_lock will protect against dentries moving to/from
    d_lru. So you can do this:

    take dcache_lock
    check if d_lru is non-empty
    take sb->s_umount
    free dentry
    release sb->s_umount
    release dcache_lock

    Yeah, locking will be more complicated in reality. Still, much less
    complicated than trying to do the same across two separate phases.

    Why can't something like that work?

    Miklos
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  19. Re: SLUB defrag pull request?

    On Wed, 22 Oct 2008, Miklos Szeredi wrote:

    > On Tue, 21 Oct 2008, Christoph Lameter wrote:
    >> The only way that a secure reference can be established is if the
    >> slab page is locked. That requires a spinlock. The slab allocator
    >> calls the get() functions while the slab lock guarantees object
    >> existence. Then locks are dropped and reclaim actions can start with
    >> the guarantee that the slab object will not suddenly vanish.

    >
    > Yes, you've made up your mind, that you want to do it this way. But
    > it's the _wrong_ way, this "want to get a secure reference for use
    > later" leads to madness when applied to dentries or inodes. Try for a
    > minute to think outside this template.
    >
    > For example dcache_lock will protect against dentries moving to/from
    > d_lru. So you can do this:
    >
    > take dcache_lock
    > check if d_lru is non-empty


    The dentry could have been freed even before we take the dcache_lock. We
    cannot access d_lru without a stable reference to the dentry.

    > take sb->s_umount
    > free dentry
    > release sb->s_umount
    > release dcache_lock
    >
    > Yeah, locking will be more complicated in reality. Still, much less
    > complicated than trying to do the same across two separate phases.
    >
    > Why can't something like that work?


    Because the slab starts out with a series of objects left in a slab. It
    needs to do build a list of objects etc in a way that is independent as
    possible from the user of the slab page. It does that by locking the slab
    page so that free operations stall until the reference has been
    established. If it would not be shutting off frees then the objects could
    vanish under us.

    We could also avoid frees by calling some cache specific method that locks
    out frees before and after. But then frees would stall everywhere and
    every slab cache would have to check a global lock before freeing objects
    (there would be numerous complications with RCU free etc etc).

    Slab defrag only stops frees on a particular slab page.

    The slab defrag approach also allows the slab cache (dentry or inodes
    here) to do something else than free the object. It would be possible f.e.
    to move the object by allocating a new entry and moving the information to
    the new dentry. That would actually be better since it would preserve the
    objects and just move them into the same slab page.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  20. Re: SLUB defrag pull request?

    On Wed, 22 Oct 2008, Christoph Lameter wrote:
    > On Wed, 22 Oct 2008, Miklos Szeredi wrote:
    >
    > > On Tue, 21 Oct 2008, Christoph Lameter wrote:
    > >> The only way that a secure reference can be established is if the
    > >> slab page is locked. That requires a spinlock. The slab allocator
    > >> calls the get() functions while the slab lock guarantees object
    > >> existence. Then locks are dropped and reclaim actions can start with
    > >> the guarantee that the slab object will not suddenly vanish.

    > >
    > > Yes, you've made up your mind, that you want to do it this way. But
    > > it's the _wrong_ way, this "want to get a secure reference for use
    > > later" leads to madness when applied to dentries or inodes. Try for a
    > > minute to think outside this template.
    > >
    > > For example dcache_lock will protect against dentries moving to/from
    > > d_lru. So you can do this:
    > >
    > > take dcache_lock
    > > check if d_lru is non-empty

    >
    > The dentry could have been freed even before we take the dcache_lock. We
    > cannot access d_lru without a stable reference to the dentry.


    Why? The kmem_cache_free() doesn't touch the contents of the object,
    does it?

    > > take sb->s_umount
    > > free dentry
    > > release sb->s_umount
    > > release dcache_lock
    > >
    > > Yeah, locking will be more complicated in reality. Still, much less
    > > complicated than trying to do the same across two separate phases.
    > >
    > > Why can't something like that work?

    >
    > Because the slab starts out with a series of objects left in a slab. It
    > needs to do build a list of objects etc in a way that is independent as
    > possible from the user of the slab page. It does that by locking the slab
    > page so that free operations stall until the reference has been
    > established. If it would not be shutting off frees then the objects could
    > vanish under us.


    It doesn't matter. All we care about is that the dentry is on the
    lru: it's cached but unused. Every other state (being created,
    active, being freed, freed) is uninteresting.

    > We could also avoid frees by calling some cache specific method that locks
    > out frees before and after. But then frees would stall everywhere and
    > every slab cache would have to check a global lock before freeing objects
    > (there would be numerous complications with RCU free etc etc).
    >
    > Slab defrag only stops frees on a particular slab page.
    >
    > The slab defrag approach also allows the slab cache (dentry or inodes
    > here) to do something else than free the object. It would be possible f.e.
    > to move the object by allocating a new entry and moving the information to
    > the new dentry. That would actually be better since it would preserve the
    > objects and just move them into the same slab page.


    Sure, and all that is possible without doing this messy 2 phase thing.
    Unless I'm still missing something obvious...

    Miklos
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread
Page 1 of 3 1 2 3 LastLast