Re: SLUB defrag pull request?
On Monday 13 October 2008 22:42, Hugh Dickins wrote:[color=blue]
> On Mon, 13 Oct 2008, Pekka Enberg wrote:[color=green]
> > On Mon, 2008-10-13 at 10:30 +0300, Pekka Enberg wrote:[color=darkred]
> > > Hi Christoph,
> > >
> > > (I'm ccing Andrew and Hugh as well in case they want to chip in.)[/color][/color]
>
> Thanks for the headsup: and I've taken the liberty of adding Nick too,
> it would be good to have his input.[/color]
Thanks. I get too easily bored with picking nits, and high level design
review is not very highly valued, but I'll throw in my 2c anyway since
I've been asked :) May as well cc some lists.
I think it's a pretty reasonable thing to do. I was always slightly
irritated by the design, maybe due to the topsy turvy issue... you
almost want the subsystems to be able to use the struct pages to
queue into their own LRU algorithms, and then queue the unreferenced
objects off those struct pages, turning it back around the right way.
[color=blue][color=green][color=darkred]
> > > I'm planning to send a pull request of SLUB defrag today. Is there
> > > anything in particular you would like to be mentioned in the email to
> > > Linus?[/color][/color]
>
> I do fear that it'll introduce some hard-to-track-down bugs, either
> in itself or in the subsystems it's defragging, because it's coming
> at them from a new angle (isn't it?).[/color]
In many cases, yes it seems to. And some of the approaches even if
they work now seem like they *might* cause problematic constraints
in the design... Have Al and Christoph reviewed the dentry and inode
patches?
[color=blue]
> But I've no evidence for that, it's just my usual personal FUD, and
> I'm really rather impressed with how well it appears to be working.
> Just allow me to say "I told you so" when the first bug appears ;)[/color]
I've only looked closely at the buffer_head defrag part of the patch so
far, and it looks like there is a real problem there. The problem is that
pages are not always pinned until after buffer heads are freed. So it is
possible to "get" a page that has since been freed and reallocated for
something else. Boom? In the comments, I see assertions that "this is OK",
but not much analysis of what concurrency is possible and why it is safe.
A nasty, conceptual problem unfortunately that I'm worried about is that
now there is a lot more potential for concurrency from the moment the
data structure is allocated. Obviously this is why the ctor is needed,
but there are a lot of codepaths leading to allocations of these objects.
Have they all been audited for concurrency issues? For example, I see
there is buffer head defragmenting, and I see there is ext3 defragmenting,
but alloc_buffer_head calls from various filesystems don't seem to have
been touched. Presumably they have been looked at, but there is no
indication of why they are OK?
As a broad question, is such an additional concurrency constraint reasonable?
Another high level kind of issue is the efficiency of this thing. It can
take locks very frequently in some cases.
[color=blue]
> May I repeat that question overlooked from when I sent my 1/3 fix?
> I'm wondering whether kick_buffers() ought to check PageLRU: I've no
> evidence it's needed, but coming to writepage() or try_to_free_buffers()
> from this direction (by virtue of having a buffer_head in a partial slab
> page) is new, and I worry it might cause some ordering problem; whereas
> if the page is PageLRU, then it is already fair game for such reclaim.[/color]
Hmm, I don't see an immediate problem there, but it could be an issue.
Have filesystem developers been made aware of the change and OKed it?
[color=blue]
> I believe (not necessarily correctly!) that Andrew is tied up with the
> Linux Foundation's End User conference in NY today and tomorrow. I'd
> be happier if you waited for his goahead before asking Linus to pull.
> As I recall, he was unhappy with Christoph adding such a feature to
> SLUB and not SLAB while it's still undecided which way we go.[/color]
The problem with this approach is that it relies on being able to lock
out access from freeing an object purely from the slab layer. SLAB
cannot do this because it has a per-cpu frontend queue that cannot be
blocked by another CPU. It would involve adding atomic operations in
the SLAB fastpath, or somehow IPIing all CPUs during defrag operations.
AFAIKS, all my concerns, including SLAB, would be addressed if we would
be able to instead group reclaimable objects into pages, and reclaim them
by managing lists of pages.
[color=blue]
> I cannot remember where we stand in relation to the odd test where
> SLUB performs or performed worse. I think it is true to say that the
> vast majority of developers would prefer to go forward with SLUB than
> SLAB. There was a time when I was very perturbed by SLUB's reliance
> on higher order allocations, but have noticed nothing to protest
> about since it became less profligate and more adaptive. And I do
> think it's a bit unfair to ask Christoph to enhance his competition![/color]
SLAB unfortunately is still significantly faster than SLUB in some
important workloads. I am also a little worried about SLUB's higher
order allocations, but won't harp on about that.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email]majordomo@vger.kernel.org[/email]
More majordomo info at [url]http://vger.kernel.org/majordomo-info.html[/url]
Please read the FAQ at [url]http://www.tux.org/lkml/[/url]
Re: SLUB defrag pull request?
On Mon, 13 Oct 2008, Nick Piggin wrote:[color=blue]
> In many cases, yes it seems to. And some of the approaches even if
> they work now seem like they *might* cause problematic constraints
> in the design... Have Al and Christoph reviewed the dentry and inode
> patches?[/color]
This d_invalidate() looks suspicious to me:
+/*
+ * Slab has dropped all the locks. Get rid of the refcount obtained
+ * earlier and also free the object.
+ */
+static void kick_dentries(struct kmem_cache *s,
+ int nr, void **v, void *private)
+{
+ struct dentry *dentry;
+ int i;
+
+ /*
+ * First invalidate the dentries without holding the dcache lock
+ */
+ for (i = 0; i < nr; i++) {
+ dentry = v[i];
+
+ if (dentry)
+ d_invalidate(dentry);
+ }
I think it's wrong to unhash dentries while they are possibly still
being used. You can do the shrink_dcache_parent() here, but should
leave the unhashing to be done by prune_one_dentry(), after it's been
checked that there are no other users of the dentry.
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email]majordomo@vger.kernel.org[/email]
More majordomo info at [url]http://vger.kernel.org/majordomo-info.html[/url]
Please read the FAQ at [url]http://www.tux.org/lkml/[/url]
Re: SLUB defrag pull request?
On Mon, 13 Oct 2008, Miklos Szeredi wrote:[color=blue]
> On Mon, 13 Oct 2008, Nick Piggin wrote:[color=green]
> > In many cases, yes it seems to. And some of the approaches even if
> > they work now seem like they *might* cause problematic constraints
> > in the design... Have Al and Christoph reviewed the dentry and inode
> > patches?[/color]
>
> This d_invalidate() looks suspicious to me:[/color]
And the things kick_inodes() does without any sort of locking look
even more dangerous.
It should be the other way round: first make sure nothing is
referencing the inode, and _then_ start cleaning it up with
appropriate locks held. See prune_icache().
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email]majordomo@vger.kernel.org[/email]
More majordomo info at [url]http://vger.kernel.org/majordomo-info.html[/url]
Please read the FAQ at [url]http://www.tux.org/lkml/[/url]
Re: SLUB defrag pull request?
Miklos Szeredi wrote:[color=blue]
> I think it's wrong to unhash dentries while they are possibly still
> being used. You can do the shrink_dcache_parent() here, but should
> leave the unhashing to be done by prune_one_dentry(), after it's been
> checked that there are no other users of the dentry.
>
>[/color]
d_invalidate() calls shrink_dcache_parent() as needed and will fail if
there are other users of the dentry.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email]majordomo@vger.kernel.org[/email]
More majordomo info at [url]http://vger.kernel.org/majordomo-info.html[/url]
Please read the FAQ at [url]http://www.tux.org/lkml/[/url]
Re: SLUB defrag pull request?
On Mon, 13 Oct 2008, Christoph Lameter wrote:[color=blue]
> Miklos Szeredi wrote:[color=green]
> > I think it's wrong to unhash dentries while they are possibly still
> > being used. You can do the shrink_dcache_parent() here, but should
> > leave the unhashing to be done by prune_one_dentry(), after it's been
> > checked that there are no other users of the dentry.
> >
> >[/color]
> d_invalidate() calls shrink_dcache_parent() as needed and will fail if
> there are other users of the dentry.[/color]
Only if it's a directory. Now unhashing an in-use non-directory is
not fatal, but you'll get things like "filename (deleted)" in /proc,
and suchlike. Don't do it.
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email]majordomo@vger.kernel.org[/email]
More majordomo info at [url]http://vger.kernel.org/majordomo-info.html[/url]
Please read the FAQ at [url]http://www.tux.org/lkml/[/url]
Re: SLUB defrag pull request?
Miklos Szeredi wrote:[color=blue]
> And the things kick_inodes() does without any sort of locking look
> even more dangerous.
>
> It should be the other way round: first make sure nothing is
> referencing the inode, and _then_ start cleaning it up with
> appropriate locks held. See prune_icache().
>
>[/color]
kick_inodes() only works on inodes that first have undergone
get_inodes() where we establish a refcount under inode_lock(). The final
cleanup in kick_inodes() is done under iprune_mutex. You are looking at
the loop that does writeback and invalidates attached dentries. This can
fail for various reasons.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email]majordomo@vger.kernel.org[/email]
More majordomo info at [url]http://vger.kernel.org/majordomo-info.html[/url]
Please read the FAQ at [url]http://www.tux.org/lkml/[/url]
Re: SLUB defrag pull request?
On Mon, 13 Oct 2008, Christoph Lameter wrote:[color=blue]
> Miklos Szeredi wrote:[color=green]
> > And the things kick_inodes() does without any sort of locking look
> > even more dangerous.
> >
> > It should be the other way round: first make sure nothing is
> > referencing the inode, and _then_ start cleaning it up with
> > appropriate locks held. See prune_icache().
> >
> >[/color]
> kick_inodes() only works on inodes that first have undergone
> get_inodes() where we establish a refcount under inode_lock(). The final
> cleanup in kick_inodes() is done under iprune_mutex. You are looking at
> the loop that does writeback and invalidates attached dentries. This can
> fail for various reasons.[/color]
Yes, but I'm not at all sure that calling remove_inode_buffers() or
invalidate_mapping_pages() is OK on a live inode. They should be done
after checking the refcount, just like prune_icache() does.
Also, while d_invalidate() is not actually wrong here, because you
check S_ISDIR(), but it's still the wrong function to use. You really
just want to shrink the children. Invalidation means: the filesystem
found out that the cached inode is invalid, so we want to throw it
away. In the future it might actually be able to do it for
directories as well, but currently it cannot because of possible
mounts on the dentry.
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email]majordomo@vger.kernel.org[/email]
More majordomo info at [url]http://vger.kernel.org/majordomo-info.html[/url]
Please read the FAQ at [url]http://www.tux.org/lkml/[/url]
Re: SLUB defrag pull request?
And BTW the whole thing seems to be broken WRT umount. Getting a
reference to a dentry or an inode without also getting reference to a
vfsmount is going to result in "VFS: Busy inodes after unmount of
%s. Self-destruct in 5 seconds. Have a nice day...\n". And getting a
reference to the vfsmount will result in EBUSY when trying to umount,
which is also not what we want.
So it seemst that this two pass method will not work with dentries or
inodes at all :(
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email]majordomo@vger.kernel.org[/email]
More majordomo info at [url]http://vger.kernel.org/majordomo-info.html[/url]
Please read the FAQ at [url]http://www.tux.org/lkml/[/url]
Re: SLUB defrag pull request?
Miklos Szeredi wrote:[color=blue]
>[color=green][color=darkred]
>>> It should be the other way round: first make sure nothing is
>>> referencing the inode, and _then_ start cleaning it up with
>>> appropriate locks held. See prune_icache().[/color][/color][/color]
The code was initially taken from prune_icache.
[color=blue][color=green]
>> kick_inodes() only works on inodes that first have undergone
>> get_inodes() where we establish a refcount under inode_lock(). The final
>> cleanup in kick_inodes() is done under iprune_mutex. You are looking at
>> the loop that does writeback and invalidates attached dentries. This can
>> fail for various reasons.[/color]
>
> Yes, but I'm not at all sure that calling remove_inode_buffers() or
> invalidate_mapping_pages() is OK on a live inode. They should be done
> after checking the refcount, just like prune_icache() does.[/color]
Dont we do the same on a truncate?
[color=blue]
> Also, while d_invalidate() is not actually wrong here, because you
> check S_ISDIR(), but it's still the wrong function to use. You really
> just want to shrink the children. Invalidation means: the filesystem
> found out that the cached inode is invalid, so we want to throw it
> away. In the future it might actually be able to do it for
> directories as well, but currently it cannot because of possible
> mounts on the dentry.[/color]
Thats the same issue as with the dentries. The new function could deal with
both situations?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email]majordomo@vger.kernel.org[/email]
More majordomo info at [url]http://vger.kernel.org/majordomo-info.html[/url]
Please read the FAQ at [url]http://www.tux.org/lkml/[/url]
Re: SLUB defrag pull request?
On Mon, 20 Oct 2008, Christoph Lameter wrote:[color=blue][color=green][color=darkred]
> >> kick_inodes() only works on inodes that first have undergone
> >> get_inodes() where we establish a refcount under inode_lock(). The final
> >> cleanup in kick_inodes() is done under iprune_mutex. You are looking at
> >> the loop that does writeback and invalidates attached dentries. This can
> >> fail for various reasons.[/color]
> >
> > Yes, but I'm not at all sure that calling remove_inode_buffers() or
> > invalidate_mapping_pages() is OK on a live inode. They should be done
> > after checking the refcount, just like prune_icache() does.[/color]
>
> Dont we do the same on a truncate?[/color]
Yes, with i_mutex and i_alloc_sem held.
[color=blue]
>[color=green]
> > Also, while d_invalidate() is not actually wrong here, because you
> > check S_ISDIR(), but it's still the wrong function to use. You really
> > just want to shrink the children. Invalidation means: the filesystem
> > found out that the cached inode is invalid, so we want to throw it
> > away. In the future it might actually be able to do it for
> > directories as well, but currently it cannot because of possible
> > mounts on the dentry.[/color]
>
> Thats the same issue as with the dentries. The new function could deal with
> both situations?[/color]
Sure.
The big issue is dealing with umount. You could do something like
grab_super() on sb before getting a ref on the inode/dentry. But I'm
not sure this is a good idea. There must be a simpler way to achieve
this...
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email]majordomo@vger.kernel.org[/email]
More majordomo info at [url]http://vger.kernel.org/majordomo-info.html[/url]
Please read the FAQ at [url]http://www.tux.org/lkml/[/url]
Re: SLUB defrag pull request?
Miklos Szeredi wrote:[color=blue]
> On Mon, 20 Oct 2008, Christoph Lameter wrote:[color=green][color=darkred]
>>>> kick_inodes() only works on inodes that first have undergone
>>>> get_inodes() where we establish a refcount under inode_lock(). The final
>>>> cleanup in kick_inodes() is done under iprune_mutex. You are looking at
>>>> the loop that does writeback and invalidates attached dentries. This can
>>>> fail for various reasons.
>>> Yes, but I'm not at all sure that calling remove_inode_buffers() or
>>> invalidate_mapping_pages() is OK on a live inode. They should be done
>>> after checking the refcount, just like prune_icache() does.[/color]
>> Dont we do the same on a truncate?[/color]
>
> Yes, with i_mutex and i_alloc_sem held.[/color]
There is another call to invalidate_mapping_pages() in prune_icache (that is
where this code originates). No i_mutex and i_alloc. Only iprune_mutex held
and that seems to be for the protection of the list. So just checking
inode->i_count would do the trick?
[color=blue][color=green][color=darkred]
>>> Also, while d_invalidate() is not actually wrong here, because you
>>> check S_ISDIR(), but it's still the wrong function to use. You really
>>> just want to shrink the children. Invalidation means: the filesystem
>>> found out that the cached inode is invalid, so we want to throw it
>>> away. In the future it might actually be able to do it for
>>> directories as well, but currently it cannot because of possible
>>> mounts on the dentry.[/color]
>> Thats the same issue as with the dentries. The new function could deal with
>> both situations?[/color]
>
> Sure.
>
> The big issue is dealing with umount. You could do something like
> grab_super() on sb before getting a ref on the inode/dentry. But I'm
> not sure this is a good idea. There must be a simpler way to achieve
> this...[/color]
Taking a lock on vfsmount_lock? But that would make dentry reclaim a pain.
We are only interested in the reclaim a dentry if its currently unused. If so
then why does unmount matter? Both unmount and reclaim will attempt to remove
the dentry.
Have a look at get_dentries(). It takes the dcache_lock and checks the dentry
state. Either the entry is ignored or dget_locked() removes it from the lru.
If its off the LRU then it can no longer be reclaimed by umount.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email]majordomo@vger.kernel.org[/email]
More majordomo info at [url]http://vger.kernel.org/majordomo-info.html[/url]
Please read the FAQ at [url]http://www.tux.org/lkml/[/url]
Re: SLUB defrag pull request?
On Mon, 20 Oct 2008, Christoph Lameter wrote:[color=blue]
> Miklos Szeredi wrote:[color=green][color=darkred]
> >>> Yes, but I'm not at all sure that calling remove_inode_buffers() or
> >>> invalidate_mapping_pages() is OK on a live inode. They should be done
> >>> after checking the refcount, just like prune_icache() does.
> >> Dont we do the same on a truncate?[/color]
> >
> > Yes, with i_mutex and i_alloc_sem held.[/color]
>
> There is another call to invalidate_mapping_pages() in prune_icache (that is
> where this code originates). No i_mutex and i_alloc. Only iprune_mutex held
> and that seems to be for the protection of the list. So just checking
> inode->i_count would do the trick?[/color]
Yes, that's what I was saying.
[color=blue][color=green]
> > The big issue is dealing with umount. You could do something like
> > grab_super() on sb before getting a ref on the inode/dentry. But I'm
> > not sure this is a good idea. There must be a simpler way to achieve
> > this..[/color]
>
> Taking a lock on vfsmount_lock? But that would make dentry reclaim a pain.[/color]
No, I mean simpler than having to do this two stage stuff.
[color=blue]
> We are only interested in the reclaim a dentry if its currently unused. If so
> then why does unmount matter? Both unmount and reclaim will attempt to remove
> the dentry.
>
> Have a look at get_dentries(). It takes the dcache_lock and checks the dentry
> state. Either the entry is ignored or dget_locked() removes it from the lru.
> If its off the LRU then it can no longer be reclaimed by umount.[/color]
How is that better? You will still get busy inodes on umount.
And anyway the dentry could be put back onto the LRU by somebody else
between get_dentries() and kick_dentries(). So I don't even see how
taking the dentry off the LRU helps _anything_.
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email]majordomo@vger.kernel.org[/email]
More majordomo info at [url]http://vger.kernel.org/majordomo-info.html[/url]
Please read the FAQ at [url]http://www.tux.org/lkml/[/url]
Re: SLUB defrag pull request?
Miklos Szeredi wrote:
[color=blue][color=green]
>> There is another call to invalidate_mapping_pages() in prune_icache (that is
>> where this code originates). No i_mutex and i_alloc. Only iprune_mutex held
>> and that seems to be for the protection of the list. So just checking
>> inode->i_count would do the trick?[/color]
>
> Yes, that's what I was saying.[/color]
Ok. Thats easy to do.
[color=blue]
>[color=green][color=darkred]
>>> The big issue is dealing with umount. You could do something like
>>> grab_super() on sb before getting a ref on the inode/dentry. But I'm
>>> not sure this is a good idea. There must be a simpler way to achieve
>>> this..[/color]
>> Taking a lock on vfsmount_lock? But that would make dentry reclaim a pain.[/color]
>
> No, I mean simpler than having to do this two stage stuff.[/color]
How could it be simpler? First you need to establish a secure reference to the
object so that it cannot vanish from under us. Then all the references can be
checked and possibly removed. If we do not need a secure reference then the
get_dentries() etc method can be NULL.
[color=blue][color=green]
>> We are only interested in the reclaim a dentry if its currently unused. If so
>> then why does unmount matter? Both unmount and reclaim will attempt to remove
>> the dentry.
>>
>> Have a look at get_dentries(). It takes the dcache_lock and checks the dentry
>> state. Either the entry is ignored or dget_locked() removes it from the lru.
>> If its off the LRU then it can no longer be reclaimed by umount.[/color]
>
> How is that better? You will still get busy inodes on umount.[/color]
Those inodes are going to be freed by the reclaim code. Why would they be busy
(unless the case below occurs of course).
[color=blue]
> And anyway the dentry could be put back onto the LRU by somebody else
> between get_dentries() and kick_dentries(). So I don't even see how
> taking the dentry off the LRU helps _anything_.[/color]
get_dentries() gets a reference. dput will not put the dentry back onto the LRU.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email]majordomo@vger.kernel.org[/email]
More majordomo info at [url]http://vger.kernel.org/majordomo-info.html[/url]
Please read the FAQ at [url]http://www.tux.org/lkml/[/url]
Re: SLUB defrag pull request?
On Mon, 20 Oct 2008, Christoph Lameter wrote:[color=blue][color=green][color=darkred]
> >>> The big issue is dealing with umount. You could do something like
> >>> grab_super() on sb before getting a ref on the inode/dentry. But I'm
> >>> not sure this is a good idea. There must be a simpler way to achieve
> >>> this..
> >> Taking a lock on vfsmount_lock? But that would make dentry reclaim a pain.[/color]
> >
> > No, I mean simpler than having to do this two stage stuff.[/color]
>
> How could it be simpler? First you need to establish a secure
> reference to the object so that it cannot vanish from under us. Then
> all the references can be checked and possibly removed. If we do not
> need a secure reference then the get_dentries() etc method can be
> NULL.[/color]
So, isn't it possible to do without get_dentries()? What's the
fundamental difference between this and regular cache shrinking?
[color=blue]
> Those inodes are going to be freed by the reclaim code. Why would
> they be busy (unless the case below occurs of course).[/color]
Case below was brainfart, please ignore. But that doesn't really
help: the VFS assumes that you cannot umount while there are busy
dentries/inodes. Usually it works this way: VFS first gets vfsmount
ref, then gets dentry ref, and releases them in the opposite order.
And umount is not allowed if vfsmount has a non-zero refcount (it's a
bit more complicated, but the essense is the same).
The current SLUB defrag violates this: it gets dentry or inode ref
without getting a ref on the vfsmount or the super block as well.
This means that the umount will succeed (that's OK), but also the
super block will be going away and that's bad. See
generic_shutdown_super().
[color=blue][color=green]
> > And anyway the dentry could be put back onto the LRU by somebody else
> > between get_dentries() and kick_dentries(). So I don't even see how
> > taking the dentry off the LRU helps _anything_.[/color]
>
> get_dentries() gets a reference. dput will not put the dentry back
> onto the LRU.[/color]
Right.
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email]majordomo@vger.kernel.org[/email]
More majordomo info at [url]http://vger.kernel.org/majordomo-info.html[/url]
Please read the FAQ at [url]http://www.tux.org/lkml/[/url]
Re: SLUB defrag pull request?
Miklos Szeredi wrote:[color=blue]
> So, isn't it possible to do without get_dentries()? What's the
> fundamental difference between this and regular cache shrinking?[/color]
The fundamental difference is that slab defrag operates on sparsely populated
dentries. It comes into effect when the density of dentries per page is low
and lots of memory is wasted. It defragments by kicking out dentries in low
density pages. These can then be reclaimed.
[color=blue]
> Case below was brainfart, please ignore. But that doesn't really
> help: the VFS assumes that you cannot umount while there are busy
> dentries/inodes. Usually it works this way: VFS first gets vfsmount
> ref, then gets dentry ref, and releases them in the opposite order.
> And umount is not allowed if vfsmount has a non-zero refcount (it's a
> bit more complicated, but the essense is the same).[/color]
The dentries that we get a ref on are candidates for removal. Their lifetime
is limited. Unmounting while we are trying to remove dentries/inodes results
in two mechanisms removing dentries/inodes.
If we have obtained a reference then invalidate_list() will return the number
of busy inodes which would trigger the printk in generic_shutdown_super(). But
these are inodes currently being reclaimed by slab defrag. Just waiting a bit
would remedy the situation.
We would need some way to make generic_shutdown_super() wait until slab defrag
is finished.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email]majordomo@vger.kernel.org[/email]
More majordomo info at [url]http://vger.kernel.org/majordomo-info.html[/url]
Please read the FAQ at [url]http://www.tux.org/lkml/[/url]
Re: SLUB defrag pull request?
On Mon, 20 Oct 2008, Christoph Lameter wrote:[color=blue]
> Miklos Szeredi wrote:[color=green]
> > So, isn't it possible to do without get_dentries()? What's the
> > fundamental difference between this and regular cache shrinking?[/color]
>
> The fundamental difference is that slab defrag operates on sparsely
> populated dentries. It comes into effect when the density of
> dentries per page is low and lots of memory is wasted. It
> defragments by kicking out dentries in low density pages. These can
> then be reclaimed.[/color]
OK, but why can't this be done in just one stage?
AFAICS the problem is exactly the same as generic shrinking, except it
wants to evict dentries selectively: only ones which are in very
sparse slabs.
So is the problem selecting these dentries? Would it be too expensive
to do it the same as normal cache shrinking and walk the lru, but only
evict the ones which are tagged as being in a sparse page?
[color=blue]
> The dentries that we get a ref on are candidates for removal. Their
> lifetime is limited. Unmounting while we are trying to remove
> dentries/inodes results in two mechanisms removing dentries/inodes.
>
> If we have obtained a reference then invalidate_list() will return
> the number of busy inodes which would trigger the printk in
> generic_shutdown_super(). But these are inodes currently being
> reclaimed by slab defrag. Just waiting a bit would remedy the
> situation.[/color]
I guess so, but that's just a hack to work around the problem, and
creates more interdependencies between VFS and the allocator with
unforseeable consequences.
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email]majordomo@vger.kernel.org[/email]
More majordomo info at [url]http://vger.kernel.org/majordomo-info.html[/url]
Please read the FAQ at [url]http://www.tux.org/lkml/[/url]
Re: SLUB defrag pull request?
On Mon, Oct 20, 2008 at 02:53:40PM -0500, Christoph Lameter wrote:[color=blue]
> Miklos Szeredi wrote:[color=green]
> > Case below was brainfart, please ignore. But that doesn't really
> > help: the VFS assumes that you cannot umount while there are busy
> > dentries/inodes. Usually it works this way: VFS first gets vfsmount
> > ref, then gets dentry ref, and releases them in the opposite order.
> > And umount is not allowed if vfsmount has a non-zero refcount (it's a
> > bit more complicated, but the essense is the same).[/color]
>
> The dentries that we get a ref on are candidates for removal. Their lifetime
> is limited. Unmounting while we are trying to remove dentries/inodes results
> in two mechanisms removing dentries/inodes.
>
> If we have obtained a reference then invalidate_list() will return the number
> of busy inodes which would trigger the printk in generic_shutdown_super(). But
> these are inodes currently being reclaimed by slab defrag. Just waiting a bit
> would remedy the situation.
>
> We would need some way to make generic_shutdown_super() wait until slab defrag
> is finished.[/color]
Seems to me that prune_dcache() handles this case by holding the sb->s_umount
semaphore while pruning. The same logic applies here, right?
Cheers,
Dave.
--
Dave Chinner
[email]david@fromorbit.com[/email]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email]majordomo@vger.kernel.org[/email]
More majordomo info at [url]http://vger.kernel.org/majordomo-info.html[/url]
Please read the FAQ at [url]http://www.tux.org/lkml/[/url]
Re: SLUB defrag pull request?
On Tue, 21 Oct 2008, Christoph Lameter wrote:[color=blue]
> The only way that a secure reference can be established is if the
> slab page is locked. That requires a spinlock. The slab allocator
> calls the get() functions while the slab lock guarantees object
> existence. Then locks are dropped and reclaim actions can start with
> the guarantee that the slab object will not suddenly vanish.[/color]
Yes, you've made up your mind, that you want to do it this way. But
it's the _wrong_ way, this "want to get a secure reference for use
later" leads to madness when applied to dentries or inodes. Try for a
minute to think outside this template.
For example dcache_lock will protect against dentries moving to/from
d_lru. So you can do this:
take dcache_lock
check if d_lru is non-empty
take sb->s_umount
free dentry
release sb->s_umount
release dcache_lock
Yeah, locking will be more complicated in reality. Still, much less
complicated than trying to do the same across two separate phases.
Why can't something like that work?
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email]majordomo@vger.kernel.org[/email]
More majordomo info at [url]http://vger.kernel.org/majordomo-info.html[/url]
Please read the FAQ at [url]http://www.tux.org/lkml/[/url]
Re: SLUB defrag pull request?
On Wed, 22 Oct 2008, Miklos Szeredi wrote:
[color=blue]
> On Tue, 21 Oct 2008, Christoph Lameter wrote:[color=green]
>> The only way that a secure reference can be established is if the
>> slab page is locked. That requires a spinlock. The slab allocator
>> calls the get() functions while the slab lock guarantees object
>> existence. Then locks are dropped and reclaim actions can start with
>> the guarantee that the slab object will not suddenly vanish.[/color]
>
> Yes, you've made up your mind, that you want to do it this way. But
> it's the _wrong_ way, this "want to get a secure reference for use
> later" leads to madness when applied to dentries or inodes. Try for a
> minute to think outside this template.
>
> For example dcache_lock will protect against dentries moving to/from
> d_lru. So you can do this:
>
> take dcache_lock
> check if d_lru is non-empty[/color]
The dentry could have been freed even before we take the dcache_lock. We
cannot access d_lru without a stable reference to the dentry.
[color=blue]
> take sb->s_umount
> free dentry
> release sb->s_umount
> release dcache_lock
>
> Yeah, locking will be more complicated in reality. Still, much less
> complicated than trying to do the same across two separate phases.
>
> Why can't something like that work?[/color]
Because the slab starts out with a series of objects left in a slab. It
needs to do build a list of objects etc in a way that is independent as
possible from the user of the slab page. It does that by locking the slab
page so that free operations stall until the reference has been
established. If it would not be shutting off frees then the objects could
vanish under us.
We could also avoid frees by calling some cache specific method that locks
out frees before and after. But then frees would stall everywhere and
every slab cache would have to check a global lock before freeing objects
(there would be numerous complications with RCU free etc etc).
Slab defrag only stops frees on a particular slab page.
The slab defrag approach also allows the slab cache (dentry or inodes
here) to do something else than free the object. It would be possible f.e.
to move the object by allocating a new entry and moving the information to
the new dentry. That would actually be better since it would preserve the
objects and just move them into the same slab page.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email]majordomo@vger.kernel.org[/email]
More majordomo info at [url]http://vger.kernel.org/majordomo-info.html[/url]
Please read the FAQ at [url]http://www.tux.org/lkml/[/url]
Re: SLUB defrag pull request?
On Wed, 22 Oct 2008, Christoph Lameter wrote:[color=blue]
> On Wed, 22 Oct 2008, Miklos Szeredi wrote:
>[color=green]
> > On Tue, 21 Oct 2008, Christoph Lameter wrote:[color=darkred]
> >> The only way that a secure reference can be established is if the
> >> slab page is locked. That requires a spinlock. The slab allocator
> >> calls the get() functions while the slab lock guarantees object
> >> existence. Then locks are dropped and reclaim actions can start with
> >> the guarantee that the slab object will not suddenly vanish.[/color]
> >
> > Yes, you've made up your mind, that you want to do it this way. But
> > it's the _wrong_ way, this "want to get a secure reference for use
> > later" leads to madness when applied to dentries or inodes. Try for a
> > minute to think outside this template.
> >
> > For example dcache_lock will protect against dentries moving to/from
> > d_lru. So you can do this:
> >
> > take dcache_lock
> > check if d_lru is non-empty[/color]
>
> The dentry could have been freed even before we take the dcache_lock. We
> cannot access d_lru without a stable reference to the dentry.[/color]
Why? The kmem_cache_free() doesn't touch the contents of the object,
does it?
[color=blue][color=green]
> > take sb->s_umount
> > free dentry
> > release sb->s_umount
> > release dcache_lock
> >
> > Yeah, locking will be more complicated in reality. Still, much less
> > complicated than trying to do the same across two separate phases.
> >
> > Why can't something like that work?[/color]
>
> Because the slab starts out with a series of objects left in a slab. It
> needs to do build a list of objects etc in a way that is independent as
> possible from the user of the slab page. It does that by locking the slab
> page so that free operations stall until the reference has been
> established. If it would not be shutting off frees then the objects could
> vanish under us.[/color]
It doesn't matter. All we care about is that the dentry is on the
lru: it's cached but unused. Every other state (being created,
active, being freed, freed) is uninteresting.
[color=blue]
> We could also avoid frees by calling some cache specific method that locks
> out frees before and after. But then frees would stall everywhere and
> every slab cache would have to check a global lock before freeing objects
> (there would be numerous complications with RCU free etc etc).
>
> Slab defrag only stops frees on a particular slab page.
>
> The slab defrag approach also allows the slab cache (dentry or inodes
> here) to do something else than free the object. It would be possible f.e.
> to move the object by allocating a new entry and moving the information to
> the new dentry. That would actually be better since it would preserve the
> objects and just move them into the same slab page.[/color]
Sure, and all that is possible without doing this messy 2 phase thing.
Unless I'm still missing something obvious...
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email]majordomo@vger.kernel.org[/email]
More majordomo info at [url]http://vger.kernel.org/majordomo-info.html[/url]
Please read the FAQ at [url]http://www.tux.org/lkml/[/url]