[PATCH 0/4] ksm - dynamic page sharing driver for linux - Kernel
This is a discussion on [PATCH 0/4] ksm - dynamic page sharing driver for linux - Kernel ; Jonathan Corbet wrote:
> [Let's see if I can get through the rest without premature sends...]
>
> On Wed, 12 Nov 2008 00:17:39 +0200
> Izik Eidus wrote:
>
>
>>> Actually, it occurs to me that there's no ...
-
Re: [PATCH 3/4] add ksm kernel shared memory driver
Jonathan Corbet wrote:
> [Let's see if I can get through the rest without premature sends...]
>
> On Wed, 12 Nov 2008 00:17:39 +0200
> Izik Eidus wrote:
>
>
>>> Actually, it occurs to me that there's no sanity checks on any of
>>> the values passed in by ioctl(). What happens if the user tells
>>> KSM to scan a bogus range of memory?
>>>
>>>
>> Well get_user_pages() run in context of the process, therefore it
>> should fail in "bogus range of memory"
>>
>
> But it will fail in a totally silent and mysterious way. Doesn't it
> seem better to verify the values when you can return a meaningful error
> code to the caller?
>
>
Well I dont mind insert it (the above for sure is not a bug)
but even with that, the user can still free the memory that he gave to us
so this check if "nice to have check", we have nothing to do but to relay on
get_user_pages return value 
> The other ioctl() calls have the same issue; you can start the thread
> with nonsensical values for the number of pages to scan and the sleep
> time.
>
well about this i agree, here it make alot of logic to check the values!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
-
Re: [PATCH 3/4] add ksm kernel shared memory driver
Jonathan Corbet wrote:
> On Wed, 12 Nov 2008 00:17:39 +0200
> Izik Eidus wrote:
>
>
>>>> +static int ksm_dev_open(struct inode *inode, struct file *filp)
>>>> +{
>>>> + try_module_get(THIS_MODULE);
>>>> + return 0;
>>>> +}
>>>> +
>>>> +static int ksm_dev_release(struct inode *inode, struct file *filp)
>>>> +{
>>>> + module_put(THIS_MODULE);
>>>> + return 0;
>>>> +}
>>>> +
>>>> +static struct file_operations ksm_chardev_ops = {
>>>> + .open = ksm_dev_open,
>>>> + .release = ksm_dev_release,
>>>> + .unlocked_ioctl = ksm_dev_ioctl,
>>>> + .compat_ioctl = ksm_dev_ioctl,
>>>> +};
>>>>
>>>>
>>> Why do you roll your own module reference counting? Is there a
>>> reason you don't just set .owner and let the VFS handle it?
>>>
>>>
>> Yes, I am taking get_task_mm() if the module will be removed before i
>> free the mms, things will go wrong
>>
>
> But...if you set .owner, the VFS will do the try_module_get() *before*
> calling into your module (as an added bonus, it will actually check the
> return value too).
Ohhh i see what you mean
you are right i had at least needed to check for the return value of
try_module_get(),
anyway will check this issue for V2.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
-
Re: [PATCH 3/4] add ksm kernel shared memory driver
On Tue, 11 Nov 2008 15:03:45 MST, Jonathan Corbet said:
> > +#define PAGECMP_OFFSET 128
> > +#define PAGEHASH_SIZE (PAGECMP_OFFSET ? PAGECMP_OFFSET : PAGE_SIZE)
> > +/* hash the page */
> > +static void page_hash(struct page *page, unsigned char *digest)
>
> So is this really saying that you only hash the first 128 bytes, relying on
> full compares for the rest? I assume there's a perfectly good reason for
> doing it that way, but it's not clear to me from reading the code. Do you
> gain performance which is not subsequently lost in the (presumably) higher
> number of hash collisions?
Seems reasonably sane to me - only doing the first 128 bytes rather than
a full 4K page is some 32 times faster. Yes, you'll have the *occasional*
case where two pages were identical for 128 bytes but then differed, which is
why there's buckets. But the vast majority of the time, at least one bit
will be different in the first part.
In fact, I'd not be surprised if only going for 64 bytes works well...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Exmh version 2.5 07/13/2001
iD8DBQFJGgnQcC3lWbTT17ARAlPaAKCJ13v2zxBAEeCBOwc54F DvcFVHAQCfXw1N
mk26Q65TvGS4aSjLOuWvCfU=
=Vsok
-----END PGP SIGNATURE-----
-
Re: [PATCH 3/4] add ksm kernel shared memory driver
Izik Eidus wrote:
>> Any benchmarks on the runtime cost of having KSM running?
>>
>
> This one is problematic, ksm can take anything from 0% to 100% cpu
> its all depend on how fast you run it.
> it have 3 parameters:
> number of pages to scan before it go to sleep
> maximum number of pages to merge while we scanning the above pages
> (merging is expensive)
> time to sleep (when runing from userspace using /dev/ksm, we actually
> do it there (userspace)
The scan process priority also has its effect. One strategy would be to
run it at idle priority as long as you have enough free memory, and
increase the priority as memory starts depleting.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
-
Re: [PATCH 3/4] add ksm kernel shared memory driver
Jonathan Corbet wrote:
>> +static struct list_head slots;
>>
>
> Some of these file-static variable names seem a little..terse...
>
While ksm was written to be independent of a certain TLA-named kernel
subsystem developed two rooms away, they share some naming... this
refers to kvm 'memory slots' which correspond to DIMM banks.
I guess it should be renamed to merge_ranges or something.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
-
Re: [PATCH 3/4] add ksm kernel shared memory driver
Hi Jonathan,
On Tue, Nov 11, 2008 at 03:30:28PM -0700, Jonathan Corbet wrote:
> But it will fail in a totally silent and mysterious way. Doesn't it
> seem better to verify the values when you can return a meaningful error
> code to the caller?
I think you're right, but just because find_extend_vma will have the
effect of growing the kernel stack down. We clearly don't set it on a
stack with KVM as there's nothing to share on the stack usually - we
only set it in the guest physical memory range. And things are safe
regardless as get_user_pages is verifying the values for us. Problem
is it's using find_extend_vma because it behaves like a page fault. We
must not behave like a pagefault, we're much closer to follow_page
only than a page fault. Not a big deal, but it can be improved by
avoiding to extend the stack somehow (likely simplest is to call
find_vma twice, first time externally, we hold mmap_sem externally so
all right).
> What about things like cache effects from scanning all those pages? My
> guess is that, if you're trying to run dozens of Windows guests, cache
> usage is not at the top of your list of concerns, but I could be
> wrong. Usually am...
Oh that's not an issue. This is all about trading some CPU for lots of
free memory. It pays off big as so many more VM can run. With desktop
virtualization and 1G systems, you reach a RAM bottleneck much quicker
than a CPU bottleneck. Perhaps not so quick on server virtualization
but the point is this is intentional. It may be possible to compute
the jhash (that's where the cpu is spent) with instructions that don't
pollute the cpu cache but I doubt it's going to make much an huge
difference.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
-
Re: [PATCH 3/4] add ksm kernel shared memory driver
Jonathan Corbet wrote:
>
> What about things like cache effects from scanning all those pages? My
> guess is that, if you're trying to run dozens of Windows guests, cache
> usage is not at the top of your list of concerns, but I could be
> wrong. Usually am...
>
Ok, ksm does make the cache of the cpu dirty when scanning the pages
(but the scanning happen slowly slowly and cache usually get dirty much
faster)
But infact it improve the cache by the fact that it make many ptes point
to the same page
so if before we had 12 process touching 12 diffrent physical page they
would dirty the page
but now they will touch just one...
so i guess it depend on how you see it...
> Thanks,
>
> jon
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
-
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Tue, Nov 11, 2008 at 04:30:22PM -0600, Christoph Lameter wrote:
> On Tue, 11 Nov 2008, Andrea Arcangeli wrote:
>
> > this page_count check done with only the tree_lock won't prevent a
> > task to start O_DIRECT after page_count has been read in the above line.
> >
> > If a thread starts O_DIRECT on the page, and the o_direct is still in
> > flight by the time you copy the page to the new page, the read will
> > not be represented fully in the newpage leading to userland data
> > corruption.
>
> O_DIRECT does not take a refcount on the page in order to prevent this?
It definitely does, it's also the only thing it does.
The whole point is that O_DIRECT can start the instruction after
page_count returns as far as I can tell.
If you check the three emails I linked in answer to Andrew on the
topic, we agree the o_direct can't start under PT lock (or under
mmap_sem in write mode but migrate.c rightefully takes the read
mode). So the fix used in ksm page_wrprotect and in fork() is to check
page_count vs page_mapcount inside PT lock before doing anything on
the pte. If you just mark the page wprotect while O_DIRECT is in
flight, that's enough for fork() to generate data corruption in the
parent (not the child where the result would be undefined). But in the
parent the result of the o-direct is defined and it'd never corrupt if
this was a cached-I/O. The moment the parent pte is marked readonly, a thread
in the parent could write to the last 512bytes of the page, leading to
the first 512bytes coming with O_DIRECT from disk being lost (as the
write will trigger a cow before I/O is complete and the dma will
complete on the oldpage).
We do the check in page_wprotect because that's the _first_ place
where we mark a pte readonly. Same as fork. And we can't convert a pte
from writeable to wrprotected without doing this check inside PT lock
or we generate data corruption with o_direct (again same as the bug in
fork).
We don't have to check the page_count vs mapcount later in
replace_page because we know if anybody started an O_DIRECT read from
disk, it would have triggered a cow, and the pte_same check that we
have to do for other reasons would take care of bailing out the
replace_page.
> Oh they could be migrated if you had a callback to the devices method for
> giving up references. Same as slab defrag.
The oldpage is a regular anonymous page for us, not the 'strange'
object called PageKSM. And we need to migrate many anonymous pages
having destination a single PageKSM page already preallocated and not
being an anonymous page. We never migrate PageKSM to anything, that's
always the destination.
> Seems that we are tinkering around with the concept of what an anonymous
> page is? Doesnt shmem have some means of converting pages to file backed?
> Swizzling?
Anonymous page is anything with page->mapping pointing to an anon_vma
or swapcache_space instead of an address_space of a real inode backed
by the vfs.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
-
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Wed, Nov 12, 2008 at 12:17:22AM +0100, Andrea Arcangeli wrote:
> We don't have to check the page_count vs mapcount later in
> replace_page because we know if anybody started an O_DIRECT read from
> disk, it would have triggered a cow, and the pte_same check that we
> have to do for other reasons would take care of bailing out the
> replace_page.
Ah, for completeness: above I didn't mention the case of O_DIRECT
writes to disk, because we never need to care about those. They're
never a problem. If the page is replaced and the cpu writes to the
page and by doing so triggers a cow that lead to the CPU write to go
in a different page (not the one under dma) it'll be like if the write
to disk completed before the cpu overwritten the page, so result is
undefined. I don't think we've to define the case of somebody doing a
direct read from a location where there's still an o_direct write in
flight either.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
-
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Wed, 12 Nov 2008, Andrea Arcangeli wrote:
> > O_DIRECT does not take a refcount on the page in order to prevent this?
>
> It definitely does, it's also the only thing it does.
Then page migration will not occur because there is an unresolved
reference.
> The whole point is that O_DIRECT can start the instruction after
> page_count returns as far as I can tell.
But there must still be reference for the bio and whatever may be going on
at the time in order to perform the I/O operation.
> If you check the three emails I linked in answer to Andrew on the
> topic, we agree the o_direct can't start under PT lock (or under
> mmap_sem in write mode but migrate.c rightefully takes the read
> mode). So the fix used in ksm page_wrprotect and in fork() is to check
> page_count vs page_mapcount inside PT lock before doing anything on
> the pte. If you just mark the page wprotect while O_DIRECT is in
> flight, that's enough for fork() to generate data corruption in the
> parent (not the child where the result would be undefined). But in the
> parent the result of the o-direct is defined and it'd never corrupt if
> this was a cached-I/O. The moment the parent pte is marked readonly, a thread
> in the parent could write to the last 512bytes of the page, leading to
> the first 512bytes coming with O_DIRECT from disk being lost (as the
> write will trigger a cow before I/O is complete and the dma will
> complete on the oldpage).
Have you actually seen corruption or this conjecture? AFACT the page
count is elevated while I/O is in progress and thus this is safe.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/