[PATCH 0/4] ksm - dynamic page sharing driver for linux - Kernel

This is a discussion on [PATCH 0/4] ksm - dynamic page sharing driver for linux - Kernel ; Jonathan Corbet wrote: > [Let's see if I can get through the rest without premature sends...] > > On Wed, 12 Nov 2008 00:17:39 +0200 > Izik Eidus wrote: > > >>> Actually, it occurs to me that there's no ...

+ Reply to Thread
Page 3 of 3 FirstFirst 1 2 3
Results 41 to 50 of 50

Thread: [PATCH 0/4] ksm - dynamic page sharing driver for linux

  1. Re: [PATCH 3/4] add ksm kernel shared memory driver

    Jonathan Corbet wrote:
    > [Let's see if I can get through the rest without premature sends...]
    >
    > On Wed, 12 Nov 2008 00:17:39 +0200
    > Izik Eidus wrote:
    >
    >
    >>> Actually, it occurs to me that there's no sanity checks on any of
    >>> the values passed in by ioctl(). What happens if the user tells
    >>> KSM to scan a bogus range of memory?
    >>>
    >>>

    >> Well get_user_pages() run in context of the process, therefore it
    >> should fail in "bogus range of memory"
    >>

    >
    > But it will fail in a totally silent and mysterious way. Doesn't it
    > seem better to verify the values when you can return a meaningful error
    > code to the caller?
    >
    >


    Well I dont mind insert it (the above for sure is not a bug)
    but even with that, the user can still free the memory that he gave to us
    so this check if "nice to have check", we have nothing to do but to relay on
    get_user_pages return value

    > The other ioctl() calls have the same issue; you can start the thread
    > with nonsensical values for the number of pages to scan and the sleep
    > time.
    >


    well about this i agree, here it make alot of logic to check the values!



    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [PATCH 3/4] add ksm kernel shared memory driver

    Jonathan Corbet wrote:
    > On Wed, 12 Nov 2008 00:17:39 +0200
    > Izik Eidus wrote:
    >
    >
    >>>> +static int ksm_dev_open(struct inode *inode, struct file *filp)
    >>>> +{
    >>>> + try_module_get(THIS_MODULE);
    >>>> + return 0;
    >>>> +}
    >>>> +
    >>>> +static int ksm_dev_release(struct inode *inode, struct file *filp)
    >>>> +{
    >>>> + module_put(THIS_MODULE);
    >>>> + return 0;
    >>>> +}
    >>>> +
    >>>> +static struct file_operations ksm_chardev_ops = {
    >>>> + .open = ksm_dev_open,
    >>>> + .release = ksm_dev_release,
    >>>> + .unlocked_ioctl = ksm_dev_ioctl,
    >>>> + .compat_ioctl = ksm_dev_ioctl,
    >>>> +};
    >>>>
    >>>>
    >>> Why do you roll your own module reference counting? Is there a
    >>> reason you don't just set .owner and let the VFS handle it?
    >>>
    >>>

    >> Yes, I am taking get_task_mm() if the module will be removed before i
    >> free the mms, things will go wrong
    >>

    >
    > But...if you set .owner, the VFS will do the try_module_get() *before*
    > calling into your module (as an added bonus, it will actually check the
    > return value too).

    Ohhh i see what you mean
    you are right i had at least needed to check for the return value of
    try_module_get(),
    anyway will check this issue for V2.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [PATCH 3/4] add ksm kernel shared memory driver

    On Tue, 11 Nov 2008 15:03:45 MST, Jonathan Corbet said:

    > > +#define PAGECMP_OFFSET 128
    > > +#define PAGEHASH_SIZE (PAGECMP_OFFSET ? PAGECMP_OFFSET : PAGE_SIZE)
    > > +/* hash the page */
    > > +static void page_hash(struct page *page, unsigned char *digest)

    >
    > So is this really saying that you only hash the first 128 bytes, relying on
    > full compares for the rest? I assume there's a perfectly good reason for
    > doing it that way, but it's not clear to me from reading the code. Do you
    > gain performance which is not subsequently lost in the (presumably) higher
    > number of hash collisions?


    Seems reasonably sane to me - only doing the first 128 bytes rather than
    a full 4K page is some 32 times faster. Yes, you'll have the *occasional*
    case where two pages were identical for 128 bytes but then differed, which is
    why there's buckets. But the vast majority of the time, at least one bit
    will be different in the first part.

    In fact, I'd not be surprised if only going for 64 bytes works well...

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.9 (GNU/Linux)
    Comment: Exmh version 2.5 07/13/2001

    iD8DBQFJGgnQcC3lWbTT17ARAlPaAKCJ13v2zxBAEeCBOwc54F DvcFVHAQCfXw1N
    mk26Q65TvGS4aSjLOuWvCfU=
    =Vsok
    -----END PGP SIGNATURE-----


  4. Re: [PATCH 3/4] add ksm kernel shared memory driver

    Izik Eidus wrote:
    >> Any benchmarks on the runtime cost of having KSM running?
    >>

    >
    > This one is problematic, ksm can take anything from 0% to 100% cpu
    > its all depend on how fast you run it.
    > it have 3 parameters:
    > number of pages to scan before it go to sleep
    > maximum number of pages to merge while we scanning the above pages
    > (merging is expensive)
    > time to sleep (when runing from userspace using /dev/ksm, we actually
    > do it there (userspace)


    The scan process priority also has its effect. One strategy would be to
    run it at idle priority as long as you have enough free memory, and
    increase the priority as memory starts depleting.

    --
    I have a truly marvellous patch that fixes the bug which this
    signature is too narrow to contain.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [PATCH 3/4] add ksm kernel shared memory driver

    Jonathan Corbet wrote:
    >> +static struct list_head slots;
    >>

    >
    > Some of these file-static variable names seem a little..terse...
    >


    While ksm was written to be independent of a certain TLA-named kernel
    subsystem developed two rooms away, they share some naming... this
    refers to kvm 'memory slots' which correspond to DIMM banks.

    I guess it should be renamed to merge_ranges or something.

    --
    I have a truly marvellous patch that fixes the bug which this
    signature is too narrow to contain.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [PATCH 3/4] add ksm kernel shared memory driver

    Hi Jonathan,

    On Tue, Nov 11, 2008 at 03:30:28PM -0700, Jonathan Corbet wrote:
    > But it will fail in a totally silent and mysterious way. Doesn't it
    > seem better to verify the values when you can return a meaningful error
    > code to the caller?


    I think you're right, but just because find_extend_vma will have the
    effect of growing the kernel stack down. We clearly don't set it on a
    stack with KVM as there's nothing to share on the stack usually - we
    only set it in the guest physical memory range. And things are safe
    regardless as get_user_pages is verifying the values for us. Problem
    is it's using find_extend_vma because it behaves like a page fault. We
    must not behave like a pagefault, we're much closer to follow_page
    only than a page fault. Not a big deal, but it can be improved by
    avoiding to extend the stack somehow (likely simplest is to call
    find_vma twice, first time externally, we hold mmap_sem externally so
    all right).

    > What about things like cache effects from scanning all those pages? My
    > guess is that, if you're trying to run dozens of Windows guests, cache
    > usage is not at the top of your list of concerns, but I could be
    > wrong. Usually am...


    Oh that's not an issue. This is all about trading some CPU for lots of
    free memory. It pays off big as so many more VM can run. With desktop
    virtualization and 1G systems, you reach a RAM bottleneck much quicker
    than a CPU bottleneck. Perhaps not so quick on server virtualization
    but the point is this is intentional. It may be possible to compute
    the jhash (that's where the cpu is spent) with instructions that don't
    pollute the cpu cache but I doubt it's going to make much an huge
    difference.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [PATCH 3/4] add ksm kernel shared memory driver

    Jonathan Corbet wrote:
    >
    > What about things like cache effects from scanning all those pages? My
    > guess is that, if you're trying to run dozens of Windows guests, cache
    > usage is not at the top of your list of concerns, but I could be
    > wrong. Usually am...
    >


    Ok, ksm does make the cache of the cpu dirty when scanning the pages
    (but the scanning happen slowly slowly and cache usually get dirty much
    faster)
    But infact it improve the cache by the fact that it make many ptes point
    to the same page
    so if before we had 12 process touching 12 diffrent physical page they
    would dirty the page
    but now they will touch just one...

    so i guess it depend on how you see it...

    > Thanks,
    >
    > jon
    >


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

    On Tue, Nov 11, 2008 at 04:30:22PM -0600, Christoph Lameter wrote:
    > On Tue, 11 Nov 2008, Andrea Arcangeli wrote:
    >
    > > this page_count check done with only the tree_lock won't prevent a
    > > task to start O_DIRECT after page_count has been read in the above line.
    > >
    > > If a thread starts O_DIRECT on the page, and the o_direct is still in
    > > flight by the time you copy the page to the new page, the read will
    > > not be represented fully in the newpage leading to userland data
    > > corruption.

    >
    > O_DIRECT does not take a refcount on the page in order to prevent this?


    It definitely does, it's also the only thing it does.

    The whole point is that O_DIRECT can start the instruction after
    page_count returns as far as I can tell.

    If you check the three emails I linked in answer to Andrew on the
    topic, we agree the o_direct can't start under PT lock (or under
    mmap_sem in write mode but migrate.c rightefully takes the read
    mode). So the fix used in ksm page_wrprotect and in fork() is to check
    page_count vs page_mapcount inside PT lock before doing anything on
    the pte. If you just mark the page wprotect while O_DIRECT is in
    flight, that's enough for fork() to generate data corruption in the
    parent (not the child where the result would be undefined). But in the
    parent the result of the o-direct is defined and it'd never corrupt if
    this was a cached-I/O. The moment the parent pte is marked readonly, a thread
    in the parent could write to the last 512bytes of the page, leading to
    the first 512bytes coming with O_DIRECT from disk being lost (as the
    write will trigger a cow before I/O is complete and the dma will
    complete on the oldpage).

    We do the check in page_wprotect because that's the _first_ place
    where we mark a pte readonly. Same as fork. And we can't convert a pte
    from writeable to wrprotected without doing this check inside PT lock
    or we generate data corruption with o_direct (again same as the bug in
    fork).

    We don't have to check the page_count vs mapcount later in
    replace_page because we know if anybody started an O_DIRECT read from
    disk, it would have triggered a cow, and the pte_same check that we
    have to do for other reasons would take care of bailing out the
    replace_page.

    > Oh they could be migrated if you had a callback to the devices method for
    > giving up references. Same as slab defrag.


    The oldpage is a regular anonymous page for us, not the 'strange'
    object called PageKSM. And we need to migrate many anonymous pages
    having destination a single PageKSM page already preallocated and not
    being an anonymous page. We never migrate PageKSM to anything, that's
    always the destination.

    > Seems that we are tinkering around with the concept of what an anonymous
    > page is? Doesnt shmem have some means of converting pages to file backed?
    > Swizzling?


    Anonymous page is anything with page->mapping pointing to an anon_vma
    or swapcache_space instead of an address_space of a real inode backed
    by the vfs.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

    On Wed, Nov 12, 2008 at 12:17:22AM +0100, Andrea Arcangeli wrote:
    > We don't have to check the page_count vs mapcount later in
    > replace_page because we know if anybody started an O_DIRECT read from
    > disk, it would have triggered a cow, and the pte_same check that we
    > have to do for other reasons would take care of bailing out the
    > replace_page.


    Ah, for completeness: above I didn't mention the case of O_DIRECT
    writes to disk, because we never need to care about those. They're
    never a problem. If the page is replaced and the cpu writes to the
    page and by doing so triggers a cow that lead to the CPU write to go
    in a different page (not the one under dma) it'll be like if the write
    to disk completed before the cpu overwritten the page, so result is
    undefined. I don't think we've to define the case of somebody doing a
    direct read from a location where there's still an o_direct write in
    flight either.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

    On Wed, 12 Nov 2008, Andrea Arcangeli wrote:

    > > O_DIRECT does not take a refcount on the page in order to prevent this?

    >
    > It definitely does, it's also the only thing it does.


    Then page migration will not occur because there is an unresolved
    reference.

    > The whole point is that O_DIRECT can start the instruction after
    > page_count returns as far as I can tell.


    But there must still be reference for the bio and whatever may be going on
    at the time in order to perform the I/O operation.

    > If you check the three emails I linked in answer to Andrew on the
    > topic, we agree the o_direct can't start under PT lock (or under
    > mmap_sem in write mode but migrate.c rightefully takes the read
    > mode). So the fix used in ksm page_wrprotect and in fork() is to check
    > page_count vs page_mapcount inside PT lock before doing anything on
    > the pte. If you just mark the page wprotect while O_DIRECT is in
    > flight, that's enough for fork() to generate data corruption in the
    > parent (not the child where the result would be undefined). But in the
    > parent the result of the o-direct is defined and it'd never corrupt if
    > this was a cached-I/O. The moment the parent pte is marked readonly, a thread
    > in the parent could write to the last 512bytes of the page, leading to
    > the first 512bytes coming with O_DIRECT from disk being lost (as the
    > write will trigger a cow before I/O is complete and the dma will
    > complete on the oldpage).


    Have you actually seen corruption or this conjecture? AFACT the page
    count is elevated while I/O is in progress and thus this is safe.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread
Page 3 of 3 FirstFirst 1 2 3