[PATCH 0/4] ksm - dynamic page sharing driver for linux - Kernel

This is a discussion on [PATCH 0/4] ksm - dynamic page sharing driver for linux - Kernel ; On Tue, 11 Nov 2008 21:38:06 +0100 Andrea Arcangeli wrote: > > > + * set all the ptes pointed to a page as read only, > > > + * odirect_sync is set to 0 in case we cannot ...

+ Reply to Thread
Page 2 of 3 FirstFirst 1 2 3 LastLast
Results 21 to 40 of 50

Thread: [PATCH 0/4] ksm - dynamic page sharing driver for linux

  1. Re: [PATCH 1/4] rmap: add page_wrprotect() function,

    On Tue, 11 Nov 2008 21:38:06 +0100
    Andrea Arcangeli wrote:

    > > > + * set all the ptes pointed to a page as read only,
    > > > + * odirect_sync is set to 0 in case we cannot protect against race with odirect
    > > > + * return the number of ptes that were set as read only
    > > > + * (ptes that were read only before this function was called are couned as well)
    > > > + */

    > >
    > > But it isn't.

    >
    > What isn't?


    This code comment had the kerneldoc marker ("/**") but it isn't a
    kerneldoc comment.

    > > I don't understand this odirect_sync thing. What race? Please expand
    > > this comment to make the function of odirect_sync more understandable.

    >
    > I should have answered this one with the above 3 links.


    OK, well can we please update the code so these things are clearer.

    (It's a permanent problem I have. I ask "what is this", but I really
    mean "the code should be changed so that readers will know what this is")

    > > What do you think about making all this new code dependent upon some
    > > CONFIG_ switch which CONFIG_KVM can select?

    >
    > I like that too.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [PATCH 1/4] rmap: add page_wrprotect() function,

    On Tue, Nov 11, 2008 at 01:01:49PM -0800, Andrew Morton wrote:
    > On Tue, 11 Nov 2008 21:38:06 +0100
    > Andrea Arcangeli wrote:
    >
    > > > > + * set all the ptes pointed to a page as read only,
    > > > > + * odirect_sync is set to 0 in case we cannot protect against race with odirect
    > > > > + * return the number of ptes that were set as read only
    > > > > + * (ptes that were read only before this function was called are couned as well)
    > > > > + */
    > > >
    > > > But it isn't.

    > >
    > > What isn't?

    >
    > This code comment had the kerneldoc marker ("/**") but it isn't a
    > kerneldoc comment.


    8) never mind, I thought it was referred to the quoted comment, like
    if the comment pretended something the function wasn't doing.

    > OK, well can we please update the code so these things are clearer.


    Sure. Let's say this o_direct fix was done after confirmation that
    this was the agreed fix to solve those kind of problems (same fix that
    fork will need as in the third link).

    > (It's a permanent problem I have. I ask "what is this", but I really
    > mean "the code should be changed so that readers will know what this is")


    Agreed, this deserves much more commentary. But not much effort was
    done in this area, because fork (and likely migrate) has the same race
    and it isn't fixed yet so this is still a work-in-progress area. The
    fix has to be the same for all places that have this bug, and we have
    not agreed on a way to fix gup_fast yet (all I provided as suggestion
    that will work fine is brlock but surely Nick will try to find a way
    that remains non-blocking, which is the core of the problem, if it
    can't block, we've to use RCU but we can't wait there as far as I can
    tell as all sort of synchronous memory regions are involved).

    I think once the problem will get fixed and patches will go in
    mainline, more docs will emerge (I hope . We'll most certainly need
    changes to cover gup_fast (including if we use brlock, the read side
    of the lock will have to be taken around it). This was a fix we did at
    the last minute just to reflect the latest status.

    I CC Nick to this thread btw.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

    On Tue, 11 Nov 2008, Izik Eidus wrote:

    > yes but it replace it with kernel allocated page.
    > > page migration already kinda does that. Is there common ground?
    > >
    > >

    > page migration as far as i saw cant migrate anonymous page into kernel page.
    > if you want we can change page_migration to do that, but i thought you will
    > rather have ksm changes separate.


    What do you mean by kernel page? The kernel can allocate a page and then
    point a user space pte to it. That is how page migration works.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

    Christoph Lameter wrote:
    >> page migration as far as i saw cant migrate anonymous page into kernel page.
    >> if you want we can change page_migration to do that, but i thought you will
    >> rather have ksm changes separate.
    >>

    >
    > What do you mean by kernel page? The kernel can allocate a page and then
    > point a user space pte to it. That is how page migration works.
    >

    i mean filebacked page (!AnonPage())
    ksm need the pte inside the vma to point from anonymous page into
    filebacked page
    can migrate.c do it without changes?
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

    On Tue, 11 Nov 2008, Andrea Arcangeli wrote:

    > btw, page_migration likely is buggy w.r.t. o_direct too (and now
    > unfixable with gup_fast until the 2.4 brlock is added around it or
    > similar) if it does the same thing but without any page_mapcount vs
    > page_count check.


    Details please?

    > page_migration does too much for us, so us calling into migrate.c may
    > not be ideal. It has to convert a fresh page to a VM page. In KSM we
    > don't convert the newpage to be a VM page, we just replace the anon
    > page with another page. The new page in the KSM case is not a page
    > known by the VM, not in the lru etc...


    A VM page as opposed to pages not in the VM? ???

    page migration requires the page to be on the LRU. That could be changed
    if you have a different means of isolating a page from its page tables.

    > The way to go could be to change the page_migration to use
    > replace_page (or __replace_page if called in some shared inner-lock
    > context) after preparing the newpage to be a regular VM page. If we
    > can do that, migrate.c will get the o_direct race fixed too for free.


    Define a regular VM page? A page on the LRU?

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

    On Tue, Nov 11, 2008 at 03:21:45PM -0600, Christoph Lameter wrote:
    > What do you mean by kernel page? The kernel can allocate a page and then
    > point a user space pte to it. That is how page migration works.


    Just to make an example, remove_migration_pte adds the page back to
    rmap layer. We can't do that right now as rmap for the ksm pages will
    be built inside ksm, or alternatively rmap.c will have to learn to
    handle nonlinear anon-vma.

    Migration simply migrates the page. The new page is identical to the
    original one, just backed by different physical memory.

    For us the new page is an entirely different beast that we build
    ourself (we can't let migrate.c to pretend dealing with the newpage
    like if it resembled the old page like it's doing now).

    We replace a linear anon page with something that isn't an anonymous
    page at all right now (in the future it may become a nonlinear anon
    page if VM learns about it, or still an unknown page
    external-rmappable if we go the external-rmap way).

    There's clearly something to share, but the replace_page seem to be
    the one that could be called from migrate.c. What is different is that
    we don't need the migration pte placeholder, we never block releasing
    locks, all atomic with pte wrprotected, and a final pte_same check
    under PT lock before we replace the page. There isn't a whole lot to
    share after all, but surely it'd be nice to share if we can. Us
    calling into migrate.c isn't feasible right now without some
    significant change to migrate.c where it would be misplaced IMHO as to
    share we'd need migrate.c to call into VM core instead.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

    On Tue, 11 Nov 2008, Izik Eidus wrote:

    > > What do you mean by kernel page? The kernel can allocate a page and then
    > > point a user space pte to it. That is how page migration works.
    > >

    > i mean filebacked page (!AnonPage())


    ok.

    > ksm need the pte inside the vma to point from anonymous page into filebacked
    > page
    > can migrate.c do it without changes?


    So change anonymous to filebacked page?

    Currently page migration assumes that the page will continue to be part
    of the existing file or anon vma.

    What you want sounds like assigning a swap pte to an anonymous page? That
    way a anon page gains membership in a file backed mapping.


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

    Christoph Lameter wrote:
    >
    >
    > Currently page migration assumes that the page will continue to be part
    > of the existing file or anon vma.
    >


    exactly, and ksm really need it to get out of the existing anon vma!

    > What you want sounds like assigning a swap pte to an anonymous page? That
    > way a anon page gains membership in a file backed mapping.
    >
    >
    >


    No, i want pte that is found inside vma and point to anonymous page,
    will stop point into the anonymous page
    and will point to a whole diffrent page that i chose (for ksm it is
    needed because this way we are mapping alot
    of ptes into the same write_protected page and save memory)

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

    On Tue, 11 Nov 2008, Avi Kivity wrote:

    > Christoph Lameter wrote:
    > > page migration requires the page to be on the LRU. That could be changed
    > > if you have a different means of isolating a page from its page tables.
    > >

    >
    > Isn't rmap the means of isolating a page from its page tables? I guess I'm
    > misunderstanding something.


    In order to migrate a page one first has to make sure that userspace and
    the kernel cannot access the page in any way. User space must be made to
    generate page faults and all kernel references must be accounted for and
    not be in use.

    The user space portion involves changing the page tables so that faults
    are generated.

    The kernel portion isolates the page from the LRU (to exempt it from
    kernel reclaim handling etc).

    Only then can the page and its metadata be copied to a new location.

    Guess you already have the LRU portion done. So you just need the user
    space isolation portion?

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

    Christoph Lameter wrote:
    > page migration requires the page to be on the LRU. That could be changed
    > if you have a different means of isolating a page from its page tables.
    >


    Isn't rmap the means of isolating a page from its page tables? I guess
    I'm misunderstanding something.

    --
    I have a truly marvellous patch that fixes the bug which this
    signature is too narrow to contain.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

    Christoph Lameter wrote:
    > On Tue, 11 Nov 2008, Avi Kivity wrote:
    >
    >
    >> Christoph Lameter wrote:
    >>
    >>> page migration requires the page to be on the LRU. That could be changed
    >>> if you have a different means of isolating a page from its page tables.
    >>>
    >>>

    >> Isn't rmap the means of isolating a page from its page tables? I guess I'm
    >> misunderstanding something.
    >>

    >
    > In order to migrate a page one first has to make sure that userspace and
    > the kernel cannot access the page in any way. User space must be made to
    > generate page faults and all kernel references must be accounted for and
    > not be in use.
    >

    This is really not the case for ksm,
    in ksm we allow the page to be accessed all the time, we dont have to
    swap the page
    like migrate.c is doing...

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [PATCH 3/4] add ksm kernel shared memory driver

    I don't claim to begin to really understand the deep VM side of this
    patch, but I can certainly pick nits as I work through it...sorry for
    the lack of anything more substantive.

    > +static struct list_head slots;


    Some of these file-static variable names seem a little..terse...

    > +#define PAGECMP_OFFSET 128
    > +#define PAGEHASH_SIZE (PAGECMP_OFFSET ? PAGECMP_OFFSET : PAGE_SIZE)
    > +/* hash the page */
    > +static void page_hash(struct page *page, unsigned char *digest)


    So is this really saying that you only hash the first 128 bytes, relying on
    full compares for the rest? I assume there's a perfectly good reason for
    doing it that way, but it's not clear to me from reading the code. Do you
    gain performance which is not subsequently lost in the (presumably) higher
    number of hash collisions?

    > +static int ksm_scan_start(struct ksm_scan *ksm_scan, int scan_npages,
    > + int max_npages)
    > +{
    > + struct ksm_mem_slot *slot;
    > + struct page *page[1];
    > + int val;
    > + int ret = 0;
    > +
    > + down_read(&slots_lock);
    > +
    > + scan_update_old_index(ksm_scan);
    > +
    > + while (scan_npages > 0 && max_npages > 0) {


    Should this loop maybe check kthread_run too? It seems like you could loop
    for a long time after kthread_run has been set to zero.

    > +static int ksm_dev_open(struct inode *inode, struct file *filp)
    > +{
    > + try_module_get(THIS_MODULE);
    > + return 0;
    > +}
    > +
    > +static int ksm_dev_release(struct inode *inode, struct file *filp)
    > +{
    > + module_put(THIS_MODULE);
    > + return 0;
    > +}
    > +
    > +static struct file_operations ksm_chardev_ops = {
    > + .open = ksm_dev_open,
    > + .release = ksm_dev_release,
    > + .unlocked_ioctl = ksm_dev_ioctl,
    > + .compat_ioctl = ksm_dev_ioctl,
    > +};


    Why do you roll your own module reference counting? Is there a reason you
    don't just set .owner and let the VFS handle it?

    Given that the KSM_REGISTER_MEMORY_REGION ioctl() creates unswappable
    memory, should there be some sort of capability check done there? A check
    for starting/stopping the thread might also make sense. Or is that
    expected to be handled via permissions on /dev/ksm?

    Actually, it occurs to me that there's no sanity checks on any of the
    values passed in by ioctl(). What happens if the user tells KSM to scan a
    bogus range of memory?

    Any benchmarks on the runtime cost of having KSM running?

    jon
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: [PATCH 3/4] add ksm kernel shared memory driver

    On Tue, Nov 11, 2008 at 12:38:51PM -0800, Andrew Morton wrote:
    > Please fully document that interface in the changelog so that we can
    > review your decisions here. This is by far the most important
    > consideration - we can change all the code, but interfaces are for
    > ever.


    Yes, this is the most important point in my view. Even after we make
    the ksm pages swappable it'll remain an invisible change to anybody
    but us (it'll work better under VM pressure, but that's about it).

    > uh-oh, ioctls.


    Yes, it's all ioctl based. In very short, it assigns the task and
    memory region to scan, and allows to start/stop the kernel thread that
    does the scan while selecting how many pages to execute per scan and
    how many scans to execute per second. The more pages per scan and the
    more scans per second, the higher cpu utilization of the kernel thread.

    It would also be possible to submit ksm in a way that has no API at
    all (besides kernel module params tunable later by sysfs to set
    pages-per-scan and scan-frequency). Doing that would allow us to defer
    the API decisions. But then all anonymous memory in the system will be
    scanned unconditionally even if there may be little to share for
    certain tasks. It would perform quite well, usually the sharable part
    is the largest part so the rest wouldn't generate an huge amount of
    cpu waste. There's some ram waste too, as some memory has to be
    allocated for every page we want to possibly share.

    In some ways removing the API would make it simpler to use for non
    virtualization environments where they may want to enable it
    system-wide.

    > ooh, a comment!


    8)

    > > + kpage = alloc_page(GFP_KERNEL | __GFP_HIGHMEM);

    >
    > Stray whitepace.
    >
    > Replace with GFP_HIGHUSER.


    So not a cleanup, but an improvement (I agree highuser is better, this
    isn't by far any higher-priority kernel alloc and it deserves to have
    lower prio in the watermarks).

    > The term "shared memory" has specific meanings in Linux/Unix. I'm
    > suspecting that its use here was inappropriate but I have insufficient
    > information to be able to suggest alternatives.


    We could call it create_shared_anonymous_memory but then what if it
    ever becomes capable of sharing pagecache too? (I doubt it will, not
    in the short term at least

    Usually when we do these kind of tricks we use the word cloning, so
    perhaps also create_cloned_memory_area, so you can later say cloned
    anonymous memory or cloned shared memory page instead of just KSM
    page. But then this module would become KCM and not KSM 8). Perhaps we
    should just go recursive and use create_ksm_memory_area.

    > Generally: this review was rather a waste of your time and of mine
    > because the code is inadequately documented.


    Well, this was a great review considering how little the code was
    documented, definitely not a waste of time on our end, it was very
    helpful and the good thing is I don't see any controversial stuff.

    The two inner loops are the core of the ksm scan, and they're aren't
    very simple I've to say. Documenting won't be trivial but it's surely
    possible, Izik already working on it I think! Apologies if any time
    was wasted on your side!!
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

    On Tue, Nov 11, 2008 at 03:26:57PM -0600, Christoph Lameter wrote:
    > On Tue, 11 Nov 2008, Andrea Arcangeli wrote:
    >
    > > btw, page_migration likely is buggy w.r.t. o_direct too (and now
    > > unfixable with gup_fast until the 2.4 brlock is added around it or
    > > similar) if it does the same thing but without any page_mapcount vs
    > > page_count check.

    >
    > Details please?


    spin_lock_irq(&mapping->tree_lock);

    pslot = radix_tree_lookup_slot(&mapping->page_tree,
    page_index(page));

    expected_count = 2 + !!PagePrivate(page);
    if (page_count(page) != expected_count ||

    this page_count check done with only the tree_lock won't prevent a
    task to start O_DIRECT after page_count has been read in the above line.

    If a thread starts O_DIRECT on the page, and the o_direct is still in
    flight by the time you copy the page to the new page, the read will
    not be represented fully in the newpage leading to userland data
    corruption.

    > > page_migration does too much for us, so us calling into migrate.c may
    > > not be ideal. It has to convert a fresh page to a VM page. In KSM we
    > > don't convert the newpage to be a VM page, we just replace the anon
    > > page with another page. The new page in the KSM case is not a page
    > > known by the VM, not in the lru etc...

    >
    > A VM page as opposed to pages not in the VM? ???


    Yes, you migrate old VM pages to new VM pages. We replace VM pages
    with unknown stuff called KSM pages. So in the inner function where you
    replace the pte-migration-placeholder with a pte pointing to the
    newpage, you also rightfully do unconditional stuff we can't be doing
    like page_add_*_rmap. It's VM pages you're dealing with. Not for us.

    > page migration requires the page to be on the LRU. That could be changed
    > if you have a different means of isolating a page from its page tables.


    Yes we'd need to change those bits to be able to use migrate.c.

    > Define a regular VM page? A page on the LRU?


    Yes, pages owned, allocated and worked on by the VM. So they can be
    swapped, collected, migrated etc... You can't possibly migrate a
    device driver page for example and infact those device driver pages
    can't be migrated either.

    The KSM page initially is a driver page, later we'd like to teach the
    VM how to swap it by introducing rmap methods and adding it to the
    LRU. As long as it's only anonymous memory that we're sharing/cloning,
    we won't have to patch pagecache radix tree and other stuff. BTW, if
    we ever decice to clone pagecache we could generate immense metadata
    ram overhead in the radix tree with just a single page of data. All
    issues that don't exist for anon ram.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: [PATCH 3/4] add ksm kernel shared memory driver

    Jonathan Corbet wrote:
    > I don't claim to begin to really understand the deep VM side of this
    > patch, but I can certainly pick nits as I work through it...sorry for
    > the lack of anything more substantive.
    >
    >
    >> +static struct list_head slots;
    >>

    >
    > Some of these file-static variable names seem a little..terse...
    >
    >
    >> +#define PAGECMP_OFFSET 128
    >> +#define PAGEHASH_SIZE (PAGECMP_OFFSET ? PAGECMP_OFFSET : PAGE_SIZE)
    >> +/* hash the page */
    >> +static void page_hash(struct page *page, unsigned char *digest)
    >>

    >
    > So is this really saying that you only hash the first 128 bytes, relying on
    > full compares for the rest? I assume there's a perfectly good reason for
    > doing it that way, but it's not clear to me from reading the code. Do you
    > gain performance which is not subsequently lost in the (presumably) higher
    > number of hash collisions?
    >
    >
    >> +static int ksm_scan_start(struct ksm_scan *ksm_scan, int scan_npages,
    >> + int max_npages)
    >> +{
    >> + struct ksm_mem_slot *slot;
    >> + struct page *page[1];
    >> + int val;
    >> + int ret = 0;
    >> +
    >> + down_read(&slots_lock);
    >> +
    >> + scan_update_old_index(ksm_scan);
    >> +
    >> + while (scan_npages > 0 && max_npages > 0) {
    >>

    >
    > Should this loop maybe check kthread_run too? It seems like you could loop
    > for a long time after kthread_run has been set to zero.
    >
    >
    >> +static int ksm_dev_open(struct inode *inode, struct file *filp)
    >> +{
    >> + try_module_get(THIS_MODULE);
    >> + return 0;
    >> +}
    >> +
    >> +static int ksm_dev_release(struct inode *inode, struct file *filp)
    >> +{
    >> + module_put(THIS_MODULE);
    >> + return 0;
    >> +}
    >> +
    >> +static struct file_operations ksm_chardev_ops = {
    >> + .open = ksm_dev_open,
    >> + .release = ksm_dev_release,
    >> + .unlocked_ioctl = ksm_dev_ioctl,
    >> + .compat_ioctl = ksm_dev_ioctl,
    >> +};
    >>

    >
    > Why do you roll your own module reference counting? Is there a reason you
    > don't just set .owner and let the VFS handle it?
    >


    Yes, I am taking get_task_mm() if the module will be removed before i
    free the mms, things will go wrong

    > Given that the KSM_REGISTER_MEMORY_REGION ioctl() creates unswappable
    > memory, should there be some sort of capability check done there? A check
    > for starting/stopping the thread might also make sense. Or is that
    > expected to be handled via permissions on /dev/ksm?
    >


    Well, I think giving premmision to /dev/ksm default as root is enough.
    No?

    > Actually, it occurs to me that there's no sanity checks on any of the
    > values passed in by ioctl(). What happens if the user tells KSM to scan a
    > bogus range of memory?
    >


    Well get_user_pages() run in context of the process, therefore it should
    fail in "bogus range of memory"

    > Any benchmarks on the runtime cost of having KSM running?
    >


    This one is problematic, ksm can take anything from 0% to 100% cpu
    its all depend on how fast you run it.
    it have 3 parameters:
    number of pages to scan before it go to sleep
    maximum number of pages to merge while we scanning the above pages
    (merging is expensive)
    time to sleep (when runing from userspace using /dev/ksm, we actually do
    it there (userspace)

    > jon
    > --
    > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    > the body of a message to majordomo@vger.kernel.org
    > More majordomo info at http://vger.kernel.org/majordomo-info.html
    > Please read the FAQ at http://www.tux.org/lkml/
    >


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. Re: [PATCH 3/4] add ksm kernel shared memory driver

    On Wed, 12 Nov 2008 00:17:39 +0200
    Izik Eidus wrote:

    > >> +static int ksm_dev_open(struct inode *inode, struct file *filp)
    > >> +{
    > >> + try_module_get(THIS_MODULE);
    > >> + return 0;
    > >> +}
    > >> +
    > >> +static int ksm_dev_release(struct inode *inode, struct file *filp)
    > >> +{
    > >> + module_put(THIS_MODULE);
    > >> + return 0;
    > >> +}
    > >> +
    > >> +static struct file_operations ksm_chardev_ops = {
    > >> + .open = ksm_dev_open,
    > >> + .release = ksm_dev_release,
    > >> + .unlocked_ioctl = ksm_dev_ioctl,
    > >> + .compat_ioctl = ksm_dev_ioctl,
    > >> +};
    > >>

    > >
    > > Why do you roll your own module reference counting? Is there a
    > > reason you don't just set .owner and let the VFS handle it?
    > >

    >
    > Yes, I am taking get_task_mm() if the module will be removed before i
    > free the mms, things will go wrong


    But...if you set .owner, the VFS will do the try_module_get() *before*
    calling into your module (as an added bonus, it will actually check the
    return value too). All you've succeeded in doing here is adding a
    microscopic race to the module reference counting; otherwise the end
    result is the same.

    jon
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

    On Tue, Nov 11, 2008 at 03:31:18PM -0600, Christoph Lameter wrote:
    > > ksm need the pte inside the vma to point from anonymous page into filebacked
    > > page
    > > can migrate.c do it without changes?

    >
    > So change anonymous to filebacked page?
    >
    > Currently page migration assumes that the page will continue to be part
    > of the existing file or anon vma.
    >
    > What you want sounds like assigning a swap pte to an anonymous page? That
    > way a anon page gains membership in a file backed mapping.


    KSM needs to convert anonymous pages to PageKSM, which means a page
    owned by ksm.c and only known by ksm.c. The Linux VM will free this
    page in munmap but that's about it, all we do is to match the number
    of anon-ptes pointing to the page with the page_count. So besides
    freeing the page when the last user exit()s or cows it, the VM will do
    nothing about it. Initially. Later it can swap it in a nonlinear way.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  18. Re: [PATCH 3/4] add ksm kernel shared memory driver

    [Let's see if I can get through the rest without premature sends...]

    On Wed, 12 Nov 2008 00:17:39 +0200
    Izik Eidus wrote:

    > > Actually, it occurs to me that there's no sanity checks on any of
    > > the values passed in by ioctl(). What happens if the user tells
    > > KSM to scan a bogus range of memory?
    > >

    >
    > Well get_user_pages() run in context of the process, therefore it
    > should fail in "bogus range of memory"


    But it will fail in a totally silent and mysterious way. Doesn't it
    seem better to verify the values when you can return a meaningful error
    code to the caller?

    The other ioctl() calls have the same issue; you can start the thread
    with nonsensical values for the number of pages to scan and the sleep
    time.

    >
    > > Any benchmarks on the runtime cost of having KSM running?
    > >

    >
    > This one is problematic, ksm can take anything from 0% to 100% cpu
    > its all depend on how fast you run it.
    > it have 3 parameters:
    > number of pages to scan before it go to sleep
    > maximum number of pages to merge while we scanning the above pages
    > (merging is expensive)
    > time to sleep (when runing from userspace using /dev/ksm, we actually
    > do it there (userspace)


    What about things like cache effects from scanning all those pages? My
    guess is that, if you're trying to run dozens of Windows guests, cache
    usage is not at the top of your list of concerns, but I could be
    wrong. Usually am...

    Thanks,

    jon
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  19. Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

    Christoph Lameter wrote:
    > On Tue, 11 Nov 2008, Avi Kivity wrote:
    >
    >
    >> Christoph Lameter wrote:
    >>
    >>> page migration requires the page to be on the LRU. That could be changed
    >>> if you have a different means of isolating a page from its page tables.
    >>>
    >>>

    >> Isn't rmap the means of isolating a page from its page tables? I guess I'm
    >> misunderstanding something.
    >>

    >
    > In order to migrate a page one first has to make sure that userspace and
    > the kernel cannot access the page in any way. User space must be made to
    > generate page faults and all kernel references must be accounted for and
    > not be in use.
    >
    > The user space portion involves changing the page tables so that faults
    > are generated.
    >
    > The kernel portion isolates the page from the LRU (to exempt it from
    > kernel reclaim handling etc).
    >
    >


    Thanks.

    > Only then can the page and its metadata be copied to a new location.
    >
    > Guess you already have the LRU portion done. So you just need the user
    > space isolation portion?
    >


    We don't want to limit all faults, just writes. So we write protect the
    page before merging.

    What do you mean by page metadata? obviously the dirty bit (Izik?),
    accessed bit and position in the LRU are desirable (the last is quite
    difficult for ksm -- the page occupied *two* positions in the LRU).

    --
    I have a truly marvellous patch that fixes the bug which this
    signature is too narrow to contain.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  20. Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

    On Tue, 11 Nov 2008, Andrea Arcangeli wrote:

    > this page_count check done with only the tree_lock won't prevent a
    > task to start O_DIRECT after page_count has been read in the above line.
    >
    > If a thread starts O_DIRECT on the page, and the o_direct is still in
    > flight by the time you copy the page to the new page, the read will
    > not be represented fully in the newpage leading to userland data
    > corruption.


    O_DIRECT does not take a refcount on the page in order to prevent this?

    > > Define a regular VM page? A page on the LRU?

    >
    > Yes, pages owned, allocated and worked on by the VM. So they can be
    > swapped, collected, migrated etc... You can't possibly migrate a
    > device driver page for example and infact those device driver pages
    > can't be migrated either.


    Oh they could be migrated if you had a callback to the devices method for
    giving up references. Same as slab defrag.

    > The KSM page initially is a driver page, later we'd like to teach the
    > VM how to swap it by introducing rmap methods and adding it to the
    > LRU. As long as it's only anonymous memory that we're sharing/cloning,
    > we won't have to patch pagecache radix tree and other stuff. BTW, if
    > we ever decice to clone pagecache we could generate immense metadata
    > ram overhead in the radix tree with just a single page of data. All
    > issues that don't exist for anon ram.


    Seems that we are tinkering around with the concept of what an anonymous
    page is? Doesnt shmem have some means of converting pages to file backed?
    Swizzling?

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread
Page 2 of 3 FirstFirst 1 2 3 LastLast