[PATCH 0/4] ksm - dynamic page sharing driver for linux - Kernel

This is a discussion on [PATCH 0/4] ksm - dynamic page sharing driver for linux - Kernel ; KSM is a linux driver that allows dynamicly sharing identical memory pages between one or more processes. unlike tradtional page sharing that is made at the allocation of the memory, ksm do it dynamicly after the memory was created. Memory ...

+ Reply to Thread
Page 1 of 3 1 2 3 LastLast
Results 1 to 20 of 50

Thread: [PATCH 0/4] ksm - dynamic page sharing driver for linux

  1. [PATCH 0/4] ksm - dynamic page sharing driver for linux

    KSM is a linux driver that allows dynamicly sharing identical memory pages
    between one or more processes.

    unlike tradtional page sharing that is made at the allocation of the
    memory, ksm do it dynamicly after the memory was created.
    Memory is periodically scanned; identical pages are identified and merged.
    the sharing is unnoticeable by the process that use this memory.
    (the shared pages are marked as readonly, and in case of write
    do_wp_page() take care to create new copy of the page)

    this driver is very useful for KVM as in cases of runing multiple guests
    operation system of the same type, many pages are sharable.
    this driver can be useful by OpenVZ as well.

    KSM right now scan just memory that was registered to used by it, it
    does not
    scan the whole system memory (this can be changed, but the changes to
    find
    identical pages in normal linux system that doesnt run multiple guests)

    KSM can run as kernel thread or as userspace application (or both (it is
    allowed to run more than one scanner in a time)).

    example for how to control the kernel thread:


    ksmctl.c

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include "ksm.h"

    int main(int argc, char *argv[])
    {
    int fd;
    int used = 0;
    int fd_start;
    struct ksm_kthread_info info;


    if (argc < 2) {
    fprintf(stderr, "usage: %s {start npages sleep | stop |
    info}\n", argv[0]);
    exit(1);
    }

    fd = open("/dev/ksm", O_RDWR | O_TRUNC, (mode_t)0600);
    if (fd == -1) {
    fprintf(stderr, "could not open /dev/ksm\n");
    exit(1);
    }

    if (!strncmp(argv[1], "start", strlen(argv[1]))) {
    used = 1;
    if (argc < 5) {
    fprintf(stderr, "usage: %s start npages_to_scan",
    argv[0]);
    fprintf(stderr, "npages_max_merge sleep\n");
    exit(1);
    }
    info.pages_to_scan = atoi(argv[2]);
    info.max_pages_to_merge = atoi(argv[3]);
    info.sleep = atoi(argv[4]);
    info.running = 1;

    fd_start = ioctl(fd, KSM_START_STOP_KTHREAD, &info);
    if (fd_start == -1) {
    fprintf(stderr, "KSM_START_KTHREAD failed\n");
    exit(1);
    }
    printf("created scanner\n");
    }

    if (!strncmp(argv[1], "stop", strlen(argv[1]))) {
    used = 1;
    info.running = 0;
    fd_start = ioctl(fd, KSM_START_STOP_KTHREAD, &info);
    if (fd_start == -1) {
    fprintf(stderr, "KSM_START_STOP_KTHREAD failed\n");
    exit(1);
    }
    printf("stopped scanner\n");
    }

    if (!strncmp(argv[1], "info", strlen(argv[1]))) {
    used = 1;
    fd_start = ioctl(fd, KSM_GET_INFO_KTHREAD, &info);
    if (fd_start == -1) {
    fprintf(stderr, "KSM_GET_INFO_KTHREAD failed\n");
    exit(1);
    }
    printf("running %d, pages_to_scan %d pages_max_merge %d",
    info.running, info.pages_to_scan,
    info.max_pages_to_merge);
    printf("sleep_time %d\n", info.sleep);
    }

    if (!used)
    fprintf(stderr, "unknown command %s\n", argv[1]);

    return 0;
    }


    example of how to register qemu to ksm (or any userspace application)

    diff --git a/qemu/vl.c b/qemu/vl.c
    index 4721fdd..7785bf9 100644
    --- a/qemu/vl.c
    +++ b/qemu/vl.c
    @@ -21,6 +21,7 @@
    * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
    * DEALINGS IN
    * THE SOFTWARE.
    */
    +#include "ksm.h"
    #include "hw/hw.h"
    #include "hw/boards.h"
    #include "hw/usb.h"
    @@ -5799,6 +5800,37 @@ static void termsig_setup(void)

    #endif

    +int ksm_register_memory(void)
    +{
    + int fd;
    + int ksm_fd;
    + int r = 1;
    + struct ksm_memory_region ksm_region;
    +
    + fd = open("/dev/ksm", O_RDWR | O_TRUNC, (mode_t)0600);
    + if (fd == -1)
    + goto out;
    +
    + ksm_fd = ioctl(fd, KSM_CREATE_SHARED_MEMORY_AREA);
    + if (ksm_fd == -1)
    + goto out_free;
    +
    + ksm_region.npages = phys_ram_size / TARGET_PAGE_SIZE;
    + ksm_region.addr = phys_ram_base;
    + r = ioctl(ksm_fd, KSM_REGISTER_MEMORY_REGION, &ksm_region);
    + if (r)
    + goto out_free1;
    +
    + return r;
    +
    +out_free1:
    + close(ksm_fd);
    +out_free:
    + close(fd);
    +out:
    + return r;
    +}
    +
    int main(int argc, char **argv)
    {
    #ifdef CONFIG_GDBSTUB
    @@ -6735,6 +6767,8 @@ int main(int argc, char **argv)
    /* init the dynamic translator */
    cpu_exec_init_all(tb_size * 1024 * 1024);

    + ksm_register_memory();
    +
    bdrv_init();

    /* we always create the cdrom drive, even if no disk is there */
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. [PATCH 1/4] rmap: add page_wrprotect() function,

    From: Izik Eidus

    this function is useful for cases you want to compare page and know
    that its value wont change during you compare it.

    this function is working by walking over the whole rmap of a page
    and mark every pte related to the page as write_protect.

    the odirect_sync paramter is used to notify the caller of
    page_wrprotect() if one pte or more was not marked readonly
    in order to avoid race with odirect.

    Signed-off-by: Izik Eidus
    ---
    include/linux/rmap.h | 7 ++++
    mm/rmap.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++
    2 files changed, 104 insertions(+), 0 deletions(-)

    diff --git a/include/linux/rmap.h b/include/linux/rmap.h
    index 89f0564..2a37fb7 100644
    --- a/include/linux/rmap.h
    +++ b/include/linux/rmap.h
    @@ -121,6 +121,8 @@ static inline int try_to_munlock(struct page *page)
    }
    #endif

    +int page_wrprotect(struct page *page, int *odirect_sync);
    +
    #else /* !CONFIG_MMU */

    #define anon_vma_init() do {} while (0)
    @@ -135,6 +137,11 @@ static inline int page_mkclean(struct page *page)
    return 0;
    }

    +static inline int page_wrprotect(struct page *page, int *odirect_sync)
    +{
    + return 0;
    +}
    +

    #endif /* CONFIG_MMU */

    diff --git a/mm/rmap.c b/mm/rmap.c
    index 1099394..3684edd 100644
    --- a/mm/rmap.c
    +++ b/mm/rmap.c
    @@ -576,6 +576,103 @@ int page_mkclean(struct page *page)
    }
    EXPORT_SYMBOL_GPL(page_mkclean);

    +static int page_wrprotect_one(struct page *page, struct vm_area_struct *vma,
    + int *odirect_sync)
    +{
    + struct mm_struct *mm = vma->vm_mm;
    + unsigned long address;
    + pte_t *pte;
    + spinlock_t *ptl;
    + int ret = 0;
    +
    + address = vma_address(page, vma);
    + if (address == -EFAULT)
    + goto out;
    +
    + pte = page_check_address(page, mm, address, &ptl, 0);
    + if (!pte)
    + goto out;
    +
    + if (pte_write(*pte)) {
    + pte_t entry;
    +
    + if (page_mapcount(page) != page_count(page)) {
    + *odirect_sync = 0;
    + goto out_unlock;
    + }
    + flush_cache_page(vma, address, pte_pfn(*pte));
    + entry = ptep_clear_flush_notify(vma, address, pte);
    + entry = pte_wrprotect(entry);
    + set_pte_at(mm, address, pte, entry);
    + }
    + ret = 1;
    +
    +out_unlock:
    + pte_unmap_unlock(pte, ptl);
    +out:
    + return ret;
    +}
    +
    +static int page_wrprotect_file(struct page *page, int *odirect_sync)
    +{
    + struct address_space *mapping;
    + struct prio_tree_iter iter;
    + struct vm_area_struct *vma;
    + pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
    + int ret = 0;
    +
    + mapping = page_mapping(page);
    + if (!mapping)
    + return ret;
    +
    + spin_lock(&mapping->i_mmap_lock);
    +
    + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff)
    + ret += page_wrprotect_one(page, vma, odirect_sync);
    +
    + spin_unlock(&mapping->i_mmap_lock);
    +
    + return ret;
    +}
    +
    +static int page_wrprotect_anon(struct page *page, int *odirect_sync)
    +{
    + struct vm_area_struct *vma;
    + struct anon_vma *anon_vma;
    + int ret = 0;
    +
    + anon_vma = page_lock_anon_vma(page);
    + if (!anon_vma)
    + return ret;
    +
    + list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
    + ret += page_wrprotect_one(page, vma, odirect_sync);
    +
    + page_unlock_anon_vma(anon_vma);
    +
    + return ret;
    +}
    +
    +/**
    + * set all the ptes pointed to a page as read only,
    + * odirect_sync is set to 0 in case we cannot protect against race with odirect
    + * return the number of ptes that were set as read only
    + * (ptes that were read only before this function was called are couned as well)
    + */
    +int page_wrprotect(struct page *page, int *odirect_sync)
    +{
    + int ret =0;
    +
    + *odirect_sync = 1;
    + if (PageAnon(page))
    + ret = page_wrprotect_anon(page, odirect_sync);
    + else
    + ret = page_wrprotect_file(page, odirect_sync);
    +
    + return ret;
    +}
    +EXPORT_SYMBOL(page_wrprotect);
    +
    /**
    * __page_set_anon_rmap - setup new anonymous rmap
    * @page: the page to add the mapping to
    --
    1.6.0.3

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

    On Tue, 11 Nov 2008 15:21:37 +0200 Izik Eidus wrote:

    > KSM is a linux driver that allows dynamicly sharing identical memory pages
    > between one or more processes.
    >
    > unlike tradtional page sharing that is made at the allocation of the
    > memory, ksm do it dynamicly after the memory was created.
    > Memory is periodically scanned; identical pages are identified and merged.
    > the sharing is unnoticeable by the process that use this memory.
    > (the shared pages are marked as readonly, and in case of write
    > do_wp_page() take care to create new copy of the page)
    >
    > this driver is very useful for KVM as in cases of runing multiple guests
    > operation system of the same type, many pages are sharable.
    > this driver can be useful by OpenVZ as well.


    These benefits should be quantified, please. Also any benefits to any
    other workloads should be identified and quantified.

    The whole approach seems wrong to me. The kernel lost track of these
    pages and then we run around post-facto trying to fix that up again.
    Please explain (for the changelog) why the kernel cannot get this right
    via the usual sharing, refcounting and COWing approaches.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

    Andrew Morton wrote:
    > The whole approach seems wrong to me. The kernel lost track of these
    > pages and then we run around post-facto trying to fix that up again.
    > Please explain (for the changelog) why the kernel cannot get this right
    > via the usual sharing, refcounting and COWing approaches.
    >


    For kvm, the kernel never knew those pages were shared. They are loaded
    from independent (possibly compressed and encrypted) disk images. These
    images are different; but some pages happen to be the same because they
    came from the same installation media.

    For OpenVZ the situation is less clear, but if you allow users to
    independently upgrade their chroots you will eventually arrive at the
    same scenario (unless of course you apply the same merging strategy at
    the filesystem level).

    --
    I have a truly marvellous patch that fixes the bug which this
    signature is too narrow to contain.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

    Andrew Morton wrote:
    > On Tue, 11 Nov 2008 15:21:37 +0200 Izik Eidus wrote:
    >
    >
    >> KSM is a linux driver that allows dynamicly sharing identical memory pages
    >> between one or more processes.
    >>
    >> unlike tradtional page sharing that is made at the allocation of the
    >> memory, ksm do it dynamicly after the memory was created.
    >> Memory is periodically scanned; identical pages are identified and merged.
    >> the sharing is unnoticeable by the process that use this memory.
    >> (the shared pages are marked as readonly, and in case of write
    >> do_wp_page() take care to create new copy of the page)
    >>
    >> this driver is very useful for KVM as in cases of runing multiple guests
    >> operation system of the same type, many pages are sharable.
    >> this driver can be useful by OpenVZ as well.
    >>

    >
    > These benefits should be quantified, please. Also any benefits to any
    > other workloads should be identified and quantified.
    >

    Sure,
    we have used KSM in production for about half year and the numbers that
    came from our QA is:
    using KSM for desktop (KSM was tested just for windows desktop workload)
    you can run as many as
    52 windows xp with 1 giga ram each on server with just 16giga ram. (this
    is more than 300% overcommit)
    the reason is that most of the kernel/dlls of this guests is shared and
    in addition we are sharing the windows zero
    (windows keep making all its free memory as zero, so every time windows
    release memory we take the page back to the host)
    there is slide that give this numbers you can find at:
    http://kvm.qumranet.com/kvmwiki/KvmF...=kdf2008_3.pdf
    (slide 27)
    beside more i gave presentation about ksm that can be found at:
    http://kvm.qumranet.com/kvmwiki/KvmF...kdf2008_12.pdf

    if more numbers are wanted for other workloads i can test it.
    (the idea of ksm is to run it slowly slowy at low priority and let it
    merge pages when no one need the cpu)
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

    Avi Kivity wrote:
    > Andrew Morton wrote:
    >> The whole approach seems wrong to me. The kernel lost track of these
    >> pages and then we run around post-facto trying to fix that up again.
    >> Please explain (for the changelog) why the kernel cannot get this right
    >> via the usual sharing, refcounting and COWing approaches.
    >>

    >
    > For kvm, the kernel never knew those pages were shared. They are
    > loaded from independent (possibly compressed and encrypted) disk
    > images. These images are different; but some pages happen to be the
    > same because they came from the same installation media.


    As Avi said, in kvm we cannot know how the guest is going to map its
    pages, we have nothing to do but to scan for the identical pages
    (you can have pages that are shared that are in whole different offset
    inside the guest)

    >
    > For OpenVZ the situation is less clear, but if you allow users to
    > independently upgrade their chroots you will eventually arrive at the
    > same scenario (unless of course you apply the same merging strategy at
    > the filesystem level).
    >


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

    On Tue, 11 Nov 2008 20:48:16 +0200
    Avi Kivity wrote:

    > Andrew Morton wrote:
    > > The whole approach seems wrong to me. The kernel lost track of these
    > > pages and then we run around post-facto trying to fix that up again.
    > > Please explain (for the changelog) why the kernel cannot get this right
    > > via the usual sharing, refcounting and COWing approaches.
    > >

    >
    > For kvm, the kernel never knew those pages were shared. They are loaded
    > from independent (possibly compressed and encrypted) disk images. These
    > images are different; but some pages happen to be the same because they
    > came from the same installation media.


    What userspace-only changes could fix this? Identify the common data,
    write it to a flat file and mmap it, something like that?

    > For OpenVZ the situation is less clear, but if you allow users to
    > independently upgrade their chroots you will eventually arrive at the
    > same scenario (unless of course you apply the same merging strategy at
    > the filesystem level).


    hm.

    There has been the occasional discussion about idenfifying all-zeroes
    pages and scavenging them, repointing them at the zero page. Could
    this infrastructure be used for that? (And how much would we gain from
    it?)

    [I'm looking for reasons why this is more than a muck-up-the-vm-for-kvm
    thing here ]
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

    Andrew Morton wrote:
    > On Tue, 11 Nov 2008 20:48:16 +0200
    > Avi Kivity wrote:
    >
    >
    >> Andrew Morton wrote:
    >>
    >>> The whole approach seems wrong to me. The kernel lost track of these
    >>> pages and then we run around post-facto trying to fix that up again.
    >>> Please explain (for the changelog) why the kernel cannot get this right
    >>> via the usual sharing, refcounting and COWing approaches.
    >>>
    >>>

    >> For kvm, the kernel never knew those pages were shared. They are loaded
    >> from independent (possibly compressed and encrypted) disk images. These
    >> images are different; but some pages happen to be the same because they
    >> came from the same installation media.
    >>

    >
    > What userspace-only changes could fix this? Identify the common data,
    > write it to a flat file and mmap it, something like that?
    >
    >
    >> For OpenVZ the situation is less clear, but if you allow users to
    >> independently upgrade their chroots you will eventually arrive at the
    >> same scenario (unless of course you apply the same merging strategy at
    >> the filesystem level).
    >>

    >
    > hm.
    >
    > There has been the occasional discussion about idenfifying all-zeroes
    > pages and scavenging them, repointing them at the zero page. Could
    > this infrastructure be used for that? (And how much would we gain from
    > it?)
    >
    > [I'm looking for reasons why this is more than a muck-up-the-vm-for-kvm
    > thing here ]

    KSM is separate driver , it doesn't change anything in the VM but adding
    two helper functions.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

    On Tue, 11 Nov 2008 21:07:10 +0200
    Izik Eidus wrote:

    > we have used KSM in production for about half year and the numbers that
    > came from our QA is:
    > using KSM for desktop (KSM was tested just for windows desktop workload)
    > you can run as many as
    > 52 windows xp with 1 giga ram each on server with just 16giga ram. (this
    > is more than 300% overcommit)
    > the reason is that most of the kernel/dlls of this guests is shared and
    > in addition we are sharing the windows zero
    > (windows keep making all its free memory as zero, so every time windows
    > release memory we take the page back to the host)
    > there is slide that give this numbers you can find at:
    > http://kvm.qumranet.com/kvmwiki/KvmF...=kdf2008_3.pdf
    > (slide 27)
    > beside more i gave presentation about ksm that can be found at:
    > http://kvm.qumranet.com/kvmwiki/KvmF...kdf2008_12.pdf


    OK, 300% isn't chicken feed.

    It is quite important that information such as this be prepared, added to
    the patch changelogs and maintained. For a start, without this basic
    information, there is no reason for anyone to look at any of the code!
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

    Andrew Morton wrote:
    >> For kvm, the kernel never knew those pages were shared. They are loaded
    >> from independent (possibly compressed and encrypted) disk images. These
    >> images are different; but some pages happen to be the same because they
    >> came from the same installation media.
    >>

    >
    > What userspace-only changes could fix this? Identify the common data,
    > write it to a flat file and mmap it, something like that?
    >
    >


    This was considered. You can't scan the image, because it may be
    encrypted/compressed/offset (typical images _are_ offset because the
    first partition starts at sector 63...). The data may come from the
    network and not a disk image. You can't scan in userspace because the
    images belong to different users and contain sensitive data. Pages may
    come from several images (multiple disk images per guest) so you end up
    with one vma per page.

    So you have to scan memory, after the guest has retrieved it from
    disk/network/manufactured it somehow, decompressed and encrypted it,
    written it to the offset it wants. You can't scan from userspace since
    it's sensitive data, and of course the actual merging need to be done
    atomically, which can only be done from the holy of holies, the vm.

    >> For OpenVZ the situation is less clear, but if you allow users to
    >> independently upgrade their chroots you will eventually arrive at the
    >> same scenario (unless of course you apply the same merging strategy at
    >> the filesystem level).
    >>

    >
    > hm.
    >
    > There has been the occasional discussion about idenfifying all-zeroes
    > pages and scavenging them, repointing them at the zero page. Could
    > this infrastructure be used for that?


    Yes, trivially. ksm may be an overkill for this, though.

    > (And how much would we gain from
    > it?)
    >


    A lot of zeros.

    > [I'm looking for reasons why this is more than a muck-up-the-vm-for-kvm
    > thing here ]
    >


    I sympathize -- us too. Consider the typical multiuser gnome
    minicomputer with all 150 users reading lwn.net at the same time instead
    of working. You could share the firefox rendered page cache, reducing
    memory utilization drastically.

    --
    I have a truly marvellous patch that fixes the bug which this
    signature is too narrow to contain.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

    On Tue, 11 Nov 2008 21:18:23 +0200
    Izik Eidus wrote:

    > > hm.
    > >
    > > There has been the occasional discussion about idenfifying all-zeroes
    > > pages and scavenging them, repointing them at the zero page. Could
    > > this infrastructure be used for that? (And how much would we gain from
    > > it?)
    > >
    > > [I'm looking for reasons why this is more than a muck-up-the-vm-for-kvm
    > > thing here ]


    ^^ this?

    > KSM is separate driver , it doesn't change anything in the VM but adding
    > two helper functions.


    What, you mean I should actually read the code? Oh well, OK.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

    On Tue, 11 Nov 2008 15:21:39 +0200
    Izik Eidus wrote:

    > From: Izik Eidus
    >
    > this function is needed in cases you want to change the userspace
    > virtual mapping into diffrent physical page,


    Not sure that I understand that description. We want to replace a live
    page in an anonymous VMA with a different one?

    It looks that way.

    page migration already kinda does that. Is there common ground?

    > KSM need this for merging the identical pages.
    >
    > this function is working by removing the oldpage from the rmap and
    > calling put_page on it, and by setting the virtual address pte
    > to point into the new page.
    > (note that the new page (the page that we change the pte to map to)
    > cannot be anonymous page)
    >


    I don't understand the restrictions on anonymous pages. Please expand
    the changelog so that reviewers can understand the reasons for this
    restriction.


    > ---
    > include/linux/mm.h | 3 ++
    > mm/memory.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++ ++
    > 2 files changed, 71 insertions(+), 0 deletions(-)
    >
    > diff --git a/include/linux/mm.h b/include/linux/mm.h
    > index ffee2f7..4da7fa8 100644
    > --- a/include/linux/mm.h
    > +++ b/include/linux/mm.h
    > @@ -1207,6 +1207,9 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
    > int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
    > unsigned long pfn);
    >
    > +int replace_page(struct vm_area_struct *vma, struct page *oldpage,
    > + struct page *newpage, pte_t orig_pte, pgprot_t prot);
    > +
    > struct page *follow_page(struct vm_area_struct *, unsigned long address,
    > unsigned int foll_flags);
    > #define FOLL_WRITE 0x01 /* check pte is writable */
    > diff --git a/mm/memory.c b/mm/memory.c
    > index 164951c..b2c542c 100644
    > --- a/mm/memory.c
    > +++ b/mm/memory.c
    > @@ -1472,6 +1472,74 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
    > }
    > EXPORT_SYMBOL(vm_insert_mixed);
    >
    > +/**
    > + * replace _page - replace the pte mapping related to vm area between two pages


    s/replace _page/replace_page/

    > + * (from oldpage to newpage)
    > + * NOTE: you should take into consideration the impact on the VM when replacing
    > + * anonymous pages with kernel non swappable pages.
    > + */


    This _is_ a kerneldoc comment, but kernedoc comments conventionally
    document the arguments and the return value also.

    > +int replace_page(struct vm_area_struct *vma, struct page *oldpage,
    > + struct page *newpage, pte_t orig_pte, pgprot_t prot)
    > +{
    > + struct mm_struct *mm = vma->vm_mm;
    > + pgd_t *pgd;
    > + pud_t *pud;
    > + pmd_t *pmd;
    > + pte_t *ptep;
    > + spinlock_t *ptl;
    > + unsigned long addr;
    > + int ret;
    > +
    > + BUG_ON(PageAnon(newpage));
    > +
    > + ret = -EFAULT;
    > + addr = page_address_in_vma(oldpage, vma);
    > + if (addr == -EFAULT)
    > + goto out;
    > +
    > + pgd = pgd_offset(mm, addr);
    > + if (!pgd_present(*pgd))
    > + goto out;
    > +
    > + pud = pud_offset(pgd, addr);
    > + if (!pud_present(*pud))
    > + goto out;
    > +
    > + pmd = pmd_offset(pud, addr);
    > + if (!pmd_present(*pmd))
    > + goto out;
    > +
    > + ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
    > + if (!ptep)
    > + goto out;
    > +
    > + if (!pte_same(*ptep, orig_pte)) {
    > + pte_unmap_unlock(ptep, ptl);
    > + goto out;
    > + }
    > +
    > + ret = 0;
    > + get_page(newpage);
    > + page_add_file_rmap(newpage);
    > +
    > + flush_cache_page(vma, addr, pte_pfn(*ptep));
    > + ptep_clear_flush(vma, addr, ptep);
    > + set_pte_at(mm, addr, ptep, mk_pte(newpage, prot));
    > +
    > + page_remove_rmap(oldpage, vma);
    > + if (PageAnon(oldpage)) {
    > + dec_mm_counter(mm, anon_rss);
    > + inc_mm_counter(mm, file_rss);
    > + }
    > + put_page(oldpage);
    > +
    > + pte_unmap_unlock(ptep, ptl);
    > +
    > +out:
    > + return ret;
    > +}
    > +EXPORT_SYMBOL(replace_page);


    Again, we could make the presence of this code selectable by subsystems
    which want it.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: [PATCH 1/4] rmap: add page_wrprotect() function,

    On Tue, 11 Nov 2008 15:21:38 +0200
    Izik Eidus wrote:

    > From: Izik Eidus
    >
    > this function is useful for cases you want to compare page and know
    > that its value wont change during you compare it.
    >
    > this function is working by walking over the whole rmap of a page
    > and mark every pte related to the page as write_protect.
    >
    > the odirect_sync paramter is used to notify the caller of
    > page_wrprotect() if one pte or more was not marked readonly
    > in order to avoid race with odirect.


    The grammar is a bit mangled. Missing apostrophes. Sentences start
    with capital letters.

    Yeah, it's a nit, but we don't actually gain anything from deliberately
    mangling the language.

    >
    > ...
    >
    > --- a/mm/rmap.c
    > +++ b/mm/rmap.c
    > @@ -576,6 +576,103 @@ int page_mkclean(struct page *page)
    > }
    > EXPORT_SYMBOL_GPL(page_mkclean);
    >
    > +static int page_wrprotect_one(struct page *page, struct vm_area_struct *vma,
    > + int *odirect_sync)
    > +{
    > + struct mm_struct *mm = vma->vm_mm;
    > + unsigned long address;
    > + pte_t *pte;
    > + spinlock_t *ptl;
    > + int ret = 0;
    > +
    > + address = vma_address(page, vma);
    > + if (address == -EFAULT)
    > + goto out;
    > +
    > + pte = page_check_address(page, mm, address, &ptl, 0);
    > + if (!pte)
    > + goto out;
    > +
    > + if (pte_write(*pte)) {
    > + pte_t entry;
    > +
    > + if (page_mapcount(page) != page_count(page)) {
    > + *odirect_sync = 0;
    > + goto out_unlock;
    > + }
    > + flush_cache_page(vma, address, pte_pfn(*pte));
    > + entry = ptep_clear_flush_notify(vma, address, pte);
    > + entry = pte_wrprotect(entry);
    > + set_pte_at(mm, address, pte, entry);
    > + }
    > + ret = 1;
    > +
    > +out_unlock:
    > + pte_unmap_unlock(pte, ptl);
    > +out:
    > + return ret;
    > +}


    OK. I think. We need to find a way of provoking Hugh to look at it.

    > +static int page_wrprotect_file(struct page *page, int *odirect_sync)
    > +{
    > + struct address_space *mapping;
    > + struct prio_tree_iter iter;
    > + struct vm_area_struct *vma;
    > + pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
    > + int ret = 0;
    > +
    > + mapping = page_mapping(page);


    What pins *mapping in memory? Usually this is done by requiring that
    the caller has locked the page. But no such precondition is documented
    here.

    > + if (!mapping)
    > + return ret;
    > +
    > + spin_lock(&mapping->i_mmap_lock);
    > +
    > + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff)
    > + ret += page_wrprotect_one(page, vma, odirect_sync);
    > +
    > + spin_unlock(&mapping->i_mmap_lock);
    > +
    > + return ret;
    > +}
    > +
    > +static int page_wrprotect_anon(struct page *page, int *odirect_sync)
    > +{
    > + struct vm_area_struct *vma;
    > + struct anon_vma *anon_vma;
    > + int ret = 0;
    > +
    > + anon_vma = page_lock_anon_vma(page);
    > + if (!anon_vma)
    > + return ret;
    > +
    > + list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
    > + ret += page_wrprotect_one(page, vma, odirect_sync);
    > +
    > + page_unlock_anon_vma(anon_vma);
    > +
    > + return ret;
    > +}
    > +
    > +/**


    This token means "this is a kerneldoc comment".

    > + * set all the ptes pointed to a page as read only,
    > + * odirect_sync is set to 0 in case we cannot protect against race with odirect
    > + * return the number of ptes that were set as read only
    > + * (ptes that were read only before this function was called are couned as well)
    > + */


    But it isn't.

    I don't understand this odirect_sync thing. What race? Please expand
    this comment to make the function of odirect_sync more understandable.

    > +int page_wrprotect(struct page *page, int *odirect_sync)
    > +{
    > + int ret =0;
    > +
    > + *odirect_sync = 1;
    > + if (PageAnon(page))
    > + ret = page_wrprotect_anon(page, odirect_sync);
    > + else
    > + ret = page_wrprotect_file(page, odirect_sync);
    > +
    > + return ret;
    > +}
    > +EXPORT_SYMBOL(page_wrprotect);


    What do you think about making all this new code dependent upon some
    CONFIG_ switch which CONFIG_KVM can select?

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

    Hi Andrew,

    thanks for looking into this.

    On Tue, Nov 11, 2008 at 11:11:10AM -0800, Andrew Morton wrote:
    > What userspace-only changes could fix this? Identify the common data,
    > write it to a flat file and mmap it, something like that?


    The whole idea is to do something that works transparently and isn't
    tailored for kvm. The mmu notifier change_pte method can be dropped as
    well if you want (I recommended not to have it in the first submission
    but Izik preferred to keep it because it will optimize away a kvm
    shadow pte minor fault the first time kvm access the page after
    sharing it). The page_wrprotect and replace_page can also be embedded
    in ksm.

    So the idea is that while we could do something specific to ksm that
    keeps most of the code in userland, it'd be more tricky as it'd
    require some communication with the core VM anyway (we can't just do
    it in userland with mprotect, memcpy, mmap(MAP_PRIVATE) as it wouldn't
    be atomic and second it'd be inefficient in terms of vma-buildup for
    the same reason nonlinear-vmas exist), but most important: it wouldn't
    work for all other regular process. With KSM we can share anonymous
    memory for the whole system, KVM is just a random user.

    This implementation is on the simple side because it can't
    swap. Swapping and perhaps the limitation of sharing anonymous memory
    is the only trouble here but those will be addressed in the
    future. ksm is a new device driver so it's like /dev/mem, so no
    swapping isn't a blocker here.

    By sharing anon pages, in short we're making anonymous vmas nonlinear,
    and this isn't supported by the current rmap code. So swapping can't
    work unless we mark those anon-vmas nonlinear and we either build the
    equivalent of the old pte_chains on demand just for those nonlinear
    shared pages, or we do a full scan of all ptes in the nonlinear
    anon-vmas. An external rmap infrastructure can allow ksm to build
    whatever it wants inside ksm.c to track the nonlinear anon-pages
    inside a regular anon-vma and rmap.c can invoke those methods to find
    the ptes for those nonlinear pages. The core VM won't get more complex
    and ksm can decide if to do a full nonlinear scan of the vma, or to
    build the equivalent of pte_chains. This again has to be added later
    and once everybody sees ksm, it'll be easier to agree on a
    external-rmap API to allow it to swap. While the pte_chains are very
    inefficent to reverse the regular anonymous mappings, they're
    efficient solution as an exception for the shared KSM pages that gets
    scattered over the linear anon-vmas.

    It's a bit like the initial kvm that was merged despite it couldn't
    swap. Then we added mmu notifiers, and now kvm can swap. So we add ksm
    now without swapping and later we build an external-rmap to allow ksm
    to swap after we agree ksm is useful and people starts using it.

    > There has been the occasional discussion about idenfifying all-zeroes
    > pages and scavenging them, repointing them at the zero page. Could
    > this infrastructure be used for that? (And how much would we gain from
    > it?)


    Zero pages makes a lot of difference for windows, but they're totally
    useless for linux. With current ksm all guest pagecache is 100% shared
    across hosts, so when you start an app the .text runs on the same
    physical memory on both guests. Works fine and code is quite simple in
    this round. Once we add swapping it'll be a bit more complex in VM
    terms as it'll have to handle nonlinear anon-vmas.

    If we ever decide to share MAP_SHARED pagecache it'll be even more
    complicated than just adding the external-rmap... I think this can be
    done incrementally if needed at all. OpenVZ if the install is smart
    enough could share the pagecache by just hardlinking the equal
    binaries.. but AFIK they don't do that normally. For the anon ram they
    need this too, they can't solve equal anon ram in userland as it has
    to be handled atomically at runtime.

    The folks at CERN LHC (was visiting it last month) badly need KSM too
    for certain apps they're running that are allocating huge arrays
    (aligned) in anon memory and they're mostly equal for all
    processes. They tried to work around it with fork but it's not working
    well, KSM would solve their problem (it'd solve it both on the same OS
    and across OS with kvm as virtualization engine running on the same host).

    So I think this is good stuff, and I'd focus discussions and reviews
    on the KSM API of /dev/ksm that if merged will be longstanding and
    much more troublesome than the rest of the code if changed later (if
    we change the ksm internals at any time nobody will notice), and
    post-merging we can focus on the external-rmap to make KSM pages first
    class citizens in VM terms. But then anything can be changed here, so
    suggestions welcome!

    Thanks!
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

    Andrew Morton wrote:
    > On Tue, 11 Nov 2008 21:18:23 +0200
    > Izik Eidus wrote:
    >
    >
    >>> hm.
    >>>
    >>> There has been the occasional discussion about idenfifying all-zeroes
    >>> pages and scavenging them, repointing them at the zero page. Could
    >>> this infrastructure be used for that? (And how much would we gain from
    >>> it?)
    >>>
    >>> [I'm looking for reasons why this is more than a muck-up-the-vm-for-kvm
    >>> thing here ]
    >>>

    >
    > ^^ this?
    >
    >
    >> KSM is separate driver , it doesn't change anything in the VM but adding
    >> two helper functions.
    >>

    >
    > What, you mean I should actually read the code? Oh well, OK.
    >

    Andrea i think what is happening here is my fault
    i will try to give here much more information about KSM:
    first the bad things:
    KSM shared pages are right now (we have patch that can change it but we
    want to wait with it) unswappable
    this mean that the entire memory of the guest is swappable but the pages
    that are shared are not.
    (when the pages are splited back by COW they become anonymous again with
    the help of do_wp_page()
    the reason that the pages are not swappable is beacuse the way the Linux
    Rmap is working, this not allow us to create nonlinear anonymous pages
    (we dont want to use nonlinear vma for kvm, as it will make swapping for
    kvm very slow)
    the reason that ksm pages need to have nonlinear reverse mapping is that
    for one guest identical page can be found in whole diffrent offset than
    other guest have it
    (this is from the userspace VM point of view)

    the rest is quite simple:
    it is walking over the entire guest memory (or only some of it) and scan
    for identical pages using hash table
    it merge the pages into one single write protected page

    numbers for ksm is something that i have just for desktops and just the
    numbers i gave you
    what is do know is:
    big overcommit like 300% is possible just when you take into account
    that some of the guest memory will be free
    we are sharing mostly the DLLs/ KERNEL / ZERO pages, for the DLLS and
    KERNEL PAGEs this pages likely will never break
    but ZERO pages will be break when windows will allocate them and will
    come back when windows will free the memory.
    (i wouldnt suggest 300% overcommit for servers workload, beacuse you can
    end up swapping in that case,
    but for desktops after runing in production and passed some seiroes qa
    tress tests it seems like 300% is a real number that can be use)

    i just ran test on two fedora 8 guests and got that results (using GNOME
    in both of them)
    9959 root 15 0 730m 537m 281m S 8 3.4 0:44.28
    kvm

    9956 root 15 0 730m 537m 246m S 4 3.4 0:41.43 kvm
    as you can see the physical sharing was 281mb and 246mb (kernel pages
    are counted as shared)
    there is small lie in this numbers beacuse pages that was shared across
    two guests and was splited by writing from guest number 1 will still
    have 1 refernce count to it
    and will still be kernel page (untill the other guest (num 2) will write
    to it as well)


    anyway i am willing to make much better testing or everything that
    needed for this patchs to be merged.
    (just tell me what and i will do it)

    beside that you should know that patch 4 is not a must, it is just nice
    optimization...

    thanks.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

    Izik Eidus wrote:
    > Andrew Morton wrote:
    >> On Tue, 11 Nov 2008 21:18:23 +0200
    >> Izik Eidus wrote:
    >>
    >>
    >>>> hm.
    >>>>
    >>>> There has been the occasional discussion about idenfifying all-zeroes
    >>>> pages and scavenging them, repointing them at the zero page. Could
    >>>> this infrastructure be used for that? (And how much would we gain
    >>>> from
    >>>> it?)
    >>>>
    >>>> [I'm looking for reasons why this is more than a
    >>>> muck-up-the-vm-for-kvm
    >>>> thing here ]
    >>>>

    >>
    >> ^^ this?
    >>
    >>
    >>> KSM is separate driver , it doesn't change anything in the VM but
    >>> adding two helper functions.
    >>>

    >>
    >> What, you mean I should actually read the code? Oh well, OK.
    >>

    > Andrea i think what is happening here is my fault

    Sorry, meant to write here Andrew :-)
    > i will try to give here much more information about KSM:
    > first the bad things:
    > KSM shared pages are right now (we have patch that can change it but
    > we want to wait with it) unswappable
    > this mean that the entire memory of the guest is swappable but the
    > pages that are shared are not.
    > (when the pages are splited back by COW they become anonymous again
    > with the help of do_wp_page()
    > the reason that the pages are not swappable is beacuse the way the
    > Linux Rmap is working, this not allow us to create nonlinear anonymous
    > pages
    > (we dont want to use nonlinear vma for kvm, as it will make swapping
    > for kvm very slow)
    > the reason that ksm pages need to have nonlinear reverse mapping is
    > that for one guest identical page can be found in whole diffrent
    > offset than other guest have it
    > (this is from the userspace VM point of view)
    >
    > the rest is quite simple:
    > it is walking over the entire guest memory (or only some of it) and
    > scan for identical pages using hash table
    > it merge the pages into one single write protected page
    >
    > numbers for ksm is something that i have just for desktops and just
    > the numbers i gave you
    > what is do know is:
    > big overcommit like 300% is possible just when you take into account
    > that some of the guest memory will be free
    > we are sharing mostly the DLLs/ KERNEL / ZERO pages, for the DLLS and
    > KERNEL PAGEs this pages likely will never break
    > but ZERO pages will be break when windows will allocate them and will
    > come back when windows will free the memory.
    > (i wouldnt suggest 300% overcommit for servers workload, beacuse you
    > can end up swapping in that case,
    > but for desktops after runing in production and passed some seiroes qa
    > tress tests it seems like 300% is a real number that can be use)
    >
    > i just ran test on two fedora 8 guests and got that results (using
    > GNOME in both of them)
    > 9959 root 15 0 730m 537m 281m S 8 3.4 0:44.28
    > kvm
    >
    > 9956 root 15 0 730m 537m 246m S 4 3.4 0:41.43 kvm
    > as you can see the physical sharing was 281mb and 246mb (kernel pages
    > are counted as shared)
    > there is small lie in this numbers beacuse pages that was shared
    > across two guests and was splited by writing from guest number 1 will
    > still have 1 refernce count to it
    > and will still be kernel page (untill the other guest (num 2) will
    > write to it as well)
    >
    >
    > anyway i am willing to make much better testing or everything that
    > needed for this patchs to be merged.
    > (just tell me what and i will do it)
    >
    > beside that you should know that patch 4 is not a must, it is just
    > nice optimization...
    >
    > thanks.
    >
    > --
    > To unsubscribe from this list: send the line "unsubscribe
    > linux-kernel" in
    > the body of a message to majordomo@vger.kernel.org
    > More majordomo info at http://vger.kernel.org/majordomo-info.html
    > Please read the FAQ at http://www.tux.org/lkml/


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. Re: [PATCH 1/4] rmap: add page_wrprotect() function,

    On Tue, Nov 11, 2008 at 11:39:48AM -0800, Andrew Morton wrote:
    > > +static int page_wrprotect_one(struct page *page, struct vm_area_struct *vma,
    > > + int *odirect_sync)
    > > +{
    > > + struct mm_struct *mm = vma->vm_mm;
    > > + unsigned long address;
    > > + pte_t *pte;
    > > + spinlock_t *ptl;
    > > + int ret = 0;
    > > +
    > > + address = vma_address(page, vma);
    > > + if (address == -EFAULT)
    > > + goto out;
    > > +
    > > + pte = page_check_address(page, mm, address, &ptl, 0);
    > > + if (!pte)
    > > + goto out;
    > > +
    > > + if (pte_write(*pte)) {
    > > + pte_t entry;
    > > +
    > > + if (page_mapcount(page) != page_count(page)) {
    > > + *odirect_sync = 0;
    > > + goto out_unlock;
    > > + }
    > > + flush_cache_page(vma, address, pte_pfn(*pte));
    > > + entry = ptep_clear_flush_notify(vma, address, pte);
    > > + entry = pte_wrprotect(entry);
    > > + set_pte_at(mm, address, pte, entry);
    > > + }
    > > + ret = 1;
    > > +
    > > +out_unlock:
    > > + pte_unmap_unlock(pte, ptl);
    > > +out:
    > > + return ret;
    > > +}

    >
    > OK. I think. We need to find a way of provoking Hugh to look at it.


    Yes. Please focus on the page_mapcount != page_count, which is likely
    missing from migrate.c too and in turn page migration currently breaks
    O_DIRECT like fork() is buggy as well as discussed here:

    http://marc.info/?l=linux-mm&m=122236799302540&w=2
    http://marc.info/?l=linux-mm&m=122524107519182&w=2
    http://marc.info/?l=linux-mm&m=122581116713932&w=2

    The fix implemented in ksm currently handles older kernels (like
    rhel/sles) not current mainline that does
    get_user_pages_fast. get_user_pages_fast is unfixable yet (see my last
    email to Nick above asking for a way to block gup_fast).

    The fix proposed by Nick plus my additional fix, should stop the
    corruption in fork the same way the above check fixes it for ksm. But
    todate gup_fast remains unfixable.

    > > +static int page_wrprotect_file(struct page *page, int *odirect_sync)
    > > +{
    > > + struct address_space *mapping;
    > > + struct prio_tree_iter iter;
    > > + struct vm_area_struct *vma;
    > > + pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
    > > + int ret = 0;
    > > +
    > > + mapping = page_mapping(page);

    >
    > What pins *mapping in memory? Usually this is done by requiring that
    > the caller has locked the page. But no such precondition is documented
    > here.


    Looks buggy but we never call it from ksm 8). I guess Izik added it
    for completeness when preparing for mainline submission. We've the
    option to get rid of page_wrprotect_file entirely and only implement a
    page_wrprotect_anon! Otherwise we can add a BUG_ON(!PageLocked(page))
    before the above page_mapping to protect against truncate.

    > > + * set all the ptes pointed to a page as read only,
    > > + * odirect_sync is set to 0 in case we cannot protect against race with odirect
    > > + * return the number of ptes that were set as read only
    > > + * (ptes that were read only before this function was called are couned as well)
    > > + */

    >
    > But it isn't.


    What isn't?

    > I don't understand this odirect_sync thing. What race? Please expand
    > this comment to make the function of odirect_sync more understandable.


    I should have answered this one with the above 3 links.

    > What do you think about making all this new code dependent upon some
    > CONFIG_ switch which CONFIG_KVM can select?


    I like that too.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  18. Re: [PATCH 3/4] add ksm kernel shared memory driver

    On Tue, 11 Nov 2008 15:21:40 +0200
    Izik Eidus wrote:

    > From: Izik Eidus
    >
    > ksm is driver that allow merging identical pages between one or more
    > applications in way unvisible to the application that use it.
    > pages that are merged are marked as readonly and are COWed when any application
    > try to change them.
    >
    > ksm is working by walking over the memory pages of the applications it scan
    > in order to find identical pages.
    > it uses an hash table to find in effective way the identical pages.
    >
    > when ksm find two identical pages, it marked them as readonly and merge them
    > into single one page,
    > after the pages are marked as readonly and merged into one page, linux
    > will treat this pages as normal copy_on_write pages and will fork them
    > when write access will happen to them.
    >
    > ksm scan just memory areas that were registred to be scanned by it.
    >


    This driver apepars to implement a new userspace interface, in /dev/ksm

    Please fully document that interface in the changelog so that we can
    review your decisions here. This is by far the most important
    consideration - we can change all the code, but interfaces are for
    ever.

    > diff --git a/drivers/Kconfig b/drivers/Kconfig
    > index d38f43f..c1c701f 100644
    > --- a/drivers/Kconfig
    > +++ b/drivers/Kconfig
    > @@ -105,4 +105,9 @@ source "drivers/uio/Kconfig"
    > source "drivers/xen/Kconfig"
    >
    > source "drivers/staging/Kconfig"
    > +
    > +config KSM
    > + bool "KSM driver support"
    > + help
    > + ksm is driver for merging identical pages between applciations


    s/is/is a/

    "applications" is misspelt.

    > endmenu
    > diff --git a/include/linux/ksm.h b/include/linux/ksm.h
    > new file mode 100644
    > index 0000000..f873502
    > --- /dev/null
    > +++ b/include/linux/ksm.h
    > @@ -0,0 +1,53 @@
    > +#ifndef __LINUX_KSM_H
    > +#define __LINUX_KSM_H
    > +
    > +/*
    > + * Userspace interface for /dev/ksm - kvm shared memory
    > + */
    > +
    > +#include
    > +#include
    > +
    > +#define KSM_API_VERSION 1
    > +
    > +/* for KSM_REGISTER_MEMORY_REGION */
    > +struct ksm_memory_region {
    > + __u32 npages; /* number of pages to share */
    > + __u32 pad;
    > + __u64 addr; /* the begining of the virtual address */
    > +};
    > +
    > +struct ksm_user_scan {
    > + __u32 pages_to_scan;
    > + __u32 max_pages_to_merge;
    > +};
    > +
    > +struct ksm_kthread_info {
    > + __u32 sleep; /* number of microsecoends to sleep */
    > + __u32 pages_to_scan; /* number of pages to scan */
    > + __u32 max_pages_to_merge;
    > + __u32 running;
    > +};
    > +
    > +#define KSMIO 0xAB
    > +
    > +/* ioctls for /dev/ksm */
    > +#define KSM_GET_API_VERSION _IO(KSMIO, 0x00)
    > +#define KSM_CREATE_SHARED_MEMORY_AREA _IO(KSMIO, 0x01) /* return SMA fd */
    > +#define KSM_CREATE_SCAN _IO(KSMIO, 0x02) /* return SCAN fd */
    > +#define KSM_START_STOP_KTHREAD _IOW(KSMIO, 0x03,\
    > + struct ksm_kthread_info)
    > +#define KSM_GET_INFO_KTHREAD _IOW(KSMIO, 0x04,\
    > + struct ksm_kthread_info)
    > +
    > +
    > +/* ioctls for SMA fds */
    > +#define KSM_REGISTER_MEMORY_REGION _IOW(KSMIO, 0x20,\
    > + struct ksm_memory_region)
    > +#define KSM_REMOVE_MEMORY_REGION _IO(KSMIO, 0x21)
    > +
    > +/* ioctls for SCAN fds */
    > +#define KSM_SCAN _IOW(KSMIO, 0x40,\
    > + struct ksm_user_scan)
    > +
    > +#endif


    uh-oh, ioctls.

    Please do document that interface for us.

    > diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h
    > index 26433ec..adc2435 100644
    > --- a/include/linux/miscdevice.h
    > +++ b/include/linux/miscdevice.h
    > @@ -30,6 +30,7 @@
    > #define TUN_MINOR 200
    > #define HPET_MINOR 228
    > #define KVM_MINOR 232
    > +#define KSM_MINOR 233
    >
    > struct device;
    >
    > diff --git a/mm/Kconfig b/mm/Kconfig
    > index 5b5790f..e7f0061 100644
    > --- a/mm/Kconfig
    > +++ b/mm/Kconfig
    > @@ -222,3 +222,6 @@ config UNEVICTABLE_LRU
    >
    > config MMU_NOTIFIER
    > bool
    > +
    > +config KSM
    > + bool
    > diff --git a/mm/Makefile b/mm/Makefile
    > index c06b45a..9722afe 100644
    > --- a/mm/Makefile
    > +++ b/mm/Makefile
    > @@ -26,6 +26,7 @@ obj-$(CONFIG_TMPFS_POSIX_ACL) += shmem_acl.o
    > obj-$(CONFIG_TINY_SHMEM) += tiny-shmem.o
    > obj-$(CONFIG_SLOB) += slob.o
    > obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
    > +obj-$(CONFIG_KSM) += ksm.o
    > obj-$(CONFIG_SLAB) += slab.o
    > obj-$(CONFIG_SLUB) += slub.o
    > obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
    > diff --git a/mm/ksm.c b/mm/ksm.c
    > new file mode 100644
    > index 0000000..977eb37
    > --- /dev/null
    > +++ b/mm/ksm.c
    > @@ -0,0 +1,1202 @@
    > +/*
    > + * Memory merging driver for Linux
    > + *
    > + * This module enables dynamic sharing of identical pages found in different
    > + * memory areas, even if they are not shared by fork()
    > + *
    > + * Copyright (C) 2008 Red Hat, Inc.
    > + * Authors:
    > + * Izik Eidus
    > + * Andrea Arcangeli
    > + * Chris Wright
    > + *
    > + * This work is licensed under the terms of the GNU GPL, version 2.
    > + */
    > +
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +
    > +#include
    > +
    > +MODULE_AUTHOR("Red Hat, Inc.");
    > +MODULE_LICENSE("GPL");
    > +
    > +static int page_hash_size;
    > +module_param(page_hash_size, int, 0);
    > +MODULE_PARM_DESC(page_hash_size, "Hash table size for the pages checksum");
    > +
    > +static int rmap_hash_size;
    > +module_param(rmap_hash_size, int, 0);
    > +MODULE_PARM_DESC(rmap_hash_size, "Hash table size for the reverse mapping");
    > +
    > +static int sha1_hash_size;
    > +module_param(sha1_hash_size, int, 0);
    > +MODULE_PARM_DESC(sha1_hash_size, "Hash table size for the sha1 caching");
    > +
    > +struct ksm_mem_slot {
    > + struct list_head link;
    > + struct list_head sma_link;
    > + struct mm_struct *mm;
    > + unsigned long addr; /* the begining of the virtual address */
    > + int npages; /* number of pages to share */
    > +};
    > +
    > +/*
    > + * sma - shared memory area, each process have its own sma that contain the
    > + * information about the slots that it own
    > + */
    > +struct ksm_sma {
    > + struct list_head sma_slots;
    > +};
    > +
    > +struct ksm_scan {
    > + struct ksm_mem_slot *slot_index; /* the slot we are scanning now */
    > + int page_index; /* the page inside sma that is now being scanned */
    > +};
    > +
    > +struct page_hash_item {
    > + struct hlist_node link;
    > + struct mm_struct *mm;
    > + unsigned long addr;
    > +};
    > +
    > +struct rmap_item {
    > + struct hlist_node link;
    > + struct page_hash_item *page_hash_item;
    > + unsigned long oldindex;
    > +};
    > +
    > +struct sha1_item {
    > + unsigned char sha1val[SHA1_DIGEST_SIZE];
    > + unsigned long pfn;
    > +};
    > +
    > +static struct list_head slots;


    Use LIST_HEAD(), remove the INIT_LIST_HEAD() from ksm_init().

    > +static struct rw_semaphore slots_lock;


    Use DECLARE_RWSEM(), remove the init_rwsem() from ksm_init().

    > +static DEFINE_MUTEX(sha1_lock);
    > +
    > +static int npages_hash;
    > +static struct hlist_head *page_hash_items;
    > +static int nrmaps_hash;
    > +static struct hlist_head *rmap_hash;
    > +static int nsha1s_hash;
    > +static struct sha1_item *sha1_hash;
    > +
    > +static struct kmem_cache *page_hash_item_cache;
    > +static struct kmem_cache *rmap_item_cache;
    > +
    > +static int kthread_sleep;
    > +static int kthread_pages_to_scan;
    > +static int kthread_max_npages;
    > +static struct ksm_scan kthread_ksm_scan;
    > +static int kthread_run;
    > +static struct task_struct *kthread;
    > +static wait_queue_head_t kthread_wait;


    initialise at compile-time.

    > +static struct rw_semaphore kthread_lock;


    initialise at compile-time.

    > +static struct crypto_hash *tfm;
    > +static unsigned char hmac_key[SHA1_DIGEST_SIZE];
    > +static DEFINE_MUTEX(tfm_mutex);
    > +
    > +static spinlock_t hash_lock;


    initialise at compile-time.

    I'd suggest that the above globals have a decent amount of
    documentation attached to them - describe what they're for, what is
    their role in life. Documenting the data structures and the
    relationships between them is an excellent way of helping readers to
    understand the overall code.

    > +static int ksm_tfm_init(void)
    > +{
    > + struct crypto_hash *hash;
    > + int rc = 0;
    > +
    > + mutex_lock(&tfm_mutex);
    > + if (tfm)
    > + goto out;
    > +
    > + /* Must be called from user context before starting any scanning */
    > + hash = crypto_alloc_hash("hmac(sha1)", 0, CRYPTO_ALG_ASYNC);
    > + if (IS_ERR(hash)) {
    > + rc = PTR_ERR(hash);
    > + goto out;
    > + }
    > +
    > + get_random_bytes(hmac_key, sizeof(hmac_key));
    > +
    > + rc = crypto_hash_setkey(hash, hmac_key, SHA1_DIGEST_SIZE);
    > + if (rc) {
    > + crypto_free_hash(hash);
    > + goto out;
    > + }
    > + tfm = hash;
    > +out:
    > + mutex_unlock(&tfm_mutex);
    > + return rc;
    > +}


    Did the Kconfig have the appropriate `depends on' lines for this?

    > +static int ksm_slab_init(void)
    > +{
    > + int ret = 1;
    > +
    > + page_hash_item_cache = kmem_cache_create("ksm_page_hash_item",
    > + sizeof(struct page_hash_item), 0, 0,
    > + NULL);
    > + if (!page_hash_item_cache)
    > + goto out;
    > +
    > + rmap_item_cache = kmem_cache_create("ksm_rmap_item",
    > + sizeof(struct rmap_item), 0, 0,
    > + NULL);
    > + if (!rmap_item_cache)
    > + goto out_free;


    You might be able to use KMEM_CACHE() here. I don't like KMEM_CACHE()
    much, but let's be consistent here. We change the arguments to
    kmem_cache_create() regularly, and KMEM_CACHE helps to reduce the
    churn.

    > + return 0;
    > +
    > +out_free:
    > + kmem_cache_destroy(page_hash_item_cache);
    > +out:
    > + return ret;
    > +}
    > +
    > +static void ksm_slab_free(void)
    > +{
    > + kmem_cache_destroy(rmap_item_cache);
    > + kmem_cache_destroy(page_hash_item_cache);
    > +}
    > +
    > +static struct page_hash_item *alloc_page_hash_item(void)
    > +{
    > + void *obj;
    > +
    > + obj = kmem_cache_zalloc(page_hash_item_cache, GFP_KERNEL);
    > + return (struct page_hash_item *)obj;


    Unneeded and undesirable case of void*.

    > +}


    But this whole function can be reduced to

    static inline struct page_hash_item *alloc_page_hash_item(void)
    {
    return kmem_cache_zalloc(page_hash_item_cache, GFP_KERNEL);
    }

    > +static void free_page_hash_item(struct page_hash_item *page_hash_item)
    > +{
    > + kmem_cache_free(page_hash_item_cache, page_hash_item);
    > +}


    Suggest that this be inlined (the compiler might have decided to do
    that anwyay, but it won't hurt to give gcc some help)

    > +static struct rmap_item *alloc_rmap_item(void)
    > +{
    > + void *obj;
    > +
    > + obj = kmem_cache_zalloc(rmap_item_cache, GFP_KERNEL);
    > + return (struct rmap_item *)obj;
    > +}


    static inline struct rmap_item *alloc_rmap_item(void)
    {
    return kmem_cache_zalloc(rmap_item_cache, GFP_KERNEL);
    }

    > +static void free_rmap_item(struct rmap_item *rmap_item)
    > +{
    > + kmem_cache_free(rmap_item_cache, rmap_item);
    > +}


    Inline this.

    > +static inline int PageKsm(struct page *page)
    > +{
    > + return !PageAnon(page);
    > +}


    What's going on here? Please fully document the _meaning_ of
    PageKsm(page) and then write something which helps readers understand
    how and why that identically maps onto !PageAnon.

    Please don't just toss stuff like this in there and expect people to
    somehow understand what it's doing

    > +static int page_hash_init(void)
    > +{
    > + if (!page_hash_size) {
    > + struct sysinfo sinfo;
    > +
    > + si_meminfo(&sinfo);
    > + page_hash_size = sinfo.totalram;
    > + page_hash_size /= 40;
    > + }


    This will go wrong on large i386 highmem machines. The hash should be
    sized by the amount of lowmem, not by the total amount of memory.

    > + npages_hash = page_hash_size;
    > + page_hash_items = vmalloc(npages_hash * sizeof(struct page_hash_item));
    > + if (!page_hash_items)
    > + return 1;
    > +
    > + memset(page_hash_items, 0, npages_hash * sizeof(struct page_hash_item));
    > + return 0;
    > +}


    Review of this code would be considerably more productive if
    page_hash_items had had good documentation. As it stands, I'm left to
    vaguely guess at its function.

    > +static void page_hash_free(void)
    > +{
    > + int i;
    > + struct hlist_head *bucket;
    > + struct hlist_node *node, *n;
    > + struct page_hash_item *page_hash_item;
    > +
    > + for (i = 0; i < npages_hash; ++i) {
    > + bucket = &page_hash_items[i];
    > + hlist_for_each_entry_safe(page_hash_item, node, n, bucket, link) {
    > + hlist_del(&page_hash_item->link);
    > + free_page_hash_item(page_hash_item);
    > + }
    > + }
    > + vfree(page_hash_items);
    > +}
    > +
    > +static int rmap_hash_init(void)
    > +{
    > + if (!rmap_hash_size) {
    > + struct sysinfo sinfo;
    > +
    > + si_meminfo(&sinfo);
    > + rmap_hash_size = sinfo.totalram;
    > + rmap_hash_size /= 40;
    > + }
    > + nrmaps_hash = rmap_hash_size;
    > + rmap_hash = vmalloc(nrmaps_hash *
    > + sizeof(struct hlist_head));


    unneeded linebreak.

    > + if (!rmap_hash)
    > + return 1;
    > + memset(rmap_hash, 0, nrmaps_hash * sizeof(struct hlist_head));
    > + return 0;
    > +}


    There are more modern ways than si_meminfo() of getting the info you
    want. nr_free_buffer_pages() is one, and there are similar functions
    around there.

    > +static void rmap_hash_free(void)
    > +{
    > + int i;
    > + struct hlist_head *bucket;
    > + struct hlist_node *node, *n;
    > + struct rmap_item *rmap_item;
    > +
    > + for (i = 0; i < nrmaps_hash; ++i) {
    > + bucket = &rmap_hash[i];
    > + hlist_for_each_entry_safe(rmap_item, node, n, bucket, link) {
    > + hlist_del(&rmap_item->link);
    > + free_rmap_item(rmap_item);
    > + }
    > + }
    > + vfree(rmap_hash);
    > +}
    > +
    > +static int sha1_hash_init(void)
    > +{
    > + if (!sha1_hash_size) {
    > + struct sysinfo sinfo;
    > +
    > + si_meminfo(&sinfo);
    > + sha1_hash_size = sinfo.totalram;
    > + sha1_hash_size /= 128;
    > + }
    > + nsha1s_hash = sha1_hash_size;
    > + sha1_hash = vmalloc(nsha1s_hash *
    > + sizeof(struct sha1_item));


    unneeded linebreak.

    > + if (!sha1_hash)
    > + return 1;
    > + memset(sha1_hash, 0, nsha1s_hash * sizeof(struct sha1_item));
    > + return 0;
    > +}
    > +
    > +static void sha1_hash_free(void)
    > +{
    > + vfree(sha1_hash);
    > +}
    > +
    > +static inline u32 calc_hash_index(struct page *page)
    > +{
    > + u32 hash;
    > + void *addr = kmap_atomic(page, KM_USER0);
    > + hash = jhash(addr, PAGE_SIZE, 17);
    > + kunmap_atomic(addr, KM_USER0);
    > + return hash % npages_hash;
    > +}
    > +
    > +static void remove_page_from_hash(struct mm_struct *mm, unsigned long addr)
    > +{
    > + struct rmap_item *rmap_item;
    > + struct hlist_head *bucket;
    > + struct hlist_node *node, *n;
    > +
    > + bucket = &rmap_hash[addr % nrmaps_hash];
    > + hlist_for_each_entry_safe(rmap_item, node, n, bucket, link) {
    > + if (mm == rmap_item->page_hash_item->mm &&
    > + rmap_item->page_hash_item->addr == addr) {
    > + hlist_del(&rmap_item->page_hash_item->link);
    > + free_page_hash_item(rmap_item->page_hash_item);
    > + hlist_del(&rmap_item->link);
    > + free_rmap_item(rmap_item);
    > + return;
    > + }
    > + }
    > +}


    ah, so we hashed the pages by their contents.

    Avoiding races in here would be a challenge. Alas, we are given no
    description at all of how these problems have been solved.

    > +static int ksm_sma_ioctl_register_memory_region(struct ksm_sma *ksm_sma,
    > + struct ksm_memory_region *mem)
    > +{
    > + struct ksm_mem_slot *slot;
    > + int ret = -1;
    > +
    > + if (!current->mm)
    > + goto out;


    Why? Needs a comment.

    > + slot = kzalloc(sizeof(struct ksm_mem_slot), GFP_KERNEL);
    > + if (!slot)
    > + goto out;


    This duplicates the above `if (!current->mm)' test.

    > + slot->mm = get_task_mm(current);
    > + slot->addr = mem->addr;
    > + slot->npages = mem->npages;
    > +
    > + down_write(&slots_lock);
    > +
    > + list_add_tail(&slot->link, &slots);
    > + list_add_tail(&slot->sma_link, &ksm_sma->sma_slots);
    > +
    > + up_write(&slots_lock);
    > + ret = 0;
    > +out:
    > + return ret;
    > +}
    > +
    > +static void remove_mm_from_hash(struct mm_struct *mm)
    > +{
    > + struct ksm_mem_slot *slot;
    > + int pages_count = 0;
    > +
    > + list_for_each_entry(slot, &slots, link)
    > + if (slot->mm == mm)
    > + break;
    > + if (!slot)
    > + BUG();


    BUG_ON(!slot)

    > + spin_lock(&hash_lock);
    > + while (pages_count < slot->npages) {
    > + remove_page_from_hash(mm, slot->addr + pages_count * PAGE_SIZE);
    > + pages_count++;
    > + }
    > + spin_unlock(&hash_lock);
    > + list_del(&slot->link);
    > +}
    > +
    > +static int ksm_sma_ioctl_remove_memory_region(struct ksm_sma *ksm_sma)
    > +{
    > + struct ksm_mem_slot *slot, *node;
    > +
    > + down_write(&slots_lock);
    > + list_for_each_entry_safe(slot, node, &ksm_sma->sma_slots, sma_link) {
    > + remove_mm_from_hash(slot->mm);
    > + mmput(slot->mm);
    > + list_del(&slot->sma_link);
    > + kfree(slot);
    > + }
    > + up_write(&slots_lock);
    > + return 0;
    > +}
    > +
    > +static int ksm_sma_release(struct inode *inode, struct file *filp)
    > +{
    > + struct ksm_sma *ksm_sma = filp->private_data;
    > + int r;
    > +
    > + r = ksm_sma_ioctl_remove_memory_region(ksm_sma);
    > + kfree(ksm_sma);
    > + return r;
    > +}
    > +
    > +static long ksm_sma_ioctl(struct file *filp,
    > + unsigned int ioctl, unsigned long arg)
    > +{
    > + struct ksm_sma *sma = filp->private_data;
    > + void __user *argp = (void __user *)arg;
    > + int r = EINVAL;
    > +
    > + switch (ioctl) {
    > + case KSM_REGISTER_MEMORY_REGION: {
    > + struct ksm_memory_region ksm_memory_region;
    > +
    > + r = -EFAULT;
    > + if (copy_from_user(&ksm_memory_region, argp,
    > + sizeof ksm_memory_region))
    > + goto out;
    > + r = ksm_sma_ioctl_register_memory_region(sma,
    > + &ksm_memory_region);
    > + break;
    > + }
    > + case KSM_REMOVE_MEMORY_REGION:
    > + r = ksm_sma_ioctl_remove_memory_region(sma);
    > + break;
    > + }
    > +
    > +out:
    > + return r;
    > +}
    > +
    > +static int insert_page_to_hash(struct ksm_scan *ksm_scan,
    > + unsigned long hash_index,
    > + struct page_hash_item *page_hash_item,
    > + struct rmap_item *rmap_item)
    > +{
    > + struct ksm_mem_slot *slot;
    > + struct hlist_head *bucket;
    > +
    > + slot = ksm_scan->slot_index;
    > + page_hash_item->addr = slot->addr + ksm_scan->page_index * PAGE_SIZE;
    > + page_hash_item->mm = slot->mm;
    > + bucket = &page_hash_items[hash_index];
    > + hlist_add_head(&page_hash_item->link, bucket);
    > +
    > + rmap_item->page_hash_item = page_hash_item;
    > + rmap_item->oldindex = hash_index;
    > + bucket = &rmap_hash[page_hash_item->addr % nrmaps_hash];
    > + hlist_add_head(&rmap_item->link, bucket);
    > + return 0;
    > +}
    > +
    > +static void update_hash(struct ksm_scan *ksm_scan,
    > + unsigned long hash_index)
    > +{
    > + struct rmap_item *rmap_item;
    > + struct ksm_mem_slot *slot;
    > + struct hlist_head *bucket;
    > + struct hlist_node *node, *n;
    > + unsigned long addr;
    > +
    > + slot = ksm_scan->slot_index;;


    double semicolon

    > + addr = slot->addr + ksm_scan->page_index * PAGE_SIZE;
    > + bucket = &rmap_hash[addr % nrmaps_hash];
    > + hlist_for_each_entry_safe(rmap_item, node, n, bucket, link) {
    > + if (slot->mm == rmap_item->page_hash_item->mm &&
    > + rmap_item->page_hash_item->addr == addr) {
    > + if (hash_index != rmap_item->oldindex) {
    > + hlist_del(&rmap_item->page_hash_item->link);
    > + free_page_hash_item(rmap_item->page_hash_item);
    > + hlist_del(&rmap_item->link);
    > + free_rmap_item(rmap_item);
    > + }
    > + return;
    > + }
    > + }
    > +}
    > +
    > +static unsigned long addr_in_vma(struct vm_area_struct *vma, struct page *page)
    > +{
    > + pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
    > + unsigned long addr;
    > +
    > + addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
    > + if (unlikely(addr < vma->vm_start || addr >= vma->vm_end))
    > + return -EFAULT;
    > + return addr;
    > +}
    > +
    > +static pte_t *get_pte(struct mm_struct *mm, unsigned long addr)
    > +{
    > + pgd_t *pgd;
    > + pud_t *pud;
    > + pmd_t *pmd;
    > + pte_t *ptep = NULL;
    > +
    > + pgd = pgd_offset(mm, addr);
    > + if (!pgd_present(*pgd))
    > + goto out;
    > +
    > + pud = pud_offset(pgd, addr);
    > + if (!pud_present(*pud))
    > + goto out;
    > +
    > + pmd = pmd_offset(pud, addr);
    > + if (!pmd_present(*pmd))
    > + goto out;
    > +
    > + ptep = pte_offset_map(pmd, addr);
    > +out:
    > + return ptep;
    > +}
    > +
    > +static int is_present_pte(struct mm_struct *mm, unsigned long addr)
    > +{
    > + pte_t *ptep;
    > +
    > + ptep = get_pte(mm, addr);
    > + if (!ptep)
    > + return 0;
    > +
    > + if (pte_present(*ptep))
    > + return 1;
    > + return 0;
    > +}


    I'm all lost now. Are we dealing with only anonymous vma's here? Or
    can they be file-backed as well or what?

    A meta-question is: why did I get lost reading this code?

    > +#define PAGECMP_OFFSET 128
    > +#define PAGEHASH_SIZE (PAGECMP_OFFSET ? PAGECMP_OFFSET : PAGE_SIZE)
    > +/* hash the page */
    > +static void page_hash(struct page *page, unsigned char *digest)
    > +{
    > + struct scatterlist sg;
    > + struct hash_desc desc;
    > +
    > + sg_init_table(&sg, 1);
    > + sg_set_page(&sg, page, PAGEHASH_SIZE, 0);
    > + desc.tfm = tfm;
    > + desc.flags = 0;
    > + crypto_hash_digest(&desc, &sg, PAGEHASH_SIZE, digest);
    > +}
    > +
    > +/* pages_identical
    > + * calculate sha1 hash of each page, compare results,
    > + * and return 1 if identical, 0 otherwise.
    > + */


    ooh, a comment!

    > +static int pages_identical(struct page *oldpage, struct page *newpage, int new)
    > +{
    > + int r;
    > + unsigned char old_digest[SHA1_DIGEST_SIZE];
    > + struct sha1_item *sha1_item;
    > +
    > + page_hash(oldpage, old_digest);
    > + /*
    > + * If new = 1, it is never safe to use the sha1 value that is
    > + * inside the cache, the reason is that the page can be released
    > + * and then recreated and have diffrent sha1 value.
    > + * (swapping as for now is not an issue, beacuse KsmPages cannot be


    "because"

    > + * swapped)
    > + */
    > + if (new) {
    > + mutex_lock(&sha1_lock);
    > + sha1_item = &sha1_hash[page_to_pfn(newpage) % nsha1s_hash];
    > + page_hash(newpage, sha1_item->sha1val);
    > + sha1_item->pfn = page_to_pfn(newpage);
    > + r = !memcmp(old_digest, sha1_item->sha1val, SHA1_DIGEST_SIZE);
    > + mutex_unlock(&sha1_lock);
    > + } else {
    > + mutex_lock(&sha1_lock);
    > + sha1_item = &sha1_hash[page_to_pfn(newpage) % nsha1s_hash];
    > + if (sha1_item->pfn != page_to_pfn(newpage)) {
    > + page_hash(newpage, sha1_item->sha1val);
    > + sha1_item->pfn = page_to_pfn(newpage);
    > + }
    > + r = !memcmp(old_digest, sha1_item->sha1val, SHA1_DIGEST_SIZE);
    > + mutex_unlock(&sha1_lock);
    > + }
    > + if (PAGECMP_OFFSET && r) {
    > + char *old_addr, *new_addr;
    > + old_addr = kmap_atomic(oldpage, KM_USER0);
    > + new_addr = kmap_atomic(newpage, KM_USER1);
    > + r = !memcmp(old_addr+PAGECMP_OFFSET, new_addr+PAGECMP_OFFSET, PAGE_SIZE-PAGECMP_OFFSET);
    > + kunmap_atomic(old_addr, KM_USER0);
    > + kunmap_atomic(new_addr, KM_USER1);
    > + }
    > + return r;
    > +}
    > +
    > +/*
    > + * try_to_merge_one_page - take two pages and merge them into one
    > + * note:
    > + * oldpage should be anon page while newpage should be file mapped page
    > + */


    The function's name implies that the function can fail. But the
    function's description implies that it always succeeds.

    > +static int try_to_merge_one_page(struct mm_struct *mm,
    > + struct vm_area_struct *vma,
    > + struct page *oldpage,
    > + struct page *newpage,
    > + pgprot_t newprot,
    > + int new)
    > +{
    > + int ret = 0;
    > + int odirect_sync;
    > + unsigned long page_addr_in_vma;
    > + pte_t orig_pte, *orig_ptep;
    > +
    > + get_page(newpage);
    > + get_page(oldpage);
    > +
    > + down_read(&mm->mmap_sem);
    > +
    > + page_addr_in_vma = addr_in_vma(vma, oldpage);
    > + if (page_addr_in_vma == -EFAULT)
    > + goto out_unlock;
    > +
    > + orig_ptep = get_pte(mm, page_addr_in_vma);
    > + if (!orig_ptep)
    > + goto out_unlock;
    > + orig_pte = *orig_ptep;
    > + if (!pte_present(orig_pte))
    > + goto out_unlock;
    > + if (page_to_pfn(oldpage) != pte_pfn(orig_pte))
    > + goto out_unlock;
    > + /*
    > + * page_wrprotect check if the page is swapped or in swap cache,
    > + * in the future we might want to run here if_present_pte and then
    > + * swap_free
    > + */
    > + if (!page_wrprotect(oldpage, &odirect_sync))
    > + goto out_unlock;
    > + if (!odirect_sync)
    > + goto out_unlock;
    > +
    > + orig_pte = pte_wrprotect(orig_pte);
    > +
    > + ret = 1;
    > + if (pages_identical(oldpage, newpage, new))
    > + ret = replace_page(vma, oldpage, newpage, orig_pte, newprot);
    > +
    > + if (!ret)
    > + ret = 1;
    > + else
    > + ret = 0;


    ret = !ret?

    > +out_unlock:
    > + up_read(&mm->mmap_sem);
    > + put_page(oldpage);
    > + put_page(newpage);
    > + return ret;
    > +}


    By now I've lost track of what the return value of this function is on
    success versus failure and what the semantic meaning of "failure" is.
    And none of this was documented.

    It is usual for kernel fucntions to return a -ve errno on failure.

    > +static int try_to_merge_two_pages(struct mm_struct *mm1, struct page *page1,
    > + struct mm_struct *mm2, struct page *page2,
    > + unsigned long addr1, unsigned long addr2)
    > +{
    > + struct vm_area_struct *vma;
    > + pgprot_t prot;
    > + int ret = 0;
    > +
    > + /*
    > + * If page2 isn't shared (it isn't PageKsm) we have to allocate a new
    > + * file mapped page and make the two ptes of mm1(page1) and mm2(page2)
    > + * point to it. If page2 is shared, we can just make the pte of
    > + * mm1(page1) point to page2
    > + */
    > + if (PageKsm(page2)) {
    > + vma = find_vma(mm1, addr1);
    > + if (!vma)
    > + return ret;
    > + prot = vma->vm_page_prot;
    > + pgprot_val(prot) &= ~VM_WRITE;
    > + ret = try_to_merge_one_page(mm1, vma, page1, page2, prot, 0);
    > + } else {
    > + struct page *kpage;
    > +
    > + kpage = alloc_page(GFP_KERNEL | __GFP_HIGHMEM);


    Stray whitepace.

    Replace with GFP_HIGHUSER.

    > + if (!kpage)
    > + return ret;


    hm, silent failure. What are the consequences of this? (WOuld prefer
    that this question be answered via code comment!)

    > + vma = find_vma(mm1, addr1);
    > + if (!vma) {
    > + page_cache_release(kpage);


    put_page() is the preferred way of releasing non-pagecache pages.

    > + return ret;
    > + }
    > + prot = vma->vm_page_prot;
    > + pgprot_val(prot) &= ~VM_WRITE;
    > +
    > + copy_highpage(kpage, page1);


    Are you sure we didn't need copy_user_highpage()?

    > + ret = try_to_merge_one_page(mm1, vma, page1, kpage, prot, 1);
    > +
    > + if (ret) {
    > + vma = find_vma(mm2, addr2);
    > + if (!vma) {
    > + page_cache_release(kpage);
    > + return ret;
    > + }
    > +
    > + prot = vma->vm_page_prot;
    > + pgprot_val(prot) &= ~VM_WRITE;
    > +
    > + ret = try_to_merge_one_page(mm2, vma, page2, kpage,
    > + prot, 0);
    > + }
    > + page_cache_release(kpage);


    put_page(), please.

    > + }
    > + return ret;


    please document the return value.

    > +}
    > +
    > +static int cmp_and_merge_page(struct ksm_scan *ksm_scan, struct page *page)
    > +{
    > + struct hlist_head *bucket;
    > + struct hlist_node *node, *n;
    > + struct page_hash_item *page_hash_item;
    > + struct ksm_mem_slot *slot;
    > + unsigned long hash_index;
    > + unsigned long addr;
    > + int ret = 0;
    > + int used = 0;
    > +
    > + hash_index = calc_hash_index(page);
    > + bucket = &page_hash_items[hash_index];
    > +
    > + slot = ksm_scan->slot_index;
    > + addr = slot->addr + ksm_scan->page_index * PAGE_SIZE;


    Hang on, ->page_index is an `int'. This code will blow up at 16TB (and
    maybe even 8TB, when things go negative).

    > + spin_lock(&hash_lock);
    > + /*
    > + * update_hash must be called every time because there is a chance
    > + * that the data in the page has changed since the page was inserted
    > + * into the hash table to avoid inserting the page more than once.
    > + */
    > + update_hash(ksm_scan, hash_index);
    > + spin_unlock(&hash_lock);
    > +
    > + hlist_for_each_entry_safe(page_hash_item, node, n, bucket, link) {
    > + int npages;
    > + struct page *hash_page[1];
    > +
    > + if (slot->mm == page_hash_item->mm && addr == page_hash_item->addr) {
    > + used = 1;
    > + continue;
    > + }
    > +
    > + down_read(&page_hash_item->mm->mmap_sem);
    > + /*
    > + * If the page is swapped out or in swap cache we don't want to
    > + * scan it (it is just for performance).
    > + */
    > + if (!is_present_pte(page_hash_item->mm, page_hash_item->addr)) {
    > + up_read(&page_hash_item->mm->mmap_sem);
    > + continue;
    > + }
    > + npages = get_user_pages(current, page_hash_item->mm,
    > + page_hash_item->addr,
    > + 1, 0, 0, hash_page, NULL);
    > + up_read(&page_hash_item->mm->mmap_sem);
    > + if (npages != 1)
    > + break;
    > +
    > + /* Recalculate the page's hash index in case it has changed. */
    > + if (calc_hash_index(hash_page[0]) == hash_index) {
    > +
    > + ret = try_to_merge_two_pages(slot->mm, page,
    > + page_hash_item->mm,
    > + hash_page[0], addr,
    > + page_hash_item->addr);
    > + if (ret) {
    > + page_cache_release(hash_page[0]);
    > + return ret;
    > + }
    > + }
    > + page_cache_release(hash_page[0]);
    > + }
    > + /* If node is NULL and used=0, the page is not in the hash table. */
    > + if (!node && !used) {
    > + struct page_hash_item *page_hash_item;
    > + struct rmap_item *rmap_item;
    > +
    > + page_hash_item = alloc_page_hash_item();
    > + if (!page_hash_item)
    > + return ret;
    > +
    > + rmap_item = alloc_rmap_item();
    > + if (!rmap_item) {
    > + free_page_hash_item(page_hash_item);
    > + return ret;
    > + }
    > +
    > + spin_lock(&hash_lock);
    > + update_hash(ksm_scan, hash_index);
    > + insert_page_to_hash(ksm_scan, hash_index, page_hash_item, rmap_item);
    > + spin_unlock(&hash_lock);
    > + }
    > +
    > + return ret;
    > +}
    > +
    > +/* return 1 - no slots registered, nothing to be done */
    > +static int scan_get_next_index(struct ksm_scan *ksm_scan, int nscan)
    > +{
    > + struct ksm_mem_slot *slot;
    > +
    > + if (list_empty(&slots))
    > + return 1;
    > +
    > + slot = ksm_scan->slot_index;
    > +
    > + /* Are there pages left in this slot to scan? */
    > + if ((slot->npages - ksm_scan->page_index - nscan) > 0) {
    > + ksm_scan->page_index += nscan;
    > + return 0;
    > + }
    > +
    > + list_for_each_entry_from(slot, &slots, link) {
    > + if (slot == ksm_scan->slot_index)
    > + continue;
    > + ksm_scan->page_index = 0;
    > + ksm_scan->slot_index = slot;
    > + return 0;
    > + }
    > +
    > + /* look like we finished scanning the whole memory, starting again */
    > + ksm_scan->page_index = 0;
    > + ksm_scan->slot_index = list_first_entry(&slots,
    > + struct ksm_mem_slot, link);
    > + return 0;
    > +}
    > +
    > +/*
    > + * update slot_index so it point to vaild data, it is possible that by
    > + * the time we are here the data that ksm_scan was pointed to was released
    > + * so we have to call this function every time after taking the slots_lock
    > + */
    > +static void scan_update_old_index(struct ksm_scan *ksm_scan)
    > +{
    > + struct ksm_mem_slot *slot;
    > +
    > + if (list_empty(&slots))
    > + return;
    > +
    > + list_for_each_entry(slot, &slots, link) {
    > + if (ksm_scan->slot_index == slot)
    > + return;
    > + }
    > +
    > + ksm_scan->slot_index = list_first_entry(&slots,
    > + struct ksm_mem_slot, link);
    > + ksm_scan->page_index = 0;
    > +}
    > +
    > +static int ksm_scan_start(struct ksm_scan *ksm_scan, int scan_npages,
    > + int max_npages)
    > +{
    > + struct ksm_mem_slot *slot;
    > + struct page *page[1];
    > + int val;
    > + int ret = 0;
    > +
    > + down_read(&slots_lock);
    > +
    > + scan_update_old_index(ksm_scan);
    > +
    > + while (scan_npages > 0 && max_npages > 0) {
    > + if (scan_get_next_index(ksm_scan, 1)) {
    > + /* we have no slots, another ret value should be used */
    > + goto out;
    > + }
    > +
    > + slot = ksm_scan->slot_index;
    > + down_read(&slot->mm->mmap_sem);
    > + /*
    > + * If the page is swapped out or in swap cache, we don't want to
    > + * scan it (it is just for performance).
    > + */
    > + if (is_present_pte(slot->mm, slot->addr +
    > + ksm_scan->page_index * PAGE_SIZE)) {
    > + val = get_user_pages(current, slot->mm, slot->addr +
    > + ksm_scan->page_index * PAGE_SIZE ,
    > + 1, 0, 0, page, NULL);
    > + up_read(&slot->mm->mmap_sem);
    > + if (val == 1) {
    > + if (!PageKsm(page[0]))
    > + max_npages -=
    > + cmp_and_merge_page(ksm_scan, page[0]);
    > + page_cache_release(page[0]);
    > + }
    > + } else
    > + up_read(&slot->mm->mmap_sem);
    > + scan_npages--;
    > + }
    > +
    > + scan_get_next_index(ksm_scan, 1);
    > +out:
    > + up_read(&slots_lock);
    > + return ret;
    > +}


    I'm rather lost now. The code sorely needs documentation

    > +static int ksm_scan_ioctl_start(struct ksm_scan *ksm_scan,
    > + struct ksm_user_scan *scan)
    > +{
    > + return ksm_scan_start(ksm_scan, scan->pages_to_scan,
    > + scan->max_pages_to_merge);
    > +}
    > +
    > +static int ksm_scan_release(struct inode *inode, struct file *filp)
    > +{
    > + struct ksm_scan *ksm_scan = filp->private_data;
    > +
    > + kfree(ksm_scan);
    > + return 0;
    > +}
    > +
    > +static long ksm_scan_ioctl(struct file *filp,
    > + unsigned int ioctl, unsigned long arg)
    > +{
    > + struct ksm_scan *ksm_scan = filp->private_data;
    > + void __user *argp = (void __user *)arg;
    > + int r = EINVAL;
    > +
    > + switch (ioctl) {
    > + case KSM_SCAN: {
    > + struct ksm_user_scan scan;
    > +
    > + r = -EFAULT;
    > + if (copy_from_user(&scan, argp,
    > + sizeof(struct ksm_user_scan)))
    > + break;
    > +
    > + r = ksm_scan_ioctl_start(ksm_scan, &scan);
    > + }
    > + }
    > + return r;
    > +}
    > +
    > +static struct file_operations ksm_sma_fops = {
    > + .release = ksm_sma_release,
    > + .unlocked_ioctl = ksm_sma_ioctl,
    > + .compat_ioctl = ksm_sma_ioctl,
    > +};
    > +
    > +static int ksm_dev_ioctl_create_shared_memory_area(void)
    > +{
    > + int fd = -1;
    > + struct ksm_sma *ksm_sma;
    > +
    > + ksm_sma = kmalloc(sizeof(struct ksm_sma), GFP_KERNEL);
    > + if (!ksm_sma)
    > + goto out;
    > +
    > + INIT_LIST_HEAD(&ksm_sma->sma_slots);
    > +
    > + fd = anon_inode_getfd("ksm-sma", &ksm_sma_fops, ksm_sma, 0);
    > + if (fd < 0)
    > + goto out_free;
    > +
    > + return fd;
    > +out_free:
    > + kfree(ksm_sma);
    > +out:
    > + return fd;
    > +}


    The term "shared memory" has specific meanings in Linux/Unix. I'm
    suspecting that its use here was inappropriate but I have insufficient
    information to be able to suggest alternatives.

    > +static struct file_operations ksm_scan_fops = {
    > + .release = ksm_scan_release,
    > + .unlocked_ioctl = ksm_scan_ioctl,
    > + .compat_ioctl = ksm_scan_ioctl,
    > +};
    > +
    > +static struct ksm_scan *ksm_scan_create(void)
    > +{
    > + struct ksm_scan *ksm_scan;
    > +
    > + ksm_scan = kzalloc(sizeof(struct ksm_scan), GFP_KERNEL);
    > + return ksm_scan;
    > +}


    Convert to one-liner.

    > +static int ksm_dev_ioctl_create_scan(void)
    > +{
    > + int fd;
    > + struct ksm_scan *ksm_scan;
    > +
    > + if (!tfm) {
    > + fd = ksm_tfm_init();
    > + if (fd)
    > + goto out;
    > + }
    > +
    > + fd = -ENOMEM;
    > + ksm_scan = ksm_scan_create();
    > + if (!ksm_scan)
    > + goto out;
    > +
    > + fd = anon_inode_getfd("ksm-scan", &ksm_scan_fops, ksm_scan, 0);
    > + if (fd < 0)
    > + goto out_free;
    > + return fd;
    > +
    > +out_free:
    > + kfree(ksm_scan);
    > +out:
    > + return fd;
    > +}


    What the heck is all this doing?

    > +static int ksm_dev_ioctl_start_stop_kthread(struct ksm_kthread_info *info)
    > +{
    > + int rc = 0;
    > +
    > + /* Make sure crypto tfm is initialized before starting scanning */
    > + if (info->running && !tfm) {
    > + rc = ksm_tfm_init();
    > + if (rc)
    > + goto out;
    > + }
    > +
    > + down_write(&kthread_lock);
    > +
    > + kthread_sleep = info->sleep;
    > + kthread_pages_to_scan = info->pages_to_scan;
    > + kthread_max_npages = info->max_pages_to_merge;
    > + kthread_run = info->running;
    > +
    > + up_write(&kthread_lock);
    > +
    > + if (kthread_run)
    > + wake_up_interruptible(&kthread_wait);
    > +
    > +out:
    > + return rc;
    > +}


    We have a well-known kernel-wide macro called "kthread_run()". The
    creation of a file-wide global of the same name is unfortunate.

    > +static int ksm_dev_ioctl_get_info_kthread(struct ksm_kthread_info *info)
    > +{
    > + down_read(&kthread_lock);
    > +
    > + info->sleep = kthread_sleep;
    > + info->pages_to_scan = kthread_pages_to_scan;
    > + info->max_pages_to_merge = kthread_max_npages;
    > + info->running = kthread_run;
    > +
    > + up_read(&kthread_lock);
    > + return 0;
    > +}
    > +
    > +static long ksm_dev_ioctl(struct file *filp,
    > + unsigned int ioctl, unsigned long arg)
    > +{
    > + void __user *argp = (void __user *)arg;
    > + long r = -EINVAL;
    > +
    > + switch (ioctl) {
    > + case KSM_GET_API_VERSION:
    > + r = KSM_API_VERSION;
    > + break;
    > + case KSM_CREATE_SHARED_MEMORY_AREA:
    > + r = ksm_dev_ioctl_create_shared_memory_area();
    > + break;
    > + case KSM_CREATE_SCAN:
    > + r = ksm_dev_ioctl_create_scan();
    > + break;
    > + case KSM_START_STOP_KTHREAD: {
    > + struct ksm_kthread_info info;
    > +
    > + r = -EFAULT;
    > + if (copy_from_user(&info, argp,
    > + sizeof(struct ksm_kthread_info)))
    > + break;
    > +
    > + r = ksm_dev_ioctl_start_stop_kthread(&info);
    > + break;
    > + }
    > + case KSM_GET_INFO_KTHREAD: {
    > + struct ksm_kthread_info info;
    > +
    > + r = ksm_dev_ioctl_get_info_kthread(&info);
    > + if (r)
    > + break;
    > + r = -EFAULT;
    > + if (copy_to_user(argp, &info,
    > + sizeof(struct ksm_kthread_info)))
    > + break;
    > + r = 0;
    > + }
    > + default:
    > + return r;
    > + }
    > + return r;
    > +}


    Interface should be documented.

    > +static int ksm_dev_open(struct inode *inode, struct file *filp)
    > +{
    > + try_module_get(THIS_MODULE);
    > + return 0;
    > +}
    > +
    > +static int ksm_dev_release(struct inode *inode, struct file *filp)
    > +{
    > + module_put(THIS_MODULE);
    > + return 0;
    > +}


    What's going on here?

    > +static struct file_operations ksm_chardev_ops = {
    > + .open = ksm_dev_open,
    > + .release = ksm_dev_release,
    > + .unlocked_ioctl = ksm_dev_ioctl,
    > + .compat_ioctl = ksm_dev_ioctl,
    > +};
    > +
    > +static struct miscdevice ksm_dev = {
    > + KSM_MINOR,
    > + "ksm",
    > + &ksm_chardev_ops,
    > +};
    > +
    > +int kthread_ksm_scan_thread(void *nothing)
    > +{
    > + while (!kthread_should_stop()) {
    > + if(kthread_run) {


    checkpatch?

    > + down_read(&kthread_lock);
    > + ksm_scan_start(&kthread_ksm_scan,
    > + kthread_pages_to_scan,
    > + kthread_max_npages);
    > + up_read(&kthread_lock);
    > + schedule_timeout_interruptible(usecs_to_jiffies(kt hread_sleep));
    > + } else
    > + wait_event_interruptible(kthread_wait, kthread_run);
    > + }
    > + return 0;
    > +}
    > +
    > +static int __init ksm_init(void)
    > +{
    > + int r = 1;
    > +
    > + r = ksm_slab_init();
    > + if (r)
    > + goto out;
    > +
    > + r = page_hash_init();
    > + if (r)
    > + goto out_free1;
    > +
    > + r = rmap_hash_init();
    > + if (r)
    > + goto out_free2;
    > +
    > + r = sha1_hash_init();
    > + if (r)
    > + goto out_free3;
    > +
    > + INIT_LIST_HEAD(&slots);
    > + init_rwsem(&slots_lock);
    > + spin_lock_init(&hash_lock);
    > + init_rwsem(&kthread_lock);
    > + init_waitqueue_head(&kthread_wait);
    > +
    > + kthread = kthread_run(kthread_ksm_scan_thread, NULL, "kksmd");
    > + if (IS_ERR(kthread)) {
    > + printk(KERN_ERR "ksm: creating kthread failed\n");
    > + goto out_free4;
    > + }
    > +
    > + r = misc_register(&ksm_dev);
    > + if (r) {
    > + printk(KERN_ERR "ksm: misc device register failed\n");
    > + goto out_free5;
    > + }
    > +
    > + printk(KERN_WARNING "ksm loaded\n");
    > + return 0;
    > +
    > +out_free5:
    > + kthread_stop(kthread);
    > +out_free4:
    > + sha1_hash_free();
    > +out_free3:
    > + rmap_hash_free();
    > +out_free2:
    > + page_hash_free();
    > +out_free1:
    > + ksm_slab_free();
    > +out:
    > + return r;
    > +}
    > +
    > +static void __exit ksm_exit(void)
    > +{
    > + misc_deregister(&ksm_dev);
    > + kthread_run = 1;
    > + kthread_stop(kthread);
    > + if (tfm)
    > + crypto_free_hash(tfm);
    > + sha1_hash_free();
    > + rmap_hash_free();
    > + page_hash_free();
    > + ksm_slab_free();
    > +}
    > +
    > +module_init(ksm_init)
    > +module_exit(ksm_exit)


    Generally: this review was rather a waste of your time and of mine
    because the code is inadequately documented.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  19. Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

    Andrew Morton wrote:
    > On Tue, 11 Nov 2008 15:21:39 +0200
    > Izik Eidus wrote:
    >
    >
    >> From: Izik Eidus
    >>
    >> this function is needed in cases you want to change the userspace
    >> virtual mapping into diffrent physical page,
    >>

    >
    > Not sure that I understand that description. We want to replace a live
    > page in an anonymous VMA with a different one?
    >
    > It looks that way.
    >

    yes but it replace it with kernel allocated page.
    > page migration already kinda does that. Is there common ground?
    >
    >

    page migration as far as i saw cant migrate anonymous page into kernel page.
    if you want we can change page_migration to do that, but i thought you
    will rather have ksm changes separate.

    >> KSM need this for merging the identical pages.
    >>
    >> this function is working by removing the oldpage from the rmap and
    >> calling put_page on it, and by setting the virtual address pte
    >> to point into the new page.
    >> (note that the new page (the page that we change the pte to map to)
    >> cannot be anonymous page)
    >>
    >>

    >
    > I don't understand the restrictions on anonymous pages. Please expand
    > the changelog so that reviewers can understand the reasons for this
    > restriction.
    >

    the page that we are going to map into the pte going to be nonlinear
    from the point of view of anon-vma
    therefore it cannot be anonymous.

    >
    >
    >> ---
    >> include/linux/mm.h | 3 ++
    >> mm/memory.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++ ++
    >> 2 files changed, 71 insertions(+), 0 deletions(-)
    >>
    >> diff --git a/include/linux/mm.h b/include/linux/mm.h
    >> index ffee2f7..4da7fa8 100644
    >> --- a/include/linux/mm.h
    >> +++ b/include/linux/mm.h
    >> @@ -1207,6 +1207,9 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
    >> int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
    >> unsigned long pfn);
    >>
    >> +int replace_page(struct vm_area_struct *vma, struct page *oldpage,
    >> + struct page *newpage, pte_t orig_pte, pgprot_t prot);
    >> +
    >> struct page *follow_page(struct vm_area_struct *, unsigned long address,
    >> unsigned int foll_flags);
    >> #define FOLL_WRITE 0x01 /* check pte is writable */
    >> diff --git a/mm/memory.c b/mm/memory.c
    >> index 164951c..b2c542c 100644
    >> --- a/mm/memory.c
    >> +++ b/mm/memory.c
    >> @@ -1472,6 +1472,74 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
    >> }
    >> EXPORT_SYMBOL(vm_insert_mixed);
    >>
    >> +/**
    >> + * replace _page - replace the pte mapping related to vm area between two pages
    >>

    >
    > s/replace _page/replace_page/
    >
    >
    >> + * (from oldpage to newpage)
    >> + * NOTE: you should take into consideration the impact on the VM when replacing
    >> + * anonymous pages with kernel non swappable pages.
    >> + */
    >>

    >
    > This _is_ a kerneldoc comment, but kernedoc comments conventionally
    > document the arguments and the return value also.
    >
    >
    >> +int replace_page(struct vm_area_struct *vma, struct page *oldpage,
    >> + struct page *newpage, pte_t orig_pte, pgprot_t prot)
    >> +{
    >> + struct mm_struct *mm = vma->vm_mm;
    >> + pgd_t *pgd;
    >> + pud_t *pud;
    >> + pmd_t *pmd;
    >> + pte_t *ptep;
    >> + spinlock_t *ptl;
    >> + unsigned long addr;
    >> + int ret;
    >> +
    >> + BUG_ON(PageAnon(newpage));
    >> +
    >> + ret = -EFAULT;
    >> + addr = page_address_in_vma(oldpage, vma);
    >> + if (addr == -EFAULT)
    >> + goto out;
    >> +
    >> + pgd = pgd_offset(mm, addr);
    >> + if (!pgd_present(*pgd))
    >> + goto out;
    >> +
    >> + pud = pud_offset(pgd, addr);
    >> + if (!pud_present(*pud))
    >> + goto out;
    >> +
    >> + pmd = pmd_offset(pud, addr);
    >> + if (!pmd_present(*pmd))
    >> + goto out;
    >> +
    >> + ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
    >> + if (!ptep)
    >> + goto out;
    >> +
    >> + if (!pte_same(*ptep, orig_pte)) {
    >> + pte_unmap_unlock(ptep, ptl);
    >> + goto out;
    >> + }
    >> +
    >> + ret = 0;
    >> + get_page(newpage);
    >> + page_add_file_rmap(newpage);
    >> +
    >> + flush_cache_page(vma, addr, pte_pfn(*ptep));
    >> + ptep_clear_flush(vma, addr, ptep);
    >> + set_pte_at(mm, addr, ptep, mk_pte(newpage, prot));
    >> +
    >> + page_remove_rmap(oldpage, vma);
    >> + if (PageAnon(oldpage)) {
    >> + dec_mm_counter(mm, anon_rss);
    >> + inc_mm_counter(mm, file_rss);
    >> + }
    >> + put_page(oldpage);
    >> +
    >> + pte_unmap_unlock(ptep, ptl);
    >> +
    >> +out:
    >> + return ret;
    >> +}
    >> +EXPORT_SYMBOL(replace_page);
    >>

    >
    > Again, we could make the presence of this code selectable by subsystems
    > which want it.
    >
    > ]


    sure.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  20. Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

    On Tue, Nov 11, 2008 at 11:45:55AM -0800, Andrew Morton wrote:
    > page migration already kinda does that. Is there common ground?


    btw, page_migration likely is buggy w.r.t. o_direct too (and now
    unfixable with gup_fast until the 2.4 brlock is added around it or
    similar) if it does the same thing but without any page_mapcount vs
    page_count check.

    page_migration does too much for us, so us calling into migrate.c may
    not be ideal. It has to convert a fresh page to a VM page. In KSM we
    don't convert the newpage to be a VM page, we just replace the anon
    page with another page. The new page in the KSM case is not a page
    known by the VM, not in the lru etc...

    The way to go could be to change the page_migration to use
    replace_page (or __replace_page if called in some shared inner-lock
    context) after preparing the newpage to be a regular VM page. If we
    can do that, migrate.c will get the o_direct race fixed too for free.

    > I don't understand the restrictions on anonymous pages. Please expand
    > the changelog so that reviewers can understand the reasons for this
    > restriction.


    That's a good question but I don't have a definitive answer as I
    didn't account for exactly how complex it would be yet.

    Suppose a file has 0-4k equal to 4k-8k. A MAP_SHARED that maps both
    pages with the same physical page sounds tricky. The shared pages are
    KSM owned and invisible to the VM (later could be made visible with an
    external-rmap), but pagecache can't be just KSM owned, they at least
    must go in pagecache radix tree so that requires patching the radix
    tree etc... All things we don't need for anon ram. I think first thing
    to extend is to add external-rmap and make ksm swappable, then we can
    think if something can be done about MAP_SHARED/MAP_PRIVATE too
    (MAP_PRIVATE post-COW already works, the question is pre-COW). One
    excuse of why I didn't think too much about it yet is because in
    effect KSM it's mostly useful to the anon ram, the pagecache can be
    solved in userland with hardlinks with openvz, and shared libs already
    do all they can to share .text (the not-shared post dynamic link
    should be covered by ksm already).

    > Again, we could make the presence of this code selectable by subsystems
    > which want it.


    Indeed!!
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread
Page 1 of 3 1 2 3 LastLast