[patch 0/9] [RFC] EMM Notifier V2 - Kernel

This is a discussion on [patch 0/9] [RFC] EMM Notifier V2 - Kernel ; On Wed, Apr 02, 2008 at 06:24:15PM -0700, Christoph Lameter wrote: > Ok lets forget about the single theaded thing to solve the registration > races. As Andrea pointed out this still has ssues with other subscribed > subsystems (and ...

+ Reply to Thread
Page 3 of 3 FirstFirst 1 2 3
Results 41 to 50 of 50

Thread: [patch 0/9] [RFC] EMM Notifier V2

  1. Re: EMM: disable other notifiers before register and unregister

    On Wed, Apr 02, 2008 at 06:24:15PM -0700, Christoph Lameter wrote:
    > Ok lets forget about the single theaded thing to solve the registration
    > races. As Andrea pointed out this still has ssues with other subscribed
    > subsystems (and also try_to_unmap). We could do something like what
    > stop_machine_run does: First disable all running subsystems before
    > registering a new one.
    >
    > Maybe this is a possible solution.


    It still doesn't solve this kernel crash.

    CPU0 CPU1
    range_start (mmu notifier chain is empty)
    range_start returns
    mmu_notifier_register
    kvm_emm_stop (how kvm can ever know
    the other cpu is in the middle of the critical section?)
    kvm page fault (kvm thinks mmu_notifier_register serialized)
    zap ptes
    free_page mapped by spte/GRU and not pinned -> crash


    There's no way the lowlevel can stop mmu_notifier_register and if
    mmu_notifier_register returns, then sptes will be instantiated and
    it'll corrupt memory the same way.

    The seqlock was fine, what is wrong is the assumption that we can let
    the lowlevel driver handle a range_end happening without range_begin
    before it. The problem is that by design the lowlevel can't handle a
    range_end happening without a range_begin before it. This is the core
    kernel crashing problem we have (it's a kernel crashing issue only for
    drivers that don't pin the pages, so XPMEM wouldn't crash but still it
    would leak memory, which is a more graceful failure than random mm
    corruption).

    The basic trouble is that sometime range_begin/end critical sections
    run outside the mmap_sem (see try_to_unmap_cluster in #v10 or even
    try_to_unmap_one only in EMM-V2).

    My attempt to fix this once and for all is to walk all vmas of the
    "mm" inside mmu_notifier_register and take all anon_vma locks and
    i_mmap_locks in virtual address order in a row. It's ok to take those
    inside the mmap_sem. Supposedly if anybody will ever take a double
    lock it'll do in order too. Then I can dump all the other locking and
    remove the seqlock, and the driver is guaranteed there will be a
    single call of range_begin followed by a single call of range_end the
    whole time and no race could ever happen, and there won't be replied
    calls of range_begin that would screwup a recursive semaphore
    locking. The patch won't be pretty, I guess I'll vmalloc an array of
    pointers to locks to reorder them. It doesn't need to be fast. Also
    the locks can't go away from under us while we hold the
    down_write(mmap_sem) because the vmas can be altered only with
    down_write(mmap_sem) (modulo vm_start/vm_end that can be modified with
    only down_read(mmap_sem) + page_table_lock like in growsdown page
    faults). So it should be ok to take all those locks inside the
    mmap_sem and implement a lock_vm(mm) unlock_vm(mm). I'll think more
    about this hammer approach while I try to implement it...
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: EMM: Fixup return value handling of emm_notify()

    On Thu, 3 Apr 2008, Peter Zijlstra wrote:

    > It seems to me that common code can be shared using functions? No need
    > to stuff everything into a single function. We have method vectors all
    > over the kernel, we could do a_ops as a single callback too, but we
    > dont.
    >
    > FWIW I prefer separate methods.


    Ok. It seems that I already added some new methods which do not use all
    parameters. So lets switch back to the old scheme for the next release.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: EMM: disable other notifiers before register and unregister

    On Thu, 3 Apr 2008, Andrea Arcangeli wrote:

    > My attempt to fix this once and for all is to walk all vmas of the
    > "mm" inside mmu_notifier_register and take all anon_vma locks and
    > i_mmap_locks in virtual address order in a row. It's ok to take those
    > inside the mmap_sem. Supposedly if anybody will ever take a double
    > lock it'll do in order too. Then I can dump all the other locking and


    What about concurrent mmu_notifier registrations from two mm_structs
    that have shared mappings? Isnt there a potential deadlock situation?

    > faults). So it should be ok to take all those locks inside the
    > mmap_sem and implement a lock_vm(mm) unlock_vm(mm). I'll think more
    > about this hammer approach while I try to implement it...


    Well good luck. Hopefully we will get to something that works.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: EMM: disable other notifiers before register and unregister

    On Thu, 3 Apr 2008, Christoph Lameter wrote:

    > > faults). So it should be ok to take all those locks inside the
    > > mmap_sem and implement a lock_vm(mm) unlock_vm(mm). I'll think more
    > > about this hammer approach while I try to implement it...

    >
    > Well good luck. Hopefully we will get to something that works.


    Another hammer to use may be the freezer from software suspend. With that
    you can get all tasks of a process into a definite state. Then take the
    mmap_sem writably. But then there is still try_to_unmap and friends that
    can race.


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: EMM: disable other notifiers before register and unregister

    On Thu, Apr 03, 2008 at 12:20:41PM -0700, Christoph Lameter wrote:
    > On Thu, 3 Apr 2008, Andrea Arcangeli wrote:
    >
    > > My attempt to fix this once and for all is to walk all vmas of the
    > > "mm" inside mmu_notifier_register and take all anon_vma locks and
    > > i_mmap_locks in virtual address order in a row. It's ok to take those
    > > inside the mmap_sem. Supposedly if anybody will ever take a double
    > > lock it'll do in order too. Then I can dump all the other locking and

    >
    > What about concurrent mmu_notifier registrations from two mm_structs
    > that have shared mappings? Isnt there a potential deadlock situation?


    No, the ordering of the lock avoids that. Here a snippnet.

    /*
    * This operation locks against the VM for all pte/vma/mm related
    * operations that could ever happen on a certain mm. This includes
    * vmtruncate, try_to_unmap, and all page faults. The holder
    * must not hold any mm related lock. A single task can't take more
    * than one mm lock in a row or it would deadlock.
    */

    So you can't do:

    mm_lock(mm1);
    mm_lock(mm2);

    But if two different tasks run the mm_lock everything is ok. Each task
    in the system can lock at most 1 mm at time.

    > Well good luck. Hopefully we will get to something that works.


    Looks good so far but I didn't finish it yet.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. [PATCH] mmu notifier #v11

    This should guarantee that nobody can register when any of the mmu
    notifiers is running avoiding all the races including guaranteeing
    range_start not to be missed. I'll adapt the other patches to provide
    the sleeping-feature on top of this (only needed by XPMEM) soon. KVM
    seems to run fine on top of this one.

    Andrew can you apply this to -mm?

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Nick Piggin
    Signed-off-by: Christoph Lameter

    diff --git a/include/linux/mm.h b/include/linux/mm.h
    --- a/include/linux/mm.h
    +++ b/include/linux/mm.h
    @@ -1050,6 +1050,9 @@
    unsigned long addr, unsigned long len,
    unsigned long flags, struct page **pages);

    +extern void mm_lock(struct mm_struct *mm);
    +extern void mm_unlock(struct mm_struct *mm);
    +
    extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);

    extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
    diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
    --- a/include/linux/mm_types.h
    +++ b/include/linux/mm_types.h
    @@ -225,6 +225,9 @@
    #ifdef CONFIG_CGROUP_MEM_RES_CTLR
    struct mem_cgroup *mem_cgroup;
    #endif
    +#ifdef CONFIG_MMU_NOTIFIER
    + struct hlist_head mmu_notifier_list;
    +#endif
    };

    #endif /* _LINUX_MM_TYPES_H */
    diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
    new file mode 100644
    --- /dev/null
    +++ b/include/linux/mmu_notifier.h
    @@ -0,0 +1,175 @@
    +#ifndef _LINUX_MMU_NOTIFIER_H
    +#define _LINUX_MMU_NOTIFIER_H
    +
    +#include
    +#include
    +#include
    +
    +struct mmu_notifier;
    +struct mmu_notifier_ops;
    +
    +#ifdef CONFIG_MMU_NOTIFIER
    +
    +struct mmu_notifier_ops {
    + /*
    + * Called when nobody can register any more notifier in the mm
    + * and after the "mn" notifier has been disarmed already.
    + */
    + void (*release)(struct mmu_notifier *mn,
    + struct mm_struct *mm);
    +
    + /*
    + * clear_flush_young is called after the VM is
    + * test-and-clearing the young/accessed bitflag in the
    + * pte. This way the VM will provide proper aging to the
    + * accesses to the page through the secondary MMUs and not
    + * only to the ones through the Linux pte.
    + */
    + int (*clear_flush_young)(struct mmu_notifier *mn,
    + struct mm_struct *mm,
    + unsigned long address);
    +
    + /*
    + * Before this is invoked any secondary MMU is still ok to
    + * read/write to the page previously pointed by the Linux pte
    + * because the old page hasn't been freed yet. If required
    + * set_page_dirty has to be called internally to this method.
    + */
    + void (*invalidate_page)(struct mmu_notifier *mn,
    + struct mm_struct *mm,
    + unsigned long address);
    +
    + /*
    + * invalidate_range_start() and invalidate_range_end() must be
    + * paired. Multiple invalidate_range_start/ends may be nested
    + * or called concurrently.
    + */
    + void (*invalidate_range_start)(struct mmu_notifier *mn,
    + struct mm_struct *mm,
    + unsigned long start, unsigned long end);
    + void (*invalidate_range_end)(struct mmu_notifier *mn,
    + struct mm_struct *mm,
    + unsigned long start, unsigned long end);
    +};
    +
    +struct mmu_notifier {
    + struct hlist_node hlist;
    + const struct mmu_notifier_ops *ops;
    +};
    +
    +static inline int mm_has_notifiers(struct mm_struct *mm)
    +{
    + return unlikely(!hlist_empty(&mm->mmu_notifier_list));
    +}
    +
    +extern void mmu_notifier_register(struct mmu_notifier *mn,
    + struct mm_struct *mm);
    +extern void __mmu_notifier_release(struct mm_struct *mm);
    +extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
    + unsigned long address);
    +extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
    + unsigned long address);
    +extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
    + unsigned long start, unsigned long end);
    +extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
    + unsigned long start, unsigned long end);
    +
    +
    +static inline void mmu_notifier_release(struct mm_struct *mm)
    +{
    + if (mm_has_notifiers(mm))
    + __mmu_notifier_release(mm);
    +}
    +
    +static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
    + unsigned long address)
    +{
    + if (mm_has_notifiers(mm))
    + return __mmu_notifier_clear_flush_young(mm, address);
    + return 0;
    +}
    +
    +static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
    + unsigned long address)
    +{
    + if (mm_has_notifiers(mm))
    + __mmu_notifier_invalidate_page(mm, address);
    +}
    +
    +static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
    + unsigned long start, unsigned long end)
    +{
    + if (mm_has_notifiers(mm))
    + __mmu_notifier_invalidate_range_start(mm, start, end);
    +}
    +
    +static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
    + unsigned long start, unsigned long end)
    +{
    + if (mm_has_notifiers(mm))
    + __mmu_notifier_invalidate_range_end(mm, start, end);
    +}
    +
    +static inline void mmu_notifier_mm_init(struct mm_struct *mm)
    +{
    + INIT_HLIST_HEAD(&mm->mmu_notifier_list);
    +}
    +
    +#define ptep_clear_flush_notify(__vma, __address, __ptep) \
    +({ \
    + pte_t __pte; \
    + struct vm_area_struct *___vma = __vma; \
    + unsigned long ___address = __address; \
    + __pte = ptep_clear_flush(___vma, ___address, __ptep); \
    + mmu_notifier_invalidate_page(___vma->vm_mm, ___address); \
    + __pte; \
    +})
    +
    +#define ptep_clear_flush_young_notify(__vma, __address, __ptep) \
    +({ \
    + int __young; \
    + struct vm_area_struct *___vma = __vma; \
    + unsigned long ___address = __address; \
    + __young = ptep_clear_flush_young(___vma, ___address, __ptep); \
    + __young |= mmu_notifier_clear_flush_young(___vma->vm_mm, \
    + ___address); \
    + __young; \
    +})
    +
    +#else /* CONFIG_MMU_NOTIFIER */
    +
    +static inline void mmu_notifier_release(struct mm_struct *mm)
    +{
    +}
    +
    +static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
    + unsigned long address)
    +{
    + return 0;
    +}
    +
    +static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
    + unsigned long address)
    +{
    +}
    +
    +static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
    + unsigned long start, unsigned long end)
    +{
    +}
    +
    +static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
    + unsigned long start, unsigned long end)
    +{
    +}
    +
    +static inline void mmu_notifier_mm_init(struct mm_struct *mm)
    +{
    +}
    +
    +#define ptep_clear_flush_young_notify ptep_clear_flush_young
    +#define ptep_clear_flush_notify ptep_clear_flush
    +
    +#endif /* CONFIG_MMU_NOTIFIER */
    +
    +#endif /* _LINUX_MMU_NOTIFIER_H */
    diff --git a/kernel/fork.c b/kernel/fork.c
    --- a/kernel/fork.c
    +++ b/kernel/fork.c
    @@ -53,6 +53,7 @@
    #include
    #include
    #include
    +#include

    #include
    #include
    @@ -362,6 +363,7 @@

    if (likely(!mm_alloc_pgd(mm))) {
    mm->def_flags = 0;
    + mmu_notifier_mm_init(mm);
    return mm;
    }

    diff --git a/mm/Kconfig b/mm/Kconfig
    --- a/mm/Kconfig
    +++ b/mm/Kconfig
    @@ -193,3 +193,7 @@
    config VIRT_TO_BUS
    def_bool y
    depends on !ARCH_NO_VIRT_TO_BUS
    +
    +config MMU_NOTIFIER
    + def_bool y
    + bool "MMU notifier, for paging KVM/RDMA"
    diff --git a/mm/Makefile b/mm/Makefile
    --- a/mm/Makefile
    +++ b/mm/Makefile
    @@ -33,4 +33,5 @@
    obj-$(CONFIG_SMP) += allocpercpu.o
    obj-$(CONFIG_QUICKLIST) += quicklist.o
    obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
    +obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o

    diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
    --- a/mm/filemap_xip.c
    +++ b/mm/filemap_xip.c
    @@ -194,7 +194,7 @@
    if (pte) {
    /* Nuke the page table entry. */
    flush_cache_page(vma, address, pte_pfn(*pte));
    - pteval = ptep_clear_flush(vma, address, pte);
    + pteval = ptep_clear_flush_notify(vma, address, pte);
    page_remove_rmap(page, vma);
    dec_mm_counter(mm, file_rss);
    BUG_ON(pte_dirty(pteval));
    diff --git a/mm/fremap.c b/mm/fremap.c
    --- a/mm/fremap.c
    +++ b/mm/fremap.c
    @@ -15,6 +15,7 @@
    #include
    #include
    #include
    +#include

    #include
    #include
    @@ -214,7 +215,9 @@
    spin_unlock(&mapping->i_mmap_lock);
    }

    + mmu_notifier_invalidate_range_start(mm, start, start + size);
    err = populate_range(mm, vma, start, size, pgoff);
    + mmu_notifier_invalidate_range_end(mm, start, start + size);
    if (!err && !(flags & MAP_NONBLOCK)) {
    if (unlikely(has_write_lock)) {
    downgrade_write(&mm->mmap_sem);
    diff --git a/mm/hugetlb.c b/mm/hugetlb.c
    --- a/mm/hugetlb.c
    +++ b/mm/hugetlb.c
    @@ -14,6 +14,7 @@
    #include
    #include
    #include
    +#include

    #include
    #include
    @@ -799,6 +800,7 @@
    BUG_ON(start & ~HPAGE_MASK);
    BUG_ON(end & ~HPAGE_MASK);

    + mmu_notifier_invalidate_range_start(mm, start, end);
    spin_lock(&mm->page_table_lock);
    for (address = start; address < end; address += HPAGE_SIZE) {
    ptep = huge_pte_offset(mm, address);
    @@ -819,6 +821,7 @@
    }
    spin_unlock(&mm->page_table_lock);
    flush_tlb_range(vma, start, end);
    + mmu_notifier_invalidate_range_end(mm, start, end);
    list_for_each_entry_safe(page, tmp, &page_list, lru) {
    list_del(&page->lru);
    put_page(page);
    diff --git a/mm/memory.c b/mm/memory.c
    --- a/mm/memory.c
    +++ b/mm/memory.c
    @@ -51,6 +51,7 @@
    #include
    #include
    #include
    +#include

    #include
    #include
    @@ -611,6 +612,9 @@
    if (is_vm_hugetlb_page(vma))
    return copy_hugetlb_page_range(dst_mm, src_mm, vma);

    + if (is_cow_mapping(vma->vm_flags))
    + mmu_notifier_invalidate_range_start(src_mm, addr, end);
    +
    dst_pgd = pgd_offset(dst_mm, addr);
    src_pgd = pgd_offset(src_mm, addr);
    do {
    @@ -621,6 +625,11 @@
    vma, addr, next))
    return -ENOMEM;
    } while (dst_pgd++, src_pgd++, addr = next, addr != end);
    +
    + if (is_cow_mapping(vma->vm_flags))
    + mmu_notifier_invalidate_range_end(src_mm,
    + vma->vm_start, end);
    +
    return 0;
    }

    @@ -897,7 +906,9 @@
    lru_add_drain();
    tlb = tlb_gather_mmu(mm, 0);
    update_hiwater_rss(mm);
    + mmu_notifier_invalidate_range_start(mm, address, end);
    end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
    + mmu_notifier_invalidate_range_end(mm, address, end);
    if (tlb)
    tlb_finish_mmu(tlb, address, end);
    return end;
    @@ -1463,10 +1474,11 @@
    {
    pgd_t *pgd;
    unsigned long next;
    - unsigned long end = addr + size;
    + unsigned long start = addr, end = addr + size;
    int err;

    BUG_ON(addr >= end);
    + mmu_notifier_invalidate_range_start(mm, start, end);
    pgd = pgd_offset(mm, addr);
    do {
    next = pgd_addr_end(addr, end);
    @@ -1474,6 +1486,7 @@
    if (err)
    break;
    } while (pgd++, addr = next, addr != end);
    + mmu_notifier_invalidate_range_end(mm, start, end);
    return err;
    }
    EXPORT_SYMBOL_GPL(apply_to_page_range);
    @@ -1675,7 +1688,7 @@
    * seen in the presence of one thread doing SMC and another
    * thread doing COW.
    */
    - ptep_clear_flush(vma, address, page_table);
    + ptep_clear_flush_notify(vma, address, page_table);
    set_pte_at(mm, address, page_table, entry);
    update_mmu_cache(vma, address, entry);
    lru_cache_add_active(new_page);
    diff --git a/mm/mmap.c b/mm/mmap.c
    --- a/mm/mmap.c
    +++ b/mm/mmap.c
    @@ -26,6 +26,7 @@
    #include
    #include
    #include
    +#include

    #include
    #include
    @@ -1747,11 +1748,13 @@
    lru_add_drain();
    tlb = tlb_gather_mmu(mm, 0);
    update_hiwater_rss(mm);
    + mmu_notifier_invalidate_range_start(mm, start, end);
    unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
    vm_unacct_memory(nr_accounted);
    free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
    next? next->vm_start: 0);
    tlb_finish_mmu(tlb, start, end);
    + mmu_notifier_invalidate_range_end(mm, start, end);
    }

    /*
    @@ -2037,6 +2040,7 @@
    unsigned long end;

    /* mm's last user has gone, and its about to be pulled down */
    + mmu_notifier_release(mm);
    arch_exit_mmap(mm);

    lru_add_drain();
    @@ -2242,3 +2246,69 @@

    return 0;
    }
    +
    +static void mm_lock_unlock(struct mm_struct *mm, int lock)
    +{
    + struct vm_area_struct *vma;
    + spinlock_t *i_mmap_lock_last, *anon_vma_lock_last;
    +
    + i_mmap_lock_last = NULL;
    + for (; {
    + spinlock_t *i_mmap_lock = (spinlock_t *) -1UL;
    + for (vma = mm->mmap; vma; vma = vma->vm_next)
    + if (vma->vm_file && vma->vm_file->f_mapping &&
    + (unsigned long) i_mmap_lock >
    + (unsigned long)
    + &vma->vm_file->f_mapping->i_mmap_lock &&
    + (unsigned long)
    + &vma->vm_file->f_mapping->i_mmap_lock >
    + (unsigned long) i_mmap_lock_last)
    + i_mmap_lock =
    + &vma->vm_file->f_mapping->i_mmap_lock;
    + if (i_mmap_lock == (spinlock_t *) -1UL)
    + break;
    + i_mmap_lock_last = i_mmap_lock;
    + if (lock)
    + spin_lock(i_mmap_lock);
    + else
    + spin_unlock(i_mmap_lock);
    + }
    +
    + anon_vma_lock_last = NULL;
    + for (; {
    + spinlock_t *anon_vma_lock = (spinlock_t *) -1UL;
    + for (vma = mm->mmap; vma; vma = vma->vm_next)
    + if (vma->anon_vma &&
    + (unsigned long) anon_vma_lock >
    + (unsigned long) &vma->anon_vma->lock &&
    + (unsigned long) &vma->anon_vma->lock >
    + (unsigned long) anon_vma_lock_last)
    + anon_vma_lock = &vma->anon_vma->lock;
    + if (anon_vma_lock == (spinlock_t *) -1UL)
    + break;
    + anon_vma_lock_last = anon_vma_lock;
    + if (lock)
    + spin_lock(anon_vma_lock);
    + else
    + spin_unlock(anon_vma_lock);
    + }
    +}
    +
    +/*
    + * This operation locks against the VM for all pte/vma/mm related
    + * operations that could ever happen on a certain mm. This includes
    + * vmtruncate, try_to_unmap, and all page faults. The holder
    + * must not hold any mm related lock. A single task can't take more
    + * than one mm lock in a row or it would deadlock.
    + */
    +void mm_lock(struct mm_struct * mm)
    +{
    + down_write(&mm->mmap_sem);
    + mm_lock_unlock(mm, 1);
    +}
    +
    +void mm_unlock(struct mm_struct *mm)
    +{
    + mm_lock_unlock(mm, 0);
    + up_write(&mm->mmap_sem);
    +}
    diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
    new file mode 100644
    --- /dev/null
    +++ b/mm/mmu_notifier.c
    @@ -0,0 +1,100 @@
    +/*
    + * linux/mm/mmu_notifier.c
    + *
    + * Copyright (C) 2008 Qumranet, Inc.
    + * Copyright (C) 2008 SGI
    + * Christoph Lameter
    + *
    + * This work is licensed under the terms of the GNU GPL, version 2. See
    + * the COPYING file in the top-level directory.
    + */
    +
    +#include
    +#include
    +#include
    +
    +/*
    + * No synchronization. This function can only be called when only a single
    + * process remains that performs teardown.
    + */
    +void __mmu_notifier_release(struct mm_struct *mm)
    +{
    + struct mmu_notifier *mn;
    +
    + while (unlikely(!hlist_empty(&mm->mmu_notifier_list))) {
    + mn = hlist_entry(mm->mmu_notifier_list.first,
    + struct mmu_notifier,
    + hlist);
    + hlist_del(&mn->hlist);
    + if (mn->ops->release)
    + mn->ops->release(mn, mm);
    + }
    +}
    +
    +/*
    + * If no young bitflag is supported by the hardware, ->clear_flush_young can
    + * unmap the address and return 1 or 0 depending if the mapping previously
    + * existed or not.
    + */
    +int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
    + unsigned long address)
    +{
    + struct mmu_notifier *mn;
    + struct hlist_node *n;
    + int young = 0;
    +
    + hlist_for_each_entry(mn, n, &mm->mmu_notifier_list, hlist) {
    + if (mn->ops->clear_flush_young)
    + young |= mn->ops->clear_flush_young(mn, mm, address);
    + }
    +
    + return young;
    +}
    +
    +void __mmu_notifier_invalidate_page(struct mm_struct *mm,
    + unsigned long address)
    +{
    + struct mmu_notifier *mn;
    + struct hlist_node *n;
    +
    + hlist_for_each_entry(mn, n, &mm->mmu_notifier_list, hlist) {
    + if (mn->ops->invalidate_page)
    + mn->ops->invalidate_page(mn, mm, address);
    + }
    +}
    +
    +void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
    + unsigned long start, unsigned long end)
    +{
    + struct mmu_notifier *mn;
    + struct hlist_node *n;
    +
    + hlist_for_each_entry(mn, n, &mm->mmu_notifier_list, hlist) {
    + if (mn->ops->invalidate_range_start)
    + mn->ops->invalidate_range_start(mn, mm, start, end);
    + }
    +}
    +
    +void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
    + unsigned long start, unsigned long end)
    +{
    + struct mmu_notifier *mn;
    + struct hlist_node *n;
    +
    + hlist_for_each_entry(mn, n, &mm->mmu_notifier_list, hlist) {
    + if (mn->ops->invalidate_range_end)
    + mn->ops->invalidate_range_end(mn, mm, start, end);
    + }
    +}
    +
    +/*
    + * Must not hold mmap_sem nor any other VM related lock when calling
    + * this registration function.
    + */
    +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
    +{
    + mm_lock(mm);
    + hlist_add_head(&mn->hlist, &mm->mmu_notifier_list);
    + mm_unlock(mm);
    +}
    +EXPORT_SYMBOL_GPL(mmu_notifier_register);
    diff --git a/mm/mprotect.c b/mm/mprotect.c
    --- a/mm/mprotect.c
    +++ b/mm/mprotect.c
    @@ -21,6 +21,7 @@
    #include
    #include
    #include
    +#include
    #include
    #include
    #include
    @@ -198,10 +199,12 @@
    dirty_accountable = 1;
    }

    + mmu_notifier_invalidate_range_start(mm, start, end);
    if (is_vm_hugetlb_page(vma))
    hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
    else
    change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
    + mmu_notifier_invalidate_range_end(mm, start, end);
    vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
    vm_stat_account(mm, newflags, vma->vm_file, nrpages);
    return 0;
    diff --git a/mm/mremap.c b/mm/mremap.c
    --- a/mm/mremap.c
    +++ b/mm/mremap.c
    @@ -18,6 +18,7 @@
    #include
    #include
    #include
    +#include

    #include
    #include
    @@ -74,7 +75,11 @@
    struct mm_struct *mm = vma->vm_mm;
    pte_t *old_pte, *new_pte, pte;
    spinlock_t *old_ptl, *new_ptl;
    + unsigned long old_start;

    + old_start = old_addr;
    + mmu_notifier_invalidate_range_start(vma->vm_mm,
    + old_start, old_end);
    if (vma->vm_file) {
    /*
    * Subtle point from Rajesh Venkatasubramanian: before
    @@ -116,6 +121,7 @@
    pte_unmap_unlock(old_pte - 1, old_ptl);
    if (mapping)
    spin_unlock(&mapping->i_mmap_lock);
    + mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
    }

    #define LATENCY_LIMIT (64 * PAGE_SIZE)
    diff --git a/mm/rmap.c b/mm/rmap.c
    --- a/mm/rmap.c
    +++ b/mm/rmap.c
    @@ -49,6 +49,7 @@
    #include
    #include
    #include
    +#include

    #include

    @@ -287,7 +288,7 @@
    if (vma->vm_flags & VM_LOCKED) {
    referenced++;
    *mapcount = 1; /* break early from loop */
    - } else if (ptep_clear_flush_young(vma, address, pte))
    + } else if (ptep_clear_flush_young_notify(vma, address, pte))
    referenced++;

    /* Pretend the page is referenced if the task has the
    @@ -456,7 +457,7 @@
    pte_t entry;

    flush_cache_page(vma, address, pte_pfn(*pte));
    - entry = ptep_clear_flush(vma, address, pte);
    + entry = ptep_clear_flush_notify(vma, address, pte);
    entry = pte_wrprotect(entry);
    entry = pte_mkclean(entry);
    set_pte_at(mm, address, pte, entry);
    @@ -717,14 +718,14 @@
    * skipped over this mm) then we should reactivate it.
    */
    if (!migration && ((vma->vm_flags & VM_LOCKED) ||
    - (ptep_clear_flush_young(vma, address, pte)))) {
    + (ptep_clear_flush_young_notify(vma, address, pte)))) {
    ret = SWAP_FAIL;
    goto out_unmap;
    }

    /* Nuke the page table entry. */
    flush_cache_page(vma, address, page_to_pfn(page));
    - pteval = ptep_clear_flush(vma, address, pte);
    + pteval = ptep_clear_flush_notify(vma, address, pte);

    /* Move the dirty bit to the physical page now the pte is gone. */
    if (pte_dirty(pteval))
    @@ -849,12 +850,12 @@
    page = vm_normal_page(vma, address, *pte);
    BUG_ON(!page || PageAnon(page));

    - if (ptep_clear_flush_young(vma, address, pte))
    + if (ptep_clear_flush_young_notify(vma, address, pte))
    continue;

    /* Nuke the page table entry. */
    flush_cache_page(vma, address, pte_pfn(*pte));
    - pteval = ptep_clear_flush(vma, address, pte);
    + pteval = ptep_clear_flush_notify(vma, address, pte);

    /* If nonlinear, store the file page offset in the pte. */
    if (page->index != linear_page_index(vma, address))
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [PATCH] mmu notifier #v11

    I am always the guy doing the cleanup after Andrea it seems. Sigh.

    Here is the mm_lock/mm_unlock logic separated out for easier review.
    Adds some comments. Still objectionable is the multiple ways of
    invalidating pages in #v11. Callout now has similar locking to emm.

    From: Christoph Lameter
    Subject: mm_lock: Lock a process against reclaim

    Provide a way to lock an mm_struct against reclaim (try_to_unmap
    etc). This is necessary for the invalidate notifier approaches so
    that they can reliably add and remove a notifier.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Christoph Lameter

    ---
    include/linux/mm.h | 10 ++++++++
    mm/mmap.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++ +++
    2 files changed, 76 insertions(+)

    Index: linux-2.6/include/linux/mm.h
    ================================================== =================
    --- linux-2.6.orig/include/linux/mm.h 2008-04-02 11:41:47.741678873 -0700
    +++ linux-2.6/include/linux/mm.h 2008-04-04 15:02:17.660504756 -0700
    @@ -1050,6 +1050,16 @@ extern int install_special_mapping(struc
    unsigned long addr, unsigned long len,
    unsigned long flags, struct page **pages);

    +/*
    + * Locking and unlocking an mm against reclaim.
    + *
    + * mm_lock will take mmap_sem writably (to prevent additional vmas from being
    + * added) and then take all mapping locks of the existing vmas. With that
    + * reclaim is effectively stopped.
    + */
    +extern void mm_lock(struct mm_struct *mm);
    +extern void mm_unlock(struct mm_struct *mm);
    +
    extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);

    extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
    Index: linux-2.6/mm/mmap.c
    ================================================== =================
    --- linux-2.6.orig/mm/mmap.c 2008-04-04 14:55:03.477593980 -0700
    +++ linux-2.6/mm/mmap.c 2008-04-04 14:59:05.505395402 -0700
    @@ -2242,3 +2242,69 @@ int install_special_mapping(struct mm_st

    return 0;
    }
    +
    +static void mm_lock_unlock(struct mm_struct *mm, int lock)
    +{
    + struct vm_area_struct *vma;
    + spinlock_t *i_mmap_lock_last, *anon_vma_lock_last;
    +
    + i_mmap_lock_last = NULL;
    + for (; {
    + spinlock_t *i_mmap_lock = (spinlock_t *) -1UL;
    + for (vma = mm->mmap; vma; vma = vma->vm_next)
    + if (vma->vm_file && vma->vm_file->f_mapping &&
    + (unsigned long) i_mmap_lock >
    + (unsigned long)
    + &vma->vm_file->f_mapping->i_mmap_lock &&
    + (unsigned long)
    + &vma->vm_file->f_mapping->i_mmap_lock >
    + (unsigned long) i_mmap_lock_last)
    + i_mmap_lock =
    + &vma->vm_file->f_mapping->i_mmap_lock;
    + if (i_mmap_lock == (spinlock_t *) -1UL)
    + break;
    + i_mmap_lock_last = i_mmap_lock;
    + if (lock)
    + spin_lock(i_mmap_lock);
    + else
    + spin_unlock(i_mmap_lock);
    + }
    +
    + anon_vma_lock_last = NULL;
    + for (; {
    + spinlock_t *anon_vma_lock = (spinlock_t *) -1UL;
    + for (vma = mm->mmap; vma; vma = vma->vm_next)
    + if (vma->anon_vma &&
    + (unsigned long) anon_vma_lock >
    + (unsigned long) &vma->anon_vma->lock &&
    + (unsigned long) &vma->anon_vma->lock >
    + (unsigned long) anon_vma_lock_last)
    + anon_vma_lock = &vma->anon_vma->lock;
    + if (anon_vma_lock == (spinlock_t *) -1UL)
    + break;
    + anon_vma_lock_last = anon_vma_lock;
    + if (lock)
    + spin_lock(anon_vma_lock);
    + else
    + spin_unlock(anon_vma_lock);
    + }
    +}
    +
    +/*
    + * This operation locks against the VM for all pte/vma/mm related
    + * operations that could ever happen on a certain mm. This includes
    + * vmtruncate, try_to_unmap, and all page faults. The holder
    + * must not hold any mm related lock. A single task can't take more
    + * than one mm lock in a row or it would deadlock.
    + */
    +void mm_lock(struct mm_struct * mm)
    +{
    + down_write(&mm->mmap_sem);
    + mm_lock_unlock(mm, 1);
    +}
    +
    +void mm_unlock(struct mm_struct *mm)
    +{
    + mm_lock_unlock(mm, 0);
    + up_write(&mm->mmap_sem);
    +}

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [PATCH] mmu notifier #v11

    On Fri, Apr 04, 2008 at 03:06:18PM -0700, Christoph Lameter wrote:
    > Adds some comments. Still objectionable is the multiple ways of
    > invalidating pages in #v11. Callout now has similar locking to emm.


    range_begin exists because range_end is called after the page has
    already been freed. invalidate_page is called _before_ the page is
    freed but _after_ the pte has been zapped.

    In short when working with single pages it's a waste to block the
    secondary-mmu page fault, because it's zero cost to invalidate_page
    before put_page. Not even GRU need to do that.

    Instead for the multiple-pte-zapping we have to call range_end _after_
    the page is already freed. This is so that there is a single range_end
    call for an huge amount of address space. So we need a range_begin for
    the subsystems not using page pinning for example. When working with
    single pages (try_to_unmap_one, do_wp_page) invalidate_page avoids to
    block the secondary mmu page fault, and it's in turn faster.

    Besides avoiding need of serializing the secondary mmu page fault,
    invalidate_page also reduces the overhead when the mmu notifiers are
    disarmed (i.e. kvm not running).
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [PATCH] mmu notifier #v11

    On Sat, 5 Apr 2008, Andrea Arcangeli wrote:

    > In short when working with single pages it's a waste to block the
    > secondary-mmu page fault, because it's zero cost to invalidate_page
    > before put_page. Not even GRU need to do that.


    That depends on what the notifier is being used for. Some serialization
    with the external mappings has to be done anyways. And its cleaner to have
    one API that does a lock/unlock scheme. Atomic operations can easily lead
    to races.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [PATCH] mmu notifier #v11

    On Sun, Apr 06, 2008 at 10:45:41PM -0700, Christoph Lameter wrote:
    > That depends on what the notifier is being used for. Some serialization
    > with the external mappings has to be done anyways. And its cleaner to have


    As far as I can tell no, you don't need to serialize against the
    secondary mmu page fault in invalidate_page, like you instead have to
    do in range_begin if you don't unpin the pages in range_end.

    > one API that does a lock/unlock scheme. Atomic operations can easily lead
    > to races.


    What races? Note that if you don't want to optimize XPMEM and GRU can
    feel free to implement their own invalidate_page as this:

    invalidate_page(mm, addr) {
    range_begin(mm, addr, addr+PAGE_SIZE)
    range_end(mm, addr, addr+PAGE_SIZE)
    }

    There's zero risk of adding races if they do this, but I doubt they
    want to run as slow as with EMM so I guess they'll exploit the
    optimization by going lock-free vs the spte page fault in
    invalidate_page.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread
Page 3 of 3 FirstFirst 1 2 3