[patch 2/6] mmu_notifier: Callbacks to invalidate address ranges - Kernel

This is a discussion on [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges - Kernel ; The invalidation of address ranges in a mm_struct needs to be performed when pages are removed or permissions etc change. Most of the VM address space changes can use the range invalidate callback. invalidate_range() is generally called with mmap_sem held ...

+ Reply to Thread
Results 1 to 5 of 5

Thread: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

  1. [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

    The invalidation of address ranges in a mm_struct needs to be
    performed when pages are removed or permissions etc change.
    Most of the VM address space changes can use the range invalidate
    callback.

    invalidate_range() is generally called with mmap_sem held but
    no spinlocks are active. If invalidate_range() is called with
    locks held then we pass a flag into invalidate_range()

    Comments state that mmap_sem must be held for
    remap_pfn_range() but various drivers do not seem to do this.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Robin Holt
    Signed-off-by: Christoph Lameter

    ---
    mm/fremap.c | 2 ++
    mm/hugetlb.c | 2 ++
    mm/memory.c | 11 +++++++++--
    mm/mmap.c | 1 +
    4 files changed, 14 insertions(+), 2 deletions(-)

    Index: linux-2.6/mm/fremap.c
    ================================================== =================
    --- linux-2.6.orig/mm/fremap.c 2008-01-25 19:31:05.000000000 -0800
    +++ linux-2.6/mm/fremap.c 2008-01-25 19:32:49.000000000 -0800
    @@ -15,6 +15,7 @@
    #include
    #include
    #include
    +#include

    #include
    #include
    @@ -211,6 +212,7 @@ asmlinkage long sys_remap_file_pages(uns
    spin_unlock(&mapping->i_mmap_lock);
    }

    + mmu_notifier(invalidate_range, mm, start, start + size, 0);
    err = populate_range(mm, vma, start, size, pgoff);
    if (!err && !(flags & MAP_NONBLOCK)) {
    if (unlikely(has_write_lock)) {
    Index: linux-2.6/mm/memory.c
    ================================================== =================
    --- linux-2.6.orig/mm/memory.c 2008-01-25 19:31:05.000000000 -0800
    +++ linux-2.6/mm/memory.c 2008-01-25 19:32:49.000000000 -0800
    @@ -50,6 +50,7 @@
    #include
    #include
    #include
    +#include

    #include
    #include
    @@ -891,6 +892,8 @@ unsigned long zap_page_range(struct vm_a
    end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
    if (tlb)
    tlb_finish_mmu(tlb, address, end);
    + mmu_notifier(invalidate_range, mm, address, end,
    + (details ? (details->i_mmap_lock != NULL) : 0));
    return end;
    }

    @@ -1319,7 +1322,7 @@ int remap_pfn_range(struct vm_area_struc
    {
    pgd_t *pgd;
    unsigned long next;
    - unsigned long end = addr + PAGE_ALIGN(size);
    + unsigned long start = addr, end = addr + PAGE_ALIGN(size);
    struct mm_struct *mm = vma->vm_mm;
    int err;

    @@ -1360,6 +1363,7 @@ int remap_pfn_range(struct vm_area_struc
    if (err)
    break;
    } while (pgd++, addr = next, addr != end);
    + mmu_notifier(invalidate_range, mm, start, end, 0);
    return err;
    }
    EXPORT_SYMBOL(remap_pfn_range);
    @@ -1443,7 +1447,7 @@ int apply_to_page_range(struct mm_struct
    {
    pgd_t *pgd;
    unsigned long next;
    - unsigned long end = addr + size;
    + unsigned long start = addr, end = addr + size;
    int err;

    BUG_ON(addr >= end);
    @@ -1454,6 +1458,7 @@ int apply_to_page_range(struct mm_struct
    if (err)
    break;
    } while (pgd++, addr = next, addr != end);
    + mmu_notifier(invalidate_range, mm, start, end, 0);
    return err;
    }
    EXPORT_SYMBOL_GPL(apply_to_page_range);
    @@ -1634,6 +1639,8 @@ gotten:
    /*
    * Re-check the pte - we dropped the lock
    */
    + mmu_notifier(invalidate_range, mm, address,
    + address + PAGE_SIZE - 1, 0);
    page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
    if (likely(pte_same(*page_table, orig_pte))) {
    if (old_page) {
    Index: linux-2.6/mm/mmap.c
    ================================================== =================
    --- linux-2.6.orig/mm/mmap.c 2008-01-25 19:31:05.000000000 -0800
    +++ linux-2.6/mm/mmap.c 2008-01-25 19:32:49.000000000 -0800
    @@ -1748,6 +1748,7 @@ static void unmap_region(struct mm_struc
    free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
    next? next->vm_start: 0);
    tlb_finish_mmu(tlb, start, end);
    + mmu_notifier(invalidate_range, mm, start, end, 0);
    }

    /*
    Index: linux-2.6/mm/hugetlb.c
    ================================================== =================
    --- linux-2.6.orig/mm/hugetlb.c 2008-01-25 19:33:58.000000000 -0800
    +++ linux-2.6/mm/hugetlb.c 2008-01-25 19:34:13.000000000 -0800
    @@ -14,6 +14,7 @@
    #include
    #include
    #include
    +#include

    #include
    #include
    @@ -763,6 +764,7 @@ void __unmap_hugepage_range(struct vm_ar
    }
    spin_unlock(&mm->page_table_lock);
    flush_tlb_range(vma, start, end);
    + mmu_notifier(invalidate_range, mm, start, end, 1);
    list_for_each_entry_safe(page, tmp, &page_list, lru) {
    list_del(&page->lru);
    put_page(page);

    --
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

    On Tue, Jan 29, 2008 at 04:22:46PM -0800, Christoph Lameter wrote:
    > That is only partially true. pte are created wronly in order to track
    > dirty state these days. The first write will lead to a fault that switches
    > the pte to writable. When the page undergoes writeback the page again
    > becomes write protected. Thus our need to effectively deal with
    > page_mkclean.


    Well I was talking about anonymous memory.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges


    On Wed, 2008-01-30 at 01:59 +0100, Andrea Arcangeli wrote:
    > On Tue, Jan 29, 2008 at 04:22:46PM -0800, Christoph Lameter wrote:
    > > That is only partially true. pte are created wronly in order to track
    > > dirty state these days. The first write will lead to a fault that switches
    > > the pte to writable. When the page undergoes writeback the page again
    > > becomes write protected. Thus our need to effectively deal with
    > > page_mkclean.

    >
    > Well I was talking about anonymous memory.


    Just to be absolutely clear on this (I lost track of what exactly we are
    talking about here), nonlinear mappings no not do the dirty accounting,
    and are not allowed on a backing store that would require dirty
    accounting.



    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

    On Tue, Jan 29, 2008 at 06:28:05PM -0600, Jack Steiner wrote:
    > On Tue, Jan 29, 2008 at 04:20:50PM -0800, Christoph Lameter wrote:
    > > On Wed, 30 Jan 2008, Andrea Arcangeli wrote:
    > >
    > > > > invalidate_range after populate allows access to memory for which ptes
    > > > > were zapped and the refcount was released.
    > > >
    > > > The last refcount is released by the invalidate_range itself.

    > >
    > > That is true for your implementation and to address Robin's issues. Jack:
    > > Is that true for the GRU?

    >
    > I'm not sure I understand the question. The GRU never (currently) takes
    > a reference on a page. It has no mechanism for tracking pages that
    > were exported to the external TLBs.


    If you don't have a pin, then things like invalidate_range in
    remap_file_pages can't be safe as writes through the external TLBs can
    keep going on pages in the freelist. For you to be safe w/o a
    page-pin, you need to return in the direction of invalidate_page
    inside ptep_clear_flush (or anyway before
    page_cache_release/__free_page/put_page...). You're generally not safe
    with any invalidate_range that may run after the page pointed by the
    pte has been freed (or can be freed by the VM anytime because of being
    unpinned cache).
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

    On Wed, Jan 30, 2008 at 06:04:52PM +0100, Andrea Arcangeli wrote:
    > On Wed, Jan 30, 2008 at 10:11:24AM -0600, Robin Holt wrote:

    ....
    > > The three issues we need to simultaneously solve is revoking the remote
    > > page table/tlb information while still in a sleepable context and not
    > > having the remote faulters become out of sync with the granting process.

    ....
    > > Could we consider doing a range-based recall and lock callout before
    > > clearing the processes page tables/TLBs, then use the _page or _range
    > > callouts from Andrea's patch to clear the mappings, finally make a
    > > range-based unlock callout. The mmu_notifier user would usually use ops
    > > for either the recall+lock/unlock family of callouts or the _page/_range
    > > family of callouts.

    >
    > invalidate_page/age_page can return inside ptep_clear_flush/young and
    > Jack will need that too. Infact Jack will need an invalidate_page also
    > inside ptep_get_and_clear. And the range callout will be done always
    > in a sleeping context and it'll relay on the page-pin to be safe (when
    > details->i_mmap_lock != NULL invalidate_range it shouldn't be called
    > inside zap_page_range but before returning from
    > unmap_mapping_range_vma before cond_resched). This will make
    > everything a bit simpler and less prone to breakage IMHO, plus it'll
    > have a chance to work for Jack w/o page-pin without additional
    > cluttering of mm/*.c.


    I don't think I saw the answer to my original question. I assume your
    original patch, extended in a way similar to what Christoph has done,
    can be made to work to cover both the KVM and GRU (Jack's) case.

    XPMEM, however, does not look to be solvable due to the three simultaneous
    issues above. To address that, I think I am coming to the conclusion
    that we need an accompanying but seperate pair of callouts. The first
    will ensure the remote page tables and TLBs are cleared and all page
    information is returned back to the process that is granting access to
    its address space. That will include an implicit block on the address
    range so no further faults will be satisfied by the remote accessor
    (forgot the KVM name for this, sorry). Any faults will be held off
    and only the processes page tables/TLBs are in play. Once the normal
    processing of the kernel is complete, an unlock callout would be made
    for the range and then faulting may occur on behalf of the process again.

    Currently, this is the only direct solution that I can see as a
    possibility. My question is two fold. Does this seem like a reasonable
    means to solve the three simultaneous issues above and if so, does it
    seem like the most reasonable means?

    Thanks,
    Robin
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread