[patch 0/9] [RFC] EMM Notifier V2 - Kernel

This is a discussion on [patch 0/9] [RFC] EMM Notifier V2 - Kernel ; On Wed, Apr 02, 2008 at 11:43:02AM -0700, Christoph Lameter wrote: > Subject: EMM: Fix rcu handling and spelling > > Fix the way rcu_dereference is done. Acked-by: Paul E. McKenney > Signed-off-by: Christoph Lameter > > --- > include/linux/rmap.h ...

+ Reply to Thread
Page 2 of 3 FirstFirst 1 2 3 LastLast
Results 21 to 40 of 50

Thread: [patch 0/9] [RFC] EMM Notifier V2

  1. Re: EMM: Fix rcu handling and spelling

    On Wed, Apr 02, 2008 at 11:43:02AM -0700, Christoph Lameter wrote:
    > Subject: EMM: Fix rcu handling and spelling
    >
    > Fix the way rcu_dereference is done.


    Acked-by: Paul E. McKenney

    > Signed-off-by: Christoph Lameter
    >
    > ---
    > include/linux/rmap.h | 2 +-
    > mm/rmap.c | 4 ++--
    > 2 files changed, 3 insertions(+), 3 deletions(-)
    >
    > Index: linux-2.6/include/linux/rmap.h
    > ================================================== =================
    > --- linux-2.6.orig/include/linux/rmap.h 2008-04-02 11:41:58.737866596 -0700
    > +++ linux-2.6/include/linux/rmap.h 2008-04-02 11:42:08.282029661 -0700
    > @@ -91,7 +91,7 @@ static inline void page_dup_rmap(struct
    > * when the VM removes references to pages.
    > */
    > enum emm_operation {
    > - emm_release, /* Process existing, */
    > + emm_release, /* Process exiting, */
    > emm_invalidate_start, /* Before the VM unmaps pages */
    > emm_invalidate_end, /* After the VM unmapped pages */
    > emm_referenced /* Check if a range was referenced */
    > Index: linux-2.6/mm/rmap.c
    > ================================================== =================
    > --- linux-2.6.orig/mm/rmap.c 2008-04-02 11:41:58.737866596 -0700
    > +++ linux-2.6/mm/rmap.c 2008-04-02 11:42:08.282029661 -0700
    > @@ -303,7 +303,7 @@ EXPORT_SYMBOL_GPL(emm_notifier_register)
    > int __emm_notify(struct mm_struct *mm, enum emm_operation op,
    > unsigned long start, unsigned long end)
    > {
    > - struct emm_notifier *e = rcu_dereference(mm)->emm_notifier;
    > + struct emm_notifier *e = rcu_dereference(mm->emm_notifier);
    > int x;
    >
    > while (e) {
    > @@ -317,7 +317,7 @@ int __emm_notify(struct mm_struct *mm, e
    > * emm_notifier contents (e) must be fetched after
    > * the retrival of the pointer to the notifier.
    > */
    > - e = rcu_dereference(e)->next;
    > + e = rcu_dereference(e->next);
    > }
    > return 0;
    > }
    >

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. EMM: Fixup return value handling of emm_notify()

    On Wed, 2 Apr 2008, Christoph Lameter wrote:

    > Here f.e. We can add a special emm_age() function that iterates
    > differently and does the | for you.


    Well maybe not really necessary. How about this fix? Its likely a problem
    to stop callbacks if one callback returned an error.


    Subject: EMM: Fixup return value handling of emm_notify()

    Right now we stop calling additional subsystems if one callback returned
    an error. That has the potential for causing additional trouble with the
    subsystems that do not receive the callbacks they expect if one has failed.

    So change the handling of error code to continue callbacks to other
    subsystems but return the first error code encountered.

    If a callback returns a positive return value then add up all the value
    from all the calls. That can be used to establish how many references
    exist (xpmem may want this feature at some point) or ensure that the
    aging works the way Andrea wants it to (KVM, XPmem so far do not
    care too much).

    Signed-off-by: Christoph Lameter

    ---
    mm/rmap.c | 28 +++++++++++++++++++++++-----
    1 file changed, 23 insertions(+), 5 deletions(-)

    Index: linux-2.6/mm/rmap.c
    ================================================== =================
    --- linux-2.6.orig/mm/rmap.c 2008-04-02 11:46:20.738342852 -0700
    +++ linux-2.6/mm/rmap.c 2008-04-02 12:03:57.672494320 -0700
    @@ -299,27 +299,45 @@ void emm_notifier_register(struct emm_no
    }
    EXPORT_SYMBOL_GPL(emm_notifier_register);

    -/* Perform a callback */
    +/*
    + * Perform a callback
    + *
    + * The return of this function is either a negative error of the first
    + * callback that failed or a consolidated count of all the positive
    + * values that were returned by the callbacks.
    + */
    int __emm_notify(struct mm_struct *mm, enum emm_operation op,
    unsigned long start, unsigned long end)
    {
    struct emm_notifier *e = rcu_dereference(mm->emm_notifier);
    int x;
    + int result = 0;

    while (e) {
    -
    if (e->callback) {
    x = e->callback(e, mm, op, start, end);
    - if (x)
    - return x;
    +
    + /*
    + * Callback may return a positive value to indicate a count
    + * or a negative error code. We keep the first error code
    + * but continue to perform callbacks to other subscribed
    + * subsystems.
    + */
    + if (x && result >= 0) {
    + if (x >= 0)
    + result += x;
    + else
    + result = x;
    + }
    }
    +
    /*
    * emm_notifier contents (e) must be fetched after
    * the retrival of the pointer to the notifier.
    */
    e = rcu_dereference(e->next);
    }
    - return 0;
    + return result;
    }
    EXPORT_SYMBOL_GPL(__emm_notify);
    #endif
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. EMM: Require single threadedness for registration.

    Here is a patch to require single threaded execution during emm_register.
    This also allows an easy implementation of an unregister function and gets
    rid of the races that Andrea worried about.

    The approach here is similar to what was used in selinux for security
    context changes (see selinux_setprocattr).

    Is it okay for the users of emm to require single threadedness for
    registration?



    Subject: EMM: Require single threaded execution for register and unregister

    We can avoid the concurrency issues arising at registration if we
    only allow registration of notifiers when the process has only a single
    thread. That even allows to avoid the use of rcu.

    Signed-off-by: Christoph Lameter

    ---
    mm/rmap.c | 46 +++++++++++++++++++++++++++++++++++++---------
    1 file changed, 37 insertions(+), 9 deletions(-)

    Index: linux-2.6/mm/rmap.c
    ================================================== =================
    --- linux-2.6.orig/mm/rmap.c 2008-04-02 13:53:46.002473685 -0700
    +++ linux-2.6/mm/rmap.c 2008-04-02 14:03:05.872199896 -0700
    @@ -286,20 +286,48 @@ void emm_notifier_release(struct mm_stru
    }
    }

    -/* Register a notifier */
    +/*
    + * Register a notifier
    + *
    + * mmap_sem is held writably.
    + *
    + * Process must be single threaded.
    + */
    void emm_notifier_register(struct emm_notifier *e, struct mm_struct *mm)
    {
    + BUG_ON(atomic_read(&mm->mm_users) != 1);
    +
    e->next = mm->emm_notifier;
    - /*
    - * The update to emm_notifier (e->next) must be visible
    - * before the pointer becomes visible.
    - * rcu_assign_pointer() does exactly what we need.
    - */
    - rcu_assign_pointer(mm->emm_notifier, e);
    + mm->emm_notifier = e;
    }
    EXPORT_SYMBOL_GPL(emm_notifier_register);

    /*
    + * Unregister a notifier
    + *
    + * mmap_sem is held writably
    + *
    + * Process must be single threaded
    + */
    +void emm_notifier_unregister(struct emm_notifier *e, struct mm_struct *mm)
    +{
    + struct emm_notifier *p = mm->emm_notifier;
    +
    + BUG_ON(atomic_read(&mm->mm_users) != 1);
    +
    + if (e == p)
    + mm->emm_notifier = e->next;
    + else {
    + while (p->next != e)
    + p = p->next;
    +
    + p->next = e->next;
    + }
    + e->callback(e, mm, emm_release, 0, TASK_SIZE);
    +}
    +EXPORT_SYMBOL_GPL(emm_notifier_unregister);
    +
    +/*
    * Perform a callback
    *
    * The return of this function is either a negative error of the first
    @@ -309,7 +337,7 @@ EXPORT_SYMBOL_GPL(emm_notifier_register)
    int __emm_notify(struct mm_struct *mm, enum emm_operation op,
    unsigned long start, unsigned long end)
    {
    - struct emm_notifier *e = rcu_dereference(mm->emm_notifier);
    + struct emm_notifier *e = mm->emm_notifier;
    int x;
    int result = 0;

    @@ -335,7 +363,7 @@ int __emm_notify(struct mm_struct *mm, e
    * emm_notifier contents (e) must be fetched after
    * the retrival of the pointer to the notifier.
    */
    - e = rcu_dereference(e->next);
    + e = e->next;
    }
    return result;
    }
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: EMM: Fixup return value handling of emm_notify()

    On Wed, Apr 02, 2008 at 12:03:50PM -0700, Christoph Lameter wrote:
    > + /*
    > + * Callback may return a positive value to indicate a count
    > + * or a negative error code. We keep the first error code
    > + * but continue to perform callbacks to other subscribed
    > + * subsystems.
    > + */
    > + if (x && result >= 0) {
    > + if (x >= 0)
    > + result += x;
    > + else
    > + result = x;
    > + }
    > }
    > +


    Now think of when one of the kernel janitors will micro-optimize
    PG_dirty to be returned by invalidate_page so a single set_page_dirty
    will be invoked... Keep in mind this is a kernel internal APIs, ask
    Greg if we can change it in order to optimize later in the future. I
    think my #v9 is optimal enough while being simple at the same time,
    but anyway it's silly to be hardwired to such an interface that worst
    of all requires switch statements instead of proper pointer to
    functions and a fixed set of parameters and retval semantics for all
    methods.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: EMM: Fixup return value handling of emm_notify()

    On Wed, 2 Apr 2008, Andrea Arcangeli wrote:

    > but anyway it's silly to be hardwired to such an interface that worst
    > of all requires switch statements instead of proper pointer to
    > functions and a fixed set of parameters and retval semantics for all
    > methods.


    The EMM API with a single callback is the simplest approach at this point.
    A common callback for all operations allows the driver to implement common
    entry and exit code as seen in XPMem.

    I guess we can complicate this more by switching to a different API or
    adding additional emm_xxx() callback if need be but I really want to have
    a strong case for why this would be needed. There is the danger of
    adding frills with special callbacks in this and that situation that could
    make the notifier complicated and specific to a certain usage scenario.

    Having this generic simple interface will hopefully avoid such things.


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [patch 1/9] EMM Notifier: The notifier calls

    On Wed, Apr 02, 2008 at 10:59:50AM -0700, Christoph Lameter wrote:
    > Did I see #v10? Could you start a new subject when you post please? Do
    > not respond to some old message otherwise the threading will be wrong.


    I wasn't clear enough, #v10 was in the works... I was thinking about
    the last two issues before posting it.

    > How exactly does the GRU corrupt memory?


    Jack added synchronize_rcu, I assume for a reason.

    >
    > > Another less obviously safe approach is to allow the register
    > > method to succeed only when mm_users=1 and the task is single
    > > threaded. This way if all the places where the mmu notifers aren't
    > > invoked on the mm not by the current task, are only doing
    > > invalidates after/before zapping ptes, if the istantiation of new
    > > ptes is single threaded too, we shouldn't worry if we miss an
    > > invalidate for a pte that is zero and doesn't point to any physical
    > > page. In the places where current->mm != mm I'm using
    > > invalidate_page 99% of the time, and that only follows the
    > > ptep_clear_flush. The problem are the range_begin that will happen
    > > before zapping the pte in places where current->mm !=
    > > mm. Unfortunately in my incremental patch where I move all
    > > invalidate_page outside of the PT lock to prepare for allowing
    > > sleeping inside the mmu notifiers, I used range_begin/end in places
    > > like try_to_unmap_cluster where current->mm != mm. In general
    > > this solution looks more fragile than the seqlock.

    >
    > Hmmm... Okay that is one solution that would just require a BUG_ON in the
    > registration methods.


    Perhaps you didn't notice that this solution can't work if you call
    range_begin/end not in the "current" context and try_to_unmap_cluster
    does exactly that for both my patchset and yours. Missing an _end is
    ok, missing a _begin is never ok.

    > Well doesnt the requirement of just one execution thread also deal with
    > that issue?


    Yes, except again it can't work for try_to_unmap_cluster.

    This solution is only applicable to #v10 if I fix try_to_unmap_cluster
    to only call invalidate_page (relaying on the fact the VM holds a pin
    and a lock on any page that is being mmu-notifier-invalidated).

    You can't use the single threaded approach to solve either 1 or 2,
    because your _begin call is called anywhere and that's where you call
    the secondary-tlb flush and it's fatal to miss it.

    invalidate_page is called always after, so it enforced the tlb flush
    to be called _after_ and so it's inherently safe.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [patch 5/9] Convert anon_vma lock to rw_sem and refcount

    On Wed, Apr 02, 2008 at 11:15:26AM -0700, Christoph Lameter wrote:
    > On Wed, 2 Apr 2008, Andrea Arcangeli wrote:
    >
    > > On Tue, Apr 01, 2008 at 01:55:36PM -0700, Christoph Lameter wrote:
    > > > This results in f.e. the Aim9 brk performance test to got down by 10-15%.

    > >
    > > I guess it's more likely because of overscheduling for small crtitical
    > > sections, did you counted the total number of context switches? I
    > > guess there will be a lot more with your patch applied. That
    > > regression is a showstopper and it is the reason why I've suggested
    > > before to add a CONFIG_XPMEM or CONFIG_MMU_NOTIFIER_SLEEP config
    > > option to make the VM locks sleep capable only when XPMEM=y
    > > (PREEMPT_RT will enable it too). Thanks for doing the benchmark work!

    >
    > There are more context switches if locks are contended.
    >
    > But that has actually also some good aspects because we avoid busy loops
    > and can potentially continue work in another process.


    That would be the case if the "wait time" would be longer than the
    scheduling time, the whole point is that with anonvma the write side
    is so fast it's likely never worth scheduling (probably not even with
    preempt-rt for the write side, the read side is an entirely different
    matter but the read side can run concurrently if the system is heavy
    paging), hence the slowdown. What you benchmarked is the write side,
    which is also the fast path when the system is heavily CPU bound. I've
    to say aim is a great benchmark to test this regression.

    But I think a config option will solve all of this.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [patch 1/9] EMM Notifier: The notifier calls

    On Wed, 2 Apr 2008, Andrea Arcangeli wrote:

    > > Hmmm... Okay that is one solution that would just require a BUG_ON in the
    > > registration methods.

    >
    > Perhaps you didn't notice that this solution can't work if you call
    > range_begin/end not in the "current" context and try_to_unmap_cluster
    > does exactly that for both my patchset and yours. Missing an _end is
    > ok, missing a _begin is never ok.


    If you look at the patch you will see a requirement of holding a
    writelock on mmap_sem which will keep out get_user_pages().
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [patch 5/9] Convert anon_vma lock to rw_sem and refcount

    On Wed, 2 Apr 2008, Andrea Arcangeli wrote:

    > paging), hence the slowdown. What you benchmarked is the write side,
    > which is also the fast path when the system is heavily CPU bound. I've
    > to say aim is a great benchmark to test this regression.


    I am a bit surprised that brk performance is that important. There may be
    other measurement that have to be made to assess how this would impact a
    real load.



    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: EMM: Require single threadedness for registration.

    On Wed, Apr 02, 2008 at 02:05:28PM -0700, Christoph Lameter wrote:
    > Here is a patch to require single threaded execution during emm_register.
    > This also allows an easy implementation of an unregister function and gets
    > rid of the races that Andrea worried about.


    That would work for #v10 if I remove the invalidate_range_start from
    try_to_unmap_cluster, it can't work for EMM because you've
    emm_invalidate_start firing anywhere outside the context of the
    current task (even regular rmap code, not just nonlinear corner case
    will trigger the race). In short the single threaded approach would be
    workable only thanks to the fact #v10 has the notion of
    invalidate_page for flushing the tlb _after_ and to avoid blocking the
    secondary page fault during swapping. In the kvm case I don't want to
    block the page fault for anything but madvise which is strictly only
    used after guest inflated the balloon, and the existence of
    invalidate_page allows that optimization, and allows not to serialize
    against the kvm page fault during all regular page faults when the
    invalidate_page is called while the page is pinned by the VM.

    The requirement for invalidate_page is that the pte and linux tlb are
    flushed _before_ and the page is freed _after_ the invalidate_page
    method. that's not the case for _begin/_end. The page is freed well
    before _end runs, hence the need of _begin and to block the secondary
    mmu page fault during the vma-mangling operations.

    #v10 takes care of all this, and despite I could perhaps fix the
    remaining two issues using the single-threaded enforcement I
    suggested, I preferred to go safe and spend an unsigned per-mm in case
    anybody needs to attach at runtime, the single threaded restriction
    didn't look very clean.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: EMM: Require single threadedness for registration.

    On Thu, 3 Apr 2008, Andrea Arcangeli wrote:

    > That would work for #v10 if I remove the invalidate_range_start from
    > try_to_unmap_cluster, it can't work for EMM because you've
    > emm_invalidate_start firing anywhere outside the context of the
    > current task (even regular rmap code, not just nonlinear corner case
    > will trigger the race). In short the single threaded approach would be


    But in that case it will be firing for a callback to another mm_struct.
    The notifiers are bound to mm_structs and keep separate contexts.

    > The requirement for invalidate_page is that the pte and linux tlb are
    > flushed _before_ and the page is freed _after_ the invalidate_page
    > method. that's not the case for _begin/_end. The page is freed well
    > before _end runs, hence the need of _begin and to block the secondary
    > mmu page fault during the vma-mangling operations.


    You could flush in _begin and free on _end? I thought you are taking a
    refcount on the page? You can drop the refcount only on _end to ensure
    that the page does not go away before.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: EMM: Require single threadedness for registration.

    On Wed, Apr 02, 2008 at 03:06:19PM -0700, Christoph Lameter wrote:
    > On Thu, 3 Apr 2008, Andrea Arcangeli wrote:
    >
    > > That would work for #v10 if I remove the invalidate_range_start from
    > > try_to_unmap_cluster, it can't work for EMM because you've
    > > emm_invalidate_start firing anywhere outside the context of the
    > > current task (even regular rmap code, not just nonlinear corner case
    > > will trigger the race). In short the single threaded approach would be

    >
    > But in that case it will be firing for a callback to another mm_struct.
    > The notifiers are bound to mm_structs and keep separate contexts.


    Why can't it fire on the mm_struct where GRU just registered? That
    mm_struct existed way before GRU registered, and VM is free to unmap
    it w/o mmap_sem if there was any memory pressure.

    > You could flush in _begin and free on _end? I thought you are taking a
    > refcount on the page? You can drop the refcount only on _end to ensure
    > that the page does not go away before.


    we're going to lock + flush on begin and unlock on _end w/o
    refcounting to microoptimize. Free is done by
    unmap_vmas/madvise/munmap at will. That's a very slow path, inflating
    the balloon is not problematic. But invalidate_page allows to avoid
    blocking page faults during swapping so minor faults can happen and
    refresh the pte young bits etc... When the VM unmaps the page while
    holding the page pin, there's no race and that's where invalidate_page
    is being used to generate lower invalidation overhead.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: [patch 5/9] Convert anon_vma lock to rw_sem and refcount

    On Wed, Apr 02, 2008 at 02:56:25PM -0700, Christoph Lameter wrote:
    > I am a bit surprised that brk performance is that important. There may be


    I think it's not brk but fork that is being slowed down, did you
    oprofile? AIM forks a lot... The write side fast path generating the
    overscheduling I guess is when the new vmas are created for the child
    and queued in the parent anon-vma in O(1), so immediate, even
    preempt-rt would be ok with it spinning and not scheduling, it's just
    a list_add (much faster than schedule() indeed). Every time there's a
    collision when multiple child forks simultaneously and they all try to
    queue in the same anon-vma, things will slowdown.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [patch 1/9] EMM Notifier: The notifier calls

    On Wed, Apr 02, 2008 at 02:54:52PM -0700, Christoph Lameter wrote:
    > On Wed, 2 Apr 2008, Andrea Arcangeli wrote:
    >
    > > > Hmmm... Okay that is one solution that would just require a BUG_ON in the
    > > > registration methods.

    > >
    > > Perhaps you didn't notice that this solution can't work if you call
    > > range_begin/end not in the "current" context and try_to_unmap_cluster
    > > does exactly that for both my patchset and yours. Missing an _end is
    > > ok, missing a _begin is never ok.

    >
    > If you look at the patch you will see a requirement of holding a
    > writelock on mmap_sem which will keep out get_user_pages().


    I said try_to_unmap_cluster, not get_user_pages.

    CPU0 CPU1
    try_to_unmap_cluster:
    emm_invalidate_start in EMM (or mmu_notifier_invalidate_range_start in #v10)
    walking the list by hand in EMM (or with hlist cleaner in #v10)
    xpmem method invoked
    schedule for a long while inside invalidate_range_start while skbs are sent
    gru registers
    synchronize_rcu (sorry useless now)
    single threaded, so taking a page fault
    secondary tlb instantiated
    xpm method returns
    end of the list (didn't notice that it has to restart to flush the gru)
    zap pte
    free the page
    gru corrupts memory

    CPU 1 was single threaded, CPU0 doesn't hold any mmap_sem or any other
    lock that could ever serialize against the GRU as far as I can tell.

    In general my #v10 solution mixing seqlock + rcu looks more robust and
    allows multithreaded attachment of mmu notifers as well. I could have
    fixed it with the single threaded thanks to the fact the only place
    outside the mm->mmap_sem is try_to_unmap_cluster for me but it wasn't
    simple to convert, nor worth it, given nonlinear isn't worth
    optimizing for (not even the core VM cares about try_to_unmap_cluster
    which is infact the only place in the VM with a O(N) complexity for
    each try_to_unmap call, where N is the size of the mapping divided by
    page_size).
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: EMM: Require single threadedness for registration.

    On Thu, 3 Apr 2008, Andrea Arcangeli wrote:

    > Why can't it fire on the mm_struct where GRU just registered? That
    > mm_struct existed way before GRU registered, and VM is free to unmap
    > it w/o mmap_sem if there was any memory pressure.


    Right. Hmmm... Bad situation. We would have invalidate_start take
    a lock to prevent registration until _end has run.

    We could use stop_machine_run to register the notifier.... ;-).

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. Re: [patch 1/9] EMM Notifier: The notifier calls

    On Thu, 3 Apr 2008, Andrea Arcangeli wrote:

    > I said try_to_unmap_cluster, not get_user_pages.
    >
    > CPU0 CPU1
    > try_to_unmap_cluster:
    > emm_invalidate_start in EMM (or mmu_notifier_invalidate_range_start in #v10)
    > walking the list by hand in EMM (or with hlist cleaner in #v10)
    > xpmem method invoked
    > schedule for a long while inside invalidate_range_start while skbs are sent
    > gru registers
    > synchronize_rcu (sorry useless now)


    All of this would be much easier if you could stop the drivel. The sync
    rcu was for an earlier release of the mmu notifier. Why the sniping?

    > single threaded, so taking a page fault
    > secondary tlb instantiated


    The driver must not allow faults to occur between start and end. The
    trouble here is that GRU and xpmem are mixed. If CPU0 would have been
    running GRU instead of XPMEM then the fault would not have occurred
    because the gru would have noticed that a range op is active. If both
    systems would have run xpmem then the same would have worked.

    I guess this means that an address space cannot reliably registered to
    multiple subsystems if some of those do not take a refcount. If all
    drivers would be required to take a refcount then this would also not
    occur.

    > In general my #v10 solution mixing seqlock + rcu looks more robust and
    > allows multithreaded attachment of mmu notifers as well. I could have


    Well its easy to say that if no one else has looked at it yet. I expressed
    some concerns in reply to your post of #v10.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. EMM: disable other notifiers before register and unregister

    Ok lets forget about the single theaded thing to solve the registration
    races. As Andrea pointed out this still has ssues with other subscribed
    subsystems (and also try_to_unmap). We could do something like what
    stop_machine_run does: First disable all running subsystems before
    registering a new one.

    Maybe this is a possible solution.


    Subject: EMM: disable other notifiers before register and unregister

    As Andrea has pointed out: There are races during registration if other
    subsystem notifiers are active while we register a callback.

    Solve that issue by adding two new notifiers:

    emm_stop
    Stops the notifier operations. Notifier must block on
    invalidate_start and emm_referenced from this point on.
    If an invalidate_start has not been completed by a call
    to invalidate_end then the driver must wait until the
    operation is complete before returning.

    emm_start
    Restart notifier operations.

    Before registration all other subscribed subsystems are stopped.
    Then the new subsystem is subscribed and things can get running
    without consistency issues.

    Subsystems are restarted after the lists have been updated.

    This also works for unregistering. If we can get all subsystems
    to stop then we can also reliably unregister a subsystem. So
    provide that callback.

    Signed-off-by: Christoph Lameter

    ---
    include/linux/rmap.h | 10 +++++++---
    mm/rmap.c | 30 ++++++++++++++++++++++++++++++
    2 files changed, 37 insertions(+), 3 deletions(-)

    Index: linux-2.6/include/linux/rmap.h
    ================================================== =================
    --- linux-2.6.orig/include/linux/rmap.h 2008-04-02 18:16:07.906032549 -0700
    +++ linux-2.6/include/linux/rmap.h 2008-04-02 18:17:10.291070009 -0700
    @@ -94,7 +94,9 @@ enum emm_operation {
    emm_release, /* Process exiting, */
    emm_invalidate_start, /* Before the VM unmaps pages */
    emm_invalidate_end, /* After the VM unmapped pages */
    - emm_referenced /* Check if a range was referenced */
    + emm_referenced, /* Check if a range was referenced */
    + emm_stop, /* Halt all faults/invalidate_starts */
    + emm_start, /* Restart operations */
    };

    struct emm_notifier {
    @@ -126,13 +128,15 @@ static inline int emm_notify(struct mm_s

    /*
    * Register a notifier with an mm struct. Release occurs when the process
    - * terminates by calling the notifier function with emm_release.
    + * terminates by calling the notifier function with emm_release or when
    + * emm_notifier_unregister is called.
    *
    * Must hold the mmap_sem for write.
    */
    extern void emm_notifier_register(struct emm_notifier *e,
    struct mm_struct *mm);
    -
    +extern void emm_notifier_unregister(struct emm_notifier *e,
    + struct mm_struct *mm);

    /*
    * Called from mm/vmscan.c to handle paging out
    Index: linux-2.6/mm/rmap.c
    ================================================== =================
    --- linux-2.6.orig/mm/rmap.c 2008-04-02 18:16:09.378057062 -0700
    +++ linux-2.6/mm/rmap.c 2008-04-02 18:16:10.710079201 -0700
    @@ -289,16 +289,46 @@ void emm_notifier_release(struct mm_stru
    /* Register a notifier */
    void emm_notifier_register(struct emm_notifier *e, struct mm_struct *mm)
    {
    + /* Bring all other notifiers into a quiescent state */
    + emm_notify(mm, emm_stop, 0, TASK_SIZE);
    +
    e->next = mm->emm_notifier;
    +
    /*
    * The update to emm_notifier (e->next) must be visible
    * before the pointer becomes visible.
    * rcu_assign_pointer() does exactly what we need.
    */
    rcu_assign_pointer(mm->emm_notifier, e);
    +
    + /* Continue notifiers */
    + emm_notify(mm, emm_start, 0, TASK_SIZE);
    }
    EXPORT_SYMBOL_GPL(emm_notifier_register);

    +/* Unregister a notifier */
    +void emm_notifier_unregister(struct emm_notifier *e, struct mm_struct *mm)
    +{
    + struct emm_notifier *p;
    +
    + emm_notify(mm, emm_stop, 0, TASK_SIZE);
    +
    + p = mm->emm_notifier;
    + if (e == p)
    + mm->emm_notifier = e->next;
    + else {
    + while (p->next != e)
    + p = p->next;
    +
    + p->next = e->next;
    + }
    + e->next = mm->emm_notifier;
    +
    + emm_notify(mm, emm_start, 0, TASK_SIZE);
    + e->callback(e, mm, emm_release, 0, TASK_SIZE);
    +}
    +EXPORT_SYMBOL_GPL(emm_notifier_unregister);
    +
    /*
    * Perform a callback
    *

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  18. Re: EMM: disable other notifiers before register and unregister

    On Wed, 2008-04-02 at 18:24 -0700, Christoph Lameter wrote:
    > Ok lets forget about the single theaded thing to solve the registration
    > races. As Andrea pointed out this still has ssues with other subscribed
    > subsystems (and also try_to_unmap). We could do something like what
    > stop_machine_run does: First disable all running subsystems before
    > registering a new one.
    >
    > Maybe this is a possible solution.
    >
    >
    > Subject: EMM: disable other notifiers before register and unregister
    >
    > As Andrea has pointed out: There are races during registration if other
    > subsystem notifiers are active while we register a callback.
    >
    > Solve that issue by adding two new notifiers:
    >
    > emm_stop
    > Stops the notifier operations. Notifier must block on
    > invalidate_start and emm_referenced from this point on.
    > If an invalidate_start has not been completed by a call
    > to invalidate_end then the driver must wait until the
    > operation is complete before returning.
    >
    > emm_start
    > Restart notifier operations.


    Please use pause and resume or something like that. stop-start is an
    unnatural order; we usually start before we stop, whereas we pause first
    and resume later.

    > Before registration all other subscribed subsystems are stopped.
    > Then the new subsystem is subscribed and things can get running
    > without consistency issues.
    >
    > Subsystems are restarted after the lists have been updated.
    >
    > This also works for unregistering. If we can get all subsystems
    > to stop then we can also reliably unregister a subsystem. So
    > provide that callback.
    >
    > Signed-off-by: Christoph Lameter
    >
    > ---
    > include/linux/rmap.h | 10 +++++++---
    > mm/rmap.c | 30 ++++++++++++++++++++++++++++++
    > 2 files changed, 37 insertions(+), 3 deletions(-)
    >
    > Index: linux-2.6/include/linux/rmap.h
    > ================================================== =================
    > --- linux-2.6.orig/include/linux/rmap.h 2008-04-02 18:16:07.906032549 -0700
    > +++ linux-2.6/include/linux/rmap.h 2008-04-02 18:17:10.291070009 -0700
    > @@ -94,7 +94,9 @@ enum emm_operation {
    > emm_release, /* Process exiting, */
    > emm_invalidate_start, /* Before the VM unmaps pages */
    > emm_invalidate_end, /* After the VM unmapped pages */
    > - emm_referenced /* Check if a range was referenced */
    > + emm_referenced, /* Check if a range was referenced */
    > + emm_stop, /* Halt all faults/invalidate_starts */
    > + emm_start, /* Restart operations */
    > };
    >
    > struct emm_notifier {
    > @@ -126,13 +128,15 @@ static inline int emm_notify(struct mm_s
    >
    > /*
    > * Register a notifier with an mm struct. Release occurs when the process
    > - * terminates by calling the notifier function with emm_release.
    > + * terminates by calling the notifier function with emm_release or when
    > + * emm_notifier_unregister is called.
    > *
    > * Must hold the mmap_sem for write.
    > */
    > extern void emm_notifier_register(struct emm_notifier *e,
    > struct mm_struct *mm);
    > -
    > +extern void emm_notifier_unregister(struct emm_notifier *e,
    > + struct mm_struct *mm);
    >
    > /*
    > * Called from mm/vmscan.c to handle paging out
    > Index: linux-2.6/mm/rmap.c
    > ================================================== =================
    > --- linux-2.6.orig/mm/rmap.c 2008-04-02 18:16:09.378057062 -0700
    > +++ linux-2.6/mm/rmap.c 2008-04-02 18:16:10.710079201 -0700
    > @@ -289,16 +289,46 @@ void emm_notifier_release(struct mm_stru
    > /* Register a notifier */
    > void emm_notifier_register(struct emm_notifier *e, struct mm_struct *mm)
    > {
    > + /* Bring all other notifiers into a quiescent state */
    > + emm_notify(mm, emm_stop, 0, TASK_SIZE);
    > +
    > e->next = mm->emm_notifier;
    > +
    > /*
    > * The update to emm_notifier (e->next) must be visible
    > * before the pointer becomes visible.
    > * rcu_assign_pointer() does exactly what we need.
    > */
    > rcu_assign_pointer(mm->emm_notifier, e);
    > +
    > + /* Continue notifiers */
    > + emm_notify(mm, emm_start, 0, TASK_SIZE);
    > }
    > EXPORT_SYMBOL_GPL(emm_notifier_register);
    >
    > +/* Unregister a notifier */
    > +void emm_notifier_unregister(struct emm_notifier *e, struct mm_struct *mm)
    > +{
    > + struct emm_notifier *p;
    > +
    > + emm_notify(mm, emm_stop, 0, TASK_SIZE);
    > +
    > + p = mm->emm_notifier;
    > + if (e == p)
    > + mm->emm_notifier = e->next;
    > + else {
    > + while (p->next != e)
    > + p = p->next;
    > +
    > + p->next = e->next;
    > + }
    > + e->next = mm->emm_notifier;
    > +
    > + emm_notify(mm, emm_start, 0, TASK_SIZE);
    > + e->callback(e, mm, emm_release, 0, TASK_SIZE);
    > +}
    > +EXPORT_SYMBOL_GPL(emm_notifier_unregister);
    > +
    > /*
    > * Perform a callback
    > *
    >


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  19. Re: EMM: Fixup return value handling of emm_notify()

    On Wed, 2008-04-02 at 14:33 -0700, Christoph Lameter wrote:
    > On Wed, 2 Apr 2008, Andrea Arcangeli wrote:
    >
    > > but anyway it's silly to be hardwired to such an interface that worst
    > > of all requires switch statements instead of proper pointer to
    > > functions and a fixed set of parameters and retval semantics for all
    > > methods.

    >
    > The EMM API with a single callback is the simplest approach at this point.
    > A common callback for all operations allows the driver to implement common
    > entry and exit code as seen in XPMem.


    It seems to me that common code can be shared using functions? No need
    to stuff everything into a single function. We have method vectors all
    over the kernel, we could do a_ops as a single callback too, but we
    dont.

    FWIW I prefer separate methods.

    > I guess we can complicate this more by switching to a different API or
    > adding additional emm_xxx() callback if need be but I really want to have
    > a strong case for why this would be needed. There is the danger of
    > adding frills with special callbacks in this and that situation that could
    > make the notifier complicated and specific to a certain usage scenario.
    >
    > Having this generic simple interface will hopefully avoid such things.
    >
    >


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  20. Re: EMM: Fixup return value handling of emm_notify()

    On Thu, Apr 03, 2008 at 12:40:46PM +0200, Peter Zijlstra wrote:
    > It seems to me that common code can be shared using functions? No need
    > FWIW I prefer separate methods.


    kvm patch using mmu notifiers shares 99% of the code too between the
    two different methods implemented indeed. Code sharing is the same and
    if something pointer to functions will be faster if gcc isn't smart or
    can't create a compile time hash to jump into the right address
    without having to check every case: .
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread
Page 2 of 3 FirstFirst 1 2 3 LastLast