[PATCH 3/3] wait_task_inactive: don't use the dummy version when !SMP && PREEMPT - Kernel

This is a discussion on [PATCH 3/3] wait_task_inactive: don't use the dummy version when !SMP && PREEMPT - Kernel ; wait_task_inactive() must return success only when/if the task leaves the runqueue, it doesn't work correctly when !SMP && PREEMPT because in that case we use the "dummy" version which doesn't check .on_rq. Change the "#if" around the full-blown version from ...

+ Reply to Thread
Results 1 to 6 of 6

Thread: [PATCH 3/3] wait_task_inactive: don't use the dummy version when !SMP && PREEMPT

  1. [PATCH 3/3] wait_task_inactive: don't use the dummy version when !SMP && PREEMPT

    wait_task_inactive() must return success only when/if the task leaves
    the runqueue, it doesn't work correctly when !SMP && PREEMPT because
    in that case we use the "dummy" version which doesn't check .on_rq.

    Change the "#if" around the full-blown version from SMP to SMP || PREEMPT.

    The patch looks monstrous because it moves the (unchanged) definition
    of wait_task_inactive() outside of "#ifdef CONFIG_SMP", but it is quite
    trivial.

    Signed-off-by: Oleg Nesterov

    include/linux/sched.h | 2
    kernel/sched.c | 210 +++++++++++++++++++++++++-------------------------
    2 files changed, 107 insertions(+), 105 deletions(-)

    --- 27/include/linux/sched.h~3_WTI_PREEMPT 2008-07-30 20:28:31.000000000 +0400
    +++ 27/include/linux/sched.h 2008-07-30 20:47:46.000000000 +0400
    @@ -1874,7 +1874,7 @@ struct task_struct *fork_idle(int);
    extern void set_task_comm(struct task_struct *tsk, char *from);
    extern char *get_task_comm(char *to, struct task_struct *tsk);

    -#ifdef CONFIG_SMP
    +#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
    extern unsigned long wait_task_inactive(struct task_struct *, long match_state);
    #else
    static inline unsigned long wait_task_inactive(struct task_struct *p,
    --- 27/kernel/sched.c~3_WTI_PREEMPT 2008-07-30 19:47:56.000000000 +0400
    +++ 27/kernel/sched.c 2008-07-30 20:43:41.000000000 +0400
    @@ -1864,110 +1864,6 @@ migrate_task(struct task_struct *p, int
    return 1;
    }

    -/*
    - * wait_task_inactive - wait for a thread to unschedule.
    - *
    - * If @match_state is nonzero, it's the @p->state value just checked and
    - * not expected to change. If it changes, i.e. @p might have woken up,
    - * then return zero. When we succeed in waiting for @p to be off its CPU,
    - * we return a positive number (its total switch count). If a second call
    - * a short while later returns the same number, the caller can be sure that
    - * @p has remained unscheduled the whole time.
    - *
    - * The caller must ensure that the task *will* unschedule sometime soon,
    - * else this function might spin for a *long* time. This function can't
    - * be called with interrupts off, or it may introduce deadlock with
    - * smp_call_function() if an IPI is sent by the same process we are
    - * waiting to become inactive.
    - */
    -unsigned long wait_task_inactive(struct task_struct *p, long match_state)
    -{
    - unsigned long flags;
    - int running, on_rq;
    - unsigned long ncsw;
    - struct rq *rq;
    -
    - for (; {
    - /*
    - * We do the initial early heuristics without holding
    - * any task-queue locks at all. We'll only try to get
    - * the runqueue lock when things look like they will
    - * work out!
    - */
    - rq = task_rq(p);
    -
    - /*
    - * If the task is actively running on another CPU
    - * still, just relax and busy-wait without holding
    - * any locks.
    - *
    - * NOTE! Since we don't hold any locks, it's not
    - * even sure that "rq" stays as the right runqueue!
    - * But we don't care, since "task_running()" will
    - * return false if the runqueue has changed and p
    - * is actually now running somewhere else!
    - */
    - while (task_running(rq, p)) {
    - if (match_state && unlikely(p->state != match_state))
    - return 0;
    - cpu_relax();
    - }
    -
    - /*
    - * Ok, time to look more closely! We need the rq
    - * lock now, to be *sure*. If we're wrong, we'll
    - * just go back and repeat.
    - */
    - rq = task_rq_lock(p, &flags);
    - running = task_running(rq, p);
    - on_rq = p->se.on_rq;
    - ncsw = 0;
    - if (!match_state || p->state == match_state)
    - ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
    - task_rq_unlock(rq, &flags);
    -
    - /*
    - * If it changed from the expected state, bail out now.
    - */
    - if (unlikely(!ncsw))
    - break;
    -
    - /*
    - * Was it really running after all now that we
    - * checked with the proper locks actually held?
    - *
    - * Oops. Go back and try again..
    - */
    - if (unlikely(running)) {
    - cpu_relax();
    - continue;
    - }
    -
    - /*
    - * It's not enough that it's not actively running,
    - * it must be off the runqueue _entirely_, and not
    - * preempted!
    - *
    - * So if it wa still runnable (but just not actively
    - * running right now), it's preempted, and we should
    - * yield - it could be a while.
    - */
    - if (unlikely(on_rq)) {
    - schedule_timeout_uninterruptible(1);
    - continue;
    - }
    -
    - /*
    - * Ahh, all good. It wasn't running, and it wasn't
    - * runnable, which means that it will never become
    - * running in the future either. We're all done!
    - */
    - break;
    - }
    -
    - return ncsw;
    -}
    -
    /***
    * kick_process - kick a running thread to enter/exit the kernel
    * @p: the to-be-kicked thread
    @@ -2176,6 +2072,112 @@ static int sched_balance_self(int cpu, i

    #endif /* CONFIG_SMP */

    +#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
    +/*
    + * wait_task_inactive - wait for a thread to unschedule.
    + *
    + * If @match_state is nonzero, it's the @p->state value just checked and
    + * not expected to change. If it changes, i.e. @p might have woken up,
    + * then return zero. When we succeed in waiting for @p to be off its CPU,
    + * we return a positive number (its total switch count). If a second call
    + * a short while later returns the same number, the caller can be sure that
    + * @p has remained unscheduled the whole time.
    + *
    + * The caller must ensure that the task *will* unschedule sometime soon,
    + * else this function might spin for a *long* time. This function can't
    + * be called with interrupts off, or it may introduce deadlock with
    + * smp_call_function() if an IPI is sent by the same process we are
    + * waiting to become inactive.
    + */
    +unsigned long wait_task_inactive(struct task_struct *p, long match_state)
    +{
    + unsigned long flags;
    + int running, on_rq;
    + unsigned long ncsw;
    + struct rq *rq;
    +
    + for (; {
    + /*
    + * We do the initial early heuristics without holding
    + * any task-queue locks at all. We'll only try to get
    + * the runqueue lock when things look like they will
    + * work out!
    + */
    + rq = task_rq(p);
    +
    + /*
    + * If the task is actively running on another CPU
    + * still, just relax and busy-wait without holding
    + * any locks.
    + *
    + * NOTE! Since we don't hold any locks, it's not
    + * even sure that "rq" stays as the right runqueue!
    + * But we don't care, since "task_running()" will
    + * return false if the runqueue has changed and p
    + * is actually now running somewhere else!
    + */
    + while (task_running(rq, p)) {
    + if (match_state && unlikely(p->state != match_state))
    + return 0;
    + cpu_relax();
    + }
    +
    + /*
    + * Ok, time to look more closely! We need the rq
    + * lock now, to be *sure*. If we're wrong, we'll
    + * just go back and repeat.
    + */
    + rq = task_rq_lock(p, &flags);
    + running = task_running(rq, p);
    + on_rq = p->se.on_rq;
    + ncsw = 0;
    + if (!match_state || p->state == match_state)
    + ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
    + task_rq_unlock(rq, &flags);
    +
    + /*
    + * If it changed from the expected state, bail out now.
    + */
    + if (unlikely(!ncsw))
    + break;
    +
    + /*
    + * Was it really running after all now that we
    + * checked with the proper locks actually held?
    + *
    + * Oops. Go back and try again..
    + */
    + if (unlikely(running)) {
    + cpu_relax();
    + continue;
    + }
    +
    + /*
    + * It's not enough that it's not actively running,
    + * it must be off the runqueue _entirely_, and not
    + * preempted!
    + *
    + * So if it wa still runnable (but just not actively
    + * running right now), it's preempted, and we should
    + * yield - it could be a while.
    + */
    + if (unlikely(on_rq)) {
    + schedule_timeout_uninterruptible(1);
    + continue;
    + }
    +
    + /*
    + * Ahh, all good. It wasn't running, and it wasn't
    + * runnable, which means that it will never become
    + * running in the future either. We're all done!
    + */
    + break;
    + }
    +
    + return ncsw;
    +}
    +#endif
    +
    /***
    * try_to_wake_up - wake up a thread
    * @p: the to-be-woken-up thread

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [PATCH 3/3] wait_task_inactive: don't use the dummy version when !SMP && PREEMPT



    On Wed, 30 Jul 2008, Oleg Nesterov wrote:
    >
    > The patch looks monstrous because it moves the (unchanged) definition
    > of wait_task_inactive() outside of "#ifdef CONFIG_SMP", but it is quite
    > trivial.


    Hmm. Doesn't this just deadlock in UP (PREEMPT) if wait_task_interactive()
    is ever called from a no-preempt context?

    And if that's never the case, the comment should be updated to reflect
    that (right now it says that it's only invalid to call it with interrupts
    disabled to avoid cross-IPI deadlocks).

    Oh, and shouldn't it do a "yield()" instead of a cpu_relax() on UP?

    Inquiring minds want to know. That function was very much expressly
    designed for SMP, not for preemption, and I want to understand why it's
    ok (_if_ it's ok).

    Linus
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [PATCH 3/3] wait_task_inactive: don't use the dummy version when !SMP && PREEMPT

    On 07/30, Linus Torvalds wrote:
    >
    > On Wed, 30 Jul 2008, Oleg Nesterov wrote:
    > >
    > > The patch looks monstrous because it moves the (unchanged) definition
    > > of wait_task_inactive() outside of "#ifdef CONFIG_SMP", but it is quite
    > > trivial.

    >
    > Hmm. Doesn't this just deadlock in UP (PREEMPT) if wait_task_interactive()
    > is ever called from a no-preempt context?


    Given that it calls schedule_timeout_uninterruptible(), it can't be used
    from the no-preempt context,

    > And if that's never the case, the comment should be updated to reflect
    > that (right now it says that it's only invalid to call it with interrupts
    > disabled to avoid cross-IPI deadlocks).


    Yes, I think this function is might_sleep(),

    > Oh, and shouldn't it do a "yield()" instead of a cpu_relax() on UP?


    I _think_ that rq->curr must be == current without CONFIG_SMP, but

    > and I want to understand why it's
    > ok (_if_ it's ok).


    me too.

    Hopefully Ingo can ack/nack.

    Oleg.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [PATCH 3/3] wait_task_inactive: don't use the dummy version when !SMP && PREEMPT

    2008/7/30 Linus Torvalds :
    >
    >
    > On Wed, 30 Jul 2008, Oleg Nesterov wrote:
    >>
    >> The patch looks monstrous because it moves the (unchanged) definition
    >> of wait_task_inactive() outside of "#ifdef CONFIG_SMP", but it is quite
    >> trivial.

    >
    > Hmm. Doesn't this just deadlock in UP (PREEMPT) if wait_task_interactive()
    > is ever called from a no-preempt context?
    >
    > And if that's never the case, the comment should be updated to reflect
    > that (right now it says that it's only invalid to call it with interrupts
    > disabled to avoid cross-IPI deadlocks).
    >
    > Oh, and shouldn't it do a "yield()" instead of a cpu_relax() on UP?


    This part could have been skipped for UP. task_running(rq, p) just
    can't give 'true' for UP (otherwise it's a bug). The only relevant
    part is "on_rq = p->se.on_rq".

    >
    > Inquiring minds want to know. That function was very much expressly
    > designed for SMP,


    It looks so. Otherwise it's behavior is not symmetric and I think,
    either [1] it shouldn't be a "nop" for !SMP or [2] there shouldn't be
    a version for !SMP at all --> so no one can make false assumptions.

    (if [1], then I think a separate function for PREEMPT would look
    better, i.e. without parts with task_running())


    e.g. consider this code from kthread_bind():

    /* Must have done schedule() in kthread() before we set_task_cpu */
    wait_task_inactive(k, 0);

    set_task_cpu(k, cpu);
    k->cpus_allowed = cpumask_of_cpu(cpu);
    k->rt.nr_cpus_allowed = 1;
    k->flags |= PF_THREAD_BOUND;

    set_task_cpu(k, cpu) is safe _only_ if 'k' is not on the run-queue
    (and can't be placed onto it behind our back -- heh, a bit subtle).

    Now, for !SMP + PREEMPT it's not a case. set_task_cpu() may be called
    while 'k' is still on the run-queue (more precisely, preempted in
    kthread() between complete(&create->started); and schedule().

    Yes, set_task_cpu() is a "nop" for UP so that's ok in this particular
    case. But let's suppose, another use-case would be introduced with
    'false' assumptions causing troubles for !SMP.


    > not for preemption, and I want to understand why it's
    > ok (_if_ it's ok).
    >
    > Linus



    --
    Best regards,
    Dmitry Adamushko
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [PATCH 3/3] wait_task_inactive: don't use the dummy version when !SMP && PREEMPT

    I'm not convinced we need this at all (as per prior thread
    "Q: wait_task_inactive() and !CONFIG_SMP && CONFIG_PREEMPT").

    Let's not get hung up on what the explanation of the function's semantics
    supposedly is. We can change the comments to match the reality. We can
    rename the function to better reflect its true guarantee, if you like.

    There are two uses of wait_task_inactive. One is kthread_bind.
    We have established that a nop is fine for !SMP there.

    The other uses are for tracing. Presently that is the traditional use in
    ptrace, and task_current_syscall.

    I've described that the requirement of the traditional ptrace call is quite
    weak. A nop is fine for !SMP. Preemption is not an issue because all that
    matters is that no access to pt_regs/thread_struct can race with the actual
    context switch (by definition has to be on another CPU to be simultaneous).

    For task_current_syscall, it also doesn't matter that it's left the run
    queue. It just matters that the call before looking at pt_regs and the
    call after indicate whether it stayed switched out for the duration. As
    per the prior thread on this, I think that's covered by using nivcsw+nvcsw
    as I did originally rather than nvcsw alone. So a short inline is still
    fine.

    The other uses I anticipate (e.g. utrace) follow the same model as
    task_current_syscall (check before and after) and have only the same
    requirements.


    Thanks,
    Roland
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [PATCH 3/3] wait_task_inactive: don't use the dummy version when !SMP && PREEMPT

    On 07/30, Dmitry Adamushko wrote:
    >
    > 2008/7/30 Linus Torvalds :
    > >
    > > Oh, and shouldn't it do a "yield()" instead of a cpu_relax() on UP?

    >
    > This part could have been skipped for UP. task_running(rq, p) just
    > can't give 'true' for UP (otherwise it's a bug). The only relevant
    > part is "on_rq = p->se.on_rq".


    That was my understanding, thanks for your confirmation.

    > (if [1], then I think a separate function for PREEMPT would look
    > better, i.e. without parts with task_running())


    in that case we can also eliminate task_rq_lock() afaics.

    > e.g. consider this code from kthread_bind():
    >
    > /* Must have done schedule() in kthread() before we set_task_cpu */
    > wait_task_inactive(k, 0);
    >
    > set_task_cpu(k, cpu);
    > k->cpus_allowed = cpumask_of_cpu(cpu);
    > k->rt.nr_cpus_allowed = 1;
    > k->flags |= PF_THREAD_BOUND;
    >
    > set_task_cpu(k, cpu) is safe _only_ if 'k' is not on the run-queue
    > (and can't be placed onto it behind our back -- heh, a bit subtle).
    >
    > Now, for !SMP + PREEMPT it's not a case. set_task_cpu() may be called
    > while 'k' is still on the run-queue (more precisely, preempted in
    > kthread() between complete(&create->started); and schedule().
    >
    > Yes, set_task_cpu() is a "nop" for UP so that's ok in this particular
    > case. But let's suppose, another use-case would be introduced with
    > 'false' assumptions causing troubles for !SMP.


    Completely agreed.


    Currently the only user which can suffer on UP && PREEMPT is
    task_current_syscall(). As Roland suggested we can fix it if we change
    the !SMP version to return ->nivcsw + ->nvcsw. But personally I don't
    like the fact that we will have the subltle difference in behaviour
    depending on CONFIG_SMP. We have to wait for .on_rq == 0 on SMP, imho
    it is better to do the same on UP even if none of the current callers
    needs this.

    That said, I think this problem is minor and I don't (and can't) have
    the strong opinion on how to fix it. I'd better wait for the authoritative
    verdict.

    Oleg.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread