[PATCH 0/4] pending scheduler updates - Kernel

This is a discussion on [PATCH 0/4] pending scheduler updates - Kernel ; Please apply -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/...

+ Reply to Thread
Results 1 to 17 of 17

Thread: [PATCH 0/4] pending scheduler updates

  1. [PATCH 0/4] pending scheduler updates

    Please apply
    --

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. [PATCH 2/4] sched: fair scheduler should not resched rt tasks

    With use of ftrace Steven noticed that some RT tasks got rescheduled due
    to sched_fair interaction.

    What happens is that we reprogram the hrtick from enqueue/dequeue_fair_task()
    because that can change nr_running, and thus a current tasks ideal runtime.
    However, its possible the current task isn't a fair_sched_class task, and thus
    doesn't have a hrtick set to change.

    Fix this by wrapping those hrtick_start_fair() calls in a hrtick_update()
    function, which will check for the right conditions.

    Reported-by: Steven Rostedt
    Signed-off-by: Peter Zijlstra
    Acked-by: Steven Rostedt
    ---
    kernel/sched_fair.c | 28 ++++++++++++++++++++++++----
    1 file changed, 24 insertions(+), 4 deletions(-)

    Index: linux-2.6/kernel/sched_fair.c
    ================================================== =================
    --- linux-2.6.orig/kernel/sched_fair.c
    +++ linux-2.6/kernel/sched_fair.c
    @@ -73,6 +73,8 @@ unsigned int sysctl_sched_wakeup_granula

    const_debug unsigned int sysctl_sched_migration_cost = 500000UL;

    +static const struct sched_class fair_sched_class;
    +
    /************************************************** ************
    * CFS operations on generic schedulable entities:
    */
    @@ -848,11 +850,31 @@ static void hrtick_start_fair(struct rq
    hrtick_start(rq, delta);
    }
    }
    +
    +/*
    + * called from enqueue/dequeue and updates the hrtick when the
    + * current task is from our class and nr_running is low enough
    + * to matter.
    + */
    +static void hrtick_update(struct rq *rq)
    +{
    + struct task_struct *curr = rq->curr;
    +
    + if (curr->sched_class != &fair_sched_class)
    + return;
    +
    + if (cfs_rq_of(&curr->se)->nr_running < sched_nr_latency)
    + hrtick_start_fair(rq, curr);
    +}
    #else /* !CONFIG_SCHED_HRTICK */
    static inline void
    hrtick_start_fair(struct rq *rq, struct task_struct *p)
    {
    }
    +
    +static inline void hrtick_update(struct rq *rq)
    +{
    +}
    #endif

    /*
    @@ -873,7 +895,7 @@ static void enqueue_task_fair(struct rq
    wakeup = 1;
    }

    - hrtick_start_fair(rq, rq->curr);
    + hrtick_update(rq);
    }

    /*
    @@ -895,7 +917,7 @@ static void dequeue_task_fair(struct rq
    sleep = 1;
    }

    - hrtick_start_fair(rq, rq->curr);
    + hrtick_update(rq);
    }

    /*
    @@ -1001,8 +1023,6 @@ static inline int wake_idle(int cpu, str

    #ifdef CONFIG_SMP

    -static const struct sched_class fair_sched_class;
    -
    #ifdef CONFIG_FAIR_GROUP_SCHED
    /*
    * effective_load() calculates the load change as seen from the root_task_group

    --

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. [PATCH 3/4] sched: revert back to per-rq vruntime

    Vatsa rightly points out that having the runqueue weight in the vruntime
    calculations can cause unfairness in the face of task joins/leaves.

    Suppose: dv = dt * rw / w

    Then take 10 tasks t_n, each of similar weight. If the first will run 1
    then its vruntime will increase by 10. Now, if the next 8 tasks leave after
    having run their 1, then the last task will get a vruntime increase of 2
    after having run 1.

    Which will leave us with 2 tasks of equal weight and equal runtime, of which
    one will not be scheduled for 8/2=4 units of time.

    Ergo, we cannot do that and must use: dv = dt / w.

    This means we cannot have a global vruntime based on effective priority, but
    must instead go back to the vruntime per rq model we started out with.

    This patch was lightly tested by doing starting while loops on each nice level
    and observing their execution time, and a simple group scenario of 1:2:3 pinned
    to a single cpu.

    Signed-off-by: Peter Zijlstra
    ---
    kernel/sched_fair.c | 32 +++++++++++++++-----------------
    1 file changed, 15 insertions(+), 17 deletions(-)

    Index: linux-2.6/kernel/sched_fair.c
    ================================================== =================
    --- linux-2.6.orig/kernel/sched_fair.c
    +++ linux-2.6/kernel/sched_fair.c
    @@ -336,7 +336,7 @@ int sched_nr_latency_handler(struct ctl_
    #endif

    /*
    - * delta *= w / rw
    + * delta *= P[w / rw]
    */
    static inline unsigned long
    calc_delta_weight(unsigned long delta, struct sched_entity *se)
    @@ -350,15 +350,13 @@ calc_delta_weight(unsigned long delta, s
    }

    /*
    - * delta *= rw / w
    + * delta /= w
    */
    static inline unsigned long
    calc_delta_fair(unsigned long delta, struct sched_entity *se)
    {
    - for_each_sched_entity(se) {
    - delta = calc_delta_mine(delta,
    - cfs_rq_of(se)->load.weight, &se->load);
    - }
    + if (unlikely(se->load.weight != NICE_0_LOAD))
    + delta = calc_delta_mine(delta, NICE_0_LOAD, &se->load);

    return delta;
    }
    @@ -388,26 +386,26 @@ static u64 __sched_period(unsigned long
    * We calculate the wall-time slice from the period by taking a part
    * proportional to the weight.
    *
    - * s = p*w/rw
    + * s = p*P[w/rw]
    */
    static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
    {
    - return calc_delta_weight(__sched_period(cfs_rq->nr_running), se);
    + unsigned long nr_running = cfs_rq->nr_running;
    +
    + if (unlikely(!se->on_rq))
    + nr_running++;
    +
    + return calc_delta_weight(__sched_period(nr_running), se);
    }

    /*
    * We calculate the vruntime slice of a to be inserted task
    *
    - * vs = s*rw/w = p
    + * vs = s/w
    */
    -static u64 sched_vslice_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
    +static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
    {
    - unsigned long nr_running = cfs_rq->nr_running;
    -
    - if (!se->on_rq)
    - nr_running++;
    -
    - return __sched_period(nr_running);
    + return calc_delta_fair(sched_slice(cfs_rq, se), se);
    }

    /*
    @@ -630,7 +628,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
    * stays open at the end.
    */
    if (initial && sched_feat(START_DEBIT))
    - vruntime += sched_vslice_add(cfs_rq, se);
    + vruntime += sched_vslice(cfs_rq, se);

    if (!initial) {
    /* sleeps upto a single latency don't count. */

    --

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. [PATCH 4/4] sched: fix wakeup preemption

    In my recent wakeup preempt rework I messed up the asym wakeup.
    The idea is that it should be easier to preempt lighter tasks
    but not harder to preempt heavier tasks.

    Signed-off-by: Peter Zijlstra
    ---
    kernel/sched_fair.c | 6 +++---
    1 file changed, 3 insertions(+), 3 deletions(-)

    Index: linux-2.6/kernel/sched_fair.c
    ================================================== =================
    --- linux-2.6.orig/kernel/sched_fair.c
    +++ linux-2.6/kernel/sched_fair.c
    @@ -1243,8 +1243,8 @@ static unsigned long wakeup_gran(struct
    * More easily preempt - nice tasks, while not making it harder for
    * + nice tasks.
    */
    - if (sched_feat(ASYM_GRAN))
    - gran = calc_delta_mine(gran, NICE_0_LOAD, &se->load);
    + if (sched_feat(ASYM_GRAN) && se->load.weight < NICE_0_LOAD)
    + gran = (gran * se->load.weight) >> NICE_0_SHIFT;

    return gran;
    }
    @@ -1296,7 +1296,7 @@ static void check_preempt_wakeup(struct
    }

    delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
    - if (delta_exec > wakeup_gran(pse))
    + if (delta_exec > wakeup_gran(se))
    resched_task(curr);
    }


    --

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. [PATCH 1/4] sched: optimize group load balancer

    I noticed that tg_shares_up() unconditionally takes rq-locks for all cpus
    in the sched_domain. This hurts.

    We need the rq-locks whenever we change the weight of the per-cpu group sched
    entities. To allevate this a little, only change the weight when the new
    weight is at least shares_thresh away from the old value.

    This avoids the rq-lock for the top level entries, since those will never
    be re-weighted, and fuzzes the lower level entries a little to gain performance
    in semi-stable situations.

    Signed-off-by: Peter Zijlstra
    CC: Chris Friesen
    ---
    include/linux/sched.h | 1 +
    kernel/sched.c | 45 +++++++++++++++++++++++++--------------------
    kernel/sysctl.c | 10 ++++++++++
    3 files changed, 36 insertions(+), 20 deletions(-)

    Index: linux-2.6/include/linux/sched.h
    ================================================== =================
    --- linux-2.6.orig/include/linux/sched.h
    +++ linux-2.6/include/linux/sched.h
    @@ -1660,6 +1660,7 @@ extern unsigned int sysctl_sched_feature
    extern unsigned int sysctl_sched_migration_cost;
    extern unsigned int sysctl_sched_nr_migrate;
    extern unsigned int sysctl_sched_shares_ratelimit;
    +extern unsigned int sysctl_sched_shares_thresh;

    int sched_nr_latency_handler(struct ctl_table *table, int write,
    struct file *file, void __user *buffer, size_t *length,
    Index: linux-2.6/kernel/sched.c
    ================================================== =================
    --- linux-2.6.orig/kernel/sched.c
    +++ linux-2.6/kernel/sched.c
    @@ -818,6 +818,13 @@ const_debug unsigned int sysctl_sched_nr
    unsigned int sysctl_sched_shares_ratelimit = 250000;

    /*
    + * Inject some fuzzyness into changing the per-cpu group shares
    + * this avoids remote rq-locks at the expense of fairness.
    + * default: 4
    + */
    +unsigned int sysctl_sched_shares_thresh = 4;
    +
    +/*
    * period over which we measure -rt task cpu usage in us.
    * default: 1s
    */
    @@ -1453,8 +1460,8 @@ static void __set_se_shares(struct sched
    * Calculate and set the cpu's group shares.
    */
    static void
    -__update_group_shares_cpu(struct task_group *tg, int cpu,
    - unsigned long sd_shares, unsigned long sd_rq_weight)
    +update_group_shares_cpu(struct task_group *tg, int cpu,
    + unsigned long sd_shares, unsigned long sd_rq_weight)
    {
    int boost = 0;
    unsigned long shares;
    @@ -1485,19 +1492,23 @@ __update_group_shares_cpu(struct task_gr
    *
    */
    shares = (sd_shares * rq_weight) / (sd_rq_weight + 1);
    + shares = clamp_t(unsigned long, shares, MIN_SHARES, MAX_SHARES);

    - /*
    - * record the actual number of shares, not the boosted amount.
    - */
    - tg->cfs_rq[cpu]->shares = boost ? 0 : shares;
    - tg->cfs_rq[cpu]->rq_weight = rq_weight;
    + if (abs(shares - tg->se[cpu]->load.weight) >
    + sysctl_sched_shares_thresh) {
    + struct rq *rq = cpu_rq(cpu);
    + unsigned long flags;

    - if (shares < MIN_SHARES)
    - shares = MIN_SHARES;
    - else if (shares > MAX_SHARES)
    - shares = MAX_SHARES;
    + spin_lock_irqsave(&rq->lock, flags);
    + /*
    + * record the actual number of shares, not the boosted amount.
    + */
    + tg->cfs_rq[cpu]->shares = boost ? 0 : shares;
    + tg->cfs_rq[cpu]->rq_weight = rq_weight;

    - __set_se_shares(tg->se[cpu], shares);
    + __set_se_shares(tg->se[cpu], shares);
    + spin_unlock_irqrestore(&rq->lock, flags);
    + }
    }

    /*
    @@ -1526,14 +1537,8 @@ static int tg_shares_up(struct task_grou
    if (!rq_weight)
    rq_weight = cpus_weight(sd->span) * NICE_0_LOAD;

    - for_each_cpu_mask(i, sd->span) {
    - struct rq *rq = cpu_rq(i);
    - unsigned long flags;
    -
    - spin_lock_irqsave(&rq->lock, flags);
    - __update_group_shares_cpu(tg, i, shares, rq_weight);
    - spin_unlock_irqrestore(&rq->lock, flags);
    - }
    + for_each_cpu_mask(i, sd->span)
    + update_group_shares_cpu(tg, i, shares, rq_weight);

    return 0;
    }
    Index: linux-2.6/kernel/sysctl.c
    ================================================== =================
    --- linux-2.6.orig/kernel/sysctl.c
    +++ linux-2.6/kernel/sysctl.c
    @@ -277,6 +277,16 @@ static struct ctl_table kern_table[] = {
    },
    {
    .ctl_name = CTL_UNNUMBERED,
    + .procname = "sched_shares_thresh",
    + .data = &sysctl_sched_shares_thresh,
    + .maxlen = sizeof(unsigned int),
    + .mode = 0644,
    + .proc_handler = &proc_dointvec_minmax,
    + .strategy = &sysctl_intvec,
    + .extra1 = &zero,
    + },
    + {
    + .ctl_name = CTL_UNNUMBERED,
    .procname = "sched_child_runs_first",
    .data = &sysctl_sched_child_runs_first,
    .maxlen = sizeof(unsigned int),

    --

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [PATCH 0/4] pending scheduler updates


    * Peter Zijlstra wrote:

    > Please apply


    applied these to tip/sched/urgent:

    f9c0b09: sched: revert back to per-rq vruntime
    a4c2f00: sched: fair scheduler should not resched rt tasks
    ffda12a: sched: optimize group load balancer

    (will wait with 4/4 until you and Mike come to a verdict.)

    thanks Peter,

    Ingo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [PATCH 4/4] sched: fix wakeup preemption

    Peter Zijlstra wrote:

    > In my recent wakeup preempt rework I messed up the asym wakeup.
    > The idea is that it should be easier to preempt lighter tasks
    > but not harder to preempt heavier tasks.


    > Index: linux-2.6/kernel/sched_fair.c
    > ================================================== =================
    > --- linux-2.6.orig/kernel/sched_fair.c
    > +++ linux-2.6/kernel/sched_fair.c
    > @@ -1243,8 +1243,8 @@ static unsigned long wakeup_gran(struct
    > * More easily preempt - nice tasks, while not making it harder for
    > * + nice tasks.
    > */
    > - if (sched_feat(ASYM_GRAN))
    > - gran = calc_delta_mine(gran, NICE_0_LOAD, &se->load);
    > + if (sched_feat(ASYM_GRAN) && se->load.weight < NICE_0_LOAD)
    > + gran = (gran * se->load.weight) >> NICE_0_SHIFT;
    >
    > return gran;
    > }


    Setting aside whether the asym wakeup is desirable, the code looks
    reasonable but I think you need to change the code comments as well.

    The proposed code only affects with a weight of less than NICE_0_LOAD,
    ie. +nice tasks. The comment suggests the opposite.

    Chris
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [PATCH 0/4] pending scheduler updates

    On Mon, Oct 20, 2008 at 02:05:38PM +0200, Ingo Molnar wrote:
    >
    > * Peter Zijlstra wrote:
    >
    > > Please apply

    >
    > applied these to tip/sched/urgent:
    >
    > f9c0b09: sched: revert back to per-rq vruntime
    > a4c2f00: sched: fair scheduler should not resched rt tasks
    > ffda12a: sched: optimize group load balancer
    >
    > (will wait with 4/4 until you and Mike come to a verdict.)


    Is there any conclusion on Patch 4/4? It looks sane to me!

    --
    Regards,
    vatsa
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [PATCH 0/4] pending scheduler updates


    * Srivatsa Vaddagiri wrote:

    > On Mon, Oct 20, 2008 at 02:05:38PM +0200, Ingo Molnar wrote:
    > >
    > > * Peter Zijlstra wrote:
    > >
    > > > Please apply

    > >
    > > applied these to tip/sched/urgent:
    > >
    > > f9c0b09: sched: revert back to per-rq vruntime
    > > a4c2f00: sched: fair scheduler should not resched rt tasks
    > > ffda12a: sched: optimize group load balancer
    > >
    > > (will wait with 4/4 until you and Mike come to a verdict.)

    >
    > Is there any conclusion on Patch 4/4? It looks sane to me!


    Mike?

    Ingo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [PATCH 0/4] pending scheduler updates

    On Wed, 2008-10-22 at 11:40 +0200, Ingo Molnar wrote:
    > * Srivatsa Vaddagiri wrote:
    >
    > > On Mon, Oct 20, 2008 at 02:05:38PM +0200, Ingo Molnar wrote:
    > > >
    > > > * Peter Zijlstra wrote:
    > > >
    > > > > Please apply
    > > >
    > > > applied these to tip/sched/urgent:
    > > >
    > > > f9c0b09: sched: revert back to per-rq vruntime
    > > > a4c2f00: sched: fair scheduler should not resched rt tasks
    > > > ffda12a: sched: optimize group load balancer
    > > >
    > > > (will wait with 4/4 until you and Mike come to a verdict.)

    > >
    > > Is there any conclusion on Patch 4/4? It looks sane to me!

    >
    > Mike?


    Your call of course, but I don't think it's a good trade. In testing,
    it does serious injury to mysql+oltp peak throughput, and slows down
    things which need frequent preemption in order to compete effectively
    against hogs. (includes desktop, though that's not heavily tested)

    The attached starvation testcase, distilled from real app and posted to
    lkml a few years ago, it totally kills. It's a worst case scenario to
    be sure, but that worst case is pretty dramatic. I guesstimated it
    would take ~12 hours for it to do 10M signals with wakeup preemption so
    restricted, vs 43 seconds with preemption as is. To me, that suggests
    that anything ultra fast/light will pay through the nose.

    It has positive effects too, but IMHO, the bad outweigh the good.

    -Mike


  11. Re: [PATCH 0/4] pending scheduler updates

    On Wed, 2008-10-22 at 12:03 +0200, Mike Galbraith wrote:

    > It has positive effects too, but IMHO, the bad outweigh the good.


    BTW, most dramatic on the other end of the spectrum is pgsql+oltp. With
    preemption as is, it collapses as load climbs to heavy with preemption
    knobs at stock. Postgres uses user-land spinlocks and _appears_ to wake
    others while these are still held. For this load, there is such a thing
    as too much short-term fairness, preempting lock holder creates nasty
    gaggle of contended lock spinners. It's curable with knobs, and I think
    it's postgres's own fault, but may be wrong.

    With that patch, pgsql+oltp scales perfectly.

    -Mike

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [PATCH 0/4] pending scheduler updates


    * Mike Galbraith wrote:

    > On Wed, 2008-10-22 at 12:03 +0200, Mike Galbraith wrote:
    >
    > > It has positive effects too, but IMHO, the bad outweigh the good.

    >
    > BTW, most dramatic on the other end of the spectrum is pgsql+oltp.
    > With preemption as is, it collapses as load climbs to heavy with
    > preemption knobs at stock. Postgres uses user-land spinlocks and
    > _appears_ to wake others while these are still held. For this load,
    > there is such a thing as too much short-term fairness, preempting lock
    > holder creates nasty gaggle of contended lock spinners. It's curable
    > with knobs, and I think it's postgres's own fault, but may be wrong.
    >
    > With that patch, pgsql+oltp scales perfectly.


    hm, tempting.

    Have you tried to hack/fix pgsql to do proper wakeups?

    Right now pgsql it punishes schedulers that preempt it while it is
    holding totally undeclared (to the kernel) user-space spinlocks ...

    Hence postgresql is rewarding a _bad_ scheduler policy in essence. And
    pgsql scalability seems to fall totally apart above 16 cpus - regardless
    of scheduler policy.

    Ingo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: [PATCH 0/4] pending scheduler updates

    On Wed, 2008-10-22 at 14:10 +0200, Ingo Molnar wrote:
    > * Mike Galbraith wrote:
    >
    > > On Wed, 2008-10-22 at 12:03 +0200, Mike Galbraith wrote:
    > >
    > > > It has positive effects too, but IMHO, the bad outweigh the good.

    > >
    > > BTW, most dramatic on the other end of the spectrum is pgsql+oltp.
    > > With preemption as is, it collapses as load climbs to heavy with
    > > preemption knobs at stock. Postgres uses user-land spinlocks and
    > > _appears_ to wake others while these are still held. For this load,
    > > there is such a thing as too much short-term fairness, preempting lock
    > > holder creates nasty gaggle of contended lock spinners. It's curable
    > > with knobs, and I think it's postgres's own fault, but may be wrong.
    > >
    > > With that patch, pgsql+oltp scales perfectly.

    >
    > hm, tempting.


    I disagree. Postgres's scaling problem is trivially corrected by
    twiddling knobs (or whatnot). With that patch, you can't twiddle mysql
    throughput back, or disk intensive loads for that matter. You can tweak
    the preempt number, but it has nothing to do with lag, so anybody can
    preempt anybody else as you turn the knob toward zero. Chaos.

    > Have you tried to hack/fix pgsql to do proper wakeups?


    No, I tried to build without spinlocks to verify, but build croaked.
    Never went back to slogging through the code.

    > Right now pgsql it punishes schedulers that preempt it while it is
    > holding totally undeclared (to the kernel) user-space spinlocks ...
    >
    > Hence postgresql is rewarding a _bad_ scheduler policy in essence. And
    > pgsql scalability seems to fall totally apart above 16 cpus - regardless
    > of scheduler policy.


    If someone gives me that problem, and a credit card for electric
    company, I'll do my very extra special best to defeat it ;-)

    -Mike

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [PATCH 0/4] pending scheduler updates


    * Mike Galbraith wrote:

    > > > With that patch, pgsql+oltp scales perfectly.

    > >
    > > hm, tempting.

    >
    > I disagree. Postgres's scaling problem is trivially corrected by
    > twiddling knobs (or whatnot). [...]


    okay, then we need to document it a bit more: what knobs need twiddling
    to make it scale perfectly?

    > [...] With that patch, you can't twiddle mysql throughput back, or
    > disk intensive loads for that matter. You can tweak the preempt
    > number, but it has nothing to do with lag, so anybody can preempt
    > anybody else as you turn the knob toward zero. Chaos.


    okay, convinced.

    > > Have you tried to hack/fix pgsql to do proper wakeups?

    >
    > No, I tried to build without spinlocks to verify, but build croaked.
    > Never went back to slogging through the code.


    if it falls back to IPC semaphores that's a bad trade from a performance
    POV. The best would be if it used proper futexes (i.e. pthread_mutex()
    and friends) not some home-grown user-space spinlock thing.

    Ingo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: [PATCH 0/4] pending scheduler updates

    On Wed, 2008-10-22 at 14:42 +0200, Ingo Molnar wrote:
    > * Mike Galbraith wrote:
    >
    > > > > With that patch, pgsql+oltp scales perfectly.
    > > >
    > > > hm, tempting.

    > >
    > > I disagree. Postgres's scaling problem is trivially corrected by
    > > twiddling knobs (or whatnot). [...]

    >
    > okay, then we need to document it a bit more: what knobs need twiddling
    > to make it scale perfectly?


    Twiddle sched_wakeup_granularity_ns or just turn FAIR_SLEEPERS off for
    best pgsql+oltp scalability. Pgsql+oltp collapse is delicate. It
    doesn't take much to send it one way or the other. Cut preempt a wee
    bit, and it can snap right into shape.

    I think we need to penalize sleepers a wee bit as load climbs in general
    though, everybody begins to like preemption less as load increases.
    Mysql+oltp improves as well, and it loves preempt at modest load.

    -Mike

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. Re: [PATCH 0/4] pending scheduler updates

    On Wed, 2008-10-22 at 12:32 +0200, Mike Galbraith wrote:
    > On Wed, 2008-10-22 at 12:03 +0200, Mike Galbraith wrote:
    >
    > > It has positive effects too, but IMHO, the bad outweigh the good.

    >
    > BTW, most dramatic on the other end of the spectrum is pgsql+oltp. With
    > preemption as is, it collapses as load climbs to heavy with preemption
    > knobs at stock. Postgres uses user-land spinlocks and _appears_ to wake
    > others while these are still held. For this load, there is such a thing
    > as too much short-term fairness, preempting lock holder creates nasty
    > gaggle of contended lock spinners. It's curable with knobs, and I think
    > it's postgres's own fault, but may be wrong.
    >
    > With that patch, pgsql+oltp scales perfectly.


    Are we talking about this patch, which re-instates the vruntime based
    wakeup-preemption ?

    ---
    Subject: sched: fix wakeup preemption
    From: Peter Zijlstra

    Signed-off-by: Peter Zijlstra
    ---
    kernel/sched_fair.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++----
    1 file changed, 92 insertions(+), 6 deletions(-)

    Index: linux-2.6/kernel/sched_fair.c
    ================================================== =================
    --- linux-2.6.orig/kernel/sched_fair.c
    +++ linux-2.6/kernel/sched_fair.c
    @@ -143,6 +143,49 @@ static inline struct sched_entity *paren
    return se->parent;
    }

    +/* return depth at which a sched entity is present in the hierarchy */
    +static inline int depth_se(struct sched_entity *se)
    +{
    + int depth = 0;
    +
    + for_each_sched_entity(se)
    + depth++;
    +
    + return depth;
    +}
    +
    +static void
    +find_matching_se(struct sched_entity **se, struct sched_entity **pse)
    +{
    + int se_depth, pse_depth;
    +
    + /*
    + * preemption test can be made between sibling entities who are in the
    + * same cfs_rq i.e who have a common parent. Walk up the hierarchy of
    + * both tasks until we find their ancestors who are siblings of common
    + * parent.
    + */
    +
    + /* First walk up until both entities are at same depth */
    + se_depth = depth_se(*se);
    + pse_depth = depth_se(*pse);
    +
    + while (se_depth > pse_depth) {
    + se_depth--;
    + *se = parent_entity(*se);
    + }
    +
    + while (pse_depth > se_depth) {
    + pse_depth--;
    + *pse = parent_entity(*pse);
    + }
    +
    + while (!is_same_group(*se, *pse)) {
    + *se = parent_entity(*se);
    + *pse = parent_entity(*pse);
    + }
    +}
    +
    #else /* CONFIG_FAIR_GROUP_SCHED */

    static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
    @@ -193,6 +236,11 @@ static inline struct sched_entity *paren
    return NULL;
    }

    +static inline void
    +find_matching_se(struct sched_entity **se, struct sched_entity **pse)
    +{
    +}
    +
    #endif /* CONFIG_FAIR_GROUP_SCHED */


    @@ -1244,13 +1292,42 @@ static unsigned long wakeup_gran(struct
    * More easily preempt - nice tasks, while not making it harder for
    * + nice tasks.
    */
    - if (sched_feat(ASYM_GRAN))
    - gran = calc_delta_mine(gran, NICE_0_LOAD, &se->load);
    + if (!sched_feat(ASYM_GRAN) || se->load.weight > NICE_0_LOAD)
    + gran = calc_delta_fair(sysctl_sched_wakeup_granularity, se);

    return gran;
    }

    /*
    + * Should 'se' preempt 'curr'.
    + *
    + * |s1
    + * |s2
    + * |s3
    + * g
    + * |<--->|c
    + *
    + * w(c, s1) = -1
    + * w(c, s2) = 0
    + * w(c, s3) = 1
    + *
    + */
    +static int
    +wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
    +{
    + s64 gran, vdiff = curr->vruntime - se->vruntime;
    +
    + if (vdiff < 0)
    + return -1;
    +
    + gran = wakeup_gran(curr);
    + if (vdiff > gran)
    + return 1;
    +
    + return 0;
    +}
    +
    +/*
    * Preempt the current task with a newly woken task if needed:
    */
    static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int sync)
    @@ -1258,7 +1335,6 @@ static void check_preempt_wakeup(struct
    struct task_struct *curr = rq->curr;
    struct cfs_rq *cfs_rq = task_cfs_rq(curr);
    struct sched_entity *se = &curr->se, *pse = &p->se;
    - s64 delta_exec;

    if (unlikely(rt_prio(p->prio))) {
    update_rq_clock(rq);
    @@ -1296,9 +1372,19 @@ static void check_preempt_wakeup(struct
    return;
    }

    - delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
    - if (delta_exec > wakeup_gran(pse))
    - resched_task(curr);
    + find_matching_se(&se, &pse);
    +
    + while (se) {
    + BUG_ON(!pse);
    +
    + if (wakeup_preempt_entity(se, pse) == 1) {
    + resched_task(curr);
    + break;
    + }
    +
    + se = parent_entity(se);
    + pse = parent_entity(pse);
    + }
    }

    static struct task_struct *pick_next_task_fair(struct rq *rq)


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. Re: [PATCH 0/4] pending scheduler updates

    On Wed, 2008-10-22 at 19:38 +0200, Peter Zijlstra wrote:
    > On Wed, 2008-10-22 at 12:32 +0200, Mike Galbraith wrote:
    > > On Wed, 2008-10-22 at 12:03 +0200, Mike Galbraith wrote:
    > >
    > > > It has positive effects too, but IMHO, the bad outweigh the good.

    > >
    > > BTW, most dramatic on the other end of the spectrum is pgsql+oltp. With
    > > preemption as is, it collapses as load climbs to heavy with preemption
    > > knobs at stock. Postgres uses user-land spinlocks and _appears_ to wake
    > > others while these are still held. For this load, there is such a thing
    > > as too much short-term fairness, preempting lock holder creates nasty
    > > gaggle of contended lock spinners. It's curable with knobs, and I think
    > > it's postgres's own fault, but may be wrong.
    > >
    > > With that patch, pgsql+oltp scales perfectly.

    >
    > Are we talking about this patch, which re-instates the vruntime based
    > wakeup-preemption ?


    No, if it was that one, I'd be tinkering with mysql+oltp. Everything
    else I tested (limited time, but fairly wide spectrum) with that patch
    was fine, including interactivity. Caveat: tbench/netperf test results
    I'm not comfortable with, would need to backport to 26 to feel at all
    confident with those. (fwtw)

    -Mike

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread