[patch] restore sched_exec load balance heuristics - Kernel

This is a discussion on [patch] restore sched_exec load balance heuristics - Kernel ; We've seen long standing performance regression on sys_execve for several upstream kernels, largely on workload that does heavy execve. The main reason for the regression was due to a change in sched_exec load balance heuristics. For example, on 2.6.11 kernel, ...

+ Reply to Thread
Results 1 to 8 of 8

Thread: [patch] restore sched_exec load balance heuristics

  1. [patch] restore sched_exec load balance heuristics

    We've seen long standing performance regression on sys_execve for several
    upstream kernels, largely on workload that does heavy execve. The main
    reason for the regression was due to a change in sched_exec load balance
    heuristics. For example, on 2.6.11 kernel, the "exec" task will run on
    the same cpu if that is the only task running. However, 2.6.13 and onward
    kernels will go around the sched-domain looking for most idle CPU (which
    doesn't treat task exec'ing as an idle CPU). Thus bouncing the exec'ing
    task all over the place which leads to poor CPU cache and numa locality.
    (The workload happens to share common data between subsequent exec program).

    This execve heuristic was removed in upstream kernel by this git commit:

    commit 68767a0ae428801649d510d9a65bb71feed44dd1
    Author: Nick Piggin
    Date: Sat Jun 25 14:57:20 2005 -0700

    [PATCH] sched: schedstats update for balance on fork
    Add SCHEDSTAT statistics for sched-balance-fork.

    From the commit description, it appears that deleting the heuristics
    was an accident, as the commit is supposedly just for schedstats.

    So, restore the sched-exec load balancing if exec'ing task is the only
    task running on that specific CPU. The logic make sense: newly exec
    program should continue to run on current CPU as it doesn't change any
    load imbalance nor does it help anything by bouncing to another idle
    CPU. By keeping on the same CPU, it preserves cache and numa locality.

    Signed-off-by: Ken Chen

    diff --git a/kernel/sched.c b/kernel/sched.c
    index e8819bc..4ad1907 100644
    --- a/kernel/sched.c
    +++ b/kernel/sched.c
    @@ -2873,7 +2873,12 @@ out:
    */
    void sched_exec(void)
    {
    - int new_cpu, this_cpu = get_cpu();
    + int new_cpu, this_cpu;
    +
    + if (this_rq()->nr_running <= 1)
    + return;
    +
    + this_cpu = get_cpu();
    new_cpu = sched_balance_self(this_cpu, SD_BALANCE_EXEC);
    put_cpu();
    if (new_cpu != this_cpu)
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [patch] restore sched_exec load balance heuristics


    * Ken Chen wrote:

    > We've seen long standing performance regression on sys_execve for several
    > upstream kernels, largely on workload that does heavy execve. The main
    > reason for the regression was due to a change in sched_exec load balance
    > heuristics. For example, on 2.6.11 kernel, the "exec" task will run on
    > the same cpu if that is the only task running. However, 2.6.13 and onward
    > kernels will go around the sched-domain looking for most idle CPU (which
    > doesn't treat task exec'ing as an idle CPU). Thus bouncing the exec'ing
    > task all over the place which leads to poor CPU cache and numa locality.
    > (The workload happens to share common data between subsequent exec program).
    >
    > This execve heuristic was removed in upstream kernel by this git commit:
    >
    > commit 68767a0ae428801649d510d9a65bb71feed44dd1
    > Author: Nick Piggin
    > Date: Sat Jun 25 14:57:20 2005 -0700
    >
    > [PATCH] sched: schedstats update for balance on fork
    > Add SCHEDSTAT statistics for sched-balance-fork.
    >
    > >From the commit description, it appears that deleting the heuristics

    > was an accident, as the commit is supposedly just for schedstats.
    >
    > So, restore the sched-exec load balancing if exec'ing task is the only
    > task running on that specific CPU. The logic make sense: newly exec
    > program should continue to run on current CPU as it doesn't change any
    > load imbalance nor does it help anything by bouncing to another idle
    > CPU. By keeping on the same CPU, it preserves cache and numa locality.
    >
    > Signed-off-by: Ken Chen
    >
    > diff --git a/kernel/sched.c b/kernel/sched.c
    > index e8819bc..4ad1907 100644
    > --- a/kernel/sched.c
    > +++ b/kernel/sched.c
    > @@ -2873,7 +2873,12 @@ out:
    > */
    > void sched_exec(void)
    > {
    > - int new_cpu, this_cpu = get_cpu();
    > + int new_cpu, this_cpu;
    > +
    > + if (this_rq()->nr_running <= 1)
    > + return;
    > +
    > + this_cpu = get_cpu();
    > new_cpu = sched_balance_self(this_cpu, SD_BALANCE_EXEC);
    > put_cpu();
    > if (new_cpu != this_cpu)


    ok, this should be solved - but rather at the level of
    sched_balance_self(): it should never migrate this task over to
    another cpu, it should take away this task's load from the current
    CPU's load when considering migration.

    Ingo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [patch] restore sched_exec load balance heuristics

    On Thu, Nov 6, 2008 at 12:07 PM, Ingo Molnar wrote:
    > ok, this should be solved - but rather at the level of
    > sched_balance_self(): it should never migrate this task over to
    > another cpu, it should take away this task's load from the current
    > CPU's load when considering migration.


    There are two callers to sched_balance_self(). In the sched_fork
    path, sched_balance_self will balance the newly forked task. I think
    it is OK to bounce a newly forked task to another CPU since current
    CPU will be busy when fork returns in the parent process.

    And if sched_balance_self() needs to different between fork / exec
    load balance, it has to check a flag from function argument, which I
    think it is better to just short circuit in sched_exec() directly.

    - Ken
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [patch] restore sched_exec load balance heuristics


    * Ken Chen wrote:

    > On Thu, Nov 6, 2008 at 12:07 PM, Ingo Molnar wrote:
    > > ok, this should be solved - but rather at the level of
    > > sched_balance_self(): it should never migrate this task over to
    > > another cpu, it should take away this task's load from the current
    > > CPU's load when considering migration.

    >
    > There are two callers to sched_balance_self(). In the sched_fork
    > path, sched_balance_self will balance the newly forked task. I
    > think it is OK to bounce a newly forked task to another CPU since
    > current CPU will be busy when fork returns in the parent process.
    >
    > And if sched_balance_self() needs to different between fork / exec
    > load balance, it has to check a flag from function argument, which I
    > think it is better to just short circuit in sched_exec() directly.


    yes, but the problem is deeper than that and your fix only addresses
    teh most obvious case: when a single task is exec()-ing. But if we
    exec while there are two tasks on this CPU, and one task on every
    other CPU, we bounce around the "new" task unnecessarily just as much.

    So the best solution is to pass in not a flag, but a 'load bias'
    offset - which is 0 in the fork case and -self-weight in the exec
    case.

    Ok?

    Ingo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [patch] restore sched_exec load balance heuristics

    Ken Chen wrote:

    > There are two callers to sched_balance_self(). In the sched_fork
    > path, sched_balance_self will balance the newly forked task. I think
    > it is OK to bounce a newly forked task to another CPU since current
    > CPU will be busy when fork returns in the parent process.


    What about vfork()?

    > And if sched_balance_self() needs to different between fork / exec
    > load balance, it has to check a flag from function argument, which I
    > think it is better to just short circuit in sched_exec() directly.


    From a cleanliness perspective, it make more sense to me for the
    decision as to whether or not to balance to be done in the "balance"
    function, not the "exec" function.

    Chris
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [patch] restore sched_exec load balance heuristics

    On Thu, 2008-11-06 at 21:07 +0100, Ingo Molnar wrote:

    > ok, this should be solved - but rather at the level of
    > sched_balance_self(): it should never migrate this task over to
    > another cpu, it should take away this task's load from the current
    > CPU's load when considering migration.


    How's this?

    (compile tested only)

    ---
    Subject: sched: fix sched_exec
    From: Peter Zijlstra
    Date: Mon Nov 10 09:47:54 CET 2008

    When deciding placement for an execve() task, subtract the effect of the
    current task.

    Signed-off-by: Peter Zijlstra
    Index: linux-2.6/kernel/sched.c
    ================================================== =================
    --- linux-2.6.orig/kernel/sched.c
    +++ linux-2.6/kernel/sched.c
    @@ -2035,7 +2035,8 @@ static unsigned long target_load(int cpu
    * domain.
    */
    static struct sched_group *
    -find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
    +find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu,
    + int org_cpu, unsigned long org_weight)
    {
    struct sched_group *idlest = NULL, *this = NULL, *group = sd->groups;
    unsigned long min_load = ULONG_MAX, this_load = 0;
    @@ -2063,6 +2064,9 @@ find_idlest_group(struct sched_domain *s
    else
    load = target_load(i, load_idx);

    + if (i == org_cpu)
    + load -= min(org_weight, load);
    +
    avg_load += load;
    }

    @@ -2089,7 +2093,7 @@ find_idlest_group(struct sched_domain *s
    */
    static int
    find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu,
    - cpumask_t *tmp)
    + cpumask_t *tmp, int org_cpu, unsigned long org_weight)
    {
    unsigned long load, min_load = ULONG_MAX;
    int idlest = -1;
    @@ -2101,6 +2105,9 @@ find_idlest_cpu(struct sched_group *grou
    for_each_cpu_mask_nr(i, *tmp) {
    load = weighted_cpuload(i);

    + if (i == org_cpu)
    + load -= org_weight;
    +
    if (load < min_load || (load == min_load && i == this_cpu)) {
    min_load = load;
    idlest = i;
    @@ -2121,10 +2128,11 @@ find_idlest_cpu(struct sched_group *grou
    *
    * preempt must be disabled.
    */
    -static int sched_balance_self(int cpu, int flag)
    +static int sched_balance_self(int cpu, int flag, unsigned long org_weight)
    {
    struct task_struct *t = current;
    struct sched_domain *tmp, *sd = NULL;
    + int org_cpu = cpu;

    for_each_domain(cpu, tmp) {
    /*
    @@ -2150,13 +2158,14 @@ static int sched_balance_self(int cpu, i
    }

    span = sd->span;
    - group = find_idlest_group(sd, t, cpu);
    + group = find_idlest_group(sd, t, cpu, org_cpu, org_weight);
    if (!group) {
    sd = sd->child;
    continue;
    }

    - new_cpu = find_idlest_cpu(group, t, cpu, &tmpmask);
    + new_cpu = find_idlest_cpu(group, t, cpu, &tmpmask,
    + org_cpu, org_weight);
    if (new_cpu == -1 || new_cpu == cpu) {
    /* Now try balancing at a lower domain level of cpu */
    sd = sd->child;
    @@ -2365,7 +2374,7 @@ void sched_fork(struct task_struct *p, i
    __sched_fork(p);

    #ifdef CONFIG_SMP
    - cpu = sched_balance_self(cpu, SD_BALANCE_FORK);
    + cpu = sched_balance_self(cpu, SD_BALANCE_FORK, 0);
    #endif
    set_task_cpu(p, cpu);

    @@ -2856,7 +2865,14 @@ out:
    void sched_exec(void)
    {
    int new_cpu, this_cpu = get_cpu();
    - new_cpu = sched_balance_self(this_cpu, SD_BALANCE_EXEC);
    + struct task_group *tg;
    + long weight, eload;
    +
    + tg = task_group(current);
    + weight = current->se.load.weight;
    + eload = -effective_load(tg, this_cpu, -weight, -weight);
    +
    + new_cpu = sched_balance_self(this_cpu, SD_BALANCE_EXEC, eload);
    put_cpu();
    if (new_cpu != this_cpu)
    sched_migrate_task(current, new_cpu);


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [patch] restore sched_exec load balance heuristics


    * Peter Zijlstra wrote:

    > void sched_exec(void)
    > {
    > int new_cpu, this_cpu = get_cpu();
    > - new_cpu = sched_balance_self(this_cpu, SD_BALANCE_EXEC);
    > + struct task_group *tg;
    > + long weight, eload;
    > +
    > + tg = task_group(current);
    > + weight = current->se.load.weight;
    > + eload = -effective_load(tg, this_cpu, -weight, -weight);
    > +
    > + new_cpu = sched_balance_self(this_cpu, SD_BALANCE_EXEC, eload);


    okay, i think this will work.

    it feels somewhat backwards though on a conceptual level.

    There's nothing particularly special about exec-balancing: the load
    picture is in equilibrium - it is in essence a rebalancing pass done
    not in the scheduler tick but in a special place in the middle of
    exec() where the old-task / new-task cross section is at a minimum
    level.

    _fork_ balancing is what is special: there we'll get a new context so
    we have to take the new load into account. It's a bit like wakeup
    balancing. (just done before the new task is truly woken up)

    OTOH, triggering the regular busy-balance at exec() time isnt totally
    straightforward either: the 'old' task is the current task so it
    cannot be balanced away. We have to trigger all the active-migration
    logic - which again makes exec() balancing special.

    So maybe this patch is the best solution after all. Ken, does it do
    the trick for your workload, when applied against v2.6.28-rc4?

    You might even try to confirm that your testcase still works fine even
    if you elevate the load average with +1.0 on every cpu by starting
    infinite CPU eater loops on every CPU, via this bash oneliner:

    for ((i=0;i<2;i++)); do while :; do :; done & done

    (change the '2' to '4' if you test this on a quad, not on a dual-core
    box)

    the desired behavior would be for your "exec hopper" testcase to not
    hop between cpus, but to stick the same CPU most of the time.

    Ingo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [patch] restore sched_exec load balance heuristics

    On Mon, 2008-11-10 at 10:29 +0100, Ingo Molnar wrote:
    > * Peter Zijlstra wrote:
    >
    > > void sched_exec(void)
    > > {
    > > int new_cpu, this_cpu = get_cpu();
    > > - new_cpu = sched_balance_self(this_cpu, SD_BALANCE_EXEC);
    > > + struct task_group *tg;
    > > + long weight, eload;
    > > +
    > > + tg = task_group(current);
    > > + weight = current->se.load.weight;
    > > + eload = -effective_load(tg, this_cpu, -weight, -weight);
    > > +
    > > + new_cpu = sched_balance_self(this_cpu, SD_BALANCE_EXEC, eload);

    >
    > okay, i think this will work.
    >
    > it feels somewhat backwards though on a conceptual level.
    >
    > There's nothing particularly special about exec-balancing: the load
    > picture is in equilibrium - it is in essence a rebalancing pass done
    > not in the scheduler tick but in a special place in the middle of
    > exec() where the old-task / new-task cross section is at a minimum
    > level.
    >
    > _fork_ balancing is what is special: there we'll get a new context so
    > we have to take the new load into account. It's a bit like wakeup
    > balancing. (just done before the new task is truly woken up)
    >
    > OTOH, triggering the regular busy-balance at exec() time isnt totally
    > straightforward either: the 'old' task is the current task so it
    > cannot be balanced away. We have to trigger all the active-migration
    > logic - which again makes exec() balancing special.
    >
    > So maybe this patch is the best solution after all.


    Even worse, you want to balance current, the generic load balance might
    pick two cpus to balance neither of which will have current on it. But
    even if it would pick the queue with current on it as busiest, there is
    no saying you'll actually end up moving current.

    So this specialized form of moving current to a possibly more idle cpu
    is afaics the best solution for balancing a particular task.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread