[RFC PATCH v2 0/5] sched: modular find_busiest_group() - Kernel

This is a discussion on [RFC PATCH v2 0/5] sched: modular find_busiest_group() - Kernel ; Hi, I have been building tunable sched_mc=N patches on top of existing sched_mc_power_savings code and adding more stuff to find_busiest_group(). Reference: [1]Making power policy just work http://lwn.net/Articles/287924/ [2][RFC v1] Tunable sched_mc_power_savings=n http://lwn.net/Articles/287882/ [3][RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n http://lwn.net/Articles/297306/ Peter ...

+ Reply to Thread
Results 1 to 11 of 11

Thread: [RFC PATCH v2 0/5] sched: modular find_busiest_group()

  1. [RFC PATCH v2 0/5] sched: modular find_busiest_group()

    Hi,

    I have been building tunable sched_mc=N patches on top of existing
    sched_mc_power_savings code and adding more stuff to
    find_busiest_group().

    Reference:

    [1]Making power policy just work
    http://lwn.net/Articles/287924/

    [2][RFC v1] Tunable sched_mc_power_savings=n
    http://lwn.net/Articles/287882/

    [3][RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n
    http://lwn.net/Articles/297306/

    Peter Zijlstra had suggested that it is a good idea to cleanup the
    current code in find_busiest_group() before building on the existing
    power saving balance infrastructure [4]. This becomes even more
    important from the fact that there have been recent bugs in the power
    savings code that was hard to detect and fix [5][6].

    [4] http://lkml.org/lkml/2008/9/8/103

    Reference to bugs:

    [5] sched: Fix __load_balance_iterator() for cfq with only one task
    http://lkml.org/lkml/2008/9/5/135

    [6]sched: arch_reinit_sched_domains() must destroy domains to force rebuild
    http://lkml.org/lkml/2008/8/29/191
    http://lkml.org/lkml/2008/8/29/343

    In an attempt to modularize the find_busiest_group() function and make
    it extensible to more complex load balance decision, I have defined
    new data structures and functions and make the find_busiest_group()
    function small and readable.

    ** This is RFC patch, with limited testing ***

    ChangeLog:
    ---------

    v2: Fixed most coding errors, able to run kernbench on 32-bit intel
    SMP system. Fixed errors in comments.

    v1: Initial post http://lkml.org/lkml/2008/9/24/201

    Please let me know if the approach is correct. I will test further and
    ensure expected functional.

    Thanks,
    Vaidy

    Signed-off-by: Vaidyanathan Srinivasan

    After applying the patch series, the function will look like this:

    /*
    * find_busiest_group finds and returns the busiest CPU group within the
    * domain. It calculates and returns the amount of weighted load which
    * should be moved to restore balance via the imbalance parameter.
    */
    static struct sched_group *
    find_busiest_group(struct sched_domain *sd, int this_cpu,
    unsigned long *imbalance, enum cpu_idle_type idle,
    int *sd_idle, const cpumask_t *cpus, int *balance)
    {
    struct sched_group *group = sd->groups;
    unsigned long max_pull;
    int load_idx;
    struct group_loads gl;
    struct sd_loads sdl;

    memset(&sdl, 0, sizeof(sdl));
    sdl.sd = sd;

    /* Get the load index corresponding to cpu idle state */
    load_idx = get_load_idx(sd, idle);

    do {
    int need_balance;

    need_balance = get_group_loads(group, this_cpu, cpus, idle,
    load_idx, &gl);

    if (*sd_idle && gl.nr_running)
    *sd_idle = 0;

    if (!need_balance && balance) {
    *balance = 0;
    *imbalance = 0;
    return NULL;
    }

    /* Compare groups and find busiest non-local group */
    update_sd_loads(&sdl, &gl);
    /* Compare groups and find power saving candidates */
    update_powersavings_group_loads(&sdl, &gl, idle);

    group = group->next;
    } while (group != sd->groups);

    if (!sdl.busiest.group ||
    sdl.local.load_per_cpu >= sdl.max_load ||
    sdl.busiest.nr_running == 0)
    goto out_balanced;

    sdl.load_per_cpu = (SCHED_LOAD_SCALE * sdl.load) / sdl.cpu_power;

    if (sdl.local.load_per_cpu >= sdl.load_per_cpu ||
    100*sdl.busiest.load_per_cpu <=
    sd->imbalance_pct*sdl.local.load_per_cpu)
    goto out_balanced;

    if (sdl.busiest.group_imbalance)
    sdl.busiest.avg_load_per_task =
    min(sdl.busiest.avg_load_per_task, sdl.load_per_cpu);

    /*
    * We're trying to get all the cpus to the average_load, so we don't
    * want to push ourselves above the average load, nor do we wish to
    * reduce the max loaded cpu below the average load, as either of these
    * actions would just result in more rebalancing later, and ping-pong
    * tasks around. Thus we look for the minimum possible imbalance.
    * Negative imbalances (*we* are more loaded than anyone else) will
    * be counted as no imbalance for these purposes -- we can't fix that
    * by pulling tasks to us. Be careful of negative numbers as they'll
    * appear as very large values with unsigned longs.
    */
    if (sdl.busiest.load_per_cpu <= sdl.busiest.avg_load_per_task)
    goto out_balanced;

    /*
    * In the presence of smp nice balancing, certain scenarios can have
    * max load less than avg load(as we skip the groups at or below
    * its cpu_power, while calculating max_load..)
    * In this condition attempt to adjust the imbalance parameter
    * in the small_imbalance functions.
    *
    * Now if max_load is more than avg load, balancing is needed,
    * find the exact number of tasks to be moved.
    */
    if (sdl.busiest.load_per_cpu >= sdl.load_per_cpu) {

    /* Don't want to pull so many tasks that
    * a group would go idle
    */
    max_pull = min(sdl.busiest.load_per_cpu - sdl.load_per_cpu,
    sdl.busiest.load_per_cpu -
    sdl.busiest.avg_load_per_task);

    /* How much load to actually move to equalise the imbalance */
    *imbalance = min(max_pull * sdl.busiest.group->__cpu_power,
    (sdl.load_per_cpu - sdl.local.load_per_cpu) *
    sdl.local.group->__cpu_power) /
    SCHED_LOAD_SCALE;

    /* If we have adjusted the required imbalance, then return */
    if (*imbalance >= sdl.busiest.avg_load_per_task)
    return sdl.busiest.group;

    }

    /*
    * if *imbalance is less than the average load per runnable task
    * there is no guarantee that any tasks will be moved so we'll have
    * a think about bumping its value to force at least one task to be
    * moved
    */
    *imbalance = 0; /* Will be adjusted below */

    if (small_imbalance_one_task(&sdl, imbalance))
    return sdl.busiest.group;

    /* Further look for effective cpu power utilisation */
    small_imbalance_optimize_cpu_power(&sdl, imbalance);

    /*
    * Unconditional return, we have tries all possible means to adjust
    * the imbalance for effective task move
    */
    return sdl.busiest.group;

    out_balanced:
    /* Try opportunity for power save balance */
    return powersavings_balance_group(&sdl, &gl, idle, imbalance);
    }

    ---

    Vaidyanathan Srinivasan (5):
    sched: split find_busiest_group()
    sched: small imbalance corrections
    sched: collect statistics required for powersave balance
    sched: calculate statistics for current load balance domain
    sched: load calculation for each group in sched domain


    kernel/sched.c | 627 ++++++++++++++++++++++++++++++++++----------------------
    1 files changed, 384 insertions(+), 243 deletions(-)

    --
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. [RFC PATCH v2 5/5] sched: split find_busiest_group()

    Make use of the defined helper functions and data structures and
    split the find_busiest_group() function into smaller modular functions.

    Signed-off-by: Vaidyanathan Srinivasan
    ---

    kernel/sched.c | 328 +++++++++++---------------------------------------------
    1 files changed, 67 insertions(+), 261 deletions(-)

    diff --git a/kernel/sched.c b/kernel/sched.c
    index cf1aae1..004c065 100644
    --- a/kernel/sched.c
    +++ b/kernel/sched.c
    @@ -3398,7 +3398,6 @@ static struct sched_group *powersavings_balance_group(struct sd_loads *sdl,
    return NULL;
    }
    #endif
    -
    /*
    * find_busiest_group finds and returns the busiest CPU group within the
    * domain. It calculates and returns the amount of weighted load which
    @@ -3409,208 +3408,56 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
    unsigned long *imbalance, enum cpu_idle_type idle,
    int *sd_idle, const cpumask_t *cpus, int *balance)
    {
    - struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups;
    - unsigned long max_load, avg_load, total_load, this_load, total_pwr;
    + struct sched_group *group = sd->groups;
    unsigned long max_pull;
    - unsigned long busiest_load_per_task, busiest_nr_running;
    - unsigned long this_load_per_task, this_nr_running;
    - int load_idx, group_imb = 0;
    -#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
    - int power_savings_balance = 1;
    - unsigned long leader_nr_running = 0, min_load_per_task = 0;
    - unsigned long min_nr_running = ULONG_MAX;
    - struct sched_group *group_min = NULL, *group_leader = NULL;
    -#endif
    + int load_idx;
    + struct group_loads gl;
    + struct sd_loads sdl;

    - max_load = this_load = total_load = total_pwr = 0;
    - busiest_load_per_task = busiest_nr_running = 0;
    - this_load_per_task = this_nr_running = 0;
    + memset(&sdl, 0, sizeof(sdl));
    + sdl.sd = sd;

    - if (idle == CPU_NOT_IDLE)
    - load_idx = sd->busy_idx;
    - else if (idle == CPU_NEWLY_IDLE)
    - load_idx = sd->newidle_idx;
    - else
    - load_idx = sd->idle_idx;
    + /* Get the load index corresponding to cpu idle state */
    + load_idx = get_load_idx(sd, idle);

    do {
    - unsigned long load, group_capacity, max_cpu_load, min_cpu_load;
    - int local_group;
    - int i;
    - int __group_imb = 0;
    - unsigned int balance_cpu = -1, first_idle_cpu = 0;
    - unsigned long sum_nr_running, sum_weighted_load;
    - unsigned long sum_avg_load_per_task;
    - unsigned long avg_load_per_task;
    -
    - local_group = cpu_isset(this_cpu, group->cpumask);
    -
    - if (local_group)
    - balance_cpu = first_cpu(group->cpumask);
    + int need_balance;

    - /* Tally up the load of all CPUs in the group */
    - sum_weighted_load = sum_nr_running = avg_load = 0;
    - sum_avg_load_per_task = avg_load_per_task = 0;
    -
    - max_cpu_load = 0;
    - min_cpu_load = ~0UL;
    -
    - for_each_cpu_mask_nr(i, group->cpumask) {
    - struct rq *rq;
    -
    - if (!cpu_isset(i, *cpus))
    - continue;
    + need_balance = get_group_loads(group, this_cpu, cpus, idle,
    + load_idx, &gl);

    - rq = cpu_rq(i);
    + if (*sd_idle && gl.nr_running)
    + *sd_idle = 0;

    - if (*sd_idle && rq->nr_running)
    - *sd_idle = 0;
    -
    - /* Bias balancing toward cpus of our domain */
    - if (local_group) {
    - if (idle_cpu(i) && !first_idle_cpu) {
    - first_idle_cpu = 1;
    - balance_cpu = i;
    - }
    -
    - load = target_load(i, load_idx);
    - } else {
    - load = source_load(i, load_idx);
    - if (load > max_cpu_load)
    - max_cpu_load = load;
    - if (min_cpu_load > load)
    - min_cpu_load = load;
    - }
    -
    - avg_load += load;
    - sum_nr_running += rq->nr_running;
    - sum_weighted_load += weighted_cpuload(i);
    -
    - sum_avg_load_per_task += cpu_avg_load_per_task(i);
    - }
    -
    - /*
    - * First idle cpu or the first cpu(busiest) in this sched group
    - * is eligible for doing load balancing at this and above
    - * domains. In the newly idle case, we will allow all the cpu's
    - * to do the newly idle load balance.
    - */
    - if (idle != CPU_NEWLY_IDLE && local_group &&
    - balance_cpu != this_cpu && balance) {
    + if (!need_balance && balance) {
    *balance = 0;
    - goto ret;
    + *imbalance = 0;
    + return NULL;
    }

    - total_load += avg_load;
    - total_pwr += group->__cpu_power;
    + /* Compare groups and find busiest non-local group */
    + update_sd_loads(&sdl, &gl);
    + /* Compare groups and find power saving candidates */
    + update_powersavings_group_loads(&sdl, &gl, idle);

    - /* Adjust by relative CPU power of the group */
    - avg_load = sg_div_cpu_power(group,
    - avg_load * SCHED_LOAD_SCALE);
    -
    -
    - /*
    - * Consider the group unbalanced when the imbalance is larger
    - * than the average weight of two tasks.
    - *
    - * APZ: with cgroup the avg task weight can vary wildly and
    - * might not be a suitable number - should we keep a
    - * normalized nr_running number somewhere that negates
    - * the hierarchy?
    - */
    - avg_load_per_task = sg_div_cpu_power(group,
    - sum_avg_load_per_task * SCHED_LOAD_SCALE);
    -
    - if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task)
    - __group_imb = 1;
    -
    - group_capacity = group->__cpu_power / SCHED_LOAD_SCALE;
    -
    - if (local_group) {
    - this_load = avg_load;
    - this = group;
    - this_nr_running = sum_nr_running;
    - this_load_per_task = sum_weighted_load;
    - } else if (avg_load > max_load &&
    - (sum_nr_running > group_capacity || __group_imb)) {
    - max_load = avg_load;
    - busiest = group;
    - busiest_nr_running = sum_nr_running;
    - busiest_load_per_task = sum_weighted_load;
    - group_imb = __group_imb;
    - }
    -
    -#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
    - /*
    - * Busy processors will not participate in power savings
    - * balance.
    - */
    - if (idle == CPU_NOT_IDLE ||
    - !(sd->flags & SD_POWERSAVINGS_BALANCE))
    - goto group_next;
    -
    - /*
    - * If the local group is idle or completely loaded
    - * no need to do power savings balance at this domain
    - */
    - if (local_group && (this_nr_running >= group_capacity ||
    - !this_nr_running))
    - power_savings_balance = 0;
    -
    - /*
    - * If a group is already running at full capacity or idle,
    - * don't include that group in power savings calculations
    - */
    - if (!power_savings_balance || sum_nr_running >= group_capacity
    - || !sum_nr_running)
    - goto group_next;
    -
    - /*
    - * Calculate the group which has the least non-idle load.
    - * This is the group from where we need to pick up the load
    - * for saving power
    - */
    - if ((sum_nr_running < min_nr_running) ||
    - (sum_nr_running == min_nr_running &&
    - first_cpu(group->cpumask) <
    - first_cpu(group_min->cpumask))) {
    - group_min = group;
    - min_nr_running = sum_nr_running;
    - min_load_per_task = sum_weighted_load /
    - sum_nr_running;
    - }
    -
    - /*
    - * Calculate the group which is almost near its
    - * capacity but still has some space to pick up some load
    - * from other group and save more power
    - */
    - if (sum_nr_running <= group_capacity - 1) {
    - if (sum_nr_running > leader_nr_running ||
    - (sum_nr_running == leader_nr_running &&
    - first_cpu(group->cpumask) >
    - first_cpu(group_leader->cpumask))) {
    - group_leader = group;
    - leader_nr_running = sum_nr_running;
    - }
    - }
    -group_next:
    -#endif
    group = group->next;
    } while (group != sd->groups);

    - if (!busiest || this_load >= max_load || busiest_nr_running == 0)
    + if (!sdl.busiest.group ||
    + sdl.local.load_per_cpu >= sdl.max_load ||
    + sdl.busiest.nr_running == 0)
    goto out_balanced;

    - avg_load = (SCHED_LOAD_SCALE * total_load) / total_pwr;
    + sdl.load_per_cpu = (SCHED_LOAD_SCALE * sdl.load) / sdl.cpu_power;

    - if (this_load >= avg_load ||
    - 100*max_load <= sd->imbalance_pct*this_load)
    + if (sdl.local.load_per_cpu >= sdl.load_per_cpu ||
    + 100*sdl.busiest.load_per_cpu <=
    + sd->imbalance_pct*sdl.local.load_per_cpu)
    goto out_balanced;

    - busiest_load_per_task /= busiest_nr_running;
    - if (group_imb)
    - busiest_load_per_task = min(busiest_load_per_task, avg_load);
    + if (sdl.busiest.group_imbalance)
    + sdl.busiest.avg_load_per_task =
    + min(sdl.busiest.avg_load_per_task, sdl.load_per_cpu);

    /*
    * We're trying to get all the cpus to the average_load, so we don't
    @@ -3623,104 +3470,63 @@ group_next:
    * by pulling tasks to us. Be careful of negative numbers as they'll
    * appear as very large values with unsigned longs.
    */
    - if (max_load <= busiest_load_per_task)
    + if (sdl.busiest.load_per_cpu <= sdl.busiest.avg_load_per_task)
    goto out_balanced;

    /*
    * In the presence of smp nice balancing, certain scenarios can have
    * max load less than avg load(as we skip the groups at or below
    * its cpu_power, while calculating max_load..)
    + * In this condition attempt to adjust the imbalance parameter
    + * in the small_imbalance functions.
    + *
    + * Now if max_load is more than avg load, balancing is needed,
    + * find the exact number of tasks to be moved.
    */
    - if (max_load < avg_load) {
    - *imbalance = 0;
    - goto small_imbalance;
    - }
    + if (sdl.busiest.load_per_cpu >= sdl.load_per_cpu) {
    +
    + /* Don't want to pull so many tasks that
    + * a group would go idle
    + */
    + max_pull = min(sdl.busiest.load_per_cpu - sdl.load_per_cpu,
    + sdl.busiest.load_per_cpu -
    + sdl.busiest.avg_load_per_task);
    +
    + /* How much load to actually move to equalise the imbalance */
    + *imbalance = min(max_pull * sdl.busiest.group->__cpu_power,
    + (sdl.load_per_cpu - sdl.local.load_per_cpu) *
    + sdl.local.group->__cpu_power) /
    + SCHED_LOAD_SCALE;

    - /* Don't want to pull so many tasks that a group would go idle */
    - max_pull = min(max_load - avg_load, max_load - busiest_load_per_task);
    + /* If we have adjusted the required imbalance, then return */
    + if (*imbalance >= sdl.busiest.avg_load_per_task)
    + return sdl.busiest.group;

    - /* How much load to actually move to equalise the imbalance */
    - *imbalance = min(max_pull * busiest->__cpu_power,
    - (avg_load - this_load) * this->__cpu_power)
    - / SCHED_LOAD_SCALE;
    + }

    /*
    * if *imbalance is less than the average load per runnable task
    - * there is no gaurantee that any tasks will be moved so we'll have
    + * there is no guarantee that any tasks will be moved so we'll have
    * a think about bumping its value to force at least one task to be
    * moved
    */
    - if (*imbalance < busiest_load_per_task) {
    - unsigned long tmp, pwr_now, pwr_move;
    - unsigned int imbn;
    -
    -small_imbalance:
    - pwr_move = pwr_now = 0;
    - imbn = 2;
    - if (this_nr_running) {
    - this_load_per_task /= this_nr_running;
    - if (busiest_load_per_task > this_load_per_task)
    - imbn = 1;
    - } else
    - this_load_per_task = cpu_avg_load_per_task(this_cpu);
    -
    - if (max_load - this_load + 2*busiest_load_per_task >=
    - busiest_load_per_task * imbn) {
    - *imbalance = busiest_load_per_task;
    - return busiest;
    - }
    + *imbalance = 0; /* Will be adjusted below */

    - /*
    - * OK, we don't have enough imbalance to justify moving tasks,
    - * however we may be able to increase total CPU power used by
    - * moving them.
    - */
    + if (small_imbalance_one_task(&sdl, imbalance))
    + return sdl.busiest.group;

    - pwr_now += busiest->__cpu_power *
    - min(busiest_load_per_task, max_load);
    - pwr_now += this->__cpu_power *
    - min(this_load_per_task, this_load);
    - pwr_now /= SCHED_LOAD_SCALE;
    -
    - /* Amount of load we'd subtract */
    - tmp = sg_div_cpu_power(busiest,
    - busiest_load_per_task * SCHED_LOAD_SCALE);
    - if (max_load > tmp)
    - pwr_move += busiest->__cpu_power *
    - min(busiest_load_per_task, max_load - tmp);
    -
    - /* Amount of load we'd add */
    - if (max_load * busiest->__cpu_power <
    - busiest_load_per_task * SCHED_LOAD_SCALE)
    - tmp = sg_div_cpu_power(this,
    - max_load * busiest->__cpu_power);
    - else
    - tmp = sg_div_cpu_power(this,
    - busiest_load_per_task * SCHED_LOAD_SCALE);
    - pwr_move += this->__cpu_power *
    - min(this_load_per_task, this_load + tmp);
    - pwr_move /= SCHED_LOAD_SCALE;
    + /* Further look for effective cpu power utilisation */
    + small_imbalance_optimize_cpu_power(&sdl, imbalance);

    - /* Move if we gain throughput */
    - if (pwr_move > pwr_now)
    - *imbalance = busiest_load_per_task;
    - }
    -
    - return busiest;
    + /*
    + * Unconditional return, we have tries all possible means to adjust
    + * the imbalance for effective task move
    + */
    + return sdl.busiest.group;

    out_balanced:
    -#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
    - if (idle == CPU_NOT_IDLE || !(sd->flags & SD_POWERSAVINGS_BALANCE))
    - goto ret;
    -
    - if (this == group_leader && group_leader != group_min) {
    - *imbalance = min_load_per_task;
    - return group_min;
    - }
    -#endif
    -ret:
    - *imbalance = 0;
    - return NULL;
    + /* Try opportunity for power save balance */
    + return powersavings_balance_group(&sdl, &gl, idle, imbalance);
    }

    /*

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. [RFC PATCH v2 4/5] sched: small imbalance corrections

    Add functions to bump up the imbalance to eventually initiate
    a task move to balance across groups.

    Signed-off-by: Vaidyanathan Srinivasan
    ---

    kernel/sched.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++
    1 files changed, 74 insertions(+), 0 deletions(-)

    diff --git a/kernel/sched.c b/kernel/sched.c
    index c99b5bd..cf1aae1 100644
    --- a/kernel/sched.c
    +++ b/kernel/sched.c
    @@ -3229,6 +3229,80 @@ void update_sd_loads(struct sd_loads *sdl, struct group_loads *gl)
    }
    }

    +/* Bump up imbalance to one task so that some task movement can happen */
    +
    +int small_imbalance_one_task(struct sd_loads *sdl, unsigned long *imbalance)
    +{
    + unsigned int imbn;
    + imbn = 2;
    + if (sdl->local.nr_running) {
    + if (sdl->busiest.avg_load_per_task >
    + sdl->local.avg_load_per_task)
    + imbn = 1;
    + }
    +
    + if (sdl->busiest.load_per_cpu - sdl->local.load_per_cpu +
    + 2*sdl->busiest.avg_load_per_task >=
    + sdl->busiest.avg_load_per_task * imbn) {
    + *imbalance = sdl->busiest.avg_load_per_task;
    + return 1;
    + }
    + return 0;
    +}
    +
    +/*
    + * Adjust imbalance to move task if the result of the move will
    + * yield better use of cpu power
    + */
    +
    +void small_imbalance_optimize_cpu_power(struct sd_loads *sdl,
    + unsigned long *imbalance)
    +{
    + unsigned long tmp, pwr_now, pwr_move;
    + pwr_move = pwr_now = 0;
    +
    + /*
    + * OK, we don't have enough imbalance to justify moving tasks,
    + * however we may be able to increase total CPU power used by
    + * moving them.
    + */
    +
    + pwr_now += sdl->busiest.group->__cpu_power *
    + min(sdl->busiest.avg_load_per_task,
    + sdl->busiest.load_per_cpu);
    + pwr_now += sdl->local.group->__cpu_power *
    + min(sdl->local.avg_load_per_task,
    + sdl->local.load_per_cpu);
    + pwr_now /= SCHED_LOAD_SCALE;
    +
    + /* Amount of load we'd subtract */
    + tmp = sg_div_cpu_power(sdl->busiest.group,
    + sdl->busiest.avg_load_per_task * SCHED_LOAD_SCALE);
    + if (sdl->busiest.load_per_cpu > tmp)
    + pwr_move += sdl->busiest.group->__cpu_power *
    + min(sdl->busiest.avg_load_per_task,
    + sdl->busiest.load_per_cpu - tmp);
    +
    + /* Amount of load we'd add */
    + if (sdl->busiest.load_per_cpu * sdl->busiest.group->__cpu_power <
    + sdl->busiest.avg_load_per_task * SCHED_LOAD_SCALE)
    + tmp = sg_div_cpu_power(sdl->local.group,
    + sdl->busiest.load_per_cpu *
    + sdl->busiest.group->__cpu_power);
    + else
    + tmp = sg_div_cpu_power(sdl->local.group,
    + sdl->busiest.avg_load_per_task * SCHED_LOAD_SCALE);
    + pwr_move += sdl->local.group->__cpu_power *
    + min(sdl->local.avg_load_per_task,
    + sdl->local.load + tmp);
    + pwr_move /= SCHED_LOAD_SCALE;
    +
    + /* Move if we gain throughput */
    + if (pwr_move > pwr_now)
    + *imbalance = sdl->busiest.avg_load_per_task;
    +
    +}
    +
    #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
    void update_powersavings_group_loads(struct sd_loads *sdl,
    struct group_loads *gl,

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. [RFC PATCH v2 3/5] sched: collect statistics required for powersave balance

    Update sched domain level statistics with the minimum load and
    group leader who can pull more tasks. Also suggest a powersave
    movement if the domain is otherwise balanced.

    Signed-off-by: Vaidyanathan Srinivasan
    ---

    kernel/sched.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++
    1 files changed, 96 insertions(+), 0 deletions(-)

    diff --git a/kernel/sched.c b/kernel/sched.c
    index cfd83d9..c99b5bd 100644
    --- a/kernel/sched.c
    +++ b/kernel/sched.c
    @@ -3229,6 +3229,102 @@ void update_sd_loads(struct sd_loads *sdl, struct group_loads *gl)
    }
    }

    +#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
    +void update_powersavings_group_loads(struct sd_loads *sdl,
    + struct group_loads *gl,
    + enum cpu_idle_type idle)
    +{
    + int group_capacity = gl->group->__cpu_power / SCHED_LOAD_SCALE;
    +
    + /*
    + * Busy processors will not participate in power savings
    + * balance.
    + */
    + if (idle == CPU_NOT_IDLE ||
    + !(sdl->sd->flags & SD_POWERSAVINGS_BALANCE))
    + return;
    +
    + /*
    + * If the this group is idle or completely loaded, then there
    + * is no opportunity to do power savings balance with this group
    + */
    + if (gl->nr_running >= group_capacity || gl->nr_running == 0)
    + return;
    +
    + /*
    + * Calculate the group which has the least non-idle load.
    + * This is the group from where we need to pick up the load
    + * for saving power
    + */
    + if (!sdl->min_load_group.group)
    + sdl->min_load_group = *gl;
    + else {
    + if (gl->nr_running < sdl->min_load_group.nr_running)
    + sdl->min_load_group = *gl;
    + /* If the loads are equal, then prefer the cpu with
    + * less logical number
    + */
    + else if (gl->nr_running == sdl->min_load_group.nr_running &&
    + first_cpu(gl->group->cpumask) <
    + first_cpu(sdl->min_load_group.group->cpumask))
    + sdl->min_load_group = *gl;
    + }
    +
    + /*
    + * Calculate the group which is almost near its
    + * capacity but still has some space to pick up some load
    + * from other group and save more power
    + */
    +
    + if (gl->nr_running > 0 && gl->nr_running <= group_capacity - 1) {
    + if (!sdl->power_save_leader_group.group)
    + sdl->power_save_leader_group = *gl;
    + else {
    + if (gl->nr_running >
    + sdl->power_save_leader_group.nr_running)
    + sdl->power_save_leader_group = *gl;
    + else if (gl->nr_running ==
    + sdl->power_save_leader_group.nr_running &&
    + first_cpu(gl->group->cpumask) <
    + first_cpu(sdl->min_load_group.group->cpumask))
    + sdl->power_save_leader_group = *gl;
    + }
    + }
    +}
    +
    +static struct sched_group *powersavings_balance_group(struct sd_loads *sdl,
    + struct group_loads *gl, enum cpu_idle_type idle,
    + unsigned long *imbalance)
    +{
    + *imbalance = 0;
    + if (idle == CPU_NOT_IDLE || !(sdl->sd->flags & SD_POWERSAVINGS_BALANCE))
    + return NULL;
    +
    + if (sdl->local.group == sdl->power_save_leader_group.group &&
    + sdl->power_save_leader_group.group !=
    + sdl->min_load_group.group) {
    + *imbalance = sdl->min_load_group.avg_load_per_task;
    + return sdl->min_load_group.group;
    + }
    +
    + return NULL;
    +}
    +#else
    +void update_powersavings_group_loads(struct sd_loads *sdl,
    + struct group_loads *gl, enum cpu_idle_type idle)
    +{
    + return;
    +}
    +
    +static struct sched_group *powersavings_balance_group(struct sd_loads *sdl,
    + struct group_loads *gl, enum cpu_idle_type idle,
    + unsigned long *imbalance)
    +{
    + *imbalance = 0;
    + return NULL;
    +}
    +#endif
    +
    /*
    * find_busiest_group finds and returns the busiest CPU group within the
    * domain. It calculates and returns the amount of weighted load which

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. [RFC PATCH v2 2/5] sched: calculate statistics for current load balance domain

    Add data structures and function to collect per sched domain
    statistics required for load balance decision.

    Signed-off-by: Vaidyanathan Srinivasan
    ---

    kernel/sched.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++ +
    1 files changed, 51 insertions(+), 0 deletions(-)

    diff --git a/kernel/sched.c b/kernel/sched.c
    index ab77937..cfd83d9 100644
    --- a/kernel/sched.c
    +++ b/kernel/sched.c
    @@ -3178,6 +3178,57 @@ int get_group_loads(struct sched_group *group, int this_cpu,
    return need_balance;
    }

    +/* Struct to hold sched domain level loads */
    +
    +struct sd_loads {
    + struct sched_domain *sd;
    + unsigned long load; /* Total decay load */
    + unsigned long cpu_power; /* Total cpu_power */
    + unsigned long max_load; /* load of busiest group */
    + unsigned long load_per_cpu; /* Decay load per cpu in this sd
    + * (load/cpu_power)
    + */
    + struct group_loads local; /* this_cpu's group,
    + * need to pull from busiest
    + */
    + struct group_loads busiest; /* Candidate group to pull */
    + /* sched_mc power_save balance groups */
    + struct group_loads min_load_group; /* Group that is least loaded and
    + * can fit the tasks into the
    + * leader group
    + */
    + struct group_loads power_save_leader_group; /* Group that is almost
    + * full but can still pull
    + * tasks from
    + * min_load_group
    + */
    +};
    +
    +/* Function to aggregate per group loads to sched domain loads */
    +
    +void update_sd_loads(struct sd_loads *sdl, struct group_loads *gl)
    +{
    + int group_capacity = gl->group->__cpu_power / SCHED_LOAD_SCALE;
    + sdl->max_load = 0;
    +
    + if (gl->local_group)
    + sdl->local = *gl;
    + else {
    + sdl->load += gl->load;
    + sdl->cpu_power += gl->group->__cpu_power;
    +
    +
    + /* Find the busiest */
    + if (gl->load > sdl->max_load &&
    + (gl->nr_running > group_capacity ||
    + gl->group_imbalance)) {
    +
    + sdl->max_load = gl->load;
    + sdl->busiest = *gl;
    + }
    + }
    +}
    +
    /*
    * find_busiest_group finds and returns the busiest CPU group within the
    * domain. It calculates and returns the amount of weighted load which

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [RFC PATCH v2 0/5] sched: modular find_busiest_group()


    Hi,

    I've looked over the last posting, but so far failed to respond :/

    While I like the clean-up, I'm atm a bit reluctant to apply, as I'm
    trying to rework the sched_domain stuff and the HT balancing and I'm not
    quite sure how this will interact with your patches.

    I hope to share my proposal (in talk - no patches yet) soonish (this
    weekend, early next week) so that we can discuss it.

    Thanks for you patience,

    Peter

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [RFC PATCH v2 0/5] sched: modular find_busiest_group()

    * Peter Zijlstra [2008-10-09 16:19:28]:

    >
    > Hi,
    >
    > I've looked over the last posting, but so far failed to respond :/
    >
    > While I like the clean-up, I'm atm a bit reluctant to apply, as I'm
    > trying to rework the sched_domain stuff and the HT balancing and I'm not
    > quite sure how this will interact with your patches.


    Hi Peter,

    Thanks for reviewing the changes. I just wanted an initial feedback
    before I spend too much time on this approach. I am interested in HT
    balancing stuff and I wanted to study the behaviour and improve the
    power saving heuristics there as well.

    > I hope to share my proposal (in talk - no patches yet) soonish (this
    > weekend, early next week) so that we can discuss it.


    I am looking forward to it. We can definitely rework the
    implementation to seamlessly take on power saving heuristics.

    Thanks,
    Vaidy
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [RFC PATCH v2 0/5] sched: modular find_busiest_group()


    Hi,

    So the basic issue is sched_group::cpu_power should become more dynamic.

    There are two driving factors:
    - RT time consumption feedback into CFS
    - dynamic per-cpu power manangement like Intel's Dynamic Speed
    Technology (formerly know as Turbo Mode).

    We currently have sched_group::cpu_power to model SMT. We say that
    multiple threads that share a core are not as powerful as two cores.

    Therefore, we move a task when doing so results in more of that power
    being utilized, resulting in preferring to run tasks on full cores
    instead of SMT siblings.


    RT time
    -------

    So the basic issue is that we cannot know how much cpu-time will be
    consumed by RT tasks (we used to relate that to the number of running
    tasks, but that's utter nonsense).

    Therefore the only way is to measure it and assume the near future looks
    like the near past.

    So why is this an issue.. suppose we have 2 cpus, and 1 cpu is consumed
    for 50% by RT tasks, while the other is fully available to regular
    tasks.

    In that case we'd want to load-balance such that the cpu affected by the
    RT task(s) gets half the load the other cpu has.

    [ I tried modelling this by scaling the load of cpus up, but that fails
    to handle certain cases - for instance 100% RT gets real funny, and it
    fails to properly skip the RT-loaded cores in the low-load situation ]

    Dynamic Speed Technology
    ------------------------

    With cpus actively fiddling with their processing capacity we get into
    similar issues. Again we can measure this, but this would require the
    addition of a clock that measures work instead of time.

    Having that, we can even acturately measure the old SMT case, which has
    always been approximated by a static percentage - even though the actual
    gain is very workload dependent.

    The idea is to introduce sched_work_clock() so that:

    work_delta / time_delta gives the power for a cpu. <1 means we
    did less work than a dedicated pipeline, >1 means we did more.

    So, if for example our core's canonical freq would be 2.0GHz but we get
    boosted to 2.2GHz while the other core would get lowered to 1.8GHz we
    can observe and attribute this asymetric power balance.

    [ This assumes that the total power is preserved for non-idle situations
    - is that true?, if not this gets real interesting ]

    Also, an SMT thread, when sharing the core with its sibling will get <1,
    but together they might be >1.


    Funny corner cases
    ------------------

    Like mentioned in the RT time note, there is the possiblity that a core
    has 0 power (left) for SCHED_OTHER. This has a consequence for the
    balance cpu. Currently we let the first cpu in the domain do the
    balancing, however if that CPU has 0 power it might not be the best
    choice (esp since part of the balancing can be done from softirq context
    - which would basically starve that group).



    Sched domains
    -------------

    There seems to be a use-case where we need both the cache and the
    package levels. So I wanted to have both levels in there.

    Currently each domain level can only be one of:

    SD_LV_NONE = 0,
    SD_LV_SIBLING,
    SD_LV_MC,
    SD_LV_CPU,
    SD_LV_NODE,
    SD_LV_ALLNODES,

    So to avoid a double domain with 'native' multi-core chips where the
    cache and package level have the same span, I want to encode this
    information in the sched_domain::flags as bits, which means a level can
    be both cache and package.


    Over balancing
    --------------

    Lastly, we might need to introduce SD_OVER_BALANCE, which toggles the
    over-balance logic. While over-balancing brings better fairness for a
    number of cases, its also hard on power savings.




    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [RFC PATCH v2 0/5] sched: modular find_busiest_group()

    * Peter Zijlstra [2008-10-14 14:09:13]:

    >
    > Hi,
    >
    > So the basic issue is sched_group::cpu_power should become more dynamic.


    Hi Peter,

    This is a good idea. Dynamically increasing cpu power to some groups
    will automatically help power savings when we want to consolidate
    better to one cpu package when overall system utilisation is very low.

    > There are two driving factors:
    > - RT time consumption feedback into CFS
    > - dynamic per-cpu power manangement like Intel's Dynamic Speed
    > Technology (formerly know as Turbo Mode).
    >
    > We currently have sched_group::cpu_power to model SMT. We say that
    > multiple threads that share a core are not as powerful as two cores.
    >
    > Therefore, we move a task when doing so results in more of that power
    > being utilized, resulting in preferring to run tasks on full cores
    > instead of SMT siblings.
    >
    >
    > RT time
    > -------
    >
    > So the basic issue is that we cannot know how much cpu-time will be
    > consumed by RT tasks (we used to relate that to the number of running
    > tasks, but that's utter nonsense).
    >
    > Therefore the only way is to measure it and assume the near future looks
    > like the near past.
    >
    > So why is this an issue.. suppose we have 2 cpus, and 1 cpu is consumed
    > for 50% by RT tasks, while the other is fully available to regular
    > tasks.
    >
    > In that case we'd want to load-balance such that the cpu affected by the
    > RT task(s) gets half the load the other cpu has.
    >
    > [ I tried modelling this by scaling the load of cpus up, but that fails
    > to handle certain cases - for instance 100% RT gets real funny, and it
    > fails to properly skip the RT-loaded cores in the low-load situation ]
    >
    > Dynamic Speed Technology
    > ------------------------
    >
    > With cpus actively fiddling with their processing capacity we get into
    > similar issues. Again we can measure this, but this would require the
    > addition of a clock that measures work instead of time.
    >
    > Having that, we can even acturately measure the old SMT case, which has
    > always been approximated by a static percentage - even though the actual
    > gain is very workload dependent.
    >
    > The idea is to introduce sched_work_clock() so that:
    >
    > work_delta / time_delta gives the power for a cpu. <1 means we
    > did less work than a dedicated pipeline, >1 means we did more.


    The challenge here is measurement of 'work'. What will be the
    parameter that will be fair for most workloads and easy to measure on
    most systems?

    * Instructions completion count
    * APERF or similar CPU specific counter on x86
    * POWER has PURR and SPURR to have a measure of relative work done


    > So, if for example our core's canonical freq would be 2.0GHz but we get
    > boosted to 2.2GHz while the other core would get lowered to 1.8GHz we
    > can observe and attribute this asymetric power balance.
    >
    > [ This assumes that the total power is preserved for non-idle situations
    > - is that true?, if not this gets real interesting ]


    I would assume total compute power will be preserved over a long
    period of time. But certain workloads can benefit more from acceleration
    on the same system challenging the above assumption.

    > Also, an SMT thread, when sharing the core with its sibling will get <1,
    > but together they might be >1.


    In this case what is the normalised value of '1' It is difficult to
    estimate the nominal cpu power with threads. If we can assume
    normalised value to be theoretical max, then sum of both threads can
    be less than 1 and will never achieve 1 in practice

    > Funny corner cases
    > ------------------
    >
    > Like mentioned in the RT time note, there is the possiblity that a core
    > has 0 power (left) for SCHED_OTHER. This has a consequence for the
    > balance cpu. Currently we let the first cpu in the domain do the
    > balancing, however if that CPU has 0 power it might not be the best
    > choice (esp since part of the balancing can be done from softirq context
    > - which would basically starve that group).


    Agreed, but relative easy to solve compared to other challenges

    > Sched domains
    > -------------
    >
    > There seems to be a use-case where we need both the cache and the
    > package levels. So I wanted to have both levels in there.
    >
    > Currently each domain level can only be one of:
    >
    > SD_LV_NONE = 0,
    > SD_LV_SIBLING,
    > SD_LV_MC,
    > SD_LV_CPU,
    > SD_LV_NODE,
    > SD_LV_ALLNODES,
    >
    > So to avoid a double domain with 'native' multi-core chips where the
    > cache and package level have the same span, I want to encode this
    > information in the sched_domain::flags as bits, which means a level can
    > be both cache and package.


    This will help power savings balance and make the implementation
    clean. You have suggested this previously also.
    Similarly collapse the NODE level if it is redundant?

    > Over balancing
    > --------------
    >
    > Lastly, we might need to introduce SD_OVER_BALANCE, which toggles the
    > over-balance logic. While over-balancing brings better fairness for a
    > number of cases, its also hard on power savings.


    I did not understand this over balance. Can you please explain.

    Thanks,
    Vaidy

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [RFC PATCH v2 0/5] sched: modular find_busiest_group()

    On Tue, 2008-10-14 at 18:37 +0530, Vaidyanathan Srinivasan wrote:
    > * Peter Zijlstra [2008-10-14 14:09:13]:
    >
    > >
    > > Hi,
    > >
    > > So the basic issue is sched_group::cpu_power should become more dynamic.

    >
    > Hi Peter,
    >
    > This is a good idea. Dynamically increasing cpu power to some groups
    > will automatically help power savings when we want to consolidate
    > better to one cpu package when overall system utilisation is very low.


    Ah, yes another use case of this ;-)

    > > Dynamic Speed Technology
    > > ------------------------
    > >
    > > With cpus actively fiddling with their processing capacity we get into
    > > similar issues. Again we can measure this, but this would require the
    > > addition of a clock that measures work instead of time.
    > >
    > > Having that, we can even acturately measure the old SMT case, which has
    > > always been approximated by a static percentage - even though the actual
    > > gain is very workload dependent.
    > >
    > > The idea is to introduce sched_work_clock() so that:
    > >
    > > work_delta / time_delta gives the power for a cpu. <1 means we
    > > did less work than a dedicated pipeline, >1 means we did more.

    >
    > The challenge here is measurement of 'work'. What will be the
    > parameter that will be fair for most workloads and easy to measure on
    > most systems?
    >
    > * Instructions completion count
    > * APERF or similar CPU specific counter on x86
    > * POWER has PURR and SPURR to have a measure of relative work done


    Right - I was hoping for some feedback from the arch folks (maybe I
    should have CC'ed linux-arch) on this issue.

    > > So, if for example our core's canonical freq would be 2.0GHz but we get
    > > boosted to 2.2GHz while the other core would get lowered to 1.8GHz we
    > > can observe and attribute this asymetric power balance.
    > >
    > > [ This assumes that the total power is preserved for non-idle situations
    > > - is that true?, if not this gets real interesting ]

    >
    > I would assume total compute power will be preserved over a long
    > period of time. But certain workloads can benefit more from acceleration
    > on the same system challenging the above assumption.


    Yes, trouble is that as soon as its not a constant, we get into a
    generic optimisation problem, which I'd rather not try to solve in the
    CPU scheduler.

    > > Also, an SMT thread, when sharing the core with its sibling will get <1,
    > > but together they might be >1.

    >
    > In this case what is the normalised value of '1' It is difficult to
    > estimate the nominal cpu power with threads. If we can assume
    > normalised value to be theoretical max, then sum of both threads can
    > be less than 1 and will never achieve 1 in practice


    Agreed, getting a normalized value is possibly non-trivial. If we'd look
    at things like (avg) cpu-speed over a measured time interval its doable,
    but once we start looking at instructions completed and similar things
    this might be a little more difficult.

    Then again, we could perhaps re-normalize the value such that the avg
    value over all cpus ends up being 1 - then again, SMT might ruin this
    scheme.

    > > Funny corner cases
    > > ------------------
    > >
    > > Like mentioned in the RT time note, there is the possiblity that a core
    > > has 0 power (left) for SCHED_OTHER. This has a consequence for the
    > > balance cpu. Currently we let the first cpu in the domain do the
    > > balancing, however if that CPU has 0 power it might not be the best
    > > choice (esp since part of the balancing can be done from softirq context
    > > - which would basically starve that group).

    >
    > Agreed, but relative easy to solve compared to other challenges


    Yes, I just tossed it in to not forget about it ;-)

    The thing is, I know I wanted to write about two corner cases, but I've
    already forgotten one.. I'm still hoping it will again occur to me :-)

    > > Sched domains
    > > -------------
    > >
    > > There seems to be a use-case where we need both the cache and the
    > > package levels. So I wanted to have both levels in there.
    > >
    > > Currently each domain level can only be one of:
    > >
    > > SD_LV_NONE = 0,
    > > SD_LV_SIBLING,
    > > SD_LV_MC,
    > > SD_LV_CPU,
    > > SD_LV_NODE,
    > > SD_LV_ALLNODES,
    > >
    > > So to avoid a double domain with 'native' multi-core chips where the
    > > cache and package level have the same span, I want to encode this
    > > information in the sched_domain::flags as bits, which means a level can
    > > be both cache and package.

    >
    > This will help power savings balance and make the implementation
    > clean. You have suggested this previously also.
    > Similarly collapse the NODE level if it is redundant?


    Exactly, using a bitmask type systems allows doing that more easily,
    because then a domain can be multiple types at once.

    We can even consider adding more information like: shared_l1, shared_l2
    and shared_l3. So that we have the full cache hierarchy available.

    > > Over balancing
    > > --------------
    > >
    > > Lastly, we might need to introduce SD_OVER_BALANCE, which toggles the
    > > over-balance logic. While over-balancing brings better fairness for a
    > > number of cases, its also hard on power savings.

    >
    > I did not understand this over balance. Can you please explain.


    Ah, lets assume a 2 cpu system.

    When presented with 2 tasks of weight 1024 and 1536, this constitues an
    infeasible weight distribution.

    There is no work-conserving way to schedule those two tasks, such that
    their received cpu-time is proportionally fair.

    However, when presented with 3 tasks each of weight 1024, this is
    statically-infeasible but dynamically-feasible.

    That is, there is no static distribution of tasks such that each tasks
    receives a proportionally fair share of the cpu-time. However, by
    rotating the excess task between the 2 cpus, we can ensure fairness.

    In the latter case, we always have an imbalance between runqueues which
    is smaller than 1 task.

    In this case we can do two things, not schedule in which case we choose
    to maintain the static distribution, this is called under-scheduling.

    Or we move 1 task despite the fact that we'll tip the imbalance the
    other way around, this is called over-scheduling.

    As you can see, over-scheduling allows for fairness in more cases, but
    at some expense.

    A lot of people prefer the added determinism of the extra fairness, but
    not everybody. Hence the tunable.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [RFC PATCH v2 0/5] sched: modular find_busiest_group()

    * Peter Zijlstra [2008-10-14 15:25:03]:

    > On Tue, 2008-10-14 at 18:37 +0530, Vaidyanathan Srinivasan wrote:
    > > * Peter Zijlstra [2008-10-14 14:09:13]:
    > >
    > > >
    > > > Hi,
    > > >
    > > > So the basic issue is sched_group::cpu_power should become more dynamic.

    > >
    > > Hi Peter,
    > >
    > > This is a good idea. Dynamically increasing cpu power to some groups
    > > will automatically help power savings when we want to consolidate
    > > better to one cpu package when overall system utilisation is very low.

    >
    > Ah, yes another use case of this ;-)
    >
    > > > Dynamic Speed Technology
    > > > ------------------------
    > > >
    > > > With cpus actively fiddling with their processing capacity we get into
    > > > similar issues. Again we can measure this, but this would require the
    > > > addition of a clock that measures work instead of time.
    > > >
    > > > Having that, we can even acturately measure the old SMT case, which has
    > > > always been approximated by a static percentage - even though the actual
    > > > gain is very workload dependent.
    > > >
    > > > The idea is to introduce sched_work_clock() so that:
    > > >
    > > > work_delta / time_delta gives the power for a cpu. <1 means we
    > > > did less work than a dedicated pipeline, >1 means we did more.

    > >
    > > The challenge here is measurement of 'work'. What will be the
    > > parameter that will be fair for most workloads and easy to measure on
    > > most systems?
    > >
    > > * Instructions completion count
    > > * APERF or similar CPU specific counter on x86
    > > * POWER has PURR and SPURR to have a measure of relative work done

    >
    > Right - I was hoping for some feedback from the arch folks (maybe I
    > should have CC'ed linux-arch) on this issue.


    Hi Peter,

    Do you want to post this RFD again to get more feedback?

    --Vaidy
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread