[RFC PATCH v3 0/5] Tunable sched_mc_power_savings=n - Kernel

This is a discussion on [RFC PATCH v3 0/5] Tunable sched_mc_power_savings=n - Kernel ; Hi, The existing power saving loadbalancer CONFIG_SCHED_MC attempts to run the workload in the system on minimum number of CPU packages and tries to keep rest of the CPU packages idle for longer duration. Thus consolidating workloads to fewer packages ...

+ Reply to Thread
Results 1 to 18 of 18

Thread: [RFC PATCH v3 0/5] Tunable sched_mc_power_savings=n

  1. [RFC PATCH v3 0/5] Tunable sched_mc_power_savings=n

    Hi,

    The existing power saving loadbalancer CONFIG_SCHED_MC attempts to run
    the workload in the system on minimum number of CPU packages and tries
    to keep rest of the CPU packages idle for longer duration. Thus
    consolidating workloads to fewer packages help other packages to be in
    idle state and save power. The current implementation is very
    conservative and does not work effectively across different workloads.
    Initial idea of tunable sched_mc_power_savings=n was proposed to
    enable tuning of the power saving load balancer based on the system
    configuration, workload characteristics and end user requirements.

    The power savings and performance of the given workload in an under
    utilised system can be controlled by setting values of 0, 1 or 2 to
    /sys/devices/system/cpu/sched_mc_power_savings with 0 being highest
    performance and least power savings and level 2 indicating maximum
    power savings even at the cost of slight performance degradation.

    Please refer to the following discussions and article for details.

    [1]Making power policy just work
    http://lwn.net/Articles/287924/

    [2][RFC v1] Tunable sched_mc_power_savings=n
    http://lwn.net/Articles/287882/

    [3][RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n
    http://lwn.net/Articles/297306/

    The following series of patch demonstrates the basic framework for
    tunable sched_mc_power_savings.

    This version of the patch incorporates comments and feedback received
    on the previous post. Thanks to Peter Zijlstra for the review and
    comments.

    Changes from v2:
    ----------------

    * Fixed locking order issue in active-balance-new-idle
    * Moved the wakeup biasing code to wake_idle() function and preserve
    wake_affine function. Previous version would break wake_affine in
    order to aggressively consolidate tasks
    * Removed sched_mc_preferred_wakeup_cpu global variable and moved to
    doms_cur/dattr_cur and added a per_cpu pointer to appropriate
    storage in partitioned sched domain. This changed is needed to
    preserve functionality in case of partitioned sched domains
    * Patch on 2.6.28-rc3 kernel

    Notes:
    ------

    * The patch has been tested on x86 with basic cpusets and kernbench.
    Correct functionality in the case of partitioned sched domain need
    to be analysed.

    Results:
    --------

    Basic functionality of the code has not changed and the power vs
    performance benefits for kernbench are similar to the ones posted
    earlier.

    KERNBENCH Runs: make -j4 on a x86 8 core, dual socket quad core cpu
    package system

    SchedMC Run Time Package Idle Energy Power
    0 81.61s 53.07% 52.81% 1.00x J 1.00y W
    1 81.52s 40.83% 65.45% 0.96x J 0.96y W
    2 74.66s 22.20% 83.94% 0.90x J 0.98y W

    *** This is RFC code and not for inclusion ***

    Please feel free to test, and let me know your comments and feedback.

    Thanks,
    Vaidy

    Signed-off-by: Vaidyanathan Srinivasan

    ---

    Gautham R Shenoy (1):
    sched: Framework for sched_mc/smt_power_savings=N

    Vaidyanathan Srinivasan (4):
    sched: activate active load balancing in new idle cpus
    sched: bias task wakeups to preferred semi-idle packages
    sched: nominate preferred wakeup cpu
    sched: favour lower logical cpu number for sched_mc balance


    include/linux/sched.h | 12 +++++++
    kernel/sched.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++---
    kernel/sched_fair.c | 17 +++++++++
    3 files changed, 112 insertions(+), 6 deletions(-)

    --
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. [RFC PATCH v3 4/5] sched: bias task wakeups to preferred semi-idle packages

    Preferred wakeup cpu (from a semi idle package) has been
    nominated in find_busiest_group() in the previous patch. Use
    this information in sched_mc_preferred_wakeup_cpu in function
    wake_idle() to bias task wakeups if the following conditions
    are satisfied:
    - The present cpu that is trying to wakeup the process is
    idle and waking the target process on this cpu will
    potentially wakeup a completely idle package
    - The previous cpu on which the target process ran is
    also idle and hence selecting the previous cpu may
    wakeup a semi idle cpu package
    - The task being woken up is allowed to run in the
    nominated cpu (cpu affinity and restrictions)

    Basically if both the current cpu and the previous cpu on
    which the task ran is idle, select the nominated cpu from semi
    idle cpu package for running the new task that is waking up.

    Cache hotness is considered since the actual biasing happens
    in wake_idle() only if the application is cache cold.

    This technique will effectively move short running bursty jobs in
    a mostly idle system.

    Wakeup biasing for power savings gets automatically disabled if
    system utilisation increases due to the fact that the probability
    of finding both this_cpu and prev_cpu idle decreases.

    Signed-off-by: Vaidyanathan Srinivasan
    ---

    kernel/sched_fair.c | 17 +++++++++++++++++
    1 files changed, 17 insertions(+), 0 deletions(-)

    diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
    index ce514af..ad5269a 100644
    --- a/kernel/sched_fair.c
    +++ b/kernel/sched_fair.c
    @@ -1026,6 +1026,23 @@ static int wake_idle(int cpu, struct task_struct *p)
    cpumask_t tmp;
    struct sched_domain *sd;
    int i;
    + int this_cpu;
    + unsigned int *chosen_wakeup_cpu;
    +
    + /*
    + * At POWERSAVINGS_BALANCE_WAKEUP level, if both this_cpu and prev_cpu
    + * are idle and this is not a kernel thread and this task's affinity
    + * allows it to be moved to preferred cpu, then just move!
    + */
    +
    + this_cpu = smp_processor_id();
    + chosen_wakeup_cpu = per_cpu(sched_mc_preferred_wakeup_cpu, this_cpu);
    +
    + if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP &&
    + chosen_wakeup_cpu &&
    + idle_cpu(cpu) && idle_cpu(this_cpu) && p->mm &&
    + cpu_isset(*chosen_wakeup_cpu, p->cpus_allowed))
    + return *chosen_wakeup_cpu;

    /*
    * If it is idle, then it is the best cpu to run this task.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [RFC PATCH v3 0/5] Tunable sched_mc_power_savings=n


    a quick response, I'll read them more carefully tomorrow:

    - why are the preferred cpu things pointers? afaict using just the cpu
    number is both smaller and clearer to the reader.

    - in patch 5/5 you do:

    + spin_unlock(&this_rq->lock);
    + double_rq_lock(this_rq, busiest);

    we call that double_lock_balance()

    - comments go like:

    /*
    * this is a multi-
    * line comment
    */



    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. [RFC PATCH v3 5/5] sched: activate active load balancing in new idle cpus

    Active load balancing is a process by which migration thread
    is woken up on the target CPU in order to pull current
    running task on another package into this newly idle
    package.

    This method is already in use with normal load_balance(),
    this patch introduces this method to new idle cpus when
    sched_mc is set to POWERSAVINGS_BALANCE_WAKEUP.

    This logic provides effective consolidation of short running
    daemon jobs in a almost idle system

    The side effect of this patch may be ping-ponging of tasks
    if the system is moderately utilised. May need to adjust the
    iterations before triggering.

    Signed-off-by: Vaidyanathan Srinivasan
    ---

    kernel/sched.c | 35 +++++++++++++++++++++++++++++++++++
    1 files changed, 35 insertions(+), 0 deletions(-)

    diff --git a/kernel/sched.c b/kernel/sched.c
    index 16c5e1f..4d99509 100644
    --- a/kernel/sched.c
    +++ b/kernel/sched.c
    @@ -3687,10 +3687,45 @@ redo:
    }

    if (!ld_moved) {
    + int active_balance;
    + unsigned long flags;
    +
    schedstat_inc(sd, lb_failed[CPU_NEWLY_IDLE]);
    if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
    !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
    return -1;
    +
    + if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
    + return -1;
    +
    + if (sd->nr_balance_failed++ < 2)
    + return -1;
    +
    + /* Release this_rq lock and take in correct order */
    + spin_unlock(&this_rq->lock);
    + double_rq_lock(this_rq, busiest);
    +
    + /* don't kick the migration_thread, if the curr
    + * task on busiest cpu can't be moved to this_cpu
    + */
    + if (!cpu_isset(this_cpu, busiest->curr->cpus_allowed)) {
    + double_rq_unlock(this_rq, busiest);
    + spin_lock(&this_rq->lock);
    + all_pinned = 1;
    + return ld_moved;
    + }
    +
    + if (!busiest->active_balance) {
    + busiest->active_balance = 1;
    + busiest->push_cpu = this_cpu;
    + active_balance = 1;
    + }
    +
    + double_rq_unlock(this_rq, busiest);
    + if (active_balance)
    + wake_up_process(busiest->migration_thread);
    +
    + spin_lock(&this_rq->lock);
    } else
    sd->nr_balance_failed = 0;


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [RFC PATCH v3 0/5] Tunable sched_mc_power_savings=n

    * Peter Zijlstra [2008-11-10 19:50:16]:

    >
    > a quick response, I'll read them more carefully tomorrow:


    Hi Peter,

    Thanks for the quick review.

    >
    > - why are the preferred cpu things pointers? afaict using just the cpu
    > number is both smaller and clearer to the reader.


    I would need each cpu within a partitioned sched domain to point to
    the _same_ preferred wakeup cpu. The preferred CPU will be updated in
    one place in find_busiest_group() and used by wake_idle.

    If I have a per cpu value, then updating it for each cpu in the
    partitioned sched domain will be slow.

    The actual number of preferred_wakeup_cpu will be equal to the number
    of partitions. If there are no partitions in the sched domains, then
    then all per-cpu pointers will point to the same variable.

    > - in patch 5/5 you do:
    >
    > + spin_unlock(&this_rq->lock);
    > + double_rq_lock(this_rq, busiest);
    >
    > we call that double_lock_balance()


    Will fix this. Did not look for such a routine

    > - comments go like:
    >
    > /*
    > * this is a multi-
    > * line comment
    > */


    Will fix this too.

    Thanks,
    Vaidy

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [RFC PATCH v3 5/5] sched: activate active load balancing in new idle cpus

    On Tue, 2008-11-11 at 00:03 +0530, Vaidyanathan Srinivasan wrote:
    > Active load balancing is a process by which migration thread
    > is woken up on the target CPU in order to pull current
    > running task on another package into this newly idle
    > package.
    >
    > This method is already in use with normal load_balance(),
    > this patch introduces this method to new idle cpus when
    > sched_mc is set to POWERSAVINGS_BALANCE_WAKEUP.
    >
    > This logic provides effective consolidation of short running
    > daemon jobs in a almost idle system
    >
    > The side effect of this patch may be ping-ponging of tasks
    > if the system is moderately utilised. May need to adjust the
    > iterations before triggering.


    OK, I'm so not getting this patch..

    if normal newly idle balancing fails that means the other runqueue has
    only a single task on it (or some other really stubborn stuff), so then
    you go move that one task that is already running, from one cpu to
    another.

    _why_?

    The only answer I can come up with is that you prefer one cpu's
    idle-ness over another - which makes sense, as you try to get whole
    packages idle.

    But I'm not seeing where that package logic is hidden..

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [RFC PATCH v3 3/5] sched: nominate preferred wakeup cpu

    On Tue, 2008-11-11 at 00:03 +0530, Vaidyanathan Srinivasan wrote:
    > When the system utilisation is low and more cpus are idle,
    > then the process waking up from sleep should prefer to
    > wakeup an idle cpu from semi-idle cpu package (multi core
    > package) rather than a completely idle cpu package which
    > would waste power.
    >
    > Use the sched_mc balance logic in find_busiest_group() to
    > nominate a preferred wakeup cpu.
    >
    > This info can be sored in appropriate sched_domain, but
    > updating this info in all copies of sched_domain is not
    > practical. For now lets try with a per-cpu variable
    > pointing to a common storage in partition sched domain
    > attribute. Global variable may not work in partitioned
    > sched domain case.


    Would it make sense to place the preferred_wakeup_cpu stuff in the
    root_domain structure we already have?

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [RFC PATCH v3 3/5] sched: nominate preferred wakeup cpu

    Peter Zijlstra wrote:
    > On Tue, 2008-11-11 at 00:03 +0530, Vaidyanathan Srinivasan wrote:
    >
    >> When the system utilisation is low and more cpus are idle,
    >> then the process waking up from sleep should prefer to
    >> wakeup an idle cpu from semi-idle cpu package (multi core
    >> package) rather than a completely idle cpu package which
    >> would waste power.
    >>
    >> Use the sched_mc balance logic in find_busiest_group() to
    >> nominate a preferred wakeup cpu.
    >>
    >> This info can be sored in appropriate sched_domain, but
    >> updating this info in all copies of sched_domain is not
    >> practical. For now lets try with a per-cpu variable
    >> pointing to a common storage in partition sched domain
    >> attribute. Global variable may not work in partitioned
    >> sched domain case.
    >>

    >
    > Would it make sense to place the preferred_wakeup_cpu stuff in the
    > root_domain structure we already have?
    >


    From the description, this is exactly what the root-domains were created
    to solve.

    Vaidyanathan, just declare your object in "struct root_domain" and
    initialize it in init_rootdomain() in kernel/sched.c, and then access it
    via rq->rd to take advantage of this infrastructure. It will
    automatically follow any partitioning that happens to be configured.

    -Greg



    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v2.0.9 (GNU/Linux)
    Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

    iEYEARECAAYFAkkZkb4ACgkQP5K2CMvXmqEmeACgjAcMrXZN1V HXJ6HmK/6+QTg2
    llMAn0BsPxua4QNRBJwWCNG6SUEsXAJO
    =XoPM
    -----END PGP SIGNATURE-----


  9. Re: [RFC PATCH v3 3/5] sched: nominate preferred wakeup cpu

    On Tue, 2008-11-11 at 20:51 +0530, Srivatsa Vaddagiri wrote:
    > On Tue, Nov 11, 2008 at 09:07:58AM -0500, Gregory Haskins wrote:
    > > > Would it make sense to place the preferred_wakeup_cpu stuff in the
    > > > root_domain structure we already have?
    > > >

    > >
    > > From the description, this is exactly what the root-domains were created
    > > to solve.
    > >
    > > Vaidyanathan, just declare your object in "struct root_domain" and
    > > initialize it in init_rootdomain() in kernel/sched.c, and then access it
    > > via rq->rd to take advantage of this infrastructure. It will
    > > automatically follow any partitioning that happens to be configured.

    >
    > If I understand correctly, we may want to have more than one preferred
    > cpu in a given sched domain, taking into account node topology i.e if a
    > given sched domain encompasses two nodes, then we may like to designate
    > 2 preferred wakeup_cpu's, one per node. If that is the case, then
    > root_domain may not be of use here?


    Agreed, in which case this sched_domain_attr stuff might work out better
    - but I'm not sure I fully get that.. will stare at that a bit more.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [RFC PATCH v3 3/5] sched: nominate preferred wakeup cpu

    On Tue, Nov 11, 2008 at 09:07:58AM -0500, Gregory Haskins wrote:
    > > Would it make sense to place the preferred_wakeup_cpu stuff in the
    > > root_domain structure we already have?
    > >

    >
    > From the description, this is exactly what the root-domains were created
    > to solve.
    >
    > Vaidyanathan, just declare your object in "struct root_domain" and
    > initialize it in init_rootdomain() in kernel/sched.c, and then access it
    > via rq->rd to take advantage of this infrastructure. It will
    > automatically follow any partitioning that happens to be configured.


    If I understand correctly, we may want to have more than one preferred
    cpu in a given sched domain, taking into account node topology i.e if a
    given sched domain encompasses two nodes, then we may like to designate
    2 preferred wakeup_cpu's, one per node. If that is the case, then
    root_domain may not be of use here?

    - vatsa
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [RFC PATCH v3 3/5] sched: nominate preferred wakeup cpu

    * Peter Zijlstra [2008-11-11 14:43:39]:

    > On Tue, 2008-11-11 at 00:03 +0530, Vaidyanathan Srinivasan wrote:
    > > When the system utilisation is low and more cpus are idle,
    > > then the process waking up from sleep should prefer to
    > > wakeup an idle cpu from semi-idle cpu package (multi core
    > > package) rather than a completely idle cpu package which
    > > would waste power.
    > >
    > > Use the sched_mc balance logic in find_busiest_group() to
    > > nominate a preferred wakeup cpu.
    > >
    > > This info can be sored in appropriate sched_domain, but
    > > updating this info in all copies of sched_domain is not
    > > practical. For now lets try with a per-cpu variable
    > > pointing to a common storage in partition sched domain
    > > attribute. Global variable may not work in partitioned
    > > sched domain case.

    >
    > Would it make sense to place the preferred_wakeup_cpu stuff in the
    > root_domain structure we already have?


    Yep, that will be a good idea. We can get to root_domain from each
    CPU's rq and we can get rid of the per-cpu pointers for
    preferred_wakeup_cpu as well. I will change the implementation and
    re-post.

    Thanks,
    Vaidy

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [RFC PATCH v3 3/5] sched: nominate preferred wakeup cpu

    Vaidyanathan Srinivasan wrote:
    > * Peter Zijlstra [2008-11-11 14:43:39]:
    >
    >> On Tue, 2008-11-11 at 00:03 +0530, Vaidyanathan Srinivasan wrote:
    >>> When the system utilisation is low and more cpus are idle,
    >>> then the process waking up from sleep should prefer to
    >>> wakeup an idle cpu from semi-idle cpu package (multi core
    >>> package) rather than a completely idle cpu package which
    >>> would waste power.
    >>>
    >>> Use the sched_mc balance logic in find_busiest_group() to
    >>> nominate a preferred wakeup cpu.
    >>>
    >>> This info can be sored in appropriate sched_domain, but
    >>> updating this info in all copies of sched_domain is not
    >>> practical. For now lets try with a per-cpu variable
    >>> pointing to a common storage in partition sched domain
    >>> attribute. Global variable may not work in partitioned
    >>> sched domain case.

    >> Would it make sense to place the preferred_wakeup_cpu stuff in the
    >> root_domain structure we already have?

    >
    > Yep, that will be a good idea. We can get to root_domain from each
    > CPU's rq and we can get rid of the per-cpu pointers for
    > preferred_wakeup_cpu as well. I will change the implementation and
    > re-post.


    Did you see Vatsa's comments? root_domain will no work if you have more than one
    preferred_wakeup_cpu per domain.

    --
    Balbir
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: [RFC PATCH v3 5/5] sched: activate active load balancing in new idle cpus

    * Peter Zijlstra [2008-11-11 14:47:15]:

    > On Tue, 2008-11-11 at 00:03 +0530, Vaidyanathan Srinivasan wrote:
    > > Active load balancing is a process by which migration thread
    > > is woken up on the target CPU in order to pull current
    > > running task on another package into this newly idle
    > > package.
    > >
    > > This method is already in use with normal load_balance(),
    > > this patch introduces this method to new idle cpus when
    > > sched_mc is set to POWERSAVINGS_BALANCE_WAKEUP.
    > >
    > > This logic provides effective consolidation of short running
    > > daemon jobs in a almost idle system
    > >
    > > The side effect of this patch may be ping-ponging of tasks
    > > if the system is moderately utilised. May need to adjust the
    > > iterations before triggering.

    >
    > OK, I'm so not getting this patch..
    >
    > if normal newly idle balancing fails that means the other runqueue has
    > only a single task on it (or some other really stubborn stuff), so then
    > you go move that one task that is already running, from one cpu to
    > another.
    >
    > _why_?
    >
    > The only answer I can come up with is that you prefer one cpu's
    > idle-ness over another - which makes sense, as you try to get whole
    > packages idle.


    Your answer is correct. We want to move that one task from a non-idle
    cpu to this cpu that is just going to be idle.

    This is the same method used to move task in load_balance(), I have
    extended it for load_balance_newidle() to make the consolidation
    faster at sched_mc=2.


    > But I'm not seeing where that package logic is hidden..



    The package logic comes from find_busiest_group(). If there are no
    imbalance, then find_busiest_group() will return NULL. However when
    sched_mc={1,2} then find_busiest_group() will select a group
    from which a running task may be pulled to this cpu in order to make
    the other package idle. If there is no opportunity to make a package
    idle and if there are no imbalance, then find_busiest_group() will
    return NULL and no action will be taken in load_balance_newidle().

    Under normal task pull operation due to imbalance, there will be more
    than one task in the source run queue and move_tasks() will succeed.
    ld_moved will be true and the active balance code will not be
    triggered.

    If we enter a scenario where we are moving the only running task from
    another cpu, then this should have been suggested by
    find_busiest_group's sched_mc balance logic and thus moving that task
    will potentially freeup the source package.

    Thanks for the careful review.

    --Vaidy

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [RFC PATCH v3 3/5] sched: nominate preferred wakeup cpu

    * Peter Zijlstra [2008-11-11 16:26:14]:

    > On Tue, 2008-11-11 at 20:51 +0530, Srivatsa Vaddagiri wrote:
    > > On Tue, Nov 11, 2008 at 09:07:58AM -0500, Gregory Haskins wrote:
    > > > > Would it make sense to place the preferred_wakeup_cpu stuff in the
    > > > > root_domain structure we already have?
    > > > >
    > > >
    > > > From the description, this is exactly what the root-domains were created
    > > > to solve.
    > > >
    > > > Vaidyanathan, just declare your object in "struct root_domain" and
    > > > initialize it in init_rootdomain() in kernel/sched.c, and then access it
    > > > via rq->rd to take advantage of this infrastructure. It will
    > > > automatically follow any partitioning that happens to be configured.

    > >
    > > If I understand correctly, we may want to have more than one preferred
    > > cpu in a given sched domain, taking into account node topology i.e if a
    > > given sched domain encompasses two nodes, then we may like to designate
    > > 2 preferred wakeup_cpu's, one per node. If that is the case, then
    > > root_domain may not be of use here?

    >
    > Agreed, in which case this sched_domain_attr stuff might work out better
    > - but I'm not sure I fully get that.. will stare at that a bit more.


    The current code that I posted assumes one preferred_wakeup_cpu per
    partitioned domain. Moving the variable to root_domain is a good idea
    for this implementation.

    In future when we need one preferred_wakeup_cpu per node per
    partitioned domain, we will need a array for each partitioned domain.
    Having the array in root_domain is better than having it in dattr.

    Depending upon experimental results, we may choose to have only one
    preferred_wakeup_cpu per partitioned domain. When the system
    utilisation is quite low, it is better to move all movable tasks from
    each node to a selected node (0). This will freeup all CPUs in other
    nodes. Just that we need to consider cache hotness and cross-node
    memory access more carefully before crossing a node boundary for
    consolidation.

    --Vaidy

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: [RFC PATCH v3 3/5] sched: nominate preferred wakeup cpu

    * Gregory Haskins [2008-11-11 09:07:58]:

    > Peter Zijlstra wrote:
    > > On Tue, 2008-11-11 at 00:03 +0530, Vaidyanathan Srinivasan wrote:
    > >
    > >> When the system utilisation is low and more cpus are idle,
    > >> then the process waking up from sleep should prefer to
    > >> wakeup an idle cpu from semi-idle cpu package (multi core
    > >> package) rather than a completely idle cpu package which
    > >> would waste power.
    > >>
    > >> Use the sched_mc balance logic in find_busiest_group() to
    > >> nominate a preferred wakeup cpu.
    > >>
    > >> This info can be sored in appropriate sched_domain, but
    > >> updating this info in all copies of sched_domain is not
    > >> practical. For now lets try with a per-cpu variable
    > >> pointing to a common storage in partition sched domain
    > >> attribute. Global variable may not work in partitioned
    > >> sched domain case.
    > >>

    > >
    > > Would it make sense to place the preferred_wakeup_cpu stuff in the
    > > root_domain structure we already have?
    > >

    >
    > From the description, this is exactly what the root-domains were created
    > to solve.
    >
    > Vaidyanathan, just declare your object in "struct root_domain" and
    > initialize it in init_rootdomain() in kernel/sched.c, and then access it
    > via rq->rd to take advantage of this infrastructure. It will
    > automatically follow any partitioning that happens to be configured.


    Yep, I agree. I will use root_domain for this purpose in the next
    revision.

    Thanks,
    Vaidy
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. Re: [RFC PATCH v3 3/5] sched: nominate preferred wakeup cpu

    * Balbir Singh [2008-11-11 22:19:46]:

    > Vaidyanathan Srinivasan wrote:
    > > * Peter Zijlstra [2008-11-11 14:43:39]:
    > >
    > >> On Tue, 2008-11-11 at 00:03 +0530, Vaidyanathan Srinivasan wrote:
    > >>> When the system utilisation is low and more cpus are idle,
    > >>> then the process waking up from sleep should prefer to
    > >>> wakeup an idle cpu from semi-idle cpu package (multi core
    > >>> package) rather than a completely idle cpu package which
    > >>> would waste power.
    > >>>
    > >>> Use the sched_mc balance logic in find_busiest_group() to
    > >>> nominate a preferred wakeup cpu.
    > >>>
    > >>> This info can be sored in appropriate sched_domain, but
    > >>> updating this info in all copies of sched_domain is not
    > >>> practical. For now lets try with a per-cpu variable
    > >>> pointing to a common storage in partition sched domain
    > >>> attribute. Global variable may not work in partitioned
    > >>> sched domain case.
    > >> Would it make sense to place the preferred_wakeup_cpu stuff in the
    > >> root_domain structure we already have?

    > >
    > > Yep, that will be a good idea. We can get to root_domain from each
    > > CPU's rq and we can get rid of the per-cpu pointers for
    > > preferred_wakeup_cpu as well. I will change the implementation and
    > > re-post.

    >
    > Did you see Vatsa's comments? root_domain will no work if you have more than one
    > preferred_wakeup_cpu per domain.


    Hi Balbir,

    I just saw Vatsa's comments. We have similar limitation with the
    current implementation also. sched_domain_attr dattr is also per
    partitioned domain and not per numa node.

    In the current implementation we can get rid of the per-cpu variables
    and use root_domain. Later we can have an array in root_domain and
    index it based on the cpu's node.

    Thanks,
    Vaidy
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. Re: [RFC PATCH v3 5/5] sched: activate active load balancing in new idle cpus

    * Peter Zijlstra [2008-11-11 18:21:50]:

    > On Tue, 2008-11-11 at 22:34 +0530, Vaidyanathan Srinivasan wrote:
    > > * Peter Zijlstra [2008-11-11 14:47:15]:
    > >
    > > > On Tue, 2008-11-11 at 00:03 +0530, Vaidyanathan Srinivasan wrote:
    > > > > Active load balancing is a process by which migration thread
    > > > > is woken up on the target CPU in order to pull current
    > > > > running task on another package into this newly idle
    > > > > package.
    > > > >
    > > > > This method is already in use with normal load_balance(),
    > > > > this patch introduces this method to new idle cpus when
    > > > > sched_mc is set to POWERSAVINGS_BALANCE_WAKEUP.
    > > > >
    > > > > This logic provides effective consolidation of short running
    > > > > daemon jobs in a almost idle system
    > > > >
    > > > > The side effect of this patch may be ping-ponging of tasks
    > > > > if the system is moderately utilised. May need to adjust the
    > > > > iterations before triggering.
    > > >
    > > > OK, I'm so not getting this patch..
    > > >
    > > > if normal newly idle balancing fails that means the other runqueue has
    > > > only a single task on it (or some other really stubborn stuff), so then
    > > > you go move that one task that is already running, from one cpu to
    > > > another.
    > > >
    > > > _why_?
    > > >
    > > > The only answer I can come up with is that you prefer one cpu's
    > > > idle-ness over another - which makes sense, as you try to get whole
    > > > packages idle.

    > >
    > > Your answer is correct. We want to move that one task from a non-idle
    > > cpu to this cpu that is just going to be idle.
    > >
    > > This is the same method used to move task in load_balance(), I have
    > > extended it for load_balance_newidle() to make the consolidation
    > > faster at sched_mc=2.
    > >
    > >
    > > > But I'm not seeing where that package logic is hidden..

    > >
    > >
    > > The package logic comes from find_busiest_group(). If there are no
    > > imbalance, then find_busiest_group() will return NULL. However when
    > > sched_mc={1,2} then find_busiest_group() will select a group
    > > from which a running task may be pulled to this cpu in order to make
    > > the other package idle. If there is no opportunity to make a package
    > > idle and if there are no imbalance, then find_busiest_group() will
    > > return NULL and no action will be taken in load_balance_newidle().
    > >
    > > Under normal task pull operation due to imbalance, there will be more
    > > than one task in the source run queue and move_tasks() will succeed.
    > > ld_moved will be true and the active balance code will not be
    > > triggered.
    > >
    > > If we enter a scenario where we are moving the only running task from
    > > another cpu, then this should have been suggested by
    > > find_busiest_group's sched_mc balance logic and thus moving that task
    > > will potentially freeup the source package.
    > >
    > > Thanks for the careful review.

    >
    > Ah, right, thanks!
    >
    > Could you clarify this by adding a comment to this effect right before
    > the added code?


    Sure. Will add detailed comments.

    --Vaidy

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  18. Re: [RFC PATCH v3 5/5] sched: activate active load balancing in new idle cpus

    On Tue, 2008-11-11 at 22:34 +0530, Vaidyanathan Srinivasan wrote:
    > * Peter Zijlstra [2008-11-11 14:47:15]:
    >
    > > On Tue, 2008-11-11 at 00:03 +0530, Vaidyanathan Srinivasan wrote:
    > > > Active load balancing is a process by which migration thread
    > > > is woken up on the target CPU in order to pull current
    > > > running task on another package into this newly idle
    > > > package.
    > > >
    > > > This method is already in use with normal load_balance(),
    > > > this patch introduces this method to new idle cpus when
    > > > sched_mc is set to POWERSAVINGS_BALANCE_WAKEUP.
    > > >
    > > > This logic provides effective consolidation of short running
    > > > daemon jobs in a almost idle system
    > > >
    > > > The side effect of this patch may be ping-ponging of tasks
    > > > if the system is moderately utilised. May need to adjust the
    > > > iterations before triggering.

    > >
    > > OK, I'm so not getting this patch..
    > >
    > > if normal newly idle balancing fails that means the other runqueue has
    > > only a single task on it (or some other really stubborn stuff), so then
    > > you go move that one task that is already running, from one cpu to
    > > another.
    > >
    > > _why_?
    > >
    > > The only answer I can come up with is that you prefer one cpu's
    > > idle-ness over another - which makes sense, as you try to get whole
    > > packages idle.

    >
    > Your answer is correct. We want to move that one task from a non-idle
    > cpu to this cpu that is just going to be idle.
    >
    > This is the same method used to move task in load_balance(), I have
    > extended it for load_balance_newidle() to make the consolidation
    > faster at sched_mc=2.
    >
    >
    > > But I'm not seeing where that package logic is hidden..

    >
    >
    > The package logic comes from find_busiest_group(). If there are no
    > imbalance, then find_busiest_group() will return NULL. However when
    > sched_mc={1,2} then find_busiest_group() will select a group
    > from which a running task may be pulled to this cpu in order to make
    > the other package idle. If there is no opportunity to make a package
    > idle and if there are no imbalance, then find_busiest_group() will
    > return NULL and no action will be taken in load_balance_newidle().
    >
    > Under normal task pull operation due to imbalance, there will be more
    > than one task in the source run queue and move_tasks() will succeed.
    > ld_moved will be true and the active balance code will not be
    > triggered.
    >
    > If we enter a scenario where we are moving the only running task from
    > another cpu, then this should have been suggested by
    > find_busiest_group's sched_mc balance logic and thus moving that task
    > will potentially freeup the source package.
    >
    > Thanks for the careful review.


    Ah, right, thanks!

    Could you clarify this by adding a comment to this effect right before
    the added code?

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread