[RFC v1] Tunable sched_mc_power_savings=n - Kernel

This is a discussion on [RFC v1] Tunable sched_mc_power_savings=n - Kernel ; Hi, The existing power saving loadbalancer CONFIG_SCHED_MC attempts to run the workload in the system on minimum number of CPU packages and tries to keep rest of the CPU packages idle for longer duration. Thus consolidating workloads to fewer packages ...

+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 20 of 36

Thread: [RFC v1] Tunable sched_mc_power_savings=n

  1. [RFC v1] Tunable sched_mc_power_savings=n

    Hi,

    The existing power saving loadbalancer CONFIG_SCHED_MC attempts to run
    the workload in the system on minimum number of CPU packages and tries
    to keep rest of the CPU packages idle for longer duration. Thus
    consolidating workloads to fewer packages help other packages to be in
    idle state and save power.

    echo 1 > /sys/devices/system/cpu/sched_mc_power_savings is used to
    turn on this feature.

    When enabled, this tunable would influence the loadbalancer decision
    in find_busiest_group(). Two parameters are extracted at the this
    time. group_leader is the group that is almost full and has just
    enough capacity to pull few (one) tasks while group_min is the group
    that has too few tasks and if we can move them to group_leader, then
    this group can go completely idle.

    The default criteria to select group_leader and group_min would catch
    long running threads on various packages and pull them to single
    package. The group_capacity limits the number of tasks that is being
    pulled and we are expected to have one task per core in a package and
    all the core in a package are loaded.

    This default criteria for selection when sched_mc_power_savings=1 has
    a good balance of power savings and least performance impact. The
    conservative approach taken towards consolidation makes the selection
    criteria workload dependent. Long running steady state workloads are
    placed correct, but not bursty workload.

    The idea being proposed is to enhance the tunable with varied degrees
    of consolidation that can work best for different workload
    characteristics. echo 2 > /sys/.../sched_mc_power_savings could
    enable more aggressive consolidation than the default.

    I am presently working on different criteria that can help consolidate
    different types of workload with varied degrees of power savings and
    performance impact.

    Advantages:

    * Enterprise workloads on large hardware configurations may need
    aggressive consolidation strategy
    * Performance impact on server is different from desktop or laptops.
    Interactivity is less of a concern on large enterprise servers while
    workload response times and performance per watt is more significant
    * Aggressive power savings even with marginal performance penalty is
    is a useful tunable for servers since it may provide good
    performance-per-watt at low utilisation
    * This tunable can influence other parts of scheduler like wakeup
    biasing for overall task consolidation

    Proposed changes:

    * Add more values to sched_mc_power_savings tunable (bit flags?)
    * Enable different consolidation strategy based on the value
    * Evaluate different strategy against different workloads and design
    heuristics for auto tuning
    * Modify selection of group_leader by changing the spare capacity
    evaluation
    * Increase group capacity of the group leader to avoid pulling tasks
    away from group_leader within a short time
    * Choose different load_idx while evaluating and selecting the load
    * Use the sched_mc_power_savings settings outside of load balancer
    like in task wakeup biasing
    * Design power saving loadbalancer in combination with process wakeup
    biasing in order to consolidate bursty and short running jobs to
    less CPU packages in an idle or under-utilised system.

    Disadvantages:

    * More tunable settings will lead to sub-optimal performance if not
    exploited correctly. Once the tunable criteria is established and
    we have good heuristics, we can have a default setting that can
    automatically choose the right technique.

    I will send the changes in criteria and their impact in subsequent
    RFCs. I would like to solicit feedback on the overall idea and inputs
    from people who have already attempted similar changes.

    Thanks,
    Vaidy


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [RFC v1] Tunable sched_mc_power_savings=n

    Vaidyanathan Srinivasan writes:
    >
    > The idea being proposed is to enhance the tunable with varied degrees
    > of consolidation that can work best for different workload
    > characteristics. echo 2 > /sys/.../sched_mc_power_savings could
    > enable more aggressive consolidation than the default.


    It would be better to fix the single power saving default to work
    better with bursty workloads too than to add more tunables. Tunables
    are basically "we give up, let's push the problem to the user"
    which is not nice. I suspect a lot of users won't even know if their
    workloads are bursty or not. Or they might have workloads which
    are both bursty and not bursty.

    Or did you try that and failed?

    -Andi
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [RFC v1] Tunable sched_mc_power_savings=n

    On Thu, Jun 26, 2008 at 03:49:01PM +0200, Andi Kleen wrote:
    > Vaidyanathan Srinivasan writes:
    > >
    > > The idea being proposed is to enhance the tunable with varied degrees
    > > of consolidation that can work best for different workload
    > > characteristics. echo 2 > /sys/.../sched_mc_power_savings could
    > > enable more aggressive consolidation than the default.

    >
    > It would be better to fix the single power saving default to work
    > better with bursty workloads too than to add more tunables. Tunables
    > are basically "we give up, let's push the problem to the user"
    > which is not nice. I suspect a lot of users won't even know if their
    > workloads are bursty or not. Or they might have workloads which
    > are both bursty and not bursty.
    >
    > Or did you try that and failed?


    I think we have a reasonable default with sched_mc_power_savings=1.
    Beyond that it hard to figure out how much work you can group together
    and run in a small number of physical CPU packages. The approach
    we are taking is to let system administrators decide what level
    of power savings they want. If they want power savings at the cost
    of performance, they should be able to do so using a higher
    value of sched_mc_power_savings. If they see that they can pack
    more work without affecting their transaction time, they should
    be able to adjust the level of packing. Beyond a sane default,
    it is hard to do this inside the kernel.

    Thanks
    Dipankar
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [RFC v1] Tunable sched_mc_power_savings=n

    Andi Kleen wrote:
    > Vaidyanathan Srinivasan writes:
    >> The idea being proposed is to enhance the tunable with varied degrees
    >> of consolidation that can work best for different workload
    >> characteristics. echo 2 > /sys/.../sched_mc_power_savings could
    >> enable more aggressive consolidation than the default.

    >
    > It would be better to fix the single power saving default to work
    > better with bursty workloads too than to add more tunables. Tunables
    > are basically "we give up, let's push the problem to the user"
    > which is not nice. I suspect a lot of users won't even know if their
    > workloads are bursty or not. Or they might have workloads which
    > are both bursty and not bursty.
    >
    > Or did you try that and failed?
    >


    A user could be an application and certain applications can predict their
    workload. For example, a database, a file indexer, etc can predict their workload.

    Policies are best known in user land and the best controlled from there.
    Consider a case where the end user might select a performance based policy or a
    policy to aggressively save power (during peak tariff times). With
    virtualization, the whole concept of application is changing, the OS by itself
    could be an application


    --
    Warm Regards,
    Balbir Singh
    Linux Technology Center
    IBM, ISTL
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [RFC v1] Tunable sched_mc_power_savings=n


    > A user could be an application and certain applications can predict their
    > workload.


    So you expect the applications to run suid root and change a sysctl?
    And what happens when two applications run that do that and they have differing
    requirements? Will they fight over the sysctl?

    > For example, a database, a file indexer, etc can predict their workload.



    A file indexer should run with a high nice level and low priority would ideally always
    prefer power saving. But it doesn't currently. Perhaps it should?

    >
    > Policies are best known in user land and the best controlled from there.
    > Consider a case where the end user might select a performance based policy or a
    > policy to aggressively save power (during peak tariff times). With


    How many users are going to do that? Seems like a unrealistic case to me.

    -Andi
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [RFC v1] Tunable sched_mc_power_savings=n

    * Dipankar Sarma [2008-06-26 20:31:00]:

    > On Thu, Jun 26, 2008 at 03:49:01PM +0200, Andi Kleen wrote:
    > > Vaidyanathan Srinivasan writes:
    > > >
    > > > The idea being proposed is to enhance the tunable with varied degrees
    > > > of consolidation that can work best for different workload
    > > > characteristics. echo 2 > /sys/.../sched_mc_power_savings could
    > > > enable more aggressive consolidation than the default.

    > >
    > > It would be better to fix the single power saving default to work
    > > better with bursty workloads too than to add more tunables. Tunables
    > > are basically "we give up, let's push the problem to the user"
    > > which is not nice. I suspect a lot of users won't even know if their
    > > workloads are bursty or not. Or they might have workloads which
    > > are both bursty and not bursty.
    > >
    > > Or did you try that and failed?

    >
    > I think we have a reasonable default with sched_mc_power_savings=1.
    > Beyond that it hard to figure out how much work you can group together
    > and run in a small number of physical CPU packages. The approach
    > we are taking is to let system administrators decide what level
    > of power savings they want. If they want power savings at the cost
    > of performance, they should be able to do so using a higher
    > value of sched_mc_power_savings. If they see that they can pack
    > more work without affecting their transaction time, they should
    > be able to adjust the level of packing. Beyond a sane default,
    > it is hard to do this inside the kernel.


    Hi Andi,

    Aggressive grouping and consolidation may hurt performance to some
    extent depending on the workload. The default setting could have least
    performance impact and moderate power savings. We certainly need
    user/application input on how much 'potential' performance hit the
    application is willing to take in order to save considerable power
    under low system utilisation. As Dipankar has mentioned, the proposed
    idea is to use sched_mc_power_savings as a power-savings and
    performance trade-off tunable parameter.

    We tried to tweak wakeup logic to move tasks to one package at idle,
    it works great at idle, but could potentially cause too much redundant
    load balancing at certain system utilisation. Every technique used to
    consolidate tasks has its benefits at particular utilisation level and
    also depends on nature of workload. I agree that we should avoid
    tunable as far as possible, but we still need make the changes
    available to community so that we can compare the different methods
    across various workloads and system configuration. One of the
    settings in the tunable can very well be 'let the kernel decide what
    is best'

    --Vaidy

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [RFC v1] Tunable sched_mc_power_savings=n

    * Andi Kleen [2008-06-26 20:08:41]:

    >
    > > A user could be an application and certain applications can predict their
    > > workload.

    >
    > So you expect the applications to run suid root and change a sysctl?
    > And what happens when two applications run that do that and they have differing
    > requirements? Will they fight over the sysctl?


    System management software and workload monitoring and managing
    software can potentially control the tunable on behalf of the
    applications for best overall power savings and performance.

    Applications with conflicting goals should resolve among themselves.
    The application with highest performance requirement should win. The
    power QoS framework set_acceptable_latency() ensures that the lowest
    latency set across the system wins. This tunable can also be based on
    the similar approach.


    > > For example, a database, a file indexer, etc can predict their workload.

    >
    >
    > A file indexer should run with a high nice level and low priority would ideally always
    > prefer power saving. But it doesn't currently. Perhaps it should?


    Power management settings affect the entire system. It may not be
    based on per application priority or nice value. However if the
    priority of all the applications currently running in the system
    indicate power savings, then the kernel can goto more aggressive power
    saving state.

    > >
    > > Policies are best known in user land and the best controlled from there.
    > > Consider a case where the end user might select a performance based policy or a
    > > policy to aggressively save power (during peak tariff times). With

    >
    > How many users are going to do that? Seems like a unrealistic case to me.


    System management software should do this. Certainly manual
    intervention to change these settings will not be popular. Given the
    trends in virtualisation and modular systems, most datacenters will
    use some form of systems management software and infrastructure that
    is empowered to make policy based decisions on provisioning and
    systems configuration.

    In a small-scale datacenters, peak and off-peak hour settings can be
    potentially done through simple cron jobs.

    --Vaidy
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [RFC v1] Tunable sched_mc_power_savings=n

    Vaidyanathan Srinivasan wrote:
    > * Andi Kleen [2008-06-26 20:08:41]:
    >
    >
    >>>A user could be an application and certain applications can predict their
    >>>workload.

    >>
    >>So you expect the applications to run suid root and change a sysctl?
    >>And what happens when two applications run that do that and they have differing
    >>requirements? Will they fight over the sysctl?


    There are cases where Oracle does this, to ensure the (critical!) log writer
    isn't starved by cpu-hungry query optimizer processes...


    > System management software and workload monitoring and managing
    > software can potentially control the tunable on behalf of the
    > applications for best overall power savings and performance.
    >
    > Applications with conflicting goals should resolve among themselves.
    > The application with highest performance requirement should win. The
    > power QoS framework set_acceptable_latency() ensures that the lowest
    > latency set across the system wins. This tunable can also be based on
    > the similar approach.


    This is what the IBM zOS "WLM" does: a godlike service runs, records
    the delays of workloads on the system, and then adjusts tuning
    parameters to speed up processes which are running slower than their
    service levels call for, taking the resources from processes which
    are running faster than service agreements require.

    Look for goal-directed resource management and "workload manager" in
    Redbooks. Better, ask some of the IBM folks here (;-))


    >>>For example, a database, a file indexer, etc can predict their workload.

    >>
    >>
    >>A file indexer should run with a high nice level and low priority would ideally always
    >>prefer power saving. But it doesn't currently. Perhaps it should?

    >
    >
    > Power management settings affect the entire system. It may not be
    > based on per application priority or nice value. However if the
    > priority of all the applications currently running in the system
    > indicate power savings, then the kernel can goto more aggressive power
    > saving state.
    >
    >
    >>>Policies are best known in user land and the best controlled from there.
    >>>Consider a case where the end user might select a performance based policy or a
    >>>policy to aggressively save power (during peak tariff times). With

    >>
    >>How many users are going to do that? Seems like a unrealistic case to me.


    It's just another policy you could have in your workload management
    set: a friend and I were discussing that just the other day!

    > System management software should do this. Certainly manual
    > intervention to change these settings will not be popular. Given the
    > trends in virtualisation and modular systems, most datacenters will
    > use some form of systems management software and infrastructure that
    > is empowered to make policy based decisions on provisioning and
    > systems configuration.
    >
    > In a small-scale datacenters, peak and off-peak hour settings can be
    > potentially done through simple cron jobs.
    >
    > --Vaidy

    -

    --dave
    --
    David Collier-Brown | Always do right. This will gratify
    Sun Microsystems, Toronto | some people and astonish the rest
    davecb@sun.com | -- Mark Twain
    (905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
    bridge: (877) 385-4099 code: 506 9191#
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [RFC v1] Tunable sched_mc_power_savings=n

    Vaidyanathan Srinivasan wrote:

    Playing devil's advocate here.


    > * Andi Kleen [2008-06-26 20:08:41]:
    >
    >>> A user could be an application and certain applications can predict their
    >>> workload.

    >> So you expect the applications to run suid root and change a sysctl?
    >> And what happens when two applications run that do that and they have differing
    >> requirements? Will they fight over the sysctl?

    >
    > System management software and workload monitoring and managing
    > software can potentially control the tunable on behalf of the
    > applications for best overall power savings and performance.


    Does it have the needed information for that? e.g. real time information
    on what the system does? I don't think anybody is in a better position
    to control that than the kernel.

    > Applications with conflicting goals should resolve among themselves.


    That sounds wrong to me. Negotiating between conflicting requirements
    from different applications is something that kernels are supposed
    to do.

    > The application with highest performance requirement should win.


    That is right, but the kernel can do that based on nice levels
    and possibly other information, can't it?


    > The
    > power QoS framework set_acceptable_latency() ensures that the lowest
    > latency set across the system wins.


    But that only helps kernel drivers, not user space, doesn't it?

    > Power management settings affect the entire system. It may not be
    > based on per application priority or nice value. However if the
    > priority of all the applications currently running in the system
    > indicate power savings, then the kernel can goto more aggressive power
    > saving state.


    That's what I meant yes. So if only the file system indexer is running
    over night all niced it will run as power efficiently as possible.

    > In a small-scale datacenters, peak and off-peak hour settings can be
    > potentially done through simple cron jobs.


    Is there any real drawback from only controlling it through nice levels?

    Anyways I think the main thing I object to in your proposal is that
    your tunable is system global, not per process. I'm also not
    sure if a tunable is really a good idea and if the kernel couldn't
    do a better job.

    -Andi
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [RFC v1] Tunable sched_mc_power_savings=n

    On Thu, Jun 26, 2008 at 10:17:00PM +0200, Andi Kleen wrote:
    > Vaidyanathan Srinivasan wrote:
    > > System management software and workload monitoring and managing
    > > software can potentially control the tunable on behalf of the
    > > applications for best overall power savings and performance.

    >
    > Does it have the needed information for that? e.g. real time information
    > on what the system does? I don't think anybody is in a better position
    > to control that than the kernel.


    Some workload managers already do that - they provision cpu and memory
    resources based on request rates and response times. Such software is
    in a better position to make a decision whether they can live with
    reduced performance due to power saving mode or not. The point I am
    making is the the kernel doesn't have any notion of transactional
    performance - so if an administrator wants to run unimportant
    transactions on a slower but low-power system, he/she should have
    the option of doing so.

    > > Applications with conflicting goals should resolve among themselves.

    >
    > That sounds wrong to me. Negotiating between conflicting requirements
    > from different applications is something that kernels are supposed
    > to do.


    Agreed. However that is a difficult problem to solve and not the
    intention of this idea. Global power setting is a simple first step.
    I don't think we have a good understanding of cases where conflicting
    power requirements from multiple applications need to be addressed.
    We will have to look at that when the issue arises.

    > > In a small-scale datacenters, peak and off-peak hour settings can be
    > > potentially done through simple cron jobs.

    >
    > Is there any real drawback from only controlling it through nice levels?


    In a system with more than a couple of sockets, it is more beneficial
    (power-wise) to pack all work in to a small number of processors
    and let the other processors go to very low power sleep. Compared
    to running tasks slowly and spreading them all over the processors.

    > Anyways I think the main thing I object to in your proposal is that
    > your tunable is system global, not per process. I'm also not
    > sure if a tunable is really a good idea and if the kernel couldn't
    > do a better job.


    While it would be nice to have a per process tunable, I am not sure
    we are ready for that yet. A global setting is easy to implement
    and we have immediate use for it. The kernel already does a decent
    job conservatively - by packing one task per core in a package
    when sched_mc_power_savings=1 is set. Any further packing may affect
    performance and should not therefore be the default behavior.

    Thanks
    Dipankar
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [RFC v1] Tunable sched_mc_power_savings=n

    Dipankar Sarma wrote:

    > Some workload managers already do that - they provision cpu and memory
    > resources based on request rates and response times. Such software is
    > in a better position to make a decision whether they can live with
    > reduced performance due to power saving mode or not. The point I am
    > making is the the kernel doesn't have any notion of transactional
    > performance


    The kernel definitely knows about burstiness vs non burstiness at least
    (although it currently has no long term memory for that). Does it need
    more than that for this? Anyways if nice levels were used that is not
    even needed, because it's ok to run niced processes slower.

    And your workload manager could just nice processes. It should probably
    do that anyways to tell ondemand you don't need full frequency.

    - so if an administrator wants to run unimportant
    > transactions on a slower but low-power system, he/she should have
    > the option of doing so.
    >
    >>> Applications with conflicting goals should resolve among themselves.

    >> That sounds wrong to me. Negotiating between conflicting requirements
    >> from different applications is something that kernels are supposed
    >> to do.

    >
    > Agreed. However that is a difficult problem to solve and not the
    > intention of this idea. Global power setting is a simple first step.
    > I don't think we have a good understanding of cases where conflicting


    Always the guy who needs the most performance wins? And if only
    niced processes are running it's ok to be slower.

    It would be similar to nice levels. In fact nice levels could be probably
    used directly (similar to how ionice coopts them too)

    Or another case that already uses it is cpufreq/ondemand: when only niced
    processes run the CPU is not cranked up to the highest frequency.

    I don't see why that information couldn't be used by the load balancer
    either to optimize socket use for power. Ok except that the load balancer
    is already very tricky. But still would be probably better to have some more
    complex code that does DTRT automatically than another tunable.

    >>> In a small-scale datacenters, peak and off-peak hour settings can be
    >>> potentially done through simple cron jobs.

    >> Is there any real drawback from only controlling it through nice levels?

    >
    > In a system with more than a couple of sockets, it is more beneficial
    > (power-wise) to pack all work in to a small number of processors
    > and let the other processors go to very low power sleep. Compared
    > to running tasks slowly and spreading them all over the processors.


    You answered a different question?

    > While it would be nice to have a per process tunable, I am not sure
    > we are ready for that yet.


    Can you please elaborate what you think is missing?

    -Andi
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [RFC v1] Tunable sched_mc_power_savings=n

    On Thu, 2008-06-26 at 23:37 +0200, Andi Kleen wrote:
    > Dipankar Sarma wrote:
    >
    > > Some workload managers already do that - they provision cpu and memory
    > > resources based on request rates and response times. Such software is
    > > in a better position to make a decision whether they can live with
    > > reduced performance due to power saving mode or not. The point I am
    > > making is the the kernel doesn't have any notion of transactional
    > > performance

    >
    > The kernel definitely knows about burstiness vs non burstiness at least
    > (although it currently has no long term memory for that). Does it need
    > more than that for this? Anyways if nice levels were used that is not
    > even needed, because it's ok to run niced processes slower.
    >
    > And your workload manager could just nice processes. It should probably
    > do that anyways to tell ondemand you don't need full frequency.


    Except that I want my nice 19 distcc processes to utilize as much cpu as
    possible, but just not bother any other stuff I might be doing...



    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: [RFC v1] Tunable sched_mc_power_savings=n

    Peter Zijlstra wrote:

    >> And your workload manager could just nice processes. It should probably
    >> do that anyways to tell ondemand you don't need full frequency.

    >
    > Except that I want my nice 19 distcc processes to utilize as much cpu as
    > possible, but just not bother any other stuff I might be doing...


    They already won't do that if you run ondemand and cpufreq. It won't
    crank up the frequency for niced processes.

    Extending that existing policy to socket load balancing would be only
    natural.

    -Andi

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [RFC v1] Tunable sched_mc_power_savings=n

    Andi Kleen wrote:
    >> A user could be an application and certain applications can predict their
    >> workload.

    >
    > So you expect the applications to run suid root and change a sysctl?
    > And what happens when two applications run that do that and they have differing
    > requirements? Will they fight over the sysctl?
    >


    We expect the system administrator to set an overall policy. The administrators
    should have some flexibility in deciding how aggressive they want their power
    savings to be

    >> For example, a database, a file indexer, etc can predict their workload.

    >
    >
    > A file indexer should run with a high nice level and low priority would ideally always
    > prefer power saving. But it doesn't currently. Perhaps it should?
    >


    Replace file indexer with a datawarehouse, What if I have several instances of
    these workloads running in parallel? The administrator should be able to decide
    when to consolidate for power and when to spread for performance.


    >> Policies are best known in user land and the best controlled from there.
    >> Consider a case where the end user might select a performance based policy or a
    >> policy to aggressively save power (during peak tariff times). With

    >
    > How many users are going to do that? Seems like a unrealistic case to me.


    Two generic comments about the users part

    1. The fact that we have sched_mc_power_savings is an indication that there are
    users trying to use it for power savings
    2. Users demand features, but they can only use them once we provide the tunables.

    It might seem unrealistic for a one machine scenario, but consider a data center
    hosting thousands of servers. Depending on the utilization, the administrator
    might decide to use different policies for different servers.



    --
    Warm Regards,
    Balbir Singh
    Linux Technology Center
    IBM, ISTL
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: [RFC v1] Tunable sched_mc_power_savings=n

    On Thu, Jun 26, 2008 at 11:37:08PM +0200, Andi Kleen wrote:
    > Dipankar Sarma wrote:
    >
    > > Some workload managers already do that - they provision cpu and memory
    > > resources based on request rates and response times. Such software is
    > > in a better position to make a decision whether they can live with
    > > reduced performance due to power saving mode or not. The point I am
    > > making is the the kernel doesn't have any notion of transactional
    > > performance

    >
    > The kernel definitely knows about burstiness vs non burstiness at least
    > (although it currently has no long term memory for that). Does it need
    > more than that for this? Anyways if nice levels were used that is not
    > even needed, because it's ok to run niced processes slower.
    >
    > And your workload manager could just nice processes. It should probably
    > do that anyways to tell ondemand you don't need full frequency.


    The current usage of this we are looking requires system-wide
    settings. That means nicing every process running on the system.
    That seems a little messy. Secondly, even if you nice the processes
    they are still going to be spread all over the CPU packages
    running at lower frequencies due to nice. The point I am making
    is that it is more effective to push work into smaller number
    of cpu packages and let others go to low-power sleep state.

    > > Agreed. However that is a difficult problem to solve and not the
    > > intention of this idea. Global power setting is a simple first step.
    > > I don't think we have a good understanding of cases where conflicting

    >
    > Always the guy who needs the most performance wins? And if only
    > niced processes are running it's ok to be slower.
    >
    > It would be similar to nice levels. In fact nice levels could be probably
    > used directly (similar to how ionice coopts them too)
    >
    > Or another case that already uses it is cpufreq/ondemand: when only niced
    > processes run the CPU is not cranked up to the highest frequency.


    Using nice, you can force lowering of frequency - but you can do that
    using userspace governor as well - no need to mess with process
    priorities. We are talking about a different optimization here - something
    that will give more benefits in powersave mode when you have large
    systems.

    > >>> In a small-scale datacenters, peak and off-peak hour settings can be
    > >>> potentially done through simple cron jobs.
    > >> Is there any real drawback from only controlling it through nice levels?

    > >
    > > In a system with more than a couple of sockets, it is more beneficial
    > > (power-wise) to pack all work in to a small number of processors
    > > and let the other processors go to very low power sleep. Compared
    > > to running tasks slowly and spreading them all over the processors.

    >
    > You answered a different question?


    The point is that grouping tasks into small number of sockets is
    more effective than nicing which may still spread the tasks all
    over the sockets. Think of this as light-weight CPU hotplug.
    Something that can compact and expand CPU capacity fast and
    extends an existing power management interface / logic.

    Thanks
    Dipankar

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. Re: [RFC v1] Tunable sched_mc_power_savings=n

    * Andi Kleen [2008-06-27 00:38:53]:

    > Peter Zijlstra wrote:
    >
    > >> And your workload manager could just nice processes. It should probably
    > >> do that anyways to tell ondemand you don't need full frequency.

    > >
    > > Except that I want my nice 19 distcc processes to utilize as much cpu as
    > > possible, but just not bother any other stuff I might be doing...

    >
    > They already won't do that if you run ondemand and cpufreq. It won't
    > crank up the frequency for niced processes.


    This may not provide the best power saving if the workload is bursty.
    Finishing the job quickly and entering sleep states have better
    impact. This is the race-to-idle problem where we want to maximise the
    sleep state utilisation relative to reducing the frequency. The
    benefit of this technique is certainly workload specific. However
    even in this particular case, running at the lowest frequency is the
    safest option from OS point of view for power savings. However for
    maximum power savings, increasing sleep state utilisation have the
    following advantages:

    * Sleep states are per core while voltage and frequency control are
    for multiple cores in a multi-core package. Hence freq change
    decisions needs to be taken at the package level. Though ondemand
    makes the decision based on per-core utilisation and process
    priority, the actual effect in hardware is the highest freq
    recommended by all cores. Per core decision is actually only
    a recommendation or a vote.

    * Moving tasks to less number of CPU package in a multi socket system
    will provide maximum savings since even shared resources on the idle
    sockets can be in low power states.

    Multi socket systems with multi core CPUs have more controls for power
    savings that were previously not available on single core systems.
    Automatically making the right decision is an ideal solution. However
    since there are trade-offs, we would like the users to experiment with
    what suits them the best. The rational is similar to why we provide
    different cpufreq governors and tunables.

    If we discover a good automatic technique to choose the right power
    saving strategy that is widely acceptable, then certainly we will go
    for it. Can we build the stepping stone to reach there? Can we consider
    these tunables as enablements for end users to try them out easily
    and provide feedback?

    >
    > Extending that existing policy to socket load balancing would be only
    > natural.


    Consolidation based on task priority seems to be the challenge here.
    However this is a good point. This is certainly a parameter for auto
    tuning if only we can overcome the challenges in using priority for
    task consolidation.

    --Vaidy

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. Re: [RFC v1] Tunable sched_mc_power_savings=n

    * David Collier-Brown [2008-06-26 15:37:06]:

    > Vaidyanathan Srinivasan wrote:
    >> * Andi Kleen [2008-06-26 20:08:41]:
    >>
    >>
    >>>> A user could be an application and certain applications can predict their
    >>>> workload.
    >>>
    >>> So you expect the applications to run suid root and change a sysctl?
    >>> And what happens when two applications run that do that and they have differing
    >>> requirements? Will they fight over the sysctl?

    >
    > There are cases where Oracle does this, to ensure the (critical!) log writer
    > isn't starved by cpu-hungry query optimizer processes...


    Good here is an example for the use-case we are proposing

    >
    >
    >> System management software and workload monitoring and managing
    >> software can potentially control the tunable on behalf of the
    >> applications for best overall power savings and performance.
    >>
    >> Applications with conflicting goals should resolve among themselves.
    >> The application with highest performance requirement should win. The
    >> power QoS framework set_acceptable_latency() ensures that the lowest
    >> latency set across the system wins. This tunable can also be based on
    >> the similar approach.

    >
    > This is what the IBM zOS "WLM" does: a godlike service runs, records
    > the delays of workloads on the system, and then adjusts tuning
    > parameters to speed up processes which are running slower than their
    > service levels call for, taking the resources from processes which
    > are running faster than service agreements require.
    >
    > Look for goal-directed resource management and "workload manager" in
    > Redbooks. Better, ask some of the IBM folks here (;-))


    This tunable can certainly be very useful for such WLM software.
    However this can be useful in simple system deployment as well. If
    the purpose of the system and its workload characteristics are easily
    determined and there is little runtime variation, then the
    administrator can easily choose the correct tunable.

    >
    >>>> For example, a database, a file indexer, etc can predict their workload.
    >>>
    >>>
    >>> A file indexer should run with a high nice level and low priority would ideally always
    >>> prefer power saving. But it doesn't currently. Perhaps it should?

    >>
    >>
    >> Power management settings affect the entire system. It may not be
    >> based on per application priority or nice value. However if the
    >> priority of all the applications currently running in the system
    >> indicate power savings, then the kernel can goto more aggressive power
    >> saving state.
    >>
    >>
    >>>> Policies are best known in user land and the best controlled from there.
    >>>> Consider a case where the end user might select a performance based policy or a
    >>>> policy to aggressively save power (during peak tariff times). With
    >>>
    >>> How many users are going to do that? Seems like a unrealistic case to me.

    >
    > It's just another policy you could have in your workload management
    > set: a friend and I were discussing that just the other day!


    Power policy across datacenter that takes into account customer
    priority class and current cost of power (peak vs non peak time).

    >> System management software should do this. Certainly manual
    >> intervention to change these settings will not be popular. Given the
    >> trends in virtualisation and modular systems, most datacenters will
    >> use some form of systems management software and infrastructure that
    >> is empowered to make policy based decisions on provisioning and
    >> systems configuration.
    >>
    >> In a small-scale datacenters, peak and off-peak hour settings can be
    >> potentially done through simple cron jobs.
    >>
    >> --Vaidy

    > -
    >
    > --dave
    > --
    > David Collier-Brown | Always do right. This will gratify
    > Sun Microsystems, Toronto | some people and astonish the rest
    > davecb@sun.com | -- Mark Twain
    > (905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
    > bridge: (877) 385-4099 code: 506 9191#
    > --
    > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    > the body of a message to majordomo@vger.kernel.org
    > More majordomo info at http://vger.kernel.org/majordomo-info.html
    > Please read the FAQ at http://www.tux.org/lkml/
    >

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  18. Re: [RFC v1] Tunable sched_mc_power_savings=n

    * Andi Kleen [2008-06-26 22:17:00]:

    > Vaidyanathan Srinivasan wrote:
    >
    > Playing devil's advocate here.
    >


    [...]

    > > The
    > > power QoS framework set_acceptable_latency() ensures that the lowest
    > > latency set across the system wins.

    >
    > But that only helps kernel drivers, not user space, doesn't it?


    Yes the QoS notification is mainly for kernel drivers, but
    applications can control them using the /dev/[...,network_latency,...]
    interface as documented in Documentations/power/pm_qos_interface.txt

    The device drivers are expected to get feedback (tunable?) from
    applications that are dependent on those drivers and set the correct
    power saving level. Multimedia applications are expected to make use
    of this interface to set/communicate the correct power saving levels
    for audio drivers.

    Many application can set different latency requirement, but the least
    will win. Here the PM-QoS framework in kernel arbitrates between
    applications and resolves conflicts by choosing the least latency or
    most conservative power saving mode.

    --Vaidy

    [...]
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  19. Re: [RFC v1] Tunable sched_mc_power_savings=n

    On Fri, 2008-06-27 at 00:38 +0200, Andi Kleen wrote:
    > Peter Zijlstra wrote:
    >
    > >> And your workload manager could just nice processes. It should probably
    > >> do that anyways to tell ondemand you don't need full frequency.

    > >
    > > Except that I want my nice 19 distcc processes to utilize as much cpu as
    > > possible, but just not bother any other stuff I might be doing...

    >
    > They already won't do that if you run ondemand and cpufreq. It won't
    > crank up the frequency for niced processes.
    >
    > Extending that existing policy to socket load balancing would be only
    > natural.


    There used to be an option for them to also up on niced load. If that
    disappeared then I'd call that a huge usability regression. Basically
    making ondemand useless.

    /me checks,..

    Yeah, on F9, my opteron runs at 1GHz when idle, but when I start distcc,
    which like said runs on nice 19, the cpu speed goes up to 2.4GHz.

    And it uses the ondemand govenor.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  20. Re: [RFC v1] Tunable sched_mc_power_savings=n

    Dipankar Sarma wrote:
    > On Thu, Jun 26, 2008 at 11:37:08PM +0200, Andi Kleen wrote:
    >> Dipankar Sarma wrote:
    >>
    >>> Some workload managers already do that - they provision cpu and memory
    >>> resources based on request rates and response times. Such software is
    >>> in a better position to make a decision whether they can live with
    >>> reduced performance due to power saving mode or not. The point I am
    >>> making is the the kernel doesn't have any notion of transactional
    >>> performance

    >> The kernel definitely knows about burstiness vs non burstiness at least
    >> (although it currently has no long term memory for that). Does it need
    >> more than that for this? Anyways if nice levels were used that is not
    >> even needed, because it's ok to run niced processes slower.
    >>
    >> And your workload manager could just nice processes. It should probably
    >> do that anyways to tell ondemand you don't need full frequency.

    >
    > The current usage of this we are looking requires system-wide
    > settings. That means nicing every process running on the system.
    > That seems a little messy.


    Is it less messy than the letting applications negotiate
    for the best policy by themselves as someone else suggested on the thread?

    > Secondly, even if you nice the processes
    > they are still going to be spread all over the CPU packages
    > running at lower frequencies due to nice.


    My point was that this could be fixed and you could use nice
    (or another per process parameter if you prefer)
    as an input to load balancer decisions.

    > Using nice, you can force lowering of frequency - but you can do that
    > using userspace governor as well - no need to mess with process
    > priorities.



    > We are talking about a different optimization here - something
    > that will give more benefits in powersave mode when you have large
    > systems.


    Yes it's a different optimization (although the over all theme -- power saving
    -- is the same), but is there a real reason it cannot be driven from the
    same per process heuristics instead of your ugly global sysctl?

    >>>>> In a small-scale datacenters, peak and off-peak hour settings can be
    >>>>> potentially done through simple cron jobs.
    >>>> Is there any real drawback from only controlling it through nice levels?
    >>> In a system with more than a couple of sockets, it is more beneficial
    >>> (power-wise) to pack all work in to a small number of processors
    >>> and let the other processors go to very low power sleep. Compared
    >>> to running tasks slowly and spreading them all over the processors.

    >> You answered a different question?

    >
    > The point is that grouping tasks into small number of sockets is
    > more effective than nicing which may still spread the tasks all
    > over the sockets.


    Sorry you completely misunderstood me. I know the principle
    behind the socket grouping. And yes it's a different mechanism
    from cpu frequency scaling.

    My point was just that the heuristics
    used by one power saving mechanism (ondemand) could be used
    for the other too (socket grouping) -- and it would be certainly
    a far saner interface than a global sysctl!.

    -Andi
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread
Page 1 of 2 1 2 LastLast