sched: deep power-saving states - Kernel

This is a discussion on sched: deep power-saving states - Kernel ; Hi Arjan, I was giving some thought to that topic you brought up at our LF-end-user session on RT w.r.t. deep power state wakeup adding latency. As Steven mentioned, we currently have this thing called "cpupri" (kernel/sched_cpupri.c) in the scheduler ...

+ Reply to Thread
Results 1 to 10 of 10

Thread: sched: deep power-saving states

  1. sched: deep power-saving states

    Hi Arjan,
    I was giving some thought to that topic you brought up at our
    LF-end-user session on RT w.r.t. deep power state wakeup adding latency.

    As Steven mentioned, we currently have this thing called "cpupri"
    (kernel/sched_cpupri.c) in the scheduler which allows us to classify
    each core (on a per disjoint cpuset basis) as being either IDLE,
    SCHED_OTHER, or RT1 - RT99. (Note that currently we lump both IDLE and
    SCHED_OTHER together as SCHED_OTHER because we don't yet care to
    differentiate between them, but I have patches to fix this that I can
    submit).

    What I was thinking is that a simple mechanism to quantify the
    power-state penalty would be to add those states as priority levels in
    the cpupri namespace. E.g. We could substitute IDLE-RUNNING for IDLE,
    and add IDLE-PS1, IDLE-PS2, .. IDLE-PSn, OTHER, RT1, .. RT99. This
    means the scheduler would favor waking an IDLE-RUNNING core over an
    IDLE-PS1-PSn, etc. The question in my mind is: can the power-states be
    determined in a static fashion such that we know what value to quantify
    the idle state before we enter it? Or is it more dynamic (e.g. the
    longer it is in an MWAIT, the deeper the sleep gets).

    If its dynamic, is there a deterministic algorithm that could be applied
    so that, say, a timer on a different CPU (bsp makes sense to me) could
    advance the IDLE-PSx state in cpupri on behalf of the low-power core as
    time goes on?

    Thoughts?
    -Greg
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: sched: deep power-saving states

    On Wed, 22 Oct 2008 09:42:52 -0400
    Gregory Haskins wrote:

    > What I was thinking is that a simple mechanism to quantify the
    > power-state penalty would be to add those states as priority levels in
    > the cpupri namespace. E.g. We could substitute IDLE-RUNNING for IDLE,
    > and add IDLE-PS1, IDLE-PS2, .. IDLE-PSn, OTHER, RT1, .. RT99. This
    > means the scheduler would favor waking an IDLE-RUNNING core over an
    > IDLE-PS1-PSn, etc. The question in my mind is: can the power-states
    > be determined in a static fashion such that we know what value to
    > quantify the idle state before we enter it? Or is it more dynamic
    > (e.g. the longer it is in an MWAIT, the deeper the sleep gets).


    it's a little dynamic, but just assuming the worst will be a very good
    approximation of reality. And we know what we're getting into in that
    sense.

    --
    Arjan van de Ven Intel Open Source Technology Centre
    For development, discussion and tips for power savings,
    visit http://www.lesswatts.org
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: sched: deep power-saving states

    Arjan van de Ven wrote:
    > On Wed, 22 Oct 2008 09:42:52 -0400
    > Gregory Haskins wrote:
    >
    >
    >> What I was thinking is that a simple mechanism to quantify the
    >> power-state penalty would be to add those states as priority levels in
    >> the cpupri namespace. E.g. We could substitute IDLE-RUNNING for IDLE,
    >> and add IDLE-PS1, IDLE-PS2, .. IDLE-PSn, OTHER, RT1, .. RT99. This
    >> means the scheduler would favor waking an IDLE-RUNNING core over an
    >> IDLE-PS1-PSn, etc. The question in my mind is: can the power-states
    >> be determined in a static fashion such that we know what value to
    >> quantify the idle state before we enter it? Or is it more dynamic
    >> (e.g. the longer it is in an MWAIT, the deeper the sleep gets).
    >>

    >
    > it's a little dynamic, but just assuming the worst will be a very good
    > approximation of reality. And we know what we're getting into in that
    > sense.
    >


    Ok, but if we just assume the worst case always, how do I differentiate
    between, say, IDLE-RUNNING and IDLE-PSn? If I assign them all to
    IDLE-PSn apriori its no better than the basic single IDLE state we
    support today. Or am I misunderstanding you?

    -Greg



    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v2.0.9 (GNU/Linux)
    Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

    iEYEARECAAYFAkj/MyEACgkQlOSOBdgZUxnZ4gCdHgNvHPsY1xOU38xBljVhEM6C
    qQAAnjGP4GjAd7nYCEf3VTnnPhuxCWMN
    =1PF+
    -----END PGP SIGNATURE-----


  4. Re: sched: deep power-saving states

    On Wed, 22 Oct 2008 10:05:21 -0400
    Gregory Haskins wrote:

    > Arjan van de Ven wrote:
    > > On Wed, 22 Oct 2008 09:42:52 -0400
    > > Gregory Haskins wrote:
    > >
    > >
    > >> What I was thinking is that a simple mechanism to quantify the
    > >> power-state penalty would be to add those states as priority
    > >> levels in the cpupri namespace. E.g. We could substitute
    > >> IDLE-RUNNING for IDLE, and add IDLE-PS1, IDLE-PS2, .. IDLE-PSn,
    > >> OTHER, RT1, .. RT99. This means the scheduler would favor waking
    > >> an IDLE-RUNNING core over an IDLE-PS1-PSn, etc. The question in
    > >> my mind is: can the power-states be determined in a static fashion
    > >> such that we know what value to quantify the idle state before we
    > >> enter it? Or is it more dynamic (e.g. the longer it is in an
    > >> MWAIT, the deeper the sleep gets).

    > >
    > > it's a little dynamic, but just assuming the worst will be a very
    > > good approximation of reality. And we know what we're getting into
    > > in that sense.
    > >

    >
    > Ok, but if we just assume the worst case always, how do I
    > differentiate between, say, IDLE-RUNNING and IDLE-PSn? If I assign
    > them all to IDLE-PSn apriori its no better than the basic single IDLE
    > state we support today. Or am I misunderstanding you?


    eh yes I wasn't very clear; it's pre-coffee time here

    we know *for each C state* we go in, what its maximum latency is.
    Now, that is the *maximum*; there are times where it'll be less
    (there are several steps for going into a C-state hardware wise, and if
    an interrupt comes in before they're all completed, getting out of it
    means not having to undo ALL the steps, so it'll be faster)



    --
    Arjan van de Ven Intel Open Source Technology Centre
    For development, discussion and tips for power savings,
    visit http://www.lesswatts.org
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: sched: deep power-saving states

    Arjan van de Ven wrote:
    > On Wed, 22 Oct 2008 10:05:21 -0400
    > Gregory Haskins wrote:
    >
    >
    >> Arjan van de Ven wrote:
    >>
    >>> On Wed, 22 Oct 2008 09:42:52 -0400
    >>> Gregory Haskins wrote:
    >>>
    >>>
    >>>
    >>>> What I was thinking is that a simple mechanism to quantify the
    >>>> power-state penalty would be to add those states as priority
    >>>> levels in the cpupri namespace. E.g. We could substitute
    >>>> IDLE-RUNNING for IDLE, and add IDLE-PS1, IDLE-PS2, .. IDLE-PSn,
    >>>> OTHER, RT1, .. RT99. This means the scheduler would favor waking
    >>>> an IDLE-RUNNING core over an IDLE-PS1-PSn, etc. The question in
    >>>> my mind is: can the power-states be determined in a static fashion
    >>>> such that we know what value to quantify the idle state before we
    >>>> enter it? Or is it more dynamic (e.g. the longer it is in an
    >>>> MWAIT, the deeper the sleep gets).
    >>>>
    >>> it's a little dynamic, but just assuming the worst will be a very
    >>> good approximation of reality. And we know what we're getting into
    >>> in that sense.
    >>>
    >>>

    >> Ok, but if we just assume the worst case always, how do I
    >> differentiate between, say, IDLE-RUNNING and IDLE-PSn? If I assign
    >> them all to IDLE-PSn apriori its no better than the basic single IDLE
    >> state we support today. Or am I misunderstanding you?
    >>

    >
    > eh yes I wasn't very clear; it's pre-coffee time here
    >
    > we know *for each C state* we go in, what its maximum latency is.
    > Now, that is the *maximum*; there are times where it'll be less
    > (there are several steps for going into a C-state hardware wise, and if
    > an interrupt comes in before they're all completed, getting out of it
    > means not having to undo ALL the steps, so it'll be faster)
    >


    [Adding Peter Zijlstra to the thread]

    Ah, yes of course! That makes sense. So I have to admit I am fairly
    ignorant of the ACPI C-state stuff, so I just read up on it. In the
    context of what you said, it makes perfect sense to me now.

    IIUC, the OS selects which C-state it will enter at idle points based on
    some internal criteria (TBD). All we have to do is remap the cpupri
    "IDLE" state to something like IDLE-C1, IDLE-C2, ..., IDLE-Cn and have
    the cpupri map get updated coincident with the pm_idle() call. Then the
    scheduler will naturally favor cores that are in lighter sleep over
    cores in deep sleep.

    I am not sure if this is exactly what you were getting at during the
    conf, since it doesnt really consider deep-sleep latency times
    directly. But I think this is a step in the right direction.

    -Greg



    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v2.0.9 (GNU/Linux)
    Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

    iEYEARECAAYFAkj/OCkACgkQlOSOBdgZUxmeoACeM94ACPjza23Qz4ESbfEonuVM
    PcMAn0j8aA/n4jmWTmhiQKOOU83AYryR
    =kd9Z
    -----END PGP SIGNATURE-----


  6. Re: sched: deep power-saving states

    On Wed, 22 Oct 2008 10:26:49 -0400
    Gregory Haskins wrote:
    steps, so it'll be
    > > faster)

    >
    > [Adding Peter Zijlstra to the thread]
    >
    > Ah, yes of course! That makes sense. So I have to admit I am fairly
    > ignorant of the ACPI C-state stuff, so I just read up on it. In the
    > context of what you said, it makes perfect sense to me now.
    >
    > IIUC, the OS selects which C-state it will enter at idle points based
    > on some internal criteria (TBD). All we have to do is remap the
    > cpupri "IDLE" state to something like IDLE-C1, IDLE-C2, ..., IDLE-Cn
    > and have the cpupri map get updated coincident with the pm_idle()
    > call. Then the scheduler will naturally favor cores that are in
    > lighter sleep over cores in deep sleep.
    >
    > I am not sure if this is exactly what you were getting at during the
    > conf, since it doesnt really consider deep-sleep latency times
    > directly. But I think this is a step in the right direction.


    it for sure is a step in the right direction.
    the actual exit costs are an optional parameter in this sense,
    the steps between C states are non-linear (more like exponential)
    so knowing the actual numbers could be used. but even if you don't
    use it, it still makes sense and is a very good first order behavior.



    --
    Arjan van de Ven Intel Open Source Technology Centre
    For development, discussion and tips for power savings,
    visit http://www.lesswatts.org
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: sched: deep power-saving states

    On Wed, 2008-10-22 at 07:36 -0700, Arjan van de Ven wrote:
    > On Wed, 22 Oct 2008 10:26:49 -0400
    > Gregory Haskins wrote:
    > steps, so it'll be
    > > > faster)

    > >
    > > [Adding Peter Zijlstra to the thread]
    > >
    > > Ah, yes of course! That makes sense. So I have to admit I am fairly
    > > ignorant of the ACPI C-state stuff, so I just read up on it. In the
    > > context of what you said, it makes perfect sense to me now.
    > >
    > > IIUC, the OS selects which C-state it will enter at idle points based
    > > on some internal criteria (TBD). All we have to do is remap the
    > > cpupri "IDLE" state to something like IDLE-C1, IDLE-C2, ..., IDLE-Cn
    > > and have the cpupri map get updated coincident with the pm_idle()
    > > call. Then the scheduler will naturally favor cores that are in
    > > lighter sleep over cores in deep sleep.
    > >
    > > I am not sure if this is exactly what you were getting at during the
    > > conf, since it doesnt really consider deep-sleep latency times
    > > directly. But I think this is a step in the right direction.

    >
    > it for sure is a step in the right direction.
    > the actual exit costs are an optional parameter in this sense,
    > the steps between C states are non-linear (more like exponential)
    > so knowing the actual numbers could be used. but even if you don't
    > use it, it still makes sense and is a very good first order behavior.


    This still leaves us with the worst case IRQ response as given by the
    deepest C state. Which might be un-desirable.

    jcm was, once upon a time, working on dynamically changing the idle
    routine, so that people who care about wakeup latency can run idle=poll
    while their application runs, and the acpi C state stuff when nobody
    cares.

    This could of course then be tied into the PM QoS stuff Mark has been
    doing.

    Fact of life is, for some RT apps, anything but idle=poll is too much.

    But yes, when C states are in play, it makes sense to try and wake a cpu
    that's not deep over a very deep idle one.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: sched: deep power-saving states

    On Wed, 22 Oct 2008 21:49:52 +0200
    Peter Zijlstra wrote:
    >
    > This still leaves us with the worst case IRQ response as given by the
    > deepest C state. Which might be un-desirable.


    that's a different problem in a different problem space.
    >
    > jcm was, once upon a time, working on dynamically changing the idle
    > routine, so that people who care about wakeup latency can run
    > idle=poll while their application runs, and the acpi C state stuff
    > when nobody cares.
    >
    > This could of course then be tied into the PM QoS stuff Mark has been
    > doing.


    in fact you already have this *exactly* today; this isn't future
    technology.


    --
    Arjan van de Ven Intel Open Source Technology Centre
    For development, discussion and tips for power savings,
    visit http://www.lesswatts.org
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: sched: deep power-saving states

    On Wed, 2008-10-22 at 12:55 -0700, Arjan van de Ven wrote:
    > On Wed, 22 Oct 2008 21:49:52 +0200
    > Peter Zijlstra wrote:
    > >
    > > This still leaves us with the worst case IRQ response as given by the
    > > deepest C state. Which might be un-desirable.

    >
    > that's a different problem in a different problem space.


    Ah right, so the only point was trying to wake shallow cpus so as to try
    and let deep cpus idle longer?

    > > jcm was, once upon a time, working on dynamically changing the idle
    > > routine, so that people who care about wakeup latency can run
    > > idle=poll while their application runs, and the acpi C state stuff
    > > when nobody cares.
    > >
    > > This could of course then be tied into the PM QoS stuff Mark has been
    > > doing.

    >
    > in fact you already have this *exactly* today; this isn't future
    > technology.


    Interesting, what knob do I turn to get idle=poll dynamically?
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: sched: deep power-saving states

    On Wed, 22 Oct 2008 22:05:25 +0200
    Peter Zijlstra wrote:

    > >
    > > in fact you already have this *exactly* today; this isn't future
    > > technology.

    >
    > Interesting, what knob do I turn to get idle=poll dynamically?


    you ask PMQOS for a 0 usec latency, and you just get idle=poll behavior;
    you can do this from the kernel, or, as root, from userland.


    --
    Arjan van de Ven Intel Open Source Technology Centre
    For development, discussion and tips for power savings,
    visit http://www.lesswatts.org
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread