[tbench regression fixes]: digging out smelly deadmen. - Kernel

This is a discussion on [tbench regression fixes]: digging out smelly deadmen. - Kernel ; Hi. It was reported recently that tbench has a long history of regressions, starting at least from 2.6.23 kernel. I verified that in my test environment tbench 'lost' more than 100 MB/s from 470 down to 355 between at least ...

+ Reply to Thread
Page 1 of 5 1 2 3 ... LastLast
Results 1 to 20 of 92

Thread: [tbench regression fixes]: digging out smelly deadmen.

  1. [tbench regression fixes]: digging out smelly deadmen.


    Hi.

    It was reported recently that tbench has a long history of regressions,
    starting at least from 2.6.23 kernel. I verified that in my test
    environment tbench 'lost' more than 100 MB/s from 470 down to 355
    between at least 2.6.24 and 2.6.27. 2.6.26-2.6.27 performance regression
    in my machines is rougly corresponds to 375 down to 355 MB/s.

    I spent several days in various tests and bisections (unfortunately
    bisect can not always point to the 'right' commit), and found following
    problems.

    First, related to the network, as lots of people expected: TSO/GSO over
    loopback with tbench workload eats about 5-10 MB/s, since TSO/GSO frame
    creation overhead is not paid by the optimized super-frame processing
    gains. Since it brings really impressive improvement in big-packet
    workload, it was (likely) decided not to add a patch for this, but
    instead one can disable TSO/GSO via ethtool. This patch was added in
    2.6.27 window, so it has its part in its regression.

    Second part in the 26-27 window regression (I remind, it is about 20
    MB/s) is related to the scheduler changes, which was expected by another
    group of people. I tracked it down to the
    a7be37ac8e1565e00880531f4e2aff421a21c803 commit, which, if being
    reverted, returns 2.6.27 tbench perfromance to the highest (for
    2.6.26-2.6.27) 365 MB/s mark. I also tested tree, stopped at above
    commit itself, i.e. not 2.6.27, and got 373 MB/s, so likely another
    changes in that merge ate couple of megs. Attached patch against 2.6.27.

    Curious reader can ask, where did we lost another 100 MB/s? This small
    issue was not detected (or at least reported in netdev@ with provocative
    enough subject), and it happend to live somehere in 2.6.24-2.6.25 changes.
    I was so lucky to 'guess' (just after couple of hundreds of compilations),
    that it corresponds to 8f4d37ec073c17e2d4aa8851df5837d798606d6f commit about
    high-resolution timers, attached patch against 2.6.25 brings tbench
    performance for the 2.6.25 kernel tree to 455 MB/s.

    There are still somewhat missed 20 MB/s, but 2.6.24 has 475 MB/s, so
    likely bug lives between 2.6.24 and above 8f4d37ec073 commit.

    I can test your patches (the most interesting attached one does not
    apply clearly to the current tree) for the 2.6.27 tree tomorrow
    (it is more than 3 A.M. in Moscow).

    P.S. I'm not currently subscribed to any of the mentioned lists (and write
    from long-ago-unused email), so can not find appropriate subject and reply
    into the thread.

    --
    Evgeniy Polyakov


  2. Re: [tbench regression fixes]: digging out smelly deadmen.

    On Fri, 2008-10-10 at 03:17 +0400, Evgeniy Polyakov wrote:

    > I was so lucky to 'guess' (just after couple of hundreds of compilations),
    > that it corresponds to 8f4d37ec073c17e2d4aa8851df5837d798606d6f commit about
    > high-resolution timers, attached patch against 2.6.25 brings tbench
    > performance for the 2.6.25 kernel tree to 455 MB/s.


    can you try

    echo NO_HRTICK > /debug/sched_features

    on .27 like kernels?

    Also, what clocksource do those machines use?

    cat /sys/devices/system/clocksource/clocksource0/current_clocksource

    As to, a7be37ac8e1565e00880531f4e2aff421a21c803, could you try
    tip/master? I reworked some of the wakeup preemption code in there.

    Thanks for looking into this issue!

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [tbench regression fixes]: digging out smelly deadmen.

    Hi Peter.

    I've enabled kernel hacking option and scheduler debugging and turned
    off hrticks and performance jumped to 382 MB/s:

    vanilla 27: 347.222
    no TSO/GSO: 357.331
    no hrticks: 382.983

    I use tsc clocksource, also available acpi_pm and jiffies,
    with acpi_pm performance is even lower (I stopped test after it dropped
    below 340 MB/s mark), jiffies do not work at all, looks like sockets
    stuck in time_wait state when this clock source is used, although that
    may be some different issue.

    So I think hrticks are guilty, but still not as good as .25 tree without
    mentioned changes (455 MB/s) and .24 (475 MB/s).

    --
    Evgeniy Polyakov
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [tbench regression fixes]: digging out smelly deadmen.


    hi Evgeniy,

    * Evgeniy Polyakov wrote:

    > Hi Peter.
    >
    > I've enabled kernel hacking option and scheduler debugging and turned
    > off hrticks and performance jumped to 382 MB/s:
    >
    > vanilla 27: 347.222
    > no TSO/GSO: 357.331
    > no hrticks: 382.983
    >
    > I use tsc clocksource, also available acpi_pm and jiffies,
    > with acpi_pm performance is even lower (I stopped test after it dropped
    > below 340 MB/s mark), jiffies do not work at all, looks like sockets
    > stuck in time_wait state when this clock source is used, although that
    > may be some different issue.
    >
    > So I think hrticks are guilty, but still not as good as .25 tree without
    > mentioned changes (455 MB/s) and .24 (475 MB/s).


    i'm glad that you are looking into this! That is an SMP box, right? If
    yes then could you try this sched-domains tuning utility i have written
    yesterday (incidentally):

    http://redhat.com/~mingo/cfs-schedul...-sched-domains

    just run it without options to see the current sched-domains options. On
    a testsystem i have it displays this:

    # tune-sched-domains
    usage: tune-sched-domains
    current val on cpu0/domain0:
    SD flag: 47
    + 1: SD_LOAD_BALANCE: Do load balancing on this domain
    + 2: SD_BALANCE_NEWIDLE: Balance when about to become idle
    + 4: SD_BALANCE_EXEC: Balance on exec
    + 8: SD_BALANCE_FORK: Balance on fork, clone
    - 16: SD_WAKE_IDLE: Wake to idle CPU on task wakeup
    + 32: SD_WAKE_AFFINE: Wake task to waking CPU
    - 64: SD_WAKE_BALANCE: Perform balancing at task wakeup

    then could you check what effects it has if you turn off
    SD_BALANCE_NEWIDLE? On my box i did it via:

    # tune-sched-domains $[47-2]
    changed /proc/sys/kernel/sched_domain/cpu0/domain0/flags: 47 => 45
    SD flag: 45
    + 1: SD_LOAD_BALANCE: Do load balancing on this domain
    - 2: SD_BALANCE_NEWIDLE: Balance when about to become idle
    + 4: SD_BALANCE_EXEC: Balance on exec
    + 8: SD_BALANCE_FORK: Balance on fork, clone
    - 16: SD_WAKE_IDLE: Wake to idle CPU on task wakeup
    + 32: SD_WAKE_AFFINE: Wake task to waking CPU
    - 64: SD_WAKE_BALANCE: Perform balancing at task wakeup
    changed /proc/sys/kernel/sched_domain/cpu0/domain1/flags: 1101 => 45
    SD flag: 45
    + 1: SD_LOAD_BALANCE: Do load balancing on this domain
    - 2: SD_BALANCE_NEWIDLE: Balance when about to become idle
    + 4: SD_BALANCE_EXEC: Balance on exec
    + 8: SD_BALANCE_FORK: Balance on fork, clone
    - 16: SD_WAKE_IDLE: Wake to idle CPU on task wakeup
    + 32: SD_WAKE_AFFINE: Wake task to waking CPU
    - 64: SD_WAKE_BALANCE: Perform balancing at task wakeup

    and please, when tuning such scheduler bits, could you run latest
    tip/master:

    http://people.redhat.com/mingo/tip.git/README

    and you need to have CONFIG_SCHED_DEBUG=y enabled for the tuning knobs.

    so that it's all in sync with upcoming scheduler changes/tunings/fixes.

    It will also make it much easier for us to apply any fix patches you
    might send :-)

    For advanced tuners: you can specify two or more domain flags options as
    well on the command line - that will be put into domain1/domain2/etc. I
    usually tune these flags via something like:

    tune-sched-domains $[1*1+1*2+1*4+1*8+0*16+1*32+1*64]

    that makes it easy to set/clear each of the flags.

    Ingo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [tbench regression fixes]: digging out smelly deadmen.

    On Fri, 2008-10-10 at 03:17 +0400, Evgeniy Polyakov wrote:
    > Hi.


    Greetings. Glad to see someone pursuing this.

    > It was reported recently that tbench has a long history of regressions,
    > starting at least from 2.6.23 kernel. I verified that in my test
    > environment tbench 'lost' more than 100 MB/s from 470 down to 355
    > between at least 2.6.24 and 2.6.27. 2.6.26-2.6.27 performance regression
    > in my machines is rougly corresponds to 375 down to 355 MB/s.
    >
    > I spent several days in various tests and bisections (unfortunately
    > bisect can not always point to the 'right' commit), and found following
    > problems.
    >
    > First, related to the network, as lots of people expected: TSO/GSO over
    > loopback with tbench workload eats about 5-10 MB/s, since TSO/GSO frame
    > creation overhead is not paid by the optimized super-frame processing
    > gains. Since it brings really impressive improvement in big-packet
    > workload, it was (likely) decided not to add a patch for this, but
    > instead one can disable TSO/GSO via ethtool. This patch was added in
    > 2.6.27 window, so it has its part in its regression.


    Part, disabling TSO/GSO doesn't do enough here. See test log below.

    > Second part in the 26-27 window regression (I remind, it is about 20
    > MB/s) is related to the scheduler changes, which was expected by another
    > group of people. I tracked it down to the
    > a7be37ac8e1565e00880531f4e2aff421a21c803 commit, which, if being
    > reverted, returns 2.6.27 tbench perfromance to the highest (for
    > 2.6.26-2.6.27) 365 MB/s mark. I also tested tree, stopped at above
    > commit itself, i.e. not 2.6.27, and got 373 MB/s, so likely another
    > changes in that merge ate couple of megs. Attached patch against 2.6.27.


    a7be37a adds some math overhead, calls to calc_delta_mine() per
    wakeup/context switch for all weight tasks, whereas previously these
    calls were only made for tasks which were not nice 0. It also shifts
    performance a bit in favor of loads which dislike wakeup preemption,
    this effect lessens as task count increases. Per testing, overhead is
    not the primary factor in throughput loss. I believe clock accuracy to
    be a more important factor than overhead by a very large margin.

    Reverting a7be37a (and the two asym fixes) didn't do a whole lot for me
    either. I'm still ~8% down from 2.6.26 for netperf, and ~3% for tbench,
    and the 2.6.26 numbers are gcc-4.1, which are a little lower than
    gcc-4.3. Along the way, I've reverted 100% of scheduler and ilk 26->27
    and been unable to recover throughput. (Too bad I didn't know about
    that TSO/GSO thingy, would have been nice.)

    I can achieve nearly the same improvement for tbench with a little
    tinker, and _more_ for netperf than reverting these changes delivers,
    see last log entry, experiment cut math overhead by less than 1/3.

    For the full cfs history, even with those three reverts, I'm ~6% down on
    tbench, and ~14% for netperf, and haven't found out where it went.

    > Curious reader can ask, where did we lost another 100 MB/s? This small
    > issue was not detected (or at least reported in netdev@ with provocative
    > enough subject), and it happend to live somehere in 2.6.24-2.6.25 changes.
    > I was so lucky to 'guess' (just after couple of hundreds of compilations),
    > that it corresponds to 8f4d37ec073c17e2d4aa8851df5837d798606d6f commit about
    > high-resolution timers, attached patch against 2.6.25 brings tbench
    > performance for the 2.6.25 kernel tree to 455 MB/s.


    I have highres timers disabled in my kernels because per testing it does
    cost a lot at high frequency, but primarily because it's not available
    throughout test group, same for nohz. A patchlet went into 2.6.27 to
    neutralized the cost of hrtick when it's not active. Per re-test,
    2.6.27 should be zero impact with hrtick disabled.

    > There are still somewhat missed 20 MB/s, but 2.6.24 has 475 MB/s, so
    > likely bug lives between 2.6.24 and above 8f4d37ec073 commit.


    I lost some at 24, got it back at 25 etc. Some of it is fairness /
    preemption differences, but there's a bunch I can't find, and massive
    amounts of time spent bisecting were a waste of time.

    My annotated test log. File under fwiw.

    Note: 2.6.23 cfs was apparently a bad-hair day for high frequency
    switchers. Anyone entering the way-back-machine to test 2.6.23, should
    probably use cfs-24.1, which is 2.6.24 scheduler minus on zero impact
    for nice 0 loads line.

    -------------------------------------------------------------------------
    UP config, no nohz or highres timers except as noted.

    60 sec localhost network tests, tbench 1 and 1 netperf TCP_RR pair.
    use ring-test -t 2 -w 0 -s 0 to see roughly how heavy the full ~0 work
    fast path is, vmstat 10 ctx/s fed to bc (close enough for gvt work).
    ring-test args: -t NR tasks -w work_ms -s sleep_ms

    sched_wakeup_granularity_ns always set to 0 for all tests to maximize
    context switches.

    Why? O(1) preempts very aggressively with dissimilar task loads, as
    both tbench and netperf are. With O(1), sleepier component preempts
    less sleepy component on each and every wakeup. CFS preempts based on
    lag (sleepiness) as well, but it's short vs long term. Granularity of
    zero was as close to apple/apple as I could get.. apple/pineapple.

    2.6.22.19-up
    ring-test - 1.204 us/cycle = 830 KHz (gcc-4.1)
    ring-test - doorstop (gcc-4.3)
    netperf - 147798.56 rr/s = 295 KHz (hmm, a bit unstable, 140K..147K rr/s)
    tbench - 374.573 MB/sec

    2.6.22.19-cfs-v24.1-up
    ring-test - 1.098 us/cycle = 910 KHz (gcc-4.1)
    ring-test - doorstop (gcc-4.3)
    netperf - 140039.03 rr/s = 280 KHz = 3.57us - 1.10us sched = 2.47us/packet network
    tbench - 364.191 MB/sec

    2.6.23.17-up
    ring-test - 1.252 us/cycle = 798 KHz (gcc-4.1)
    ring-test - 1.235 us/cycle = 809 KHz (gcc-4.3)
    netperf - 123736.40 rr/s = 247 KHz sb 268 KHZ / 134336.37 rr/s
    tbench - 355.906 MB/sec

    2.6.23.17-cfs-v24.1-up
    ring-test - 1.100 us/cycle = 909 KHz (gcc-4.1)
    ring-test - 1.074 us/cycle = 931 KHz (gcc-4.3)
    netperf - 135847.14 rr/s = 271 KHz sb 280 KHz / 140039.03 rr/s
    tbench - 364.511 MB/sec

    2.6.24.7-up
    ring-test - 1.100 us/cycle = 909 KHz (gcc-4.1)
    ring-test - 1.068 us/cycle = 936 KHz (gcc-4.3)
    netperf - 122300.66 rr/s = 244 KHz sb 280 KHz / 140039.03 rr/s
    tbench - 341.523 MB/sec

    2.6.25.17-up
    ring-test - 1.163 us/cycle = 859 KHz (gcc-4.1)
    ring-test - 1.129 us/cycle = 885 KHz (gcc-4.3)
    netperf - 132102.70 rr/s = 264 KHz sb 275 KHz / 137627.30 rr/s
    tbench - 361.71 MB/sec

    retest 2.6.25.18-up, gcc = 4.3

    2.6.25.18-up
    push patches/revert_hrtick.diff
    ring-test - 1.127 us/cycle = 887 KHz
    netperf - 132123.42 rr/s
    tbench - 358.964 361.538 361.164 MB/sec
    (all is well, zero impact as expected, enable highres timers)

    2.6.25.18-up
    pop patches/revert_hrtick.diff
    push patches/hrtick.diff (cut overhead when hrtick disabled patchlet in .27)

    echo 7 > sched_features = nohrtick
    ring-test - 1.183 us/cycle = 845 KHz
    netperf - 131976.23 rr/s
    tbench - 361.17 360.468 361.721 MB/sec

    echo 15 > sched_features = default = hrtick
    ring-test - 1.333 us/cycle = 750 KHz - .887
    netperf - 120520.67 rr/s - .913
    tbench - 344.092 344.569 344.839 MB/sec - .953

    (yeah, why i turned highres timers off while testing high frequency throughput)

    2.6.26.5-up
    ring-test - 1.195 us/cycle = 836 KHz (gcc-4.1)
    ring-test - 1.179 us/cycle = 847 KHz (gcc-4.3)
    netperf - 131289.73 rr/s = 262 KHZ sb 272 KHz / 136425.64 rr/s
    tbench - 354.07 MB/sec

    2.6.27-rc8-up
    ring-test - 1.225 us/cycle = 816 KHz (gcc-4.1)
    ring-test - 1.196 us/cycle = 836 KHz (gcc-4.3)
    netperf - 118090.27 rr/s = 236 KHz sb 270 KHz / 135317.99 rr/s
    tbench - 329.856 MB/sec

    retest of 2.6.27-final-up, gcc = 4.3. tbench/netperf numbers above here
    are all gcc-4.1 except for 2.6.25 retest.

    2.6.27-final-up
    ring-test - 1.193 us/cycle = 838 KHz (gcc-4.3)
    tbench - 337.377 MB/sec tso/gso on
    tbench - 340.362 MB/sec tso/gso off
    netperf - TCP_RR 120751.30 rr/s tso/gso on
    netperf - TCP_RR 121293.48 rr/s tso/gso off

    2.6.27-final-up
    push revert_weight_and_asym_stuff.diff
    ring-test - 1.133 us/cycle = 882 KHz (gcc-4.3)
    tbench - 340.481 MB/sec tso/gso on
    tbench - 343.472 MB/sec tso/gso off
    netperf - 119486.14 rr/s tso/gso on
    netperf - 121035.56 rr/s tso/gso off

    2.6.27-final-up-tinker
    ring-test - 1.141 us/cycle = 876 KHz (gcc-4.3)
    tbench - 339.095 MB/sec tso/gso on
    tbench - 340.507 MB/sec tso/gso off
    netperf - 122371.59 rr/s tso/gso on
    netperf - 124650.09 rr/s tso/gso off


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [tbench regression fixes]: digging out smelly deadmen.

    Hi Ingo.

    On Fri, Oct 10, 2008 at 11:15:11AM +0200, Ingo Molnar (mingo@elte.hu) wrote:

    > >
    > > I use tsc clocksource, also available acpi_pm and jiffies,
    > > with acpi_pm performance is even lower (I stopped test after it dropped
    > > below 340 MB/s mark), jiffies do not work at all, looks like sockets
    > > stuck in time_wait state when this clock source is used, although that
    > > may be some different issue.
    > >
    > > So I think hrticks are guilty, but still not as good as .25 tree without
    > > mentioned changes (455 MB/s) and .24 (475 MB/s).

    >
    > i'm glad that you are looking into this! That is an SMP box, right? If
    > yes then could you try this sched-domains tuning utility i have written
    > yesterday (incidentally):
    >
    > http://redhat.com/~mingo/cfs-schedul...-sched-domains


    I've removed SD_BALANCE_NEWIDLE:
    # ./tune-sched-domains $[191-2]
    changed /proc/sys/kernel/sched_domain/cpu0/domain0/flags: 191 => 189
    SD flag: 189
    + 1: SD_LOAD_BALANCE: Do load balancing on this domain
    - 2: SD_BALANCE_NEWIDLE: Balance when about to become idle
    + 4: SD_BALANCE_EXEC: Balance on exec
    + 8: SD_BALANCE_FORK: Balance on fork, clone
    + 16: SD_WAKE_IDLE: Wake to idle CPU on task wakeup
    + 32: SD_WAKE_AFFINE: Wake task to waking CPU
    - 64: SD_WAKE_BALANCE: Perform balancing at task wakeup
    + 128: SD_SHARE_CPUPOWER: Domain members share cpu power
    changed /proc/sys/kernel/sched_domain/cpu0/domain1/flags: 47 => 189
    SD flag: 189
    + 1: SD_LOAD_BALANCE: Do load balancing on this domain
    - 2: SD_BALANCE_NEWIDLE: Balance when about to become idle
    + 4: SD_BALANCE_EXEC: Balance on exec
    + 8: SD_BALANCE_FORK: Balance on fork, clone
    + 16: SD_WAKE_IDLE: Wake to idle CPU on task wakeup
    + 32: SD_WAKE_AFFINE: Wake task to waking CPU
    - 64: SD_WAKE_BALANCE: Perform balancing at task wakeup
    + 128: SD_SHARE_CPUPOWER: Domain members share cpu power

    And got noticeble improvement (each new line has fixes from previous):

    vanilla 27: 347.222
    no TSO/GSO: 357.331
    no hrticks: 382.983
    no balance: 389.802

    > and please, when tuning such scheduler bits, could you run latest
    > tip/master:
    >
    > http://people.redhat.com/mingo/tip.git/README
    >
    > and you need to have CONFIG_SCHED_DEBUG=y enabled for the tuning knobs.
    >
    > so that it's all in sync with upcoming scheduler changes/tunings/fixes.


    Ok, I've started to pull it down, I will reply back when things are
    ready.

    --
    Evgeniy Polyakov
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [tbench regression fixes]: digging out smelly deadmen.


    * Evgeniy Polyakov wrote:

    > > i'm glad that you are looking into this! That is an SMP box, right?
    > > If yes then could you try this sched-domains tuning utility i have
    > > written yesterday (incidentally):
    > >
    > > http://redhat.com/~mingo/cfs-schedul...-sched-domains

    >
    > I've removed SD_BALANCE_NEWIDLE:
    > # ./tune-sched-domains $[191-2]


    > And got noticeble improvement (each new line has fixes from previous):
    >
    > vanilla 27: 347.222
    > no TSO/GSO: 357.331
    > no hrticks: 382.983
    > no balance: 389.802


    okay. The target is 470 MB/sec, right? (Assuming the workload is sane
    and 'fixing' it does not mean we have to schedule worse.)

    We are still way off from 470 MB/sec.

    Ingo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [tbench regression fixes]: digging out smelly deadmen.


    * Evgeniy Polyakov wrote:

    > Hi Ingo.
    >
    > On Fri, Oct 10, 2008 at 11:15:11AM +0200, Ingo Molnar (mingo@elte.hu) wrote:
    >
    > > >
    > > > I use tsc clocksource, also available acpi_pm and jiffies,
    > > > with acpi_pm performance is even lower (I stopped test after it dropped
    > > > below 340 MB/s mark), jiffies do not work at all, looks like sockets
    > > > stuck in time_wait state when this clock source is used, although that
    > > > may be some different issue.
    > > >
    > > > So I think hrticks are guilty, but still not as good as .25 tree without
    > > > mentioned changes (455 MB/s) and .24 (475 MB/s).

    > >
    > > i'm glad that you are looking into this! That is an SMP box, right? If
    > > yes then could you try this sched-domains tuning utility i have written
    > > yesterday (incidentally):
    > >
    > > http://redhat.com/~mingo/cfs-schedul...-sched-domains

    >
    > I've removed SD_BALANCE_NEWIDLE:
    > # ./tune-sched-domains $[191-2]
    > changed /proc/sys/kernel/sched_domain/cpu0/domain0/flags: 191 => 189
    > SD flag: 189
    > + 1: SD_LOAD_BALANCE: Do load balancing on this domain
    > - 2: SD_BALANCE_NEWIDLE: Balance when about to become idle
    > + 4: SD_BALANCE_EXEC: Balance on exec
    > + 8: SD_BALANCE_FORK: Balance on fork, clone
    > + 16: SD_WAKE_IDLE: Wake to idle CPU on task wakeup
    > + 32: SD_WAKE_AFFINE: Wake task to waking CPU
    > - 64: SD_WAKE_BALANCE: Perform balancing at task wakeup
    > + 128: SD_SHARE_CPUPOWER: Domain members share cpu power
    > changed /proc/sys/kernel/sched_domain/cpu0/domain1/flags: 47 => 189
    > SD flag: 189
    > + 1: SD_LOAD_BALANCE: Do load balancing on this domain
    > - 2: SD_BALANCE_NEWIDLE: Balance when about to become idle
    > + 4: SD_BALANCE_EXEC: Balance on exec
    > + 8: SD_BALANCE_FORK: Balance on fork, clone
    > + 16: SD_WAKE_IDLE: Wake to idle CPU on task wakeup
    > + 32: SD_WAKE_AFFINE: Wake task to waking CPU
    > - 64: SD_WAKE_BALANCE: Perform balancing at task wakeup
    > + 128: SD_SHARE_CPUPOWER: Domain members share cpu power
    >
    > And got noticeble improvement (each new line has fixes from previous):
    >
    > vanilla 27: 347.222
    > no TSO/GSO: 357.331
    > no hrticks: 382.983
    > no balance: 389.802
    >
    > > and please, when tuning such scheduler bits, could you run latest
    > > tip/master:
    > >
    > > http://people.redhat.com/mingo/tip.git/README
    > >
    > > and you need to have CONFIG_SCHED_DEBUG=y enabled for the tuning knobs.
    > >
    > > so that it's all in sync with upcoming scheduler changes/tunings/fixes.

    >
    > Ok, I've started to pull it down, I will reply back when things are
    > ready.


    make sure you have this fix in tip/master already:

    5b7dba4: sched_clock: prevent scd->clock from moving backwards

    Note: Mike is 100% correct in suggesting that a very good cpu_clock() is
    needed for precise scheduling.

    i've also Cc:-ed Nick.

    Ingo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [tbench regression fixes]: digging out smelly deadmen.

    On Fri, Oct 10, 2008 at 01:42:45PM +0200, Ingo Molnar (mingo@elte.hu) wrote:
    > > vanilla 27: 347.222
    > > no TSO/GSO: 357.331
    > > no hrticks: 382.983
    > > no balance: 389.802

    >
    > okay. The target is 470 MB/sec, right? (Assuming the workload is sane
    > and 'fixing' it does not mean we have to schedule worse.)


    Well, that's where I started/stopped, so maybe we will even move
    further?

    --
    Evgeniy Polyakov
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [tbench regression fixes]: digging out smelly deadmen.


    * Evgeniy Polyakov wrote:

    > On Fri, Oct 10, 2008 at 01:42:45PM +0200, Ingo Molnar (mingo@elte.hu) wrote:
    > > > vanilla 27: 347.222
    > > > no TSO/GSO: 357.331
    > > > no hrticks: 382.983
    > > > no balance: 389.802

    > >
    > > okay. The target is 470 MB/sec, right? (Assuming the workload is sane
    > > and 'fixing' it does not mean we have to schedule worse.)

    >
    > Well, that's where I started/stopped, so maybe we will even move
    > further?


    that's the right attitude

    Ingo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [tbench regression fixes]: digging out smelly deadmen.

    On Fri, Oct 10, 2008 at 01:40:42PM +0200, Ingo Molnar (mingo@elte.hu) wrote:
    > make sure you have this fix in tip/master already:
    >
    > 5b7dba4: sched_clock: prevent scd->clock from moving backwards
    >
    > Note: Mike is 100% correct in suggesting that a very good cpu_clock() is
    > needed for precise scheduling.


    The last commit is 5dc64a3442b98eaa and aforementioned changeset was included.
    Result is quite bad:

    vanilla 27: 347.222
    no TSO/GSO: 357.331
    no hrticks: 382.983
    no balance: 389.802
    tip: 365.576

    --
    Evgeniy Polyakov
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [tbench regression fixes]: digging out smelly deadmen.

    Hi Mike.

    On Fri, Oct 10, 2008 at 12:13:43PM +0200, Mike Galbraith (efault@gmx.de) wrote:
    > a7be37a adds some math overhead, calls to calc_delta_mine() per
    > wakeup/context switch for all weight tasks, whereas previously these
    > calls were only made for tasks which were not nice 0. It also shifts
    > performance a bit in favor of loads which dislike wakeup preemption,


    I believe anyone dislikes this

    > this effect lessens as task count increases. Per testing, overhead is
    > not the primary factor in throughput loss. I believe clock accuracy to
    > be a more important factor than overhead by a very large margin.


    In my tests it was not just overhead, it was a disaster.
    And stopping just before this commit gained 20 MB/s out of 30 MB/s lose
    for 26-27 window. No matter what accuracy it brings, this is just wrong
    to assume that such performance drop in some workloads is justified.
    What this accuracy is needed for?

    > Reverting a7be37a (and the two asym fixes) didn't do a whole lot for me
    > either. I'm still ~8% down from 2.6.26 for netperf, and ~3% for tbench,
    > and the 2.6.26 numbers are gcc-4.1, which are a little lower than
    > gcc-4.3. Along the way, I've reverted 100% of scheduler and ilk 26->27
    > and been unable to recover throughput. (Too bad I didn't know about
    > that TSO/GSO thingy, would have been nice.)
    >
    > I can achieve nearly the same improvement for tbench with a little
    > tinker, and _more_ for netperf than reverting these changes delivers,
    > see last log entry, experiment cut math overhead by less than 1/3.


    Yeah, that's what I like

    > For the full cfs history, even with those three reverts, I'm ~6% down on
    > tbench, and ~14% for netperf, and haven't found out where it went.
    >
    > > Curious reader can ask, where did we lost another 100 MB/s? This small
    > > issue was not detected (or at least reported in netdev@ with provocative
    > > enough subject), and it happend to live somehere in 2.6.24-2.6.25 changes.
    > > I was so lucky to 'guess' (just after couple of hundreds of compilations),
    > > that it corresponds to 8f4d37ec073c17e2d4aa8851df5837d798606d6f commit about
    > > high-resolution timers, attached patch against 2.6.25 brings tbench
    > > performance for the 2.6.25 kernel tree to 455 MB/s.

    >
    > I have highres timers disabled in my kernels because per testing it does
    > cost a lot at high frequency, but primarily because it's not available
    > throughout test group, same for nohz. A patchlet went into 2.6.27 to
    > neutralized the cost of hrtick when it's not active. Per re-test,
    > 2.6.27 should be zero impact with hrtick disabled.


    Well, yes, disabling it should bring performance back, but since they
    are actually enabled everywhere and trick with debugfs is not widely
    known, this is actually a red flag.

    > > There are still somewhat missed 20 MB/s, but 2.6.24 has 475 MB/s, so
    > > likely bug lives between 2.6.24 and above 8f4d37ec073 commit.

    >
    > I lost some at 24, got it back at 25 etc. Some of it is fairness /
    > preemption differences, but there's a bunch I can't find, and massive
    > amounts of time spent bisecting were a waste of time.


    Yup, but since I slacked with bits of beer after POHMELFS release I did
    not regret too much

    > My annotated test log. File under fwiw.
    >
    > Note: 2.6.23 cfs was apparently a bad-hair day for high frequency
    > switchers. Anyone entering the way-back-machine to test 2.6.23, should
    > probably use cfs-24.1, which is 2.6.24 scheduler minus on zero impact
    > for nice 0 loads line.
    >
    > -------------------------------------------------------------------------
    > UP config, no nohz or highres timers except as noted.


    UP actually may expect the differece in our results: I have 4-way (2
    physical and 2 logical (HT enabled) CPUs) 32-bit old Xeons with highmem
    enabled. I also tried low-latency preemption and no preemption (server)
    without much difference.

    --
    Evgeniy Polyakov
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: [tbench regression fixes]: digging out smelly deadmen.

    On Sat, 2008-10-11 at 17:13 +0400, Evgeniy Polyakov wrote:
    > Hi Mike.
    >
    > On Fri, Oct 10, 2008 at 12:13:43PM +0200, Mike Galbraith (efault@gmx.de) wrote:
    > > a7be37a adds some math overhead, calls to calc_delta_mine() per
    > > wakeup/context switch for all weight tasks, whereas previously these
    > > calls were only made for tasks which were not nice 0. It also shifts
    > > performance a bit in favor of loads which dislike wakeup preemption,

    >
    > I believe anyone dislikes this
    >
    > > this effect lessens as task count increases. Per testing, overhead is
    > > not the primary factor in throughput loss. I believe clock accuracy to
    > > be a more important factor than overhead by a very large margin.

    >
    > In my tests it was not just overhead, it was a disaster.
    > And stopping just before this commit gained 20 MB/s out of 30 MB/s lose
    > for 26-27 window. No matter what accuracy it brings, this is just wrong
    > to assume that such performance drop in some workloads is justified.
    > What this accuracy is needed for?


    a7be37a 's purpose is for group scheduling where it provides means to
    calculate things in a unform metric.

    If you take the following scenario:

    R
    /|\
    A 1 B
    /|\ |
    2 3 4 5

    Where letters denote supertasks/groups and digits are tasks.

    We used to look at a single level only, so if you want to compute a
    task's ideal runtime, you'd take:

    runtime_i = period w_i / \Sum_i w_i

    So, in the above example, assuming all entries have an equal weight,
    we'd want to run A for p/3. But then we'd also want to run 2 for p/3.
    IOW. all of A's tasks would run in p time.

    Which in contrairy to the expectation that all tasks in the scenario
    would run in p.

    So what the patch does is change the calculation to:

    period \Prod_l w_l,i / \Sum_i w_l,i

    Which would, for 2 end up being: p 1/3 1/3 = p/9.

    Now the thing that causes the extra math in the !group case is that for
    the single level case, we can avoid doing that division by the sum,
    because that is equal for all tasks (we then compensate for it at some
    other place).

    However, for the nested case, we cannot do that.

    That said, we can probably still avoid the division for the top level
    stuff, because the sum of the top level weights is still invariant
    between all tasks.

    I'll have a stab at doing so... I initially didn't do this because my
    first try gave some real ugly code, but we'll see - these numbers are a
    very convincing reason to try again.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [tbench regression fixes]: digging out smelly deadmen.

    On Sat, 2008-10-11 at 16:39 +0200, Peter Zijlstra wrote:

    > That said, we can probably still avoid the division for the top level
    > stuff, because the sum of the top level weights is still invariant
    > between all tasks.


    Less math would be nice of course...

    > I'll have a stab at doing so... I initially didn't do this because my
    > first try gave some real ugly code, but we'll see - these numbers are a
    > very convincing reason to try again.


    ....but the numbers I get on Q6600 don't pin the tail on the math donkey.

    Update to UP test log.

    2.6.27-final-up
    ring-test - 1.193 us/cycle = 838 KHz (gcc-4.3)
    tbench - 337.377 MB/sec tso/gso on
    tbench - 340.362 MB/sec tso/gso off
    netperf - 120751.30 rr/s tso/gso on
    netperf - 121293.48 rr/s tso/gso off

    2.6.27-final-up
    patches/revert_weight_and_asym_stuff.diff
    ring-test - 1.133 us/cycle = 882 KHz (gcc-4.3)
    tbench - 340.481 MB/sec tso/gso on
    tbench - 343.472 MB/sec tso/gso off
    netperf - 119486.14 rr/s tso/gso on
    netperf - 121035.56 rr/s tso/gso off

    2.6.28-up
    ring-test - 1.149 us/cycle = 870 KHz (gcc-4.3)
    tbench - 343.681 MB/sec tso/gso off
    netperf - 122812.54 rr/s tso/gso off

    My SMP log, updated to account for TSO/GSO monkey-wrench.

    ( truckload of time wasted chasing unbisectable
    tso gizmo. )

    SMP config, same as UP kernels tested, except SMP.

    tbench -t 60 4 localhost followed by four 60 sec netperf
    TCP_RR pairs, each pair on it's own core of my Q6600.

    2.6.22.19

    Throughput 1250.73 MB/sec 4 procs 1.00

    16384 87380 1 1 60.01 111272.55 1.00
    16384 87380 1 1 60.00 104689.58
    16384 87380 1 1 60.00 110733.05
    16384 87380 1 1 60.00 110748.88

    2.6.22.19-cfs-v24.1

    Throughput 1213.21 MB/sec 4 procs .970

    16384 87380 1 1 60.01 108569.27 .992
    16384 87380 1 1 60.01 108541.04
    16384 87380 1 1 60.00 108579.63
    16384 87380 1 1 60.01 108519.09

    2.6.23.17

    Throughput 1200.46 MB/sec 4 procs .959

    16384 87380 1 1 60.01 95987.66 .866
    16384 87380 1 1 60.01 92819.98
    16384 87380 1 1 60.01 95454.00
    16384 87380 1 1 60.01 94834.84

    2.6.23.17-cfs-v24.1

    Throughput 1238.68 MB/sec 4 procs .990

    16384 87380 1 1 60.01 105871.52 .969
    16384 87380 1 1 60.01 105813.11
    16384 87380 1 1 60.01 106106.31
    16384 87380 1 1 60.01 106310.20

    2.6.24.7

    Throughput 1204 MB/sec 4 procs .962

    16384 87380 1 1 60.00 99599.27 .910
    16384 87380 1 1 60.00 99439.95
    16384 87380 1 1 60.00 99556.38
    16384 87380 1 1 60.00 99500.45

    2.6.25.17

    Throughput 1223.16 MB/sec 4 procs .977
    16384 87380 1 1 60.00 101768.95 .930
    16384 87380 1 1 60.00 101888.46
    16384 87380 1 1 60.01 101608.21
    16384 87380 1 1 60.01 101833.05

    2.6.26.5

    Throughput 1183.47 MB/sec 4 procs .945

    16384 87380 1 1 60.00 100837.12 .922
    16384 87380 1 1 60.00 101230.12
    16384 87380 1 1 60.00 100868.45
    16384 87380 1 1 60.00 100491.41

    numbers above here are gcc-4.1, below gcc-4.3

    2.6.26.6

    Throughput 1177.18 MB/sec 4 procs

    16384 87380 1 1 60.00 100896.10
    16384 87380 1 1 60.00 100028.16
    16384 87380 1 1 60.00 101729.44
    16384 87380 1 1 60.01 100341.26

    TSO/GSO off

    2.6.27-final

    Throughput 1177.39 MB/sec 4 procs

    16384 87380 1 1 60.00 98830.65
    16384 87380 1 1 60.00 98722.47
    16384 87380 1 1 60.00 98565.17
    16384 87380 1 1 60.00 98633.03

    2.6.27-final
    patches/revert_weight_and_asym_stuff.diff

    Throughput 1167.67 MB/sec 4 procs

    16384 87380 1 1 60.00 97003.05
    16384 87380 1 1 60.00 96758.42
    16384 87380 1 1 60.00 96432.01
    16384 87380 1 1 60.00 97060.98

    2.6.28.git

    Throughput 1173.14 MB/sec 4 procs

    16384 87380 1 1 60.00 98449.33
    16384 87380 1 1 60.00 98484.92
    16384 87380 1 1 60.00 98657.98
    16384 87380 1 1 60.00 98467.39



    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: [tbench regression fixes]: digging out smelly deadmen.

    On Sun, 2008-10-12 at 08:02 +0200, Mike Galbraith wrote:

    > Data:
    >
    > read/write requests/sec per client count
    > 1 2 4 8 16 32 64 128 256
    > 2.6.26.6.mysql 7978 19856 37238 36652 34399 33054 31608 27983 23411
    > 2.6.27.mysql 9618 18329 37128 36504 33590 31846 30719 27685 21299
    > 2.6.27.rev.mysql 10944 19544 37349 36582 33793 31744 29161 25719 21026
    > 2.6.28.git.mysql 9518 18031 30418 33571 33330 32797 31353 29139 25793
    >
    > 2.6.26.6.pgsql 14165 27516 53883 53679 51960 49694 44377 35361 32879
    > 2.6.27.pgsql 14146 27519 53797 53739 52850 47633 39976 30552 28741
    > 2.6.27.rev.pgsql 14168 27561 53973 54043 53150 47900 39906 31987 28034
    > 2.6.28.git.pgsql 14404 28318 55124 55010 55002 54890 53745 53519 52215


    P.S. all knobs stock, TSO/GSO off.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. Re: [tbench regression fixes]: digging out smelly deadmen.

    On Friday, 10 of October 2008, Ingo Molnar wrote:
    >
    > * Evgeniy Polyakov wrote:
    >
    > > On Fri, Oct 10, 2008 at 01:42:45PM +0200, Ingo Molnar (mingo@elte.hu) wrote:
    > > > > vanilla 27: 347.222
    > > > > no TSO/GSO: 357.331
    > > > > no hrticks: 382.983
    > > > > no balance: 389.802
    > > >
    > > > okay. The target is 470 MB/sec, right? (Assuming the workload is sane
    > > > and 'fixing' it does not mean we have to schedule worse.)

    > >
    > > Well, that's where I started/stopped, so maybe we will even move
    > > further?

    >
    > that's the right attitude


    Can anyone please tell me if there was any conclusion of this thread?

    Thanks,
    Rafael
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. Re: [tbench regression fixes]: digging out smelly deadmen.

    From: "Rafael J. Wysocki"
    Date: Sat, 25 Oct 2008 00:25:34 +0200

    > On Friday, 10 of October 2008, Ingo Molnar wrote:
    > >
    > > * Evgeniy Polyakov wrote:
    > >
    > > > On Fri, Oct 10, 2008 at 01:42:45PM +0200, Ingo Molnar (mingo@elte.hu) wrote:
    > > > > > vanilla 27: 347.222
    > > > > > no TSO/GSO: 357.331
    > > > > > no hrticks: 382.983
    > > > > > no balance: 389.802
    > > > >
    > > > > okay. The target is 470 MB/sec, right? (Assuming the workload is sane
    > > > > and 'fixing' it does not mean we have to schedule worse.)
    > > >
    > > > Well, that's where I started/stopped, so maybe we will even move
    > > > further?

    > >
    > > that's the right attitude

    >
    > Can anyone please tell me if there was any conclusion of this thread?


    I made some more analysis in private with Ingo and Peter Z. and found
    that the tbench decreases correlate pretty much directly with the
    ongoing increasing cpu cost of wake_up() and friends in the fair
    scheduler.

    The largest increase in computational cost of wakeups came in 2.6.27
    when the hrtimer bits got added, it more than tripled the cost of a wakeup.
    In 2.6.28-rc1 the hrtimer feature has been disabled, but I think that
    should be backports into the 2.6.27-stable branch.

    So I think that should be backported, and meanwhile I'm spending some
    time in the background trying to replace the fair schedulers RB tree
    crud with something faster so maybe at some point we can recover all
    of the regressions in this area caused by the CFS code.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  18. Re: [tbench regression fixes]: digging out smelly deadmen.

    On Sat, 2008-10-25 at 00:25 +0200, Rafael J. Wysocki wrote:
    > On Friday, 10 of October 2008, Ingo Molnar wrote:
    > >
    > > * Evgeniy Polyakov wrote:
    > >
    > > > On Fri, Oct 10, 2008 at 01:42:45PM +0200, Ingo Molnar (mingo@elte.hu) wrote:
    > > > > > vanilla 27: 347.222
    > > > > > no TSO/GSO: 357.331
    > > > > > no hrticks: 382.983
    > > > > > no balance: 389.802
    > > > >
    > > > > okay. The target is 470 MB/sec, right? (Assuming the workload is sane
    > > > > and 'fixing' it does not mean we have to schedule worse.)
    > > >
    > > > Well, that's where I started/stopped, so maybe we will even move
    > > > further?

    > >
    > > that's the right attitude

    >
    > Can anyone please tell me if there was any conclusion of this thread?


    Part of the .27 regression was added scheduler overhead going from .26
    to .27. The scheduler overhead is now gone, but an unidentified source
    of localhost throughput loss remains for both SMP and UP configs.

    -Mike

    My last test data, updated to reflect recent commits:

    Legend:
    clock = v2.6.26..5052696 + 5052696..v2.6.27-rc7 sched clock changes
    weight = a7be37a + c9c294a + ced8aa1 (adds math overhead)
    buddy = 103638d (adds math overhead)
    buddy_overhead = b0aa51b (removes math overhead of buddy)
    revert_to_per_rq_vruntime = f9c0b09 (+2 lines, removes math overhead of weight)

    2.6.26.6-up virgin
    ring-test - 1.169 us/cycle = 855 KHz 1.000
    netperf - 130967.54 131143.75 130914.96 rr/s avg 131008.75 rr/s 1.000
    tbench - 357.593 355.455 356.048 MB/sec avg 356.365 MB/sec 1.000

    2.6.26.6-up + clock + buddy + weight (== .27 scheduler)
    ring-test - 1.234 us/cycle = 810 KHz .947 [cmp1]
    netperf - 128026.62 128118.48 127973.54 rr/s avg 128039.54 rr/s .977
    tbench - 342.011 345.307 343.535 MB/sec avg 343.617 MB/sec .964

    2.6.26.6-up + clock + buddy + weight + revert_to_per_rq_vruntime + buddy_overhead
    ring-test - 1.174 us/cycle = 851 KHz .995 [cmp2]
    netperf - 133928.03 134265.41 134297.06 rr/s avg 134163.50 rr/s 1.024
    tbench - 358.049 359.529 358.342 MB/sec avg 358.640 MB/sec 1.006

    versus .26 counterpart
    2.6.27-up virgin
    ring-test - 1.193 us/cycle = 838 KHz 1.034 [vs cmp1]
    netperf - 121293.48 121700.96 120716.98 rr/s avg 121237.14 rr/s .946
    tbench - 340.362 339.780 341.353 MB/sec avg 340.498 MB/sec .990

    2.6.27-up + revert_to_per_rq_vruntime + buddy_overhead
    ring-test - 1.122 us/cycle = 891 KHz 1.047 [vs cmp2]
    netperf - 119353.27 118600.98 119719.12 rr/s avg 119224.45 rr/s .900
    tbench - 338.701 338.508 338.562 MB/sec avg 338.590 MB/sec .951

    SMP config

    2.6.26.6-smp virgin
    ring-test - 1.575 us/cycle = 634 KHz 1.000
    netperf - 400487.72 400321.98 404165.10 rr/s avg 401658.26 rr/s 1.000
    tbench - 1178.27 1177.18 1184.61 MB/sec avg 1180.02 MB/sec 1.000

    2.6.26.6-smp + clock + buddy + weight + revert_to_per_rq_vruntime + buddy_overhead
    ring-test - 1.575 us/cycle = 634 KHz 1.000
    netperf - 412191.70 411873.15 414638.27 rr/s avg 412901.04 rr/s 1.027
    tbench - 1193.18 1200.93 1199.61 MB/sec avg 1197.90 MB/sec 1.015

    versus 26.6 plus
    2.6.27-smp virgin
    ring-test - 1.674 us/cycle = 597 KHz .941
    netperf - 382536.26 380931.29 380552.82 rr/s avg 381340.12 rr/s .923
    tbench - 1151.47 1143.21 1154.17 MB/sec avg 1149.616 MB/sec .959

    2.6.27-smp + revert_to_per_rq_vruntime + buddy_overhead
    ring-test - 1.570 us/cycle = 636 KHz 1.003
    netperf - 386487.91 389858.00 388180.91 rr/s avg 388175.60 rr/s .940
    tbench - 1179.52 1184.25 1180.18 MB/sec avg 1181.31 MB/sec .986



    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  19. Re: [tbench regression fixes]: digging out smelly deadmen.

    On Fri, 2008-10-24 at 16:31 -0700, David Miller wrote:
    > From: "Rafael J. Wysocki"
    > Date: Sat, 25 Oct 2008 00:25:34 +0200
    >
    > > On Friday, 10 of October 2008, Ingo Molnar wrote:
    > > >
    > > > * Evgeniy Polyakov wrote:
    > > >
    > > > > On Fri, Oct 10, 2008 at 01:42:45PM +0200, Ingo Molnar (mingo@elte.hu) wrote:
    > > > > > > vanilla 27: 347.222
    > > > > > > no TSO/GSO: 357.331
    > > > > > > no hrticks: 382.983
    > > > > > > no balance: 389.802
    > > > > >
    > > > > > okay. The target is 470 MB/sec, right? (Assuming the workload is sane
    > > > > > and 'fixing' it does not mean we have to schedule worse.)
    > > > >
    > > > > Well, that's where I started/stopped, so maybe we will even move
    > > > > further?
    > > >
    > > > that's the right attitude

    > >
    > > Can anyone please tell me if there was any conclusion of this thread?

    >
    > I made some more analysis in private with Ingo and Peter Z. and found
    > that the tbench decreases correlate pretty much directly with the
    > ongoing increasing cpu cost of wake_up() and friends in the fair
    > scheduler.
    >
    > The largest increase in computational cost of wakeups came in 2.6.27
    > when the hrtimer bits got added, it more than tripled the cost of a wakeup.
    > In 2.6.28-rc1 the hrtimer feature has been disabled, but I think that
    > should be backports into the 2.6.27-stable branch.
    >
    > So I think that should be backported, and meanwhile I'm spending some
    > time in the background trying to replace the fair schedulers RB tree
    > crud with something faster so maybe at some point we can recover all
    > of the regressions in this area caused by the CFS code.


    My test data indicates (to me anyway) that there is another source of
    localhost throughput loss in .27. In that data, there is no hrtick
    overhead since I didn't have highres timers enabled, and computational
    costs added in .27 were removed. Dunno where it lives, but it does
    appear to exist.

    -Mike

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  20. Re: [tbench regression fixes]: digging out smelly deadmen.

    From: Mike Galbraith
    Date: Sat, 25 Oct 2008 05:37:28 +0200

    > Part of the .27 regression was added scheduler overhead going from .26
    > to .27. The scheduler overhead is now gone, but an unidentified source
    > of localhost throughput loss remains for both SMP and UP configs.


    It has to be the TSO thinky Evgeniy hit too right?

    If not, please bisect this.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread
Page 1 of 5 1 2 3 ... LastLast