Regression in gdm-2.18 since 2.6.24 - Kernel

This is a discussion on Regression in gdm-2.18 since 2.6.24 - Kernel ; Third attempt, with luck this time I've managed to find what really broke it. Sorry, this is going to be a long mail to explain my current attribution of 'blame'. Summary: kernels newer than 2.6.24 break gdm's shutdown (and restart) ...

+ Reply to Thread
Results 1 to 19 of 19

Thread: Regression in gdm-2.18 since 2.6.24

  1. Regression in gdm-2.18 since 2.6.24

    Third attempt, with luck this time I've managed to find what really
    broke it. Sorry, this is going to be a long mail to explain my
    current attribution of 'blame'.

    Summary: kernels newer than 2.6.24 break gdm's shutdown (and
    restart) for me.

    Action to replicate:
    choose 'shutdown' or 'restart' from gdm, and confirm

    Expected behaviour: X disappears and I'm back at a tty window
    watching my bootscripts change to runlevel 0 or 6.

    Actual behaviour: many times (with 2.6.24.X 'mostly', with 2.6.25-rc
    'often') the gdm window disappears but the background remains and
    the box stays in runlevel 5.

    This only happens when this box is running a 'pure64' x86_64
    system, when it runs with a rather different 32-bit config it is
    fine. The system is now somewhat old (gcc-4.1.2, binutils-2.17,
    glibc-2.5), and the parts of gnome that I use are 2.20 except for
    gdm which is 2.18 (because I want to see the shutdown messages, in
    case things fail.)

    I first saw this on 2.6.24.2, but by that time I was mostly using
    x86 or other arches (I was behind on list mail, and missed the
    security fix in 2.6.24.1 among the other changes there). The problem
    seemed consistent on the few occasions I used this system with
    2.6.24.2. I still had a large amount of debugging info from gdm,
    and (from an earlier posting where I mistook the cause of this
    problem) I had the following:

    Mar 24 13:49:29 bluesbreaker gdm[2554]: Handling user message:
    'GET_CONFIG greeter/SetPosition :0'
    Mar 24 13:49:29 bluesbreaker gdmlogin[2995]: Got response: 'OK
    false'
    Mar 24 13:49:29 bluesbreaker gdmlogin[2995]: Sending command:
    'CLOSE'
    Mar 24 13:49:29 bluesbreaker gdm[2554]: Handling user message:
    'CLOSE'
    Mar 24 13:49:29 bluesbreaker gdm[2562]: gdm_slave_wait_for_login: In
    loop
    Mar 24 13:49:35 bluesbreaker gdm[2562]: gdm_slave_wait_for_login:
    end verify for ''
    Mar 24 13:49:35 bluesbreaker gdm[2562]: gdm_slave_wait_for_login: No
    login/Bad login
    Mar 24 13:49:35 bluesbreaker gdm[2562]: gdm_slave_wait_for_login: In
    loop
    Mar 24 13:49:35 bluesbreaker gdm[2562]: gdm_slave_wait_for_login:
    end verify for ''
    Mar 24 13:49:35 bluesbreaker gdm[2562]: gdm_slave_wait_for_login: No
    login/Bad login
    .... about 165 repeats of these 3 lines ...
    messages seemed to stop of their own accord until I shut down
    from a tty

    On my first attempt to find the cause, I was under the impression
    that it happened every time (in 2.6.24.2 and 2.6.24.4). Speculatively
    reverting some of the patches, plus an error where I forgot to set
    an extraversion, overwrote the modules, and later had a successful
    shutdown from 2.6.24.4 led me to erroneously point the finger at
    either the drm patches or i2c-viapro. In fact, the problem doesn't
    appear every time, and I needed to do 10 attempts (a mix of 5
    shutdowns and 5 restarts) before saying that a kernel seemed to be
    ok.

    In my second attempt, I tried to bisect (v2.6.24 good, v2.6.25-rc1
    bad) and ended up in 2.6.24-rc4. I haven't had any replies to my
    post yesterday about that, so I conclude that 'git bisect' is
    another "flexible and powerful tool" which will bite non-experts like
    me.

    For my third attempt (yesterday evening, and today) I established
    that 2.6.24 shuts down perfectly on this system, but anything
    newer is "variable". Hence, the mix of 5 restarts and 5 shutdowns
    before believing a particular kernel is ok.

    I used 2.6.24.x for this third attempt. After confirming that
    2.6.24 was rock solid for this, I tried some of the patches applied
    in 2.6.24.{1,2}. This was a lttle tricky, because security fixes
    meant the normal stable "we'll apply these patches unless somebody
    objects" considerations didn't apply and I didn't get to see which
    individual changes were being applied to stable.

    For the first pass, I cherry-picked the stable fixes for
    fs/eventpoll.c, fs/splice.c, kernel/sched_fair.c, and then
    include/linx/wait.h to make eventpoll compile. That kernel
    restarted once, then failed (I'm no longer certain if the second
    attempt was a restart or a shutdown). At that point, I had
    confirmed that even in 2.6.24-stable the failure didn't happen all
    the time, so I reverted to extended testing.

    First up was the pair of changes to fs/splice.c. They were fine.
    Then I added eventpoll.c and wait.h and ran a few tests - seemed fine.
    After that I added the change to sched_fair and things became
    interesting - all the restarts were ok, all the shutdowns failed.

    At that point I tried 2.6.24.4 and reverted what should be the first
    attachment for sched_fair. That passed all my tests for restart and
    shutdown.

    Next I went forward to 2.6.25-rc8. Here, I found that 'patch'
    would not revert the first hunk of that attachment because of a
    context change. So, I tried reverting only the second hunk (I didn't
    know why it had been changed, so maybe they were to fix different
    problems) - interestingly, that passed all 5 attempts to restart,
    and failed all 5 attempts to shutdown. I then tried the second
    attachment (which reverts both hunks from rc8) and all of my tests
    passed. Probably, there is some option for patch to ignore context,
    and I have no idea what problem(s) the original change was supposed
    to fix. For me, reverting the original would be wonderful but if
    that will cause problems for others then I'm willing to test any
    suggested changes.

    My .config for 2.6.25 is the third attachment. Clearly, I'd like
    this to be fixed in both 25 and stable. Thanks for reading this far.

    Ken
    --
    das eine Mal als Tragödie, das andere Mal als Farce
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: Regression in gdm-2.18 since 2.6.24

    On Thursday, 3 of April 2008, Ken Moffat wrote:
    > Third attempt, with luck this time I've managed to find what really
    > broke it. Sorry, this is going to be a long mail to explain my
    > current attribution of 'blame'.
    >
    > Summary: kernels newer than 2.6.24 break gdm's shutdown (and
    > restart) for me.
    >
    > Action to replicate:
    > choose 'shutdown' or 'restart' from gdm, and confirm
    >
    > Expected behaviour: X disappears and I'm back at a tty window
    > watching my bootscripts change to runlevel 0 or 6.
    >
    > Actual behaviour: many times (with 2.6.24.X 'mostly', with 2.6.25-rc
    > 'often') the gdm window disappears but the background remains and
    > the box stays in runlevel 5.
    >
    > This only happens when this box is running a 'pure64' x86_64
    > system, when it runs with a rather different 32-bit config it is
    > fine. The system is now somewhat old (gcc-4.1.2, binutils-2.17,
    > glibc-2.5), and the parts of gnome that I use are 2.20 except for
    > gdm which is 2.18 (because I want to see the shutdown messages, in
    > case things fail.)
    >
    > I first saw this on 2.6.24.2, but by that time I was mostly using
    > x86 or other arches (I was behind on list mail, and missed the
    > security fix in 2.6.24.1 among the other changes there). The problem
    > seemed consistent on the few occasions I used this system with
    > 2.6.24.2. I still had a large amount of debugging info from gdm,
    > and (from an earlier posting where I mistook the cause of this
    > problem) I had the following:
    >
    > Mar 24 13:49:29 bluesbreaker gdm[2554]: Handling user message:
    > 'GET_CONFIG greeter/SetPosition :0'
    > Mar 24 13:49:29 bluesbreaker gdmlogin[2995]: Got response: 'OK
    > false'
    > Mar 24 13:49:29 bluesbreaker gdmlogin[2995]: Sending command:
    > 'CLOSE'
    > Mar 24 13:49:29 bluesbreaker gdm[2554]: Handling user message:
    > 'CLOSE'
    > Mar 24 13:49:29 bluesbreaker gdm[2562]: gdm_slave_wait_for_login: In
    > loop
    > Mar 24 13:49:35 bluesbreaker gdm[2562]: gdm_slave_wait_for_login:
    > end verify for ''
    > Mar 24 13:49:35 bluesbreaker gdm[2562]: gdm_slave_wait_for_login: No
    > login/Bad login
    > Mar 24 13:49:35 bluesbreaker gdm[2562]: gdm_slave_wait_for_login: In
    > loop
    > Mar 24 13:49:35 bluesbreaker gdm[2562]: gdm_slave_wait_for_login:
    > end verify for ''
    > Mar 24 13:49:35 bluesbreaker gdm[2562]: gdm_slave_wait_for_login: No
    > login/Bad login
    > ... about 165 repeats of these 3 lines ...
    > messages seemed to stop of their own accord until I shut down
    > from a tty
    >
    > On my first attempt to find the cause, I was under the impression
    > that it happened every time (in 2.6.24.2 and 2.6.24.4). Speculatively
    > reverting some of the patches, plus an error where I forgot to set
    > an extraversion, overwrote the modules, and later had a successful
    > shutdown from 2.6.24.4 led me to erroneously point the finger at
    > either the drm patches or i2c-viapro. In fact, the problem doesn't
    > appear every time, and I needed to do 10 attempts (a mix of 5
    > shutdowns and 5 restarts) before saying that a kernel seemed to be
    > ok.
    >
    > In my second attempt, I tried to bisect (v2.6.24 good, v2.6.25-rc1
    > bad) and ended up in 2.6.24-rc4. I haven't had any replies to my
    > post yesterday about that, so I conclude that 'git bisect' is
    > another "flexible and powerful tool" which will bite non-experts like
    > me.
    >
    > For my third attempt (yesterday evening, and today) I established
    > that 2.6.24 shuts down perfectly on this system, but anything
    > newer is "variable". Hence, the mix of 5 restarts and 5 shutdowns
    > before believing a particular kernel is ok.
    >
    > I used 2.6.24.x for this third attempt. After confirming that
    > 2.6.24 was rock solid for this, I tried some of the patches applied
    > in 2.6.24.{1,2}. This was a lttle tricky, because security fixes
    > meant the normal stable "we'll apply these patches unless somebody
    > objects" considerations didn't apply and I didn't get to see which
    > individual changes were being applied to stable.
    >
    > For the first pass, I cherry-picked the stable fixes for
    > fs/eventpoll.c, fs/splice.c, kernel/sched_fair.c, and then
    > include/linx/wait.h to make eventpoll compile. That kernel
    > restarted once, then failed (I'm no longer certain if the second
    > attempt was a restart or a shutdown). At that point, I had
    > confirmed that even in 2.6.24-stable the failure didn't happen all
    > the time, so I reverted to extended testing.
    >
    > First up was the pair of changes to fs/splice.c. They were fine.
    > Then I added eventpoll.c and wait.h and ran a few tests - seemed fine.
    > After that I added the change to sched_fair and things became
    > interesting - all the restarts were ok, all the shutdowns failed.
    >
    > At that point I tried 2.6.24.4 and reverted what should be the first
    > attachment for sched_fair. That passed all my tests for restart and
    > shutdown.
    >
    > Next I went forward to 2.6.25-rc8. Here, I found that 'patch'
    > would not revert the first hunk of that attachment because of a
    > context change. So, I tried reverting only the second hunk (I didn't
    > know why it had been changed, so maybe they were to fix different
    > problems) - interestingly, that passed all 5 attempts to restart,
    > and failed all 5 attempts to shutdown. I then tried the second
    > attachment (which reverts both hunks from rc8) and all of my tests
    > passed. Probably, there is some option for patch to ignore context,
    > and I have no idea what problem(s) the original change was supposed
    > to fix. For me, reverting the original would be wonderful but if
    > that will cause problems for others then I'm willing to test any
    > suggested changes.
    >
    > My .config for 2.6.25 is the third attachment. Clearly, I'd like
    > this to be fixed in both 25 and stable. Thanks for reading this far.


    The attachments are missing.

    Can you please just provide us with the name of the git commit that you think
    breaks things for you?

    Also, is there CONFIG_FAIR_GROUP_SCHED set in your .config?

    Thanks,
    Rafael
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: Regression in gdm-2.18 since 2.6.24

    On Thu, Apr 03, 2008 at 09:56:53PM +0200, Rafael J. Wysocki wrote:
    > On Thursday, 3 of April 2008, Ken Moffat wrote:
    > > Third attempt, with luck this time I've managed to find what really
    > > broke it. Sorry, this is going to be a long mail to explain my
    > > current attribution of 'blame'.

    [...]
    >
    > The attachments are missing.
    >
    > Can you please just provide us with the name of the git commit that youthink
    > breaks things for you?
    >
    > Also, is there CONFIG_FAIR_GROUP_SCHED set in your .config?
    >
    > Thanks,
    > Rafael

    Oh! Maybe they are there this time. Apologies.

    For the commit, all I've got is the truncated line out of
    patch-2.6.24.1:
    index da7c061..2288ad8 100644

    Unfortunately, git log doesn't seem to find da7c061.

    And from the config
    CONFIG_FAIR_GROUP_SCHED=y

    Ken
    --
    das eine Mal als Tragödie, das andere Mal als Farce


  4. Re: Regression in gdm-2.18 since 2.6.24

    On Thursday, 3 of April 2008, Ken Moffat wrote:
    > On Thu, Apr 03, 2008 at 09:56:53PM +0200, Rafael J. Wysocki wrote:
    > > On Thursday, 3 of April 2008, Ken Moffat wrote:
    > > > Third attempt, with luck this time I've managed to find what really
    > > > broke it. Sorry, this is going to be a long mail to explain my
    > > > current attribution of 'blame'.

    > [...]
    > >
    > > The attachments are missing.
    > >
    > > Can you please just provide us with the name of the git commit that you think
    > > breaks things for you?
    > >
    > > Also, is there CONFIG_FAIR_GROUP_SCHED set in your .config?
    > >
    > > Thanks,
    > > Rafael

    > Oh! Maybe they are there this time. Apologies.
    >
    > For the commit, all I've got is the truncated line out of
    > patch-2.6.24.1:
    > index da7c061..2288ad8 100644
    >
    > Unfortunately, git log doesn't seem to find da7c061.
    >
    > And from the config
    > CONFIG_FAIR_GROUP_SCHED=y


    Please unset CONFIG_GROUP_SCHED and retest. It is known to cause problems
    (see http://bugzilla.kernel.org/show_bug.cgi?id=9969, for example).

    Thanks,
    Rafael
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: Regression in gdm-2.18 since 2.6.24

    On Thu, Apr 03, 2008 at 11:29:21PM +0200, Rafael J. Wysocki wrote:
    >
    > Please unset CONFIG_GROUP_SCHED and retest. It is known to cause problems
    > (see http://bugzilla.kernel.org/show_bug.cgi?id=9969, for example).
    >
    > Thanks,
    > Rafael

    Thanks for the suggestion. Tried it on 2.6.24.4, and
    double-checked /proc/config.gz after booting -

    # CONFIG_FAIR_GROUP_SCHED is not set

    Unfortunately, the first test reboot failed in the same way.

    Ken
    --
    das eine Mal als Tragödie, das andere Mal als Farce
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: Regression in gdm-2.18 since 2.6.24

    On Friday, 4 of April 2008, Ken Moffat wrote:
    > On Thu, Apr 03, 2008 at 11:29:21PM +0200, Rafael J. Wysocki wrote:
    > >
    > > Please unset CONFIG_GROUP_SCHED and retest. It is known to cause problems
    > > (see http://bugzilla.kernel.org/show_bug.cgi?id=9969, for example).
    > >
    > > Thanks,
    > > Rafael

    > Thanks for the suggestion. Tried it on 2.6.24.4, and
    > double-checked /proc/config.gz after booting -
    >
    > # CONFIG_FAIR_GROUP_SCHED is not set
    >
    > Unfortunately, the first test reboot failed in the same way.


    Please unset CONFIG_GROUP_SCHED too.

    Thanks,
    Rafael
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: Regression in gdm-2.18 since 2.6.24

    On Fri, Apr 04, 2008 at 02:20:00AM +0200, Rafael J. Wysocki wrote:
    > On Friday, 4 of April 2008, Ken Moffat wrote:
    > > On Thu, Apr 03, 2008 at 11:29:21PM +0200, Rafael J. Wysocki wrote:
    > > >
    > > > Please unset CONFIG_GROUP_SCHED and retest. It is known to cause problems
    > > > (see http://bugzilla.kernel.org/show_bug.cgi?id=9969, for example).
    > > >
    > > > Thanks,
    > > > Rafael

    > > Thanks for the suggestion. Tried it on 2.6.24.4, and
    > > double-checked /proc/config.gz after booting -
    > >
    > > # CONFIG_FAIR_GROUP_SCHED is not set
    > >
    > > Unfortunately, the first test reboot failed in the same way.

    >
    > Please unset CONFIG_GROUP_SCHED too.
    >
    > Thanks,
    > Rafael


    Yet again, I'm confused. I thought you'd missed a word in what you
    asked because the config for 2.6.24.4 doesn't show this.
    ken@bluesbreaker ~ $zgrep GROUP /proc/config.gz
    # CONFIG_CGROUPS is not set
    # CONFIG_FAIR_GROUP_SCHED is not set
    ken@bluesbreaker ~ $

    I _can_ see it in the 2.6.25 configs (except for rc1). Will retry
    rc8 with them both turned off.

    Ken
    --
    das eine Mal als Tragödie, das andere Mal als Farce
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: Regression in gdm-2.18 since 2.6.24

    On Thu, Apr 03, 2008 at 08:19:16PM +0100, Ken Moffat wrote:
    > Next I went forward to 2.6.25-rc8. Here, I found that 'patch'
    > would not revert the first hunk of that attachment because of a
    > context change. So, I tried reverting only the second hunk (I didn't
    > know why it had been changed, so maybe they were to fix different
    > problems) - interestingly, that passed all 5 attempts to restart,
    > and failed all 5 attempts to shutdown. I then tried the second
    > attachment (which reverts both hunks from rc8) and all of my tests
    > passed.


    Just to confirm, are you saying you applied patch below on top of
    2.6.25-rc8 and it solved your shutdown issues?


    ---
    kernel/sched_fair.c | 4 ++--
    1 files changed, 2 insertions(+), 2 deletions(-)

    Index: current/kernel/sched_fair.c
    ================================================== =================
    --- current.orig/kernel/sched_fair.c
    +++ current/kernel/sched_fair.c
    @@ -510,7 +510,7 @@

    if (!initial) {
    /* sleeps upto a single latency don't count. */
    - if (sched_feat(NEW_FAIR_SLEEPERS)) {
    + if (sched_feat(NEW_FAIR_SLEEPERS) && entity_is_task(se)) {
    vruntime -= calc_delta_fair(sysctl_sched_latency,
    &cfs_rq->load);
    }
    @@ -1145,7 +1145,7 @@
    * More easily preempt - nice tasks, while not making
    * it harder for + nice tasks.
    */
    - if (unlikely(se->load.weight > NICE_0_LOAD))
    + if (unlikely(se->load.weight != NICE_0_LOAD))
    gran = calc_delta_fair(gran, &se->load);

    if (pse->vruntime + gran < se->vruntime)


    The (reverse of above) patch was required to solve latency issues reported by
    several folks (ex: http://ozlabs.org/pipermail/linuxppc...ry/050355.html ).

    I don't see any obvious reason why this patch affects shutdown. Is there
    any way you can get more debug data? Basically when the machine enters
    into problem state, I want to see dmesg, /proc/sched_debug output and Sysrq-T
    o/p. You could get this by logging into the system over network or (if n/w
    is not working in that state) by running this script:

    [First ensure that syslogd is capturing kernel messages in
    /var/log/messages:

    Edit /etc/syslog.conf to ensure it has this line uncommented:

    kern.* /var/log/messages

    Restart syslog after any changes
    ]

    >>>>>>>>>>>>>>>>>>>>>>>>>>>>>


    #!/bin/bash

    /etc/init.d syslog stop
    mv /var/log/messages /var/log/messages.old.$$
    touch /var/log/messages
    /etc/init.d syslog start

    &

    sleep 10
    echo "Process List" > /tmp/sched-log
    ps -elf >> /tmp/sched-log
    echo >> /tmp/sched-log
    echo "Sched debug" >> /tmp/sched-log
    cat /proc/sched_debug >> /tmp/sched-log
    echo "Process stack trace" >> /tmp/sched-log
    echo 1 > /proc/sys/kernel/sysrq
    echo t > /proc/sysrq-trigger

    echo "dmesg output" >> /tmp/sched-log
    cat /var/log/messages >> /tmp/sched-log


    <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

    sched-log file size would be large. You could send it to me privately or
    host it on a website and send a pointer to it.

    - vatsa
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: Regression in gdm-2.18 since 2.6.24

    On Fri, Apr 04, 2008 at 12:48:39AM +0100, Ken Moffat wrote:
    > On Thu, Apr 03, 2008 at 11:29:21PM +0200, Rafael J. Wysocki wrote:
    > >
    > > Please unset CONFIG_GROUP_SCHED and retest. It is known to cause problems
    > > (see http://bugzilla.kernel.org/show_bug.cgi?id=9969, for example).
    > >
    > > Thanks,
    > > Rafael

    > Thanks for the suggestion. Tried it on 2.6.24.4, and
    > double-checked /proc/config.gz after booting -
    >
    > # CONFIG_FAIR_GROUP_SCHED is not set
    >
    > Unfortunately, the first test reboot failed in the same way.
    >


    Just to confirm. The bug happens even when CONFIG_FAIR_GROUP_SCHED is
    unset?

    --
    regards,
    Dhaval
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: Regression in gdm-2.18 since 2.6.24

    On Fri, Apr 04, 2008 at 01:37:17PM +0100, Ken Moffat wrote:
    > On Fri, Apr 04, 2008 at 02:20:00AM +0200, Rafael J. Wysocki wrote:
    > > On Friday, 4 of April 2008, Ken Moffat wrote:
    > > > On Thu, Apr 03, 2008 at 11:29:21PM +0200, Rafael J. Wysocki wrote:
    > > > >
    > > > > Please unset CONFIG_GROUP_SCHED and retest. It is known to cause problems
    > > > > (see http://bugzilla.kernel.org/show_bug.cgi?id=9969, for example).
    > > > >
    > > > > Thanks,
    > > > > Rafael
    > > > Thanks for the suggestion. Tried it on 2.6.24.4, and
    > > > double-checked /proc/config.gz after booting -
    > > >
    > > > # CONFIG_FAIR_GROUP_SCHED is not set
    > > >
    > > > Unfortunately, the first test reboot failed in the same way.

    > >
    > > Please unset CONFIG_GROUP_SCHED too.
    > >
    > > Thanks,
    > > Rafael

    >
    > Yet again, I'm confused. I thought you'd missed a word in what you
    > asked because the config for 2.6.24.4 doesn't show this.
    > ken@bluesbreaker ~ $zgrep GROUP /proc/config.gz
    > # CONFIG_CGROUPS is not set
    > # CONFIG_FAIR_GROUP_SCHED is not set
    > ken@bluesbreaker ~ $
    >
    > I _can_ see it in the 2.6.25 configs (except for rc1). Will retry
    > rc8 with them both turned off.
    >


    Right, in 2.6.24 we called it CONFIG_FAIR_GROUP_SCHED since group
    scheduling was available only for CFS. In 2.6.25 we have group
    scheduling for RT as well, so it is called CONFIG_GROUP_SCHED now.

    --
    regards,
    Dhaval
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: Regression in gdm-2.18 since 2.6.24

    On Fri, Apr 04, 2008 at 08:17:12PM +0530, Dhaval Giani wrote:
    > On Fri, Apr 04, 2008 at 12:48:39AM +0100, Ken Moffat wrote:
    > > On Thu, Apr 03, 2008 at 11:29:21PM +0200, Rafael J. Wysocki wrote:
    > > >
    > > > Please unset CONFIG_GROUP_SCHED and retest. It is known to cause problems
    > > > (see http://bugzilla.kernel.org/show_bug.cgi?id=9969, for example).
    > > >
    > > > Thanks,
    > > > Rafael

    > > Thanks for the suggestion. Tried it on 2.6.24.4, and
    > > double-checked /proc/config.gz after booting -
    > >
    > > # CONFIG_FAIR_GROUP_SCHED is not set
    > >
    > > Unfortunately, the first test reboot failed in the same way.
    > >

    >
    > Just to confirm. The bug happens even when CONFIG_FAIR_GROUP_SCHED is
    > unset?
    >
    > --
    > regards,
    > Dhaval

    Yes.

    Ken
    --
    das eine Mal als Tragödie, das andere Mal als Farce
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: Regression in gdm-2.18 since 2.6.24

    On Fri, Apr 04, 2008 at 08:18:25PM +0530, Dhaval Giani wrote:
    > On Fri, Apr 04, 2008 at 01:37:17PM +0100, Ken Moffat wrote:
    > > On Fri, Apr 04, 2008 at 02:20:00AM +0200, Rafael J. Wysocki wrote:
    > > > On Friday, 4 of April 2008, Ken Moffat wrote:
    > > > > On Thu, Apr 03, 2008 at 11:29:21PM +0200, Rafael J. Wysocki wrote:
    > > > > >
    > > > > > Please unset CONFIG_GROUP_SCHED and retest. It is known to cause problems
    > > > > > (see http://bugzilla.kernel.org/show_bug.cgi?id=9969, for example).
    > > > > >
    > > > > > Thanks,
    > > > > > Rafael
    > > > > Thanks for the suggestion. Tried it on 2.6.24.4, and
    > > > > double-checked /proc/config.gz after booting -
    > > > >
    > > > > # CONFIG_FAIR_GROUP_SCHED is not set
    > > > >
    > > > > Unfortunately, the first test reboot failed in the same way.
    > > >
    > > > Please unset CONFIG_GROUP_SCHED too.
    > > >
    > > > Thanks,
    > > > Rafael

    > >
    > > Yet again, I'm confused. I thought you'd missed a word in what you
    > > asked because the config for 2.6.24.4 doesn't show this.
    > > ken@bluesbreaker ~ $zgrep GROUP /proc/config.gz
    > > # CONFIG_CGROUPS is not set
    > > # CONFIG_FAIR_GROUP_SCHED is not set
    > > ken@bluesbreaker ~ $
    > >
    > > I _can_ see it in the 2.6.25 configs (except for rc1). Will retry
    > > rc8 with them both turned off.
    > >

    >
    > Right, in 2.6.24 we called it CONFIG_FAIR_GROUP_SCHED since group
    > scheduling was available only for CFS. In 2.6.25 we have group
    > scheduling for RT as well, so it is called CONFIG_GROUP_SCHED now.
    >
    > --
    > regards,
    > Dhaval

    OK, thanks for the explanation. Just gave -rc8 the first test with
    CONFIG_GROUP_SCHED unset, and again it fails.

    Ken
    --
    das eine Mal als Tragödie, das andere Mal als Farce
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: Regression in gdm-2.18 since 2.6.24

    On Fri, Apr 04, 2008 at 04:32:32PM +0100, Ken Moffat wrote:
    > > Just to confirm, are you saying you applied patch below on top of
    > > 2.6.25-rc8 and it solved your shutdown issues?
    > >

    > Yes.


    Thanks for confirming that the patch I sent was what you had tried and
    found it to fix your problem. That patch however is not something we
    want to apply for 2.6.25-rc8 (since it will worsen interactivity for
    other cases).

    Given that you seem to be seeing the problem even without
    CONFIG_GROUP_SCHED, only the second hunk of the patch seems to be making
    a difference for your problem i.e just the hunk below applied on
    2.6.25-rc8 (to kernel/sched_fair.c) should fix your problem too:

    @@ -1145,7 +1145,7 @@ static void check_preempt_wakeup(struct
    * More easily preempt - nice tasks, while not making
    * it harder for + nice tasks.
    */
    - if (unlikely(se->load.weight > NICE_0_LOAD))
    + if (unlikely(se->load.weight != NICE_0_LOAD))
    gran = calc_delta_fair(gran, &se->load);

    if (pse->vruntime + gran < se->vruntime)

    [The first hunk is a no-op under !CONFIG_GROUP_SCHED, since
    entity_is_task() is always 1 for !CONFIG_GROUP_SCHED]

    This second hunk changes how fast + or - niced tasks get preempted.

    2.6.25-rc8 (Bad case):
    Sets preempt granularity for + niced tasks at 5ms (1 CPU)

    2.6.25-rc8 + the hunk above (Good case):
    Sets preempt granularity for + niced tasks at >5ms


    So bumping up preempt granularity for + niced tasks seems to make things
    work for you. IMO the deeper problem lies somewhere else (perhaps is
    some race issue in gdm itself), which is easily exposed with 2.6.25-rc8
    which lets + niced tasks be preempted quickly.

    To help validate this, can you let us know the result of tuning preempt
    granularity on native 2.6.25-rc8 (without any patches applied and
    CONFIG_GROUP_SCHED disabled)?

    # echo 100000000 > /proc/sys/kernel/sched_wakeup_granularity_ns

    To check if echo command worked, do:

    # cat /proc/sys/kernel/sched_wakeup_granularity_ns

    It should return 100000000.

    Now try shutting down thr' gdm and pls let me know if it makes a
    difference.

    --
    Regards,
    vatsa
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: Regression in gdm-2.18 since 2.6.24

    On Sat, Apr 05, 2008 at 08:10:43PM +0530, Srivatsa Vaddagiri wrote:
    >
    > Given that you seem to be seeing the problem even without
    > CONFIG_GROUP_SCHED, only the second hunk of the patch seems to be making
    > a difference for your problem i.e just the hunk below applied on
    > 2.6.25-rc8 (to kernel/sched_fair.c) should fix your problem too:
    >
    > @@ -1145,7 +1145,7 @@ static void check_preempt_wakeup(struct
    > * More easily preempt - nice tasks, while not making
    > * it harder for + nice tasks.
    > */
    > - if (unlikely(se->load.weight > NICE_0_LOAD))
    > + if (unlikely(se->load.weight != NICE_0_LOAD))
    > gran = calc_delta_fair(gran, &se->load);
    >
    > if (pse->vruntime + gran < se->vruntime)
    >
    > [The first hunk is a no-op under !CONFIG_GROUP_SCHED, since
    > entity_is_task() is always 1 for !CONFIG_GROUP_SCHED]
    >
    > This second hunk changes how fast + or - niced tasks get preempted.
    >
    > 2.6.25-rc8 (Bad case):
    > Sets preempt granularity for + niced tasks at 5ms (1 CPU)
    >
    > 2.6.25-rc8 + the hunk above (Good case):
    > Sets preempt granularity for + niced tasks at >5ms
    >

    Well, I'm no longer sure exactly what was in the config, but after
    I had confirmed the reversion would fix 2.6.24.4 I _did_ try just
    the second part of the patch applied to 2.6.25-rc8 and it gave a 60%
    success rate across 10 tests.
    >
    > So bumping up preempt granularity for + niced tasks seems to make things
    > work for you. IMO the deeper problem lies somewhere else (perhaps is
    > some race issue in gdm itself), which is easily exposed with 2.6.25-rc8
    > which lets + niced tasks be preempted quickly.
    >


    I agree this is probably exposing a problem somewhere else.

    > To help validate this, can you let us know the result of tuning preempt
    > granularity on native 2.6.25-rc8 (without any patches applied and
    > CONFIG_GROUP_SCHED disabled)?
    >
    > # echo 100000000 > /proc/sys/kernel/sched_wakeup_granularity_ns
    >
    > To check if echo command worked, do:
    >
    > # cat /proc/sys/kernel/sched_wakeup_granularity_ns
    >
    > It should return 100000000.
    >
    > Now try shutting down thr' gdm and pls let me know if it makes a
    > difference.
    >
    > --
    > Regards,
    > vatsa


    Will do, but it might be a day or so before I can get to this.

    Thanks.

    Ken
    --
    das eine Mal als Tragödie, das andere Mal als Farce
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: Regression in gdm-2.18 since 2.6.24

    On Sat, Apr 05, 2008 at 10:03:47PM +0100, Ken Moffat wrote:
    > On Sat, Apr 05, 2008 at 08:10:43PM +0530, Srivatsa Vaddagiri wrote:
    > >
    > > Given that you seem to be seeing the problem even without
    > > CONFIG_GROUP_SCHED, only the second hunk of the patch seems to be making
    > > a difference for your problem i.e just the hunk below applied on
    > > 2.6.25-rc8 (to kernel/sched_fair.c) should fix your problem too:
    > >
    > > @@ -1145,7 +1145,7 @@ static void check_preempt_wakeup(struct
    > > * More easily preempt - nice tasks, while not making
    > > * it harder for + nice tasks.
    > > */
    > > - if (unlikely(se->load.weight > NICE_0_LOAD))
    > > + if (unlikely(se->load.weight != NICE_0_LOAD))
    > > gran = calc_delta_fair(gran, &se->load);
    > >
    > > if (pse->vruntime + gran < se->vruntime)
    > >
    > > [The first hunk is a no-op under !CONFIG_GROUP_SCHED, since
    > > entity_is_task() is always 1 for !CONFIG_GROUP_SCHED]
    > >
    > > This second hunk changes how fast + or - niced tasks get preempted.
    > >
    > > 2.6.25-rc8 (Bad case):
    > > Sets preempt granularity for + niced tasks at 5ms (1 CPU)
    > >
    > > 2.6.25-rc8 + the hunk above (Good case):
    > > Sets preempt granularity for + niced tasks at >5ms
    > >

    > Well, I'm no longer sure exactly what was in the config, but after
    > I had confirmed the reversion would fix 2.6.24.4 I _did_ try just
    > the second part of the patch applied to 2.6.25-rc8 and it gave a 60%
    > success rate across 10 tests.
    > >
    > > So bumping up preempt granularity for + niced tasks seems to make things
    > > work for you. IMO the deeper problem lies somewhere else (perhaps is
    > > some race issue in gdm itself), which is easily exposed with 2.6.25-rc8
    > > which lets + niced tasks be preempted quickly.
    > >

    >
    > I agree this is probably exposing a problem somewhere else.
    >
    > > To help validate this, can you let us know the result of tuning preempt
    > > granularity on native 2.6.25-rc8 (without any patches applied and
    > > CONFIG_GROUP_SCHED disabled)?
    > >
    > > # echo 100000000 > /proc/sys/kernel/sched_wakeup_granularity_ns
    > >
    > > To check if echo command worked, do:
    > >
    > > # cat /proc/sys/kernel/sched_wakeup_granularity_ns
    > >
    > > It should return 100000000.
    > >
    > > Now try shutting down thr' gdm and pls let me know if it makes a
    > > difference.
    > >
    > > --
    > > Regards,
    > > vatsa

    >
    > Will do, but it might be a day or so before I can get to this.
    >
    > Thanks.
    >
    > Ken


    Well, I found your analysis convincing. Unfortunately, my hardware
    disagreed. Testing -rc8 with CONFIG_GROUP_SCHED disabled (a test is
    a mixture of 5 attempts to restart and 5 to shutdown):

    1. the base version success is 4/10

    2. increasing the granularity by a factor of 10 as you requested,
    success is 8/10

    3. applying the second part of the patch (and not altering the
    granularity) success is 3/10

    4. applying both parts of the patch (and not altering the
    granularity), success is 5/10.

    Clearly, 3/10 and 5/10 may not be meaningfully different on such a
    small sample size (but, 10 attempts is probably as much as my mind
    and blood-pressure can stand!). Whether 8/10 is meaningfully better
    I don't know, the point is that it still failed some of the time.

    At this point, I started to doubt my previous results, so I
    retested rc8 with CONFIG_GROUP_SCHED=y and both parts of the patch,
    and again success is 10/10. So, that combination has run through at
    least 20 shutdowns or restarts without a problem.

    Summary: if I apply the patch to revert both hunks, AND use
    CONFIG_GROUP_SCHED, everything is good. All other variations fail
    sooner or later within 10 tests (for the little it's worth, the
    longest string of successful runs between failures is 6, so a
    minimum of 10 tests is probably necessary before saying a version
    seems ok).

    If I was confused earlier, I guess I must be dazed and confused
    now!

    Ken
    --

    das eine Mal als Tragödie, das andere Mal als Farce
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. Re: Regression in gdm-2.18 since 2.6.24

    On Mon, Apr 07, 2008 at 12:48:33AM +0100, Ken Moffat wrote:
    > Well, I found your analysis convincing. Unfortunately, my hardware
    > disagreed. Testing -rc8 with CONFIG_GROUP_SCHED disabled (a test is
    > a mixture of 5 attempts to restart and 5 to shutdown):
    >
    > 1. the base version success is 4/10
    >
    > 2. increasing the granularity by a factor of 10 as you requested,
    > success is 8/10


    This makes me think that we are just exposing a timing related problem
    in gdm here.

    How abt a larger factor?

    # echo 200000000 > /proc/sys/kernel/sched_wakeup_granularity_ns

    Does that make it 10/10 ?!

    Anyway, it would be interesting to analyze the failure scenario more
    (with help from gdm developers). Can you get some more debug data in this
    regard?

    Before you shutdown,

    # strace -p 2>/tmp/gdmlog1 &
    # strace -p 2>/tmp/gdmlog2 &

    Now shutdown and wait few minutes to confirm its not working. Send me
    the strace log files ..Hopefully this will give a hint on what they are
    deadlocked on (in the last log you sent, i can see both gdm-binaries in
    sleep state ..whether that was a momentary state or whether they are
    actually deadlocked, will be confirmed by strace logs above).

    > If I was confused earlier, I guess I must be dazed and confused
    > now!


    me too!

    Ingo/Peter, Any other suggestions you have?


    --
    Regards,
    vatsa
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. Re: Regression in gdm-2.18 since 2.6.24

    On Tue, 2008-04-08 at 14:20 +0530, Srivatsa Vaddagiri wrote:
    > On Mon, Apr 07, 2008 at 12:48:33AM +0100, Ken Moffat wrote:
    > > Well, I found your analysis convincing. Unfortunately, my hardware
    > > disagreed. Testing -rc8 with CONFIG_GROUP_SCHED disabled (a test is
    > > a mixture of 5 attempts to restart and 5 to shutdown):
    > >
    > > 1. the base version success is 4/10
    > >
    > > 2. increasing the granularity by a factor of 10 as you requested,
    > > success is 8/10

    >
    > This makes me think that we are just exposing a timing related problem
    > in gdm here.
    >
    > How abt a larger factor?
    >
    > # echo 200000000 > /proc/sys/kernel/sched_wakeup_granularity_ns
    >
    > Does that make it 10/10 ?!
    >
    > Anyway, it would be interesting to analyze the failure scenario more
    > (with help from gdm developers). Can you get some more debug data in this
    > regard?
    >
    > Before you shutdown,
    >
    > # strace -p 2>/tmp/gdmlog1 &
    > # strace -p 2>/tmp/gdmlog2 &
    >
    > Now shutdown and wait few minutes to confirm its not working. Send me
    > the strace log files ..Hopefully this will give a hint on what they are
    > deadlocked on (in the last log you sent, i can see both gdm-binaries in
    > sleep state ..whether that was a momentary state or whether they are
    > actually deadlocked, will be confirmed by strace logs above).
    >
    > > If I was confused earlier, I guess I must be dazed and confused
    > > now!

    >
    > me too!
    >
    > Ingo/Peter, Any other suggestions you have?


    Sounds like a race condition to me; non of these changes affect
    correctness in a strict manner of speaking.



    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  18. Re: Regression in gdm-2.18 since 2.6.24

    On Tue 8.Apr'08 at 14:20:27 +0530, Srivatsa Vaddagiri wrote:
    > On Mon, Apr 07, 2008 at 12:48:33AM +0100, Ken Moffat wrote:
    > > Well, I found your analysis convincing. Unfortunately, my hardware
    > > disagreed. Testing -rc8 with CONFIG_GROUP_SCHED disabled (a test is
    > > a mixture of 5 attempts to restart and 5 to shutdown):
    > >
    > > 1. the base version success is 4/10
    > >
    > > 2. increasing the granularity by a factor of 10 as you requested,
    > > success is 8/10

    >
    > This makes me think that we are just exposing a timing related problem
    > in gdm here.


    I can see the same issue in my desktop with kdm, but I've never had
    the energy to bisect or look further into it, so I just type halt or
    reboot in xterm :-(

    But the fact is that a few days ago I tried rebooting through kdm button,
    but it left me with a console login screen (and no reboot).

    It also happened a few times before too (and I am always using a
    uptodated 2.6.25-rcX kernel).

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  19. Re: Regression in gdm-2.18 since 2.6.24

    On Tue, Apr 08, 2008 at 02:20:27PM +0530, Srivatsa Vaddagiri wrote:
    > On Mon, Apr 07, 2008 at 12:48:33AM +0100, Ken Moffat wrote:
    > > Well, I found your analysis convincing. Unfortunately, my hardware
    > > disagreed. Testing -rc8 with CONFIG_GROUP_SCHED disabled (a test is
    > > a mixture of 5 attempts to restart and 5 to shutdown):
    > >
    > > 1. the base version success is 4/10
    > >
    > > 2. increasing the granularity by a factor of 10 as you requested,
    > > success is 8/10

    >
    > This makes me think that we are just exposing a timing related problem
    > in gdm here.
    >
    > How abt a larger factor?
    >
    > # echo 200000000 > /proc/sys/kernel/sched_wakeup_granularity_ns
    >
    > Does that make it 10/10 ?!
    >

    [ snipping the suggestion to run strace, I've already sent the
    results off-list ]

    Yes, it does. Seems to run adequately too (a little audio playback,
    a little you-tube, some untarring and compiling). Understanding the
    real problem would be nice, but for me this seems to be an adequate
    work-around. This is with CONFIG_GROUP_SCHED turned off.

    Thanks

    Ken
    --
    das eine Mal als Tragödie, das andere Mal als Farce
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread