sticking in killall5 in halt? - Mandrake

This is a discussion on sticking in killall5 in halt? - Mandrake ; I'm fighting an ACPI issue in Mandriva 2007 on a Tyan S2466 motherboard (following poweroff the front panel switch is not active, so cannot be used to restart the system). This is documented here: http://bugzilla.kernel.org/show_bug.cgi?id=7961 Along the way the system ...

+ Reply to Thread
Results 1 to 8 of 8

Thread: sticking in killall5 in halt?

  1. sticking in killall5 in halt?

    I'm fighting an ACPI issue in Mandriva 2007 on a Tyan S2466
    motherboard (following poweroff the front panel switch is
    not active, so cannot be used to restart the system). This
    is documented here:

    http://bugzilla.kernel.org/show_bug.cgi?id=7961

    Along the way the system has shown a propensity for
    locking up in /etc/rc.d/init.d/halt at one or the other of
    the killall5 lines. During a reboot (invariably) or a poweroff
    (sometimes) it gets down to:

    "Sending all processes the KILL signal"

    (or much less often, the "TERM signal" line) and locks. It seems
    to do this primarily when the command is given from the console
    or from an rlogin session. The system seems to be happier to
    shutdown from a remote command like:

    rsh nodename poweroff

    This is with Vanilla kernel 2.6.19.3. However killall5 is from
    Mandriva 2007. Before the upgrade neither this system, nor any of
    its 19 identical twins, refused to complete a "reboot" or "poweroff"
    no matter where it was issued. That was when they were running
    Mandrake 10.1 with a 2.6.8.1 kernel though.

    I made one BIOS change after this problem was noticed, turning off
    console redirection (lets one change BIOS settings on a serial line)
    but it made no difference.

    Any ideas what could possibly be causing "killall5 -9" or "killall5 -15"
    to lock?

    Thanks,

    David Mathog

  2. Re: sticking in killall5 in halt is BUG in chkconfig

    David Mathog wrote:

    > Any ideas what could possibly be causing "killall5 -9" or "killall5 -15"
    > to lock?


    What a long and twisty road this has been, but at the end there is a
    real bug in Mandriva 2007. Please bear with me.

    I made a modified killall5 and had /etc/rc.d/init.d/halt use that. It
    was locking up in readproc() on the KILL stage. Then I put a "ps -ef"
    in between the two stages and found that the sge_execd process was still
    running. (Sun Grid Engine.) Bizarre, it starts and stops fine with

    service sge start
    service sge stop

    When I did:

    service sge stop
    reboot

    The system rebooted without a problem!

    Then I noticed something, in rc0.d and rc6.d the keytable, netfs, sshd
    and ypbind entries are all at K00. That's wrong, the
    /etc/rc.d/init.d entries for none of those show a zero K value.
    So I tried:

    chkconfig --del ypbind
    chkconfig --add ypbind

    and it did NOT change it from K00ypbind. To my mind this is a
    bug in chkconfig since it screws up on at least 4 different init files.
    Relevant versions are:

    chkconfig-1.3.25-2mdv2007.0
    ypbind-1.19.1-4mdv2007.0

    So what's happening here? The shutdown order for some services are
    wrong in Mandriva 2007. I checked another system and it had the same
    bogus values for sshd and keytable, that system doesn't use the others.
    Chkconfig is supposed to set these from the /etc/rc.d/init.d/whatever
    file, but is not doing so correctly. SGE uses a yp map and so
    it doesn't shut down properly if ypbind goes before it does. (I don't
    know why exactly, but there it is, it still shows an OK.) This stuck
    sge_execd process then causes killall5 to hang up in readproc(), again
    I can't say why exactly.

    Dominoes, butterfly wings, and houses of cards come to mind.

    Could somebody else please check your Mandriva 2007 system and see if it
    also has bogus rc0.d and rc6.d K00 values present?

    Thanks,

    David Mathog

  3. Re: sticking in killall5 in halt is BUG in chkconfig

    This is worse than I thought. Here's the chkconfig line
    from ypbind:

    # chkconfig: 345 17 83

    and here are the /etc/rc.d values:

    .../init.d/ypbind
    .../rc0.d/K00ypbind
    .../rc1.d/K00ypbind
    .../rc2.d/K00ypbind
    .../rc3.d/S53ypbind
    .../rc4.d/S53ypbind
    .../rc5.d/S53ypbind
    .../rc6.d/K00ypbind

    Both the S and K values are wrong!

    Regards,

    David Mathog

  4. Re: sticking in killall5 in halt is BUG in chkconfig

    More info (columns are:
    name of /etc/rc.d/init.d file
    chkconfig value for that rc directory
    actual entry in that directory):

    rc0.d entries
    atd 60 K60atd
    athcool 90 K90athcool
    authd 80 K80authd
    crond 60 K60crond
    dm 09 K09dm
    gmond 80 K80gmond
    iptables 92 K92iptables
    keytable 05 K00keytable
    kheader 20 K20kheader
    lm_sensors 74 K74lm_sensors
    netfs 75 K00netfs
    network 90 K90network
    nfslock 86 K48nfslock
    ntpd 10 K10ntpd
    numlock 15 K15numlock
    partmon 20 K20partmon
    portmap 89 K49portmap
    sensor_monitor 90 K90sensor_monitor
    sge 36 K36sge
    smartd 40 K40smartd
    sshd 25 K00sshd
    syslog 88 K88syslog
    xfs 10 K10xfs
    xinetd 50 K49xinetd
    ypbind 83 K00ypbind

    rc3.d entries:

    atd 40 S40atd
    athcool 10 S10athcool
    authd 14 S14authd
    crond 90 S90crond
    dm 09 K09dm
    gmond 20 S20gmond
    iptables 03 S03iptables
    keytable 14 S53keytable
    kheader 95 S95kheader
    lm_sensors 26 S26lm_sensors
    netfs 25 S52netfs
    network 10 S10network
    nfslock 14 S52nfslock
    ntpd 56 S56ntpd
    numlock 29 S29numlock
    partmon 13 S13partmon
    portmap 11 S51portmap
    sensor_monitor 10 S10sensor_monitor
    sge 95 S95sge
    smartd 40 S40smartd
    sshd 55 S55sshd
    syslog 12 S12syslog
    xfs 20 S51xfs
    xinetd 56 S56xinetd
    ypbind 17 S53ypbind


    rc6.d entries
    atd 60 K60atd
    athcool 90 K90athcool
    authd 80 K80authd
    crond 60 K60crond
    dm 09 K09dm
    gmond 80 K80gmond
    iptables 92 K92iptables
    keytable 05 K00keytable
    kheader 20 K20kheader
    lm_sensors 74 K74lm_sensors
    netfs 75 K00netfs
    network 90 K90network
    nfslock 86 K48nfslock
    ntpd 10 K10ntpd
    numlock 15 K15numlock
    partmon 20 K20partmon
    portmap 89 K49portmap
    sensor_monitor 90 K90sensor_monitor
    sge 36 K36sge
    smartd 40 K40smartd
    sshd 25 K00sshd
    syslog 88 K88syslog
    xfs 10 K10xfs
    xinetd 50 K49xinetd
    ypbind 83 K00ypbind

    Regards,

    David Mathog

  5. Re: sticking in killall5 in halt is BUG in chkconfig

    Copied the chkconfig binary from a Mandriva 2006 system. It worked
    properly on the original system. However it doesn't on Mandriva 2007,
    for ypbind it makes K74 and S26 instead of the expected 83 and 17. No
    obvious library mismatch to account for the difference as indicated by
    ldd.

    Stranger and stranger,

    David Mathog


  6. Re: sticking in killall5 in halt is BUG in chkconfig

    On Fri, 09 Feb 2007 17:46:46 -0500, David Mathog wrote:

    > Copied the chkconfig binary from a Mandriva 2006 system. It worked
    > properly on the original system. However it doesn't on Mandriva 2007,
    > for ypbind it makes K74 and S26 instead of the expected 83 and 17. No
    > obvious library mismatch to account for the difference as indicated by
    > ldd.


    From the description of --add in man chkconfig ...
    Note that default entries in LSB-delimited 'INIT INFO'
    sections take precedence over the default runlevels in the
    initscript

    See http://refspecs.freestandards.org/LS...crcomconv.html or
    http://wiki.debian.org/LSBInitScripts/
    for the format of the "INIT INFO".

    Regards, Dave Hodgins

    --
    Change nomail.afraid.org to ody.ca to reply by email.
    (nomail.afraid.org has been set up specifically for
    use in usenet. Feel free to use it yourself.)

  7. Re: sticking in killall5 in halt is BUG in chkconfig

    David W. Hodgins wrote:
    > From the description of --add in man chkconfig ...
    > Note that default entries in LSB-delimited 'INIT INFO'
    > sections take precedence over the default runlevels in the
    > initscript
    >


    Sorry to be a Luddite, but it was a lot easier for me to control
    the start and kill order for services when they would stay
    where I put them!

    Anyway, I finally figured that LSB was needed. But chkconfig is still
    brain dead, even with my sge script like this:

    # chkconfig: 345 95 36
    # description: Sun Grid Engine queuing system
    # processname: sge_qmaster
    # pidfile: /var/run/sge.pid
    ### BEGIN INIT INFO
    # Provides: sge
    # Required-Start: $local_fs $remote_fs $network ypbind $portmap
    # Required-Stop: $local_fs $remote_fs $network ypbind $portmap
    # Default-Start: 3 4 5
    # Short-Description: Sun Grid Engine
    # Description: Sun Grid Engine
    ### END INIT INFO
    # config: /usr/SGE/*

    chkconfig insists on mashing many of the rc0.d and rc6.d actions into
    the K00 positions. It does a reasonable job with the S## in the other
    rc*.d directories. Ie:

    ls -al /etc/rc.d/rc0.d | head
    total 2
    drwxr-xr-x 2 root root 1024 Feb 9 16:22 ./
    drwxr-xr-x 10 root root 1024 Feb 2 03:43 ../
    lrwxrwxrwx 1 root root 18 Feb 9 16:22 K00keytable -> ../init.d/keytable*
    lrwxrwxrwx 1 root root 15 Feb 9 16:22 K00netfs -> ../init.d/netfs*
    lrwxrwxrwx 1 root root 13 Feb 9 16:22 K00sge -> ../init.d/sge*
    lrwxrwxrwx 1 root root 14 Feb 9 16:22 K00sshd -> ../init.d/sshd*
    lrwxrwxrwx 1 root root 16 Feb 9 16:22 K00ypbind -> ../init.d/ypbind*
    lrwxrwxrwx 1 root root 12 Feb 9 16:22 K09dm -> ../init.d/dm*
    lrwxrwxrwx 1 root root 14 Feb 2 03:53 K10ntpd -> ../init.d/ntpd*

    As far as I can tell the only thing that is going to let sge shut off
    before ypbind is the coincidence that it's earlier in the collating
    order. I took out keytable, netfs, sge, sshd, and ypbind
    and added them back in various orders. It seemed sane until ypbind went in

    [root@monkey01 init.d]# chkconfig --add netfs
    [root@monkey01 init.d]# ls -al /etc/rc.d/rc6.d | head
    total 2
    drwxr-xr-x 2 root root 1024 Feb 9 16:28 ./
    drwxr-xr-x 10 root root 1024 Feb 2 03:43 ../
    lrwxrwxrwx 1 root root 13 Feb 9 16:28 K00sge -> ../init.d/sge*
    lrwxrwxrwx 1 root root 12 Feb 9 16:28 K09dm -> ../init.d/dm*
    lrwxrwxrwx 1 root root 14 Feb 2 03:53 K10ntpd -> ../init.d/ntpd*
    lrwxrwxrwx 1 root root 13 Feb 9 16:28 K10xfs -> ../init.d/xfs*
    lrwxrwxrwx 1 root root 17 Feb 2 03:57 K15numlock -> ../init.d/numlock*
    lrwxrwxrwx 1 root root 17 Feb 2 03:48 K20kheader -> ../init.d/kheader*
    lrwxrwxrwx 1 root root 17 Feb 2 03:43 K20partmon -> ../init.d/partmon*

    (netfs is way down the list at K48)

    [root@monkey01 init.d]# chkconfig --add ypbind
    [root@monkey01 init.d]# ls -al /etc/rc.d/rc6.d | head
    total 2
    drwxr-xr-x 2 root root 1024 Feb 9 16:28 ./
    drwxr-xr-x 10 root root 1024 Feb 2 03:43 ../
    lrwxrwxrwx 1 root root 15 Feb 9 16:28 K00netfs -> ../init.d/netfs*
    lrwxrwxrwx 1 root root 13 Feb 9 16:28 K00sge -> ../init.d/sge*
    lrwxrwxrwx 1 root root 16 Feb 9 16:28 K00ypbind -> ../init.d/ypbind*
    lrwxrwxrwx 1 root root 12 Feb 9 16:28 K09dm -> ../init.d/dm*
    lrwxrwxrwx 1 root root 14 Feb 2 03:53 K10ntpd -> ../init.d/ntpd*
    lrwxrwxrwx 1 root root 13 Feb 9 16:28 K10xfs -> ../init.d/xfs*
    lrwxrwxrwx 1 root root 17 Feb 2 03:57 K15numlock -> ../init.d/numlock*
    >


    I don't get this. ypbind REQUIRES netfs which provides $remote_fs, yet
    when ypbind is --added it moves netfs to the front of the list, so that
    it will turn off BEFORE ypbind. ypbind also requires $portmap, but it
    didn't cause that to move.

    [root@monkey01 init.d]# chkconfig --add netfs
    [root@monkey01 init.d]# ls -al /etc/rc.d/rc6.d | grep netfs
    lrwxrwxrwx 1 root root 15 Feb 9 16:34 K48netfs -> ../init.d/netfs*
    [root@monkey01 init.d]# ls -al /etc/rc.d/rc6.d | grep portmap
    lrwxrwxrwx 1 root root 17 Feb 9 16:34 K49portmap -> ../init.d/portmap*
    [root@monkey01 init.d]# chkconfig --add ypbind
    [root@monkey01 init.d]# ls -al /etc/rc.d/rc6.d | head
    total 2
    drwxr-xr-x 2 root root 1024 Feb 9 16:35 ./
    drwxr-xr-x 10 root root 1024 Feb 2 03:43 ../
    lrwxrwxrwx 1 root root 15 Feb 9 16:35 K00netfs -> ../init.d/netfs*
    lrwxrwxrwx 1 root root 13 Feb 9 16:35 K00sge -> ../init.d/sge*
    lrwxrwxrwx 1 root root 16 Feb 9 16:35 K00ypbind -> ../init.d/ypbind*
    lrwxrwxrwx 1 root root 12 Feb 9 16:35 K09dm -> ../init.d/dm*
    lrwxrwxrwx 1 root root 14 Feb 2 03:53 K10ntpd -> ../init.d/ntpd*
    lrwxrwxrwx 1 root root 13 Feb 9 16:35 K10xfs -> ../init.d/xfs*
    lrwxrwxrwx 1 root root 17 Feb 2 03:57 K15numlock -> ../init.d/numlock*

    I'll stick with the BUG description, although the arbitrary K and S
    numbers are apparently a "feature" of LSB, stuffing all of the K values
    into K00 isn't right. I put a required-start and -stop for ypbind in
    the local ssh init script, and as expected, they were both still in K00,
    and the condition specified wan't required because K00sshd collates
    earlier than K00ypbind. (The order in rc3.d was as desired, at
    S35ypbind and S55sshd.)

    Also, although chkconfig --del won't let you take out an entry that has
    been put in with --add if there is a dependency, chkconfig --add won't
    automatically add a dependency when one is required by the LSB. That
    is, if I trim everything down past ypbind, then --add sge, it doesn't
    automatically place ypbind or netfs, or even offer to do so or provide a
    command line switch to offer to do so.

    Regards,

    David Mathog

  8. Re: sticking in killall5 in halt is BUG in chkconfig

    On Fri, 09 Feb 2007 19:47:55 -0500, David Mathog wrote:

    > David W. Hodgins wrote:
    >> From the description of --add in man chkconfig ...
    >> Note that default entries in LSB-delimited 'INIT INFO'
    >> sections take precedence over the default runlevels in the
    >> initscript

    >
    > Sorry to be a Luddite, but it was a lot easier for me to control
    > the start and kill order for services when they would stay
    > where I put them!


    The easiest way to force the script to go where you want it
    to, would be to remove the INIT INFO section. Then the old
    style chkconfig lines would control which symlinks get created.

    I haven't looked into the LSB format, and dependency tree of
    the LSB controlled init scripts, to figure out what needs to
    be corrected.

    Regards, Dave Hodgins

    --
    Change nomail.afraid.org to ody.ca to reply by email.
    (nomail.afraid.org has been set up specifically for
    use in usenet. Feel free to use it yourself.)

+ Reply to Thread