sticking in killall5 in halt? - Mandrake
This is a discussion on sticking in killall5 in halt? - Mandrake ; I'm fighting an ACPI issue in Mandriva 2007 on a Tyan S2466
motherboard (following poweroff the front panel switch is
not active, so cannot be used to restart the system). This
is documented here:
http://bugzilla.kernel.org/show_bug.cgi?id=7961
Along the way the system ...
-
sticking in killall5 in halt?
I'm fighting an ACPI issue in Mandriva 2007 on a Tyan S2466
motherboard (following poweroff the front panel switch is
not active, so cannot be used to restart the system). This
is documented here:
http://bugzilla.kernel.org/show_bug.cgi?id=7961
Along the way the system has shown a propensity for
locking up in /etc/rc.d/init.d/halt at one or the other of
the killall5 lines. During a reboot (invariably) or a poweroff
(sometimes) it gets down to:
"Sending all processes the KILL signal"
(or much less often, the "TERM signal" line) and locks. It seems
to do this primarily when the command is given from the console
or from an rlogin session. The system seems to be happier to
shutdown from a remote command like:
rsh nodename poweroff
This is with Vanilla kernel 2.6.19.3. However killall5 is from
Mandriva 2007. Before the upgrade neither this system, nor any of
its 19 identical twins, refused to complete a "reboot" or "poweroff"
no matter where it was issued. That was when they were running
Mandrake 10.1 with a 2.6.8.1 kernel though.
I made one BIOS change after this problem was noticed, turning off
console redirection (lets one change BIOS settings on a serial line)
but it made no difference.
Any ideas what could possibly be causing "killall5 -9" or "killall5 -15"
to lock?
Thanks,
David Mathog
-
Re: sticking in killall5 in halt is BUG in chkconfig
David Mathog wrote:
> Any ideas what could possibly be causing "killall5 -9" or "killall5 -15"
> to lock?
What a long and twisty road this has been, but at the end there is a
real bug in Mandriva 2007. Please bear with me.
I made a modified killall5 and had /etc/rc.d/init.d/halt use that. It
was locking up in readproc() on the KILL stage. Then I put a "ps -ef"
in between the two stages and found that the sge_execd process was still
running. (Sun Grid Engine.) Bizarre, it starts and stops fine with
service sge start
service sge stop
When I did:
service sge stop
reboot
The system rebooted without a problem!
Then I noticed something, in rc0.d and rc6.d the keytable, netfs, sshd
and ypbind entries are all at K00. That's wrong, the
/etc/rc.d/init.d entries for none of those show a zero K value.
So I tried:
chkconfig --del ypbind
chkconfig --add ypbind
and it did NOT change it from K00ypbind. To my mind this is a
bug in chkconfig since it screws up on at least 4 different init files.
Relevant versions are:
chkconfig-1.3.25-2mdv2007.0
ypbind-1.19.1-4mdv2007.0
So what's happening here? The shutdown order for some services are
wrong in Mandriva 2007. I checked another system and it had the same
bogus values for sshd and keytable, that system doesn't use the others.
Chkconfig is supposed to set these from the /etc/rc.d/init.d/whatever
file, but is not doing so correctly. SGE uses a yp map and so
it doesn't shut down properly if ypbind goes before it does. (I don't
know why exactly, but there it is, it still shows an OK.) This stuck
sge_execd process then causes killall5 to hang up in readproc(), again
I can't say why exactly.
Dominoes, butterfly wings, and houses of cards come to mind.
Could somebody else please check your Mandriva 2007 system and see if it
also has bogus rc0.d and rc6.d K00 values present?
Thanks,
David Mathog
-
Re: sticking in killall5 in halt is BUG in chkconfig
This is worse than I thought. Here's the chkconfig line
from ypbind:
# chkconfig: 345 17 83
and here are the /etc/rc.d values:
.../init.d/ypbind
.../rc0.d/K00ypbind
.../rc1.d/K00ypbind
.../rc2.d/K00ypbind
.../rc3.d/S53ypbind
.../rc4.d/S53ypbind
.../rc5.d/S53ypbind
.../rc6.d/K00ypbind
Both the S and K values are wrong!
Regards,
David Mathog
-
Re: sticking in killall5 in halt is BUG in chkconfig
More info (columns are:
name of /etc/rc.d/init.d file
chkconfig value for that rc directory
actual entry in that directory):
rc0.d entries
atd 60 K60atd
athcool 90 K90athcool
authd 80 K80authd
crond 60 K60crond
dm 09 K09dm
gmond 80 K80gmond
iptables 92 K92iptables
keytable 05 K00keytable
kheader 20 K20kheader
lm_sensors 74 K74lm_sensors
netfs 75 K00netfs
network 90 K90network
nfslock 86 K48nfslock
ntpd 10 K10ntpd
numlock 15 K15numlock
partmon 20 K20partmon
portmap 89 K49portmap
sensor_monitor 90 K90sensor_monitor
sge 36 K36sge
smartd 40 K40smartd
sshd 25 K00sshd
syslog 88 K88syslog
xfs 10 K10xfs
xinetd 50 K49xinetd
ypbind 83 K00ypbind
rc3.d entries:
atd 40 S40atd
athcool 10 S10athcool
authd 14 S14authd
crond 90 S90crond
dm 09 K09dm
gmond 20 S20gmond
iptables 03 S03iptables
keytable 14 S53keytable
kheader 95 S95kheader
lm_sensors 26 S26lm_sensors
netfs 25 S52netfs
network 10 S10network
nfslock 14 S52nfslock
ntpd 56 S56ntpd
numlock 29 S29numlock
partmon 13 S13partmon
portmap 11 S51portmap
sensor_monitor 10 S10sensor_monitor
sge 95 S95sge
smartd 40 S40smartd
sshd 55 S55sshd
syslog 12 S12syslog
xfs 20 S51xfs
xinetd 56 S56xinetd
ypbind 17 S53ypbind
rc6.d entries
atd 60 K60atd
athcool 90 K90athcool
authd 80 K80authd
crond 60 K60crond
dm 09 K09dm
gmond 80 K80gmond
iptables 92 K92iptables
keytable 05 K00keytable
kheader 20 K20kheader
lm_sensors 74 K74lm_sensors
netfs 75 K00netfs
network 90 K90network
nfslock 86 K48nfslock
ntpd 10 K10ntpd
numlock 15 K15numlock
partmon 20 K20partmon
portmap 89 K49portmap
sensor_monitor 90 K90sensor_monitor
sge 36 K36sge
smartd 40 K40smartd
sshd 25 K00sshd
syslog 88 K88syslog
xfs 10 K10xfs
xinetd 50 K49xinetd
ypbind 83 K00ypbind
Regards,
David Mathog
-
Re: sticking in killall5 in halt is BUG in chkconfig
Copied the chkconfig binary from a Mandriva 2006 system. It worked
properly on the original system. However it doesn't on Mandriva 2007,
for ypbind it makes K74 and S26 instead of the expected 83 and 17. No
obvious library mismatch to account for the difference as indicated by
ldd.
Stranger and stranger,
David Mathog
-
Re: sticking in killall5 in halt is BUG in chkconfig
On Fri, 09 Feb 2007 17:46:46 -0500, David Mathog wrote:
> Copied the chkconfig binary from a Mandriva 2006 system. It worked
> properly on the original system. However it doesn't on Mandriva 2007,
> for ypbind it makes K74 and S26 instead of the expected 83 and 17. No
> obvious library mismatch to account for the difference as indicated by
> ldd.
From the description of --add in man chkconfig ...
Note that default entries in LSB-delimited 'INIT INFO'
sections take precedence over the default runlevels in the
initscript
See http://refspecs.freestandards.org/LS...crcomconv.html or
http://wiki.debian.org/LSBInitScripts/
for the format of the "INIT INFO".
Regards, Dave Hodgins
--
Change nomail.afraid.org to ody.ca to reply by email.
(nomail.afraid.org has been set up specifically for
use in usenet. Feel free to use it yourself.)
-
Re: sticking in killall5 in halt is BUG in chkconfig
David W. Hodgins wrote:
> From the description of --add in man chkconfig ...
> Note that default entries in LSB-delimited 'INIT INFO'
> sections take precedence over the default runlevels in the
> initscript
>
Sorry to be a Luddite, but it was a lot easier for me to control
the start and kill order for services when they would stay
where I put them!
Anyway, I finally figured that LSB was needed. But chkconfig is still
brain dead, even with my sge script like this:
# chkconfig: 345 95 36
# description: Sun Grid Engine queuing system
# processname: sge_qmaster
# pidfile: /var/run/sge.pid
### BEGIN INIT INFO
# Provides: sge
# Required-Start: $local_fs $remote_fs $network ypbind $portmap
# Required-Stop: $local_fs $remote_fs $network ypbind $portmap
# Default-Start: 3 4 5
# Short-Description: Sun Grid Engine
# Description: Sun Grid Engine
### END INIT INFO
# config: /usr/SGE/*
chkconfig insists on mashing many of the rc0.d and rc6.d actions into
the K00 positions. It does a reasonable job with the S## in the other
rc*.d directories. Ie:
ls -al /etc/rc.d/rc0.d | head
total 2
drwxr-xr-x 2 root root 1024 Feb 9 16:22 ./
drwxr-xr-x 10 root root 1024 Feb 2 03:43 ../
lrwxrwxrwx 1 root root 18 Feb 9 16:22 K00keytable -> ../init.d/keytable*
lrwxrwxrwx 1 root root 15 Feb 9 16:22 K00netfs -> ../init.d/netfs*
lrwxrwxrwx 1 root root 13 Feb 9 16:22 K00sge -> ../init.d/sge*
lrwxrwxrwx 1 root root 14 Feb 9 16:22 K00sshd -> ../init.d/sshd*
lrwxrwxrwx 1 root root 16 Feb 9 16:22 K00ypbind -> ../init.d/ypbind*
lrwxrwxrwx 1 root root 12 Feb 9 16:22 K09dm -> ../init.d/dm*
lrwxrwxrwx 1 root root 14 Feb 2 03:53 K10ntpd -> ../init.d/ntpd*
As far as I can tell the only thing that is going to let sge shut off
before ypbind is the coincidence that it's earlier in the collating
order. I took out keytable, netfs, sge, sshd, and ypbind
and added them back in various orders. It seemed sane until ypbind went in
[root@monkey01 init.d]# chkconfig --add netfs
[root@monkey01 init.d]# ls -al /etc/rc.d/rc6.d | head
total 2
drwxr-xr-x 2 root root 1024 Feb 9 16:28 ./
drwxr-xr-x 10 root root 1024 Feb 2 03:43 ../
lrwxrwxrwx 1 root root 13 Feb 9 16:28 K00sge -> ../init.d/sge*
lrwxrwxrwx 1 root root 12 Feb 9 16:28 K09dm -> ../init.d/dm*
lrwxrwxrwx 1 root root 14 Feb 2 03:53 K10ntpd -> ../init.d/ntpd*
lrwxrwxrwx 1 root root 13 Feb 9 16:28 K10xfs -> ../init.d/xfs*
lrwxrwxrwx 1 root root 17 Feb 2 03:57 K15numlock -> ../init.d/numlock*
lrwxrwxrwx 1 root root 17 Feb 2 03:48 K20kheader -> ../init.d/kheader*
lrwxrwxrwx 1 root root 17 Feb 2 03:43 K20partmon -> ../init.d/partmon*
(netfs is way down the list at K48)
[root@monkey01 init.d]# chkconfig --add ypbind
[root@monkey01 init.d]# ls -al /etc/rc.d/rc6.d | head
total 2
drwxr-xr-x 2 root root 1024 Feb 9 16:28 ./
drwxr-xr-x 10 root root 1024 Feb 2 03:43 ../
lrwxrwxrwx 1 root root 15 Feb 9 16:28 K00netfs -> ../init.d/netfs*
lrwxrwxrwx 1 root root 13 Feb 9 16:28 K00sge -> ../init.d/sge*
lrwxrwxrwx 1 root root 16 Feb 9 16:28 K00ypbind -> ../init.d/ypbind*
lrwxrwxrwx 1 root root 12 Feb 9 16:28 K09dm -> ../init.d/dm*
lrwxrwxrwx 1 root root 14 Feb 2 03:53 K10ntpd -> ../init.d/ntpd*
lrwxrwxrwx 1 root root 13 Feb 9 16:28 K10xfs -> ../init.d/xfs*
lrwxrwxrwx 1 root root 17 Feb 2 03:57 K15numlock -> ../init.d/numlock*
>
I don't get this. ypbind REQUIRES netfs which provides $remote_fs, yet
when ypbind is --added it moves netfs to the front of the list, so that
it will turn off BEFORE ypbind. ypbind also requires $portmap, but it
didn't cause that to move.
[root@monkey01 init.d]# chkconfig --add netfs
[root@monkey01 init.d]# ls -al /etc/rc.d/rc6.d | grep netfs
lrwxrwxrwx 1 root root 15 Feb 9 16:34 K48netfs -> ../init.d/netfs*
[root@monkey01 init.d]# ls -al /etc/rc.d/rc6.d | grep portmap
lrwxrwxrwx 1 root root 17 Feb 9 16:34 K49portmap -> ../init.d/portmap*
[root@monkey01 init.d]# chkconfig --add ypbind
[root@monkey01 init.d]# ls -al /etc/rc.d/rc6.d | head
total 2
drwxr-xr-x 2 root root 1024 Feb 9 16:35 ./
drwxr-xr-x 10 root root 1024 Feb 2 03:43 ../
lrwxrwxrwx 1 root root 15 Feb 9 16:35 K00netfs -> ../init.d/netfs*
lrwxrwxrwx 1 root root 13 Feb 9 16:35 K00sge -> ../init.d/sge*
lrwxrwxrwx 1 root root 16 Feb 9 16:35 K00ypbind -> ../init.d/ypbind*
lrwxrwxrwx 1 root root 12 Feb 9 16:35 K09dm -> ../init.d/dm*
lrwxrwxrwx 1 root root 14 Feb 2 03:53 K10ntpd -> ../init.d/ntpd*
lrwxrwxrwx 1 root root 13 Feb 9 16:35 K10xfs -> ../init.d/xfs*
lrwxrwxrwx 1 root root 17 Feb 2 03:57 K15numlock -> ../init.d/numlock*
I'll stick with the BUG description, although the arbitrary K and S
numbers are apparently a "feature" of LSB, stuffing all of the K values
into K00 isn't right. I put a required-start and -stop for ypbind in
the local ssh init script, and as expected, they were both still in K00,
and the condition specified wan't required because K00sshd collates
earlier than K00ypbind. (The order in rc3.d was as desired, at
S35ypbind and S55sshd.)
Also, although chkconfig --del won't let you take out an entry that has
been put in with --add if there is a dependency, chkconfig --add won't
automatically add a dependency when one is required by the LSB. That
is, if I trim everything down past ypbind, then --add sge, it doesn't
automatically place ypbind or netfs, or even offer to do so or provide a
command line switch to offer to do so.
Regards,
David Mathog
-
Re: sticking in killall5 in halt is BUG in chkconfig
On Fri, 09 Feb 2007 19:47:55 -0500, David Mathog wrote:
> David W. Hodgins wrote:
>> From the description of --add in man chkconfig ...
>> Note that default entries in LSB-delimited 'INIT INFO'
>> sections take precedence over the default runlevels in the
>> initscript
>
> Sorry to be a Luddite, but it was a lot easier for me to control
> the start and kill order for services when they would stay
> where I put them!
The easiest way to force the script to go where you want it
to, would be to remove the INIT INFO section. Then the old
style chkconfig lines would control which symlinks get created.
I haven't looked into the LSB format, and dependency tree of
the LSB controlled init scripts, to figure out what needs to
be corrected.
Regards, Dave Hodgins
--
Change nomail.afraid.org to ody.ca to reply by email.
(nomail.afraid.org has been set up specifically for
use in usenet. Feel free to use it yourself.)