Dying processes (inetd, cron, syslogd, sshd) - SCO
This is a discussion on Dying processes (inetd, cron, syslogd, sshd) - SCO ; Hi,
We are having problems on various SMP machines (5.0.6a + rs506a
installed) where at times of large load most of the running processes
just seem to stop (e.g. inetd, cron, syslogd, sshd,....) This always
seems to occur at times ...
-
Dying processes (inetd, cron, syslogd, sshd)
Hi,
We are having problems on various SMP machines (5.0.6a + rs506a
installed) where at times of large load most of the running processes
just seem to stop (e.g. inetd, cron, syslogd, sshd,....) This always
seems to occur at times of large stress to the disks, but we have never
managed to put our fingers on exactly what is causing it. When it does
happen not only does the inetd process die, but also cron and syslog
which makes it very tricky for us to put anything in place to try and
catch what is happening.
We are able to ping the machine when it does happed and also login at
the console and over a modem but not over a telnet of ssh connection.
We have had an issue open with SCO before who advised us to install
scodb and set it to trigger when the inetd process stops - and when it
does to get a sysdump. We have tried this, but the sysdump created was
too big for swap - do you know of any way from within scodb to reduce
the size of the sysdump created?
This machine (which has had the problem once a day for the last three,)
is used as a backup server in our office. All that runs on it is two
rsync's of our main machine - one for mail/uucp spools, and one for the
main data. The problem always has occured during these rsyncs, normally
when transferring a large file.
Any help on this would be much appreciated - or even advice as to which
way to look to find the answer to what is happening would be much
appreciated.
Again, thanks,
Keith
-
Re: Dying processes (inetd, cron, syslogd, sshd)
On Mon, 1 Aug 2005 17:13:24 +0100, keith@actual-systems.com wrote
(in message <1122912804.902870.308720@g44g2000cwa.googlegroups. com>):
> We are having problems on various SMP machines (5.0.6a + rs506a
> installed) where at times of large load most of the running processes
> just seem to stop (e.g. inetd, cron, syslogd, sshd,....) This always
> seems to occur at times of large stress to the disks, but we have never
> managed to put our fingers on exactly what is causing it. When it does
> happen not only does the inetd process die, but also cron and syslog
> which makes it very tricky for us to put anything in place to try and
> catch what is happening.
>
> We are able to ping the machine when it does happed and also login at
> the console and over a modem but not over a telnet of ssh connection.
When you say stop, do you mean that the process dies (quits) or is still
there but no longer doing anything ?
Can you kill a process which has stopped ?
It sounds similar to a problem we had at work when we first got our IBM
server (X250) with ServRaid controller. There was a known bug in the firmware
which would result in disk writes failing to complete in a random and
intermittent manner - it was fixed by a firmware update.
The symptoms were that every now and then (fortunately it came to light
during pre-changover testing) there would be an error in the raid controller
log, NOT the OS logs, and the write would never complete. So the system would
be in a position where the raid controller thinks it's had an error and never
completes the operation, and the OS is waiting for the controller to either
complete or report an error (which it never did). The result of this is that
whatever process was trying to write would halt waiting for the IO to
complete, and since it was waiting at kernel level (I assume) the process was
unkillable.
What's more, any other process that tried to access the block in question
would also halt waiting for the IO request to complete. One day, the fault
was somewhere important and over a period of a few hours, every process 'just
stopped' and the only thing left was to hit the power switch. When the thing
rebooted and we looked at the data, it became apparent that the system had
stopped writing data the day before (I guess the buffer flushing process got
stuck) !
Simon
-
Re: Dying processes (inetd, cron, syslogd, sshd)
When it happens the processes are no longer there - they just
disappear.
The machine is still running and we can login to the console and work
so it sounds different to your issue.
Keith
-
Re: Dying processes (inetd, cron, syslogd, sshd)
Anyone have any idea's on this problem?
What would be the outcome if you had one process that kept on wanting
more and more resource?
Do other processes hold onto the resource they have or will they
eventually get 'bullied' out of the resource they are using and
essentially stop (which theoretically would give the results I am
seeing.)
Any ideas? or does anyone have any idea's as to how I would track down
what was causing this to happen.
Thanks again,
Keith
-
Re: Dying processes (inetd, cron, syslogd, sshd)
keith@actual-systems.com wrote:
> Anyone have any idea's on this problem?
I posted on August 1st, but never saw it come back to me. This time I'm
Bcc'ing you so you'll see it even if USENET swallows it again...
keith@actual-systems.com wrote:
> We are having problems on various SMP machines (5.0.6a + rs506a
> installed) where at times of large load most of the running processes
> just seem to stop (e.g. inetd, cron, syslogd, sshd,....) This always
> seems to occur at times of large stress to the disks, but we have never
> managed to put our fingers on exactly what is causing it. When it does
> happen not only does the inetd process die, but also cron and syslog
> which makes it very tricky for us to put anything in place to try and
> catch what is happening.
>
> We are able to ping the machine when it does happed and also login at
> the console and over a modem but not over a telnet of ssh connection.
>
> We have had an issue open with SCO before who advised us to install
> scodb and set it to trigger when the inetd process stops - and when it
> does to get a sysdump. We have tried this, but the sysdump created was
> too big for swap - do you know of any way from within scodb to reduce
> the size of the sysdump created?
>
> This machine (which has had the problem once a day for the last three,)
> is used as a backup server in our office. All that runs on it is two
> rsync's of our main machine - one for mail/uucp spools, and one for the
> main data. The problem always has occured during these rsyncs, normally
> when transferring a large file.
scodb can't reduce the size of a crash dump, but you can force the dumps
to fit by limiting the amount of memory seen by the kernel. To do this,
append " mem=1m-100m" to DEFBOOTSTR in /etc/default/boot (substituting a
bit less than actual size of your dump area in place of "100m").
The load you describe would probably run in 12MB of RAM, but don't limit
memory more than you have to. The problem might be memory size-related.
You want to keep as much as you can of the machine's normal memory size.
[new material begins]
> What would be the outcome if you had one process that kept on wanting
> more and more resource?
There are some problem scenarios like that. A common one is a process
spinning out of control, allocating more and more memory. It will
eventually use all available memory; its next allocation attempt will
fail, and in most cases it will then die. Unless you have changed the
defaults, such a process usually writes a core dump. On OSR5, during
the dumping of a process's core, the process continues to own all of its
memory until the dump is complete. This means that the machine remains
critically out of memory for a long time. The process may have grown
nearly as large as your combined RAM + swap. To dump it, not only does
the kernel have to write that much data, it also may have to page a
large portion of it in from swap. This can take many minutes with large
memory and a slow disk...
During that period, other processes that try to allocate memory will
usually fail. Their subsequent behavior depends on their error
handling. Some will dump core, some will exit gracefully, some may even
stay up. And some may get into weird catatonic states.
> Do other processes hold onto the resource they have or will they
> eventually get 'bullied' out of the resource they are using and
> essentially stop (which theoretically would give the results I am
> seeing.)
For memory, a "hog" process will cause others to get written out to
swap, but those processes still "own" their memory (it will get paged
back in if they need to access it). The troubles happen when a process
tries to allocate more memory while the system is strapped.
There are probably other resources where similar things could happen.
> Any ideas? or does anyone have any idea's as to how I would track down
> what was causing this to happen.
If you had a process spin out and dump, it would leave a huge core file
that you would be able to find. If a process spins out and dies
_without_ leaving a dump, a more subtle trace is left. Normally, OSR5
doesn't use any swap at all; `swap -l` will have identical values in the
"blocks" and "free" columns. ("Normal" modern systems have enough RAM
that they never need to invoke the tremendous performance loss of
swapping.) After such an incident, `swap -l` will show quite a bit of
swap in use. This represents pages that got pushed out, and whose
processes have never actually needed to access them since the incident.
What does `crash` "p" show in the "EVENT" column for the hung processes?
>Bela<
-
Re: Dying processes (inetd, cron, syslogd, sshd)
Thanks Bela.
I will give those things a try and get back to you on the results.
Hope all is going well for you now that you have left SCO.
Keith
Bela Lubkin wrote:
> keith@actual-systems.com wrote:
>
> > Anyone have any idea's on this problem?
>
> I posted on August 1st, but never saw it come back to me. This time I'm
> Bcc'ing you so you'll see it even if USENET swallows it again...
>
> keith@actual-systems.com wrote:
>
> > We are having problems on various SMP machines (5.0.6a + rs506a
> > installed) where at times of large load most of the running processes
> > just seem to stop (e.g. inetd, cron, syslogd, sshd,....) This always
> > seems to occur at times of large stress to the disks, but we have never
> > managed to put our fingers on exactly what is causing it. When it does
> > happen not only does the inetd process die, but also cron and syslog
> > which makes it very tricky for us to put anything in place to try and
> > catch what is happening.
> >
> > We are able to ping the machine when it does happed and also login at
> > the console and over a modem but not over a telnet of ssh connection.
> >
> > We have had an issue open with SCO before who advised us to install
> > scodb and set it to trigger when the inetd process stops - and when it
> > does to get a sysdump. We have tried this, but the sysdump created was
> > too big for swap - do you know of any way from within scodb to reduce
> > the size of the sysdump created?
> >
> > This machine (which has had the problem once a day for the last three,)
> > is used as a backup server in our office. All that runs on it is two
> > rsync's of our main machine - one for mail/uucp spools, and one for the
> > main data. The problem always has occured during these rsyncs, normally
> > when transferring a large file.
>
> scodb can't reduce the size of a crash dump, but you can force the dumps
> to fit by limiting the amount of memory seen by the kernel. To do this,
> append " mem=1m-100m" to DEFBOOTSTR in /etc/default/boot (substituting a
> bit less than actual size of your dump area in place of "100m").
>
> The load you describe would probably run in 12MB of RAM, but don't limit
> memory more than you have to. The problem might be memory size-related.
> You want to keep as much as you can of the machine's normal memory size.
>
> [new material begins]
>
> > What would be the outcome if you had one process that kept on wanting
> > more and more resource?
>
> There are some problem scenarios like that. A common one is a process
> spinning out of control, allocating more and more memory. It will
> eventually use all available memory; its next allocation attempt will
> fail, and in most cases it will then die. Unless you have changed the
> defaults, such a process usually writes a core dump. On OSR5, during
> the dumping of a process's core, the process continues to own all of its
> memory until the dump is complete. This means that the machine remains
> critically out of memory for a long time. The process may have grown
> nearly as large as your combined RAM + swap. To dump it, not only does
> the kernel have to write that much data, it also may have to page a
> large portion of it in from swap. This can take many minutes with large
> memory and a slow disk...
>
> During that period, other processes that try to allocate memory will
> usually fail. Their subsequent behavior depends on their error
> handling. Some will dump core, some will exit gracefully, some may even
> stay up. And some may get into weird catatonic states.
>
> > Do other processes hold onto the resource they have or will they
> > eventually get 'bullied' out of the resource they are using and
> > essentially stop (which theoretically would give the results I am
> > seeing.)
>
> For memory, a "hog" process will cause others to get written out to
> swap, but those processes still "own" their memory (it will get paged
> back in if they need to access it). The troubles happen when a process
> tries to allocate more memory while the system is strapped.
>
> There are probably other resources where similar things could happen.
>
> > Any ideas? or does anyone have any idea's as to how I would track down
> > what was causing this to happen.
>
> If you had a process spin out and dump, it would leave a huge core file
> that you would be able to find. If a process spins out and dies
> _without_ leaving a dump, a more subtle trace is left. Normally, OSR5
> doesn't use any swap at all; `swap -l` will have identical values in the
> "blocks" and "free" columns. ("Normal" modern systems have enough RAM
> that they never need to invoke the tremendous performance loss of
> swapping.) After such an incident, `swap -l` will show quite a bit of
> swap in use. This represents pages that got pushed out, and whose
> processes have never actually needed to access them since the incident.
>
> What does `crash` "p" show in the "EVENT" column for the hung processes?
>
> >Bela<
-
Re: Dying processes (inetd, cron, syslogd, sshd)
Hi,
Just a couple of updates on the suggestions Bela gave.
I've done a 'find' for any core files and didn't find any new ones, so
they are not being created - I have yet to have the problem happen
again so I havn't done a 'swap -l' to see if any of it is being used
when the problem occurs.
While I've not used 'crash' to see what the EVENT column shows - I
don't actually think it will show anything as a ps done after the
problem has occured doesn't show the processes - they have just
disappeared.
I've set 'mem=1m-100m' on both my machines here and will try to force
the problem to happen tonight and hopefully I will have more questions
tommorow.
Thanks again,
Keith
-
Re: Dying processes (inetd, cron, syslogd, sshd)
keith@actual-systems.com wrote:
> I've done a 'find' for any core files and didn't find any new ones, so
> they are not being created - I have yet to have the problem happen
> again so I havn't done a 'swap -l' to see if any of it is being used
> when the problem occurs.
Do at least one before the system goes weird, so you'll know what's
normal.
> While I've not used 'crash' to see what the EVENT column shows - I
> don't actually think it will show anything as a ps done after the
> problem has occured doesn't show the processes - they have just
> disappeared.
Ah, I don't think I had understood properly. Disappearing processes
could have been killed by a low-memory situation, but also by another
process deliberately killing them, or all sorts of other things.
> I've set 'mem=1m-100m' on both my machines here and will try to force
> the problem to happen tonight and hopefully I will have more questions
> tommorow.
Is that the actual size of swap on the machines? Like I said, you
shouldn't restrain memory tighter than necessary: use "mem=1m-%dm",
filling in the actual size of swap on each machine (short it by 1MB to
be sure).
>Bela<
-
Re: Dying processes (inetd, cron, syslogd, sshd)
I have now had the problem happen again and have a sysdump created from
within scodb (I was triggered into debug mode when the inetd process
stopped.)
I have a few questions now.
1. From looking at this sysdump how can I tell what has caused this
process to stop running?
2. Bela, in your previous answer you mentioned that 'Disappearing
processes could have been killed by a low-memory situation' - is there
any way to tell if this is what has happened (or does the answer to the
first question give me the answer to the second....)
Again thanks for all your help,
Keith
-
Re: Dying processes (inetd, cron, syslogd, sshd)
keith@actual-systems.com wrote:
> I have now had the problem happen again and have a sysdump created from
> within scodb (I was triggered into debug mode when the inetd process
> stopped.)
>
> I have a few questions now.
> 1. From looking at this sysdump how can I tell what has caused this
> process to stop running?
Exactly how did you trigger scodb on death of the process? If you had
it catch the right thing, you should be able to see what signal caused
the death. If that signal was raised by the process's own action (e.g.
`kill(getpid(), SIGTERM)', or by trying to allocate memory when none was
available) -- the evidence will be on its own stack. If it was killed
by another process (`kill(1234, SIGTERM)'), that process's stack will
tell the story.
Start by doing a "stack -p 1234", where 1234 is the PID of inetd.
> 2. Bela, in your previous answer you mentioned that 'Disappearing
> processes could have been killed by a low-memory situation' - is there
> any way to tell if this is what has happened (or does the answer to the
> first question give me the answer to the second....)
It might. But you can also examine the system's memory state. The 4
most interesting variables are the ones shown by `sar -r`, although one
is spelled differently: freemem, freeswap, availrmem and availsmem. You
can display them by just entering their names:
scodb:1> freemem
48E4B
scodb:2> freeswap
100000
scodb:3> availrmem
5DBE6
scodb:4> availsmem
146949
That corresponds to:
$ sar -r 1 1
SCO_SV deeptht 3.2v5.0.6 i80386 08/12/2005
15:40:42 freemem freeswp availrmem availsmem (-r)
15:40:43 299572 8388608 383979 1337833
`scodb` output is in hexadecimal. All of the variables are in 4K
pages; `sar -r` output is in 4K pages except "freeswp", which is in
512-byte sectors, so we need to multiply it by 8:
$ bc
ibase=16
48E4B; 100000*8; 5DBE6; 146949
298571
8388608
383974
1337673
The numbers are slightly different because I'm looking at a live system.
If your system is deeply out of memory, `availsmem' will be near 0.
Anything under about 1000 is desperate. `freemem' and `freeswap' may or
may not be near 0, depending on exactly what the memory-eating process
is doing. `availrmem' is unlikely to be low.
Lack of memory is just one possible cause...
Show me the "stack -p ..." output for inetd. That'll give me some
idea...
>Bela<
-
Re: Dying processes (inetd, cron, syslogd, sshd)
Bela,
We were tripped into scodb as we had setup a watch point on the inetd
process.
We work out the pid of inetd using crash (p | grep inetd) - then we
convert it to hex and go into scodb.
-------------
# crash
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
> p | grep inetd
29 s 256 1 256 0 76 0 selwait inetd load nxec
> q
(PID 29=1D in hex)
Ctrl+alt+D
debug0:1> bp wl &proc[1d].p_utime
debug0:2> q
--------------------------------
Then once this problem occured we were in debug mode and I did the
following to get the dumpfile I have.
-------------------
debug0:4> bp dis *
debug0:5> sysdump()
debug0:6> bc *
debug0:7> q
# sysdump -i /dev/swap -fbhm -o /usr/tmp/dump
---------------------
Now the following is the output when I used scodb to look at the
dumpfile I have - I would have followed your advice about getting the
PID of inetd, but I think the process must have already stopped by the
time I had the dump so I just did 'stack' to give you everything and
then 'p' from within crash to get al the running processes.
----------------------------------------------------------------
back:/usr/tmp # scodb -d dump
dumpfile = dump
namelist = dump
stunfile = dump
varifile = dump
PID 0147: /etc/inetd
scodb:1>
scodb:1> stack
E0000DC0 exit2(0, 9) <- psig+18F
E0000DE8 psig(? 8057644, 0, 0, F01D4D98) <- systrap+39F
E0000E10 systrap(E0000E1C) <-
scall_nokentry+14
scodb:2> q
# crash -d dump
dumpfile = dump, namelist = dump, outfile = stdout
> p
PROC TABLE SIZE = 83
SLOT ST PID PPID PGRP UID PRI CPU EVENT NAME
FLAGS
0 s 0 0 0 0 95 0 runout sched
load sys lock nwak
1 r 1 0 0 0 80 0 init
load nwak
2 s 2 0 0 0 95 0 vhand vhand
load sys lock nwak nxec
3 r 3 0 0 0 95 0 bdflush
load sys lock nwak nxec
4 r 4 0 0 0 36 0 CPU1 idle proc
load sys lock nxec 2000000
5 p 5 0 0 0 36 0 CPU2 idle proc
load sys lock nxec 2000000
6 s 6 0 0 0 95 0 kmd_id kmdaemon
load sys lock nwak nxec
7 r 7 1 0 0 95 0 htepi_daemon
load sys lock nwak
8 s 8 0 0 0 95 0 pbintrpool strd
load sys lock nwak nxec
10 r 55 1 55 0 73 0 ifor_pmd
load nxec exit
11 z 57 55 55 0 76 0 zombie
nou nxec exit
12 r 52 1 50 0 80 0 syslogd
load nwak nxec exit
13 s 43 1 0 0 95 0 0xc1031150 htepi_daemon
load sys lock nwak
14 r 524 1 524 0 75 0 getty
load
15 r 85 1 71 0 75 0 strerr
load exit
18 r 525 1 525 0 75 0 getty
load
20 r 526 1 526 0 75 0 getty
load
21 r 276 1 276 0 81 0 cron
load nwak nxec exit
25 s 206 1 0 0 95 0 0xc1035150 htepi_daemon
load sys lock nwak
26 r 210 1 0 0 95 0 htepi_daemon
load sys lock nwak
29 z 568 1 0 204 76 0 zombie
nou exit
31 z 384 1 384 0 76 0 zombie
nou nxec exit
33 z 382 1 382 0 76 0 zombie
nou nxec exit
34 p 327 1 327 0 76 0 inetd
load nxec exit
35 r 17987 1 17987 0 80 0 getty
load nwak
41 r 392 1 384 0 75 0 lockd
load nxec exit
42 r 393 392 384 0 75 0 lockd
load nxec
43 r 394 392 384 0 75 0 lockd
load nxec
44 r 395 392 384 0 75 0 lockd
load nxec
45 r 396 392 384 0 75 0 lockd
load nxec
47 r 434 1 432 0 80 0 calserver
load nwak nxec exit
48 r 17869 1 276 0 76 0 rsync
load exit
49 r 442 1 442 0 81 0 caldaemon
load nwak exit
50 r 449 1 449 0 66 0 prngd
load nxec exit
53 r 527 1 527 0 75 0 getty
load
54 r 507 1 507 0 76 0 sshd
load nxec exit
55 r 528 1 528 0 75 0 getty
load
56 r 529 1 529 0 75 0 getty
load
57 r 530 1 530 0 75 0 getty
load
58 r 531 1 531 0 75 0 getty
load
59 r 532 1 532 0 75 0 getty
load
60 r 533 1 533 0 75 0 getty
load
61 r 534 1 534 0 75 0 getty
load
62 r 535 1 535 0 80 0 sdd
load nwak exit
63 z 17871 17869 276 0 88 0 zombie
nou exit
64 z 17240 276 276 0 73 0 zombie
nou exit
65 r 17872 1 276 0 75 0 rcmd
load nxec exit
68 s 17900 17869 276 0 95 0 0xfc2e3b90 rsync
load nwak nxec
> q
------------------------
And also looking at the memory as suggested shows it wasn't critically
close to it's limits (unless this is showing the freemem etc after the
inetd process has stopped.)
------------------------
scodb:2> freemem
6EEA0
scodb:3> freeswap
100000
scodb:4> availrmem
6FB0C
scodb:5> availsmem
16E941
------------------------
As always your help here is very much appreciated.
Thanks,
Keith
-
Re: Dying processes (inetd, cron, syslogd, sshd)
keith@actual-systems.com wrote:
> We were tripped into scodb as we had setup a watch point on the inetd
> process.
>
> We work out the pid of inetd using crash (p | grep inetd) - then we
> convert it to hex and go into scodb.
>
> -------------
> # crash
> dumpfile = /dev/mem, namelist = /unix, outfile = stdout
> > p | grep inetd
> 29 s 256 1 256 0 76 0 selwait inetd load nxec
> > q
>
> (PID 29=1D in hex)
>
> Ctrl+alt+D
> debug0:1> bp wl &proc[1d].p_utime
> debug0:2> q
Ok, that should work. You can skip the `crash` part, use "ps()" in
scodb and pick out inetd's PID there (already printed in hex). But
you can't pipe it to "grep inetd", so that's a bit of a negative...
> Then once this problem occured we were in debug mode and I did the
> following to get the dumpfile I have.
>
> -------------------
> debug0:4> bp dis *
> debug0:5> sysdump()
> debug0:6> bc *
> debug0:7> q
> # sysdump -i /dev/swap -fbhm -o /usr/tmp/dump
> ---------------------
>
> Now the following is the output when I used scodb to look at the
> dumpfile I have - I would have followed your advice about getting the
> PID of inetd, but I think the process must have already stopped by the
> time I had the dump so I just did 'stack' to give you everything and
> then 'p' from within crash to get al the running processes.
Again, "ps()" in scodb will do that.
The process was dying when the dump was taken.
Your three pieces of evidence here are inconsistent with each other.
You've set up to break when PID 0x1D is dying. The `scodb -d dump`
output from after the breakpoint says it's showing PID 0x147. And the
`crash -d dump` output has an inetd PID 0x34. So at best I have to
assume these are from three different events.
> back:/usr/tmp # scodb -d dump
> dumpfile = dump
> namelist = dump
> stunfile = dump
> varifile = dump
> PID 0147: /etc/inetd
> scodb:1>
> scodb:1> stack
> E0000DC0 exit2(0, 9) <- psig+18F
> E0000DE8 psig(? 8057644, 0, 0, F01D4D98) <- systrap+39F
> E0000E10 systrap(E0000E1C) <-
> scall_nokentry+14
> scodb:2> q
`exit2(0, 9)' is, I believe, delivering a signal 9 (SIGKILL) to the
process. There are few things inside the kernel that generate SIGKILL.
This suggests a user process is deliberately killing the dying
processes.
> # crash -d dump
> dumpfile = dump, namelist = dump, outfile = stdout
> > p
> PROC TABLE SIZE = 83
> SLOT ST PID PPID PRI EVENT NAME FLAGS
> 0 s 0 0 95 runout sched load sys lock nwak
> 1 r 1 0 80 init load nwak
> 2 s 2 0 95 vhand vhand load sys lock nwak nxec
> 3 r 3 0 95 bdflush load sys lock nwak nxec
> 4 r 4 0 36 CPU1 idle proc load sys lock nxec 2000000
> 5 p 5 0 36 CPU2 idle proc load sys lock nxec 2000000
> 6 s 6 0 95 kmd_id kmdaemon load sys lock nwak nxec
> 7 r 7 1 95 htepi_daemon load sys lock nwak
> 8 s 8 0 95 pbintrpool strd load sys lock nwak nxec
> 10 r 55 1 73 ifor_pmd <-- load nxec exit
> 11 z 57 55 76 zombie <-- nou nxec exit
> 12 r 52 1 80 syslogd <-- load nwak nxec exit
> 13 s 43 1 95 0xc1031150 htepi_daemon load sys lock nwak
> 14 r 524 1 75 getty load
> 15 r 85 1 75 strerr <-- load exit
> 18 r 525 1 75 getty load
> 20 r 526 1 75 getty load
> 21 r 276 1 81 cron <-- load nwak nxec exit
> 25 s 206 1 95 0xc1035150 htepi_daemon load sys lock nwak
> 26 r 210 1 95 htepi_daemon load sys lock nwak
> 29 z 568 1 76 zombie <-- nou exit
> 31 z 384 1 76 zombie <-- nou nxec exit
> 33 z 382 1 76 zombie <-- nou nxec exit
> 34 p 327 1 76 inetd <-- load nxec exit
> 35 r 17987 1 80 getty load nwak
> 41 r 392 1 75 lockd <-- load nxec exit
> 42 r 393 392 75 lockd load nxec
> 43 r 394 392 75 lockd load nxec
> 44 r 395 392 75 lockd load nxec
> 45 r 396 392 75 lockd load nxec
> 47 r 434 1 80 calserver <-- load nwak nxec exit
> 48 r 17869 1 76 rsync <-- load exit
> 49 r 442 1 81 caldaemon <-- load nwak exit
> 50 r 449 1 66 prngd <-- load nxec exit
> 53 r 527 1 75 getty load
> 54 r 507 1 76 sshd <-- load nxec exit
> 55 r 528 1 75 getty load
> 56 r 529 1 75 getty load
> 57 r 530 1 75 getty load
> 58 r 531 1 75 getty load
> 59 r 532 1 75 getty load
> 60 r 533 1 75 getty load
> 61 r 534 1 75 getty load
> 62 r 535 1 80 sdd <-- load nwak exit
> 63 z 17871 17869 88 zombie <-- nou exit
> 64 z 17240 276 73 zombie <-- nou exit
> 65 r 17872 1 75 rcmd <-- load nxec exit
> 68 s 17900 17869 95 0xfc2e3b90 rsync load nwak nxec
I deleted some less interesting columns from the table.
Notice the large number of processes with the "exit" flag. I marked
them with "<--" to make them visually obvious. These are all dying.
Does this list more or less match your experience of what dies in one of
these events? (Your mental list should be a little longer since there
are a few here that have gotten far enough along in dying that we can't
see what they were running, just "zombie").
To capture a better dump next time, I suggest rigging the system up so
that it breaks into scodb when _any_ of these processes die. That means
you'll catch it as early as possible -- possibly while the "attacker"
process is still issuing the SIGKILLs.
> And also looking at the memory as suggested shows it wasn't critically
> close to it's limits (unless this is showing the freemem etc after the
> inetd process has stopped.)
>
> scodb:2> freemem
> 6EEA0
> scodb:3> freeswap
> 100000
> scodb:4> availrmem
> 6FB0C
> scodb:5> availsmem
> 16E941
Right, that makes it look like it has nothing to do with memory.
> As always your help here is very much appreciated.
I'm wondering if we should take it offline -- I feel like I should be
looking directly at one of these dumps (hopefully one collected as early
as possible in an event).
Instead of setting up individual breakpoints for all the processes you
expect to die, you can use the fact that it's SIGKILL to your advantage.
Take a look at the code for kill():
scodb> u kill+175
kill+175 movl %eax,4(%edi)
kill+178 cmpl %eax,9
kill+17D jne kill+18C
kill+17F movb %al,-1(%ebp)
If your kernel matches the one I'm looking at here, kill+17F is an
address only reached during a `kill(pid, 9)' call. (If your kernel is
different: do "u kill", then type "/,9". It will scroll forward to the
right point in the disassembly. Hit a few times after that,
then "q" to exit the disassembler.)
Set a breakpoint on that ("bp kill+17F").
Test it by:
$ sleep 100 &
[1] 27388
$ kill -9 27388
.... and you should drop into the debugger. Quit.
There shouldn't be a lot of SIGKILLing going on in a normal working
system. Hopefully this won't trigger until the problem event. If it
does, you might have to find another way.
But if it does false trigger, you should also try to understand who is
killing -9 whom -- it might be an important clue. Suppose a particular
daemon has a habit of creating a child, letting it run for a while, then
killing it. Suppose its memory of the child PID is corrupted somehow.
e.g. what if it does:
do_something &
childpid=$!
...
ps -ef | grep $childpid | awk '{ print "kill -9 " $2 }' | sh
This is horribly bad code, do not use it as an example of anything
besides trouble...
So this "works fine" for a while, when $childpid is 12345. But one day
it runs its child right when the system PID counter has rolled over,
gets PID 23. `ps -ef | grep 23` is likely to match several processes
other than the intended victim.
Anyway, if you trap on delivery of signal 9, you're likely to catch the
problem happening.
And if it happens without hitting that trigger, you learn that it
_isn't_ a process-to-process kill, it's actually coming directly out of
the kernel -- also very interesting news.
>Bela<
-
Re: Dying processes (inetd, cron, syslogd, sshd)
I'm very confused by the difference in PID's showing for inetd from the
output from crash and scodb - they were both using the same dump file
(in the example of how it was setup was a different example.)
You are correct about the 'exiting' processes - they all seem to be the
processes that are the ones that are killed when this happens - so it
does look like the dump was triggered at the correct time.
I did a bit of testing on a machine here doing 'kill -9' to various
processes and the result was the 'exit2(0, 9)' as you predicted -
however when I setup the breakpoint to be 'bp kill+175' I can't seem to
get it to trigger no matter what processes I kill. I also tried to
trigger with lines above and below this but they either triggered at
every kill or not at all. Is there another line that we can setup as a
trigger?
On these servers the problem seems to happen when they are running
rsync's - so I'm going to grab the source for the version that we are
running and see if there is anything in there doing kill's similar to
your example.
Again thanks for your help.
Keith
-
Re: Dying processes (inetd, cron, syslogd, sshd)
keith@actual-systems.com wrote:
> I'm very confused by the difference in PID's showing for inetd from the
> output from crash and scodb - they were both using the same dump file
> (in the example of how it was setup was a different example.)
Hmmm. It's possible that inetd had forked at that time; but they really
look like they should be the same.
> You are correct about the 'exiting' processes - they all seem to be the
> processes that are the ones that are killed when this happens - so it
> does look like the dump was triggered at the correct time.
I'm sure it was triggered at the right time at a gross level, now we're
looking for finer precision.
> I did a bit of testing on a machine here doing 'kill -9' to various
> processes and the result was the 'exit2(0, 9)' as you predicted -
> however when I setup the breakpoint to be 'bp kill+175' I can't seem to
> get it to trigger no matter what processes I kill. I also tried to
> trigger with lines above and below this but they either triggered at
> every kill or not at all. Is there another line that we can setup as a
> trigger?
`kill+175' was the start of the disassembly I looked at:
scodb> u kill+175
kill+175 movl %eax,4(%edi)
kill+178 cmpl %eax,9
kill+17D jne kill+18C
kill+17F movb %al,-1(%ebp)
The correct breakpoint here is "bp kill+17F" -- immediately after it has
compared the signal number to 9 (and branched elsewhere if it wasn't 9).
> On these servers the problem seems to happen when they are running
> rsync's - so I'm going to grab the source for the version that we are
> running and see if there is anything in there doing kill's similar to
> your example.
Hmmm... what version is it? This appears in the OLDNEWS file in rsync
source:
" NEWS for rsync 2.5.5, aka Snowy River (2 Apr 2002)
" Protocol: 26 (unchanged)
" Changes since 2.5.4:
"
" BUG FIXES:
"
" * Fix situation where failure to fork (e.g. because out of process
" slots) would cause rsync to kill all processes owned by the
" current user. Yes, really! (Paul Haas, Martin Pool)
That seems like it could possibly have something to do with the
problem... ;-}
On the other hand, it looks like it would have killed even more
processes, but would have done so with SIGUSR1. So unless this is a
hacked version of rsync, it probably isn't precisely that problem.
>Bela<
-
Re: Dying processes (inetd, cron, syslogd, sshd)
Hi Bela,
I'm still having problems with the kill+17F line. If I set 'bp
kill+17F' I can kill -9 any process I want and it doesn't trigger.
I've put the output from 'u kill' below so you can make sure that it is
the same version you are running on - but from looking at the list of
lines 175-17F from your example before It looks to be the same.
scodb:1> u kill
kill pushl %ebp
kill+1 movl %eax,pid_stamp
kill+6 movl %ebp,%esp
kill+8 subb %esp,1C
kill+B pushl %edi
kill+C movl -14(%ebp),%eax
kill+F pushl %esi
kill+10 movl %edi,u.u_ap
kill+16 pushl %ebx
kill+17 movl %eax,4(%edi)
kill+1A testl %eax,%eax
kill+1C jl kill+25
kill+1E cmpl %eax,20
kill+23 jl kill+44
kill+25 popl %ebx
kill+26 popl %esi
kill+27 movb u.u_error,16
kill+2E popl %edi
kill+2F movl %esp,%ebp
kill+31 popl %ebp
kill+32 ret
kill+33 nop
kill+34 movb u.u_error,3
kill+3B popl %ebx
kill+3C popl %esi
kill+3D popl %edi
kill+3E movl %esp,%ebp
kill+40 popl %ebp
kill+41 ret
kill+42 nop
kill+43 nop
kill+44 movl %eax,u.u_renv
kill+49 testl %eax,3000000
kill+4E jne kill+61
kill+50 movl %eax,4(%edi)
kill+53 cmpl %eax,14
kill+58 jne kill+61
kill+5A movl 4(%edi),16
kill+61 movw %si,(%edi)
kill+64 testw %si,%si
kill+67 jle kill+70
kill+69 movl %ebx,&proc[1]
kill+6E jmp kill+75
kill+70 movl %ebx,&proc[2]
kill+75 movl %eax,u.u_procp
kill+7A testw %si,%si
kill+7D movl -C(%ebp),%eax
kill+80 jne kill+8B
kill+82 movw %ax,10(%eax)
kill+86 testw %ax,%ax
kill+89 je kill+34
kill+8B movb -1(%ebp),3
kill+8F movl %edx,v.ve_proc
kill+95 cmpl %ebx,%edx
kill+97 jnb kill+1EE
kill+9D movb %al,(%ebx)
kill+9F testb %al,%al
kill+A1 je kill+1DB
kill+A7 movw %ax,12(%ebx)
kill+AB testw %si,%si
kill+AE movw -6(%ebp),%ax
kill+B2 jle kill+BD
kill+B4 cmpw %ax,%si
kill+B7 jne kill+1DB
kill+BD testw %si,%si
kill+C0 jne kill+D3
kill+C2 movl %eax,-C(%ebp)
kill+C5 movw %ax,10(%eax)
kill+C9 cmpw 10(%ebx),%ax
kill+CD jne kill+1DB
kill+D3 cmpb %si,FFFFFFFF
kill+D7 jnl kill+EA
kill+D9 movsx %edx,%si
kill+DC negl %edx
kill+DE movsx %eax,10(%ebx)
kill+E2 cmpl %eax,%edx
kill+E4 jne kill+1DB
kill+EA movl %eax,50(%ebx)
kill+ED movl %ecx,-14(%ebp)
kill+F0 movl %eax,4C(%eax)
kill+F3 subl %eax,%ecx
kill+F5 jnle kill+1DB
kill+FB movl %eax,4(%edi)
kill+FE cmpl %eax,19
kill+103 jne kill+112
kill+105 movl %eax,-C(%ebp)
kill+108 movl %edx,C(%ebx)
kill+10B movl %ecx,C(%eax)
kill+10E cmpl %ecx,%edx
kill+110 je kill+18C
kill+112 movw %ax,A(%ebx)
kill+116 xorl %edx,%edx
kill+118 movw %dx,u.u_uid
kill+11F movw -4(%ebp),%ax
kill+123 testl %edx,%edx
kill+125 je kill+148
kill+127 xorl %eax,%eax
kill+129 movw %ax,8(%ebx)
kill+12D cmpl %edx,%eax
kill+12F je kill+148
kill+131 movl %eax,-4(%ebp)
kill+134 andl %eax,FFFF
kill+139 cmpl %edx,%eax
kill+13B movl %eax,0
kill+140 setnz %al
kill+143 jmp kill+14A
kill+145 nop
kill+146 nop
kill+147 nop
kill+148 xorl %eax,%eax
kill+14A testl %eax,%eax
kill+14C je kill+16D
kill+14E xorl %eax,%eax
kill+150 xorl %edx,%edx
kill+152 movw %ax,u.u_ruid
kill+158 movw %dx,8(%ebx)
kill+15C cmpl %eax,%edx
kill+15E je kill+16D
kill+160 movl %edx,-4(%ebp)
kill+163 andl %edx,FFFF
kill+169 cmpl %eax,%edx
kill+16B jne kill+17F
kill+16D cmpl %ebx,&proc[1]
kill+173 jne kill+18C
kill+175 movl %eax,4(%edi)
kill+178 cmpl %eax,9
kill+17D jne kill+18C
kill+17F int3
kill+180 incl %ebp
kill+181 decl %esp
kill+183 addl %esi,50(%ebp)
kill+186 movb -1(%ebp),1
kill+18A jmp kill+1D6
kill+18C movb -1(%ebp),0
kill+190 movl %eax,4(%edi)
kill+193 testl %eax,%eax
kill+195 je kill+1D6
kill+197 pushl %eax
kill+198 pushl %ebx
kill+199 call sigqueue_zalloc
kill+19E addb %esp,8
kill+1A1 movl %ecx,%eax
kill+1A3 testl %ecx,%ecx
kill+1A5 je kill+1C3
kill+1A7 movl %edx,4(%edi)
kill+1AA movl 4(%ecx),%edx
kill+1AD movl %edx,-C(%ebp)
kill+1B0 movw %dx,12(%edx)
kill+1B4 movw 10(%ecx),%dx
kill+1B8 movl %edx,-C(%ebp)
kill+1BB movw %dx,A(%edx)
kill+1BF movw 14(%ecx),%dx
kill+1C3 movsx %eax,-6(%ebp)
kill+1C7 pushl %eax
kill+1C8 pushl %ecx
kill+1C9 movl %ecx,4(%edi)
kill+1CC pushl %ecx
kill+1CD pushl %ebx
kill+1CE call psignalinfo
kill+1D3 addb %esp,10
kill+1D6 testw %si,%si
kill+1D9 jnle kill+1EE
kill+1DB movl %eax,v.ve_proc
kill+1E0 addl %ebx,158
kill+1E6 cmpl %ebx,%eax
kill+1E8 jb kill+9D
kill+1EE movb %al,-1(%ebp)
kill+1F1 testb %al,%al
kill+1F3 je kill+3B
kill+1F9 popl %ebx
kill+1FA popl %esi
kill+1FB popl %edi
kill+1FC movb u.u_error,%al
kill+201 movl %esp,%ebp
kill+203 popl %ebp
kill+204 ret
kill+205 nop
kill+206 nop
kill+207 nop
kill+208 nop
kill+209 nop
kill+20A nop
kill+20B nop
kill+20C nop
kill+20D nop
kill+20E nop
kill+20F nop
Unfortunately (or maybe fortunately,) the version of rsync we are using
is newer than 2.5.5, and it was built from the directly from the source
on the rsync site with no hacks.
back:/ # rsync --version
rsync version 2.6.3pre1 protocol version 28
Copyright (C) 1996-2004 by Andrew Tridgell and others
Capabilities: 32-bit files, socketpairs, hard links, symlinks,
batchfiles,
inplace, no IPv6, 32-bit system inums, 64-bit internal
inums
Thanks again.
Keith
P.S. I've also replied to your mail, but it was bounced back to me - so
if you still havn't received it in a few hours let me know.
-
Re: Dying processes (inetd, cron, syslogd, sshd)
keith@actual-systems.com wrote:
> I'm still having problems with the kill+17F line. If I set 'bp
> kill+17F' I can kill -9 any process I want and it doesn't trigger.
>
> I've put the output from 'u kill' below so you can make sure that it is
> the same version you are running on - but from looking at the list of
> lines 175-17F from your example before It looks to be the same.
>
> scodb:1> u kill
> kill pushl %ebp
> kill+1 movl %eax,pid_stamp
> kill+6 movl %ebp,%esp
....
> kill+175 movl %eax,4(%edi)
> kill+178 cmpl %eax,9
> kill+17D jne kill+18C
> kill+17F int3
Yeah, that looks like the right code. (Though it seems a little odd
that scodb disassembled the breakpoint as "int3". That _is_ the
instruction it uses to mark a breakpoint, but it normally hides
breakpoints in a disassembly.)
I need to try it live on my own system, which I am not near at the
moment. I'll try later today.
[later] Yeah, that doesn't work. So do this:
scodb:1> u kill+10
kill+10 movl %edi,u.u_ap
kill+16 pushl %ebx
kill+17 movl %eax,4(%edi)
kill+1A testl %eax,%eax [hit q]
scodb> bp mod kill+1a
[1] quitif %eax != 9 [hit ESC]
:q
This makes a breakpoint that is hit _every_ time the kill() system call
is invoked. "quitif %eax != 9" causes scodb to quit (i.e. return the
system to normal operation) if it wasn't a kill -9. This kind of
breakpoint sometimes has side effects; you should only do this on one of
your test systems.
When this breakpoint hits, you will be in the context of the process
issuing the kill -9, not the process being killed. This is OK. We know
that whatever process is doing it is killing a whole bunch of other
processes. You can see the PID to which the kill is being sent by
doing:
scodb> d %edi
If it's 0 or negative, it's a single call that kills multiple processes
(see kill(S) or kill(C)). Investigate that immediately.
If it's positive, it's a single-process kill. Quit and see if the
debugger pops up again immediately. Look at several of the PIDs,
quitting after each one. Then start investigating. If you're looking
at the problem event, you know it's going to issue 10-20 kills. None of
the processes it's killing will disappear while you quit the debugger
and cycle to the next kill -- the calling process is going to be able to
fire off a bunch of kill() system calls before any of the receiving
processes actually receive the signals and start dying. Remember the
kernel isn't running when you're sitting at the scodb prompt...
You might get some false hits (processes doing kill -9's that aren't
part of the problem). You should be able to recognize the real problem
case pretty easily.
Once you know what process is doing it, the rest will probably fall into
place pretty quickly...
>Bela<
-
Re: Dying processes (inetd, cron, syslogd, sshd)
Excellent, thanks Bela, I'll let you know the outcome.
Keith
-
Re: Dying processes (inetd, cron, syslogd, sshd)
Hi,
I've had this machine now tripinto debug mode, but unfortunately I am
remote from there so I had to have someone else run the commands you
wanted and give me the response.
>From what I can see it was triggered by an rsync - but I don't know how
to read the output from 'd %edi' to tell you if it was 0 or negative or
positive.
----------------------
BreakPoint: PID 000005BF /usr/bin/rcmd asa /usr/local/bin/rsync
--server--
F01D1676 kill+1A : kill+1a
---------------------
Then the output from 'd %edi' was as follows
u+1148 000005BF 00000009 00001000 000000000
Then when scodb was exited it popped straight back in and the output
for the first part was the same (what triggered it,) but the output
from 'd %edi' was as follows.
u+1148 0000052A 00000009 00001000 000000000
u+1148 00000536 00000009 00001000 000000000
u+1148 00000542 00000009 00001000 000000000
u+1148 00000536 00000009 00001000 000000000
u+1148 00000552 00000009 00001000 000000000
u+1148 000005BF 00000009 00001000 000000000
Does this mean anything to you?
Thanks,
Keith
-
Re: Dying processes (inetd, cron, syslogd, sshd)
keith@actual-systems.com wrote:
> I've had this machine now tripinto debug mode, but unfortunately I am
> remote from there so I had to have someone else run the commands you
> wanted and give me the response.
>
> >From what I can see it was triggered by an rsync - but I don't know how
> to read the output from 'd %edi' to tell you if it was 0 or negative or
> positive.
>
> ----------------------
> BreakPoint: PID 000005BF /usr/bin/rcmd asa /usr/local/bin/rsync
> --server--
>
> F01D1676 kill+1A : kill+1a
> ---------------------
> Then the output from 'd %edi' was as follows
>
> u+1148 000005BF 00000009 00001000 000000000
>
> Then when scodb was exited it popped straight back in and the output
> for the first part was the same (what triggered it,) but the output
> from 'd %edi' was as follows.
>
> u+1148 0000052A 00000009 00001000 000000000
> u+1148 00000536 00000009 00001000 000000000
> u+1148 00000542 00000009 00001000 000000000
> u+1148 00000536 00000009 00001000 000000000
> u+1148 00000552 00000009 00001000 000000000
> u+1148 000005BF 00000009 00001000 000000000
>
> Does this mean anything to you?
Yes, plenty. Each of those "u+1148" lines represents another trip into
kill(). The first time, we can see that the caller is PID 0x5BF (1471),
running `rcmd asa rsync`. %edi points to the arguments, here "5BF" and
"9": it's sending signal 9 to PID 0x5BF -- which is itself. That seems
odd... is there any chance that your guy at the keyboard mistyped that
one?
The next 6 calls show signal 9 being sent to PIDs 1322, 1334, 1346,
1334, 1362, 1471 (the last one again being the process doing the kills).
If we assume that the first one was a typo, we can say that PID 1471
(rcmd) is killing off a series of other PIDs, ending with itself.
I experimented with `rcmd`, learning that it does finish by killing
itself with signal 9. I didn't figure out how to trigger it into a
killing spree, but there obviously is a way. Since it's killing
unrelated processes, it's definitely a bug of some sort.
I'm curious about what's causing `rcmd` to malfunction that way, but
the main thing is to fix the problem on your system. To do that, let's
just eliminate the use of `rcmd`.
Find the script that's doing `rcmd asa rsync`; change it to use a
different remote command executor. For instance, change it to use `ssh
asa rsync`. If `rcmd` is adequate then you obviously don't feel the
need for the level of security offered by `ssh`, but the point here is
just to use a different tool that doesn't have this bug. You can
configure `ssh` in ways that allow passwordless remote execution. (It's
possible that `rsync` itself is directly invoking `rcmd`; in which case
you can use `rsync --rsh=/usr/bin/ssh` or something like that.)
>Bela<
-
Re: Dying processes (inetd, cron, syslogd, sshd)
Not a problem - I will do that today.
I started using rcmd as using ssh seemed to cause this problem to
happen more often! (and as both machines are on the same LAN I wasn't
worried about security.) As you said, it just a case of changing the
rsync command that is used.
Where you think there is a typo that is probably the case as the
details were transferred from screen to paper to email.
Thanks again,
Keith