Network not responding on idle SCO 5.0.5 system. - SCO

This is a discussion on Network not responding on idle SCO 5.0.5 system. - SCO ; I've a client with two SCO 5.0.5 boxes, one is the live server and the second is a hot spare. The servers have separate SCO 5.0.5 Enterprise licenses and 25-user licenses add-on. Both servers were rebooted due to inability to ...

+ Reply to Thread
Results 1 to 6 of 6

Thread: Network not responding on idle SCO 5.0.5 system.

  1. Network not responding on idle SCO 5.0.5 system.

    I've a client with two SCO 5.0.5 boxes, one is the live
    server and the second is a hot spare. The servers have
    separate SCO 5.0.5 Enterprise licenses and 25-user
    licenses add-on.

    Both servers were rebooted due to inability to access the
    live server or the backup server via telnet on 8/1 (while
    I was on vacation). On 8/4 I was informed that the problem
    occurred again but as I was in the car between locations,
    I was unable to assist with diagnosing the problem.

    I advised the client to reboot the live system and leave the
    backup server alone until I returned to my office to attempt to
    connect to the backup server remotely.

    Both servers are 5.0.5 and fully patched with the latest
    patchck version:

    > Gathering patch information... Please wait...
    >
    > INSTALLED currently on failover.XXXX.com
    > --------------------------------------------------------------------
    > oss471e oss471e - OpenServer Supplement oss471e
    > oss497c Core OS supplement for 5.0.5
    > oss600a Year 2000 Supplement for 5.0.5
    > oss640a Bind supplement for 5.0.5
    > oss642a Cron supplement for 5.0.5
    > oss646c Processor supplement for 5.0.5
    > oss663a oss663a - OpenServer Supplement oss663a
    > rs505a Release Supplement for OSR5.0.5
    > system is up-to-date as of July 7, 2008


    When I returned to the office, I tried to log in via ssh from the
    live system to the backup system after first logging in via SSH from
    my office to the live system. Ssh exited with the message
    "connection refused."

    Rcmd failover ps -ef also failed. (error message not saved)

    I called the client and had him try to stop and restart SSH but
    found that prngd was not running either. I had him remove the
    /usr/local/var/prngd/prngd.lock file, restart prngd then restart
    sshd. Once they were up, I was able to ssh to the backup system
    from the live system.

    $ w
    8:45pm up 3 days, 2:26, 3 users, load average: 0.00, 0.00, 0.00
    User Tty Login@ Idle JCPU PCPU What
    root tty01 8:27pm 1 - - /bin/ksh
    root ttyp0 5:55pm - - - w
    smf ttyp0 8:45pm - - - w
    $ sar 5 5

    SCO_SV failover 3.2v5.0.5 i80386 08/04/2008

    20:46:00 %usr %sys %wio %idle (-u)
    20:46:05 0 0 0 100
    20:46:10 0 0 0 100
    20:46:15 0 0 0 100
    20:46:20 0 0 0 100
    20:46:25 0 0 0 100

    Average 0 0 0 100

    Running netstat -a I saw:

    Active Internet connections (including servers)
    Proto Recv-Q Send-Q Local Address Foreign Address (state)
    tcp 0 0 failover.29392 treal.1727 ESTABLISHED
    tcp 0 0 *.29392 *.* LISTEN
    tcp 0 0 *.scohelp *.* LISTEN
    udp 0 0 *.488 *.*
    udp 0 0 *.* *.*
    udp 0 0 *.syslog *.*
    Active UNIX domain sockets
    Address Type Recv-Q Send-Q Conn Addr
    fcfa4f78 dgram 0 0 fcfa4828
    fcfa69a0 stream 0 0 0 /usr/local/var/prngd/
    prngd-pool
    fcfa6910 stream 0 0 fcfa6880
    fcfa6880 stream 0 0 fcfa6910
    fcfa4318 stream 0 0 0 /usr/tmp/scohelp.socket
    fcfa4168 stream 0 0 fcfa41f8
    fcfa41f8 stream 0 0 fcfa4168
    fcfa4288 stream 0 0 fcfa45e8
    fcfa4558 stream 0 0 0 /pmd/PMDCT_pipe
    fcfa45e8 stream 0 0 fcfa4288
    fcfa4678 stream 0 0 0 /pmd/LST_pipe
    fcfa4708 stream 0 0 0 /pm

    Executing "telnet localhost" returned the "connection refused" message:

    > $ telnet localhost
    > Trying 127.0.0.1...
    > telnet: Unable to connect to remote host: Connection refused


    > ps -ef | pg:
    >
    > UID PID PPID C STIME TTY TIME CMD
    > root 0 0 0 Aug-01 ? 00:00:00 sched
    > root 1 0 0 Aug-01 ? 00:00:04 /etc/init -a
    > root 2 0 0 Aug-01 ? 00:00:00 vhand
    > root 3 0 0 Aug-01 ? 00:00:48 bdflush
    > root 4 0 0 Aug-01 ? 00:00:00 kmdaemon
    > root 5 1 0 Aug-01 ? 00:00:01 htepi_daemon /
    > root 6 0 0 Aug-01 ? 00:00:00 strd
    > root 10245 1 0 17:55:56 tty01 00:00:00 /bin/login root
    > root 43 1 0 Aug-01 ? 00:00:00 /etc/syslogd
    > root 47 1 0 Aug-01 ? 00:00:00 /etc/ifor_pmd
    > root 48 47 0 Aug-01 ? 00:00:01 /etc/ifor_pmd
    > root 36 1 0 Aug-01 ? 00:00:00 htepi_daemon /stand
    > root 10247 1 0 17:55:56 tty02 00:00:00 /etc/getty tty02 sc_m
    > root 10249 1 0 17:55:56 tty03 00:00:00 /etc/getty tty03 sc_m
    > root 56 48 0 Aug-01 ? 00:00:00 /etc/sco_cpd
    > root 57 48 0 Aug-01 ? 00:00:01 /etc/ifor_sld
    > root 10258 1 0 17:55:56 tty11 00:00:00 /etc/getty tty11 sc_m
    > root 10251 1 0 17:55:56 tty04 00:00:00 /etc/getty tty04 sc_m
    > root 10252 1 0 17:55:56 tty05 00:00:00 /etc/getty tty05 sc_m
    > root 676 1 0 Aug-01 ? 00:00:00 /var/scohttp/scohttpd -d /var/scohttp
    > root 10253 1 0 17:55:56 ? 00:00:00 /tcb/files/no_luid/sdd
    > root 10267 10245 0 20:27:55 tty01 00:00:00 /bin/ksh



    The last time the system was rebooted was 8/1 at 18:20 and the /usr/adm/syslog shows that
    the network was up and running:

    Aug 1 18:20:17 failover syslogd: restart
    Aug 1 18:20:33 failover snmpd[356]: Agent started (pid 356)
    Aug 1 18:20:34 failover xntpd[359]: xntpd 3-5.92d Mon Jul 27 19:20:01 PDT 1998
    (1)
    Aug 1 18:20:34 failover xntpd[359]: tickadj = 2500, tick = 10000, tvu_maxslew =
    250000, est. hz = 100
    Aug 1 18:20:34 failover xntpd[359]: precision = 10000 usec
    Aug 1 18:20:34 failover xntpd[359]: read drift of -177.576 from /etc/driftfile
    Aug 1 18:20:34 failover sendmail[394]: starting daemon (8.8.8): SMTP+queueing@0
    1:00:00
    Aug 1 18:20:34 failover upsd[398]: *** PowerChute PLUS Version 4.2.2 Started ***
    > Aug 1 18:20:38 failover prngd[414]: prngd 0.9.23 (17 August 2001) started up for user prngd
    > Aug 1 18:20:38 failover prngd[414]: have 7 out of 110 filedescriptors open
    > Aug 1 18:20:47 failover sshd[710]: Server listening on 0.0.0.0 port 29392.


    So sshd started on 8/1 at 18:20 OK

    Aug 1 18:20:48 failover upsd[398]: Communication established
    Aug 1 18:24:51 failover xntpd[359]: synchronized to LOCAL(1), stratum=3
    Aug 1 18:25:24 failover xntpd[359]: synchronized to 129.6.15.28, stratum=1
    Aug 1 23:11:45 failover xntpd[359]: synchronized to 192.5.41.40, stratum=1
    Aug 2 03:01:36 failover xntpd[359]: time reset (step) 0.163968 s
    Aug 2 03:01:36 failover xntpd[359]: synchronisation lost
    Aug 2 03:07:07 failover xntpd[359]: synchronized to LOCAL(1), stratum=3
    Aug 2 03:08:22 failover xntpd[359]: synchronized to 192.5.41.40, stratum=1


    The following "last" is from this morning also shows that someone was able
    to telnet to the machine at 11:30 on 8/4 before the problem occurred:

    # last | pg
    User Line Device PID Login time Elapsed Time Comments
    root p0 ttyp0 11636 Tue Aug 5 09:56 00:12 logged in
    root co tty01 11522 Tue Aug 5 08:38 00:00
    kevin2 p0 ttyp0 10732 Mon Aug 4 21:48 00:00
    smf typ0 ttyp0 10680 Mon Aug 4 21:43 00:00
    root p0 ttyp0 10636 Mon Aug 4 21:35 00:08
    smf p0 ttyp0 10605 Mon Aug 4 21:27 00:05 <-- I logged in via telnet
    root p1 ttyp1 10587 Mon Aug 4 21:25 00:00 <-- Client tested telnet login
    smf typ0 ttyp0 10423 Mon Aug 4 21:07 00:19
    smf typ0 ttyp0 10350 Mon Aug 4 20:45 00:15 <-- I logged in via ssh
    root co tty01 10245 Mon Aug 4 20:27 12:10 <-- Client logged in and restarted SSH
    root p0 ttyp0 10233 Mon Aug 4 17:55 02:49 ??
    root co tty01 5376 Mon Aug 4 17:53 00:02
    root p0 ttyp0 6823 Mon Aug 4 11:30 00:14 <-- Telnet was running on 8/4 before
    root co tty01 726 Mon Aug 4 08:46 00:00 before the live system locked up.
    (EOF):

    Netstat -a after executing telinit Q: no change, telinit 2: no change,
    and then executing "/etc/tcp start" I was then able to telnet into the system.

    Active Internet connections (including servers)
    Proto Recv-Q Send-Q Local Address Foreign Address (state)
    tcp 0 4 failover.telnet treal.1752 ESTABLISHED
    tcp 0 0 *.printer *.* LISTEN
    tcp 0 0 *.time *.* LISTEN
    tcp 0 0 *.daytime *.* LISTEN
    tcp 0 0 *.chargen *.* LISTEN
    tcp 0 0 *.discard *.* LISTEN
    tcp 0 0 *.echo *.* LISTEN
    tcp 0 0 *.tcpmux *.* LISTEN
    tcp 0 0 *.shell *.* LISTEN
    tcp 0 0 *.ktelnet *.* LISTEN
    tcp 0 0 *.telnet *.* LISTEN
    tcp 0 0 *.smux *.* LISTEN
    tcp 0 0 *.29392 *.* LISTEN
    tcp 0 0 *.scohelp *.* LISTEN
    udp 0 0 localhost.ntp *.*
    udp 0 0 failover.ntp *.*
    udp 0 0 *.ntp *.*
    udp 0 0 *.time *.*
    udp 0 0 *.daytime *.*
    udp 0 0 *.chargen *.*
    udp 0 0 *.discard *.*
    udp 0 0 *.echo *.*
    udp 0 0 *.echo *.*
    udp 0 0 *.ntalk *.*
    udp 0 0 *.biff *.*
    udp 0 0 *.1269 *.*
    udp 0 0 *.snmp *.*
    udp 0 0 *.route *.*
    udp 0 0 *.488 *.*
    udp 0 0 *.* *.*
    udp 0 0 *.syslog *.*
    Active UNIX domain sockets
    Address Type Recv-Q Send-Q Conn Addr
    fcfa5200 stream 0 0 fcfa5170
    fcfa5170 stream 0 0 fcfa5200
    fcfa8f50 stream 0 0 0 /dev/printer
    fcfa7060 dgram 0 0 fcfa4828
    fcfa6448 dgram 0 0 fcfa4828
    fcfa7be8 dgram 0 0 fcfa4828
    fcfa4f78 dgram 0 0 fcfa4828
    fcfa69a0 stream 0 0 0 /usr/local/var/prngd/prngd-pool
    fcfa6910 stream 0 0 fcfa6880
    fcfa6880 stream 0 0 fcfa6910
    fcfa4318 stream 0 0 0 /usr/tmp/scohelp.socket
    fcfa4168 stream 0 0 fcfa41f8
    fcfa41f8 stream 0 0 fcfa4168
    fcfa4288 stream 0 0 fcfa45e8
    fcfa4558 stream 0 0 0 /pmd/PMDCT_pipe
    fcfa45e8 stream 0 0 fcfa4288
    fcfa4678 stream 0 0 0 /pmd/LST_pipe
    fcfa4708 stream 0 0 0 /pmd/IPCCT_pipe
    fcfa4798 dgram 0 0 fcfa4828
    fcfa4828 dgram 0 0 0 /dev/syslog
    (EOF):

    And ps -ef shows:

    smf 10630 10569 2 21:32:15 ? 00:00:00 rshd
    root 195 1 0 Aug-01 ? 00:00:01 htepi_daemon /usr1
    root 199 1 0 Aug-01 ? 00:00:09 htepi_daemon /usr2
    root 203 1 0 Aug-01 ? 00:00:00 htepi_daemon /usr3
    root 207 1 0 Aug-01 ? 00:00:00 htepi_daemon /util
    root 10260 1 0 17:55:56 tty12 00:00:00 /etc/getty tty12 sc_m
    smf 10631 10630 0 21:32:15 ? 00:00:00 ps -ef
    root 10347 1 0 20:44:50 ? 00:00:00 /usr/local/sbin/sshd
    prngd 10319 1 0 20:43:44 ? 00:00:00 /usr/local/sbin/prngd /usr/local/var/prngd/prngd-pool
    > root 10573 1 0 21:24:28 ? 00:00:00 xntpd
    > root 10576 1 0 21:24:28 ? 00:00:00 /usr/lib/lpd
    > root 10565 1 0 21:24:30 ? 00:00:00 routed
    > root 10566 1 0 21:24:30 ? 00:00:00 /etc/snmpd
    > root 10569 1 2 21:24:30 ? 00:00:00 /etc/inetd


    The above shows that /etc/tcp start ran at 21:24 when I manually executed the command.

    root 425 1 0 Aug-01 ? 00:00:00 htepi_daemon /ramtmp
    root 736 1 0 Aug-01 tty06 00:00:00 /etc/getty tty06 sc_m
    root 738 1 0 Aug-01 tty07 00:00:00 /etc/getty tty07 sc_m
    root 740 1 0 Aug-01 tty08 00:00:00 /etc/getty tty08 sc_m
    root 742 1 0 Aug-01 tty09 00:00:00 /etc/getty tty09 sc_m
    root 744 1 0 Aug-01 tty10 00:00:00 /etc/getty tty10 sc_m

    Just for grins, I executed /etc/tcp start a second time to see what happens
    and see in /usr/adm/syslog:

    Aug 4 21:44:13 failover inetd[10710]: telnet/tcp: bind: Address already in use
    Aug 4 21:44:13 failover inetd[10710]: ktelnet/tcp: bind: Address already in use
    Aug 4 21:44:13 failover inetd[10710]: shell/tcp: bind: Address already in use
    Aug 4 21:44:13 failover inetd[10710]: comsat/udp: bind: Address already in use
    Aug 4 21:44:13 failover inetd[10710]: ntalk/udp: bind: Address already in use
    Aug 4 21:44:13 failover inetd[10710]: tcpmux/tcp: bind: Address already in use
    Aug 4 21:44:13 failover inetd[10710]: echo/tcp: bind: Address already in use
    Aug 4 21:44:13 failover inetd[10710]: discard/tcp: bind: Address already in use
    Aug 4 21:44:13 failover inetd[10710]: chargen/tcp: bind: Address already in use
    Aug 4 21:44:13 failover inetd[10710]: daytime/tcp: bind: Address already in use
    Aug 4 21:44:13 failover inetd[10710]: time/tcp: bind: Address already in use
    Aug 4 21:44:13 failover inetd[10710]: echo/udp: bind: Address already in use
    Aug 4 21:44:13 failover inetd[10710]: discard/udp: bind: Address already in use
    Aug 4 21:44:13 failover inetd[10710]: chargen/udp: bind: Address already in use
    Aug 4 21:44:13 failover inetd[10710]: daytime/udp: bind: Address already in use
    Aug 4 21:44:13 failover inetd[10710]: time/udp: bind: Address already in use
    Aug 4 21:44:13 failover snmpd[10712]: start_snmpd_server: couldn't bind to req. address


    So for some reason, inetd had died and network functions stopped until I ran
    /etc/tcp start the first time at 21:24.

    What I'm looking for is pointers on how to find out why inetd died (or was killed).

    Since both the live and backup systems were affected at the same time on 8/4
    (telnet connections not working), and the live system had to be rebooted to get
    the warehouse workers back to work before I was able to login timely on the live
    system:
    root 21890 21774 2 18:12:32 tty01 00:00:00 /etc/shutdown,

    I was unable to determine that the "/etc/tcp start" might have worked on the live system.

    If "/etc/tcp start" had been tried on the live system and had it worked,
    would it be safe to use the system without rebooting? Or might the system
    be in some unstable state that would recommend rebooting as soon as possible?

    Netstat -m on 8/4 at 20:?? showed

    # netstat -m
    streams allocation:
    config alloc free total max fail
    stream 3072 30 3042 12302 112 0
    queues 566 60 506 24613 231 0
    mblks 1636 1551 85 19751160 1570 0
    buffer headers 1850 1774 76 19369 1777 0
    class 1, 64 bytes 64 1 63 2060906 9 0
    class 2, 128 bytes 64 0 64 8297616 45 0
    class 3, 256 bytes 48 0 48 993278 37 0
    class 4, 512 bytes 16 0 16 5429 10 0
    class 5, 1024 bytes 28 0 28 2347 25 0
    class 6, 2048 bytes 1546 1544 2 8351153 1546 0
    class 7, 4096 bytes 0 0 0 0 0 0
    class 8, 8192 bytes 1 0 1 17 1 0
    class 9, 16384 bytes 1 0 1 13 1 0
    class 10, 32768 bytes 0 0 0 0 0 0
    class 11, 65536 bytes 0 0 0 0 0 0
    class 12, 131072 bytes 0 0 0 0 0 0
    class 13, 262144 bytes 0 0 0 0 0 0
    class 14, 524288 bytes 0 0 0 0 0 0
    total configured streams memory: 20000.00KB
    streams memory in use: 3157.88KB
    maximum streams memory used: 3247.93KB
    #

    netstat -m on the system today shows:

    # netstat -m
    streams allocation:
    config alloc free total max fail
    stream 3072 65 3007 14881 112 0
    queues 566 133 433 29775 231 0
    mblks 1978 1799 179 26455515 1844 0
    buffer headers 2106 2038 68 24332 2041 0
    class 1, 64 bytes 64 1 63 2767816 12 0
    class 2, 128 bytes 64 0 64 11112903 50 0
    class 3, 256 bytes 64 0 64 1337334 58 0
    class 4, 512 bytes 16 0 16 5547 12 0
    class 5, 1024 bytes 28 0 28 2893 25 0
    class 6, 2048 bytes 1768 1766 2 11173618 1768 0
    class 7, 4096 bytes 0 0 0 0 0 0
    class 8, 8192 bytes 1 0 1 19 1 0
    class 9, 16384 bytes 1 0 1 14 1 0
    class 10, 32768 bytes 0 0 0 0 0 0
    class 11, 65536 bytes 0 0 0 0 0 0
    class 12, 131072 bytes 0 0 0 0 0 0
    class 13, 262144 bytes 0 0 0 0 0 0
    class 14, 524288 bytes 0 0 0 0 0 0
    total configured streams memory: 20000.00KB
    streams memory in use: 3618.72KB
    maximum streams memory used: 3709.54KB
    #

    So network traffic (mirroring data from the live to the backup system) is working
    as expected.

    Any comments, suggestions, or advice you can provide will be welcome.

    --
    Steve Fabac
    S.M. Fabac & Associates
    816/765-1670

  2. Re: Network not responding on idle SCO 5.0.5 system.

    Steve M. Fabac, Jr. wrote:
    > I've a client with two SCO 5.0.5 boxes, one is the live
    > server and the second is a hot spare. The servers have
    > separate SCO 5.0.5 Enterprise licenses and 25-user
    > licenses add-on.
    >
    > Both servers were rebooted due to inability to access the
    > live server or the backup server via telnet on 8/1 (while
    > I was on vacation). On 8/4 I was informed that the problem
    > occurred again but as I was in the car between locations,
    > I was unable to assist with diagnosing the problem.
    >
    > I advised the client to reboot the live system and leave the
    > backup server alone until I returned to my office to attempt to
    > connect to the backup server remotely.
    >
    > Both servers are 5.0.5 and fully patched with the latest
    > patchck version:
    >
    >> Gathering patch information... Please wait...
    >> INSTALLED currently on failover.XXXX.com
    >> --------------------------------------------------------------------
    >> oss471e oss471e - OpenServer Supplement oss471e
    >> oss497c Core OS supplement for 5.0.5
    >> oss600a Year 2000 Supplement for 5.0.5
    >> oss640a Bind supplement for 5.0.5
    >> oss642a Cron supplement for 5.0.5
    >> oss646c Processor supplement for 5.0.5
    >> oss663a oss663a - OpenServer Supplement oss663a
    >> rs505a Release Supplement for OSR5.0.5
    >> system is up-to-date as of July 7, 2008

    >
    > When I returned to the office, I tried to log in via ssh from the
    > live system to the backup system after first logging in via SSH from
    > my office to the live system. Ssh exited with the message
    > "connection refused."
    >
    > Rcmd failover ps -ef also failed. (error message not saved)
    >


    Have you looked in syslog for evidence of hacking attempts?

    I've seen OS5 systems lose spooling and net functions after heavy
    automated login attempts from the internet - usually from our 'friends'
    working from Korean ISP's (at least on the West coast).

    I assume you have mapped SSH to some other port than 22?

    I don't see anything that pops out as a problem (other than, of course,
    prngd and sshd just not running).

    Possibly NIC's going flaky and re-transmitting like crazy. Ditto a bad
    port on your router causing flooding of the network.

    Squirrels/rats finding cable runs in the ceiling edible?

    Some neatness freak of doubtful intelligence straightening up cables by
    using a powered stapler to tack cables to plywood?

    The above actually happened to a client of a company I consult for - I
    was sent to Chicago from the West coast to trouble shoot an intermittent
    problem with garbage shooting across the screen (dumb terminals, back in
    the day).

    some of the staples actually penetrated the wires, and when heavy trucks
    went by the plywood flexed enough to make/break contact.

    Expensive lesson for the client needless to say.


    --
    ----------------------------------------------------
    Pat Welch, UBB Computer Services, a WCS Affiliate
    SCO Authorized Partner
    Microlite BackupEdge Certified Reseller
    Unix/Linux/Windows/Hardware Sales/Support
    (209) 745-1401 Cell: (209) 251-9120
    E-mail: patubb@inreach.com
    ----------------------------------------------------

  3. Re: Network not responding on idle SCO 5.0.5 system.

    Pat Welch wrote:
    > Steve M. Fabac, Jr. wrote:
    >> I've a client with two SCO 5.0.5 boxes, one is the live
    >> server and the second is a hot spare. The servers have
    >> separate SCO 5.0.5 Enterprise licenses and 25-user
    >> licenses add-on.
    >>
    >> Both servers were rebooted due to inability to access the
    >> live server or the backup server via telnet on 8/1 (while
    >> I was on vacation). On 8/4 I was informed that the problem
    >> occurred again but as I was in the car between locations,
    >> I was unable to assist with diagnosing the problem.
    >>
    >> I advised the client to reboot the live system and leave the
    >> backup server alone until I returned to my office to attempt to
    >> connect to the backup server remotely.
    >>
    >> Both servers are 5.0.5 and fully patched with the latest
    >> patchck version:
    >>
    >>> Gathering patch information... Please wait...
    >>> INSTALLED currently on failover.XXXX.com
    >>> --------------------------------------------------------------------
    >>> oss471e oss471e - OpenServer Supplement oss471e
    >>> oss497c Core OS supplement for 5.0.5
    >>> oss600a Year 2000 Supplement for 5.0.5
    >>> oss640a Bind supplement for 5.0.5
    >>> oss642a Cron supplement for 5.0.5
    >>> oss646c Processor supplement for 5.0.5
    >>> oss663a oss663a - OpenServer Supplement oss663a
    >>> rs505a Release Supplement for OSR5.0.5
    >>> system is up-to-date as of July 7, 2008

    >>
    >> When I returned to the office, I tried to log in via ssh from the
    >> live system to the backup system after first logging in via SSH from
    >> my office to the live system. Ssh exited with the message
    >> "connection refused."
    >>
    >> Rcmd failover ps -ef also failed. (error message not saved)
    >>

    >
    > Have you looked in syslog for evidence of hacking attempts?


    Yes, none in evidence since change to sshd.conf in April


    >
    > I've seen OS5 systems lose spooling and net functions after heavy
    > automated login attempts from the internet - usually from our 'friends'
    > working from Korean ISP's (at least on the West coast).
    >
    > I assume you have mapped SSH to some other port than 22?


    Aug 1 18:20:47 failover sshd[710]: Server listening on 0.0.0.0 port 29392.

    >
    > I don't see anything that pops out as a problem (other than, of course,
    > prngd and sshd just not running).


    Further information: The live system went incommunicado after my post and
    I was able to have the client restart prngd and sshd and login to the
    system prior to having them reboot. The live system had the same type
    of netstat -a listing showing only the sshd processes listening with only
    scohelp and something on port 488 listening:

    Active Internet connections (including servers)
    Proto Recv-Q Send-Q Local Address Foreign Address (state)
    tcp 0 48 vet.29392 adsl-65-64-102-9.4957 ESTABLISHED
    tcp 0 0 *.29392 *.* LISTEN
    tcp 0 0 *.scohelp *.* LISTEN
    udp 0 0 *.488 *.*
    Active UNIX domain sockets
    Address Type Recv-Q Send-Q Conn Addr
    fcfa6b50 stream 0 0 0 /usr/local/var/prngd/
    prngd-pool
    fcfa6ac0 stream 0 0 fcfa6a30

    Executing /etc/tcp start turned on all the usual LISTEN services and
    the users were once again able to login to the system without rebooting:

    Active Internet connections (including servers)
    Proto Recv-Q Send-Q Local Address Foreign Address (state)
    tcp 0 0 *.printer *.* LISTEN
    tcp 0 0 *.smux *.* LISTEN
    tcp 0 0 *.imap *.* LISTEN
    tcp 0 0 *.pop3 *.* LISTEN
    tcp 0 0 *.time *.* LISTEN
    tcp 0 0 *.daytime *.* LISTEN
    tcp 0 0 *.chargen *.* LISTEN
    tcp 0 0 *.discard *.* LISTEN
    tcp 0 0 *.echo *.* LISTEN
    tcp 0 0 *.tcpmux *.* LISTEN
    tcp 0 0 *.shell *.* LISTEN
    tcp 0 0 *.telnet *.* LISTEN
    tcp 0 0 *.ftp *.* LISTEN
    tcp 0 48 vet.29392 adsl-65-64-102-9.4957 ESTABLISHED
    tcp 0 0 *.29392 *.* LISTEN
    tcp 0 0 *.scohelp *.* LISTEN
    udp 0 0 localhost.ntp *.*
    udp 0 0 vet.ntp *.*
    udp 0 0 vetreal.ntp *.*
    udp 0 0 *.ntp *.*
    udp 0 0 *.2086 *.*
    udp 0 0 *.snmp *.*
    udp 0 0 *.time *.*
    udp 0 0 *.daytime *.*
    udp 0 0 *.chargen *.*
    udp 0 0 *.discard *.*
    udp 0 0 *.echo *.*
    udp 0 0 *.ntalk *.*
    udp 0 0 *.biff *.*
    udp 0 0 *.tftp *.*
    udp 0 0 *.route *.*
    udp 0 0 *.488 *.*
    Active UNIX domain sockets
    Address Type Recv-Q Send-Q Conn Addr
    fcfa8920 stream 0 0 0 /dev/printer

    However, I soon got a call when they were unable to print.

    I logged back in and had to run /usr/lib/lpsched to restart the print
    spooler.

    For grins, I enabled process accounting and when I tried to start accounting
    I was informed that cron is not running so I had to restart cron as well:
    > crontab: cron may not be running - call your system administrator: No such devic
    > e or address (error 6)
    > Accounting is now enabled for use
    > To start accounting, run: /usr/lib/acct/startup
    > # ps -ef | grep crontab
    > # ps -ef | grep cron
    > #
    > # /etc/rc2.d/P75cron start
    > # ! *** cron started *** pid = 3155 Tue Aug 5 13:37:39 2008



    I also noted that syslogd was not running and when I investigated, noted that
    syslogd had not started when the system was rebooted on 8/1 and 8/4. After I managed to
    manually start syslogd, /usr/adm/syslog was updated with the boot up information
    generated on 8/1 but logged as occurring at the time I manually executed
    /etc/syslog:

    > # tail -f /usr/adm/syslog
    > Aug 1 16:43:02 treal ftpd[3298]: #2 open of pid file failed: No such file or directory
    > Aug 1 16:44:37 treal ftpd[3389]: #2 open of pid file failed: No such file or directory
    > Aug 1 16:44:38 treal ftpd[3390]: #2 open of pid file failed: No such file or directory
    > Aug 1 16:46:25 treal ftpd[3456]: #2 open of pid file failed: No such file or directory
    > Aug 1 16:46:25 treal ftpd[3457]: #2 open of pid file failed: No such file or directory
    > Aug 1 16:51:25 treal lockd[440]: term_nlm():
    > Aug 1 16:51:25 treal lockd[440]: nlm lock server died! exiting.
    > Aug 1 16:56:07 treal TLW param1=-1
    > Fri Aug 1 16:56:07 CDT 2008 reboot initated
    > Mon Aug 4 18:19:18 CDT 2008 shutdown initiated
    >
    > Wow! did not restart syslogd Friday or Monday reboot!!


    Strange, the "Mon Aug 4 18:19:18" entry above had to be written to
    /usr/adm/syslog in real time as it existed even without syslogd running
    since the reboot on 8/1.

    After I manually restarted /etc/syslogd:

    Aug 1 16:51:25 treal lockd[440]: term_nlm():
    Aug 1 16:51:25 treal lockd[440]: nlm lock server died! exiting.
    Aug 1 16:56:07 treal TLW param1=-1
    Fri Aug 1 16:56:07 CDT 2008 reboot initated
    Mon Aug 4 18:19:18 CDT 2008 shutdown initiated
    Aug 5 14:26:25 treal syslogd: restart
    Aug 5 14:26:25 treal SCO OpenServer(TM) Release 5
    Aug 5 14:26:25 treal
    Aug 5 14:26:25 treal (C) 1976-1998 The Santa Cruz Operation, Inc.
    Aug 5 14:26:25 treal (C) 1980-1994 Microsoft Corporation
    Aug 5 14:26:25 treal All rights reserved.
    Aug 5 14:26:25 treal


    >
    > Possibly NIC's going flaky and re-transmitting like crazy. Ditto a bad
    > port on your router causing flooding of the network.


    We've been fighting a stream leak since March 2008 as shown in the
    data I log every five minutes from cron:

    Tue Mar 25 16:10:03 CDT 2008
    Tue Mar 25 16:10:17 CDT 2008 streams memory in use: 1700.73KB
    Tue Mar 25 16:15:00 CDT 2008 streams memory in use: 1706.34KB
    Tue Mar 25 16:20:00 CDT 2008 streams memory in use: 1708.18KB
    Tue Mar 25 16:25:00 CDT 2008 streams memory in use: 1717.02KB
    Tue Mar 25 16:30:00 CDT 2008 streams memory in use: 1721.02KB
    ....
    Wed Mar 26 02:55:00 CDT 2008 streams memory in use: 2119.16KB
    Wed Mar 26 03:00:00 CDT 2008 streams memory in use: 2121.18KB
    Wed Mar 26 03:05:01 CDT 2008 streams memory in use: 4016.57KB
    Wed Mar 26 03:10:00 CDT 2008 streams memory in use: 4016.57KB

    Cpio backup to failover machine kicks off at 03:00

    End of day:
    Tue Mar 25 23:55:00 CDT 2008 streams memory in use: 1443.95KB
    Wed Mar 26 23:55:00 CDT 2008 streams memory in use: 1450.17KB
    Thu Mar 27 23:55:00 CDT 2008 streams memory in use: 1445.48KB
    Fri Mar 28 23:55:00 CDT 2008 streams memory in use: 4533.59KB
    Sat Mar 29 23:55:00 CDT 2008 streams memory in use: 2013.02KB
    Sun Mar 30 23:55:00 CDT 2008 streams memory in use: 4869.45KB
    Mon Mar 31 23:55:00 CDT 2008 streams memory in use: 2027.55KB
    Tue Apr 1 23:55:00 CDT 2008 streams memory in use: 5061.66KB
    Wed Apr 2 23:55:00 CDT 2008 streams memory in use: 2181.63KB
    Thu Apr 3 23:05:00 CDT 2008 streams memory in use: 5895.07KB
    Fri Apr 4 23:55:00 CDT 2008 streams memory in use: 6900.07KB
    Sat Apr 5 23:55:00 CDT 2008 streams memory in use: 7244.71KB
    Sun Apr 6 23:55:00 CDT 2008 streams memory in use: 7425.77KB
    Mon Apr 7 23:55:00 CDT 2008 streams memory in use: 8831.78KB
    Tue Apr 8 23:50:00 CDT 2008 streams memory in use: 9934.26KB
    Wed Apr 9 23:55:01 CDT 2008 streams memory in use: 11022.41KB
    Thu Apr 10 23:55:00 CDT 2008 streams memory in use: 13063.88KB
    Fri Apr 11 23:55:00 CDT 2008 streams memory in use: 14749.66KB
    Sat Apr 12 23:55:00 CDT 2008 streams memory in use: 15181.69KB
    Sun Apr 13 23:55:00 CDT 2008 streams memory in use: 15583.30KB
    Mon Apr 14 23:55:00 CDT 2008 streams memory in use: 16522.23KB
    Tue Apr 15 23:55:00 CDT 2008 streams memory in use: 17334.53KB
    Wed Apr 16 23:55:00 CDT 2008 streams memory in use: 18487.81KB
    Thu Apr 17 16:10:00 CDT 2008 streams memory in use: 19391.11KB
    System rebooted
    Thu Apr 17 16:20:00 CDT 2008 streams memory in use: 1380.50KB
    ....
    Thu Jul 24 23:55:00 CDT 2008 streams memory in use: 9092.21KB
    Fri Jul 25 23:55:00 CDT 2008 streams memory in use: 9964.78KB
    Sat Jul 26 15:45:00 CDT 2008 streams memory in use: 10975.49KB
    System rebooted
    Sat Jul 26 15:50:00 CDT 2008 streams memory in use: 1406.35KB
    Sat Jul 26 23:55:01 CDT 2008 streams memory in use: 1474.53KB
    Sun Jul 27 23:55:00 CDT 2008 streams memory in use: 4086.12KB
    Mon Jul 28 23:55:00 CDT 2008 streams memory in use: 5621.95KB
    Tue Jul 29 23:55:00 CDT 2008 streams memory in use: 6879.91KB
    Wed Jul 30 23:55:00 CDT 2008 streams memory in use: 8781.20KB
    Thu Jul 31 23:55:00 CDT 2008 streams memory in use: 9941.51KB
    Fri Aug 1 23:05:00 CDT 2008 streams memory in use: 4020.41KB
    Fri Aug 1 16:00:00 CDT 2008 streams memory in use: 13261.82KB
    System rebooted
    Fri Aug 1 16:15:00 CDT 2008 streams memory in use: 1388.23KB
    Sat Aug 2 23:55:00 CDT 2008 streams memory in use: 4325.13KB
    Sun Aug 3 23:55:00 CDT 2008 streams memory in use: 4785.69KB
    Mon Aug 4 23:55:00 CDT 2008 streams memory in use: 4092.80KB
    Tue Aug 5 04:30:00 CDT 2008 streams memory in use: 4097.25KB

    System went down hard at 18:00 on 8/5. Both disks of RAID1
    dead. Replaced with single remaining spare disk and restored
    nightly backup from 03:15 8/5.

    Tue Aug 5 22:25:01 CDT 2008 streams memory in use: 1376.86KB

    system back up at 22:?? 8/5.

    >
    > Squirrels/rats finding cable runs in the ceiling edible?
    >
    > Some neatness freak of doubtful intelligence straightening up cables by
    > using a powered stapler to tack cables to plywood?
    >
    > The above actually happened to a client of a company I consult for - I
    > was sent to Chicago from the West coast to trouble shoot an intermittent
    > problem with garbage shooting across the screen (dumb terminals, back in
    > the day).
    >
    > some of the staples actually penetrated the wires, and when heavy trucks
    > went by the plywood flexed enough to make/break contact.
    >
    > Expensive lesson for the client needless to say.
    >
    >


    Well, Not only the two SCO 5.0.5 boxes lost both RAID1 disks, also
    one Dell box running MS SBS lost RAID1 disks (both disks of the RAID)
    and an IBM box lost 2 of 5 disks. The Windows support technician decided
    that the RAID controller in the IBM was also bad as as he worked to restore
    the IBM box, more disks went off-line.

    I got the live SCO 5.0.5 box backup by restoring the nightly backup and the
    14:15 differential backup from the Buffalo NAS server where Backup Edge is
    writing its backup files. With only one remaining unused 146G disk on the shelf,
    I did not restore the failover machine so the live server is running on
    one disk without a RAID1 mirror.

    The customer has problems with building power to the server room and the
    5 APC UPS (2 new units as of 6/26 on the SCO 5.0.5 servers) did not
    prevent the problem. Damn, Damn, Damn.

    --
    Steve Fabac
    S.M. Fabac & Associates
    816/765-1670

  4. Re: Network not responding on idle SCO 5.0.5 system.

    > The customer has problems with building power to the server room and the
    > 5 APC UPS (2 new units as of 6/26 on the SCO 5.0.5 servers) did not
    > prevent the problem. Damn, Damn, Damn.


    Not all APC are equal.
    Some do power conditioning (Smart-UPS or better), some do not (Back-UPS).

    --
    Brian K. White brian@aljex.com http://www.myspace.com/KEYofR
    +++++[>+++[>+++++>+++++++<<-]<-]>>+.>.+++++.+++++++.-.[>+<---]>++.
    filePro BBx Linux SCO FreeBSD #callahans Satriani Filk!


  5. Re: Network not responding on idle SCO 5.0.5 system.

    Steve M. Fabac, Jr. wrote:

    > I've a client with two SCO 5.0.5 boxes, one is the live
    > server and the second is a hot spare. The servers have
    > separate SCO 5.0.5 Enterprise licenses and 25-user
    > licenses add-on.
    >
    > Both servers were rebooted due to inability to access the
    > live server or the backup server via telnet on 8/1 (while
    > I was on vacation). On 8/4 I was informed that the problem
    > occurred again but as I was in the car between locations,
    > I was unable to assist with diagnosing the problem.


    With information in this and later posts, this sounds like a case of:



    This was a problem in which `prngd` was mistakenly killing most
    processes in the system. Read that and make sure `prngd` is updated
    past the bad versions I mentioned there. (The same problem could happen
    with other software running as root; `prngd` was the culprit in the one
    case that was fully chased down.)

    >Bela<


  6. Re: Network not responding on idle SCO 5.0.5 system.

    Bela Lubkin wrote:
    > Steve M. Fabac, Jr. wrote:
    >
    >> I've a client with two SCO 5.0.5 boxes, one is the live
    >> server and the second is a hot spare. The servers have
    >> separate SCO 5.0.5 Enterprise licenses and 25-user
    >> licenses add-on.
    >>
    >> Both servers were rebooted due to inability to access the
    >> live server or the backup server via telnet on 8/1 (while
    >> I was on vacation). On 8/4 I was informed that the problem
    >> occurred again but as I was in the car between locations,
    >> I was unable to assist with diagnosing the problem.

    >
    > With information in this and later posts, this sounds like a case of:
    >
    >
    >


    The article referenced indicates that versions Bela found with the bug were up
    to 0.9.7 and notes that the latest, error free, release is 0.9.29.

    The client's system is running the SCO version 0.0.23 which I hope is patched
    and not a problem.


    || Netscape Communicator (ver 4.0.5b) * |
    || OpenSSH - Secure Shell remote access utilities (ver 3.4p1) * |
    || PRNGD - Pseudo Random Number Generator Daemon (ver 0.9.23) * |
    || SCO OpenServer Enterprise System (ver 5.0.5m)


    Checking ftp://ftp2.sco.com/pub/skunkware/osr5/vols/

    I see prngd-0.9.23-VOLS.tar and prngd-0.9.6-VOLS.tar

    I trust that 0.9.23 is later then 0.9.6

    I have seen problems on systems where password aging is enforced and prngd
    has been locked out because its password has expired but that is not
    the case on the client's system:

    prngd {pw_name prngd} {pw_uid 254} {loginGroup prngd} {pw_gid 101} {pw
    _dir /usr/local/etc/prngd} {pw_shell /bin/sh} {groups {}} {groupsForLogins {}} {
    auditFlags {0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0}} {mode 16877} {noPassword 0
    } {comment {PRNGD Pseudo Account}} {passwdSuccessfulChangeTime 1063812765} {last
    SuccessfulLoginTime 1063812765} {administrativeLockApplied 0} {passwdMinChangeTi
    me 0} {passwdExpirationTime 0} {passwdLifetime 0} {maxLoginAttempts 99} {passwdU
    nsuccessfulChangeTime 1218061023} {lastSuccessfulLoginTty {}} {lastSuccessfulLog
    outTty {}} {lastSuccessfulLogoutTime {}} {lastUnsuccessfulLoginTime 0} {lastUnsu
    ccessfulLoginTty {}} {unsuccessfulLoginAttempts {}} {passwdGeneratedLength 8} {p
    asswdChooseOwn 1} {passwdRunGenerator 1} {passwdCheckedForObviousness 0} {passwd
    NullAllowed 1} {userType general} {owner {}} {nice 0} {passwdUser prngd} {suOnly
    0} {privs {execsuid chmodsugid chown}} {auths {mem terminal audittrail su query
    space printqueue}} {defaultAttributes {nice privs auths passwdMinChangeTime pass
    wdGeneratedLength passwdExpirationTime passwdLifetime passwdChooseOwn passwdRunG
    enerator passwdCheckedForObviousness passwdNullAllowed maxLoginAttempts}} {defau
    ltedAttributes {nice privs auths passwdMinChangeTime passwdGeneratedLength passw
    dExpirationTime passwdLifetime passwdChooseOwn passwdRunGenerator passwdCheckedF
    orObviousness passwdNullAllowed maxLoginAttempts}} {distributed 0} {isASUUser 0}

    > This was a problem in which `prngd` was mistakenly killing most
    > processes in the system. Read that and make sure `prngd` is updated
    > past the bad versions I mentioned there. (The same problem could happen
    > with other software running as root; `prngd` was the culprit in the one
    > case that was fully chased down.)


    If there is other software on the system susceptible to the same bug, it will
    likely never be tracked down without Bela's experience and knowledge: The
    techniques he mentioned in the article are beyond me.

    >
    >> Bela<

    >
    >


    --
    Steve Fabac
    S.M. Fabac & Associates
    816/765-1670

+ Reply to Thread