Got a call this morning when my client's printers would not print and
she had an error message on the screen (not recorded as not relevant).

|+----------------------------------------------------------------------------+|
||* SCO OpenServer Release 6.0.0 (ver 6.0.0Ni) ||
|| OSS703A - X Upgrade for OpenServer 6.0.0 Maintenance Pack 2 (ver 1. ||
|| OSS706C - OpenServer 6 Maintenance Pack 2 Supplement (ver 1.0.0) ||
|| SCO OpenServer Release 6.0.0 Maintenance Pack 2 (ver 1.0.0Dy) ||
|| ||

I was unable to login via ssh. I did manage to login via telnet (got to get that
port closed) and I found that the root file system was out of space:


/ : Disk space: 4.29 MB of 9000.99 MB available ( 0.05%).
/stand : Disk space: 33.74 MB of 39.99 MB available (84.37%).
/u : Disk space: 6646.34 MB of 9000.99 MB available (73.84%).
/tmp : Disk space: 97.69 MB of 126.62 MB available (77.16%).

The above is from my nightly backup script log generated at 23:00 last night.
The current free space was lower but not 0.

From syslog:

May 17 17:11:21 unix1 ntpd[1487]: time reset 0.152811 s
May 17 17:11:21 unxi1 ntpd[1487]: synchronisation lost
May 17 18:05:25 unix1
May 17 18:05:25 unix1 NOTICE: msgcnt 1 vxfs: mesg 001: vx_nospace - /dev/root fi
May 17 18:05:25 unix1
May 17 18:09:00 unix1
May 17 18:09:00 unix1 NOTICE: msgcnt 2 vxfs: mesg 001: vx_nospace - /dev/root fi
.....
May 18 01:12:52 unix1 NOTICE: msg
May 18 01:12:52 unix1 NOTICE: msgcnt 25 vxfs: mesg 001: vx_nospace - /dev/root f
May 18 01:12:52 unix1
May 18 01:22:11 unix1 NOTICE: msg
May 18 01:22:11 unix1 NOTICE: msgcnt 26 vxfs: mesg 001: vx_nospace - /dev/root f
May 18 08:41:18 unix1 sco_pmd[986]: PMD started - PID 987
May 18 08:41:24 unix1 device address vec dma comment

I see in the /usr/lib/edge/lists/simple_job directory that the backup completed
and the system ran out of space on the root during the verify pass:

-rw-r--r-- 1 root sys 821248 May 18 00:08 .debug
-rw-r--r-- 1 root sys 12206080 May 18 00:08 verify_master.log
-rw-r--r-- 1 root sys 0 May 17 23:48 changedfiles_master.log
-rw-r--r-- 1 root sys 3785 May 17 23:48 edge_progress.log
-rw-r--r-- 1 root sys 24759784 May 17 23:48 backup_master.log

It's amazing that Backup Edge ran and created a backup while the root file system
was out of space. And also amazing that anyone was able to log in with the system
out of disk space!

I found the following BIG files:

1132 ./var/adm/sa/sa10
1372 ./var/adm/sa/sa11
1400 ./var/adm/sa/sa14
1360 ./var/adm/sa/sa15
1312 ./var/adm/sa/sa16
1372 ./var/adm/sa/sa17
1579088 ./var/adm/sa/sar11
3315792 ./var/adm/sa/sar14
3315792 ./var/adm/sa/sar15
3315792 ./var/adm/sa/sar16
2054096 ./var/adm/sa/sar17

After deleting sar14 (chosen as the second large file created
to preserve the first in case it contains a hint as to the problem):

# rm sar14
# dfspace
/ : Disk space: 1569.35 MB of 9000.99 MB available (17.44%).
/stand : Disk space: 33.74 MB of 39.99 MB available (84.37%).
/u : Disk space: 6630.27 MB of 9000.99 MB available (73.66%).
/tmp : Disk space: 126.57 MB of 126.62 MB available (99.96%).


# pwd
/var/adm/sa
# ls -lt
total 10274316
-rw------- 1 sys sys 71204 May 18 09:37 sa18
-rw------- 1 sys sys 700420 May 17 23:00 sa17
-rw------- 1 root sys 1051682816 May 17 18:05 sar17
-rw------- 1 sys sys 670320 May 16 23:00 sa16
-rw------- 1 root sys 1697677291 May 16 18:05 sar16
-rw------- 1 sys sys 695220 May 15 23:00 sa15
-rw------- 1 root sys 1697677296 May 15 18:05 sar15
-rw------- 1 sys sys 708420 May 14 23:00 sa14
-rw------- 1 sys sys 322020 May 13 23:00 sa13
-rw------- 1 sys sys 322320 May 12 23:00 sa12
-rw------- 1 sys sys 702020 May 11 23:00 sa11
-rw------- 1 root sys 808484856 May 11 18:05 sar11
-rw-r--r-- 1 sys sys 571080 May 10 23:00 sa10
-rw------- 1 root sys 82570 May 10 18:05 sar10


Well not sar running amok but the activity reporting tool that runs from
cron at 18:05:

5 18 * * 1-5 /usr/lib/sa/sa2 -s 8:00 -e 18:01 -i 1200 -A

and creates the sarXX files

The end of hd sar11:

5380 20 20 20 20 20 20 20 0a 41 76 65 72 61 67 65 20 .Average
5390 20 20 20 20 20 20 31 2e 34 20 20 20 20 20 20 20 1.4
53a0 20 30 20 20 20 20 20 32 2e 30 20 20 20 20 20 20 0 2.0
53b0 20 30 20 20 20 20 20 20 20 20 20 20 20 20 20 20 0
53c0 20 20 0a 0a 30 38 3a 30 30 3a 30 30 20 20 70 72 ..08:00:00 pr
53d0 6f 63 2d 73 7a 20 20 20 20 20 66 61 69 6c 20 20 oc-sz fail
53e0 20 6c 77 70 20 20 20 66 61 69 6c 20 20 69 6e 6f lwp fail ino
53f0 64 2d 73 7a 20 20 20 20 20 66 61 69 6c 20 20 66 d-sz fail f
5400 69 6c 65 20 20 20 66 61 69 6c 20 20 6c 6f 63 6b ile fail lock
5410 0a 30 38 3a 32 30 3a 30 30 20 20 20 31 30 31 2f .08:20:00 101/
5420 39 38 39 39 20 20 20 20 20 20 30 20 20 20 32 39 9899 0 29
5430 38 20 20 20 20 20 20 30 20 20 20 20 20 20 20 20 8 0
5440 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
*
30307ff0 20 20 20 20 20 20 20 20
30307ff8

So whatever is processing saXX files to sarXX files is having
problems.

I ran sar_enable -n to turn off sar until I resolve this issue.

I called the client back and had her shut the system down and reboot
to maintenance mode and ran fsck -ofull.

Fsck checked u but reported the following for root:

Fsck: Cannot determine file system type of /dev/root

Executing fsck /dev/root ran ok and listed "log replay in progress"
and "replay complete - marking superblock as clean."

Executing fsck -ofull /dev/root ran ok as well.

But "fsck -ofull" still checks /dev/u but reports that it can't
determine the file system type of /dev/root.

Any comments or suggestions on what happened to sar? What's my next
step?

--

Steve Fabac
S.M. Fabac & Associates
816/765-1670