nfs high load results in server lockup - NFS
This is a discussion on nfs high load results in server lockup - NFS ; Hi everyone.
I have a nfs file server that is serving a mail spool and the home
directories of users.
What is happening is that every month or so the load on the file server
climbs to 18 and stays ...
-
nfs high load results in server lockup
Hi everyone.
I have a nfs file server that is serving a mail spool and the home
directories of users.
What is happening is that every month or so the load on the file server
climbs to 18 and stays there. This has started becoming more frequent
lately. At this point, the mail server that has mounted the mail spool
locks up. The load on the file server never comes down. If I issue a
reboot command to the file server, I get the 'server rebooting' message
on the console but nothing happens. I have tried stopping the nfs
service, but that fails. I have to do a power cycle(i.e. switch off and
on via the power button) to get control of the file server again and
this results in a fsck sequence on boot. There is no network bottleneck
as all these machines are connected via a private gigabit ethernet. I
have checked the output of ifconfig on all the machines and none of the
interfaces are reporting errors, so I assume that the NICs are all
fine.
There are four machines mounting nfs shares from the file server
1. mail server
2. web server
3. backup server
4. another machine
This is the exports file on the file server:
/home 192.168.1.2(rw,no_root_squash,sync,no_wdelay)
/home 192.168.1.3(rw,no_root_squash,sync,no_wdelay)
/home 192.168.1.4(rw,no_root_squash,sync,no_wdelay)
/home 192.168.1.5(rw,no_root_squash,sync,no_wdelay)
/data1/mail 192.168.1.2(rw,no_root_squash,sync,no_wdelay)
/data1/mail 192.168.1.8(rw,no_root_squash,sync,no_wdelay)
/data1/mail 192.168.1.3(rw,no_root_squash,sync,no_wdelay)
/data1/mail 192.168.1.4(rw,no_root_squash,sync,no_wdelay)
/data1/mail 192.168.1.5(rw,no_root_squash,sync,no_wdelay)
/home is 36G
/data1/mail is 30G
These are the relevant lines in /etc/fstab the mail server
fs1:/home /home nfs
wsize=8192,rsize=8192,defaults,intr,tcp 0 0
fs1:/data1/mail /var/spool/mail nfs
wsize=8192,rsize=8192,defaults,intr,tcp 0 0
I have recently put in the options 'wsize=8192,rsize=8192', but these
are not helping.
Any help/pointers will be much appreciated.
Thanks,
Faisal
-
Re: nfs high load results in server lockup
Some more info that I missed out earlier
Im running a 2.4.21-20.EL.c0smp kernel on the file server and a
2.4.21-32.0.1.ELsmp kernel on the mail server.
The file server dmesg log always shows this:
svc: unknown version (1)
svc: unknown version (1)
svc: unknown version (1)
svc: unknown version (1)
svc: unknown version (1)
svc: unknown version (1)
svc: unknown version (1)
svc: unknown version (1)
svc: unknown version (1)
svc: unknown version (1)
Ive searched for this message and the most common explaination for this
is having a Sun machine as a nfs client. We do have three Netras but
they are not mounting off this file server.
Some more info from the file server
[root@fs1 root]# netstat -s
Ip:
5707180 total packets received
0 forwarded
0 incoming packets discarded
5706315 incoming packets delivered
8589197 requests sent out
Icmp:
23 ICMP messages received
0 input ICMP message failed.
ICMP input histogram:
destination unreachable: 22
echo requests: 1
23 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 22
echo replies: 1
Tcp:
14 active connections openings
1114 passive connection openings
0 failed connection attempts
10 connection resets received
14 connections established
5705110 segments received
8587162 segments send out
522 segments retransmited
0 bad segments received.
825 resets sent
Udp:
1999 packets received
0 packets to unknown port received.
0 packet receive errors
1999 packets sent
TcpExt:
122 packets pruned from receive queue because of socket buffer
overrun
ArpFilter: 0
162 TCP sockets finished time wait in fast timer
20660 delayed acks sent
50 delayed acks further delayed because of locked socket
Quick ack mode was activated 2 times
37926 packets directly queued to recvmsg prequeue.
21175 packets directly received from backlog
1814173 packets directly received from prequeue
1614897 packets header predicted
27706 packets header predicted and directly queued to user
TCPPureAcks: 19263
TCPHPAcks: 4445016
TCPRenoRecovery: 0
TCPSackRecovery: 491
TCPSACKReneging: 0
TCPFACKReorder: 0
TCPSACKReorder: 0
TCPRenoReorder: 0
TCPTSReorder: 0
TCPFullUndo: 0
TCPPartialUndo: 0
TCPDSACKUndo: 0
TCPLossUndo: 4
TCPLoss: 37
TCPLostRetransmit: 0
TCPRenoFailures: 0
TCPSackFailures: 1
TCPLossFailures: 0
TCPFastRetrans: 469
TCPForwardRetrans: 25
TCPSlowStartRetrans: 0
TCPTimeouts: 7
TCPRenoRecoveryFail: 0
TCPSackRecoveryFail: 0
TCPSchedulerFailed: 0
TCPRcvCollapsed: 7734
TCPDSACKOldSent: 2
TCPDSACKOfoSent: 0
TCPDSACKRecv: 4
TCPDSACKOfoRecv: 0
TCPAbortOnSyn: 0
TCPAbortOnData: 1
TCPAbortOnClose: 1
TCPAbortOnMemory: 0
TCPAbortOnTimeout: 3
TCPAbortOnLinger: 0
TCPAbortFailed: 0
TCPMemoryPressures: 0
[root@fs1 root]# nfsstat -rc
Warning: No Client Stats (/proc/net/rpc/nfs: No such file or
directory).
[root@fs1 root]# vmstat -n 1
procs memory swap io system
cpu
r b swpd free buff cache si so bi bo in cs us
sy wa id
0 16 3424 17424 25172 1908364 0 1 583 48 321 1315 8
12 61 19
0 14 3424 17404 25160 1908388 0 0 4400 1100 2731 3293 0
3 97 0
0 1 3424 17296 25064 1908592 0 16 3696 1520 2465 2454 0
3 93 5
0 15 3424 21804 25100 1904096 0 4 4068 2036 2360 3501 0
3 94 3
1 1 3424 20712 25128 1905160 0 4 2336 1100 1604 1486 0
1 86 13
0 13 3424 18356 25180 1907464 0 0 4768 444 2009 1895 0
3 97 0
0 15 3424 17344 25200 1908464 0 0 5404 208 2343 2210 0
4 96 0
0 16 3424 17128 25216 1908688 0 0 3548 196 2045 1872 0
2 98 0
0 17 3424 17192 25264 1908584 0 0 1968 308 1217 777 0
4 96 0
0 14 3424 17212 25344 1908500 0 0 4008 336 2267 1586 0
1 99 0
0 13 3424 17320 25380 1908356 0 0 6824 400 3511 3198 0
6 94 0
0 14 3424 17296 25428 1908356 0 8 7580 500 3863 3377 0
6 94 0
1 15 3424 17144 25500 1908444 0 8 6572 228 3387 2167 0
3 97 0
0 17 3424 17228 25484 1908376 0 0 6620 252 3114 2467 0
4 96 0
0 13 3424 17212 25516 1908368 0 36 7488 316 3455 2590 0
7 93 0
0 10 3424 17376 25572 1908140 0 16 4932 1032 2899 2466 0
3 92 5
0 10 3424 17348 25584 1908172 0 36 4160 556 2137 1979 0
3 96 0
0 16 3424 17304 25588 1908216 0 16 6500 248 2989 2402 0
2 98 0
0 17 3424 17376 25592 1908140 0 0 6328 520 2975 2437 0
6 94 0
0 14 3424 17312 25640 1908148 0 0 4144 256 2514 2043 0
1 99 0
[root@fs1 root]# nfsstat
Server rpc stats:
calls badcalls badauth badclnt xdrcall
1838665 9 9 0 0
Server nfs v3:
null getattr setattr lookup access readlink
0 0% 95186 5% 7046 0% 39988 2% 63079 3% 6 0%
read write create mkdir symlink mknod
1585800 86% 35233 1% 2008 0% 1 0% 0 0% 0 0%
remove rmdir rename link readdir readdirplus
3775 0% 0 0% 2 0% 1795 0% 120 0% 433 0%
fsstat fsinfo pathconf commit
34 0% 6 0% 0 0% 1867 0%
[root@fs1 root]# nfsstat -r
Server rpc stats:
calls badcalls badauth badclnt xdrcall
1931116 9 9 0 0
10:38:58 up 1:31, 4 users, load average: 16.79, 16.41, 14.36
83 processes: 82 sleeping, 1 running, 0 zombie, 0 stopped
CPU states: cpu user nice system irq softirq iowait
idle
total 0.0% 0.0% 12.4% 0.8% 1.6% 384.0%
0.0%
cpu00 0.0% 0.0% 1.9% 0.9% 1.9% 95.0%
0.0%
cpu01 0.0% 0.0% 2.9% 0.0% 0.0% 97.0%
0.0%
cpu02 0.0% 0.0% 4.9% 0.0% 0.0% 95.0%
0.0%
cpu03 0.0% 0.0% 2.9% 0.0% 0.0% 97.0%
0.0%
Mem: 2055460k av, 2038136k used, 17324k free, 0k shrd,
25748k buff
1371528k actv, 441804k in_d, 30564k in_c
Swap: 4192880k av, 3412k used, 4189468k free
1907392k cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU
COMMAND
2107 root 15 0 0 0 0 DW 2.9 0.0 0:15 1 nfsd
2106 root 15 0 0 0 0 DW 1.9 0.0 0:16 1 nfsd
2097 root 15 0 0 0 0 DW 0.9 0.0 0:15 1 nfsd
2098 root 15 0 0 0 0 DW 0.9 0.0 0:14 0 nfsd
2099 root 15 0 0 0 0 DW 0.9 0.0 0:15 2 nfsd
2102 root 15 0 0 0 0 DW 0.9 0.0 0:15 3 nfsd
2103 root 15 0 0 0 0 DW 0.9 0.0 0:15 3 nfsd
2108 root 15 0 0 0 0 DW 0.9 0.0 0:15 0 nfsd
2110 root 15 0 0 0 0 DW 0.9 0.0 0:15 2 nfsd
3929 root 20 0 1136 1136 904 R 0.9 0.0 0:00 2 top
1 root 15 0 516 516 456 S 0.0 0.0 0:05 2 init
2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0
migration/0
3 root RT 0 0 0 0 SW 0.0 0.0 0:00 1
migration/1
4 root RT 0 0 0 0 SW 0.0 0.0 0:00 2
migration/2
5 root RT 0 0 0 0 SW 0.0 0.0 0:00 3
migration/3
6 root 15 0 0 0 0 SW 0.0 0.0 0:00 3
keventd
7 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0
ksoftirqd/0
8 root 34 19 0 0 0 SWN 0.0 0.0 0:00 1
ksoftirqd/1
9 root 34 19 0 0 0 SWN 0.0 0.0 0:00 2
ksoftirqd/2
10 root 34 19 0 0 0 SWN 0.0 0.0 0:00 3
ksoftirqd/3
13 root 25 0 0 0 0 SW 0.0 0.0 0:00 2
bdflush
11 root 15 0 0 0 0 SW 0.0 0.0 0:06 0
kswapd
12 root 15 0 0 0 0 SW 0.0 0.0 0:05 3
kscand
14 root 15 0 0 0 0 SW 0.0 0.0 0:00 0
kupdated
15 root 25 0 0 0 0 SW 0.0 0.0 0:00 2
mdrecoveryd
22 root 15 0 0 0 0 SW 0.0 0.0 0:00 0
ahd_dv_0
23 root 15 0 0 0 0 SW 0.0 0.0 0:00 3
ahd_dv_1
24 root 25 0 0 0 0 SW 0.0 0.0 0:00 0
scsi_eh_0
25 root 25 0 0 0 0 SW 0.0 0.0 0:00 3
scsi_eh_1
27 root 21 0 0 0 0 SW 0.0 0.0 0:00 0
scsi_eh_2
28 root 21 0 0 0 0 SW 0.0 0.0 0:00 0
aacraid
30 root 21 0 0 0 0 SW 0.0 0.0 0:00 2
scsi_eh_2
33 root 15 0 0 0 0 SW 0.0 0.0 0:00 3
kjournald
87 root 25 0 0 0 0 SW 0.0 0.0 0:00 3 khubd
1623 root 18 0 0 0 0 SW 0.0 0.0 0:00 0
kjournald