nfs high load results in server lockup - NFS

This is a discussion on nfs high load results in server lockup - NFS ; Hi everyone. I have a nfs file server that is serving a mail spool and the home directories of users. What is happening is that every month or so the load on the file server climbs to 18 and stays ...

+ Reply to Thread
Results 1 to 2 of 2

Thread: nfs high load results in server lockup

  1. nfs high load results in server lockup

    Hi everyone.

    I have a nfs file server that is serving a mail spool and the home
    directories of users.
    What is happening is that every month or so the load on the file server
    climbs to 18 and stays there. This has started becoming more frequent
    lately. At this point, the mail server that has mounted the mail spool
    locks up. The load on the file server never comes down. If I issue a
    reboot command to the file server, I get the 'server rebooting' message
    on the console but nothing happens. I have tried stopping the nfs
    service, but that fails. I have to do a power cycle(i.e. switch off and
    on via the power button) to get control of the file server again and
    this results in a fsck sequence on boot. There is no network bottleneck
    as all these machines are connected via a private gigabit ethernet. I
    have checked the output of ifconfig on all the machines and none of the
    interfaces are reporting errors, so I assume that the NICs are all
    fine.

    There are four machines mounting nfs shares from the file server
    1. mail server
    2. web server
    3. backup server
    4. another machine

    This is the exports file on the file server:
    /home 192.168.1.2(rw,no_root_squash,sync,no_wdelay)
    /home 192.168.1.3(rw,no_root_squash,sync,no_wdelay)
    /home 192.168.1.4(rw,no_root_squash,sync,no_wdelay)
    /home 192.168.1.5(rw,no_root_squash,sync,no_wdelay)
    /data1/mail 192.168.1.2(rw,no_root_squash,sync,no_wdelay)
    /data1/mail 192.168.1.8(rw,no_root_squash,sync,no_wdelay)
    /data1/mail 192.168.1.3(rw,no_root_squash,sync,no_wdelay)
    /data1/mail 192.168.1.4(rw,no_root_squash,sync,no_wdelay)
    /data1/mail 192.168.1.5(rw,no_root_squash,sync,no_wdelay)


    /home is 36G
    /data1/mail is 30G

    These are the relevant lines in /etc/fstab the mail server

    fs1:/home /home nfs
    wsize=8192,rsize=8192,defaults,intr,tcp 0 0
    fs1:/data1/mail /var/spool/mail nfs
    wsize=8192,rsize=8192,defaults,intr,tcp 0 0

    I have recently put in the options 'wsize=8192,rsize=8192', but these
    are not helping.

    Any help/pointers will be much appreciated.

    Thanks,
    Faisal


  2. Re: nfs high load results in server lockup

    Some more info that I missed out earlier

    Im running a 2.4.21-20.EL.c0smp kernel on the file server and a
    2.4.21-32.0.1.ELsmp kernel on the mail server.

    The file server dmesg log always shows this:

    svc: unknown version (1)
    svc: unknown version (1)
    svc: unknown version (1)
    svc: unknown version (1)
    svc: unknown version (1)
    svc: unknown version (1)
    svc: unknown version (1)
    svc: unknown version (1)
    svc: unknown version (1)
    svc: unknown version (1)

    Ive searched for this message and the most common explaination for this
    is having a Sun machine as a nfs client. We do have three Netras but
    they are not mounting off this file server.


    Some more info from the file server

    [root@fs1 root]# netstat -s
    Ip:
    5707180 total packets received
    0 forwarded
    0 incoming packets discarded
    5706315 incoming packets delivered
    8589197 requests sent out
    Icmp:
    23 ICMP messages received
    0 input ICMP message failed.
    ICMP input histogram:
    destination unreachable: 22
    echo requests: 1
    23 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
    destination unreachable: 22
    echo replies: 1
    Tcp:
    14 active connections openings
    1114 passive connection openings
    0 failed connection attempts
    10 connection resets received
    14 connections established
    5705110 segments received
    8587162 segments send out
    522 segments retransmited
    0 bad segments received.
    825 resets sent
    Udp:
    1999 packets received
    0 packets to unknown port received.
    0 packet receive errors
    1999 packets sent
    TcpExt:
    122 packets pruned from receive queue because of socket buffer
    overrun
    ArpFilter: 0
    162 TCP sockets finished time wait in fast timer
    20660 delayed acks sent
    50 delayed acks further delayed because of locked socket
    Quick ack mode was activated 2 times
    37926 packets directly queued to recvmsg prequeue.
    21175 packets directly received from backlog
    1814173 packets directly received from prequeue
    1614897 packets header predicted
    27706 packets header predicted and directly queued to user
    TCPPureAcks: 19263
    TCPHPAcks: 4445016
    TCPRenoRecovery: 0
    TCPSackRecovery: 491
    TCPSACKReneging: 0
    TCPFACKReorder: 0
    TCPSACKReorder: 0
    TCPRenoReorder: 0
    TCPTSReorder: 0
    TCPFullUndo: 0
    TCPPartialUndo: 0
    TCPDSACKUndo: 0
    TCPLossUndo: 4
    TCPLoss: 37
    TCPLostRetransmit: 0
    TCPRenoFailures: 0
    TCPSackFailures: 1
    TCPLossFailures: 0
    TCPFastRetrans: 469
    TCPForwardRetrans: 25
    TCPSlowStartRetrans: 0
    TCPTimeouts: 7
    TCPRenoRecoveryFail: 0
    TCPSackRecoveryFail: 0
    TCPSchedulerFailed: 0
    TCPRcvCollapsed: 7734
    TCPDSACKOldSent: 2
    TCPDSACKOfoSent: 0
    TCPDSACKRecv: 4
    TCPDSACKOfoRecv: 0
    TCPAbortOnSyn: 0
    TCPAbortOnData: 1
    TCPAbortOnClose: 1
    TCPAbortOnMemory: 0
    TCPAbortOnTimeout: 3
    TCPAbortOnLinger: 0
    TCPAbortFailed: 0
    TCPMemoryPressures: 0


    [root@fs1 root]# nfsstat -rc
    Warning: No Client Stats (/proc/net/rpc/nfs: No such file or
    directory).


    [root@fs1 root]# vmstat -n 1
    procs memory swap io system
    cpu
    r b swpd free buff cache si so bi bo in cs us
    sy wa id
    0 16 3424 17424 25172 1908364 0 1 583 48 321 1315 8
    12 61 19
    0 14 3424 17404 25160 1908388 0 0 4400 1100 2731 3293 0
    3 97 0
    0 1 3424 17296 25064 1908592 0 16 3696 1520 2465 2454 0
    3 93 5
    0 15 3424 21804 25100 1904096 0 4 4068 2036 2360 3501 0
    3 94 3
    1 1 3424 20712 25128 1905160 0 4 2336 1100 1604 1486 0
    1 86 13
    0 13 3424 18356 25180 1907464 0 0 4768 444 2009 1895 0
    3 97 0
    0 15 3424 17344 25200 1908464 0 0 5404 208 2343 2210 0
    4 96 0
    0 16 3424 17128 25216 1908688 0 0 3548 196 2045 1872 0
    2 98 0
    0 17 3424 17192 25264 1908584 0 0 1968 308 1217 777 0
    4 96 0
    0 14 3424 17212 25344 1908500 0 0 4008 336 2267 1586 0
    1 99 0
    0 13 3424 17320 25380 1908356 0 0 6824 400 3511 3198 0
    6 94 0
    0 14 3424 17296 25428 1908356 0 8 7580 500 3863 3377 0
    6 94 0
    1 15 3424 17144 25500 1908444 0 8 6572 228 3387 2167 0
    3 97 0
    0 17 3424 17228 25484 1908376 0 0 6620 252 3114 2467 0
    4 96 0
    0 13 3424 17212 25516 1908368 0 36 7488 316 3455 2590 0
    7 93 0
    0 10 3424 17376 25572 1908140 0 16 4932 1032 2899 2466 0
    3 92 5
    0 10 3424 17348 25584 1908172 0 36 4160 556 2137 1979 0
    3 96 0
    0 16 3424 17304 25588 1908216 0 16 6500 248 2989 2402 0
    2 98 0
    0 17 3424 17376 25592 1908140 0 0 6328 520 2975 2437 0
    6 94 0
    0 14 3424 17312 25640 1908148 0 0 4144 256 2514 2043 0
    1 99 0


    [root@fs1 root]# nfsstat
    Server rpc stats:
    calls badcalls badauth badclnt xdrcall
    1838665 9 9 0 0
    Server nfs v3:
    null getattr setattr lookup access readlink
    0 0% 95186 5% 7046 0% 39988 2% 63079 3% 6 0%
    read write create mkdir symlink mknod
    1585800 86% 35233 1% 2008 0% 1 0% 0 0% 0 0%
    remove rmdir rename link readdir readdirplus
    3775 0% 0 0% 2 0% 1795 0% 120 0% 433 0%
    fsstat fsinfo pathconf commit
    34 0% 6 0% 0 0% 1867 0%


    [root@fs1 root]# nfsstat -r
    Server rpc stats:
    calls badcalls badauth badclnt xdrcall
    1931116 9 9 0 0


    10:38:58 up 1:31, 4 users, load average: 16.79, 16.41, 14.36
    83 processes: 82 sleeping, 1 running, 0 zombie, 0 stopped
    CPU states: cpu user nice system irq softirq iowait
    idle
    total 0.0% 0.0% 12.4% 0.8% 1.6% 384.0%
    0.0%
    cpu00 0.0% 0.0% 1.9% 0.9% 1.9% 95.0%
    0.0%
    cpu01 0.0% 0.0% 2.9% 0.0% 0.0% 97.0%
    0.0%
    cpu02 0.0% 0.0% 4.9% 0.0% 0.0% 95.0%
    0.0%
    cpu03 0.0% 0.0% 2.9% 0.0% 0.0% 97.0%
    0.0%
    Mem: 2055460k av, 2038136k used, 17324k free, 0k shrd,
    25748k buff
    1371528k actv, 441804k in_d, 30564k in_c
    Swap: 4192880k av, 3412k used, 4189468k free
    1907392k cached

    PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU
    COMMAND
    2107 root 15 0 0 0 0 DW 2.9 0.0 0:15 1 nfsd
    2106 root 15 0 0 0 0 DW 1.9 0.0 0:16 1 nfsd
    2097 root 15 0 0 0 0 DW 0.9 0.0 0:15 1 nfsd
    2098 root 15 0 0 0 0 DW 0.9 0.0 0:14 0 nfsd
    2099 root 15 0 0 0 0 DW 0.9 0.0 0:15 2 nfsd
    2102 root 15 0 0 0 0 DW 0.9 0.0 0:15 3 nfsd
    2103 root 15 0 0 0 0 DW 0.9 0.0 0:15 3 nfsd
    2108 root 15 0 0 0 0 DW 0.9 0.0 0:15 0 nfsd
    2110 root 15 0 0 0 0 DW 0.9 0.0 0:15 2 nfsd
    3929 root 20 0 1136 1136 904 R 0.9 0.0 0:00 2 top
    1 root 15 0 516 516 456 S 0.0 0.0 0:05 2 init
    2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0
    migration/0
    3 root RT 0 0 0 0 SW 0.0 0.0 0:00 1
    migration/1
    4 root RT 0 0 0 0 SW 0.0 0.0 0:00 2
    migration/2
    5 root RT 0 0 0 0 SW 0.0 0.0 0:00 3
    migration/3
    6 root 15 0 0 0 0 SW 0.0 0.0 0:00 3
    keventd
    7 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0
    ksoftirqd/0
    8 root 34 19 0 0 0 SWN 0.0 0.0 0:00 1
    ksoftirqd/1
    9 root 34 19 0 0 0 SWN 0.0 0.0 0:00 2
    ksoftirqd/2
    10 root 34 19 0 0 0 SWN 0.0 0.0 0:00 3
    ksoftirqd/3
    13 root 25 0 0 0 0 SW 0.0 0.0 0:00 2
    bdflush
    11 root 15 0 0 0 0 SW 0.0 0.0 0:06 0
    kswapd
    12 root 15 0 0 0 0 SW 0.0 0.0 0:05 3
    kscand
    14 root 15 0 0 0 0 SW 0.0 0.0 0:00 0
    kupdated
    15 root 25 0 0 0 0 SW 0.0 0.0 0:00 2
    mdrecoveryd
    22 root 15 0 0 0 0 SW 0.0 0.0 0:00 0
    ahd_dv_0
    23 root 15 0 0 0 0 SW 0.0 0.0 0:00 3
    ahd_dv_1
    24 root 25 0 0 0 0 SW 0.0 0.0 0:00 0
    scsi_eh_0
    25 root 25 0 0 0 0 SW 0.0 0.0 0:00 3
    scsi_eh_1
    27 root 21 0 0 0 0 SW 0.0 0.0 0:00 0
    scsi_eh_2
    28 root 21 0 0 0 0 SW 0.0 0.0 0:00 0
    aacraid
    30 root 21 0 0 0 0 SW 0.0 0.0 0:00 2
    scsi_eh_2
    33 root 15 0 0 0 0 SW 0.0 0.0 0:00 3
    kjournald
    87 root 25 0 0 0 0 SW 0.0 0.0 0:00 3 khubd
    1623 root 18 0 0 0 0 SW 0.0 0.0 0:00 0
    kjournald


+ Reply to Thread