"Unpredictable" NFS errors? - NFS

This is a discussion on "Unpredictable" NFS errors? - NFS ; Hi, all, I'm hoping someone in this group can provide some insight or suggestions on how to solve a tricky NFS problem. I use a simple backup script to tar major file systems onto a USB hard drive using a ...

+ Reply to Thread
Results 1 to 2 of 2

Thread: "Unpredictable" NFS errors?

  1. "Unpredictable" NFS errors?

    Hi, all,

    I'm hoping someone in this group can provide some insight or suggestions on
    how to solve a tricky NFS problem.

    I use a simple backup script to tar major file systems onto a USB hard drive
    using a weekly cron task. However, after upgrading my server to FC3 (and
    the 40 or so nodes to either FC2 or FC3), I encounter NFS statfs errors.
    Specifically, after the backup has been running for a few minutes, the nodes
    lose their NFS file handles for one specific file system. A "df" on a given
    node shows:

    Filesystem 1K-blocks Used Available Use% Mounted on
    /dev/sda3 14571488 2404276 11427020 18% /
    /dev/sda1 1004024 21748 931272 3% /boot
    none 1037288 0 1037288 0% /dev/shm
    /dev/sdb1 70557052 38727272 28245684 58% /tmp1
    server:/home/users 70557056 47564176 19408784 72% /home/users
    server:/usr/local/athlon1
    - - - - /usr/local
    In /var/log/messages on the node I will see:

    Sep 13 12:44:20 node_name kernel: nfs_statfs: statfs error = 116

    This message will often be repeated many times.

    Loss of access to /usr/local, where many executable programs reside is fatal
    to calculations running on our batch system. Thus, every Sunday morning
    when the backup would run, our job queue would be wiped, killing many long
    tasks.

    Interesting points to note:

    (1) We're running FC3 (kernel 2.6.9-1.667smp) on the server and either the
    same thing or FC2 (2.6.10-1.770_FC2smp) on the nodes.

    (2) The nodes use the following args in /etc/fstab for the NFS file systems:

    nfsvers=2,rsize=8192,wsize=8192,exec,dev,suid,rw

    (3) The server exports file systems with the following args in /etc/exports:

    rw,no_root_squash,insecure

    (4) Although the nodes lose their file handles for the /usr/local file
    system during the backup, the backup is actually working on a completely
    different file system when the statfs error occurs.

    (5) The problem only occurs with the /usr/local file system, and NOT with
    the /home/users file system, which is also NFS exported to the nodes. It
    doesn't matter which file system is first in /etc/fstab on the nodes; the
    problem always occurs with /usr/local.

    (6) It doesn't seem to matter that the backup is made to a (slow) USB disk.
    I also tried backing up to a file in /tmp on the server, but the same error
    occurred.


    Any insight you can provide would be greatly appreciated.

    -Daniel

    --
    T. Daniel Crawford Department of Chemistry
    crawdad[AT]vt.edu Virginia Tech
    www.chem.vt.edu/faculty/crawford.php
    --------------------
    PGP Public Key at: http://www.chem.vt.edu/chem-dept/crawford/publickey.txt


  2. Re: "Unpredictable" NFS errors?

    In article ,
    T. Daniel Crawford wrote:
    >
    >I'm hoping someone in this group can provide some insight or suggestions on
    >how to solve a tricky NFS problem.


    It is primarily an operating system one, so it would help if you
    said which. Actually, I recognise it, so I can tell it's Linux,
    but not which flavour or version.

    >I use a simple backup script to tar major file systems onto a USB hard drive
    >using a weekly cron task. However, after upgrading my server to FC3 (and
    >the 40 or so nodes to either FC2 or FC3), I encounter NFS statfs errors.
    >Specifically, after the backup has been running for a few minutes, the nodes
    >lose their NFS file handles for one specific file system. A "df" on a given
    >node shows:


    Yeah. And I would like to know why, too. We get it on our cluster,
    but it has not yet got hard enough to put the (major) effort into
    trying to track it down. Well, actually, it's not quite the same
    symptom - we get stale file handle messages.

    The cause of both is that Linux NFS clients have some broken error
    handling. Somewhere. Under some circumstances. God alone knows
    why. As Linux and its associate software are typical of modern
    junk in that they don't have a significant amount of internal
    checking, diagnostics or tracing, investigating such issues is a
    pain in the posterior.



    Regards,
    Nick Maclaren.

+ Reply to Thread