"Unpredictable" NFS errors?
I'm hoping someone in this group can provide some insight or suggestions on
how to solve a tricky NFS problem.
I use a simple backup script to tar major file systems onto a USB hard drive
using a weekly cron task. However, after upgrading my server to FC3 (and
the 40 or so nodes to either FC2 or FC3), I encounter NFS statfs errors.
Specifically, after the backup has been running for a few minutes, the nodes
lose their NFS file handles for one specific file system. A "df" on a given
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda3 14571488 2404276 11427020 18% /
/dev/sda1 1004024 21748 931272 3% /boot
none 1037288 0 1037288 0% /dev/shm
/dev/sdb1 70557052 38727272 28245684 58% /tmp1
server:/home/users 70557056 47564176 19408784 72% /home/users
- - - - /usr/local
In /var/log/messages on the node I will see:
Sep 13 12:44:20 node_name kernel: nfs_statfs: statfs error = 116
This message will often be repeated many times.
Loss of access to /usr/local, where many executable programs reside is fatal
to calculations running on our batch system. Thus, every Sunday morning
when the backup would run, our job queue would be wiped, killing many long
Interesting points to note:
(1) We're running FC3 (kernel 2.6.9-1.667smp) on the server and either the
same thing or FC2 (2.6.10-1.770_FC2smp) on the nodes.
(2) The nodes use the following args in /etc/fstab for the NFS file systems:
(3) The server exports file systems with the following args in /etc/exports:
(4) Although the nodes lose their file handles for the /usr/local file
system during the backup, the backup is actually working on a completely
different file system when the statfs error occurs.
(5) The problem only occurs with the /usr/local file system, and NOT with
the /home/users file system, which is also NFS exported to the nodes. It
doesn't matter which file system is first in /etc/fstab on the nodes; the
problem always occurs with /usr/local.
(6) It doesn't seem to matter that the backup is made to a (slow) USB disk.
I also tried backing up to a file in /tmp on the server, but the same error
Any insight you can provide would be greatly appreciated.
T. Daniel Crawford Department of Chemistry
crawdad[AT]vt.edu Virginia Tech
PGP Public Key at: [url]http://www.chem.vt.edu/chem-dept/crawford/publickey.txt[/url]
Re: "Unpredictable" NFS errors?
In article <BF4C8616.61C6email@example.com>,
T. Daniel Crawford <firstname.lastname@example.org> wrote:[color=blue]
>I'm hoping someone in this group can provide some insight or suggestions on
>how to solve a tricky NFS problem.[/color]
It is primarily an operating system one, so it would help if you
said which. Actually, I recognise it, so I can tell it's Linux,
but not which flavour or version.
>I use a simple backup script to tar major file systems onto a USB hard drive
>using a weekly cron task. However, after upgrading my server to FC3 (and
>the 40 or so nodes to either FC2 or FC3), I encounter NFS statfs errors.
>Specifically, after the backup has been running for a few minutes, the nodes
>lose their NFS file handles for one specific file system. A "df" on a given
Yeah. And I would like to know why, too. We get it on our cluster,
but it has not yet got hard enough to put the (major) effort into
trying to track it down. Well, actually, it's not quite the same
symptom - we get stale file handle messages.
The cause of both is that Linux NFS clients have some broken error
handling. Somewhere. Under some circumstances. God alone knows
why. As Linux and its associate software are typical of modern
junk in that they don't have a significant amount of internal
checking, diagnostics or tracing, investigating such issues is a
pain in the posterior.