We started getting nfs mount failures with the following /var/log/messages:

May 14 15:23:43 d0srv009 rpc.mountd: getfh failed: Operation not permitted
May 14 15:23:43 d0srv009 rpc.mountd: authenticated mount request from d0cs178.fnal.gov:950 for /projects/1007 (/projects/1007)

After doing exportfs -u and exportfs -a the same mount will succeed.
exportfs -r does not correct the situation.

The exports was a quite large netgroup - 300 or so machines and I cahnged it to:
/projects/1007 d0*.fnal.gov(rw,sync) *clued0.fnal.gov(rw,sync)
/projects/1008 d0*.fnal.gov(rw,sync) *clued0.fnal.gov(rw,sync)

but this didn't fix things.

What did cause the problem to "go away" was increasing the number of nfsd.

This is kernel-smp-2.4.21-15.EL and nfs-utils-1.0.6-21EL from RedHat.

Now that the crisis is over (exportfs -u caused stale file handles)
I wanted to look at the nfs-utils source. Any advice?

Does it make sense that more nfsd would forstall the corruption?

It seems like mounts/unmounts would update the exportfs cache.

Kurt Ruthmansdorfer; kurtATfnal.gov; 630-840-8057; FCC2W, 250H
Fermi National Accelerator Lab; ms 369; PO Box 500; Batavia, Il 60510-0500