-
glitches in NFS
Hello all,
alright, I know this problem dates back to 1993 and beyond, but even
after a whole day of browsing mailing lists I haven't found anything
helpful. I'm experiencing NFS trouble between my debian 2.6.8-3-386
host (the NFS server) and a custom card here in my company (the
client). Every few minutes (no more than 5, in any case) I get a:
nfs: server 192.168.0.21 not responding, still trying
which is resolved after a few minutes by a:
nfs: server 192.168.0.21 OK
I have tried playing around with the rsize,wsize,tcp/udp and timeo
parameters of the NFS connection, to no avail :(. I would appreciate
*any* suggestions or ideas.
Thanks,
Avishai.
-
Re: glitches in NFS
In article <1168275531.035237.239870@38g2000cwa.googlegroups.com>,
"avishai" <avishai.hendel@gmail.com> writes:
|>
|> alright, I know this problem dates back to 1993 and beyond, but even
|> after a whole day of browsing mailing lists I haven't found anything
|> helpful. I'm experiencing NFS trouble between my debian 2.6.8-3-386
|> host (the NFS server) and a custom card here in my company (the
|> client). Every few minutes (no more than 5, in any case) I get a:
|> nfs: server 192.168.0.21 not responding, still trying
|>
|> which is resolved after a few minutes by a:
|> nfs: server 192.168.0.21 OK
|>
|> I have tried playing around with the rsize,wsize,tcp/udp and timeo
|> parameters of the NFS connection, to no avail :(. I would appreciate
|> *any* suggestions or ideas.
I am not sure of the last :-(
We hit that one, too, including with an AIX client to an AIX server
and a Linux client to a Solaris server. My belief is that it is due
to a design error or specification ambiguity in NFS, but God alone
knows what or where.
There is an equally ancient, obscure, generic and more serious one
that I know of in TCP, but I doubt that it is related.
Regards,
Nick Maclaren.
-
Re: glitches in NFS
avishai wrote:[color=blue]
> Hello all,
>
> alright, I know this problem dates back to 1993 and beyond, but even
> after a whole day of browsing mailing lists I haven't found anything
> helpful. I'm experiencing NFS trouble between my debian 2.6.8-3-386
> host (the NFS server) and a custom card here in my company (the
> client). Every few minutes (no more than 5, in any case) I get a:
> nfs: server 192.168.0.21 not responding, still trying
>
> which is resolved after a few minutes by a:
> nfs: server 192.168.0.21 OK
>
> I have tried playing around with the rsize,wsize,tcp/udp and timeo
> parameters of the NFS connection, to no avail :(. I would appreciate
> *any* suggestions or ideas.
> Thanks,
> Avishai.[/color]
I don't think it has to do with the NFS protocol.
To summarize your problem, you periodically see:
nfs: server 192.168.0.21 not responding
nfs: server 192.168.0.21 OK
nfs: server 192.168.0.21 not responding
nfs: server 192.168.0.21 OK
...
Right?
Every "not responding" means that the client has repeatedly failed to
contact the server. Specifically, the client has been sending NFS
requests to which it hasn't received any replies. Then you get the "OK"
message, which means that the client can now talk to the server.
This, most likely, implies a connectivity problem between the 2 hosts.
I'd suggest you run this continuously on the client for 10 minutes:
while true
do
ping -c 1 192.168.0.21
arp -a 192.168.0.21
sleep 2
done > /tmp/out
Watch for 2 things:
(1) Do you ever lose connectivity?
(2) Does some other machine steal the server's IP?
Cheers,
bc
-
Re: glitches in NFS
In article <1168301431.082335.70130@42g2000cwt.googlegroups.com>,
"bcwalrus" <bcwalrus@gmail.com> writes:
|>
|> Every "not responding" means that the client has repeatedly failed to
|> contact the server. Specifically, the client has been sending NFS
|> requests to which it hasn't received any replies. Then you get the "OK"
|> message, which means that the client can now talk to the server.
|>
|> This, most likely, implies a connectivity problem between the 2 hosts.
Nope. Not with the symptoms he described. It is possible, but
another explanation is more likely.
|> I'd suggest you run this continuously on the client for 10 minutes:
That is certainly reasonable, and should show up any gross errors.
It is all you can do for a small amount of effort, and so should be
done, as a start.
In most of the cases I have seen, it gave the all clear and the
symptoms persisted. On other grounds, I am certain that there was
no IP stealing (e.g. it happened on one point-to-point connexion!)
and I was 90% certain of no significant connectivity issues. That
meant it had to be software, somewhere :-(
Regards,
Nick Maclaren.
-
Re: glitches in NFS
avishai wrote:[color=blue]
> Hello all,
>
> alright, I know this problem dates back to 1993 and beyond, but even
> after a whole day of browsing mailing lists I haven't found anything
> helpful. I'm experiencing NFS trouble between my debian 2.6.8-3-386
> host (the NFS server) and a custom card here in my company (the
> client). Every few minutes (no more than 5, in any case) I get a:
> nfs: server 192.168.0.21 not responding, still trying
>
> which is resolved after a few minutes by a:
> nfs: server 192.168.0.21 OK
>
> I have tried playing around with the rsize,wsize,tcp/udp and timeo
> parameters of the NFS connection, to no avail :(. I would appreciate
> *any* suggestions or ideas.
> Thanks,
> Avishai.[/color]
Have a look at [url]http://www.netapp.com/library/tr/3183.pdf[/url] for a
discussion of Linux NFS issues.
Following a suggestion in that paper we just resolved a similar sypmtom
by set "flow control: On" in a switch between the client and server.
The server was gigabit ethernet, the client only 100 megabit. Looking
at our switches, it seems that all those purchased before last fall
defaulted flow control to "on", and those purchased last fall defaulted
to "off". We are still wondering if there is a downside - does anyone
here want to comment?
The paper suggests using tcp is a better solution, but we haven't been
able to change the client configuration yet.
In our case we could "ping" as long as we liked with never a lost
packet. The problem occurred only when multiple packets arrived to
quickly for the client to absorb them, and some were lost.
Daniel Feenberg
feenberg isat nber dotte org
-
Re: glitches in NFS
In article <1169210320.632247.320470@q2g2000cwa.googlegroups.com>,
[email]feenberg@gmail.com[/email] writes:
|>
|> Following a suggestion in that paper we just resolved a similar sypmtom
|> by set "flow control: On" in a switch between the client and server.
|> The server was gigabit ethernet, the client only 100 megabit. Looking
|> at our switches, it seems that all those purchased before last fall
|> defaulted flow control to "on", and those purchased last fall defaulted
|> to "off". We are still wondering if there is a downside - does anyone
|> here want to comment?
The downside is performance - but, as you lose vastly MORE performance
on a glitch, it can be a winner. This area is a right mess, and you
can have similar compatibility problems with simplex versus duplex.
I don't understand the details but have hit them.
|> The paper suggests using tcp is a better solution, but we haven't been
|> able to change the client configuration yet.
Don't bet on it. TCP's recovery from glitches is dire. It usually
does it, but can take ages, depending on which timeout goes off.
|> In our case we could "ping" as long as we liked with never a lost
|> packet. The problem occurred only when multiple packets arrived to
|> quickly for the client to absorb them, and some were lost.
Indeed :-) And the effect can be caused by apparently extraneous
events, such as I/O on other devices and even excessive amounts of
denormalised arithmetic. Ping will usually spot those with low
probability, though.
Regards,
Nick Maclaren.
-
Re: glitches in NFS
Nick Maclaren wrote:
(snip regarding NFS server not responding)
[color=blue]
> Nope. Not with the symptoms he described. It is possible, but
> another explanation is more likely.[/color]
[color=blue]
> |> I'd suggest you run this continuously on the client for 10 minutes:[/color]
[color=blue]
> That is certainly reasonable, and should show up any gross errors.
> It is all you can do for a small amount of effort, and so should be
> done, as a start.[/color]
I used to see it fairly often in the Sun3/SunOS days, and less
often later on. That was with all Sun systems, but much slower
machines than today and with only 10Mb/s ethernet. One that
would really slow down the net was when a machine would core
dump through NFS.
[color=blue]
> In most of the cases I have seen, it gave the all clear and the
> symptoms persisted. On other grounds, I am certain that there was
> no IP stealing (e.g. it happened on one point-to-point connexion!)
> and I was 90% certain of no significant connectivity issues. That
> meant it had to be software, somewhere :-([/color]
I always thought it came from fairly short time out values, in
combination with slow machines and networks.
Then again, once we shipped away a machine that still had clients
with mounts on it. Those were going to have a long wait.
-- glen