glitches in NFS - NFS

This is a discussion on glitches in NFS - NFS ; Hello all, alright, I know this problem dates back to 1993 and beyond, but even after a whole day of browsing mailing lists I haven't found anything helpful. I'm experiencing NFS trouble between my debian 2.6.8-3-386 host (the NFS server) ...

+ Reply to Thread
Results 1 to 7 of 7

Thread: glitches in NFS

  1. glitches in NFS

    Hello all,

    alright, I know this problem dates back to 1993 and beyond, but even
    after a whole day of browsing mailing lists I haven't found anything
    helpful. I'm experiencing NFS trouble between my debian 2.6.8-3-386
    host (the NFS server) and a custom card here in my company (the
    client). Every few minutes (no more than 5, in any case) I get a:
    nfs: server 192.168.0.21 not responding, still trying

    which is resolved after a few minutes by a:
    nfs: server 192.168.0.21 OK

    I have tried playing around with the rsize,wsize,tcp/udp and timeo
    parameters of the NFS connection, to no avail . I would appreciate
    *any* suggestions or ideas.
    Thanks,
    Avishai.


  2. Re: glitches in NFS


    In article <1168275531.035237.239870@38g2000cwa.googlegroups.c om>,
    "avishai" writes:
    |>
    |> alright, I know this problem dates back to 1993 and beyond, but even
    |> after a whole day of browsing mailing lists I haven't found anything
    |> helpful. I'm experiencing NFS trouble between my debian 2.6.8-3-386
    |> host (the NFS server) and a custom card here in my company (the
    |> client). Every few minutes (no more than 5, in any case) I get a:
    |> nfs: server 192.168.0.21 not responding, still trying
    |>
    |> which is resolved after a few minutes by a:
    |> nfs: server 192.168.0.21 OK
    |>
    |> I have tried playing around with the rsize,wsize,tcp/udp and timeo
    |> parameters of the NFS connection, to no avail . I would appreciate
    |> *any* suggestions or ideas.

    I am not sure of the last :-(

    We hit that one, too, including with an AIX client to an AIX server
    and a Linux client to a Solaris server. My belief is that it is due
    to a design error or specification ambiguity in NFS, but God alone
    knows what or where.

    There is an equally ancient, obscure, generic and more serious one
    that I know of in TCP, but I doubt that it is related.


    Regards,
    Nick Maclaren.

  3. Re: glitches in NFS

    avishai wrote:
    > Hello all,
    >
    > alright, I know this problem dates back to 1993 and beyond, but even
    > after a whole day of browsing mailing lists I haven't found anything
    > helpful. I'm experiencing NFS trouble between my debian 2.6.8-3-386
    > host (the NFS server) and a custom card here in my company (the
    > client). Every few minutes (no more than 5, in any case) I get a:
    > nfs: server 192.168.0.21 not responding, still trying
    >
    > which is resolved after a few minutes by a:
    > nfs: server 192.168.0.21 OK
    >
    > I have tried playing around with the rsize,wsize,tcp/udp and timeo
    > parameters of the NFS connection, to no avail . I would appreciate
    > *any* suggestions or ideas.
    > Thanks,
    > Avishai.


    I don't think it has to do with the NFS protocol.

    To summarize your problem, you periodically see:
    nfs: server 192.168.0.21 not responding
    nfs: server 192.168.0.21 OK
    nfs: server 192.168.0.21 not responding
    nfs: server 192.168.0.21 OK
    ...
    Right?

    Every "not responding" means that the client has repeatedly failed to
    contact the server. Specifically, the client has been sending NFS
    requests to which it hasn't received any replies. Then you get the "OK"
    message, which means that the client can now talk to the server.

    This, most likely, implies a connectivity problem between the 2 hosts.
    I'd suggest you run this continuously on the client for 10 minutes:

    while true
    do
    ping -c 1 192.168.0.21
    arp -a 192.168.0.21
    sleep 2
    done > /tmp/out

    Watch for 2 things:
    (1) Do you ever lose connectivity?
    (2) Does some other machine steal the server's IP?

    Cheers,
    bc


  4. Re: glitches in NFS


    In article <1168301431.082335.70130@42g2000cwt.googlegroups.co m>,
    "bcwalrus" writes:
    |>
    |> Every "not responding" means that the client has repeatedly failed to
    |> contact the server. Specifically, the client has been sending NFS
    |> requests to which it hasn't received any replies. Then you get the "OK"
    |> message, which means that the client can now talk to the server.
    |>
    |> This, most likely, implies a connectivity problem between the 2 hosts.

    Nope. Not with the symptoms he described. It is possible, but
    another explanation is more likely.

    |> I'd suggest you run this continuously on the client for 10 minutes:

    That is certainly reasonable, and should show up any gross errors.
    It is all you can do for a small amount of effort, and so should be
    done, as a start.

    In most of the cases I have seen, it gave the all clear and the
    symptoms persisted. On other grounds, I am certain that there was
    no IP stealing (e.g. it happened on one point-to-point connexion!)
    and I was 90% certain of no significant connectivity issues. That
    meant it had to be software, somewhere :-(


    Regards,
    Nick Maclaren.

  5. Re: glitches in NFS

    avishai wrote:
    > Hello all,
    >
    > alright, I know this problem dates back to 1993 and beyond, but even
    > after a whole day of browsing mailing lists I haven't found anything
    > helpful. I'm experiencing NFS trouble between my debian 2.6.8-3-386
    > host (the NFS server) and a custom card here in my company (the
    > client). Every few minutes (no more than 5, in any case) I get a:
    > nfs: server 192.168.0.21 not responding, still trying
    >
    > which is resolved after a few minutes by a:
    > nfs: server 192.168.0.21 OK
    >
    > I have tried playing around with the rsize,wsize,tcp/udp and timeo
    > parameters of the NFS connection, to no avail . I would appreciate
    > *any* suggestions or ideas.
    > Thanks,
    > Avishai.



    Have a look at http://www.netapp.com/library/tr/3183.pdf for a
    discussion of Linux NFS issues.

    Following a suggestion in that paper we just resolved a similar sypmtom
    by set "flow control: On" in a switch between the client and server.
    The server was gigabit ethernet, the client only 100 megabit. Looking
    at our switches, it seems that all those purchased before last fall
    defaulted flow control to "on", and those purchased last fall defaulted
    to "off". We are still wondering if there is a downside - does anyone
    here want to comment?

    The paper suggests using tcp is a better solution, but we haven't been
    able to change the client configuration yet.

    In our case we could "ping" as long as we liked with never a lost
    packet. The problem occurred only when multiple packets arrived to
    quickly for the client to absorb them, and some were lost.


    Daniel Feenberg
    feenberg isat nber dotte org


  6. Re: glitches in NFS


    In article <1169210320.632247.320470@q2g2000cwa.googlegroups.c om>,
    feenberg@gmail.com writes:
    |>
    |> Following a suggestion in that paper we just resolved a similar sypmtom
    |> by set "flow control: On" in a switch between the client and server.
    |> The server was gigabit ethernet, the client only 100 megabit. Looking
    |> at our switches, it seems that all those purchased before last fall
    |> defaulted flow control to "on", and those purchased last fall defaulted
    |> to "off". We are still wondering if there is a downside - does anyone
    |> here want to comment?

    The downside is performance - but, as you lose vastly MORE performance
    on a glitch, it can be a winner. This area is a right mess, and you
    can have similar compatibility problems with simplex versus duplex.
    I don't understand the details but have hit them.

    |> The paper suggests using tcp is a better solution, but we haven't been
    |> able to change the client configuration yet.

    Don't bet on it. TCP's recovery from glitches is dire. It usually
    does it, but can take ages, depending on which timeout goes off.

    |> In our case we could "ping" as long as we liked with never a lost
    |> packet. The problem occurred only when multiple packets arrived to
    |> quickly for the client to absorb them, and some were lost.

    Indeed :-) And the effect can be caused by apparently extraneous
    events, such as I/O on other devices and even excessive amounts of
    denormalised arithmetic. Ping will usually spot those with low
    probability, though.


    Regards,
    Nick Maclaren.

  7. Re: glitches in NFS

    Nick Maclaren wrote:

    (snip regarding NFS server not responding)

    > Nope. Not with the symptoms he described. It is possible, but
    > another explanation is more likely.


    > |> I'd suggest you run this continuously on the client for 10 minutes:


    > That is certainly reasonable, and should show up any gross errors.
    > It is all you can do for a small amount of effort, and so should be
    > done, as a start.


    I used to see it fairly often in the Sun3/SunOS days, and less
    often later on. That was with all Sun systems, but much slower
    machines than today and with only 10Mb/s ethernet. One that
    would really slow down the net was when a machine would core
    dump through NFS.

    > In most of the cases I have seen, it gave the all clear and the
    > symptoms persisted. On other grounds, I am certain that there was
    > no IP stealing (e.g. it happened on one point-to-point connexion!)
    > and I was 90% certain of no significant connectivity issues. That
    > meant it had to be software, somewhere :-(


    I always thought it came from fairly short time out values, in
    combination with slow machines and networks.

    Then again, once we shipped away a machine that still had clients
    with mounts on it. Those were going to have a long wait.

    -- glen


+ Reply to Thread