Redhat Enterprise 4 and 15 second delays with NFS via TCP - NFS

This is a discussion on Redhat Enterprise 4 and 15 second delays with NFS via TCP - NFS ; We recently moved from RH9 to Redhat enterprise 4 (2.6.9++ kernel). Now we periodically see 15 second (or multiples of 15) delays. I've traced this down to code in the kernel (net/sunrpc/xprt.c) (look for RPC_REESTABLISH_TIMEOUT). This code inserts a hard ...

+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 20 of 26

Thread: Redhat Enterprise 4 and 15 second delays with NFS via TCP

  1. Redhat Enterprise 4 and 15 second delays with NFS via TCP

    We recently moved from RH9 to Redhat enterprise 4 (2.6.9++ kernel).
    Now we periodically see 15 second (or multiples of 15) delays.

    I've traced this down to code in the kernel (net/sunrpc/xprt.c)
    (look for RPC_REESTABLISH_TIMEOUT). This code inserts a
    hard 15 second delay before re-connection if a RPC connection is
    dropped. The problem seems worst with Solaris servers.

    FWIW: This involves anything using RPC/TCP not just NFS, so other
    services can be similarly impacted.

    Anyone else having similar woes?
    --jbrandt

  2. Re: Redhat Enterprise 4 and 15 second delays with NFS via TCP

    Joe Brandt wrote:
    > We recently moved from RH9 to Redhat enterprise 4 (2.6.9++ kernel).
    > Now we periodically see 15 second (or multiples of 15) delays.
    >
    > I've traced this down to code in the kernel (net/sunrpc/xprt.c)
    > (look for RPC_REESTABLISH_TIMEOUT). This code inserts a
    > hard 15 second delay before re-connection if a RPC connection is
    > dropped. The problem seems worst with Solaris servers.
    >
    > FWIW: This involves anything using RPC/TCP not just NFS, so other
    > services can be similarly impacted.
    >
    > Anyone else having similar woes?


    What I've been told is that when the Linux 2.6 NFS client gets
    a connection disconnect, it will indeed wait 15 seconds before
    trying to reconnect. This is to deal with the case where
    an NFS server reboots while it has NFS connections. If every
    client simply went in a tight loop trying to re-connect while the
    machine the server was on was booting up, the net would be flooded
    with SYNs.

    Unfortunately, the Linux client (unlike
    the Solaris client) doesn't distinguish
    between disconnects caused by a
    FIN from the server, versus a RST
    from the server. Ideally, the client
    should wait a bit if it gets RST, but if
    it gets a FIN, reconnect as soon as possible.

    The real question is why is your Solaris
    NFS server breaking connections?

    On the Solaris server side, if a connection
    is unused for more than 6 minutes,
    it will break the connection. On the
    Linux client side, I'm told that if
    the connect is idle for more than 5 minutes,
    the client will break the connection.

    By any chance have you changed the
    server side or client side connection timeouts?

    Also does the Solaris nfsd have the -c option
    set? If set, it will limit the number of
    connections to the value specified on the command
    line. If a new connection comes in, rather than reject it
    (and thus provide an obvious DoS) it closes the least
    recently created connection, and allows the new one in.
    This could cause the server to break connections
    before the idle timeout of 6 minutes is reached.

    > --jbrandt



  3. Re: Redhat Enterprise 4 and 15 second delays with NFS via TCP

    Most interesting. That could be related to the problem we get on
    occasion, and gives us a handle for investigating if it ever goes
    hard. That one is a failure (not just a message) Bad File Handle
    for no apparent reason, and that is a Linux client and Solaris server.

    Of course, if the people designing the code had half a clue, they
    would not think that the only two options are a tight loop and a
    delay before retry. I would have thought that it was obvious that
    both are seriously flawed strategies, and a very simple increasing
    delay is the best general approach. Plus, of course, that logging
    repeated failure just might be of some use to someone investigating
    problems in the area.

    But I am just an old fart, so what do I know?


    Regards,
    Nick Maclaren.

  4. Re: Redhat Enterprise 4 and 15 second delays with NFS via TCP

    Upon further investigation, the problem seems *much* worse with our
    Solaris server boxes, so I am begining to think it is a configuration
    issue, and not so much a linux kernel problem, but I agree with your
    comments.

    I should note that in the 2.6.15.1 kernel, the hard 15 second timeout
    has been changed to an exponential backoff starting at 3 seconds.

    As far as I can see we run the Solaris nfsd as so:
    nfsd -a 32
    I might bump this up a bit.

    I did try running some Linux boxes using UDP and r/wsize of 8196
    instead of 32768. The performance hit was negliable, and the number of
    retransmitted packets were reasonable.


  5. Re: Redhat Enterprise 4 and 15 second delays with NFS via TCP


    Nick Maclaren wrote:
    > Most interesting. That could be related to the problem we get on
    > occasion, and gives us a handle for investigating if it ever goes
    > hard. That one is a failure (not just a message) Bad File Handle
    > for no apparent reason, and that is a Linux client and Solaris server.


    This has nothing to do with Bad or Stale File Handles.

    > Of course, if the people designing the code had half a clue, they


    Since the Solaris client is somewhat of a model for the Linux client,
    that would include me.

    > would not think that the only two options are a tight loop and a
    > delay before retry. I would have thought that it was obvious that
    > both are seriously flawed strategies, and a very simple increasing


    I owned the Solaris client's NFS client
    RPC/TCP at one point. I've shared some of my
    experiences with the implementors of its analog in
    the Linux kernel. So your criticisms apply to
    me as well as them.

    Yes there are always more complex ways to recover from
    less than ideal situations. But more compexity leads to
    more possibilities for error (human error). If I've learned anything
    it is that I'm not smart enough to anticipate all such possibilities.

    I won't apologize for my lack of intelligence; I am what I am.

    Maybe you should give CIFS and its recovery mechanisms a go. :-)

    Alternatively, Linux and Solaris are open source. Fix the
    code yourself and run it.

    > delay is the best general approach. Plus, of course, that logging


    I was told after my first response to this thread
    later revisions of Linux 2.6 do just that.

    > repeated failure just might be of some use to someone investigating
    > problems in the area.


    Well there's always a fine line between useful diagnostics
    and console/syslog spam. Anyway the Linux kernel
    has tracing you can turn on:

    sysctl -w sunrpc.nfs_debug=65535
    sysctl -w sunrpc.rpc_debug=65535

    and with Dtrace in Solaris 10, there might be run time tracing that
    would
    help there too.

    > But I am just an old fart, so what do I know?


    I couldn't have said it better myself.


  6. Re: Redhat Enterprise 4 and 15 second delays with NFS via TCP

    The default for mounts in Redhat-9 is to use V3 with UDP. Apparently in
    RHE4 the default has changed to be TCP. The increased number of NFS/TCP
    connections to our servers seems like a smoking gun.


  7. Re: Redhat Enterprise 4 and 15 second delays with NFS via TCP


    mrbrandtexcite wrote:
    > Upon further investigation, the problem seems *much* worse with our
    > Solaris server boxes, so I am begining to think it is a configuration
    > issue, and not so much a linux kernel problem, but I agree with your
    > comments.
    >
    > I should note that in the 2.6.15.1 kernel, the hard 15 second timeout
    > has been changed to an exponential backoff starting at 3 seconds.
    >
    > As far as I can see we run the Solaris nfsd as so:
    > nfsd -a 32
    > I might bump this up a bit.


    bump to a 1000 if you want. The Solaris server is self throttling. See

    http://blogs.sun.com/roller/page/rob...ntly_called_to


  8. Re: Redhat Enterprise 4 and 15 second delays with NFS via TCP

    Mike Eisler wrote:
    > What I've been told is that when the Linux 2.6 NFS client gets a
    > connection disconnect, it will indeed wait 15 seconds before trying
    > to reconnect. This is to deal with the case where an NFS server
    > reboots while it has NFS connections. If every client simply went in
    > a tight loop trying to re-connect while the machine the server was
    > on was booting up, the net would be flooded with SYNs.


    > Unfortunately, the Linux client (unlike the Solaris client) doesn't
    > distinguish between disconnects caused by a FIN from the server,
    > versus a RST from the server. Ideally, the client should wait a bit
    > if it gets RST, but if it gets a FIN, reconnect as soon as possible.


    I'll show my ignorance - if a server was rebooting (gracefully) would
    it actually send-out RSTs? I would have thought it would send-out
    FINs. And if it went-down hard, it wouldn't respond at all right?

    I would think that one would only get RST's if there was a problem at
    the TCP level, or perhaps if the server's FIN crossed a client's
    request on the network.

    It would seem that a FIN means things are being shut-down, so wait,
    and an RST might mean a transient error, so try connecting again
    (modulo that crossed segments bit...)

    > On the Solaris server side, if a connection is unused for more than
    > 6 minutes, it will break the connection. On the Linux client side,
    > I'm told that if the connect is idle for more than 5 minutes, the
    > client will break the connection.


    "Break" the connection as in RST's? Or do you mean it will close the
    connection as in send a FIN?

    rick jones
    --
    The glass is neither half-empty nor half-full. The glass has a leak.
    The real question is "Can it be patched?"
    these opinions are mine, all mine; HP might not want them anyway...
    feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

  9. Re: Redhat Enterprise 4 and 15 second delays with NFS via TCP


    Rick Jones wrote:
    > Mike Eisler wrote:
    > > What I've been told is that when the Linux 2.6 NFS client gets a
    > > connection disconnect, it will indeed wait 15 seconds before trying
    > > to reconnect. This is to deal with the case where an NFS server
    > > reboots while it has NFS connections. If every client simply went in
    > > a tight loop trying to re-connect while the machine the server was
    > > on was booting up, the net would be flooded with SYNs.

    >
    > > Unfortunately, the Linux client (unlike the Solaris client) doesn't
    > > distinguish between disconnects caused by a FIN from the server,
    > > versus a RST from the server. Ideally, the client should wait a bit
    > > if it gets RST, but if it gets a FIN, reconnect as soon as possible.

    >
    > I'll show my ignorance - if a server was rebooting (gracefully) would
    > it actually send-out RSTs? I would have thought it would send-out


    Yes.

    > FINs. And if it went-down hard, it wouldn't respond at all right?


    Can't send a FIN unless you already have a connection. The server has
    rebooted 9and if it crashes, it didn't send a
    FIN before it halted). Unless it saves connections in stable storage
    (an awesome idea; all us NFS server implementors are obvious
    fools for not doing it :-), it has no connections.

    RFC 793:

    "In this case, the data arriving at
    TCP A from TCP B (line 2) is unacceptable because no such connection
    exists, so TCP A sends a RST. "

    When a server comes up, at first there's no listener on port 2049, so
    RST (ECONNREFUSED) is the only response. Then there is a listener,
    but the connection queue as dictated by the second parameter to the
    the listen systme call limits how many partially completed connections
    there are. A server with a grid of connected clients (not just NFS)
    is going to be inundated with
    requests after reboot. Increasing the queue length is a good thing of
    course, but there practical limits to doing so.

    > I would think that one would only get RST's if there was a problem at
    > the TCP level, or perhaps if the server's FIN crossed a client's
    > request on the network.
    >
    > It would seem that a FIN means things are being shut-down, so wait,
    > and an RST might mean a transient error, so try connecting again
    > (modulo that crossed segments bit...)


    The RST is indeed a transient condition,
    so it means the client is better off backing off. The FIN means
    either the system is shutting down gracefully (in which case, back off
    is prudent but we don't know that), or some other condition has caused
    the server to break the connection (which BTW, shouldn't happen if the
    clients and server agree to the 5 and 6 minute respective idle timeout
    convention yours truly tried to establish last decade). If you get a
    a FIN, it doesn't hurt to immediately try to re-connect, because if you
    do, you'll either get a SYN in response or an RST.

    > > On the Solaris server side, if a connection is unused for more than
    > > 6 minutes, it will break the connection. On the Linux client side,
    > > I'm told that if the connect is idle for more than 5 minutes, the
    > > client will break the connection.

    >
    > "Break" the connection as in RST's? Or do you mean it will close the
    > connection as in send a FIN?


    FIN.


  10. Re: Redhat Enterprise 4 and 15 second delays with NFS via TCP

    Mike Eisler wrote:
    > Rick Jones wrote:
    >> Mike Eisler wrote:
    >> > What I've been told is that when the Linux 2.6 NFS client gets a
    >> > connection disconnect, it will indeed wait 15 seconds before trying
    >> > to reconnect. This is to deal with the case where an NFS server
    >> > reboots while it has NFS connections. If every client simply went in
    >> > a tight loop trying to re-connect while the machine the server was
    >> > on was booting up, the net would be flooded with SYNs.

    >>
    >> > Unfortunately, the Linux client (unlike the Solaris client) doesn't
    >> > distinguish between disconnects caused by a FIN from the server,
    >> > versus a RST from the server. Ideally, the client should wait a bit
    >> > if it gets RST, but if it gets a FIN, reconnect as soon as possible.

    >>
    >> I'll show my ignorance - if a server was rebooting (gracefully) would
    >> it actually send-out RSTs? I would have thought it would send-out


    > Yes.


    >> FINs. And if it went-down hard, it wouldn't respond at all right?


    > Can't send a FIN unless you already have a connection. The server
    > has rebooted 9and if it crashes, it didn't send a FIN before it
    > halted). Unless it saves connections in stable storage (an awesome
    > idea; all us NFS server implementors are obvious fools for not doing
    > it :-), it has no connections.


    Ah, I was thinking about while the machine was going-down rather than
    coming-up.

    > The FIN means either the system is shutting down gracefully (in
    > which case, back off is prudent but we don't know that), or some
    > other condition has caused the server to break the connection (which
    > BTW, shouldn't happen if the clients and server agree to the 5 and 6
    > minute respective idle timeout convention yours truly tried to
    > establish last decade).


    Isn't that sort of a "sleep synchronization" kind of thing?

    >> "Break" the connection as in RST's? Or do you mean it will close
    >> the connection as in send a FIN?

    > FIN.


    You and I must infer different things from the word "break"

    rick jones
    --
    a wide gulf separates "what if" from "if only"
    these opinions are mine, all mine; HP might not want them anyway...
    feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

  11. Re: Redhat Enterprise 4 and 15 second delays with NFS via TCP


    Rick Jones wrote:

    > > The FIN means either the system is shutting down gracefully (in
    > > which case, back off is prudent but we don't know that), or some
    > > other condition has caused the server to break the connection (which
    > > BTW, shouldn't happen if the clients and server agree to the 5 and 6
    > > minute respective idle timeout convention yours truly tried to
    > > establish last decade).

    >
    > Isn't that sort of a "sleep synchronization" kind of thing?


    I don't know what a sleep synchronization thing is.

    Anyway, if the server and client had the same value for the connection
    idle timer, the server could release the connection at the same time
    the client sends a request. That would kind of suck.
    By making the server idle timer much
    higher than the client idle timer, it is unlikely that the server's
    idle
    timer will ever fire (because the client will have disposed of the
    connection before the server does). Only if the client crashes
    will the server need to reap idle (leaked) connections.

    > >> "Break" the connection as in RST's? Or do you mean it will close
    > >> the connection as in send a FIN?

    > > FIN.


    Though I'll add that if the FIN takes to long, I had the
    Solaris NFS server "pull the plug" and send a RST. Using
    TLI (t_ordrel() followed by a t_snddis()).

    http://cvs.opensolaris.org/source/xr...ll_cots_action
    lines 1387-1399


    >
    > You and I must infer different things from the word "break"
    >
    > rick jones
    > --
    > a wide gulf separates "what if" from "if only"
    > these opinions are mine, all mine; HP might not want them anyway...
    > feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...



  12. Re: Redhat Enterprise 4 and 15 second delays with NFS via TCP

    Mike Eisler wrote:
    > Rick Jones wrote:
    >> Isn't that sort of a "sleep synchronization" kind of thing?


    > I don't know what a sleep synchronization thing is.


    Synchronizing two "threads" (whatever) with sleep() - ass-u-me-ing
    that the other guy will be "ready" by the time the sleep has finished
    rather than via some explicit means of synchronization.

    > Anyway, if the server and client had the same value for the connection
    > idle timer, the server could release the connection at the same time
    > the client sends a request. That would kind of suck.
    > By making the server idle timer much higher than the client idle
    > timer, it is unlikely that the server's idle timer will ever fire
    > (because the client will have disposed of the connection before the
    > server does). Only if the client crashes will the server need to
    > reap idle (leaked) connections.


    So, instead of having an explicit "I want to close this connection"
    synchronization mechanism, you have the two sides sleeping for
    different lengths of time.

    rick jones
    --
    oxymoron n, commuter in a gas-guzzling luxury SUV with an American flag
    these opinions are mine, all mine; HP might not want them anyway...
    feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

  13. Re: Redhat Enterprise 4 and 15 second delays with NFS via TCP


    Rick Jones wrote:

    > > Anyway, if the server and client had the same value for the connection
    > > idle timer, the server could release the connection at the same time
    > > the client sends a request. That would kind of suck.
    > > By making the server idle timer much higher than the client idle
    > > timer, it is unlikely that the server's idle timer will ever fire
    > > (because the client will have disposed of the connection before the
    > > server does). Only if the client crashes will the server need to
    > > reap idle (leaked) connections.

    >
    > So, instead of having an explicit "I want to close this connection"
    > synchronization mechanism, you have the two sides sleeping for
    > different lengths of time.


    You mean, instead of unilaterally adding a procedure to NFS, I have
    two sides sleeping for different lengths of time? What else could I
    have
    done?


  14. Re: Redhat Enterprise 4 and 15 second delays with NFS via TCP

    In article <1138310255.895817.116160@f14g2000cwb.googlegroups. com>,
    Mike Eisler wrote:
    >
    >Can't send a FIN unless you already have a connection. The server has
    >rebooted 9and if it crashes, it didn't send a
    >FIN before it halted). Unless it saves connections in stable storage
    >(an awesome idea; all us NFS server implementors are obvious
    >fools for not doing it :-), it has no connections.


    Well, yes, but there's a bit more to it. I don't know NFS well
    enough, but I suspect that it may have the same misdesign as the
    X Windowing System in this respect. The question is what you
    should do when you get a message related to a connexion that you
    don't have.

    The traditional (pre-Unix) approach was to treat such things as
    errors, which they are. Now, in the 1970s, the connexionless
    networking people won out over the connexion ones (ARPA versus
    X.? being part of it), and it is VERY hard to do that if other
    specifications say that simply dropping connexions is acceptable.

    The problem about simply ignoring such things is that it means
    that certain classes of bug become undetectable, especially when
    combined with timeout-and-repeat recovery designs. And, as we
    all know, bugs left alone in dark corners breed and evolve.

    What I can't say is whether "sleep synchronisation" is hiding
    some bugs, which could be related to these problems, but I strongly
    suspect that it is. It CERTAINLY does in the middle of TCP, and
    in the X Windowing System.


    Regards,
    Nick Maclaren.

  15. Re: Redhat Enterprise 4 and 15 second delays with NFS via TCP


    Nick Maclaren wrote:
    > In article <1138310255.895817.116160@f14g2000cwb.googlegroups. com>,
    > Mike Eisler wrote:
    > >
    > >Can't send a FIN unless you already have a connection. The server has
    > >rebooted 9and if it crashes, it didn't send a
    > >FIN before it halted). Unless it saves connections in stable storage
    > >(an awesome idea; all us NFS server implementors are obvious
    > >fools for not doing it :-), it has no connections.

    >
    > Well, yes, but there's a bit more to it. I don't know NFS well
    > enough, but I suspect that it may have the same misdesign as the
    > X Windowing System in this respect. The question is what you
    > should do when you get a message related to a connexion that you
    > don't have.


    The TCP protocol specification is clear. You return a RST.

    The difference between X and NFS is that when X gets a connection
    reset, the X application (e.g. xterm) terminates. The application does
    not
    bother to re-establish a TCP connection. Same deal with ftp, telnet,
    ssh, etc.
    And CIFS for that matter.

    With NFS, if mounted "hard", the client is required to re-establish the
    connection. Where the client fails to do so is an indication of bugs in
    the
    client and/or server.

    > The problem about simply ignoring such things is that it means
    > that certain classes of bug become undetectable, especially when
    > combined with timeout-and-repeat recovery designs. And, as we
    > all know, bugs left alone in dark corners breed and evolve.
    >
    > What I can't say is whether "sleep synchronisation" is hiding
    > some bugs, which could be related to these problems, but I strongly


    I agree that there's evidence that some clients can't deal with
    "unexpected" FINs from servers.

    > suspect that it is. It CERTAINLY does in the middle of TCP, and
    > in the X Windowing System.



  16. Re: Redhat Enterprise 4 and 15 second delays with NFS via TCP


    In article <1138378583.222231.278350@f14g2000cwb.googlegroups. com>,
    "Mike Eisler" writes:
    |>
    |> > Well, yes, but there's a bit more to it. I don't know NFS well
    |> > enough, but I suspect that it may have the same misdesign as the
    |> > X Windowing System in this respect. The question is what you
    |> > should do when you get a message related to a connexion that you
    |> > don't have.
    |>
    |> The TCP protocol specification is clear. You return a RST.

    That is only a partial answer. Yes, that is what you do at the
    TCP layer, but there is also the question of what else you should
    do. Such as log the fact.

    |> The difference between X and NFS is that when X gets a connection
    |> reset, the X application (e.g. xterm) terminates. The application does
    |> not bother to re-establish a TCP connection. Same deal with ftp,
    |> telnet, ssh, etc. And CIFS for that matter.

    No, that's not my point. The X disaster is in the handling of
    events, which is logically very similar to TCP, and what to do
    if nothing picks them up, they are directed to somewhere that
    can't be reached, or they get stuck.

    |> I agree that there's evidence that some clients can't deal with
    |> "unexpected" FINs from servers.

    It's not just the clients. Some servers get into horribly knotted
    states when there are network problems. Solaris is more robust
    than Linux, but still has problems.


    Regards,
    Nick Maclaren.

  17. Re: Redhat Enterprise 4 and 15 second delays with NFS via TCP


    Nick Maclaren wrote:
    > In article <1138378583.222231.278350@f14g2000cwb.googlegroups. com>,
    > "Mike Eisler" writes:
    > |>
    > |> > Well, yes, but there's a bit more to it. I don't know NFS well
    > |> > enough, but I suspect that it may have the same misdesign as the
    > |> > X Windowing System in this respect. The question is what you
    > |> > should do when you get a message related to a connexion that you
    > |> > don't have.
    > |>
    > |> The TCP protocol specification is clear. You return a RST.
    >
    > That is only a partial answer. Yes, that is what you do at the
    > TCP layer, but there is also the question of what else you should
    > do. Such as log the fact.


    Wow. An opportunity for my employer to store the voluminous logs.
    We should log UDP datagram drops too.

    >
    > |> The difference between X and NFS is that when X gets a connection
    > |> reset, the X application (e.g. xterm) terminates. The application does
    > |> not bother to re-establish a TCP connection. Same deal with ftp,
    > |> telnet, ssh, etc. And CIFS for that matter.
    >
    > No, that's not my point. The X disaster is in the handling of
    > events, which is logically very similar to TCP, and what to do
    > if nothing picks them up, they are directed to somewhere that
    > can't be reached, or they get stuck.


    Given that the filing aspect of NFS is stateless, I don't know what
    your point is.

    > |> I agree that there's evidence that some clients can't deal with
    > |> "unexpected" FINs from servers.
    >
    > It's not just the clients. Some servers get into horribly knotted
    > states when there are network problems. Solaris is more robust
    > than Linux, but still has problems.


    I doubt it.


  18. Re: Redhat Enterprise 4 and 15 second delays with NFS via TCP


    In article <1138379444.139920.10590@g47g2000cwa.googlegroups.c om>,
    "Mike Eisler" writes:
    |>
    |> Wow. An opportunity for my employer to store the voluminous logs.
    |> We should log UDP datagram drops too.

    30+ years ago, we seemed to be able to log such failures without
    major problems on disks that were orders of magnitude smaller.
    You might find it informative to think why.

    In this case, if you get a lot of TCP failures relating to connexions
    you don't have, you would would be well advised to worry.

    |> Given that the filing aspect of NFS is stateless, I don't know what
    |> your point is.

    You might be surprised at how easy it is for a theoretically stateless
    design to become stateful. But, in any case, being stateless is not
    a good reason to ignore errors just because you don't expect them.

    |> > It's not just the clients. Some servers get into horribly knotted
    |> > states when there are network problems. Solaris is more robust
    |> > than Linux, but still has problems.
    |>
    |> I doubt it.

    Try doing a problem search on it, then.


    Regards,
    Nick Maclaren.


  19. Re: Redhat Enterprise 4 and 15 second delays with NFS via TCP


    Nick Maclaren wrote:
    > In article <1138379444.139920.10590@g47g2000cwa.googlegroups.c om>,
    > "Mike Eisler" writes:
    > |>
    > |> Wow. An opportunity for my employer to store the voluminous logs.
    > |> We should log UDP datagram drops too.
    >
    > 30+ years ago, we seemed to be able to log such failures without
    > major problems on disks that were orders of magnitude smaller.
    > You might find it informative to think why.


    It may be informative when one out of one NFS clients drops its TCP
    connection to an NFS server. It is useless when 30,000 out of 30,000 do
    so
    several times over the day. TCP connections just aren't a big deal
    to NFS. What are you going to do with millions of such messages each
    day?

    >
    > In this case, if you get a lot of TCP failures relating to connexions


    I don't care about TCP disconnects. Although failures to re-connect
    are interesting, and last I checked there are stats and messages
    for those in most clients.

    > you don't have, you would would be well advised to worry.
    >
    > |> Given that the filing aspect of NFS is stateless, I don't know what
    > |> your point is.
    >
    > You might be surprised at how easy it is for a theoretically stateless
    > design to become stateful. But, in any case, being stateless is not
    > a good reason to ignore errors just because you don't expect them.


    But I do expect TCP connections to drop. I put in idle timers to make
    it happen.

    > |> > It's not just the clients. Some servers get into horribly knotted
    > |> > states when there are network problems. Solaris is more robust
    > |> > than Linux, but still has problems.
    > |>
    > |> I doubt it.
    >
    > Try doing a problem search on it, then.


    Sorry, turned in my Sun badge in 2000. I only go by usenet chatter.
    Other than yours (which comingles disparate issues like TCP
    connection idle timers with stale file handles)I'm not seeing it.


  20. Re: Redhat Enterprise 4 and 15 second delays with NFS via TCP

    Mike Eisler wrote:
    > You mean, instead of unilaterally adding a procedure to NFS, I have
    > two sides sleeping for different lengths of time?


    Yep

    > What else could I have done?


    Short of a rev of the protocol? Perhaps not have the NFS server code
    ever initiate connection close on idleness and rely instead on TCP
    level keepalives to cull connections from dead clients. Keepalive
    intervals would have to be set pretty low to deal with DoS I guess.

    rick jones
    thinking as I type
    --
    The computing industry isn't as much a game of "Follow The Leader" as
    it is one of "Ring Around the Rosy" or perhaps "Duck Duck Goose."
    - Rick Jones
    these opinions are mine, all mine; HP might not want them anyway...
    feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

+ Reply to Thread
Page 1 of 2 1 2 LastLast