patch 5653 and NFS timeouts? - SGI

This is a discussion on patch 5653 and NFS timeouts? - SGI ; Since installing patch 5653 on 4 systems running Irix 6.5.22m, I've noticed that if I issue a df command on a Tru64 Unix machine on which I have disk partitions from the 4 Irix machines mounted via NFS, I'm getting ...

+ Reply to Thread
Results 1 to 6 of 6

Thread: patch 5653 and NFS timeouts?

  1. patch 5653 and NFS timeouts?

    Since installing patch 5653 on 4 systems running Irix 6.5.22m, I've
    noticed that if I issue a df command on a Tru64 Unix machine on which
    I have disk partitions from the 4 Irix machines mounted via NFS, I'm
    getting timeout messages such as:

    NFS3 RFS3_GETATTR failed for server systemname : RPC: Timed out

    Eventually df completes, after the remote systems seemingly wake up.
    Then they respond normally for awhile. But if I wait a couple hours,
    then try again, the timeout messages return. Has anyone else noticed
    this behavior after installing patch 5653?

    --
    -Lynn

  2. Re: patch 5653 and NFS timeouts?

    >>>>> "rardin" == R Lynn Rardin writes:

    rardin> Since installing patch 5653 on 4 systems running Irix 6.5.22m, I've
    rardin> noticed that if I issue a df command on a Tru64 Unix machine on which
    rardin> I have disk partitions from the 4 Irix machines mounted via NFS, I'm
    rardin> getting timeout messages such as:

    rardin> NFS3 RFS3_GETATTR failed for server systemname : RPC: Timed out

    rardin> Eventually df completes, after the remote systems seemingly wake up.

    Are mounts over TCP or over UDP?

    max

  3. Re: patch 5653 and NFS timeouts?

    In article , Max Matveev
    wrote:

    > >>>>> "rardin" == R Lynn Rardin writes:

    >
    > rardin> Since installing patch 5653 on 4 systems running Irix 6.5.22m, I've
    > rardin> noticed that if I issue a df command on a Tru64 Unix machine on which
    > rardin> I have disk partitions from the 4 Irix machines mounted via NFS, I'm
    > rardin> getting timeout messages such as:
    >
    > rardin> NFS3 RFS3_GETATTR failed for server systemname : RPC: Timed out
    >
    > rardin> Eventually df completes, after the remote systems seemingly wake up.
    >
    > Are mounts over TCP or over UDP?


    UDP. FWIW, I removed patch 5653 from one of the systems and that seems
    to have alleviated the timeout problem on that system, while they
    continue to show up on the other 3 systems.

    --
    -Lynn

  4. Re: patch 5653 and NFS timeouts?

    >>>>> "rardin" == R Lynn Rardin writes:

    rardin> In article , Max Matveev
    rardin> wrote:

    >> >>>>> "rardin" == R Lynn Rardin writes:

    >>

    rardin> Since installing patch 5653 on 4 systems running Irix 6.5.22m, I've
    rardin> noticed that if I issue a df command on a Tru64 Unix machine on which
    rardin> I have disk partitions from the 4 Irix machines mounted via NFS, I'm
    rardin> getting timeout messages such as:
    >>

    rardin> NFS3 RFS3_GETATTR failed for server systemname : RPC: Timed out
    >>

    rardin> Eventually df completes, after the remote systems seemingly wake up.
    >>
    >> Are mounts over TCP or over UDP?


    rardin> UDP. FWIW, I removed patch 5653 from one of the systems and
    rardin> that seems to have alleviated the timeout problem on that
    rardin> system, while they continue to show up on the other 3
    rardin> systems.

    Sorry, should've asked it first time - was patch 5653 the only thing
    which changed? And if it was the only thing which changed, was it a
    virgin .22m or has it being patched before with either of 5397, 5518
    or 5526?

    Also, just so I understand it correctly: Irix is the server, Tru64 is
    the client, you see the messages on Tru64, not on Irix.

    The reason I'm asking all these questions is because I really cannot
    see anything in the patch which should affect the server, but if the
    system has been upgraded from, say, .20 to .22, then there is a chance
    you're running into problems with new client authentication mechanism
    which has been added in .21 and patch is just a bystander.

    BTW, did you talk to anybody in SGI Support about it?

    max

  5. Re: patch 5653 and NFS timeouts?

    In article , Max Matveev
    wrote:

    > >>>>> "rardin" == R Lynn Rardin writes:

    >
    > rardin> In article , Max Matveev
    > rardin> wrote:
    >
    > >> >>>>> "rardin" == R Lynn Rardin writes:
    > >>

    > rardin> Since installing patch 5653 on 4 systems running Irix 6.5.22m, I've
    > rardin> noticed that if I issue a df command on a Tru64 Unix machine on which
    > rardin> I have disk partitions from the 4 Irix machines mounted via NFS, I'm
    > rardin> getting timeout messages such as:
    > >>

    > rardin> NFS3 RFS3_GETATTR failed for server systemname : RPC: Timed out
    > >>

    > rardin> Eventually df completes, after the remote systems seemingly wake up.
    > >>
    > >> Are mounts over TCP or over UDP?

    >
    > rardin> UDP. FWIW, I removed patch 5653 from one of the systems and
    > rardin> that seems to have alleviated the timeout problem on that
    > rardin> system, while they continue to show up on the other 3
    > rardin> systems.
    >
    > Sorry, should've asked it first time - was patch 5653 the only thing
    > which changed?


    I believe so, yes. And as I noted above, removal of 5653 from one of
    the systems seems to have cured the problem in that case--the Tru64 Unix
    system is no longer reporting timeouts when I df the filesystem being
    served up by the SGI.

    > And if it was the only thing which changed, was it a virgin .22m or has
    > it being patched before with either of 5397, 5518 or 5526?


    It definitely wasn't a virgin .22m install. I've been applying patches
    as SGI releases them. If I recall correctly, I installed 6.5 on these
    systems, then immediately upgraded them to 6.5.20m. 6.5.21m and 6.5.22m
    were installed when they became available.

    > Also, just so I understand it correctly: Irix is the server, Tru64 is
    > the client, you see the messages on Tru64, not on Irix.


    That's correct.

    > The reason I'm asking all these questions is because I really cannot
    > see anything in the patch which should affect the server, but if the
    > system has been upgraded from, say, .20 to .22, then there is a chance
    > you're running into problems with new client authentication mechanism
    > which has been added in .21 and patch is just a bystander.


    You may be right. Is it odd that the systems eventually respond?

    > BTW, did you talk to anybody in SGI Support about it?


    No. I usually come here first. But I can touch base with them, too.

    --
    -Lynn

  6. Re: patch 5653 and NFS timeouts?

    >>>>> "rardin" == R Lynn Rardin writes:

    >> The reason I'm asking all these questions is because I really cannot
    >> see anything in the patch which should affect the server, but if the
    >> system has been upgraded from, say, .20 to .22, then there is a chance
    >> you're running into problems with new client authentication mechanism
    >> which has been added in .21 and patch is just a bystander.


    rardin> You may be right. Is it odd that the systems eventually respond?

    No, not really if you consider how client access control is done: the
    kernel would call rpc.mountd and ask it to check if a client from host
    192.168.0.1 can access /exports/foo, mountd would have to translate
    192.168.0.1 back to host.domain.someplace and check it against access
    control list. Reverse name lookup could take time and server would
    wait for the reply from mountd before processing the NFS call which
    would explain why server would appear to "hang" and then start
    responding again.

    BTW, assuming that you do use any of root=,rw= or access= options for
    exports, I think I have a reasonable explanation on what's happening
    with this patch: you might be suffering from the combination of DNS
    delays and the fact that exponential backoff has been enabled on TCP
    for the first time - it just takes longer for the server to get reply
    from mountd which could be blowing the timeout on the client and you
    get the message and the delays on the client.

    What could be done? You could increase timeout on the client via
    timeo= mount option (I assume Tru64 has it), you could try to make
    sure reverse name lookups are quick for "impassioned" clients, you
    could disable auth cache ageing on the server by setting
    nfsauth_refresh systune to 0 (default is 1800 sec) and make sure you
    have enough space in the cache to hold info about all the clients
    which access your server (default is 512 entries, adjust via
    nfsauth_cachesz systune).

    One thing to be careful then setting nfsauth_refresh to 0 is that it
    will disable checks for changes in the netgroups if you use them.

    To test client access control, try snooping the traffic on loopback
    using command like this

    snoop -tr -d lo0 rpc 100231

    while client accesses the server, set nfsauth_refresh to some small
    value (i.e 5 second), re-export test filesystem on server (make sure
    it uses root=, rw= or access=), then on the client do a loop like

    while true; do df /mnt/server; sleep 7; done

    You should see RPC calls to program 100231 and replyes comming back -
    if there is a long delay between the call and reply then it's the time
    which mountd took to do reverse name resolution and here lays your
    problem.

    >> BTW, did you talk to anybody in SGI Support about it?

    rardin> No. I usually come here first. But I can touch base with them, too.

    It's quite likely that the problem like this will eventually end up on
    my lap anyway so I was just trying to save a support guys some time.

    max

+ Reply to Thread