Reboot > 1000 NFS clients - NFS

This is a discussion on Reboot > 1000 NFS clients - NFS ; I work in a large HPC environment. I am looking for a way to reboot potentially >1000 NFS clients within 20 minute span. When this many NFS clients reboot it results in a large number of mount requests against the ...

+ Reply to Thread
Results 1 to 5 of 5

Thread: Reboot > 1000 NFS clients

  1. Reboot > 1000 NFS clients

    I work in a large HPC environment. I am looking for a way to reboot
    potentially >1000 NFS clients within 20 minute span. When this many NFS
    clients reboot it results in a large number of mount requests against the
    NFS server. This could potentially swamp the NFS server. Is there a way to
    coordinate so that I don't overwhelm the NFS server?

    Thanks, Mike



  2. Re: Reboot > 1000 NFS clients

    Mike,

    Coordinating the reboot should be easy enough with ssh and 'echo init 6
    | at hh:mm', but it seems like the mounting should be done differently.
    Is this something that could be solved with proper application of
    something like automount so that mounts are deferred until needed?
    What about distributing the load between several NFS servers?

    I would be curious to know what type of HPC environment uses one NFS
    server to server 1000+ clients without serious degradation in the first
    place.

    Cheers,
    Eric


  3. Re: Reboot > 1000 NFS clients

    Eric,

    Thanks for the reply. Actually I simplified the problem statement a bit, we
    have 8 NFS heads serving about 2000 clients. Automount is used extensively.
    However there are some heavily used file systems that shouldn't be using
    automounted. We're not 100 certain of the root cause but our theory at
    this point is that the excessive automount activity is triggering a 'too
    many connections' condition on the busiest NFS server which triggers another
    series of events. We are working to distribute that load across multiple
    heads and other work-arounds. Long term we see a static mount of this
    particular set of file systems as one part of the solution. But if we
    static mount these 8 or so filesystems and need to restart the entire
    environment we want to avoid swamping the NFS server with mount requests.

    For the first round we are targeting a 1 hour restart. We carve up the last
    48 minutes of that 1 hour window into four parts, and let 1/4 of the
    servers reboot within their 12 minute window, each system randomly delaying
    the nfs mount port of the rc sequence.

    Thanks for the reply

    Mike

    wrote in message
    news:1142821054.041872.300090@u72g2000cwu.googlegr oups.com...
    > Mike,
    >
    > Coordinating the reboot should be easy enough with ssh and 'echo init 6
    > | at hh:mm', but it seems like the mounting should be done differently.
    > Is this something that could be solved with proper application of
    > something like automount so that mounts are deferred until needed?
    > What about distributing the load between several NFS servers?
    >
    > I would be curious to know what type of HPC environment uses one NFS
    > server to server 1000+ clients without serious degradation in the first
    > place.
    >
    > Cheers,
    > Eric
    >




  4. Re: Reboot > 1000 NFS clients

    First off sounds like you are really pressing NFS to it's limits, you
    may want to investigate other options like http://www.panasas.com/ and
    see if they might have more to offer in your situation. I have been
    kicking around a similar project locally to tackle the ever expanding
    storage problems that out research group runs into.

    You are right that automount won't solve the too many connections if
    all nodes need to connect, and I assume you have already increased the
    number of daemons and other kernel and driver level tuning to improve
    capacity. Have you investigated possibly using clustered file systems
    in order to provide a second or third NFS server for the same group of
    files?

    Cheers,
    Eric


  5. Re: Reboot > 1000 NFS clients


    wrote in message
    news:1143779279.687973.239200@e56g2000cwe.googlegr oups.com...
    > First off sounds like you are really pressing NFS to it's limits, you
    > may want to investigate other options like http://www.panasas.com/ and
    > see if they might have more to offer in your situation. I have been
    > kicking around a similar project locally to tackle the ever expanding
    > storage problems that out research group runs into.
    >
    > You are right that automount won't solve the too many connections if
    > all nodes need to connect, and I assume you have already increased the
    > number of daemons and other kernel and driver level tuning to improve
    > capacity. Have you investigated possibly using clustered file systems
    > in order to provide a second or third NFS server for the same group of
    > files?
    >
    > Cheers,
    > Eric
    >


    There are several cluster filesystems out there... in alpha order...

    HP's SFS (Scalable File Store)
    IBM's GPFS
    Lustre (from CFS)
    Panasas
    Polyserve
    RedHat GFS

    and I'm pretty sure there are many more that I have not
    listed.

    Enjoy,
    Postmaster





+ Reply to Thread