GPFS detection of Disk Array Loss - Aix

This is a discussion on GPFS detection of Disk Array Loss - Aix ; I have a GPFS cluster comprising 2 nodes and tiebreaker disk. There is only one filesystem, made from 3 NSDs, each in their own failgroup. One NSD is desconly and is for disk quorum. The other two are replicated with ...

+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 20 of 28

Thread: GPFS detection of Disk Array Loss

  1. GPFS detection of Disk Array Loss

    I have a GPFS cluster comprising 2 nodes and tiebreaker disk. There is
    only one filesystem, made from 3 NSDs, each in their own failgroup.
    One NSD is desconly and is for disk quorum. The other two are
    replicated with one NSD at each site. (I think this is a fairly
    typical setup.)

    I have a couple of problems though. Firstly I need to say that I only
    physically have TWO arrays, so my "site C" is infact on "site B".
    Therefore my tie breaker disk is mounted in the same array as one of
    my data NSDs. Similarly, the quorum disk is also on the same array.

    I am hoping that the nuts and bolts of maintaining service will not be
    affected by the co-location of the tiebreaker disks and the data disks
    on one site. Although, obviously, should this site/array fail, then
    the surviving site will be down also. However, the specific site
    holding the tiebreaker disks will survive the loss of the non-tb site.
    Am I correct with this assumption?

    As a test I am powering off the Array at the site without the
    tiebreaker disks. I envisaged that all nodes would survive as there is
    a node and tiebreaker still awake, together with a disk quorum in the
    filesystem.
    What I am seeing is all nodes freezing on access to the filesystem,
    all FS commands hang/freeze. When I run mmlsdisk sfs, it returns that
    both data and the tiebreake NSDs are available and "up". I know this
    is not the case, so what should happen? How does GPFS detect the array
    loss? and maintain service?

    When I power up the Array again, after 10 to 20 seconds after it's
    sorted itself out, full service is restored and the Filesystem is
    available again, with no intervention.

    As a side point, the Advanced GPFS admin guide mentions that internal,
    non shared disks can be used at tiebreaker sites. is this true? as
    references on the IBM site contradict this?

    Any pointers, greatfully received!

    Rob

  2. Re: GPFS detection of Disk Array Loss

    On 28 Nov., 00:24, openstream rob wrote:
    ....
    > As a test I am powering off the Array at the site without the
    > tiebreaker disks. I envisaged that all nodes would survive as there is
    > a node and tiebreaker still awake, together with a disk quorum in the
    > filesystem.
    > What I am seeing is all nodes freezing on access to the filesystem,
    > all FS commands hang/freeze. When I run mmlsdisk sfs, it returns that
    > both data and the tiebreake NSDs are available and "up". I know this
    > is not the case, so what should happen? How does GPFS detect the array
    > loss? and maintain service?

    Which gpfs version ?
    Do you use NSD Servers ?

    One cause might be:
    It depends how long the fc-fscsi subsystem tries to access a lost
    path, then failover to another path which will be dead also . Until
    then all io is blocked thus gpfs. As soon as the fcs/fscsi subsystem
    send an error that all pathes are dead gpfs will shutdown or switch
    over to use a NSD server. So check your error log about errors during
    that time

    See http://www-1.ibm.com/support/docview...20desr_lpp_bos
    for details on fast failover a.s.o

    ....
    > As a side point, the Advanced GPFS admin guide mentions that internal,
    > non shared disks can be used at tiebreaker sites. is this true?

    Its true, you might get an error message from the other cluster nodes
    that they can not access the disk what of course is true. But the
    function is given.

    hth
    Hajo

  3. Re: GPFS detection of Disk Array Loss

    Thanks Hajo

    The version of GPFS is 3.1.0.7, AIX is 5.3 ML04

    I'm looking at your attachment now,

    Thanks

    Rob

  4. Re: GPFS detection of Disk Array Loss

    On Nov 28, 11:29 am, openstream rob wrote:
    > Thanks Hajo
    >
    > The version of GPFS is 3.1.0.7, AIX is 5.3 ML04

    GPFS PTF 16 is already out and needed for integration with GPFS 3.2
    and it fixes problems with
    - ML06 and higher
    - Disks in size in between 1TB and 2 TB not correctly seen

    and many other

    hth
    Hajo

  5. Re: GPFS detection of Disk Array Loss

    Hi Hajo

    I've just run some "extended" tests, and left the Array off for 10
    minutes. Eventually, the failover occurred and everything is looking
    good. I'm going to implement the fast_fail and see if the timing
    improves. ( i'm not exactly sure how long the failover took, probably
    not the full 10 mins)

    Rob

  6. Re: GPFS detection of Disk Array Loss

    Thanks again!

    I'm downloading the ptf now.

    BTW: In the "fast_fail" mode, what system state do I need to be in to
    allow the chdev to work, and not complain of child devices being
    active? Do I need to unplug the hardware? Or can I detach a child
    device. (how do I identify the dependant devices?)

    I currently have
    2 fcsnet
    2 fscsi
    2 dar
    8 dac
    2 fcs

    Rob

  7. Re: GPFS detection of Disk Array Loss

    On 28 Nov., 13:20, openstream rob wrote:
    > Thanks again!
    >
    > I'm downloading the ptf now.
    >
    > BTW: In the "fast_fail" mode, what system state do I need to be in to
    > allow the chdev to work, and not complain of child devices being
    > active? Do I need to unplug the hardware? Or can I detach a child
    > device. (how do I identify the dependant devices?)
    >
    > I currently have
    > 2 fcsnet
    > 2 fscsi
    > 2 dar
    > 8 dac
    > 2 fcs
    >
    > Rob



    You can:
    use the -P option with the chdev and reboot
    or and if supported by all devices
    $ lsdev -S available | egrep "fcs0|fscsi0|dar|dac" # Check state of
    devices
    $ rmdev -p fcs0 # put all children into the defined state
    $ chdev -l fscsi0 ... # reconfigure your fscsi
    $ chdev -l fscsi0 ...
    $ cfgmgr -l fcs0 # Make the adapter and all its children active
    again
    $ lsdev -S available | egrep "fcs0|fscsi0|dar|dac" # Recheck state of
    devices

    repeat it for all required fc adapters. No downtime and no loss of
    disk access - if course only if you have more then one fc adapter in
    use.

    hth
    Hajo

  8. Re: GPFS detection of Disk Array Loss

    Hi Hajo

    Thanks for the hints. I ran the commands for fcs0 and it worked
    perfectly. for fcs2 ( connected to a second switch and on to a second
    disk controller) the command failed when attempting to rmdev the dac4
    device. It complained that the device was busy.

    Is this showing that all traffic is passing only through the second
    fabric? Hopefully until there is a failure, then it will be passed to
    the first?

    How do I swap it over to adjust the fscsi2 device?

    Rob

  9. Re: GPFS detection of Disk Array Loss

    On Nov 29, 5:26 pm, openstream rob wrote:
    > Hi Hajo
    >
    > Thanks for the hints. I ran the commands for fcs0 and it worked
    > perfectly. for fcs2 ( connected to a second switch and on to a second
    > disk controller) the command failed when attempting to rmdev the dac4
    > device. It complained that the device was busy.


    I am not familiar with IBM SAN Storage. Please check your
    documentation.
    At least if everything is working you can unplug the cable for dac4
    which should lead to a failover to the other one. Then you should be
    able to change the adapter and plug the cable back in.
    Or use the -p option as already mentioned and reboot.

    >
    > Is this showing that all traffic is passing only through the second
    > fabric? Hopefully until there is a failure, then it will be passed to
    > the first?
    >
    > How do I swap it over to adjust the fscsi2 device?


    A simple way to see which fcX/fscsiX device is used you can use
    $ nmon
    $ iostat -a 1 | egrep "fc|fscsi|dac"

    hth
    Hajo

  10. Re: GPFS detection of Disk Array Loss

    Hajo

    I used the -P option and it was successful. All adapters are now
    running in "fast_fail".

    This has improved the failover times to 8 minutes, from 14 minutes.
    overall.

    It's curiuos as the server consoles report the loss via errpt a lot
    quicker ( 1 minute max), GPFS doesn't react for quite a while.
    After 5 minutes GPFS log gives:
    "Local access to sfs2 failed with EIO, will attempt to access the disk
    remotely."

    mmlsdisk sfs, doesn't report sfs as "down" for a further 3 minutes.

    Any ideas?

    Rob


  11. Re: GPFS detection of Disk Array Loss

    On Nov 30, 11:14 am, openstream rob wrote:
    > Hajo
    >
    > I used the -P option and it was successful. All adapters are now
    > running in "fast_fail".


    And you rebooted the node where you made this change ? Otherwise the
    ODM has been changed but not the adapter itself.
    BTW: You also enabled dynamic tracking ?
    $ chdev -l fscsiX -a dyntrk=yes

    >
    > This has improved the failover times to 8 minutes, from 14 minutes.
    > overall.

    What do you mean with failover ? Depending on your GPFS version and
    configuration the gpfs process will stay alive as long as it can
    communicate with its other cluster members. Only in case communication
    and quorum is lost the gpfs process will shutdown.

    In case local disk can not be accessed any more nothing really should
    happen except that the filesystems on these disks will be unmounted.
    In case you have NSD servers the local gpfs process will stop disk
    access and use the remote nsd server for accessing the data.
    As soon as the local disk are available again the node switches back
    to local data access.
    The failover from local disk to network should be within seconds.

    >
    > It's curiuos as the server consoles report the loss via errpt a lot
    > quicker ( 1 minute max), GPFS doesn't react for quite a while.
    > After 5 minutes GPFS log gives:
    > "Local access to sfs2 failed with EIO, will attempt to access the disk
    > remotely."

    Maybe nothing was using the gpfs fs thus no i/o thus no error to gpfs.
    You should do some testing meaning to put some load on the system.


    >
    > mmlsdisk sfs, doesn't report sfs as "down" for a further 3 minutes.
    >


    Like i said already. You might have to reboot your gpfs nodes to be
    sure that all settings are applied.

    hth
    Hajo

  12. Re: GPFS detection of Disk Array Loss

    Hi again

    Okay, I forgot to do the dynamic tracking change. But I did reboot
    after the -P change to the fscsi devices.

    When I say failover I'm not being precise. The scenario is below:


    Array1:
    nsd sfs1:data+meta: failure grp 1
    nsd fs1:descOnly: failuregrp 1

    Arrary2:
    nsd sfs2:data+meta: failure grp 2

    Array1 and 2 have twin controllers, connected to 2 switches each to
    provide resilience to 6 servers, all with 2 fcs adapters

    there is only one filesystem sfs, comprised of nsds: sfs1 sfs2 and fs1

    My test is to power off Array2, loosing sfs2, and I was hoping that
    all the servers would maintain filesystem access via the quorum disk
    (fs1) and sfs1 being still available on Array 1.

    At this point, I have to wait 8minutes before any/all servers are
    given access to the filesystem.

    Can you spot anything else to look at? Will the fibre network topology
    be relevant. Basically, servers 1 to 3 are each linked to 2 switches(a
    nd b) ( 1 per fcs) and servers 4 to 6 are similarly linked to 2
    switches (c+d)
    switches a and c are joined and b and d are also joined. Each of these
    extended fabrics is connected to each Disk array, (that have twin
    controllers) .

    It's not too different to the setup shown in the GPFS planning manual
    (page 19 figure 11), except each switch in their diagram represents a
    joind pair in our scenario.

    Do you think the above test should "failover" instantly? This is what
    I had anticipated.

    Rob


  13. Re: GPFS detection of Disk Array Loss

    On Nov 30, 4:51 pm, openstream rob wrote:
    > Hi again
    >
    > Okay, I forgot to do the dynamic tracking change. But I did reboot
    > after the -P change to the fscsi devices.
    >
    > When I say failover I'm not being precise. The scenario is below:
    >
    > Array1:
    > nsd sfs1:data+meta: failure grp 1
    > nsd fs1:descOnly: failuregrp 1
    >
    > Arrary2:
    > nsd sfs2:data+meta: failure grp 2
    >
    > Array1 and 2 have twin controllers, connected to 2 switches each to
    > provide resilience to 6 servers, all with 2 fcs adapters
    >
    > there is only one filesystem sfs, comprised of nsds: sfs1 sfs2 and fs1
    >
    > My test is to power off Array2, loosing sfs2, and I was hoping that
    > all the servers would maintain filesystem access via the quorum disk
    > (fs1) and sfs1 being still available on Array 1.
    >
    > At this point, I have to wait 8minutes before any/all servers are
    > given access to the filesystem.


    So you are saying that for 8 minutes no filesystem access is possible
    at all but afterwards it is working ?

    Please provide the following output

    Since you use 2 failure groups - have you setup replication to 2 for
    data and metadata ?
    $ mmlsfs sfs -r -R -m -M

    What the value for unmountOnDiskFail
    $ mmlsconfig | grep -i unmount

    Do you have nsd server
    $ mmlsnsd -m





  14. Re: GPFS detection of Disk Array Loss

    On 30 Nov, 16:17, Hajo Ehlers wrote:
    > On Nov 30, 4:51 pm, openstream rob wrote:
    >
    >
    >
    >
    >
    > > Hi again

    >
    > > Okay, I forgot to do the dynamic tracking change. But I did reboot
    > > after the -P change to the fscsi devices.

    >
    > > When I say failover I'm not being precise. The scenario is below:

    >
    > > Array1:
    > > nsd sfs1:data+meta: failure grp 1
    > > nsd fs1:descOnly: failuregrp 1

    >
    > > Arrary2:
    > > nsd sfs2:data+meta: failure grp 2

    >
    > > Array1 and 2 have twin controllers, connected to 2 switches each to
    > > provide resilience to 6 servers, all with 2 fcs adapters

    >
    > > there is only one filesystem sfs, comprised of nsds: sfs1 sfs2 and fs1

    >
    > > My test is to power off Array2, loosing sfs2, and I was hoping that
    > > all the servers would maintain filesystem access via the quorum disk
    > > (fs1) and sfs1 being still available on Array 1.

    >
    > > At this point, I have to wait 8minutes before any/all servers are
    > > given access to the filesystem.

    >
    > So you are saying that for 8 minutes no filesystem access is possible
    > at all but afterwards it is working ?


    Exactly. Pull the plug. Wait 8mins and then all io returns.

    >
    > Please provide the following output
    >
    > Since you use 2 failure groups - have you setup replication to 2 for
    > data and metadata ?
    > $ mmlsfs sfs -r -R -m -M


    Yes this is done.

    >
    > What the value for unmountOnDiskFail
    > $ mmlsconfig | grep -i unmount


    Oh. This isn't set.

    > Do you have nsd server


    fs1 db1 direct
    fs1 db2 primary
    fs2 db1 direct
    fs2 db6 primary
    sfs1 db1 direct
    sfs1 db2 primary
    sfs1 db3 backup
    sfs2 db1 direct
    sfs2 db5 primary
    sfs2 db6 backup
    tb1 db1 direct
    tb2 db1 direct

    db1,2,3 are in one rack
    db 4,5,6 are in another

    tb1 and tb2 are tiebreaker disks. Only one is used. The other is used
    for DR recovery.

    Thanks Hajo. I'm looking at the unmountondiskfail conifgurable.

    Rob


    > - Show quoted text -



  15. Re: GPFS detection of Disk Array Loss

    Hi Hajo

    I thought I'd mention again that although I'm implementing a
    "standard" 3 site resilient cluster. My 3rd site is actually located
    on the secondary rack. I recognise that this rack is now a point of
    failure, but I'm allowing for manual intervention to reconfigure the
    tiebreaker and fs quorum node to the other stack. (Hence the tb2 and
    fs2 nsds in my last post)
    All my testing referenced in these posts is only against failure of
    disks in the rack WITHOUT the tiebreaker and fs quorum.
    HOWEVER, is this setup potentially causing the timing problem?

    I looked at unmountondiskfailure and I'm not sure it should be
    enabled. I would say not. Particularly as the descOnly fs quorum nsd
    is SAN based in my case.

    Really appreciate the help, what do you think?

  16. Re: GPFS detection of Disk Array Loss

    On Nov 30, 4:51 pm, openstream rob wrote:
    > Hi again

    ....
    >
    > Array1:
    > nsd sfs1:data+meta: failure grp 1
    > nsd fs1:descOnly: failuregrp 1
    >
    > Arrary2:
    > nsd sfs2:data+meta: failure grp 2
    >
    > Array1 and 2 have twin controllers, connected to 2 switches each to
    > provide resilience to 6 servers, all with 2 fcs adapters
    >
    > there is only one filesystem sfs, comprised of nsds: sfs1 sfs2 and fs1
    >
    > My test is to power off Array2, loosing sfs2, and I was hoping that
    > all the servers would maintain filesystem access via the quorum disk
    > (fs1) and sfs1 being still available on Array 1.
    >
    > At this point, I have to wait 8minutes before any/all servers are
    > given access to the filesystem.
    >
    > Can you spot anything else to look at? Will the fibre network topology
    > be relevant. Basically, servers 1 to 3 are each linked to 2 switches(a
    > nd b) ( 1 per fcs) and servers 4 to 6 are similarly linked to 2
    > switches (c+d)
    > switches a and c are joined and b and d are also joined. Each of these
    > extended fabrics is connected to each Disk array, (that have twin
    > controllers) .


    In case i understand your SAN configuration correctly your
    server[1,2,3] can SEE directly the disk on arrary2 as well as its
    own on array1
    where
    server[4,5,6] can SEE directly the disk on array1 as well as its
    own on array2

    In case the above is true an failure on array2 will not lead to a use
    of a NSD server - meaning you should not see high network traffic
    between the nodes. Of cource only if you have put load on one of the
    server[4,5,6].

    So for trouble shooting:
    Shutdown array2 , put some load on one of the servers[4,5,6] and see
    if you have any high network traffice on the test node. If not it just
    simply means that the node can access to data via the SAN on array1.
    If this is correct i beieve that your SAN fabric just needs too long
    to stabelize meaning for some reason its takes pretty long to find a
    active path to the remaining disks.

    BTW:
    I would really simplify your current configuartion.
    1)In case a server can see all disks on array1 and array2
    - Only one GPFS Server in Rack1 - with tiebraker disk
    - Only one GPFS Server in Rack2
    - no disk is configured for nsd usage - all direct attached, no
    primary or backup server
    - Keep your current SAN configuration
    I assume that
    - server1 sees array1 & array2
    - server4 sees array2 & array1

    1.1 ) Now do your tests to see what happens if array2 will not be
    available.
    If you have still these 8 minute delay i would rethink your Fabric
    configuration.
    In case not the problem might triggered on your typ of NSD server
    onfiguration.

    BTW:
    If you are going to have nsd servers then ONLY one server on each rack
    should be a nsd server. Thus server1 in rack1 is primary server and
    server4 in rack2 is backup server.
    This applies also in case a server can only see one array.


    Thus a simple configuration would be:
    2)
    - Only one GPFS Server in Rack1 - with tiebraker disk, primary nsd
    server
    - Only one GPFS Server in Rack2 - backup nsd server
    - all disk are configured for nsd usage
    - Keep your current SAN configuration
    I assume that
    - server1 sees only array1
    - server4 sees only array2

    2.1)
    Now do your tests to see what happens if array2 will not be available.

    You can extend your test by adding a third GPFS server from Rack2
    which can only see array2.
    Redo the test - turn off array2 - and the new GPFS server should
    switch to NSD mode meaing it is accessing the GPFS fs through the
    primary nsd server.

    hth
    Hajo


  17. Re: GPFS detection of Disk Array Loss

    On Nov 30, 6:23 pm, openstream rob wrote:
    > Hi Hajo
    >
    > I thought I'd mention again that although I'm implementing a
    > "standard" 3 site resilient cluster. My 3rd site is actually located
    > on the secondary rack. I recognise that this rack is now a point of
    > failure, but I'm allowing for manual intervention to reconfigure the
    > tiebreaker and fs quorum node to the other stack. (Hence the tb2 and
    > fs2 nsds in my last post)
    > All my testing referenced in these posts is only against failure of
    > disks in the rack WITHOUT the tiebreaker and fs quorum.

    That is clear to me but see my previous post regarding your
    configuration

    > HOWEVER, is this setup potentially causing the timing problem?

    Depends which part of the setup you mean ;-)
    - gpfs setup: nsd server y/n , network
    - SAN setup: How many disk, how many pathes
    - SAN Storage: What type of storage, fail over mode

    Like i said:
    simplify your current configuration and then start to extend it.

    hth
    Hajo


  18. Re: GPFS detection of Disk Array Loss

    On 30 Nov, 17:47, Hajo Ehlers wrote:
    > On Nov 30, 4:51 pm, openstream rob wrote:
    >
    >
    >
    >
    >
    > > Hi again

    > ...
    >
    > > Array1:
    > > nsd sfs1:data+meta: failure grp 1
    > > nsd fs1:descOnly: failuregrp 1

    >
    > > Arrary2:
    > > nsd sfs2:data+meta: failure grp 2

    >
    > > Array1 and 2 have twin controllers, connected to 2 switches each to
    > > provide resilience to 6 servers, all with 2 fcs adapters

    >
    > > there is only one filesystem sfs, comprised of nsds: sfs1 sfs2 and fs1

    >
    > > My test is to power off Array2, loosing sfs2, and I was hoping that
    > > all the servers would maintain filesystem access via the quorum disk
    > > (fs1) and sfs1 being still available on Array 1.

    >
    > > At this point, I have to wait 8minutes before any/all servers are
    > > given access to the filesystem.

    >
    > > Can you spot anything else to look at? Will the fibre network topology
    > > be relevant. Basically, servers 1 to 3 are each linked to 2 switches(a
    > > nd b) ( 1 per fcs) and servers 4 to 6 are similarly linked to 2
    > > switches (c+d)
    > > switches a and c are joined and b and d are also joined. Each of these
    > > extended fabrics is connected to each Disk array, (that have twin
    > > controllers) .

    >
    > In case i understand your SAN configuration correctly your
    > server[1,2,3] can SEE directly the disk on arrary2 as well as its
    > own on array1
    > where
    > server[4,5,6] can SEE directly the disk on array1 as well as its
    > own on array2
    >
    > In case the above is true an failure on array2 will not lead to a use
    > of a NSD server - meaning you should not see high network traffic
    > between the nodes. Of cource only if you have put load on one of the
    > server[4,5,6].
    >
    > So for trouble shooting:
    > Shutdown array2 , put some load on one of the servers[4,5,6] and see
    > if you have any high network traffice on the test node. If not it just
    > simply means that the node can access to data via the SAN on array1.
    > If this is correct i beieve that your SAN fabric just needs too long
    > to stabelize meaning for some reason its takes pretty long to find a
    > active path to the remaining disks.
    >
    > BTW:
    > I would really simplify your current configuartion.
    > 1)In case a server can see all disks on array1 and array2
    > - Only one GPFS Server in Rack1 - with tiebraker disk
    > - Only one GPFS Server in Rack2
    > - no disk is configured for nsd usage - all direct attached, no
    > primary or backup server
    > - Keep your current SAN configuration
    > I assume that
    > - server1 sees array1 & array2
    > - server4 sees array2 & array1
    >
    > 1.1 ) Now do your tests to see what happens if array2 will not be
    > available.
    > If you have still these 8 minute delay i would rethink your Fabric
    > configuration.
    > In case not the problem might triggered on your typ of NSD server
    > onfiguration.
    >
    > BTW:
    > If you are going to have nsd servers then ONLY one server on each rack
    > should be a nsd server. Thus server1 in rack1 is primary server and
    > server4 in rack2 is backup server.
    > This applies also in case a server can only see one array.
    >
    > Thus a simple configuration would be:
    > 2)
    > - Only one GPFS Server in Rack1 - with tiebraker disk, primary nsd
    > server
    > - Only one GPFS Server in Rack2 - backup nsd server
    > - all disk are configured for nsd usage
    > - Keep your current SAN configuration
    > I assume that
    > - server1 sees only array1
    > - server4 sees only array2
    >
    > 2.1)
    > Now do your tests to see what happens if array2 will not be available.
    >
    > You can extend your test by adding a third GPFS server from Rack2
    > which can only see array2.
    > Redo the test - turn off array2 - and the new GPFS server should
    > switch to NSD mode meaing it is accessing the GPFS fs through the
    > primary nsd server.
    >
    > hth
    > Hajo- Hide quoted text -
    >
    > - Show quoted text -


    For a quick bit of even more information:

    yes, all servers 1 to 6 can see both arrays directly.

    The NSD use is because,(sorry I didn't mention), there is another tier
    of 8 servers fully connected to the 6 already mentioned. These are
    only connected on Gb LAN, but use the NSD servers for access to sfs
    also.

    I'll drop the server count as you suggest (2) and repeat the test.

    When the array is dropped. There is NO io to remaining array. on any
    server.(for 8 mins) . e.g a "touch filename" will hang. After 8 mins
    it suceeds.

    Thanks Hajo. I'll need to do tests on Monday as the building is
    closing.

    Have a good weekend

    Rob


  19. Re: GPFS detection of Disk Array Loss

    Not so quick!!

    I just did one more test run, as I was rebooting the cluster after
    setting the dyntrk=yes, while we were chatting.

    I pulled the wire on the Array and all servers carried on immediately.
    mmlsdisk sfs show an instant state of down on sfs2 and IO is normal to
    the remaining array.

    Problem solved.

    I will do a review of the NSD servers anyway, and I still need to
    check the manual failover mechanism if the array1/rack1 trips over.

    Thanks for your time Hajo. That's about the 10th problem you've sorted
    of mine.

    Another case to add to your solved file!

    Cheers
    Rob

  20. Re: GPFS detection of Disk Array Loss

    On Nov 30, 11:14 am, openstream rob wrote:

    ....
    > quicker ( 1 minute max), GPFS doesn't react for quite a while.
    > After 5 minutes GPFS log gives:
    > "Local access to sfs2 failed with EIO, will attempt to access the disk
    > remotely."

    ....
    Sometimes it is really good to reread a thread:
    The above message stated in my opion that your server in rack2 could
    not use the disk in array1 for some reason.

    From
    http://publib.boulder.ibm.com/infoce...l1pdg1176.html
    ....
    If a disk is defined to have a local connection, as well as being
    connected to primary and secondary NSD servers, and the local
    connection fails, GPFS bypasses the broken local connection and uses
    the NSD servers to maintain disk access. The following error message
    appears in the MMFS log:

    6027-361
    Local access to disk failed with EIO, switching to access the disk
    remotely.

    This is the default behavior, and can be changed with the useNSDserver
    file system mount option. See General Parallel File System: Concepts,
    Planning, and Installation Guide and search for NSD server
    considerations.

    For a file system using the default mount option
    useNSDserver=asneeded, disk access fails over from local access to
    remote NSD access. Once local access is restored, GPFS detects this
    fact and switches back to local access. The detection and switch over
    are not instantaneous, but occur at approximately five minute
    intervals.
    ....

    So here you have your 5 minute intervall. But this should happen only
    for a node which has local access and lost it complettly. Since in
    your setup the servers sees all disk this should not happen.

    Have a nice weekend
    Hajo




+ Reply to Thread
Page 1 of 2 1 2 LastLast