GPFS Lease expired causing node expulsion - Aix

This is a discussion on GPFS Lease expired causing node expulsion - Aix ; Hi all We are developing an Oracle RAC 10g cluster system using SAN for database storage, but GPFS for the voting and OCR. We have had no particular problems, and all has been progressing until this week. Having made only ...

+ Reply to Thread
Results 1 to 18 of 18

Thread: GPFS Lease expired causing node expulsion

  1. GPFS Lease expired causing node expulsion

    Hi all

    We are developing an Oracle RAC 10g cluster system using SAN for
    database storage, but GPFS for the voting and OCR.
    We have had no particular problems, and all has been progressing until
    this week. Having made only minor changes to our build, we are getting
    installation failure on our Oracle application.

    The application install runs for 2 hours, but during the last 3
    installs, we are getting GPFS lease expires and the node running the
    oracle application load, fences due to the loss of GPFS shared
    filesystem. This has occurred at three random points during the
    install. On rerunning the application load, it has successfully
    reinstalled okay.

    The IBM problem guide suggests that the fault is soley related to
    comms failure. Although we haven't witnessed any other comms issues,
    and there have been no configuration changes.

    The GPFS master node expels and recovers the DB client node within 1
    second, probably indicating a temporary failure?

    Oracle unfortunately notices the loss of service from GPFS and fences.
    Causing a reboot. When rebooted, all is normal again.

    Is the lease expirey a common event for GPFS? or does this indicate an
    underlying comms(or other) fault?
    Does aybody else use GPFS for holding OCR and Voting disks?

    Does the expelling of a node impact the status of files being accessed/
    written to a GPFS filesystem? Does recovery get perfomed?

    Rob


  2. Re: GPFS Lease expired causing node expulsion

    On Aug 2, 5:20 pm, openstream rob wrote:
    > Hi all
    >
    > We are developing an Oracle RAC 10g cluster system using SAN for
    > database storage, but GPFS for the voting and OCR.

    So the databas

    > We have had no particular problems, and all has been progressing until
    > this week. Having made only minor changes to our build, we are getting
    > installation failure on our Oracle application.
    >
    > The application install runs for 2 hours, but during the last 3
    > installs, we are getting GPFS lease expires and the node running the
    > oracle application load, fences due to the loss of GPFS shared
    > filesystem. This has occurred at three random points during the
    > install. On rerunning the application load, it has successfully
    > reinstalled okay.
    >
    > The IBM problem guide suggests that the fault is soley related to
    > comms failure. Although we haven't witnessed any other comms issues,
    > and there have been no configuration changes.
    >
    > The GPFS master node expels and recovers the DB client node within 1
    > second, probably indicating a temporary failure?
    >
    > Oracle unfortunately notices the loss of service from GPFS and fences.
    > Causing a reboot. When rebooted, all is normal again.
    >
    > Is the lease expirey a common event for GPFS? or does this indicate an
    > underlying comms(or other) fault?
    > Does aybody else use GPFS for holding OCR and Voting disks?
    >
    > Does the expelling of a node impact the status of files being accessed/
    > written to a GPFS filesystem? Does recovery get perfomed?
    >
    > Rob


    Which GPFS and AIX version ?
    What does the mmfslog says on the failure time and on the filesystem
    manager node ?
    Latest patches installed for GPFS ?
    Is the faling node a filesystem manager ?

    BTW: What is OCR ?

    Hajo


  3. Re: GPFS Lease expired causing node expulsion

    Hi Hajo

    Aix 5.3ML04
    GPFS 3.1.0.7

    Our cluster consists of 3 database servers, which are also GPFS quorum
    mgrs. DB2 is the gpfs primary, DB3 is the secondary.
    The failure occurs on DB1, which is a plain quorum manager node.
    The fencing node, DB1, reports absolutely nothing in the errpr or mmfs
    logs. ( other than mmfs restarts 30 mins after DB2 (master) reports
    "DB1 being expelled due to expired lease.")
    DB2 reports the expelling of DB1 and the reintroduction of DB1 within
    1 second.

    OCR= Oracle cluster Registry. This is an element of Oracle RAC
    clusterware that needs to be shared between all RAC nodes, as is the
    voting disk.

    I've been looking at Oracle certification and have just noticed that
    the May version of the certification table for Oracle RAC 10g
    specifically excludes GPFS 3.x.
    During our initial scoping, the certification table indicated that
    GPFS 2.3 or above was certified, with no mention of 3+. So we may need
    to adjust our design. Regardless of this, we have been stable for
    several months before this week.

    Any ideas?

    Thanks




  4. Re: GPFS Lease expired causing node expulsion

    On Aug 2, 4:55 pm, openstream rob wrote:
    > Hi Hajo
    >
    > Aix 5.3ML04
    > GPFS 3.1.0.7
    >
    > Our cluster consists of 3 database servers, which are also GPFS quorum
    > mgrs. DB2 is the gpfs primary, DB3 is the secondary.
    > The failure occurs on DB1, which is a plain quorum manager node.
    > The fencing node, DB1, reports absolutely nothing in the errpr or mmfs
    > logs. ( other than mmfs restarts 30 mins after DB2 (master) reports
    > "DB1 being expelled due to expired lease.")
    > DB2 reports the expelling of DB1 and the reintroduction of DB1 within
    > 1 second.
    >
    > OCR= Oracle cluster Registry. This is an element of Oracle RAC
    > clusterware that needs to be shared between all RAC nodes, as is the
    > voting disk.
    >
    > I've been looking at Oracle certification and have just noticed that
    > the May version of the certification table for Oracle RAC 10g
    > specifically excludes GPFS 3.x.
    > During our initial scoping, the certification table indicated that
    > GPFS 2.3 or above was certified, with no mention of 3+. So we may need
    > to adjust our design. Regardless of this, we have been stable for
    > several months before this week.
    >
    > Any ideas?
    >
    > Thanks


    I just ran into this same problem earlier this week with a two node
    config and was initially on GPFS 2.3, Oracle RAC 10.2.0.3, AIX 5.3. I
    upgraded to 3.1 with no change in behavior. The problem turned out to
    be an issue with the ethernet on the failing node. At first, IBM was
    thinking it was a problem with fiber connectivity to disk. So, I ran
    diag on the fiber cards with no problems detected. Then I swapped the
    fiber cables with brand new ones, no change. Then checked
    connectivity with the ethernet and that was fine, then diag on the
    ethernet cards and again no problems. So what's the problem? Well,
    there must of been something in the ODM that was hosing it up as the
    only I fixed it was to remove the ethernet device configs, rediscover
    with cfgmgr, and, re-setup the cards. My failing node ceased being
    "expelled". BTW, I noticed that you have three nodes, is this for
    quorum? if so, it should not be necessary if you use a "Tie Breaker"
    disk.

    HTH,
    Pete's


  5. Re: GPFS Lease expired causing node expulsion

    Hi Pete

    This is interesting, I will investigate the comms. For our cluster, we
    have developed an installation mechanism whereby we can build from the
    ground up an entire cluster with application working, in under 10
    hours. This process is non interactive and is driven by a set of
    predetermined configuration files. This allows us to supply
    determinate builds for Test, secondary and DR clusters. What this
    essentially means is that the comms adapters will be rebuilt every
    time the cluster is built from scratch. (apprx every other day during
    development)

    Interestingly though, of all the clusters we can install on, it only
    fails on one. And only on one node in that cluster. I'm going to get
    diags run on the cards. As it happens, we are using virtual IPs formed
    from a resilient pair of adapters, I'm not totally sure how this
    arrangment fails over in event of a single card fault?

    We have three nodes for application reasons. Not totally sure what the
    justification is, however I do have it configured for 2 node quorum.
    This is currently a resilience issue I have to look at, and using tie-
    breaker si something to look at. Any advice?

    Do you use GPFS for voting and OCR as well?

    I have a few questions I'd like to ask regarding other issues of
    installation and upgrade. Are you open to having a chat on the phone/
    email, to share experiences?

    Thanks for your input

    Rob



  6. Re: GPFS Lease expired causing node expulsion

    On Aug 3, 9:16 am, openstream rob wrote:
    > Hi Pete
    >
    > This is interesting, I will investigate the comms. For our cluster, we
    > have developed an installation mechanism whereby we can build from the
    > ground up an entire cluster with application working, in under 10
    > hours. This process is non interactive and is driven by a set of
    > predetermined configuration files. This allows us to supply
    > determinate builds for Test, secondary and DR clusters. What this
    > essentially means is that the comms adapters will be rebuilt every
    > time the cluster is built from scratch. (apprx every other day during
    > development)
    >
    > Interestingly though, of all the clusters we can install on, it only
    > fails on one. And only on one node in that cluster. I'm going to get
    > diags run on the cards. As it happens, we are using virtual IPs formed
    > from a resilient pair of adapters, I'm not totally sure how this
    > arrangment fails over in event of a single card fault?
    >
    > We have three nodes for application reasons. Not totally sure what the
    > justification is, however I do have it configured for 2 node quorum.
    > This is currently a resilience issue I have to look at, and using tie-
    > breaker si something to look at. Any advice?
    >
    > Do you use GPFS for voting and OCR as well?
    >
    > I have a few questions I'd like to ask regarding other issues of
    > installation and upgrade. Are you open to having a chat on the phone/
    > email, to share experiences?
    >
    > Thanks for your input
    >
    > Rob


    We are using gpfs for the OCR & Voting Disks. My first install was
    with RAW as I was having issues and Oracle was strongly recommending
    it. They want to be sure those disks are there before crs starts. So
    it seems as though they had issues with GPFS. Back to the certified
    config, I'm looking into this now as I have a production setup going
    live in less than 2 weeks with GPFS 3.1. When I had checked
    earlier(as did you), 3.1 was supported. It's going to suck if I have
    re-install 2.3. I am open to a chat, but, I do not have time until
    later today, maybe 2:30p CDT. I'll email you my email from work as my
    email that's listed is a personal account and can not check it from
    work.

    Pete's




  7. Re: GPFS Lease expired causing node expulsion

    Pete

    How did you get on with the certification issue?

    We are still hoping our problem is hardware related. I'm swapping of
    servers about to see if the problem follows the hardware.
    It has developed though, as now we can install but during the first 5
    hours of running, the same node reboots. There are absolutely no log
    entries relating to why anywhere on the system. There's just a missing
    20 minutes of logs and a reboot entry when it comes up again. Ever
    seen that?

    Rob





  8. Re: GPFS Lease expired causing node expulsion

    On Aug 8, 10:45 am, openstream rob wrote:
    > Pete
    >
    > How did you get on with the certification issue?
    >
    > We are still hoping our problem is hardware related. I'm swapping of
    > servers about to see if the problem follows the hardware.
    > It has developed though, as now we can install but during the first 5
    > hours of running, the same node reboots. There are absolutely no log
    > entries relating to why anywhere on the system. There's just a missing
    > 20 minutes of logs and a reboot entry when it comes up again. Ever
    > seen that?
    >
    > Rob



    I think you should to go to the latest GPFS Fix level which is
    currently 3.1.0-13 in case you are still running GPFS 3.1.0.7
    See
    https://www14.software.ibm.com/webap...nload/aix.html
    for deatils

    In case you use etherchannel i would suggest not to use etherchannel
    to see if the problem is simply network related.

    hth
    Hajo


  9. Re: GPFS Lease expired causing node expulsion

    On Aug 8, 3:45 am, openstream rob wrote:
    > Pete
    >
    > How did you get on with the certification issue?
    >
    > We are still hoping our problem is hardware related. I'm swapping of
    > servers about to see if the problem follows the hardware.
    > It has developed though, as now we can install but during the first 5
    > hours of running, the same node reboots. There are absolutely no log
    > entries relating to why anywhere on the system. There's just a missing
    > 20 minutes of logs and a reboot entry when it comes up again. Ever
    > seen that?
    >
    > Rob


    I'm still working the certification issue, now with our Oracle 'Feel
    Good Person'. Oracle says IBM has to initiate the certify on gpfs 3.1
    and IBM says the opposite. Kinda getting ciruclar. Rereading thru
    all cert related docs, see that gpfs 2.1, 2.2 & 2.3 are supported.

    Don't recall missing 20 minutes of logs. One thing I did with Oracle
    was upgraded to 10.2.0.3 and applied the latest patch cluster on top
    of that, I think PC6. I had to do a little searching on metalinks.
    If you would like to ask some more in-depth questions, email me at my
    profile email.

    HTH,
    Pete's


  10. Re: GPFS Lease expired causing node expulsion

    On Aug 8, 5:27 am, Hajo Ehlers wrote:
    > On Aug 8, 10:45 am, openstream rob wrote:
    >
    > > Pete

    >
    > > How did you get on with the certification issue?

    >
    > > We are still hoping our problem is hardware related. I'm swapping of
    > > servers about to see if the problem follows the hardware.
    > > It has developed though, as now we can install but during the first 5
    > > hours of running, the same node reboots. There are absolutely no log
    > > entries relating to why anywhere on the system. There's just a missing
    > > 20 minutes of logs and a reboot entry when it comes up again. Ever
    > > seen that?

    >
    > > Rob

    >
    > I think you should to go to the latest GPFS Fix level which is
    > currently 3.1.0-13 in case you are still running GPFS 3.1.0.7
    > Seehttps://www14.software.ibm.com/webapp/set2/sas/f/gpfs/download/aix.html
    > for deatils
    >
    > In case you use etherchannel i would suggest not to use etherchannel
    > to see if the problem is simply network related.
    >
    > hth
    > Hajo


    I had already tried 3.1.0.13, speaking with an IBM analyst, 3.2 is
    suppose to be out, not looking to go to this anytime soon.

    It could be etherchannel, but, have this running on several partitions
    and had only seen it on one that happened to be running 2.3.

    Pete's


  11. Re: GPFS Lease expired causing node expulsion

    > I'm still working the certification issue, now with our Oracle 'Feel
    > Good Person'. Oracle says IBM has to initiate the certify on gpfs 3.1
    > and IBM says the opposite. Kinda getting ciruclar. Rereading thru
    > all cert related docs, see that gpfs 2.1, 2.2 & 2.3 are supported.
    >
    > Don't recall missing 20 minutes of logs. One thing I did with Oracle
    > was upgraded to 10.2.0.3 and applied the latest patch cluster on top
    > of that, I think PC6. I had to do a little searching on metalinks.
    > If you would like to ask some more in-depth questions, email me at my
    > profile email.
    >
    > HTH,
    > Pete's


    Was thinking a little more about your 20 minutes of missing logs and
    then a reboot. One question though, what version of Oracle RAC are
    you using? i.e. I was testing 10.2.0.2 and was seeing spontaneous
    reboots. in troubleshooting the issue, Oracle had me install a Patch
    Cluster (3 or 4, I don't remember which)on top of 10.2.0.2, and,
    install their OS Watcher. After the patch cluster, I never saw the
    reboot issue. If I recall correctly, I saw in the $ORA_CRS_HOME/logs/
    logs that one of the clusterware processes was initiating
    the reboot for unknown reasons.

    HTH,
    Pete's


  12. Re: GPFS Lease expired causing node expulsion

    On 8 Aug, 11:27, Hajo Ehlers wrote:
    > On Aug 8, 10:45 am, openstream rob wrote:
    >
    > > Pete

    >
    > > How did you get on with the certification issue?

    >
    > > We are still hoping our problem is hardware related. I'm swapping of
    > > servers about to see if the problem follows the hardware.
    > > It has developed though, as now we can install but during the first 5
    > > hours of running, the same node reboots. There are absolutely no log
    > > entries relating to why anywhere on the system. There's just a missing
    > > 20 minutes of logs and a reboot entry when it comes up again. Ever
    > > seen that?

    >
    > > Rob

    >
    > I think you should to go to the latest GPFS Fix level which is
    > currently 3.1.0-13 in case you are still running GPFS 3.1.0.7
    > Seehttps://www14.software.ibm.com/webapp/set2/sas/f/gpfs/download/aix.html
    > for deatils
    >
    > In case you use etherchannel i would suggest not to use etherchannel
    > to see if the problem is simply network related.
    >
    > hth
    > Hajo


    Hi Hajo

    I'll look into the gpfs update. Thanks

    We don't use etherchannel, we are using VIPA. I am looking into how
    easy it will be to deconfigure and retry.
    I'm also going to reset the database server numbering. And see if the
    fault follows the hardware. This will be a good test too.

    Thanks

    Rob


  13. Re: GPFS Lease expired causing node expulsion

    On 9 Aug, 03:43, Pete's wrote:
    > > I'm still working the certification issue, now with our Oracle 'Feel
    > > Good Person'. Oracle says IBM has to initiate the certify on gpfs 3.1
    > > and IBM says the opposite. Kinda getting ciruclar. Rereading thru
    > > all cert related docs, see that gpfs 2.1, 2.2 & 2.3 are supported.

    >
    > > Don't recall missing 20 minutes of logs. One thing I did with Oracle
    > > was upgraded to 10.2.0.3 and applied the latest patch cluster on top
    > > of that, I think PC6. I had to do a little searching on metalinks.
    > > If you would like to ask some more in-depth questions, email me at my
    > > profile email.

    >
    > > HTH,
    > > Pete's

    >
    > Was thinking a little more about your 20 minutes of missing logs and
    > then a reboot. One question though, what version of Oracle RAC are
    > you using? i.e. I was testing 10.2.0.2 and was seeing spontaneous
    > reboots. in troubleshooting the issue, Oracle had me install a Patch
    > Cluster (3 or 4, I don't remember which)on top of 10.2.0.2, and,
    > install their OS Watcher. After the patch cluster, I never saw the
    > reboot issue. If I recall correctly, I saw in the $ORA_CRS_HOME/logs/
    > logs that one of the clusterware processes was initiating
    > the reboot for unknown reasons.
    >
    > HTH,
    > Pete's


    Hi Pete

    We're up to 10.0.2.03 PC 6 already and still seeing the fault. I'll
    monitor the logs you mention, but we've seen nothing in them so far!
    I'm going to try and isolate this as a hardware fault by swapping the
    nodes about, I'll let you know.

    I'll email you regarding your setup for tiebreaker. cheers

    Thanks

    Rob


  14. Re: GPFS Lease expired causing node expulsion

    On Aug 9, 8:41 am, openstream rob wrote:
    > On 9 Aug, 03:43, Pete's wrote:
    >
    >
    >
    > > > I'm still working the certification issue, now with our Oracle 'Feel
    > > > Good Person'. Oracle says IBM has to initiate the certify on gpfs 3.1
    > > > and IBM says the opposite. Kinda getting ciruclar. Rereading thru
    > > > all cert related docs, see that gpfs 2.1, 2.2 & 2.3 are supported.

    >
    > > > Don't recall missing 20 minutes of logs. One thing I did with Oracle
    > > > was upgraded to 10.2.0.3 and applied the latest patch cluster on top
    > > > of that, I think PC6. I had to do a little searching on metalinks.
    > > > If you would like to ask some more in-depth questions, email me at my
    > > > profile email.

    >
    > > > HTH,
    > > > Pete's

    >
    > > Was thinking a little more about your 20 minutes of missing logs and
    > > then a reboot. One question though, what version of Oracle RAC are
    > > you using? i.e. I was testing 10.2.0.2 and was seeing spontaneous
    > > reboots. in troubleshooting the issue, Oracle had me install a Patch
    > > Cluster (3 or 4, I don't remember which)on top of 10.2.0.2, and,
    > > install their OS Watcher. After the patch cluster, I never saw the
    > > reboot issue. If I recall correctly, I saw in the $ORA_CRS_HOME/logs/
    > > logs that one of the clusterware processes was initiating
    > > the reboot for unknown reasons.

    >
    > > HTH,
    > > Pete's

    >
    > Hi Pete
    >
    > We're up to 10.0.2.03 PC 6 already and still seeing the fault. I'll
    > monitor the logs you mention, but we've seen nothing in them so far!
    > I'm going to try and isolate this as a hardware fault by swapping the
    > nodes about, I'll let you know.
    >
    > I'll email you regarding your setup for tiebreaker. cheers
    >
    > Thanks
    >
    > Rob


    No headway on certification yet but check the following out:

    http://download.oracle.com/docs/cd/B....htm#sthref388

    And on Metalink, Note Id: 302806.1.

    The install guide states 2.3 or higher, the Note id states 3.1 is not
    certified. I think Oracle needs to define which one is right.

    Pete's


  15. Re: GPFS Lease expired causing node expulsion

    On Aug 9, 11:36 am, Pete's wrote:
    > On Aug 9, 8:41 am, openstream rob wrote:
    >
    > > Hi Pete

    >
    > > We're up to 10.0.2.03 PC 6 already and still seeing the fault. I'll
    > > monitor the logs you mention, but we've seen nothing in them so far!
    > > I'm going to try and isolate this as a hardware fault by swapping the
    > > nodes about, I'll let you know.

    >
    > > I'll email you regarding your setup for tiebreaker. cheers

    >
    > > Thanks

    >
    > > Rob

    >
    > No headway on certification yet but check the following out:
    >
    > http://download.oracle.com/docs/cd/B...2/b14201/preai...
    >
    > And on Metalink, Note Id: 302806.1.
    >
    > The install guide states 2.3 or higher, the Note id states 3.1 is not
    > certified. I think Oracle needs to define which one is right.
    >
    > Pete's


    Not much more on certification of GPFS 3.1, found out that they are
    close but not sure how close, yet. I guess I'll be checking
    Metalink. I started working on reverting a Test Cluster from 3.1 to
    2.3, this one was easy. The concepts manual in the IBM Information
    Centers tell you how. This one worked because I started out with 2.3,
    upgraded to 3.1 and now back down to 2.3. My Production Cluster, I'm
    not so lucky. I started out with 3.1 and the only way to go back is
    to destroy the file systems and start over.

    If I find out more about certification, I'll post back.

    Pete's

    Pete's.


  16. Re: GPFS Lease expired causing node expulsion

    Came a across some info last week, looks like certification is very
    near for 3.1.

    HTH,
    Pete's


  17. Re: GPFS Lease expired causing node expulsion

    Hi Pete

    Been away. Sorry for the delay.

    We've got a few months to "go live", so I'm going to wait on 3.1
    certification.

    Recent builds have also been error free, which is puzzling. I did open
    up the server and reseat the comms cards. Maybe it is a hardware
    fault?
    I'm tracking it to see what happens.

    I'll keep you posted. Thanks for opening up to email, when I've got
    some specifics, I'll email you. Thanks again Pete

    Rob


  18. Re: GPFS Lease expired causing node expulsion

    On Aug 21, 4:59 am, openstream rob wrote:
    > Hi Pete
    >
    > Been away. Sorry for the delay.
    >
    > We've got a few months to "go live", so I'm going to wait on 3.1
    > certification.
    >
    > Recent builds have also been error free, which is puzzling. I did open
    > up the server and reseat the comms cards. Maybe it is a hardware
    > fault?
    > I'm tracking it to see what happens.
    >
    > I'll keep you posted. Thanks for opening up to email, when I've got
    > some specifics, I'll email you. Thanks again Pete
    >
    > Rob


    By the way, the certification matrix has been updated. GPFS 3.1 is
    supported with 10.2.0.3.

    Pete's


+ Reply to Thread