cluster behavior when all local disks fail - Veritas Cluster Server

This is a discussion on cluster behavior when all local disks fail - Veritas Cluster Server ; Hi, is there documentation what should and what should not happen when all local disks fail on a cluster node? We had a problem with a two node veritas 3.5 cluster running solaris 8. Two SF6800 with FC shared storage ...

+ Reply to Thread
Results 1 to 5 of 5

Thread: cluster behavior when all local disks fail

  1. cluster behavior when all local disks fail

    Hi,

    is there documentation what should and what should not happen when all local
    disks fail on a cluster node?

    We had a problem with a two node veritas 3.5 cluster running solaris 8.
    Two SF6800 with FC shared storage and a local D240 (with four disks) each.
    Due to mis-cabling the D240 on the first node failed from a power outage.

    The log files don't really explain what happend and I only have got the logs
    from the second node (of course).

    Can someone explain what happend and how it could have been prevented?


    Some of the logs:
    sr005 failed, sr006 is the surviving node.
    sr005 loses its disks a about 10:27.
    engine_A
    nothing until this line
    TAG_E 2004/12/29 10:33:08 VCS:10077:received new cluster membership
    TAG_D 2004/12/29 10:33:08 VCS:10080:System (sr006) - Membership: 0x2,
    Jeopardy: 0x1 TAG_B 2004/12/29 10:33:08 VCS:10084:System (sr005) is in Jeopardy Membership - Membership: 0x2, Visible: 0x0
    TAG_B 2004/12/29 10:33:08 VCS:10322:System sr005 (Node '0') changed state from RUNNING to FAULTED
    TAG_D 2004/12/29 10:33:08 VCS:10449:Group Parallel_grp autodisabled on node sr005 until it is probed
    TAG_D 2004/12/29 10:33:08 VCS:10449:Group admin_grp autodisabled on node sr005 until it is probed
    TAG_D 2004/12/29 10:33:08 VCS:10449:Group bill_grp autodisabled on node sr005 until it is probed
    TAG_D 2004/12/29 10:33:08 VCS:10449:Group dir_grp autodisabled on node sr005 until it is probed
    TAG_D 2004/12/29 10:33:08 VCS:10449:Group mag_grp autodisabled on node sr005 until it is probed
    TAG_D 2004/12/29 10:33:08 VCS:10449:Group push_grp autodisabled on node sr005 until it is probed
    TAG_D 2004/12/29 10:33:08 VCS:10449:Group snmp_grp autodisabled on node sr005 until it is probed
    TAG_D 2004/12/29 10:33:08 VCS:10449:Group sweeper_grp autodisabled on node sr005 until it is probed
    TAG_D 2004/12/29 10:33:08 VCS:10446:Group Parallel_grp is offline on system sr005
    TAG_D 2004/12/29 10:33:08 VCS:10446:Group mag_grp is offline on system sr005
    TAG_D 2004/12/29 10:33:08 VCS:10446:Group snmp_grp is offline on system sr005
    TAG_D 2004/12/29 10:33:08 VCS:10446:Group sweeper_grp is offline on system sr005
    TAG_E 2004/12/29 11:01:26 VCS:50135:User root fired command: hagrp -online mag_grp sr006 from 127.0.0.1
    the online failed, I don't now why (and I wasn't there personally).

    messages
    Dec 29 10:33:08 sr006 gab: [ID 495678 kern.notice] GAB:20036: Port h gen e9123503 membership ;1
    Dec 29 10:33:08 sr006 gab: [ID 789970 kern.notice] GAB:20038: Port h gen e9123503 k_jeopardy 0
    Dec 29 10:33:08 sr006 gab: [ID 881600 kern.notice] GAB:20040: Port h gen e9123503 visible 0
    nothing else


    Thanks.

    --
    I am root. If you see me laughing, you better have a backup.

  2. Re: cluster behavior when all local disks fail

    The Service Group was auto disabled (see documentation on reasons why)

    What you should have done:

    hagrp -autoenable mag_grp -sys sr005
    hagrp -online mag_grp -sys sr006




    Andreas Wohlfeld wrote:
    > Hi,
    >
    > is there documentation what should and what should not happen when all local
    > disks fail on a cluster node?
    >
    > We had a problem with a two node veritas 3.5 cluster running solaris 8.
    > Two SF6800 with FC shared storage and a local D240 (with four disks) each.
    > Due to mis-cabling the D240 on the first node failed from a power outage.
    >
    > The log files don't really explain what happend and I only have got the logs
    > from the second node (of course).
    >
    > Can someone explain what happend and how it could have been prevented?
    >
    >
    > Some of the logs:
    > sr005 failed, sr006 is the surviving node.
    > sr005 loses its disks a about 10:27.
    > engine_A
    > nothing until this line
    > TAG_E 2004/12/29 10:33:08 VCS:10077:received new cluster membership
    > TAG_D 2004/12/29 10:33:08 VCS:10080:System (sr006) - Membership: 0x2,
    > Jeopardy: 0x1 TAG_B 2004/12/29 10:33:08 VCS:10084:System (sr005) is in Jeopardy Membership - Membership: 0x2, Visible: 0x0
    > TAG_B 2004/12/29 10:33:08 VCS:10322:System sr005 (Node '0') changed state from RUNNING to FAULTED
    > TAG_D 2004/12/29 10:33:08 VCS:10449:Group Parallel_grp autodisabled on node sr005 until it is probed
    > TAG_D 2004/12/29 10:33:08 VCS:10449:Group admin_grp autodisabled on node sr005 until it is probed
    > TAG_D 2004/12/29 10:33:08 VCS:10449:Group bill_grp autodisabled on node sr005 until it is probed
    > TAG_D 2004/12/29 10:33:08 VCS:10449:Group dir_grp autodisabled on node sr005 until it is probed
    > TAG_D 2004/12/29 10:33:08 VCS:10449:Group mag_grp autodisabled on node sr005 until it is probed
    > TAG_D 2004/12/29 10:33:08 VCS:10449:Group push_grp autodisabled on node sr005 until it is probed
    > TAG_D 2004/12/29 10:33:08 VCS:10449:Group snmp_grp autodisabled on node sr005 until it is probed
    > TAG_D 2004/12/29 10:33:08 VCS:10449:Group sweeper_grp autodisabled on node sr005 until it is probed
    > TAG_D 2004/12/29 10:33:08 VCS:10446:Group Parallel_grp is offline on system sr005
    > TAG_D 2004/12/29 10:33:08 VCS:10446:Group mag_grp is offline on system sr005
    > TAG_D 2004/12/29 10:33:08 VCS:10446:Group snmp_grp is offline on system sr005
    > TAG_D 2004/12/29 10:33:08 VCS:10446:Group sweeper_grp is offline on system sr005
    > TAG_E 2004/12/29 11:01:26 VCS:50135:User root fired command: hagrp -online mag_grp sr006 from 127.0.0.1
    > the online failed, I don't now why (and I wasn't there personally).
    >
    > messages
    > Dec 29 10:33:08 sr006 gab: [ID 495678 kern.notice] GAB:20036: Port h gen e9123503 membership ;1
    > Dec 29 10:33:08 sr006 gab: [ID 789970 kern.notice] GAB:20038: Port h gen e9123503 k_jeopardy 0
    > Dec 29 10:33:08 sr006 gab: [ID 881600 kern.notice] GAB:20040: Port h gen e9123503 visible 0
    > nothing else
    >
    >
    > Thanks.
    >


  3. Re: cluster behavior when all local disks fail

    Me wrote:

    Ok, that's more or less clear, but how did the first node got faulted? There
    is no message other than "RUNNING TO FAILED". No failing of resources,
    nothing. How did the other decided that quick that all groups were offline
    on node one? Did it probe? Did it panic the system?


    > The Service Group was auto disabled (see documentation on reasons why)
    >
    > What you should have done:
    >
    > hagrp -autoenable mag_grp -sys sr005
    > hagrp -online mag_grp -sys sr006
    >
    >
    >
    >
    > Andreas Wohlfeld wrote:
    >> Hi,
    >>
    >> is there documentation what should and what should not happen when all local
    >> disks fail on a cluster node?
    >>
    >> We had a problem with a two node veritas 3.5 cluster running solaris 8.
    >> Two SF6800 with FC shared storage and a local D240 (with four disks) each.
    >> Due to mis-cabling the D240 on the first node failed from a power outage.
    >>
    >> The log files don't really explain what happend and I only have got the logs
    >> from the second node (of course).
    >>
    >> Can someone explain what happend and how it could have been prevented?
    >>
    >>
    >> Some of the logs:
    >> sr005 failed, sr006 is the surviving node.
    >> sr005 loses its disks a about 10:27.
    >> engine_A
    >> nothing until this line
    >> TAG_E 2004/12/29 10:33:08 VCS:10077:received new cluster membership
    >> TAG_D 2004/12/29 10:33:08 VCS:10080:System (sr006) - Membership: 0x2,
    >> Jeopardy: 0x1 TAG_B 2004/12/29 10:33:08 VCS:10084:System (sr005) is in Jeopardy Membership - Membership: 0x2, Visible: 0x0
    >> TAG_B 2004/12/29 10:33:08 VCS:10322:System sr005 (Node '0') changed state from RUNNING to FAULTED
    >> TAG_D 2004/12/29 10:33:08 VCS:10449:Group Parallel_grp autodisabled on node sr005 until it is probed
    >> TAG_D 2004/12/29 10:33:08 VCS:10449:Group admin_grp autodisabled on node sr005 until it is probed
    >> TAG_D 2004/12/29 10:33:08 VCS:10449:Group bill_grp autodisabled on node sr005 until it is probed
    >> TAG_D 2004/12/29 10:33:08 VCS:10449:Group dir_grp autodisabled on node sr005 until it is probed
    >> TAG_D 2004/12/29 10:33:08 VCS:10449:Group mag_grp autodisabled on node sr005 until it is probed
    >> TAG_D 2004/12/29 10:33:08 VCS:10449:Group push_grp autodisabled on node sr005 until it is probed
    >> TAG_D 2004/12/29 10:33:08 VCS:10449:Group snmp_grp autodisabled on node sr005 until it is probed
    >> TAG_D 2004/12/29 10:33:08 VCS:10449:Group sweeper_grp autodisabled on node sr005 until it is probed
    >> TAG_D 2004/12/29 10:33:08 VCS:10446:Group Parallel_grp is offline on system sr005
    >> TAG_D 2004/12/29 10:33:08 VCS:10446:Group mag_grp is offline on system sr005
    >> TAG_D 2004/12/29 10:33:08 VCS:10446:Group snmp_grp is offline on system sr005
    >> TAG_D 2004/12/29 10:33:08 VCS:10446:Group sweeper_grp is offline on system sr005
    >> TAG_E 2004/12/29 11:01:26 VCS:50135:User root fired command: hagrp -online mag_grp sr006 from 127.0.0.1
    >> the online failed, I don't now why (and I wasn't there personally).
    >>
    >> messages
    >> Dec 29 10:33:08 sr006 gab: [ID 495678 kern.notice] GAB:20036: Port h gen e9123503 membership ;1
    >> Dec 29 10:33:08 sr006 gab: [ID 789970 kern.notice] GAB:20038: Port h gen e9123503 k_jeopardy 0
    >> Dec 29 10:33:08 sr006 gab: [ID 881600 kern.notice] GAB:20040: Port h gen e9123503 visible 0
    >> nothing else
    >>
    >>
    >> Thanks.
    >>



    --
    I am root. If you see me laughing, you better have a backup.

  4. Re: cluster behavior when all local disks fail

    TAG_E 2004/12/29 10:33:08 VCS:10077:received new cluster membership

    TAG_D 2004/12/29 10:33:08 VCS:10080:System (sr006) - Membership: 0x2,
    Jeopardy: 0x1

    TAG_B 2004/12/29 10:33:08 VCS:10084:System (sr005) is in Jeopardy
    Membership - Membership: 0x2, Visible: 0x0



    mmm. so let me think. When will membership change ???
    when the links go down or
    when the node goes down (some error on the node or the VCS software)

    so the next message might give it away.........


    TAG_B 2004/12/29 10:33:08 VCS:10322:System sr005 (Node '0') changed
    state from RUNNING to FAULTED


    and then the messages file........

    Dec 29 10:33:08 sr006 gab: [ID 495678 kern.notice] GAB:20036: Port h gen
    e9123503 membership ;1
    >> Dec 29 10:33:08 sr006 gab: [ID 789970 kern.notice] GAB:20038: Port h

    gen e9123503 k_jeopardy 0
    >> Dec 29 10:33:08 sr006 gab: [ID 881600 kern.notice] GAB:20040: Port h

    gen e9123503 visible 0


    OK, so port "h" went south.
    Now, port "h" is used for the "had" process (the main VCS program) to
    communicate to gab. Looks like "had" died here (or the whole box died)

    If you had a look at the node that failed, you might see that it did
    (most of the time if a disk dies, it will "hang" the system. This is
    because a lot of IO gets blocked in the kernel, and the kernel is trying
    very hard to re-try and get some IO going - a lot of processes or
    threads might be waiting on IO)

    You must also remember that the main VCS process (had) runs in user
    land. So anything running in the kernel at that stage, will get a lot
    more time on the CPU(s). GAB also runs in the kernel, and will try to
    communicate to "had". If it can not talk to "had", it will actually kill
    the process (and hashadow will then restart it - eventually). All of
    these messages will get logged in the /var/adm/messages file (on Solaris)

    It is a shame you don't have any info regarding what happened on the
    sr005 system. Even a crash file could be analysed by Veritas support to
    tell you what happened on the other system.


    The other sysetm will stay autodisabled until it gets autoenabled or
    "had" restarts (and communicates the state of the system and the
    resources back to the other remaining node).

    When the hagrp -online command was executed on sr006, the sr005 system
    was not back yet (maybe at the ok prompt ?). Also a pitty you don't have
    the log files beyond this. The failure reason would have been stated
    (most likely the fact that it was still autodisabled on the sr005 node)

    Now, lastly, the human factor. Most of the time when someone messes up,
    they try to cover their tracks (by modifying, deleting the
    /var/adm/messages and the /var/VRTSvcs/log/engine_A.log files).

    There are some other indicators to look for (reboot times, other logs
    for other applications or agents in /var/VRTSvcs/log all stop at the
    same time and resume later - after a reboot or "go" at the ok prompt).

    Also search see if any core or crash files were generated and get them
    analysed.



    Sorry for the long post, but I hope that explains it a bit more.
    Really suggest you get your hands on more log files or other evidence
    from the sr005 system.

    Andreas Wohlfeld wrote:
    > Me wrote:
    >
    > Ok, that's more or less clear, but how did the first node got faulted? There
    > is no message other than "RUNNING TO FAILED". No failing of resources,
    > nothing. How did the other decided that quick that all groups were offline
    > on node one? Did it probe? Did it panic the system?
    >
    >
    >
    >>The Service Group was auto disabled (see documentation on reasons why)
    >>
    >>What you should have done:
    >>
    >>hagrp -autoenable mag_grp -sys sr005
    >>hagrp -online mag_grp -sys sr006
    >>
    >>
    >>
    >>
    >>Andreas Wohlfeld wrote:
    >>
    >>>Hi,
    >>>
    >>>is there documentation what should and what should not happen when all local
    >>>disks fail on a cluster node?
    >>>
    >>>We had a problem with a two node veritas 3.5 cluster running solaris 8.
    >>>Two SF6800 with FC shared storage and a local D240 (with four disks) each.
    >>>Due to mis-cabling the D240 on the first node failed from a power outage.
    >>>
    >>>The log files don't really explain what happend and I only have got the logs
    >>>from the second node (of course).
    >>>
    >>>Can someone explain what happend and how it could have been prevented?
    >>>
    >>>
    >>>Some of the logs:
    >>>sr005 failed, sr006 is the surviving node.
    >>>sr005 loses its disks a about 10:27.
    >>>engine_A
    >>>nothing until this line
    >>>TAG_E 2004/12/29 10:33:08 VCS:10077:received new cluster membership
    >>>TAG_D 2004/12/29 10:33:08 VCS:10080:System (sr006) - Membership: 0x2,
    >>>Jeopardy: 0x1 TAG_B 2004/12/29 10:33:08 VCS:10084:System (sr005) is in Jeopardy Membership - Membership: 0x2, Visible: 0x0
    >>>TAG_B 2004/12/29 10:33:08 VCS:10322:System sr005 (Node '0') changed state from RUNNING to FAULTED
    >>>TAG_D 2004/12/29 10:33:08 VCS:10449:Group Parallel_grp autodisabled on node sr005 until it is probed
    >>>TAG_D 2004/12/29 10:33:08 VCS:10449:Group admin_grp autodisabled on node sr005 until it is probed
    >>>TAG_D 2004/12/29 10:33:08 VCS:10449:Group bill_grp autodisabled on node sr005 until it is probed
    >>>TAG_D 2004/12/29 10:33:08 VCS:10449:Group dir_grp autodisabled on node sr005 until it is probed
    >>>TAG_D 2004/12/29 10:33:08 VCS:10449:Group mag_grp autodisabled on node sr005 until it is probed
    >>>TAG_D 2004/12/29 10:33:08 VCS:10449:Group push_grp autodisabled on node sr005 until it is probed
    >>>TAG_D 2004/12/29 10:33:08 VCS:10449:Group snmp_grp autodisabled on node sr005 until it is probed
    >>>TAG_D 2004/12/29 10:33:08 VCS:10449:Group sweeper_grp autodisabled on node sr005 until it is probed
    >>>TAG_D 2004/12/29 10:33:08 VCS:10446:Group Parallel_grp is offline on system sr005
    >>>TAG_D 2004/12/29 10:33:08 VCS:10446:Group mag_grp is offline on system sr005
    >>>TAG_D 2004/12/29 10:33:08 VCS:10446:Group snmp_grp is offline on system sr005
    >>>TAG_D 2004/12/29 10:33:08 VCS:10446:Group sweeper_grp is offline on system sr005
    >>>TAG_E 2004/12/29 11:01:26 VCS:50135:User root fired command: hagrp -online mag_grp sr006 from 127.0.0.1
    >>>the online failed, I don't now why (and I wasn't there personally).
    >>>
    >>>messages
    >>>Dec 29 10:33:08 sr006 gab: [ID 495678 kern.notice] GAB:20036: Port h gen e9123503 membership ;1
    >>>Dec 29 10:33:08 sr006 gab: [ID 789970 kern.notice] GAB:20038: Port h gen e9123503 k_jeopardy 0
    >>>Dec 29 10:33:08 sr006 gab: [ID 881600 kern.notice] GAB:20040: Port h gen e9123503 visible 0
    >>>nothing else
    >>>
    >>>
    >>>Thanks.
    >>>

    >
    >
    >


  5. Re: cluster behavior when all local disks fail

    Me wrote:

    Hi,

    thanks for you explenation.


    --
    I am root. If you see me laughing, you better have a backup.

+ Reply to Thread