lost system drive, no failover - Veritas Cluster Server

This is a discussion on lost system drive, no failover - Veritas Cluster Server ; We have a VCS cluster, version 3.5. No heartbeat disk configured. We lost the primary node system disk due to some array issue (boot from SAN). The other node did not initiate the failover. Neither node recorded any information about ...

+ Reply to Thread
Results 1 to 3 of 3

Thread: lost system drive, no failover

  1. lost system drive, no failover


    We have a VCS cluster, version 3.5. No heartbeat disk configured. We lost
    the primary node system disk due to some array issue (boot from SAN). The
    other node did not initiate the failover. Neither node recorded any information
    about the secondary node trying to recover after the failure.

    My questions are:
    1. Is this a split brain? Was it caused because the secondary node did not
    know if the primary node was down or all private network were down?
    2. Should the secondary node attempts a recovery? Why there is nothing in
    the log?
    3. Will a heartbeat disk or I/O fencing in 4.0 prevent this from happening?

    Thank!

  2. Re: lost system drive, no failover

    Did the node fail down to the OK prompt (without shutting down normally) ?

    If that is the case, you will see some jeopardy messages, some messages
    about Service Groups being autodisabled (this occurs if the second node
    does not know the state of the primary node, and thus does not know the
    state of the service groups - and resources) on the primary node.

    The biggest issue, why the secondary node did not attempt recovery, was
    because it did not know the state of the primary node. Say it was just
    the links, and the same diskgroups and filesystems were mounted on both
    the machines, then you would have corruption (and you would scream down
    the phone to Veritas because VCS caused data loss).

    What I do suggest, is to use the Notifier resource to alert you when
    this happens and then take action (by autoenabling the service groups if
    you know that the whole machine is down and will be down for a while)

    So, in answer to your specific questions:

    1. A Split brain happens if both the nodes are still running (some
    service groups on one node and some service groups on the other), but
    there is no communication on the private network. (Sounds like in your
    case this was not true, the primary node went to the OK promt)

    2. As I explained, this is the autodisable feature of VCS

    3. A heartbeat disk would not have helped (unless the primary machine
    was still running). IO Fencing would have left some keys on the data
    disks, and prevented the secondary node from taking over (if the primary
    went down hard). IO fencing will help in a situation where there is
    split brain and someone tries to force import a diskgroup onto a machine
    (while it is already imported on a second)



    Hope that explains it a bit.

    I suggest you go look at the time of the failure, look for "jeopardy"
    and "autodisable" in the /var/VRTSvcs/log/engine_A.log file



    Jerry wrote:
    > We have a VCS cluster, version 3.5. No heartbeat disk configured. We lost
    > the primary node system disk due to some array issue (boot from SAN). The
    > other node did not initiate the failover. Neither node recorded any information
    > about the secondary node trying to recover after the failure.
    >
    > My questions are:
    > 1. Is this a split brain? Was it caused because the secondary node did not
    > know if the primary node was down or all private network were down?
    > 2. Should the secondary node attempts a recovery? Why there is nothing in
    > the log?
    > 3. Will a heartbeat disk or I/O fencing in 4.0 prevent this from happening?
    >
    > Thank!


  3. Re: lost system drive, no failover


    Thanks! That explains a lot.

+ Reply to Thread