Cluster node panics - Veritas Cluster Server

This is a discussion on Cluster node panics - Veritas Cluster Server ; Does anyone know this phenomenon? On our SOlaris 8 node with vcs1.3 we have 3 heartbeats. Sometimes we get following messages and afterwards one node panics 1) VCS:13027:monitor procedure did not complete within the expected time. 2) VCS:10023:Agent DiskGroup not ...

+ Reply to Thread
Results 1 to 4 of 4

Thread: Cluster node panics

  1. Cluster node panics


    Does anyone know this phenomenon? On our SOlaris 8 node with vcs1.3 we have
    3 heartbeats.
    Sometimes we get following messages and afterwards one node panics

    1) VCS:13027:monitor procedure did not complete within the expected time.
    2) VCS:10023:Agent DiskGroup not sending alive messages since ...
    3) VCS:10009:Agent DiskGroup has faulted 6 times in less than 950 seconds
    -- Will not attempt to restart
    then we can read in the messages:
    panic[cpu8]/thread=2a100085d40:
    Mar 3 19:53:03 pbludge1 unix: [ID 354483 kern.notice] GAB: Port h halting
    system due to client process failure
    Mar 3 19:53:03 pbludge1 unix: [ID 100000 kern.notice]
    Mar 3 19:53:03 pbludge1
    Mar 3 19:53:03 pbludge1 genunix: [ID 723222 kern.notice] 000002a1000856c0
    gab:gab_halt+2f0 (21670507, 0, 22, 1, 2a100085d44, 0)
    We hadn't a heavy load on our systems nor problems on our network.

    thanks in advance
    Carolin



  2. Re: Cluster node panics

    Hi,

    panic is a kernel entry point that can also be called by software. Your
    panic string looks like the cluster committed suicide. Does the active
    or passive node panic? Is one of your heartbeats disk based?

    Is there any idea, why the monitoring of the disk group fails
    (instability of the corresponding storage)?

    Carolin Maurer schrieb:
    >
    > Does anyone know this phenomenon? On our SOlaris 8 node with vcs1.3 we have
    > 3 heartbeats.
    > Sometimes we get following messages and afterwards one node panics
    >
    > 1) VCS:13027:monitor procedure did not complete within the expected time.
    > 2) VCS:10023:Agent DiskGroup not sending alive messages since ...
    > 3) VCS:10009:Agent DiskGroup has faulted 6 times in less than 950 seconds
    > -- Will not attempt to restart
    > then we can read in the messages:
    > panic[cpu8]/thread=2a100085d40:
    > Mar 3 19:53:03 pbludge1 unix: [ID 354483 kern.notice] GAB: Port h halting
    > system due to client process failure
    > Mar 3 19:53:03 pbludge1 unix: [ID 100000 kern.notice]
    > Mar 3 19:53:03 pbludge1
    > Mar 3 19:53:03 pbludge1 genunix: [ID 723222 kern.notice] 000002a1000856c0
    > gab:gab_halt+2f0 (21670507, 0, 22, 1, 2a100085d44, 0)
    > We hadn't a heavy load on our systems nor problems on our network.
    >
    > thanks in advance
    > Carolin


  3. Re: Cluster node panics


    I am experiencing the same problem, except that my database folks told me
    that they were starting up another Sybase instance when the system crashed.
    So far we have been able to recreate this once and have it work two other
    times. Always the message is a cpu panic. Has anyone resolved this?



    Nov 5 14:05:42 hiinvest1 gab: [ID 184552 kern.notice] GAB:20035: Port h
    attempting to kill process due to client process failure
    Nov 5 14:05:57 hiinvest1 last message repeated 1 time
    Nov 5 14:06:27 hiinvest1 unix: [ID 836849 kern.notice]
    Nov 5 14:06:27 hiinvest1 ^Mpanic[cpu4]/thread=2a1000d7d40:
    Nov 5 14:06:28 hiinvest1 unix: [ID 354483 kern.notice] GAB: Port h halting
    system due to client process failure
    Nov 5 14:06:28 hiinvest1 unix: [ID 100000 kern.notice]
    Nov 5 14:06:28 hiinvest1 genunix: [ID 723222 kern.notice] 000002a1000d76c0
    gab:gab_halt+2f0 (21670507, 0, 23, 1, 8, 0)
    Nov 5 14:06:28 hiinvest1 genunix: [ID 179002 kern.notice] %l0-3: 000002a1000d777c
    000002a1000d779e 0000000000000047 0000000000800000
    Nov 5 14:06:28 hiinvest1 %l4-7: 00000300003197a8 0000000000000083 000000000057e8c0
    0000000000000000
    Nov 5 14:06:28 hiinvest1 genunix: [ID 723222 kern.notice] 000002a1000d77d0
    gab:gab_kill_process+160 (10507, 2a1000d7d44, 22, 1, 2a1000d7d44, 2a1000d7d3c)
    Nov 5 14:06:28 hiinvest1 genunix: [ID 179002 kern.notice] %l0-3: 0000000021670507
    0000000000000507 0000000000000007 000000007834ab10
    Nov 5 14:06:28 hiinvest1 %l4-7: 000000001102ff40 00000300056330d0 0000000000000000
    000002a1000d7880
    Nov 5 14:06:28 hiinvest1 genunix: [ID 723222 kern.notice] 000002a1000d78c0
    gab:gab_timerscan+764 (0, 2a1000d7d40, 20, 10423a00, 781ed138, 0)
    Nov 5 14:06:28 hiinvest1 genunix: [ID 179002 kern.notice] %l0-3: 000000007837f650
    0000030004d26a90 00000000781edd20 000002a100193d40
    Nov 5 14:06:28 hiinvest1 %l4-7: 000003000a323cf8 0000000000000008 0000000000000000
    000002a100193a00
    Nov 5 14:06:28 hiinvest1 genunix: [ID 723222 kern.notice] 000002a1000d79d0
    genunix:callout_execute+90 (bffffffffed8206f, 1, 300002b7038, 1ecd25d, 300002b6038,
    0)
    Nov 5 14:06:28 hiinvest1 genunix: [ID 179002 kern.notice] %l0-3: 0000000078358b80
    8000000000000000 0000000000000009 00000300002b7320
    Nov 5 14:06:28 hiinvest1 %l4-7: 0000000001ecd25d 00000300002b6000 000003000a473f00
    000002a1000d79e0
    Nov 5 14:06:29 hiinvest1 genunix: [ID 723222 kern.notice] 000002a1000d7a80
    genunix:taskq_thread+18c (30002793b38, 0, 10423a00, 10000, 30002793b6a, 30002793b90)
    Nov 5 14:06:29 hiinvest1 genunix: [ID 179002 kern.notice] %l0-3: 0000000010072774
    0000030002793b68 0000030002793b60 0000030002793b38
    Nov 5 14:06:29 hiinvest1 %l4-7: 0000030002793b58 0000030002791bc8 00000300002158f0
    0000000000000002
    Nov 5 14:06:29 hiinvest1 unix: [ID 100000 kern.notice]
    Nov 5 14:06:29 hiinvest1 genunix: [ID 672855 kern.notice] syncing file systems...
    Nov 5 14:06:29 hiinvest1 genunix: [ID 733762 kern.notice] 12
    Nov 5 14:06:31 hiinvest1 genunix: [ID 733762 kern.notice] 3
    Nov 5 14:06:42 hiinvest1 last message repeated 7 times
    Nov 5 14:06:43 hiinvest1 genunix: [ID 616637 kern.notice] cannot sync --
    giving up
    Nov 5 14:06:44 hiinvest1 genunix: [ID 353387 kern.notice] dumping to /dev/dsk/c0t10d0s1,
    offset 1932460032
    Nov 5 14:08:38 hiinvest1 genunix: [ID 409368 kern.notice] ^M100% done: 178263
    pages dumped, compression ratio 3.31,
    Nov 5 14:08:38 hiinvest1 genunix: [ID 851671 kern.notice] dump succeeded


  4. Re: Cluster node panics


    GAB is a kernel module. If the real-time process "had" (the VCS engine) does
    not heartbeat to GAB - it first will kill the "had" process which will subsequently
    get restarted by "hashadow". If "had" is hung in the kernel and GAB is unable
    to kill it, the node is immediately halted quoting the message you posted.
    See pp.236 of the VCS 2.0 Users Guide.

    If the system were inadvertenly dropped to the boot monitor (ok prompt) due
    to break sequence or stop-a and later "go" was issued on console - GAB will
    note the lack of comms with "had" and halt the node. Appendix A of the VCS
    Users Guide encourages disabling break sequence for cluster nodes to eliminate
    this risk. See pp.302

    For whatever reason a system leaves the cluster membership, the membership
    identifier is incremented on surviving nodes to denote the change of membership.
    It takes 16 seconds for LLT peer inactive and 5 seconds for GAB stable timeout
    = 21 seconds. If the same node attempts to rejoin within this timeframe,
    the joining system is sent an iofence message with reason "quick reopen".
    This results in "had" being killed and recycled by "hashadow". Using -b
    option in /etc/gabtab immediately panics the box if "had" is unable to heartbeat
    to GAB. See pp.235-237. There are numerous other tunables to enhance/modify
    the default behavior.

    If an intermittent heartbeat network problem exists, logs will indicate change
    in membership just prior to the halt - and the subsequent restoration of
    links will cause the higher numbered node to panic for 2node cluster. See
    pp.291

    In short, its best to open a support case for this issue. Once you obtain
    the case number, tar up and send in your vmcore.x and unix.x files from the
    panic (always use caseID in filenames), and fetch our vxexplore script then
    ftp the resulting file back to ftp.veritas.com:/incoming . The vxexplore
    script may be retrieved here:

    ftp://ftp.veritas.com/pub/support/vxexplore.tar.Z

    We would need to see the vxexplore from each cluster node to properly diagnose
    the cause of the issue. One corefile (latest) should be sufficient unless
    the conditions or root cause is suspected to differ between panics.

    Regards,
    -Bryan.

    "Tom Pietschmann" wrote:
    >
    >I am experiencing the same problem, except that my database folks told me
    >that they were starting up another Sybase instance when the system crashed.
    >So far we have been able to recreate this once and have it work two other
    >times. Always the message is a cpu panic. Has anyone resolved this?
    >
    >
    >
    >Nov 5 14:05:42 hiinvest1 gab: [ID 184552 kern.notice] GAB:20035: Port h
    >attempting to kill process due to client process failure
    >Nov 5 14:05:57 hiinvest1 last message repeated 1 time
    >Nov 5 14:06:27 hiinvest1 unix: [ID 836849 kern.notice]
    >Nov 5 14:06:27 hiinvest1 ^Mpanic[cpu4]/thread=2a1000d7d40:
    >Nov 5 14:06:28 hiinvest1 unix: [ID 354483 kern.notice] GAB: Port h halting
    >system due to client process failure
    >Nov 5 14:06:28 hiinvest1 unix: [ID 100000 kern.notice]
    >Nov 5 14:06:28 hiinvest1 genunix: [ID 723222 kern.notice] 000002a1000d76c0
    >gab:gab_halt+2f0 (21670507, 0, 23, 1, 8, 0)
    >Nov 5 14:06:28 hiinvest1 genunix: [ID 179002 kern.notice] %l0-3: 000002a1000d777c
    >000002a1000d779e 0000000000000047 0000000000800000
    >Nov 5 14:06:28 hiinvest1 %l4-7: 00000300003197a8 0000000000000083 000000000057e8c0
    >0000000000000000
    >Nov 5 14:06:28 hiinvest1 genunix: [ID 723222 kern.notice] 000002a1000d77d0
    >gab:gab_kill_process+160 (10507, 2a1000d7d44, 22, 1, 2a1000d7d44, 2a1000d7d3c)
    >Nov 5 14:06:28 hiinvest1 genunix: [ID 179002 kern.notice] %l0-3: 0000000021670507
    >0000000000000507 0000000000000007 000000007834ab10
    >Nov 5 14:06:28 hiinvest1 %l4-7: 000000001102ff40 00000300056330d0 0000000000000000
    >000002a1000d7880
    >Nov 5 14:06:28 hiinvest1 genunix: [ID 723222 kern.notice] 000002a1000d78c0
    >gab:gab_timerscan+764 (0, 2a1000d7d40, 20, 10423a00, 781ed138, 0)
    >Nov 5 14:06:28 hiinvest1 genunix: [ID 179002 kern.notice] %l0-3: 000000007837f650
    >0000030004d26a90 00000000781edd20 000002a100193d40
    >Nov 5 14:06:28 hiinvest1 %l4-7: 000003000a323cf8 0000000000000008 0000000000000000
    >000002a100193a00
    >Nov 5 14:06:28 hiinvest1 genunix: [ID 723222 kern.notice] 000002a1000d79d0
    >genunix:callout_execute+90 (bffffffffed8206f, 1, 300002b7038, 1ecd25d, 300002b6038,
    >0)
    >Nov 5 14:06:28 hiinvest1 genunix: [ID 179002 kern.notice] %l0-3: 0000000078358b80
    >8000000000000000 0000000000000009 00000300002b7320
    >Nov 5 14:06:28 hiinvest1 %l4-7: 0000000001ecd25d 00000300002b6000 000003000a473f00
    >000002a1000d79e0
    >Nov 5 14:06:29 hiinvest1 genunix: [ID 723222 kern.notice] 000002a1000d7a80
    >genunix:taskq_thread+18c (30002793b38, 0, 10423a00, 10000, 30002793b6a,

    30002793b90)
    >Nov 5 14:06:29 hiinvest1 genunix: [ID 179002 kern.notice] %l0-3: 0000000010072774
    >0000030002793b68 0000030002793b60 0000030002793b38
    >Nov 5 14:06:29 hiinvest1 %l4-7: 0000030002793b58 0000030002791bc8 00000300002158f0
    >0000000000000002
    >Nov 5 14:06:29 hiinvest1 unix: [ID 100000 kern.notice]
    >Nov 5 14:06:29 hiinvest1 genunix: [ID 672855 kern.notice] syncing file

    systems...
    >Nov 5 14:06:29 hiinvest1 genunix: [ID 733762 kern.notice] 12
    >Nov 5 14:06:31 hiinvest1 genunix: [ID 733762 kern.notice] 3
    >Nov 5 14:06:42 hiinvest1 last message repeated 7 times
    >Nov 5 14:06:43 hiinvest1 genunix: [ID 616637 kern.notice] cannot sync

    --
    >giving up
    >Nov 5 14:06:44 hiinvest1 genunix: [ID 353387 kern.notice] dumping to /dev/dsk/c0t10d0s1,
    >offset 1932460032
    >Nov 5 14:08:38 hiinvest1 genunix: [ID 409368 kern.notice] ^M100% done:

    178263
    >pages dumped, compression ratio 3.31,
    >Nov 5 14:08:38 hiinvest1 genunix: [ID 851671 kern.notice] dump succeeded
    >



+ Reply to Thread