GAB Timeouts when UFSDUMP is run. - Veritas Cluster Server

This is a discussion on GAB Timeouts when UFSDUMP is run. - Veritas Cluster Server ; All right. Here is a real brain tickler. I've a crontab job that runs every Wednesday @ 0400 hrs that ufsdump / ufsrestores the boot device on my server from it's primary boot disk to it's alternate. This is simply ...

+ Reply to Thread
Results 1 to 2 of 2

Thread: GAB Timeouts when UFSDUMP is run.

  1. GAB Timeouts when UFSDUMP is run.


    All right. Here is a real brain tickler. I've a crontab job that runs every
    Wednesday @ 0400 hrs that ufsdump / ufsrestores the boot device on my server
    from it's primary boot disk to it's alternate. This is simply a ufsdump
    string that performs this operation, nothing funky.
    Everytime this jobs runs, I get the following error messages

    Jan 15 04:33:41 gab: [ID 524258 kern.notice] GAB:20057: Port h
    process 7394 inactive 7 sec
    Jan 15 04:33:42 gab: [ID 524258 kern.notice] GAB:20057: Port h
    process 7394 inactive 8 sec

    I've other servers that perform the same opeation, but do not seem impacted
    by this condition. The unique characteristic of this server, in comparison
    with the others, is the load average. This server runs an extremely large
    OLTP Oracle database.

    I'm simply wondering if anyone else has run into this before.

    Regards,
    John Williams


  2. Re: SOLUTION: GAB Timeouts when UFSDUMP is run.


    All,
    Here is what I have found in regards to the GAB problems. Earlier I eluded
    to this problem seeming to more prevalent on servers with a higher load average.
    I do not think that this may be the cause. It contributes to the problem,
    but is not the culprit. The problem resides solely within the ufsdump command.
    There is a thread associated with the ufsdump command called "ufs_log worker".
    This thread performs a persistent FS flush of the FS it is backing up.
    This subsequently executes mutex against the master inode of the filesystem.
    Given that we are backing up root using this, you can only imagine the impact.
    The duration of this mutex depends upon how active the filesystem is. If
    this is the root partition of a heavily utilized server (i.e. production
    database server), the duration of the mutex is pretty long ( 1 to 2 minutes).
    When this occurs, GAB can no longer communicate with HAD and subsequently
    attempts to reset HAD. Luckily the lock duration hasn't exceeded the gab
    time-out limit. If it had, the server would have panicked. Hence, Sun does
    not recommend or support using ufsdump on an active filesystem.
    I've ran this past Sun and they have found other customer instances supporting
    our findings. The recommended solution is to discontinue using ufsdump to
    backup the root filesystem. In it's place we've a number of options, cpio,
    FS snap, gnu tar, etc.

    From the ufsdump MAN page:

    DESCRIPTION
    ufsdump backs up all files specified by files_to_dump (normally either
    a whole file system or files within a file system changed after a certain
    date) to magnetic tape, diskette, or disk file. When running ufsdump,
    the file system must be inactive; otherwise, the output of ufsdump may
    be inconsistent and restoring files correctly may be impossible. A file
    system is inactive when it is unmounted or the system is in single user
    mode. A file system is not considered inactive if one tree of the file system
    is quiescent while another tree has files or directories being modified.

    Regards,
    John Williams

    "John Williams" wrote:
    >
    >All right. Here is a real brain tickler. I've a crontab job that runs

    every
    >Wednesday @ 0400 hrs that ufsdump / ufsrestores the boot device on my server
    >from it's primary boot disk to it's alternate. This is simply a ufsdump
    >string that performs this operation, nothing funky.
    >Everytime this jobs runs, I get the following error messages
    >
    >Jan 15 04:33:41 gab: [ID 524258 kern.notice] GAB:20057: Port

    h
    >process 7394 inactive 7 sec
    >Jan 15 04:33:42 gab: [ID 524258 kern.notice] GAB:20057: Port

    h
    >process 7394 inactive 8 sec
    >
    >I've other servers that perform the same opeation, but do not seem impacted
    >by this condition. The unique characteristic of this server, in comparison
    >with the others, is the load average. This server runs an extremely large
    >OLTP Oracle database.
    >
    >I'm simply wondering if anyone else has run into this before.
    >
    >Regards,
    >John Williams
    >



+ Reply to Thread