fuser -k: hardly useful - HP UX

This is a discussion on fuser -k: hardly useful - HP UX ; hi! "fuser -k" sends a SIGKILL (not SIGTERM) to processes having a specific file(system) open. This feature is used by ServiceGuard to free up busy mountpoints (or at least try to). However this is unfortunate: 1) Many programs have a ...

+ Reply to Thread
Results 1 to 3 of 3

Thread: fuser -k: hardly useful

  1. fuser -k: hardly useful

    hi!

    "fuser -k" sends a SIGKILL (not SIGTERM) to processes having a specific
    file(system) open. This feature is used by ServiceGuard to free up busy
    mountpoints (or at least try to).

    However this is unfortunate:

    1) Many programs have a cleanup handler that would be triggered for SIGTERM,
    but SIGKILL terminates the process immediately. I had a problem with SAP
    R/3 being terminated that way. Unly after having manually removed the IPC
    structures the system would come up again (didn't want to reboot).

    2) Even when sending a SIGKILL, some multi-megabyte process (that could be
    paged out) will take a few seconds to terminate.

    I suggest:

    1) User-selectable signal, or even better: try a SIGTERM, wait a defined
    amount of time for the processes to terminate, and then send SIGKILL to the
    remaining processes, and then wait again until they terminated.

    2) freeup_busy_mountpoint() should wait a bit after trying "fuser -k" before
    trying to unmount filesystems.

    Regards,
    Ulrich

  2. Re: fuser -k: hardly useful

    Ulrich Windl writes:

    > "fuser -k" sends a SIGKILL (not SIGTERM) to processes having a specific
    > file(system) open. This feature is used by ServiceGuard to free up busy
    > mountpoints (or at least try to).


    As I understand, the use of "fuser -k" in ServiceGuard is not intended
    for "normal" shutdown of the packaged application, but as a last-ditch
    way to make sure the mountpoint is free for mount/umount operations.

    > However this is unfortunate:
    >
    > 1) Many programs have a cleanup handler that would be triggered for SIGTERM,
    > but SIGKILL terminates the process immediately. I had a problem with SAP
    > R/3 being terminated that way. Unly after having manually removed the IPC
    > structures the system would come up again (didn't want to reboot).


    In the ServiceGuard control script, there is a place for "customer
    defined halt commands". You should put there all the commands required
    to shut the packaged application down in as orderly fashion as
    possible. See the functions named "customer_defined_halt_cmds" and the
    corresponding "customer_defined_run_cmds".

    You can even program an escalation: try the "normal"
    shutdown method(s) first, wait for their completion (or arrange for a
    timeout), and if that did not work use sterner measures as applicable.

    The halt commands may also report a failure to the control script:
    this is taken to mean that the package cannot then be restarted
    without manual assistance. This is useful if auto-starting after a bad
    shutdown would cause data corruption. If you don't need to worry about
    that, you must ensure the halt commands will always return a
    "successful" error code.

    > I suggest:


    > 1) User-selectable signal, or even better: try a SIGTERM, wait a defined
    > amount of time for the processes to terminate, and then send
    > SIGKILL to the remaining processes, and then wait again until
    > they terminated.


    You can, and should, implement this in customer_defined_halt_cmds
    IF and ONLY IF it is an appropriate method to shut down the
    application in the ServiceGuard package. It should be simple to
    script: collect the PIDs with fuser -c (without -k) to see if there is
    anything to stop; if any, send the necessary signals, wait a suitable
    amount. Then collect a new set of PIDs and do the same with SIGKILL or
    just use fuser -k. Then wait a while and check again.
    (Do not repeat this forever! A process might be stuck as unkillable
    because of an I/O error. If you cannot kill the process in a
    reasonable time with SIGKILL, you should timeout and abort the package
    move, because there are some pending I/O operations that are not
    getting to the package disk.)

    As I understand it, ServiceGuard's automatic "fuser -k" is meant to
    clear out any random obstacles of (u)mounting the disk, i.e.
    administrator sessions that happened to be cd'd to the package disk,
    processes from OVO or another monitoring system trying to check the
    status of the application, and the like.

    IMO, the shutdown of the packaged application should NEVER be left for
    ServiceGuard's "fuser -k", since ServiceGuard's design assumes that
    the application should already be shut down (either in a controlled
    way, or forcefully if the sysadmin has scripted a known-good way to
    forcefully shutdown the application) at that point.

    The end of "customer_defined_halt_cmds" is a decision point for
    ServiceGuard: the application shutdown should be complete and the
    return code should tell whether the application is safe for a move to
    another node or not.

    If the answer is yes, ServiceGuard's next goal is to get the package
    disk(s) (and IP addresses, if any) detached from this node so that
    another node may start using them. Any processes still using the
    package resources are assumed to be incidental and unrelated to the
    application. Thus, rough handling (fuser -k) of those processes is
    permitted.

    If the answer is no, ServiceGuard assumes there might be an unexpected
    problem that might cause data loss if the package disks were moved.
    In this situation, ServiceGuard sets the package to be runnable on
    this node *only*: this must be undone manually by
    "cmmodpkg -e -n " after the sysadmin has ensured the
    package is in fact safe to move.

    If you let the "fuser -k" at the umount phase take care of shutting
    down the application, you are skipping this decision point completely.
    If you choose to do so, you should know what you're doing.

    > 2) freeup_busy_mountpoint() should wait a bit after trying "fuser -k" before
    > trying to unmount filesystems.


    I agree on this. As a workaround, you could increase the
    FS_UMOUNT_COUNT value, so that ServiceGuard will make more attempts to
    unmount the filesystem before giving up. I've found the value of 3 to
    be reliable for most cases when the application shutdown is already
    handled elsewhere.

    --
    Matti.Kurkela@welho.com

  3. Re: fuser -k: hardly useful

    Matti Juhani Kurkela writes:

    > Ulrich Windl writes:
    >
    > > "fuser -k" sends a SIGKILL (not SIGTERM) to processes having a specific
    > > file(system) open. This feature is used by ServiceGuard to free up busy
    > > mountpoints (or at least try to).

    >
    > As I understand, the use of "fuser -k" in ServiceGuard is not intended
    > for "normal" shutdown of the packaged application, but as a last-ditch
    > way to make sure the mountpoint is free for mount/umount operations.
    >
    > > However this is unfortunate:
    > >
    > > 1) Many programs have a cleanup handler that would be triggered for SIGTERM,
    > > but SIGKILL terminates the process immediately. I had a problem with SAP
    > > R/3 being terminated that way. Unly after having manually removed the IPC
    > > structures the system would come up again (didn't want to reboot).

    >
    > In the ServiceGuard control script, there is a place for "customer
    > defined halt commands". You should put there all the commands required
    > to shut the packaged application down in as orderly fashion as
    > possible. See the functions named "customer_defined_halt_cmds" and the
    > corresponding "customer_defined_run_cmds".


    Have you ever tried sending a kill to an application that uses >6GB of virtual
    memory, and in addition has a handler to do database roll-backs? (Not to talk
    abou users who want to do GB commits) While you star the commend to shut down
    the application cleanly, things may hang. In addition, some user may have
    started some background process, keeping a filesystem busy.

    >
    > You can even program an escalation: try the "normal"
    > shutdown method(s) first, wait for their completion (or arrange for a
    > timeout), and if that did not work use sterner measures as applicable.


    I'm afraid I'll have top do that.

    >
    > The halt commands may also report a failure to the control script:
    > this is taken to mean that the package cannot then be restarted
    > without manual assistance. This is useful if auto-starting after a bad
    > shutdown would cause data corruption. If you don't need to worry about
    > that, you must ensure the halt commands will always return a
    > "successful" error code.
    >
    > > I suggest:

    >
    > > 1) User-selectable signal, or even better: try a SIGTERM, wait a defined
    > > amount of time for the processes to terminate, and then send
    > > SIGKILL to the remaining processes, and then wait again until
    > > they terminated.

    >
    > You can, and should, implement this in customer_defined_halt_cmds
    > IF and ONLY IF it is an appropriate method to shut down the
    > application in the ServiceGuard package. It should be simple to
    > script: collect the PIDs with fuser -c (without -k) to see if there is
    > anything to stop; if any, send the necessary signals, wait a suitable
    > amount. Then collect a new set of PIDs and do the same with SIGKILL or
    > just use fuser -k. Then wait a while and check again.
    > (Do not repeat this forever! A process might be stuck as unkillable
    > because of an I/O error. If you cannot kill the process in a
    > reasonable time with SIGKILL, you should timeout and abort the package
    > move, because there are some pending I/O operations that are not
    > getting to the package disk.)


    OK!

    >
    > As I understand it, ServiceGuard's automatic "fuser -k" is meant to
    > clear out any random obstacles of (u)mounting the disk, i.e.
    > administrator sessions that happened to be cd'd to the package disk,
    > processes from OVO or another monitoring system trying to check the
    > status of the application, and the like.
    >
    > IMO, the shutdown of the packaged application should NEVER be left for
    > ServiceGuard's "fuser -k", since ServiceGuard's design assumes that
    > the application should already be shut down (either in a controlled
    > way, or forcefully if the sysadmin has scripted a known-good way to
    > forcefully shutdown the application) at that point.


    I Agree. Life's not simple all the time however.

    >
    > The end of "customer_defined_halt_cmds" is a decision point for
    > ServiceGuard: the application shutdown should be complete and the
    > return code should tell whether the application is safe for a move to
    > another node or not.
    >
    > If the answer is yes, ServiceGuard's next goal is to get the package
    > disk(s) (and IP addresses, if any) detached from this node so that
    > another node may start using them. Any processes still using the
    > package resources are assumed to be incidental and unrelated to the
    > application. Thus, rough handling (fuser -k) of those processes is
    > permitted.
    >
    > If the answer is no, ServiceGuard assumes there might be an unexpected
    > problem that might cause data loss if the package disks were moved.
    > In this situation, ServiceGuard sets the package to be runnable on
    > this node *only*: this must be undone manually by
    > "cmmodpkg -e -n " after the sysadmin has ensured the
    > package is in fact safe to move.
    >
    > If you let the "fuser -k" at the umount phase take care of shutting
    > down the application, you are skipping this decision point completely.
    > If you choose to do so, you should know what you're doing.
    >
    > > 2) freeup_busy_mountpoint() should wait a bit after trying "fuser -k" before
    > > trying to unmount filesystems.

    >
    > I agree on this. As a workaround, you could increase the
    > FS_UMOUNT_COUNT value, so that ServiceGuard will make more attempts to
    > unmount the filesystem before giving up. I've found the value of 3 to
    > be reliable for most cases when the application shutdown is already
    > handled elsewhere.


    Well, as always: It looks like work to do ;-)

    Regards,
    Ulrich

+ Reply to Thread