Subsystem heartbeat - IBM AS400

This is a discussion on Subsystem heartbeat - IBM AS400 ; Hello, I would like to know if there was a way I could write a CL program or something to let me know if a particular subsystem is up across 37 systems. I have a product that needs to have ...

+ Reply to Thread
Results 1 to 7 of 7

Thread: Subsystem heartbeat

  1. Subsystem heartbeat

    Hello,

    I would like to know if there was a way I could write a CL program or
    something to let me know if a particular subsystem is up across 37
    systems. I have a product that needs to have 3 subsystems up in
    running in order for it to work properly, but sometimes those
    subsystems drop (usually on test boxes) and I don't know about it
    until after the fact...so I'd like to be able to run a program, even
    schedule it to go out to these 37 machines and check to see if the
    subsystem is active.

    I'm not an expert at programming in the AS400 world, so please be kind
    to dumb down your answers for me...

    Thanks,

    Chris

  2. Re: Subsystem heartbeat

    You could use the QWDRSBSD (Retrieve Subsystem Information) API.

  3. Re: Subsystem heartbeat

    On Apr 30, 10:13*am, Chris wrote:
    > Hello,
    >
    > I would like to know if there was a way I could write a CL program or
    > something to let me know if a particular subsystem is up across 37
    > systems. *I have a product that needs to have 3 subsystems up in
    > running in order for it to work properly, but sometimes those
    > subsystems drop (usually on test boxes) and I don't know about it
    > until after the fact...so I'd like to be able to run a program, even
    > schedule it to go out to these 37 machines and check to see if the
    > subsystem is active.
    >
    > I'm not an expert at programming in the AS400 world, so please be kind
    > to dumb down your answers for me...
    >
    > Thanks,
    >
    > Chris


    If you have the TAATOOL's this command may help you:

    Convert WRKSBS output - TAA (CVTWRKSBS)

    This is my sample output file contents (then you can easily check for
    the subsystem being active):

    Subsystem Subsystem Active Status Total
    name Number Jobs K Stg
    QBATCH 218289 ACTIVE
    QCMN 218285 ACTIVE
    QCTL 218263 ACTIVE
    QINTER 220480 ACTIVE
    QPGMR 220481 ACTIVE
    QSERVER 218288 ACTIVE
    QSNADS 220483 ACTIVE
    QSPL 218299 1 ACTIVE
    QSVCDRCTR 218279 ACTIVE
    QSYSWRK 218264 ACTIVE
    Q1PGSCH 218286 ACTIVE
    RBTSLEEPER 218310 ACTIVE

  4. Re: Subsystem heartbeat


    "Chris" wrote in message
    news:bf5561bd-18db-4277-988d-9949775cd759@f63g2000hsf.googlegroups.com...
    | Hello,
    |
    | I would like to know if there was a way I could write a CL program
    or
    | something to let me know if a particular subsystem is up across 37
    | systems. I have a product that needs to have 3 subsystems up in
    | running in order for it to work properly, but sometimes those
    | subsystems drop (usually on test boxes) and I don't know about it
    | until after the fact...so I'd like to be able to run a program, even
    | schedule it to go out to these 37 machines and check to see if the
    | subsystem is active.
    |
    | I'm not an expert at programming in the AS400 world, so please be
    kind
    | to dumb down your answers for me...
    |
    | Thanks,
    |
    | Chris


    Chris,

    I solved a similar problem several years ago on a ten machine network
    with a two part solution.

    1) I put a submit job command in the system start-up CL to run a
    never ending supervisor job. That supervisor job would check the
    system status every X minutes to verify that all other necessary jobs
    and subsystems were running as expected. If they were not, the never
    ending job issued the necessary commands and then sent a network
    message, to a special user on one machine designated as the "network
    master", that it had to restart a production subsystem or job.

    2) Just to be sure that the supervisor program was always running, I
    made the supervisor job also execute a SBMRMTCMD to all other machines
    to run a CL program to check that the supervisor job on those machine
    were running. If they were not, or if the SBMRMTCMD failed because of
    communication problems, or because the remote machine was not up, a
    network message was sent to the network master.

    A review of messages in the special user message queue on the network
    master provided diagnostic information on reliability of
    communications across our network, availability of our applications,
    and information about unexpected operator intervention on production
    subsystems and jobs.

    If you want to follow a similar approach, I have three
    recommendations:

    1) In my network X was 10 minutes, you may want X to be shorter or
    longer.

    2) You should have a small table with one record defining each
    machine in your network so that you can easily add or drop systems
    from your network without re-coding the supervisor program code.

    3) In your environment having every one of 37 machines issue a remote
    command on all 36 other systems every X minutes may be massive
    overkill. If so, you can have the supervisor job do the remote
    command step only if it itself is running on the network master or a
    small number of trusted primary systems.

    I hope this is useful. It may sound a little complex but CL is so
    powerful you will find it does not take very much code to build a
    system with all these features.

    Mike Sicilian








  5. Re: Subsystem heartbeat

    On Apr 30, 2:09*pm, scl...@aol.com wrote:
    > On Apr 30, 10:13*am, Chris wrote:
    >
    >
    >
    >
    >
    > > Hello,

    >
    > > I would like to know if there was a way I could write a CL program or
    > > something to let me know if a particular subsystem is up across 37
    > > systems. *I have a product that needs to have 3 subsystems up in
    > > running in order for it to work properly, but sometimes those
    > > subsystems drop (usually on test boxes) and I don't know about it
    > > until after the fact...so I'd like to be able to run a program, even
    > > schedule it to go out to these 37 machines and check to see if the
    > > subsystem is active.

    >
    > > I'm not an expert at programming in the AS400 world, so please be kind
    > > to dumb down your answers for me...

    >
    > > Thanks,

    >
    > > Chris

    >
    > If you have the TAATOOL's this command may help you:
    >
    > Convert WRKSBS output - TAA (CVTWRKSBS)
    >
    > This is my sample output file contents (then you can easily check for
    > the subsystem being active):
    >
    > Subsystem * Subsystem *Active *Status * *Total
    > name * * * *Number * * Jobs * * * * * * *K Stg
    > QBATCH * * * 218289 * * * * * *ACTIVE
    > QCMN * * * * 218285 * * * * * *ACTIVE
    > QCTL * * * * 218263 * * * * * *ACTIVE
    > QINTER * * * 220480 * * * * * *ACTIVE
    > QPGMR * * * *220481 * * * * * *ACTIVE
    > QSERVER * * *218288 * * * * * *ACTIVE
    > QSNADS * * * 220483 * * * * * *ACTIVE
    > QSPL * * * * 218299 * * * *1 * ACTIVE
    > QSVCDRCTR * *218279 * * * * * *ACTIVE
    > QSYSWRK * * *218264 * * * * * *ACTIVE
    > Q1PGSCH * * *218286 * * * * * *ACTIVE
    > RBTSLEEPER * 218310 * * * * * *ACTIVE- Hide quoted text -
    >
    > - Show quoted text -


    This sounds cool in concept, but unfortunately, we do not have TAATOOLS

  6. Re: Subsystem heartbeat

    On Apr 30, 4:22*pm, "Mike" wrote:
    > "Chris" wrote in message
    >
    > news:bf5561bd-18db-4277-988d-9949775cd759@f63g2000hsf.googlegroups.com...
    > | Hello,
    > |
    > | I would like to know if there was a way I could write a CL program
    > or
    > | something to let me know if a particular subsystem is up across 37
    > | systems. *I have a product that needs to have 3 subsystems up in
    > | running in order for it to work properly, but sometimes those
    > | subsystems drop (usually on test boxes) and I don't know about it
    > | until after the fact...so I'd like to be able to run a program, even
    > | schedule it to go out to these 37 machines and check to see if the
    > | subsystem is active.
    > |
    > | I'm not an expert at programming in the AS400 world, so please be
    > kind
    > | to dumb down your answers for me...
    > |
    > | Thanks,
    > |
    > | Chris
    >
    > Chris,
    >
    > I solved a similar problem several years ago on a ten machine network
    > with a two part solution.
    >
    > 1) *I put a submit job command in the system start-up CL to run a
    > never ending supervisor job. *That supervisor job would check the
    > system status every X minutes to verify that all other necessary jobs
    > and subsystems were running as expected. *If they were not, the never
    > ending job issued the necessary commands and then sent a network
    > message, to a special user on one machine designated as the "network
    > master", that it had to restart a production subsystem or job.
    >
    > 2) *Just to be sure that the supervisor program was always running, I
    > made the supervisor job also execute a SBMRMTCMD to all other machines
    > to run a CL program to check that the supervisor job on those machine
    > were running. *If they were not, or if the SBMRMTCMD failed because of
    > communication problems, or because the remote machine was not up, a
    > network message was sent to the network master.
    >
    > A review of messages in the special user message queue on the network
    > master provided diagnostic information on reliability of
    > communications across our network, availability of our applications,
    > and information about unexpected operator intervention on production
    > subsystems and jobs.
    >
    > If you want to follow a similar approach, I have three
    > recommendations:
    >
    > 1) *In my network X was 10 minutes, you may want X to be shorter or
    > longer.
    >
    > 2) *You should have a small table with one record defining each
    > machine in your network so that you can easily add or drop systems
    > from your network without re-coding the supervisor program code.
    >
    > 3) *In your environment having every one of 37 machines issue a remote
    > command on all 36 other systems every X minutes may be massive
    > overkill. *If so, you can have the supervisor job do the remote
    > command step only if it itself is running on the network master or a
    > small number of trusted primary systems.
    >
    > I hope this is useful. *It may sound a little complex but CL is so
    > powerful you will find it does not take very much code to build a
    > system with all these features.
    >
    > Mike Sicilian


    This sounds like a very viable solution. Thank you Mike for your
    explaination; I will talk to my teammates and see how we can implement
    this in our environment.

    Thanks again...

  7. Re: Subsystem heartbeat

    On May 1, 2:38*pm, Chris wrote:
    > On Apr 30, 4:22*pm, "Mike" wrote:
    >
    >
    >
    >
    >
    > > "Chris" wrote in message

    >
    > >news:bf5561bd-18db-4277-988d-9949775cd759@f63g2000hsf.googlegroups.com...
    > > | Hello,
    > > |
    > > | I would like to know if there was a way I could write a CL program
    > > or
    > > | something to let me know if a particular subsystem is up across 37
    > > | systems. *I have a product that needs to have 3 subsystems up in
    > > | running in order for it to work properly, but sometimes those
    > > | subsystems drop (usually on test boxes) and I don't know about it
    > > | until after the fact...so I'd like to be able to run a program, even
    > > | schedule it to go out to these 37 machines and check to see if the
    > > | subsystem is active.
    > > |
    > > | I'm not an expert at programming in the AS400 world, so please be
    > > kind
    > > | to dumb down your answers for me...
    > > |
    > > | Thanks,
    > > |
    > > | Chris

    >
    > > Chris,

    >
    > > I solved a similar problem several years ago on a ten machine network
    > > with a two part solution.

    >
    > > 1) *I put a submit job command in the system start-up CL to run a
    > > never ending supervisor job. *That supervisor job would check the
    > > system status every X minutes to verify that all other necessary jobs
    > > and subsystems were running as expected. *If they were not, the never
    > > ending job issued the necessary commands and then sent a network
    > > message, to a special user on one machine designated as the "network
    > > master", that it had to restart a production subsystem or job.

    >
    > > 2) *Just to be sure that the supervisor program was always running, I
    > > made the supervisor job also execute a SBMRMTCMD to all other machines
    > > to run a CL program to check that the supervisor job on those machine
    > > were running. *If they were not, or if the SBMRMTCMD failed because of
    > > communication problems, or because the remote machine was not up, a
    > > network message was sent to the network master.

    >
    > > A review of messages in the special user message queue on the network
    > > master provided diagnostic information on reliability of
    > > communications across our network, availability of our applications,
    > > and information about unexpected operator intervention on production
    > > subsystems and jobs.

    >
    > > If you want to follow a similar approach, I have three
    > > recommendations:

    >
    > > 1) *In my network X was 10 minutes, you may want X to be shorter or
    > > longer.

    >
    > > 2) *You should have a small table with one record defining each
    > > machine in your network so that you can easily add or drop systems
    > > from your network without re-coding the supervisor program code.

    >
    > > 3) *In your environment having every one of 37 machines issue a remote
    > > command on all 36 other systems every X minutes may be massive
    > > overkill. *If so, you can have the supervisor job do the remote
    > > command step only if it itself is running on the network master or a
    > > small number of trusted primary systems.

    >
    > > I hope this is useful. *It may sound a little complex but CL is so
    > > powerful you will find it does not take very much code to build a
    > > system with all these features.

    >
    > > Mike Sicilian

    >
    > This sounds like a very viable solution. *Thank you Mike for your
    > explaination; I will talk to my teammates and see how we can implement
    > this in our environment.
    >
    > Thanks again...- Hide quoted text -
    >
    > - Show quoted text -


    you could have an autostart job entry in the subsystems themselves
    that sends *heartbeat *sysname and so on and you could hook it up via
    MQ or data queues and poll them and then start what falls down.

    Ron

+ Reply to Thread