Ideas to write watchdog - Linux

This is a discussion on Ideas to write watchdog - Linux ; Hi all! I'm writing some kind of watchdog, which will kill some process if it will notify that something is going wrong. At the moment, I'm checking number of file handles and amount of memory allocated by process. Has anyone ...

+ Reply to Thread
Results 1 to 10 of 10

Thread: Ideas to write watchdog

  1. Ideas to write watchdog

    Hi all!

    I'm writing some kind of watchdog, which will kill some process if it
    will notify that something is going wrong.
    At the moment, I'm checking number of file handles and amount of memory
    allocated by process. Has anyone any idea what other parameters of
    process could be checked ?

    Regards
    Jarek

  2. Re: Ideas to write watchdog



    On Jan 23, 5:46 pm, Jarek wrote:
    > Hi all!
    >
    > I'm writing some kind of watchdog, which will kill some process if it
    > will notify that something is going wrong.
    > At the moment, I'm checking number of file handles and amount of memory
    > allocated by process. Has anyone any idea what other parameters of
    > process could be checked ?


    This depends whether you can modify the source code of the processes.
    >
    > Regards
    > Jarek



  3. Re: Ideas to write watchdog

    Bin Chen napisał(a):
    >
    > This depends whether you can modify the source code of the processes.


    I can. This is my code and it contains a lot of self-checking procedures
    which are also veryfied by external watchdog. Now I'm trying to find all
    OS-level abnormalities.
    I'm trying to create total bullet-proof application

    regards
    Jarek

  4. Re: Ideas to write watchdog



    On Jan 23, 7:05 pm, Jarek wrote:
    > Bin Chen napisał(a):
    >
    >
    >
    > > This depends whether you can modify the source code of the processes.I can. This is my code and it contains a lot of self-checking procedures

    > which are also veryfied by external watchdog. Now I'm trying to find all
    > OS-level abnormalities.

    What do you mean of OS-level abnomality? Falling into a while(1) or
    deadlock, or quit abnomallly(segv fault etc)?
    A common way is to add a new message type to your messaging
    infrastructure that must be acked within a timeout, otherwise the
    process is considered dead. Of course I suppose your applications are
    event driven and has a messaging system already.

    For watchdog process:

    send_heartbeat_message();
    wait_ack_with_timeout();
    if (timeout)
    process_is_dead;
    ....

    > I'm trying to create total bullet-proof application
    >
    > regards
    > Jarek



  5. Re: Ideas to write watchdog

    Bin Chen napisał(a):
    > What do you mean of OS-level abnomality? Falling into a while(1) or


    The application I'm developing is changing very fast, there is no way to
    deeply test it after every change, but it have to work all the time.
    I want to avoid the cases like file handle or memory leaks or possibly
    other programming errors.
    Such things can happen in specific conditions, after days or weeks of
    operation and I want to detect them ASAP. It may even happen, that
    application works properly, processing data as expected but "eats" some
    system resources (I had once such case: fclose was inside try...cach,
    and when specific exception occured file handle remain open. After two
    months of operation application has eated all handles and crashed).

    > A common way is to add a new message type to your messaging
    > infrastructure that must be acked within a timeout, otherwise the
    > process is considered dead. Of course I suppose your applications are
    > event driven and has a messaging system already.


    I'm doing it by several counters: if counter stops growing it means that
    specific thread has a problem. Counters are in shared memory segment,
    and are verified by watchdog.

    regards
    Jarek

  6. Re: Ideas to write watchdog

    Jarek wrote:
    > Bin Chen napisał(a):
    >
    >> What do you mean of OS-level abnomality? Falling into a while(1) or

    >
    >
    > The application I'm developing is changing very fast, there is no way to
    > deeply test it after every change, but it have to work all the time.


    So you mean you have no time to develop your application properly but
    you have time to design, develop, write and test a watchdog application
    instead?

    > I want to avoid the cases like file handle or memory leaks or possibly
    > other programming errors.


    Then design and program your application properly.

    > Such things can happen in specific conditions, after days or weeks of
    > operation and I want to detect them ASAP. It may even happen, that


    Such things can happen in unspecific conditions, after only a few
    milliseconds or even microseconds of operation and you will be unable to
    detect them in time. An application mallocing without freeing or opening
    without closing may do so very fast, giving you no chance to intervene.

    You might be able to catch "well-misbehaving" applications, but you
    won't catch really misbehaving applications. Also you will be able to
    catch those problems that you have foreseen, but you will fail to catch
    those problems that you haven't foreseen.

    > application works properly, processing data as expected but "eats" some
    > system resources (I had once such case: fclose was inside try...cach,
    > and when specific exception occured file handle remain open. After two
    > months of operation application has eated all handles and crashed).


    That's why I love C++ and programming with exceptions ... not.

    --
    Josef Möllers (Pinguinpfleger bei FSC)
    If failure had no penalty success would not be a prize
    -- T. Pratchett


  7. Re: Ideas to write watchdog

    Josef Moellers napisał(a):
    > So you mean you have no time to develop your application properly but
    > you have time to design, develop, write and test a watchdog application
    > instead?


    It is not exactly like this. Sometimes, I need to do some change in the
    application, and deploy it immedietely. I may have month or more to the
    next change.

    >> I want to avoid the cases like file handle or memory leaks or possibly
    >> other programming errors.

    >
    > Then design and program your application properly.


    Very good idea. If it is so easy, why thre are no bug-free programs ?

    > Such things can happen in unspecific conditions, after only a few
    > milliseconds or even microseconds of operation and you will be unable to
    > detect them in time. An application mallocing without freeing or opening
    > without closing may do so very fast, giving you no chance to intervene.


    If it will happend so fast, I can detect it immedietely during
    preliminary testing with tools like valgrind. The problem is with the
    rare conditions. I can spend few days or week on testing. But not months.

    Jarek

  8. Re: Ideas to write watchdog

    Jarek wrote:
    > Josef Moellers napisał(a):


    >> Then design and program your application properly.

    >
    >
    > Very good idea. If it is so easy, why thre are no bug-free programs ?


    I never said that it was easy. Programming is a difficult job, a VERY
    difficult job, requiring numerous skills.
    I merely wondered why you have the opportunity to spend time in writing
    this monitoring application but do not have time designing your
    application properly.

    There are numerous reasons why there are very few bug-free programs. A
    lot of books are written on the subject.
    In my experience, the main reasons are:
    - the problem to be solved is not fully understood and/or specified
    - the problem to be solved changed during/after programming
    - the programmer lacks some skill/s) (e.g. doesn't know the programming
    language used)
    - communication between parties in a large project is bad
    - third party products (components written by others) are not/badly
    documented.

    Let me reply with another question:
    If yours is a good idea, why are there not more programs that rely on a
    monitoring application?

    You're fighting the symptoms, not curing the disease.

    --
    Josef Möllers (Pinguinpfleger bei FSC)
    If failure had no penalty success would not be a prize
    -- T. Pratchett


  9. Re: Ideas to write watchdog

    Josef Moellers wrote in news:ep7cul
    $mch$1@nntp.fujitsu-siemens.com:

    > If yours is a good idea, why are there not more programs that rely on a
    > monitoring application?


    I have worked on a number of applications that use watchdog / monitoring
    programs, though perhaps there are not so many in the open source world.

    Any critical application that requires a high degree of availability would
    be well advised to implement such a facility. No matter how perfect you
    think your application may be, it is always possible that you've missed
    something that may show up in actual deployment. The fact that you are
    allowing for the possibility of a bug in your program does not mean that
    you failed to "design your application properly."

    A watchdog is just another form of "defense in depth". Is there some
    reason why you are so hostile to this idea?

    GH

  10. Re: Ideas to write watchdog

    Gil Hamilton wrote:
    > Josef Moellers wrote in news:ep7cul
    > $mch$1@nntp.fujitsu-siemens.com:
    >
    >
    >>If yours is a good idea, why are there not more programs that rely on a
    >>monitoring application?

    >
    >
    > I have worked on a number of applications that use watchdog / monitoring
    > programs, though perhaps there are not so many in the open source world.
    >
    > Any critical application that requires a high degree of availability would
    > be well advised to implement such a facility. No matter how perfect you
    > think your application may be, it is always possible that you've missed
    > something that may show up in actual deployment. The fact that you are
    > allowing for the possibility of a bug in your program does not mean that
    > you failed to "design your application properly."


    OK, I can accept that.

    > A watchdog is just another form of "defense in depth". Is there some
    > reason why you are so hostile to this idea?


    I've worked on quite a number of projects where the main focus was on
    testing, rather than on deveoping. And I have worked with a number of
    people who, rather than think twice, write code twice.

    OTOH I have worked on a project where our team first drew up a detailed
    problem analysis, then made a detailed implementation specification,
    both of which were discussed with and approved by the customer, then we
    coded to the implementation specification. The project was finished in
    time, runs extremely stable and the customer was more than satisfied.

    Form the message , I got the
    impression that Jarek was just trying to fight symptoms.
    --
    Josef Mllers (Pinguinpfleger bei FSC)
    If failure had no penalty success would not be a prize
    -- T. Pratchett


+ Reply to Thread