ntpd fscked up again - Networking

This is a discussion on ntpd fscked up again - Networking ; Here I had another case of why ntpd is finicky and cannot be relied on. ntp.conf specified a timeserver which is in datacenter. Datacenter had power outage. Power is restored. ntp client comes up before network switch and before timeserver. ...

+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 20 of 26

Thread: ntpd fscked up again

  1. ntpd fscked up again

    Here I had another case of why ntpd is finicky and cannot be relied
    on.

    ntp.conf specified a timeserver which is in datacenter.

    Datacenter had power outage.

    Power is restored.

    ntp client comes up before network switch and before timeserver.

    ntp does not see the timeserver. It quietly drops it from the list,
    and basically sits there doing nothing, since there is nothing else
    on the list. (and having more stuff on the list would not help)

    I get no notice from it whatsoever, nothing in log files, etc. It just
    sits there, doing nothing, not letting me know, time is drifting. I
    have 26 servers acting in this manner.

    Come on people. That's not how highly available programs behave. It
    should keep on trying, or at least exit if it cannot do its job. If it
    exited, my cron job with ntpdate would take over and fix things.

    This is on Ubuntu Hardy.
    --
    Due to extreme spam originating from Google Groups, and their inattention
    to spammers, I and many others block all articles originating
    from Google Groups. If you want your postings to be seen by
    more readers you will need to find a different means of
    posting on Usenet.
    http://improve-usenet.org/

  2. Re: ntpd fscked up again

    Ignoramus24166 wrote:
    > Here I had another case of why ntpd is finicky and cannot be relied
    > on.
    >
    > ntp.conf specified a timeserver which is in datacenter.
    >
    > Datacenter had power outage.
    >
    > Power is restored.
    >
    > ntp client comes up before network switch and before timeserver.
    >
    > ntp does not see the timeserver. It quietly drops it from the list,
    > and basically sits there doing nothing, since there is nothing else
    > on the list. (and having more stuff on the list would not help)
    >
    > I get no notice from it whatsoever, nothing in log files, etc. It just
    > sits there, doing nothing, not letting me know, time is drifting. I
    > have 26 servers acting in this manner.
    >
    > Come on people. That's not how highly available programs behave. It
    > should keep on trying, or at least exit if it cannot do its job. If it
    > exited, my cron job with ntpdate would take over and fix things.
    >
    > This is on Ubuntu Hardy.

    well the source is there..rewrite it yourself and submit..


  3. Re: ntpd fscked up again

    In comp.os.linux.misc Ignoramus24166 wrote:
    > ntp does not see the timeserver. It quietly drops it from the list,
    > and basically sits there doing nothing [...]


    > Come on people. That's not how highly available programs behave. It
    > should keep on trying, or at least exit if it cannot do its job. If it
    > exited, my cron job with ntpdate would take over and fix things.


    Unfortunately, it seems "they" are having a big discussion about this
    over in the ntpd corner. It's probably one of the most important feature
    requests for the application and they want to design it properly before
    releasing it. (This seems fair enough to me, but I must admit I'm a little
    bemused over the wrangling. Server doesn't respond? Lookup the DNS again
    maybe to get a different address. Try again. Run out of addresses? Start
    from the top. Lost all servers? Scream. (How?))

    Chris

  4. Re: ntpd fscked up again

    Chris Davies wrote:
    > In comp.os.linux.misc Ignoramus24166
    > wrote:
    >> ntp does not see the timeserver. It quietly drops it from the list, and
    >> basically sits there doing nothing [...]


    I have never noticed this, though it may well be true. I have a very simple
    system, 2 computers on my LAN, with the main machine connected to the
    Internet directly. The other machine uses the main machine as the time
    server. But my main machine is almost always up, except during power
    failures or when I am booting a new kernel. It is the other one that goes up
    and down (dual boot). So I am not likely to notice this except when a power
    failure happens and the other machine comes up first.
    >
    >> Come on people. That's not how highly available programs behave. It
    >> should keep on trying, or at least exit if it cannot do its job. If it
    >> exited, my cron job with ntpdate would take over and fix things.

    >
    > Unfortunately, it seems "they" are having a big discussion about this
    > over in the ntpd corner. It's probably one of the most important feature
    > requests for the application and they want to design it properly before
    > releasing it. (This seems fair enough to me, but I must admit I'm a
    > little bemused over the wrangling. Server doesn't respond? Lookup the DNS
    > again maybe to get a different address. Try again. Run out of addresses?
    > Start from the top. Lost all servers? Scream. (How?))
    >

    The how part seems easy to me: write it in /var/log/messages.


    --
    .~. Jean-David Beyer Registered Linux User 85642.
    /V\ PGP-Key: 9A2FC99A Registered Machine 241939.
    /( )\ Shrewsbury, New Jersey http://counter.li.org
    ^^-^^ 11:50:01 up 34 days, 17:56, 4 users, load average: 4.67, 4.52, 4.53

  5. Re: ntpd fscked up again

    Ignoramus24166 writes:

    >Here I had another case of why ntpd is finicky and cannot be relied
    >on.


    >ntp.conf specified a timeserver which is in datacenter.


    >Datacenter had power outage.


    >Power is restored.


    >ntp client comes up before network switch and before timeserver.


    >ntp does not see the timeserver. It quietly drops it from the list,
    >and basically sits there doing nothing, since there is nothing else
    >on the list. (and having more stuff on the list would not help)


    >I get no notice from it whatsoever, nothing in log files, etc. It just
    >sits there, doing nothing, not letting me know, time is drifting. I
    >have 26 servers acting in this manner.


    >Come on people. That's not how highly available programs behave. It
    >should keep on trying, or at least exit if it cannot do its job. If it
    >exited, my cron job with ntpdate would take over and fix things.


    >This is on Ubuntu Hardy.


    I am not at all sure that you are right, that it is doing nothing. It is
    probably waiting. The behaviour of ntpd is wait at least the poll interval
    ( wich in the default has minimum poll of 6 which is 64 sec) and try again.
    If that does not work, it backs off the poll interval eventually waiting
    for poll interval 10 ( 20 min) before trying again. This is to prevent
    swamping the servers in just your case. Imagine you serve 10000 clients (
    which some of the main servers do). You go down and come up again, and then
    all 10000 clients bombard you with a request per second or a request per
    millisec since they have not heard from you. You have a very verysick
    server on your hands.
    Ie, how long have you let it run before deciding it is doing nothing.

  6. Re: ntpd fscked up again

    On Sep 10, 5:10*am, Ignoramus24166 24166.invalid> wrote:

    > Come on people. That's not how highly available programs behave. It
    > should keep on trying, or at least exit if it cannot do its job. If it
    > exited, my cron job with ntpdate would take over and fix things.


    For every person who complains that it doesn't track a change in DNS,
    there's a person who complains because it changes servers
    unpredictably. As for exiting if it cannot do its job, it is still
    available to synchronize the time with any server that might connect
    to it.

    This is a well-known problem with many daemons. You need to follow a
    practiced and tested power-up sequence if you lose power to more than
    one component. It is often necessary to power cycle some devices more
    than once to ensure that every device powers up with everything it
    needs to start up actually working.

    DS

  7. Re: ntpd fscked up again

    On 2008-09-10, Chris Davies wrote:
    > In comp.os.linux.misc Ignoramus24166 wrote:
    >> ntp does not see the timeserver. It quietly drops it from the list,
    >> and basically sits there doing nothing [...]

    >
    >> Come on people. That's not how highly available programs behave. It
    >> should keep on trying, or at least exit if it cannot do its job. If it
    >> exited, my cron job with ntpdate would take over and fix things.

    >
    > Unfortunately, it seems "they" are having a big discussion about this
    > over in the ntpd corner. It's probably one of the most important feature
    > requests for the application and they want to design it properly before
    > releasing it. (This seems fair enough to me, but I must admit I'm a little
    > bemused over the wrangling. Server doesn't respond? Lookup the DNS again
    > maybe to get a different address. Try again. Run out of addresses? Start
    > from the top. Lost all servers? Scream. (How?))
    >


    To me, the answer "just exit with nonzero exit code" and "do not lock
    up he socket you cannot make use of" would suffice. It would not be
    perfect, but it will work. "Keep trying if you cannot find anything"
    would be even better. Right now the result is the worst imaginable.

    --
    Due to extreme spam originating from Google Groups, and their inattention
    to spammers, I and many others block all articles originating
    from Google Groups. If you want your postings to be seen by
    more readers you will need to find a different means of
    posting on Usenet.
    http://improve-usenet.org/

  8. Re: ntpd fscked up again

    On Wed, 10 Sep 2008 07:10:51 -0500, Ignoramus24166 wrote:
    > Here I had another case of why ntpd is finicky and cannot be relied
    > on.
    >
    > ntp.conf specified a timeserver which is in datacenter.
    >
    > Datacenter had power outage.
    >
    > Power is restored.
    >
    > ntp client comes up before network switch and before timeserver.
    >
    > ntp does not see the timeserver. It quietly drops it from the list,
    > and basically sits there doing nothing, since there is nothing else
    > on the list. (and having more stuff on the list would not help)


    cfengine could save your bacon here...

    --
    * John Oliver http://www.john-oliver.net/ *

  9. Re: ntpd fscked up again

    > Ignoramus24166 wrote:
    > > Here I had another case of why ntpd is finicky and cannot be relied
    > > on.
    > >
    > > ntp.conf specified a timeserver which is in datacenter.
    > >
    > > Datacenter had power outage.
    > >
    > > Power is restored.
    > >
    > > ntp client comes up before network switch and before timeserver.


    OK, perhaps the client should deal with that more gracefully, but it
    can be equally well argued that the power-on sequence for the data
    center should see to it that the servers are all up and running before
    the clients get restarted, and that the switches are all up before the
    servers.

    > > ntp does not see the timeserver. It quietly drops it from the list,
    > > and basically sits there doing nothing, since there is nothing else
    > > on the list. (and having more stuff on the list would not help)


    I thought the whole point about NTP server selection was to select
    multiple time sources such that no single point of failure could take
    them all out from the client's grasp.

    rick jones
    --
    firebug n, the idiot who tosses a lit cigarette out his car window
    these opinions are mine, all mine; HP might not want them anyway...
    feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

  10. Re: ntpd fscked up again

    On 2008-09-10, John Oliver wrote:
    > On Wed, 10 Sep 2008 07:10:51 -0500, Ignoramus24166 wrote:
    >> Here I had another case of why ntpd is finicky and cannot be relied
    >> on.
    >>
    >> ntp.conf specified a timeserver which is in datacenter.
    >>
    >> Datacenter had power outage.
    >>
    >> Power is restored.
    >>
    >> ntp client comes up before network switch and before timeserver.
    >>
    >> ntp does not see the timeserver. It quietly drops it from the list,
    >> and basically sits there doing nothing, since there is nothing else
    >> on the list. (and having more stuff on the list would not help)

    >
    > cfengine could save your bacon here...
    >


    How? I am very curious about cfengine.

    --
    Due to extreme spam originating from Google Groups, and their inattention
    to spammers, I and many others block all articles originating
    from Google Groups. If you want your postings to be seen by
    more readers you will need to find a different means of
    posting on Usenet.
    http://improve-usenet.org/

  11. Re: ntpd fscked up again

    On 2008-09-10, Rick Jones wrote:
    >> Ignoramus24166 wrote:
    >> > Here I had another case of why ntpd is finicky and cannot be relied
    >> > on.
    >> >
    >> > ntp.conf specified a timeserver which is in datacenter.
    >> >
    >> > Datacenter had power outage.
    >> >
    >> > Power is restored.
    >> >
    >> > ntp client comes up before network switch and before timeserver.

    >
    > OK, perhaps the client should deal with that more gracefully, but it
    > can be equally well argued that the power-on sequence for the data
    > center should see to it that the servers are all up and running before
    > the clients get restarted, and that the switches are all up before the
    > servers.
    >
    >> > ntp does not see the timeserver. It quietly drops it from the list,
    >> > and basically sits there doing nothing, since there is nothing else
    >> > on the list. (and having more stuff on the list would not help)

    >
    > I thought the whole point about NTP server selection was to select
    > multiple time sources such that no single point of failure could take
    > them all out from the client's grasp.


    The server became available minutes later. After a whole day, ntpd was still
    not talking to it and had to be restarted by /etc/init.d/ntp
    restart. Then it started talking.

    --
    Due to extreme spam originating from Google Groups, and their inattention
    to spammers, I and many others block all articles originating
    from Google Groups. If you want your postings to be seen by
    more readers you will need to find a different means of
    posting on Usenet.
    http://improve-usenet.org/

  12. Re: ntpd fscked up again

    > > I thought the whole point about NTP server selection was to select
    > > multiple time sources such that no single point of failure could
    > > take them all out from the client's grasp.


    > The server became available minutes later. After a whole day, ntpd
    > was still not talking to it and had to be restarted by
    > /etc/init.d/ntp restart. Then it started talking.


    _The_ server. I'll grant that what the client ntpd was doing wasn't
    terribly friendly, but I also think that configuring only a single
    source of time is brittle at best, even if the client ntpd were going
    to recover from starting before the time server was reachable.

    That said, I wonder what the effect of enabling iburst and setting the
    minpoll interval higher might be in such a situation.

    rick jones
    --
    denial, anger, bargaining, depression, acceptance, rebirth...
    where do you want to be today?
    these opinions are mine, all mine; HP might not want them anyway...
    feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

  13. Re: ntpd fscked up again

    On 2008-09-10, Rick Jones wrote:
    >> > I thought the whole point about NTP server selection was to select
    >> > multiple time sources such that no single point of failure could
    >> > take them all out from the client's grasp.

    >
    >> The server became available minutes later. After a whole day, ntpd
    >> was still not talking to it and had to be restarted by
    >> /etc/init.d/ntp restart. Then it started talking.

    >
    > _The_ server. I'll grant that what the client ntpd was doing wasn't
    > terribly friendly, but I also think that configuring only a single
    > source of time is brittle at best, even if the client ntpd were going
    > to recover from starting before the time server was reachable.


    At that time, as I said, the core switch was down also and NO servers
    were reachable. If I had 100 servers in my config, it would not help.

    > That said, I wonder what the effect of enabling iburst and setting the
    > minpoll interval higher might be in such a situation.


    Whatever the poll interval was, it was surely less than a day.
    --
    Due to extreme spam originating from Google Groups, and their inattention
    to spammers, I and many others block all articles originating
    from Google Groups. If you want your postings to be seen by
    more readers you will need to find a different means of
    posting on Usenet.
    http://improve-usenet.org/

  14. Re: ntpd fscked up again

    In comp.os.linux.networking Ignoramus24166 wrote:
    > On 2008-09-10, Rick Jones wrote:


    > > _The_ server. I'll grant that what the client ntpd was doing
    > > wasn't terribly friendly, but I also think that configuring only a
    > > single source of time is brittle at best, even if the client ntpd
    > > were going to recover from starting before the time server was
    > > reachable.


    > At that time, as I said, the core switch was down also and NO
    > servers were reachable. If I had 100 servers in my config, it would
    > not help.


    I suppose that is true if there is only the one core switch as well
    and only one path out of each client you are pretty much toast until
    everything is back up again no matter how many servers you configure
    nor how often the cron job fires.

    > > That said, I wonder what the effect of enabling iburst and setting
    > > the minpoll interval higher might be in such a situation.


    > Whatever the poll interval was, it was surely less than a day.


    Since the initial minpoll is 64 (?) seconds, it probably gave-up on
    the order of minutes. That is based on my ass-u-me-ing it tries a
    fixed number of times to reach the server and then gives-up.

    rick jones
    --
    a wide gulf separates "what if" from "if only"
    these opinions are mine, all mine; HP might not want them anyway...
    feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

  15. Re: ntpd fscked up again

    Rick Jones writes:

    >In comp.os.linux.networking Ignoramus24166 wrote:
    >> On 2008-09-10, Rick Jones wrote:


    >> > _The_ server. I'll grant that what the client ntpd was doing
    >> > wasn't terribly friendly, but I also think that configuring only a
    >> > single source of time is brittle at best, even if the client ntpd
    >> > were going to recover from starting before the time server was
    >> > reachable.


    >> At that time, as I said, the core switch was down also and NO
    >> servers were reachable. If I had 100 servers in my config, it would
    >> not help.


    >I suppose that is true if there is only the one core switch as well
    >and only one path out of each client you are pretty much toast until
    >everything is back up again no matter how many servers you configure
    >nor how often the cron job fires.


    >> > That said, I wonder what the effect of enabling iburst and setting
    >> > the minpoll interval higher might be in such a situation.


    >> Whatever the poll interval was, it was surely less than a day.


    >Since the initial minpoll is 64 (?) seconds, it probably gave-up on
    >the order of minutes. That is based on my ass-u-me-ing it tries a
    >fixed number of times to reach the server and then gives-up.



    There is a difference in the behaviour if the server has been reached and
    has never been reached and the dns returns no address. There is a "dynamic"
    proposal but it is still not in place.

    So what you can do is to run a cron job which runs every 5 min after
    bootup. It looks at ntpq -p and sees if there are any servers in the list.
    If there are none, it restarts ntpd.


    #!/bin/bash
    if [ "`ntpq -p|awk 'flag==1 { print "OK";exit} $0 ~ /=====/ {flag=1}'`" != 'OK' ];then
    service ntpd restart
    fi

    (not positive it will work)


  16. Re: ntpd fscked up again

    Bill Unruh writes:

    > Rick Jones writes:
    >
    >>In comp.os.linux.networking Ignoramus24166 wrote:
    >>> On 2008-09-10, Rick Jones wrote:

    >
    >>> > _The_ server. I'll grant that what the client ntpd was doing
    >>> > wasn't terribly friendly, but I also think that configuring only a
    >>> > single source of time is brittle at best, even if the client ntpd
    >>> > were going to recover from starting before the time server was
    >>> > reachable.

    >
    >>> At that time, as I said, the core switch was down also and NO
    >>> servers were reachable. If I had 100 servers in my config, it would
    >>> not help.

    >
    >>I suppose that is true if there is only the one core switch as well
    >>and only one path out of each client you are pretty much toast until
    >>everything is back up again no matter how many servers you configure
    >>nor how often the cron job fires.

    >
    >>> > That said, I wonder what the effect of enabling iburst and setting
    >>> > the minpoll interval higher might be in such a situation.

    >
    >>> Whatever the poll interval was, it was surely less than a day.

    >
    >>Since the initial minpoll is 64 (?) seconds, it probably gave-up on
    >>the order of minutes. That is based on my ass-u-me-ing it tries a
    >>fixed number of times to reach the server and then gives-up.

    >
    >
    > There is a difference in the behaviour if the server has been reached and
    > has never been reached and the dns returns no address. There is a "dynamic"
    > proposal but it is still not in place.
    >
    > So what you can do is to run a cron job which runs every 5 min after
    > bootup. It looks at ntpq -p and sees if there are any servers in the list.
    > If there are none, it restarts ntpd.
    >
    >
    > #!/bin/bash
    > if [ "`ntpq -p|awk 'flag==1 { print "OK";exit} $0 ~ /=====/ {flag=1}'`" != 'OK' ];then
    > service ntpd restart
    > fi
    >
    > (not positive it will work)


    I use this and therefore know it works:

    > cat /root/bin/ntp-check

    x=`/etc/rc.d/init.d/ntpd status`
    case $x in
    *stopped*|*dead*)
    status=`/etc/rc.d/init.d/ntpd start 2>&1`
    printf "had to start ntpd on `uname -n`, state <$x>\nrestart got status\n $status"\
    | mail -s cron myemailaddr
    ;;
    esac


    (replace myemailaddr with your own)

  17. Re: ntpd fscked up again

    On Thu, 11 Sep 2008 01:00:27 +0000, Bill Unruh wrote:

    > So what you can do is to run a cron job which runs every 5 min after
    > bootup. It looks at ntpq -p and sees if there are any servers in the
    > list. If there are none, it restarts ntpd.


    Alternatively, how about

    server 127.127.1.0 # local clock
    fudge 127.127.1.0 stratum 12

    in ntp.conf? Is this still reasonable?

    See http://www.meinberg.de/english/info/ntp.htm

    Additionally, there should be an entry for the local clock which
    can be used as a fallback resource if no other time source is available.
    Since the local clock is not very accurate, it should be fudged to a low
    stratum


    --
    Regards
    Alex

    http://www.badphorm.co.uk/

  18. Re: ntpd fscked up again

    On 2008-09-11, Bill Unruh wrote:
    > Rick Jones writes:
    >
    >>In comp.os.linux.networking Ignoramus24166 wrote:
    >>> On 2008-09-10, Rick Jones wrote:

    >
    >>> > _The_ server. I'll grant that what the client ntpd was doing
    >>> > wasn't terribly friendly, but I also think that configuring only a
    >>> > single source of time is brittle at best, even if the client ntpd
    >>> > were going to recover from starting before the time server was
    >>> > reachable.

    >
    >>> At that time, as I said, the core switch was down also and NO
    >>> servers were reachable. If I had 100 servers in my config, it would
    >>> not help.

    >
    >>I suppose that is true if there is only the one core switch as well
    >>and only one path out of each client you are pretty much toast until
    >>everything is back up again no matter how many servers you configure
    >>nor how often the cron job fires.

    >
    >>> > That said, I wonder what the effect of enabling iburst and setting
    >>> > the minpoll interval higher might be in such a situation.

    >
    >>> Whatever the poll interval was, it was surely less than a day.

    >
    >>Since the initial minpoll is 64 (?) seconds, it probably gave-up on
    >>the order of minutes. That is based on my ass-u-me-ing it tries a
    >>fixed number of times to reach the server and then gives-up.

    >
    >
    > There is a difference in the behaviour if the server has been reached and
    > has never been reached and the dns returns no address. There is a "dynamic"
    > proposal but it is still not in place.
    >
    > So what you can do is to run a cron job which runs every 5 min after
    > bootup. It looks at ntpq -p and sees if there are any servers in the list.
    > If there are none, it restarts ntpd.
    >
    >
    > #!/bin/bash
    > if [ "`ntpq -p|awk 'flag==1 { print "OK";exit} $0 ~ /=====/ {flag=1}'`" != 'OK' ];then
    > service ntpd restart
    > fi


    I like this a loit.
    'thanks

    i
    > (not positive it will work)
    >


    --
    Due to extreme spam originating from Google Groups, and their inattention
    to spammers, I and many others block all articles originating
    from Google Groups. If you want your postings to be seen by
    more readers you will need to find a different means of
    posting on Usenet.
    http://improve-usenet.org/

  19. Re: ntpd fscked up again

    Alex Potter writes:

    >On Thu, 11 Sep 2008 01:00:27 +0000, Bill Unruh wrote:


    >> So what you can do is to run a cron job which runs every 5 min after
    >> bootup. It looks at ntpq -p and sees if there are any servers in the
    >> list. If there are none, it restarts ntpd.


    >Alternatively, how about


    > server 127.127.1.0 # local clock
    > fudge 127.127.1.0 stratum 12


    >in ntp.conf? Is this still reasonable?


    >See http://www.meinberg.de/english/info/ntp.htm


    >Additionally, there should be an entry for the local clock which
    >can be used as a fallback resource if no other time source is available.
    >Since the local clock is not very accurate, it should be fudged to a low
    >stratum


    That of course does absolutely nothing for your time accuracy. You cannot
    get accurate time by comparing your wristwatch to itself. And it will not
    wake up and use the external ones when they come online.



    >--
    >Regards
    >Alex


    >http://www.badphorm.co.uk/


  20. Re: ntpd fscked up again

    On Thu, 11 Sep 2008 06:00:38 +0000, Unruh wrote:

    > That of course does absolutely nothing for your time accuracy. You
    > cannot get accurate time by comparing your wristwatch to itself.


    Of course not.

    >And it
    > will not wake up and use the external ones when they come online.


    Ah, thanks for correcting my mis-apprehension.

    --
    Regards
    Alex

    http://www.badphorm.co.uk/

+ Reply to Thread
Page 1 of 2 1 2 LastLast