2 NTP Servers with diverging clocks and how to avoid stepping backwardsin time (repost) - NTP

This is a discussion on 2 NTP Servers with diverging clocks and how to avoid stepping backwardsin time (repost) - NTP ; I am doing post-mortem analysis on an NTP related problem in which one host running ntp-4.1.2 gets in a state where it seems to be making large step corrections to its local clock. When I look at the NTP stats ...

+ Reply to Thread
Results 1 to 5 of 5

Thread: 2 NTP Servers with diverging clocks and how to avoid stepping backwardsin time (repost)

  1. 2 NTP Servers with diverging clocks and how to avoid stepping backwardsin time (repost)

    I am doing post-mortem analysis on an NTP related problem in which one
    host running ntp-4.1.2 gets in a state where it seems to be making large
    step corrections to its local clock.

    When I look at the NTP stats file, I can see that something was terribly
    wrong with one or more of the NTP servers this host was using. Sometime
    around 18 August, the clocks of NTP servers 192.168.0.1 and 192.168.0.2
    began to gradually diverge reaching a difference of over 800 seconds by
    8 September. Compounding this problem, the peerstats also shows one of
    the NTP servers periodically (period of ~900s) being detected as
    unreachable over the whole duration. The other NTP server had a few
    sporadic incidences of being unreachable.

    I have captured all of the ntp configuration and the stats files. Also,
    I prepared a graph
    (http://dingo.dogpad.net/ntpProblem/reachableScatter.png) showing the
    offset of each peer as a function of time. All the stats and config
    (and the graph) can be found at http://dingo.dogpad.net/ntpProblem.

    I am a little bit interested in understanding what could have happened
    with the NTP servers on 18 August. I know that on 8 September, someone
    changed the configuration of one of the NTP servers (Note: the servers
    are probably not ntp.org's implementation), which apparently fixed the
    problem.

    I am more interested, however, how the my node handled this problem.
    Before I started digging into the problem, I was under the impression
    that ntp.org's ntpd never stepped the clock, but only slewed it to
    correct it. Now I see this is not the default behavior, bu I can
    achieve this using tinker step 0. However, I read a thread on this
    newsgroup from Feb 2005 in which David Mills suggested this could
    produce large offsets and other unpredictable errors.

    How can I avoid the large clock stepping in this scenario? Is it
    related to the "prefer" keyword used for 192.168.0.1?
    Can I safely use "tinker step 0" along with "kernel disable" to prevent
    step corrections altogether?
    Can anyone tell me what they think happened to cause the two NTP servers
    to diverge so quickly?


    ---
    Joe Harvell

  2. Re: 2 NTP Servers with diverging clocks and how to avoid steppingbackwards in time (repost)

    Joseph Harvell wrote:
    > I am doing post-mortem analysis on an NTP related problem in which one
    > host running ntp-4.1.2 gets in a state where it seems to be making large
    > step corrections to its local clock.
    >
    > When I look at the NTP stats file, I can see that something was terribly
    > wrong with one or more of the NTP servers this host was using. Sometime
    > around 18 August, the clocks of NTP servers 192.168.0.1 and 192.168.0.2
    > began to gradually diverge reaching a difference of over 800 seconds by
    > 8 September. Compounding this problem, the peerstats also shows one of
    > the NTP servers periodically (period of ~900s) being detected as
    > unreachable over the whole duration. The other NTP server had a few
    > sporadic incidences of being unreachable.
    >
    > I have captured all of the ntp configuration and the stats files. Also,
    > I prepared a graph
    > (http://dingo.dogpad.net/ntpProblem/reachableScatter.png) showing the
    > offset of each peer as a function of time. All the stats and config
    > (and the graph) can be found at http://dingo.dogpad.net/ntpProblem.
    >
    > I am a little bit interested in understanding what could have happened
    > with the NTP servers on 18 August. I know that on 8 September, someone
    > changed the configuration of one of the NTP servers (Note: the servers
    > are probably not ntp.org's implementation), which apparently fixed the
    > problem.
    >
    > I am more interested, however, how the my node handled this problem.
    > Before I started digging into the problem, I was under the impression
    > that ntp.org's ntpd never stepped the clock, but only slewed it to
    > correct it. Now I see this is not the default behavior, bu I can
    > achieve this using tinker step 0. However, I read a thread on this
    > newsgroup from Feb 2005 in which David Mills suggested this could
    > produce large offsets and other unpredictable errors.


    Ntpd will step the clock if the error exceeds 128ms but is less than
    1024 seconds. If the error is greater than 1024 seconds it declares the
    situation hopeless and commits suicide.

    >
    > How can I avoid the large clock stepping in this scenario? Is it
    > related to the "prefer" keyword used for 192.168.0.1?
    > Can I safely use "tinker step 0" along with "kernel disable" to prevent
    > step corrections altogether?


    Safely?? Probably not!!!! Far better to fix the problem, whatever it
    might be.

    If you configure four servers and one fails somehow (wrong time, crash,
    etc.) ntpd will happily continue with the remaining three servers. If
    you configure five servers, two can fail without ill effect. Two
    servers is the worst possible configuration; when the two differ, as
    they inevitably will, ntpd has no means of determining which one is more
    nearly correct! Three servers degenerates too easily to the two server
    case!

    One usual cause of persistent stepping on Linux systems is the local
    clock being updated 1000 times per second instead of 100 (kernel
    parameter HZ needs to be set to 100). The other usual cause of
    persistent stepping is a local clock frequency error greater then 500
    parts per million. The only cure for this is to repair or replace the
    local clock (usually means replacing the mother board).

    ntp-4.1.2 is well behind the current stable version. Upgrade and take
    advantage of the fixes and new features.

  3. Re: 2 NTP Servers with diverging clocks and how to avoid steppingbackwards in time (repost)

    Richard B. Gilbert wrote:
    > Joseph Harvell wrote:
    >> I am doing post-mortem analysis on an NTP related problem in which one
    >> host running ntp-4.1.2 gets in a state where it seems to be making large
    >> step corrections to its local clock.
    >>
    >> When I look at the NTP stats file, I can see that something was terribly
    >> wrong with one or more of the NTP servers this host was using. Sometime
    >> around 18 August, the clocks of NTP servers 192.168.0.1 and 192.168.0.2
    >> began to gradually diverge reaching a difference of over 800 seconds by
    >> 8 September. Compounding this problem, the peerstats also shows one of
    >> the NTP servers periodically (period of ~900s) being detected as
    >> unreachable over the whole duration. The other NTP server had a few
    >> sporadic incidences of being unreachable.
    >>
    >> I have captured all of the ntp configuration and the stats files. Also,
    >> I prepared a graph
    >> (http://dingo.dogpad.net/ntpProblem/reachableScatter.png) showing the
    >> offset of each peer as a function of time. All the stats and config
    >> (and the graph) can be found at http://dingo.dogpad.net/ntpProblem.
    >>
    >> I am a little bit interested in understanding what could have happened
    >> with the NTP servers on 18 August. I know that on 8 September, someone
    >> changed the configuration of one of the NTP servers (Note: the servers
    >> are probably not ntp.org's implementation), which apparently fixed the
    >> problem.
    >>
    >> I am more interested, however, how the my node handled this problem.
    >> Before I started digging into the problem, I was under the impression
    >> that ntp.org's ntpd never stepped the clock, but only slewed it to
    >> correct it. Now I see this is not the default behavior, bu I can
    >> achieve this using tinker step 0. However, I read a thread on this
    >> newsgroup from Feb 2005 in which David Mills suggested this could
    >> produce large offsets and other unpredictable errors.

    >
    > Ntpd will step the clock if the error exceeds 128ms but is less than
    > 1024 seconds. If the error is greater than 1024 seconds it declares the
    > situation hopeless and commits suicide.
    >
    >>
    >> How can I avoid the large clock stepping in this scenario? Is it
    >> related to the "prefer" keyword used for 192.168.0.1?
    >> Can I safely use "tinker step 0" along with "kernel disable" to prevent
    >> step corrections altogether?

    >
    > Safely?? Probably not!!!! Far better to fix the problem, whatever it
    > might be.
    >


    Yes, I agree I need to fix the reachability problem. I think
    configuring more servers is definitely a good idea.

    The reason I ask about the "prefer" keyword is I think it has the effect
    that if the prefer server survives through clustering algorithm its
    clock alone will be used to correct the local clock; whereas if no
    server is a prefer server, the clocks all survivors of the clustering
    algorithm will be used for clock corrections. Note the bands in the
    graph that suggest the local clock was repeatedly stepped back and forth
    between the two servers' clocks. What I am looking for is to see how
    much the "prefer" keyword is contributing to the frequency and magnitude
    of step corrections in this scenario.

    Also, I recognize that there are failures in which the local host can
    end up with only one server reachable, and that this can flip flop
    between two servers with clocks that are between 128ms and 1024s apart.
    So in this scenario, the local ntpd will step the clock back and forth
    unless I use tinker step 0.

    My application is more sensitive to stepping than it is to the time
    being correct. So I would really like someone to explain to me why NOT
    to use tinker step 0. The February post I was referring to suggested it
    could maybe be done safely along with 'disable kernel'.

    My plan is to have something scanning the logs or stats to detect when
    the offset is so large that the clock needs to be stepped. In this
    case, I plan to shut down the application that is sensitive to this,
    step the clock myself, and then resume.

    > If you configure four servers and one fails somehow (wrong time, crash,
    > etc.) ntpd will happily continue with the remaining three servers. If
    > you configure five servers, two can fail without ill effect. Two
    > servers is the worst possible configuration; when the two differ, as
    > they inevitably will, ntpd has no means of determining which one is more
    > nearly correct! Three servers degenerates too easily to the two server
    > case!
    >
    > One usual cause of persistent stepping on Linux systems is the local
    > clock being updated 1000 times per second instead of 100 (kernel
    > parameter HZ needs to be set to 100). The other usual cause of
    > persistent stepping is a local clock frequency error greater then 500
    > parts per million. The only cure for this is to repair or replace the
    > local clock (usually means replacing the mother board).
    >
    > ntp-4.1.2 is well behind the current stable version. Upgrade and take
    > advantage of the fixes and new features.


  4. Re: 2 NTP Servers with diverging clocks and how to avoid steppingbackwards in time (repost)

    Joseph Harvell wrote:
    > Richard B. Gilbert wrote:
    >
    >>Joseph Harvell wrote:
    >>
    >>>I am doing post-mortem analysis on an NTP related problem in which one
    >>>host running ntp-4.1.2 gets in a state where it seems to be making large
    >>>step corrections to its local clock.


    >>>How can I avoid the large clock stepping in this scenario? Is it
    >>>related to the "prefer" keyword used for 192.168.0.1?
    >>>Can I safely use "tinker step 0" along with "kernel disable" to prevent
    >>>step corrections altogether?

    >>
    >>Safely?? Probably not!!!! Far better to fix the problem, whatever it
    >>might be.
    >>

    >
    >
    > Yes, I agree I need to fix the reachability problem. I think
    > configuring more servers is definitely a good idea.
    >
    > The reason I ask about the "prefer" keyword is I think it has the effect
    > that if the prefer server survives through clustering algorithm its
    > clock alone will be used to correct the local clock; whereas if no
    > server is a prefer server, the clocks all survivors of the clustering
    > algorithm will be used for clock corrections. Note the bands in the
    > graph that suggest the local clock was repeatedly stepped back and forth
    > between the two servers' clocks. What I am looking for is to see how
    > much the "prefer" keyword is contributing to the frequency and magnitude
    > of step corrections in this scenario.


    The prefer keyword, as I understand it, tells ntpd to chose this server
    if it is possible to do so; e.g. the server is responding, it is
    synchronized, the numbers look okay, etc. If the "prefer" server's
    numbers look really bad (high jitter, synchronization distance, etc) I
    believe the prefer keyword is ignored.
    >
    > Also, I recognize that there are failures in which the local host can
    > end up with only one server reachable, and that this can flip flop
    > between two servers with clocks that are between 128ms and 1024s apart.
    > So in this scenario, the local ntpd will step the clock back and forth
    > unless I use tinker step 0.
    >
    > My application is more sensitive to stepping than it is to the time
    > being correct. So I would really like someone to explain to me why NOT
    > to use tinker step 0. The February post I was referring to suggested it
    > could maybe be done safely along with 'disable kernel'.
    >

    I have no personal experience with such a procedure. It seems to me
    that this is bending ntpd all out of shape to prevent something that
    shouldn't be happening in the first place. A properly configured ntpd
    with a properly functioning local clock; e.g. frequency within the 500
    PPM tolerance, should never NEED to step except, possibly, during startup.

    Try four, five or seven (protects against one, two, or three
    falsetickers) servers. Four are probably sufficient for most purposes.

    If your servers are "in house" and serving their unsynchronized local
    clocks, it's a very poor idea. And this is the only way I can imagine
    two servers drifting more than 128 milliseconds apart. If, for some
    reason, you can't connect to the internet, invest $85 each in one or
    more Garmin GPS18-LVC timing receivers and use them to synchronize
    either your NTP servers or your application server.

  5. Re: 2 NTP Servers with diverging clocks and how to avoid stepping backwards in time (repost)

    On Tue, 19 Sep 2006 18:49:48 GMT, Joseph Harvell
    wrote for the entire planet to see:

    >I am doing post-mortem analysis on an NTP related problem in which one
    >host running ntp-4.1.2 gets in a state where it seems to be making large
    >step corrections to its local clock.
    >
    >When I look at the NTP stats file, I can see that something was terribly
    >wrong with one or more of the NTP servers this host was using. Sometime
    >around 18 August, the clocks of NTP servers 192.168.0.1 and 192.168.0.2
    >began to gradually diverge reaching a difference of over 800 seconds by
    >8 September. Compounding this problem, the peerstats also shows one of
    >the NTP servers periodically (period of ~900s) being detected as
    >unreachable over the whole duration. The other NTP server had a few
    >sporadic incidences of being unreachable.


    >How can I avoid the large clock stepping in this scenario? Is it
    >related to the "prefer" keyword used for 192.168.0.1?


    Hi Joe -

    I have had a couple of bad experiences with the "prefer" keyword. I
    would not recommend using it at all.

    In my case using PREFER originally seemed to reduced clock hopping and
    improve the quality of our time. Then the reference clock that was
    PREFERed went out of kilter (relating to a leap-second bug) and
    because of PREFER my Stratum 1 NTPD stayed with the bad clock. I had
    a stratum 2 that PREFERed that local S1 so it stayed with the bad S1
    server. All told, about half of my ntpd servers and clients synced on
    the incorrect time (~1 sec off) until the PREFERs were removed, where
    upon the insane clocks were ignored and the ntpd processes
    reconverged.

    If you are having issues of time steps and unreachable servers, PREFER
    would likely cause those problems to magnify and spread.

    - Eric




+ Reply to Thread