Too high steps in time reset - NTP

This is a discussion on Too high steps in time reset - NTP ; Hello everybody. Since years, I've configured a ntp server to keep aligned thousand of host in a private network where the time is vital. But in the last month I experienced a problem because my ntp server syncronize resets time ...

+ Reply to Thread
Results 1 to 6 of 6

Thread: Too high steps in time reset

  1. Too high steps in time reset

    Hello everybody.

    Since years, I've configured a ntp server to keep aligned thousand of
    host in a private network where the time is vital.
    But in the last month I experienced a problem because my ntp server
    syncronize resets time with large steps (5-20 seconds) and this causes
    problems in my network. I don't understand how can it happen.

    My ntp server is a Linux RedHat 7.3 server and, shortly, /etc/ntp.conf
    is configured in this way:
    --------------------------------------------------------------------------------------------------------------
    restrict default nomodify notrap noquery
    restrict 127.0.0.1
    driftfile /var/lib/ntp/drift
    broadcastdelay 0.008
    keys /etc/ntp/keys
    broadcastclient
    server 172.31.1.90
    server 193.204.114.232
    server 127.127.1.0
    fudge 127.127.1.0 stratum 10
    restrict 172.31.1.90 mask 255.255.255.255 nomodify notrap noquery
    --------------------------------------------------------------------------------------------------------------
    the first server (172.31.1.90) is a dcf77 stratum 11 in my LAN,
    syncronizing itself 2/3 times per day
    the second server is an Internet ntp server stratum 1, IEN Galileo
    Ferraris
    both servers are ok and there is no difference in time (i.e.
    milliseconds)

    this is the today's log file:
    --------------------------------------------------------------------------------------------------------------
    Apr 22 00:15:09 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
    stratum 1
    Apr 22 00:32:16 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
    stratum 1
    Apr 22 00:49:16 gecssrv1 ntpd[20177]: synchronized to LOCAL(0),
    stratum 10
    Apr 22 01:06:23 gecssrv1 ntpd[20177]: synchronized to LOCAL(0),
    stratum 10
    Apr 22 01:23:29 gecssrv1 ntpd[20177]: synchronized to LOCAL(0),
    stratum 10
    Apr 22 01:57:36 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
    stratum 1
    Apr 22 02:14:46 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
    stratum 1
    Apr 22 02:48:51 gecssrv1 ntpd[20177]: synchronized to LOCAL(0),
    stratum 10
    Apr 22 03:40:07 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
    stratum 1
    Apr 22 03:57:19 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
    stratum 1
    Apr 22 04:14:17 gecssrv1 ntpd[20177]: synchronized to LOCAL(0),
    stratum 10
    Apr 22 04:31:24 gecssrv1 ntpd[20177]: synchronized to LOCAL(0),
    stratum 10
    Apr 22 05:22:36 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
    stratum 1
    Apr 22 05:39:47 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
    stratum 1
    Apr 22 05:56:46 gecssrv1 ntpd[20177]: synchronized to LOCAL(0),
    stratum 10
    Apr 22 06:13:49 gecssrv1 ntpd[20177]: synchronized to LOCAL(0),
    stratum 10
    Apr 22 06:48:03 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
    stratum 1
    Apr 22 07:22:16 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
    stratum 1
    Apr 22 07:22:26 gecssrv1 ntpd[20177]: time reset +9.470501 s
    Apr 22 07:26:44 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
    stratum 1
    Apr 22 07:29:59 gecssrv1 ntpd[20177]: synchronized to LOCAL(0),
    stratum 10
    Apr 22 08:13:54 gecssrv1 ntpd[20177]: synchronized to LOCAL(0),
    stratum 10
    Apr 22 08:30:58 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
    stratum 1
    --------------------------------------------------------------------------------------------------------------
    You can see that at 7:22 I've got a time reset +9.4... that's HUGE. It
    happens often.
    The dcf77 sycronized at 5:40.
    Does anyone can tell me how can it happen?

    Thank you in advance.
    Massimo

  2. Re: Too high steps in time reset

    massimo.musso@gmail.com wrote:

    --------------------------------------------------------------------------------------------------------------
    > the first server (172.31.1.90) is a dcf77 stratum 11 in my LAN,
    > syncronizing itself 2/3 times per day


    I was going to say that that never has valid time, but actually it is
    never going to be used as the server of record, even though it has valid
    time, because the local clock will always win. More later.

    > Apr 22 07:22:26 gecssrv1 ntpd[20177]: time reset +9.470501 s


    Positive steps on Red Hat are usually the result of lost clock
    interrupts. I think that is in the known issues documents mentioned in
    another thread.

    If there is any other problem, it is more or less essential that you
    provide ntpq peers output.

    > Apr 22 07:26:44 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
    > stratum 1
    > Apr 22 07:29:59 gecssrv1 ntpd[20177]: synchronized to LOCAL(0),
    > stratum 10


    Synchronizing to LOCAL should be considered a fault condition,
    equivalent to a total loss of synchronisation. LOCAL should be an
    active choice, not done by default, but if you use it, you should ensure
    that you have enough real servers to outvote it. The DCF server is
    useless because of its stratum, and would be of questionable value
    because of the large root dispersions it will accumulate between updates
    (these are not fundamental limitations of DCF as a clock source).

    Many people would say that you need at least four independent sources of
    true time, and I would suggest that that needs to be in excess over the
    number of LOCAL clock sources (direct and indirect) that you have.

    > You can see that at 7:22 I've got a time reset +9.4... that's HUGE. It
    > happens often.


    Does that correlate with some heavy disk based job? (Backup?)

    > The dcf77 sycronized at 5:40.


    The switch to the stratum 1, at about that time, may be because the
    error band on DCF has collapsed and the intersection of it and the
    stratum one now exclude the local clock value, thus outvoting the local
    clock. When it has run for a long time with no update, the error bounds
    will increase and any local clock value within them will be acceptable,
    even if that conflicts with the stratum one.

    Also note that any time reset events indicate a problem that should be
    investigated. Again see the other thread.

  3. Re: Too high steps in time reset

    Thank you for your detailed answer, David, I try to give more
    information

    David Woolley ha scritto:

    > massimo.musso@gmail.com wrote:
    >
    > --------------------------------------------------------------------------------------------------------------
    > > the first server (172.31.1.90) is a dcf77 stratum 11 in my LAN,
    > > syncronizing itself 2/3 times per day

    >
    > I was going to say that that never has valid time, but actually it is
    > never going to be used as the server of record, even though it has valid
    > time, because the local clock will always win. More later.
    >
    > > Apr 22 07:22:26 gecssrv1 ntpd[20177]: time reset +9.470501 s

    >
    > Positive steps on Red Hat are usually the result of lost clock
    > interrupts. I think that is in the known issues documents mentioned in
    > another thread.
    >

    I experience both positive and negative high steps.

    > If there is any other problem, it is more or less essential that you
    > provide ntpq peers output.
    >

    [root@gecssrv1 log]# ntpq -p
    remote refid st t when poll reach delay
    offset jitter
    ================================================== ============================
    xdcf77 LOCAL(0) 11 u 130 1024 377 5.676 1307.74
    320.974
    x193.204.114.232 .UTCI. 1 u 137 1024 377 20.074 511.544
    152.824
    *LOCAL(0) LOCAL(0) 10 l 44 64 377 0.000
    0.000 0.008

    > > Apr 22 07:26:44 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
    > > stratum 1
    > > Apr 22 07:29:59 gecssrv1 ntpd[20177]: synchronized to LOCAL(0),
    > > stratum 10

    >
    > Synchronizing to LOCAL should be considered a fault condition,
    > equivalent to a total loss of synchronisation. LOCAL should be an
    > active choice, not done by default, but if you use it, you should ensure
    > that you have enough real servers to outvote it. The DCF server is
    > useless because of its stratum, and would be of questionable value
    > because of the large root dispersions it will accumulate between updates
    > (these are not fundamental limitations of DCF as a clock source).
    >

    The DCF server worked good for 8 years, syncronizing my ntp server
    (deep far host in my WAN reach stratum 16 but they always have been
    well syncronized). The problem has come 1 month ago and the stratum 1
    Internet Server has been added by me to few days ago but it didn't
    give me any improve...

    > Many people would say that you need at least four independent sources of
    > true time, and I would suggest that that needs to be in excess over the
    > number of LOCAL clock sources (direct and indirect) that you have.
    >
    > > You can see that at 7:22 I've got a time reset +9.4... that's HUGE. It
    > > happens often.

    >
    > Does that correlate with some heavy disk based job? (Backup?)

    No jobs at all and the time reset are "time independent"
    >
    > > The dcf77 sycronized at 5:40.

    >
    > The switch to the stratum 1, at about that time, may be because the
    > error band on DCF has collapsed and the intersection of it and the
    > stratum one now exclude the local clock value, thus outvoting the local
    > clock. When it has run for a long time with no update, the error bounds
    > will increase and any local clock value within them will be acceptable,
    > even if that conflicts with the stratum one.
    >
    > Also note that any time reset events indicate a problem that should be
    > investigated. Again see the other thread.


  4. Re: Too high steps in time reset

    On 2008-04-22, massimo.musso@gmail.com wrote:

    > [root@gecssrv1 log]# ntpq -p
    > remote refid st t when poll reach delay offset jitter
    >================================================== =====================
    > xdcf77 LOCAL(0) 11 u 130 1024 377 5.676 1307.74 320.974
    > x193.204.114.232 .UTCI. 1 u 137 1024 377 20.074 511.544 152.824
    > *LOCAL(0) LOCAL(0) 10 l 44 64 377 0.000 0.000 0.008


    The Undisciplined Local Clock (LOCAL) is just a hack which allows ntpd
    to claim to be synchronized something when no real time sources are
    available. An ntpd claiming to be "synchronized" to LOCAL is actually
    just free-wheeling.

    You don't need, or want, to use LOCAL unless this ntpd is serving time
    to others.

    If you _really_ need to use LOCAL you should fudge it to stratum that is
    greater than all of your real time sources. In this case I'd use stratum
    12.

    This ntpd only has two real time sources; it needs at least one more.

    When you have only two clocks you have no way of determining which is
    correct. When you have three, or more clocks, you can use the majority
    which agree (which is, in fact, exactly what ntpd does).

    --
    Steve Kostecke
    NTP Public Services Project - http://support.ntp.org/

  5. Re: Too high steps in time reset

    massimo.musso@gmail.com writes:

    >Hello everybody.


    >Since years, I've configured a ntp server to keep aligned thousand of
    >host in a private network where the time is vital.
    >But in the last month I experienced a problem because my ntp server
    >syncronize resets time with large steps (5-20 seconds) and this causes
    >problems in my network. I don't understand how can it happen.


    >My ntp server is a Linux RedHat 7.3 server and, shortly, /etc/ntp.conf
    >is configured in this way:


    Might I suggest you upgrade. Both the computer and the software. It sounds
    like both at over 5 years old.

    iI think you are having hardware problems.



  6. Re: Too high steps in time reset

    massimo.musso@gmail.com wrote:

    >>

    > [root@gecssrv1 log]# ntpq -p
    > remote refid st t when poll reach delay
    > offset jitter
    > ================================================== ============================
    > xdcf77 LOCAL(0) 11 u 130 1024 377 5.676 1307.74
    > 320.974
    > x193.204.114.232 .UTCI. 1 u 137 1024 377 20.074 511.544
    > 152.824
    > *LOCAL(0) LOCAL(0) 10 l 44 64 377 0.000
    > 0.000 0.008


    You have serious problems. It looks like both of your proper sources of
    time are being rejected as having a false time. Also the difference
    between them is so high that at least one of them has to broken. (Hand
    tuned clocks will usually track to about 30 seconds a year, so getting
    out by 600ms in quarter of a day, or so, is totally unreasonable.)

    I'm going to guess that the DCF system isn't a real NTP server. I
    suspect it a machine synchronised to its local clock and having that
    local clock stepped to DCF on each update. A real DCF based ntp server
    would correct for the frequency error. NTP assumes that time errors
    accumulate smoothly, e.g. as the result of temperature changes or
    crystal aging. It is not optimised to handle time that jumps by half a
    second, without warning.

    Actually, looking back at the DCF machine, it is openly admitting that
    it is using the local clock. One of the problems with the local clock
    is that it reports an error band consistent with a real, locally
    attached, reference clock, so it is very easy for other machines to go
    outside of the error band. In this case, all three machines will have
    irreconcilable times.


    Assuming this is six hours since the last DCF read, we are talking 27
    ppm. That's the drift you expect from a completely uncorrected
    motherboard of slightly below average quality. You should be expecting
    uncorrected frequency errors of more like 0.1ppm, ranging to 1-2ppm if
    there have been violent temperature swings.

    You need to install a proper DCF driver on the DCF machine, and delete
    its local clock line. You should probably also delete the local clock
    line on the other machine. Finally you need to add properly
    synchronised servers sufficient that you can reliably outvote any broken
    clock. The problem here is that all three are voting for incompatible
    times, so no time can have a majority.

    Note. This doesn't solve your large step problem, but you need to get a
    vaild configuration before you start worrying about that. One of the
    things that seems to have confused things is that you have finally
    introduced a well behaved NTP time source into the system.


    If you really can't use a proper DCF driver, you still delete the local
    clock on the non-DCF machine and you should hand calibrate the drift
    file on the DCF machine. Properly calibrated, it shouldn't drift by
    more than about 100ms a day. However, because it is using the local
    clock driver, other systems will only think it can have drifted over the
    time since the last time they polled, it not for the whole day, so for
    most of the day it is still likely to give a time that is incompatible
    with that from any other time server. So on balance, if you can't use a
    DCF ntpd driver, don't use the DCF hardware.

+ Reply to Thread