Too high steps in time reset
Hello everybody.
Since years, I've configured a ntp server to keep aligned thousand of
host in a private network where the time is vital.
But in the last month I experienced a problem because my ntp server
syncronize resets time with large steps (5-20 seconds) and this causes
problems in my network. I don't understand how can it happen.
My ntp server is a Linux RedHat 7.3 server and, shortly, /etc/ntp.conf
is configured in this way:
--------------------------------------------------------------------------------------------------------------
restrict default nomodify notrap noquery
restrict 127.0.0.1
driftfile /var/lib/ntp/drift
broadcastdelay 0.008
keys /etc/ntp/keys
broadcastclient
server 172.31.1.90
server 193.204.114.232
server 127.127.1.0
fudge 127.127.1.0 stratum 10
restrict 172.31.1.90 mask 255.255.255.255 nomodify notrap noquery
--------------------------------------------------------------------------------------------------------------
the first server (172.31.1.90) is a dcf77 stratum 11 in my LAN,
syncronizing itself 2/3 times per day
the second server is an Internet ntp server stratum 1, IEN Galileo
Ferraris
both servers are ok and there is no difference in time (i.e.
milliseconds)
this is the today's log file:
--------------------------------------------------------------------------------------------------------------
Apr 22 00:15:09 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
stratum 1
Apr 22 00:32:16 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
stratum 1
Apr 22 00:49:16 gecssrv1 ntpd[20177]: synchronized to LOCAL(0),
stratum 10
Apr 22 01:06:23 gecssrv1 ntpd[20177]: synchronized to LOCAL(0),
stratum 10
Apr 22 01:23:29 gecssrv1 ntpd[20177]: synchronized to LOCAL(0),
stratum 10
Apr 22 01:57:36 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
stratum 1
Apr 22 02:14:46 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
stratum 1
Apr 22 02:48:51 gecssrv1 ntpd[20177]: synchronized to LOCAL(0),
stratum 10
Apr 22 03:40:07 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
stratum 1
Apr 22 03:57:19 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
stratum 1
Apr 22 04:14:17 gecssrv1 ntpd[20177]: synchronized to LOCAL(0),
stratum 10
Apr 22 04:31:24 gecssrv1 ntpd[20177]: synchronized to LOCAL(0),
stratum 10
Apr 22 05:22:36 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
stratum 1
Apr 22 05:39:47 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
stratum 1
Apr 22 05:56:46 gecssrv1 ntpd[20177]: synchronized to LOCAL(0),
stratum 10
Apr 22 06:13:49 gecssrv1 ntpd[20177]: synchronized to LOCAL(0),
stratum 10
Apr 22 06:48:03 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
stratum 1
Apr 22 07:22:16 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
stratum 1
Apr 22 07:22:26 gecssrv1 ntpd[20177]: time reset +9.470501 s
Apr 22 07:26:44 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
stratum 1
Apr 22 07:29:59 gecssrv1 ntpd[20177]: synchronized to LOCAL(0),
stratum 10
Apr 22 08:13:54 gecssrv1 ntpd[20177]: synchronized to LOCAL(0),
stratum 10
Apr 22 08:30:58 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
stratum 1
--------------------------------------------------------------------------------------------------------------
You can see that at 7:22 I've got a time reset +9.4... that's HUGE. It
happens often.
The dcf77 sycronized at 5:40.
Does anyone can tell me how can it happen?
Thank you in advance.
Massimo
Re: Too high steps in time reset
[email]massimo.musso@gmail.com[/email] wrote:
--------------------------------------------------------------------------------------------------------------[color=blue]
> the first server (172.31.1.90) is a dcf77 stratum 11 in my LAN,
> syncronizing itself 2/3 times per day[/color]
I was going to say that that never has valid time, but actually it is
never going to be used as the server of record, even though it has valid
time, because the local clock will always win. More later.
[color=blue]
> Apr 22 07:22:26 gecssrv1 ntpd[20177]: time reset +9.470501 s[/color]
Positive steps on Red Hat are usually the result of lost clock
interrupts. I think that is in the known issues documents mentioned in
another thread.
If there is any other problem, it is more or less essential that you
provide ntpq peers output.
[color=blue]
> Apr 22 07:26:44 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
> stratum 1
> Apr 22 07:29:59 gecssrv1 ntpd[20177]: synchronized to LOCAL(0),
> stratum 10[/color]
Synchronizing to LOCAL should be considered a fault condition,
equivalent to a total loss of synchronisation. LOCAL should be an
active choice, not done by default, but if you use it, you should ensure
that you have enough real servers to outvote it. The DCF server is
useless because of its stratum, and would be of questionable value
because of the large root dispersions it will accumulate between updates
(these are not fundamental limitations of DCF as a clock source).
Many people would say that you need at least four independent sources of
true time, and I would suggest that that needs to be in excess over the
number of LOCAL clock sources (direct and indirect) that you have.
[color=blue]
> You can see that at 7:22 I've got a time reset +9.4... that's HUGE. It
> happens often.[/color]
Does that correlate with some heavy disk based job? (Backup?)
[color=blue]
> The dcf77 sycronized at 5:40.[/color]
The switch to the stratum 1, at about that time, may be because the
error band on DCF has collapsed and the intersection of it and the
stratum one now exclude the local clock value, thus outvoting the local
clock. When it has run for a long time with no update, the error bounds
will increase and any local clock value within them will be acceptable,
even if that conflicts with the stratum one.
Also note that any time reset events indicate a problem that should be
investigated. Again see the other thread.
Re: Too high steps in time reset
Thank you for your detailed answer, David, I try to give more
information
David Woolley ha scritto:
[color=blue]
> [email]massimo.musso@gmail.com[/email] wrote:
>
> --------------------------------------------------------------------------------------------------------------[color=green]
> > the first server (172.31.1.90) is a dcf77 stratum 11 in my LAN,
> > syncronizing itself 2/3 times per day[/color]
>
> I was going to say that that never has valid time, but actually it is
> never going to be used as the server of record, even though it has valid
> time, because the local clock will always win. More later.
>[color=green]
> > Apr 22 07:22:26 gecssrv1 ntpd[20177]: time reset +9.470501 s[/color]
>
> Positive steps on Red Hat are usually the result of lost clock
> interrupts. I think that is in the known issues documents mentioned in
> another thread.
>[/color]
I experience both positive and negative high steps.
[color=blue]
> If there is any other problem, it is more or less essential that you
> provide ntpq peers output.
>[/color]
[root@gecssrv1 log]# ntpq -p
remote refid st t when poll reach delay
offset jitter
==============================================================================
xdcf77 LOCAL(0) 11 u 130 1024 377 5.676 1307.74
320.974
x193.204.114.232 .UTCI. 1 u 137 1024 377 20.074 511.544
152.824
*LOCAL(0) LOCAL(0) 10 l 44 64 377 0.000
0.000 0.008
[color=blue][color=green]
> > Apr 22 07:26:44 gecssrv1 ntpd[20177]: synchronized to 193.204.114.232,
> > stratum 1
> > Apr 22 07:29:59 gecssrv1 ntpd[20177]: synchronized to LOCAL(0),
> > stratum 10[/color]
>
> Synchronizing to LOCAL should be considered a fault condition,
> equivalent to a total loss of synchronisation. LOCAL should be an
> active choice, not done by default, but if you use it, you should ensure
> that you have enough real servers to outvote it. The DCF server is
> useless because of its stratum, and would be of questionable value
> because of the large root dispersions it will accumulate between updates
> (these are not fundamental limitations of DCF as a clock source).
>[/color]
The DCF server worked good for 8 years, syncronizing my ntp server
(deep far host in my WAN reach stratum 16 but they always have been
well syncronized). The problem has come 1 month ago and the stratum 1
Internet Server has been added by me to few days ago but it didn't
give me any improve...
[color=blue]
> Many people would say that you need at least four independent sources of
> true time, and I would suggest that that needs to be in excess over the
> number of LOCAL clock sources (direct and indirect) that you have.
>[color=green]
> > You can see that at 7:22 I've got a time reset +9.4... that's HUGE. It
> > happens often.[/color]
>
> Does that correlate with some heavy disk based job? (Backup?)[/color]
No jobs at all and the time reset are "time independent"[color=blue]
>[color=green]
> > The dcf77 sycronized at 5:40.[/color]
>
> The switch to the stratum 1, at about that time, may be because the
> error band on DCF has collapsed and the intersection of it and the
> stratum one now exclude the local clock value, thus outvoting the local
> clock. When it has run for a long time with no update, the error bounds
> will increase and any local clock value within them will be acceptable,
> even if that conflicts with the stratum one.
>
> Also note that any time reset events indicate a problem that should be
> investigated. Again see the other thread.[/color]
Re: Too high steps in time reset
On 2008-04-22, [email]massimo.musso@gmail.com[/email] <massimo.musso@gmail.com> wrote:
[color=blue]
> [root@gecssrv1 log]# ntpq -p
> remote refid st t when poll reach delay offset jitter
>=======================================================================
> xdcf77 LOCAL(0) 11 u 130 1024 377 5.676 1307.74 320.974
> x193.204.114.232 .UTCI. 1 u 137 1024 377 20.074 511.544 152.824
> *LOCAL(0) LOCAL(0) 10 l 44 64 377 0.000 0.000 0.008[/color]
The Undisciplined Local Clock (LOCAL) is just a hack which allows ntpd
to claim to be synchronized something when no real time sources are
available. An ntpd claiming to be "synchronized" to LOCAL is actually
just free-wheeling.
You don't need, or want, to use LOCAL unless this ntpd is serving time
to others.
If you _really_ need to use LOCAL you should fudge it to stratum that is
greater than all of your real time sources. In this case I'd use stratum
12.
This ntpd only has two real time sources; it needs at least one more.
When you have only two clocks you have no way of determining which is
correct. When you have three, or more clocks, you can use the majority
which agree (which is, in fact, exactly what ntpd does).
--
Steve Kostecke <kostecke@ntp.org>
NTP Public Services Project - [url]http://support.ntp.org/[/url]
Re: Too high steps in time reset
[email]massimo.musso@gmail.com[/email] writes:
[color=blue]
>Hello everybody.[/color]
[color=blue]
>Since years, I've configured a ntp server to keep aligned thousand of
>host in a private network where the time is vital.
>But in the last month I experienced a problem because my ntp server
>syncronize resets time with large steps (5-20 seconds) and this causes
>problems in my network. I don't understand how can it happen.[/color]
[color=blue]
>My ntp server is a Linux RedHat 7.3 server and, shortly, /etc/ntp.conf
>is configured in this way:[/color]
Might I suggest you upgrade. Both the computer and the software. It sounds
like both at over 5 years old.
iI think you are having hardware problems.
Re: Too high steps in time reset
[email]massimo.musso@gmail.com[/email] wrote:
[color=blue][color=green]
>>[/color]
> [root@gecssrv1 log]# ntpq -p
> remote refid st t when poll reach delay
> offset jitter
> ==============================================================================
> xdcf77 LOCAL(0) 11 u 130 1024 377 5.676 1307.74
> 320.974
> x193.204.114.232 .UTCI. 1 u 137 1024 377 20.074 511.544
> 152.824
> *LOCAL(0) LOCAL(0) 10 l 44 64 377 0.000
> 0.000 0.008[/color]
You have serious problems. It looks like both of your proper sources of
time are being rejected as having a false time. Also the difference
between them is so high that at least one of them has to broken. (Hand
tuned clocks will usually track to about 30 seconds a year, so getting
out by 600ms in quarter of a day, or so, is totally unreasonable.)
I'm going to guess that the DCF system isn't a real NTP server. I
suspect it a machine synchronised to its local clock and having that
local clock stepped to DCF on each update. A real DCF based ntp server
would correct for the frequency error. NTP assumes that time errors
accumulate smoothly, e.g. as the result of temperature changes or
crystal aging. It is not optimised to handle time that jumps by half a
second, without warning.
Actually, looking back at the DCF machine, it is openly admitting that
it is using the local clock. One of the problems with the local clock
is that it reports an error band consistent with a real, locally
attached, reference clock, so it is very easy for other machines to go
outside of the error band. In this case, all three machines will have
irreconcilable times.
Assuming this is six hours since the last DCF read, we are talking 27
ppm. That's the drift you expect from a completely uncorrected
motherboard of slightly below average quality. You should be expecting
uncorrected frequency errors of more like 0.1ppm, ranging to 1-2ppm if
there have been violent temperature swings.
You need to install a proper DCF driver on the DCF machine, and delete
its local clock line. You should probably also delete the local clock
line on the other machine. Finally you need to add properly
synchronised servers sufficient that you can reliably outvote any broken
clock. The problem here is that all three are voting for incompatible
times, so no time can have a majority.
Note. This doesn't solve your large step problem, but you need to get a
vaild configuration before you start worrying about that. One of the
things that seems to have confused things is that you have finally
introduced a well behaved NTP time source into the system.
If you really can't use a proper DCF driver, you still delete the local
clock on the non-DCF machine and you should hand calibrate the drift
file on the DCF machine. Properly calibrated, it shouldn't drift by
more than about 100ms a day. However, because it is using the local
clock driver, other systems will only think it can have drifted over the
time since the last time they polled, it not for the whole day, so for
most of the day it is still likely to give a time that is incompatible
with that from any other time server. So on balance, if you can't use a
DCF ntpd driver, don't use the DCF hardware.