drift value very large and very unstable - NTP
This is a discussion on drift value very large and very unstable - NTP ; I realize this is long, but I tried to include the whole story. I did
work earnestly to solve this on my own, but unfortunately I've been
spinning my wheels the last few days. Thanks for any help.
I am ...
-
drift value very large and very unstable
I realize this is long, but I tried to include the whole story. I did
work earnestly to solve this on my own, but unfortunately I've been
spinning my wheels the last few days. Thanks for any help.
I am having a problem with drift values approaching and, on occasion,
reaching +/-500ppm. My time source setup:
GPS --> XL-GPS::IRIGB --> SBC0::IRIGB --> SBC0::NTP
The XL-GPS is synchronized with GPS time and outputs an IRIG-B signal.
The processor board, SBC0, is a single board computer housing a
Symmetricom BC635PMC IRIG-B receiver. Three different SBC0s and three
different BC635 PMCs were tested and all produced the same results. The
BC635 IRIG-B receiver is the only time source for NTP (see "BC635" conf
file below) using NTP's Bancomm reference clock support. This is the
"target system".
The drift using this configuration is typically near +/- 500ppm. I say
"+/-" because from one run of NTP to the next it may completely swing,
for example, from +486ppm to -490ppm on the same processor board. Most
of the time this wild swing only happens following a reboot, but I've
observed it on at least two occasions when ntpd was simply stopped and
then restarted (with no conf file changes and no reboots between).
To make matters more interesting, the drift consistently settles at
~100ppm when using only a local NTP server that is synchronized with
other public stratum 2 NTP servers (such as ntp.idealab.com, zagbot.com,
etc). In other words, when syncing with public NTP Internet severs, the
drift does not swing from positive to negative and it always settles at
a reasonable value (<100 ppm).
I've done several tests, including the use of a 1Hz timestamp print out
feature of the XL-GPS. The timestamp is synchronized with system's 1PPS
and so it comes out nearly exactly once every second. I wrote a script
that waits for the 1Hz timestamp, when the timestamp print occurs, the
script runs a C program that grabs IRIG-B time from the BC635 PMC and
grabs system time using clock_gettime() and then prints these two
timestamps. I then combine these three timestamps into a log file (one
line for each 1Hz sample). This test seems to prove the stability of
the XL-GPS, BC635, and SBC0's system clock (which is not being
disciplined by NTP during the test). In particular, the test showed
SBC0's drift is in line with the 100ppm value seen when syncing with a
network time source. This test results were also consistent with the
claimed accuracy of the SBC0 oscillator, 30ppm. In other words, the
500ppm value seems to be a completely bogus fabrication of NTP.
Another piece of evidence is that the IRIG-B PMC was used on two
different single board computers (one was a Concurrent PP110, the other
a Concurrent VP315) where the drift was stable and settled at reasonable
values on both of these boards. In this case, the BC635 IRIG-B PMC did
not have a time reference, instead the time was set manually on the
BC635 and the BC635 operated in flywheel mode (i.e. the IRIG-B time
drifted with the clock on the BC635). This was the "development
system". Several weeks of testing on this system always produced stable
results. Drift values always stabilized at the same reasonable value,
for example, ~20ppm for one of these "other" SBCs. It was only after
several weeks of running on these boards that we then moved to the
"target system", SBC0, and then began experiencing the problem with drift.
The "target system" summary:
- SBC0 (2 Intel CPUs)
- GPS --> XL-GPS::IRIGB --> SBC0::IRIGB --> SBC0::NTP
- Concurrent RedHawk 4.2 (Hanoi)
- Linux sbc9 2.6.18.8-RedHawk-4.2-trace #1 SMP PREEMPT Tue May 29
12:44:24 EDT 2007 i686 i686 i386 GNU/Linux
The "development system" summary:
- SBC1: Concurrent PP110, Pentium III-M (1 CPU)
- PP110::IRIGB --> PP110::NTP
- SBC2: Concurrent VP315, Pentium M (1 CPU)
- VP315::IRIGB --> VP315::NTP
- Enterprise Linux, Version 4 (original release), kernel version:
- Linux ntp1 2.6.9-5.EL #1 Wed Jan 5 19:22:18 EST 2005 i686 i686 i386
GNU/Linux
Common items between "target system" and "development system":
- ntpd - NTP daemon program - Ver. 4.2.4p0
- BC635PMC hardware (i.e. exact same pieces of hardware)
- BC635PMC v6.5.0 driver from Symmetricom
Some other notes and thoughts on this problem:
- I have searched the web and NTP mailing list and have found various
instances of problems with large drift values, but none fit my situation
exactly or the instances were resolved by some means not applicable here.
- There is "no" activity on SBC0 when this problem occurs. By "no" I
mean no additional applications except whatever may be running as a cron
job (which isn't much). By "no" I also mean that there is no additional
hardware causing a heavy interrupt load on the system.
- The drift has _always_ gone near or equal to +/-500ppm -- i.e. it has
never stabilized at a reasonable value when running with the BC635 IRIGB
time source.
- I've tested the "target system" with and without the XL-GPS time
source, in which case the BC635 IRIG-B PMC runs in "flywheel" mode. In
"flywheel" mode, the drift problem is the same.
- The linux kernel has only a CompactFlash for a local disk and, as
such, the kernel is configured without swap space.
- The "target system" has various requirements that necessitate running
in the fashion in which we are running. For example, there is no
connection to the Internet, nor can there be reliance on "other" network
time sources. The system must be completely self sufficient with one
local IRIG-B synced board serving as a local stratum 1 NTP server for
several other local SBC0 boards.
- I am not sure what other run-time NTP information is useful so I
didn't include any. Just let me know what you would like to see. It is
not easily possible to run tests with the BC635 and the VP315 or PP110
since those pieces of hardware are no longer co-located.
- I have tested ntp-4.2.4.p4 and ntp-dev-4.2.5p113 distributions and the
drift problem is the same (although, I only ran these versions once,
long enough to see the drift go above 400ppm).
- The drift file was deleted prior to almost every run of NTP. I say
"almost" because some for some tests I wanted to see what NTP would do
when starting with a large drift value.
- There have been in the neighborhood of 50 different test runs
(probably more, but I'm not counting).
- One other test we plan to run is installing RedHat Enterprise Linux
Version 4, Update 4 on SBC0. This is a software environment more
similar to ones on which the BC635 driver was developed.
Andy
/************************************************** *****/
NTP conf file for BC635 IRIG-B PMC
/************************************************** *****/
# Base conf file for all normal operation and initial sync for both server
# and client
# Debug stuff
statistics clockstats peerstats loopstats
statsdir /var/lib/ntp/log/
filegen clockstats file stats.clock type pid link enable
filegen peerstats file stats.peer type pid link enable
filegen loopstats file stats.loop type pid link enable
restrict default nomodify notrap noquery
restrict 127.0.0.1
tinker panic 0 # don't let daemon exit for any time difference
driftfile /var/lib/ntp/drift
# Base conf file for normal operation for both server and client
tinker step 0 # disable stepping, so that we only slew time
# Conf file for normal operation of a server
server 127.127.16.0 prefer mode 2 minpoll 4 iburst burst # Symmetricom
BC635
tos orphan 6
/************************************************** *****/
NTP conf file for network NTP server
/************************************************** *****/
# Base conf file for all normal operation and initial sync for both server
# and client
# Debug stuff
statistics clockstats peerstats loopstats
statsdir /var/lib/ntp/log/
filegen clockstats file stats.clock type pid link enable
filegen peerstats file stats.peer type pid link enable
filegen loopstats file stats.loop type pid link enable
restrict default nomodify notrap noquery
restrict 127.0.0.1
tinker panic 0 # don't let daemon exit for any time difference
driftfile /var/lib/ntp/drift
# Base conf file for normal operation for both server and client
tinker step 0 # disable stepping, so that we only slew time
# Conf file for initial sync of a client
server 192.168.2.90 prefer iburst burst minpoll 5 maxpoll 9
-
Re: drift value very large and very unstable
Andy,
All I have at the moment is to make sure you have seen the known hardware
and OS issues pages at support.ntp.org/Support/TroubleshootingNTP.
It looks like there was some more information on APIC and ACPI, but those
links area currently broken.
--
Harlan Stenn
http://ntpforum.isc.org - be a member!
-
Re: drift value very large and very unstable
On Mon, 3 Mar 2008, Andy Helten wrote:
-snippage-
> I am having a problem with drift values approaching and, on occasion,
> reaching +/-500ppm.
>
-snippage-
> NTP conf file for BC635 IRIG-B PMC
> /************************************************** *****/
>
> tinker panic 0 # don't let daemon exit for any time difference
-snippage--
>
> # Base conf file for normal operation for both server and client
> tinker step 0 # disable stepping, so that we only slew time
>
> # Conf file for normal operation of a server
>
> server 127.127.16.0 prefer mode 2 minpoll 4 iburst burst # Symmetricom
> BC635
> tos orphan 6
-- snippage --
Lose the 'iburst burst' on 16.
With the two tinker commands above you give ntpd the requirement
to amortize the offset entirely with frequency control.
Are you giving it long enough to do so?
If possible, toss those tinker options and try again.
ntpq -p, ntpq -c as -c "rv &x" (where x is the association index
for the refclock 16) and ntpq -crv would be useful.
Rob
-
Re: drift value very large and very unstable
>>> In article , Harlan Stenn writes:
Harlan> It looks like there was some more information on APIC and ACPI, but
Harlan> those links area currently broken. -- Harlan Stenn
Harlan> http://ntpforum.isc.org - be a member!
Those links point to the page on "Configuring Trimble...Refclocks".
--
Harlan Stenn
http://ntpforum.isc.org - be a member!
-
Re: drift value very large and very unstable
Rob Neal wrote:
> On Mon, 3 Mar 2008, Andy Helten wrote:
> -snippage-
>
>> I am having a problem with drift values approaching and, on occasion,
>> reaching +/-500ppm.
>>
>>
> -snippage-
>
>> NTP conf file for BC635 IRIG-B PMC
>> /************************************************** *****/
>>
>> tinker panic 0 # don't let daemon exit for any time difference
>>
> -snippage--
>
>> # Base conf file for normal operation for both server and client
>> tinker step 0 # disable stepping, so that we only slew time
>>
>> # Conf file for normal operation of a server
>>
>> server 127.127.16.0 prefer mode 2 minpoll 4 iburst burst # Symmetricom
>> BC635
>> tos orphan 6
>>
> -- snippage --
>
> Lose the 'iburst burst' on 16.
>
> With the two tinker commands above you give ntpd the requirement
> to amortize the offset entirely with frequency control.
>
> Are you giving it long enough to do so?
>
> If possible, toss those tinker options and try again.
>
> ntpq -p, ntpq -c as -c "rv &x" (where x is the association index
> for the refclock 16) and ntpq -crv would be useful.
>
> Rob
>
>
Rob,
In this case, the purpose of 'iburst burst' is too decrease startup so
that ntp will begin servicing sync requests within a reasonable amount
of time. I'm not sure that both are necessary, but definitely one of
them (along with minpoll 4) decreases startup time from several minutes
to about 20 seconds. I seem to recall reading somewhere in the NTP docs
that burst and iburst have no effect on reference clocks -- it simply
isn't true for the BC635 (refclock_bancomm.c). Removing them is still
worth a try and I will run like that overnight. In fact, I started
running ntpd with the ntp.conf below (after making the suggested
ntp.conf changes) and the ntpq output below is after only about 25
minutes of ntp operation. This is running the Redhawk 2.6.18 linux
kernel on the same exact hardware as was used last night on the Redhat
2.6.9-42 kernel (the relevance of this kernel is mentioned below).
I think I have been giving it enough time to stabilize -- any test I
consider legitimate was allowed to run for at least 8 hours. Most tests
ran overnight for 18-24 hours and some tests ran over weekends for
nearly 72 hours. Results were always the same (very large drift). In
fact, if allowed to run long enough, the drift almost always reached the
+/-500 max.
The tinker commands are also necessary (at least disabling the step) due
to some commercial software that has serious problems with backward time
steps. This problem should be fixed in a future version, but that may
not be soon enough for us. Even then, we may not want time to step
backwards.
I should also provide an update for a test that ran last night in which
the base RedHat EL4 Update 4 distribution (2.6.9-42 kernel) was used
with ntp 4.2.4p0 and the exact same single board computer and exact same
BC635 hardware. This test stabilized at a drift of -35ppm with a very
small offset (0.021 milliseconds). This test ran overnight and by late
morning the drift was changing only by a few hundredths at a time. In
other words, everything was working as expected. So, whatever the
problem, it almost definitely is software related (and most likely is a
problem with the kernel?).
Regarding the kernel's HZ value and its relation to time loss/gain, is
there a way to determine the actual value at runtime? I want the value
of HZ that is actually in use in the running kernel. I wasn't able to
find a way to do this. By the HZ macro in /usr/include, I get a value
of 100 and by the "/boot/config-*" file I see a value of 250. This is
why I would like a sysctl type value or /proc entry with the actual HZ
value, not a macro or config file. Any ideas?
Thanks,
Andy
/**************************************/
new ntp.conf
/**************************************/
# Debug stuff
statistics clockstats peerstats loopstats
statsdir /var/lib/ntp/log/
filegen clockstats file stats.clock type pid link enable
filegen peerstats file stats.peer type pid link enable
filegen loopstats file stats.loop type pid link enable
restrict default nomodify notrap noquery
restrict 127.0.0.1
driftfile /var/lib/ntp/drift
server 127.127.16.0 prefer mode 2 minpoll 4 # Symmetricom BC635
tos orphan 6
/**************************************/
ntpq output
/**************************************/
sbc1 root 31->ntpq
ntpq> pe
remote refid st t when poll reach delay offset
jitter
================================================== ============================
*GPS_BANC(0) .BTFP. 0 l 4 16 377 0.000 9.121
3.489
ntpq> as
ind assID status conf reach auth condition last_event cnt
================================================== =========
1 13451 9614 yes yes none sys.peer reachable 1
ntpq> rv &1
assID=13451 status=9614 reach, conf, sel_sys.peer, 1 event, event_reach,
srcadr=GPS_BANC(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=00,
stratum=0, precision=-21, rootdelay=0.000, rootdispersion=0.000,
refid=BTFP, reach=377, unreach=0, hmode=3, pmode=4, hpoll=4, ppoll=10,
flash=00 ok, keyid=0, ttl=64, offset=9.121, delay=0.000,
dispersion=0.236, jitter=3.489,
reftime=c0311460.c183a17a Wed, Mar 6 2002 17:19:12.755,
org=c0311460.c183a17a Wed, Mar 6 2002 17:19:12.755,
rec=c0311460.c18428f8 Wed, Mar 6 2002 17:19:12.755,
xmt=c0311460.c1831775 Wed, Mar 6 2002 17:19:12.755,
filtdelay= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00,
filtoffset= 9.12 9.76 10.44 11.20 12.02 12.93 13.86 14.90,
filtdisp= 0.00 0.24 0.48 0.74 0.99 1.26 1.52 1.79
ntpq> cv
assID=0 status=0000 clk_okay, last_clk_okay,
type=16, timecode="065 22:19:27.764471000 0", poll=110, noreply=0,
badformat=0, baddata=0, fudgetime1=0.000, stratum=0, refid=BTFP,
flags=0
ntpq>
-
Re: drift value very large and very unstable
iburst is good. burst, very likely not.
--
Harlan Stenn
http://ntpforum.isc.org - be a member!
-
Re: drift value very large and very unstable
On Wed, 5 Mar 2008, Andy Helten wrote:
>
> I think I have been giving it enough time to stabilize -- any test I
> consider legitimate was allowed to run for at least 8 hours. Most tests
> ran overnight for 18-24 hours and some tests ran over weekends for
> nearly 72 hours. Results were always the same (very large drift). In
> fact, if allowed to run long enough, the drift almost always reached the
> +/-500 max.
The drift only tells part of the story. What is the offset
doing? Does it cross zero and continue to diverge, or is
it still headed to zero?
>
> The tinker commands are also necessary (at least disabling the step) due
> to some commercial software that has serious problems with backward time
> steps. This problem should be fixed in a future version, but that may
> not be soon enough for us. Even then, we may not want time to step
> backwards.
There is a reason they are options. Try setting your clock
by hand an hour or so off, and starting ntpd. Watch the
time it reports while it plays catch-up. Scary.
Your call, but it would probably fail an audit.
>
> Regarding the kernel's HZ value and its relation to time loss/gain, is
> there a way to determine the actual value at runtime? I want the value
> of HZ that is actually in use in the running kernel. I wasn't able to
> find a way to do this. By the HZ macro in /usr/include, I get a value
> of 100 and by the "/boot/config-*" file I see a value of 250. This is
> why I would like a sysctl type value or /proc entry with the actual HZ
> value, not a macro or config file. Any ideas?
Kernel sysctl of some sort. Consult the Linux kernelmongers.
>
> Thanks,
> Andy
>
> sbc1 root 31->ntpq
> ntpq> pe
> remote refid st t when poll reach delay offset
> jitter
> ================================================== ============================
> *GPS_BANC(0) .BTFP. 0 l 4 16 377 0.000 9.121
> 3.489
Jitter is ugly for an attached refclock. You have something
bad happening, this should be much lower.
> ntpq> as
>
> ind assID status conf reach auth condition last_event cnt
> ================================================== =========
> 1 13451 9614 yes yes none sys.peer reachable 1
> ntpq> rv &1
> assID=13451 status=9614 reach, conf, sel_sys.peer, 1 event, event_reach,
> srcadr=GPS_BANC(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=00,
> stratum=0, precision=-21, rootdelay=0.000, rootdispersion=0.000,
> refid=BTFP, reach=377, unreach=0, hmode=3, pmode=4, hpoll=4, ppoll=10,
> flash=00 ok, keyid=0, ttl=64, offset=9.121, delay=0.000,
> dispersion=0.236, jitter=3.489,
> reftime=c0311460.c183a17a Wed, Mar 6 2002 17:19:12.755,
> org=c0311460.c183a17a Wed, Mar 6 2002 17:19:12.755,
> rec=c0311460.c18428f8 Wed, Mar 6 2002 17:19:12.755,
> xmt=c0311460.c1831775 Wed, Mar 6 2002 17:19:12.755,
> filtdelay= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00,
> filtoffset= 9.12 9.76 10.44 11.20 12.02 12.93 13.86 14.90,
> filtdisp= 0.00 0.24 0.48 0.74 0.99 1.26 1.52
1.79
Looks like the offset is still trending to zero, with a
long way to go.
Rob
-
Re: drift value very large and very unstable
Harlan Stenn wrote:
> iburst is good. burst, very likely not.
>
Actually, I just ran a test and, at least for refclock_bancomm, burst is
the only one that matters. However, even with 'burst', it would still
take 64 seconds for ntp to declare a peer if minpoll is not also
decreased. I attributed this to the fact that 'burst' only applies to
normal ops, so at least one "normal" polling period is required.
Andy
-
Re: drift value very large and very unstable
> From: Harlan Stenn
> Date: Thu, 06 Mar 2008 06:33:17 +0000
> Sender: questions-bounces+oberman=es.net@lists.ntp.org
>
>
> iburst is good. burst, very likely not.
In general, I agree, but the context is for a connected reference clock.
I would set maxpoll to 4 as there is no reason NOT to update from the
reference clock on a frequent (16 second) basis, but I'm less certain if
iburst is really appropriate. (Nor am I sure that it's inappropriate.)
--
R. Kevin Oberman, Network Engineer
Energy Sciences Network (ESnet)
Ernest O. Lawrence Berkeley National Laboratory (Berkeley Lab)
E-mail: oberman@es.net Phone: +1 510 486-8634
Key fingerprint:059B 2DDF 031C 9BA3 14A4 EADA 927D EBB3 987B 3751
-
Re: drift value very large and very unstable
The good news is that "new ntp.conf" appears to work! This is the first
configuration that has produced reasonable results, granted it could
still be a fluke since the drift was rather unpredictable (but _always_
settled near +/-500ppm). The bad news is that we _require_ some of the
commands removed from ntp.conf (at least burst and step). After letting
ntp run with the "new ntp.conf" for at least 16 hours, the drift had
stabilized around 33ppm:
sbc1 root 1->ntpq -crv
assID=0 status=0444 leap_none, sync_uhf_clock, 4 events,
event_peer/strat_chg,
version="ntpd 4.2.4p0@1.1472 Tue Jan 8 16:23:44 UTC 2008 (1)",
processor="i686", system="Linux/2.6.18.8-RedHawk-4.2-trace", leap=00,
stratum=1, precision=-20, rootdelay=0.000, rootdispersion=0.272,
peer=13451, refid=BTFP,
reftime=c0320bd4.c1843a15 Thu, Mar 7 2002 10:55:00.755, poll=4,
clock=c0320bd5.6dfc379d Thu, Mar 7 2002 10:55:01.429, state=4,
offset=0.029, frequency=33.562, jitter=0.002, noise=0.002,
stability=0.001
This test ran with the previously problematic Redhawk kernel and all of
the same hardware. To further isolate the problem, I've added the
'burst' command back into ntp.conf, removed the drift file, and
restarted ntp.
Andy
Andy wrote:
> Rob Neal wrote:
>
>> On Mon, 3 Mar 2008, Andy Helten wrote:
>>
>> -- snippage --
>>
>> Lose the 'iburst burst' on 16.
>>
>> With the two tinker commands above you give ntpd the requirement
>> to amortize the offset entirely with frequency control.
>>
>> Are you giving it long enough to do so?
>>
>> If possible, toss those tinker options and try again.
>>
>> ntpq -p, ntpq -c as -c "rv &x" (where x is the association index
>> for the refclock 16) and ntpq -crv would be useful.
>>
>> Rob
>>
>>
>>
> Rob,
>
> In this case, the purpose of 'iburst burst' is too decrease startup so
> that ntp will begin servicing sync requests within a reasonable amount
> of time. I'm not sure that both are necessary, but definitely one of
> them (along with minpoll 4) decreases startup time from several minutes
> to about 20 seconds. I seem to recall reading somewhere in the NTP docs
> that burst and iburst have no effect on reference clocks -- it simply
> isn't true for the BC635 (refclock_bancomm.c). Removing them is still
> worth a try and I will run like that overnight. In fact, I started
> running ntpd with the ntp.conf below (after making the suggested
> ntp.conf changes) and the ntpq output below is after only about 25
> minutes of ntp operation. This is running the Redhawk 2.6.18 linux
> kernel on the same exact hardware as was used last night on the Redhat
> 2.6.9-42 kernel (the relevance of this kernel is mentioned below).
>
> I think I have been giving it enough time to stabilize -- any test I
> consider legitimate was allowed to run for at least 8 hours. Most tests
> ran overnight for 18-24 hours and some tests ran over weekends for
> nearly 72 hours. Results were always the same (very large drift). In
> fact, if allowed to run long enough, the drift almost always reached the
> +/-500 max.
>
> The tinker commands are also necessary (at least disabling the step) due
> to some commercial software that has serious problems with backward time
> steps. This problem should be fixed in a future version, but that may
> not be soon enough for us. Even then, we may not want time to step
> backwards.
>
> I should also provide an update for a test that ran last night in which
> the base RedHat EL4 Update 4 distribution (2.6.9-42 kernel) was used
> with ntp 4.2.4p0 and the exact same single board computer and exact same
> BC635 hardware. This test stabilized at a drift of -35ppm with a very
> small offset (0.021 milliseconds). This test ran overnight and by late
> morning the drift was changing only by a few hundredths at a time. In
> other words, everything was working as expected. So, whatever the
> problem, it almost definitely is software related (and most likely is a
> problem with the kernel?).
>
> Regarding the kernel's HZ value and its relation to time loss/gain, is
> there a way to determine the actual value at runtime? I want the value
> of HZ that is actually in use in the running kernel. I wasn't able to
> find a way to do this. By the HZ macro in /usr/include, I get a value
> of 100 and by the "/boot/config-*" file I see a value of 250. This is
> why I would like a sysctl type value or /proc entry with the actual HZ
> value, not a macro or config file. Any ideas?
>
> Thanks,
> Andy
>
> /**************************************/
> new ntp.conf
> /**************************************/
> # Debug stuff
> statistics clockstats peerstats loopstats
> statsdir /var/lib/ntp/log/
> filegen clockstats file stats.clock type pid link enable
> filegen peerstats file stats.peer type pid link enable
> filegen loopstats file stats.loop type pid link enable
>
> restrict default nomodify notrap noquery
> restrict 127.0.0.1
>
> driftfile /var/lib/ntp/drift
>
> server 127.127.16.0 prefer mode 2 minpoll 4 # Symmetricom BC635
> tos orphan 6
>
>
>
> /**************************************/
> ntpq output
> /**************************************/
>
> sbc1 root 31->ntpq
> ntpq> pe
> remote refid st t when poll reach delay offset
> jitter
> ================================================== ============================
> *GPS_BANC(0) .BTFP. 0 l 4 16 377 0.000 9.121
> 3.489
> ntpq> as
>
> ind assID status conf reach auth condition last_event cnt
> ================================================== =========
> 1 13451 9614 yes yes none sys.peer reachable 1
> ntpq> rv &1
> assID=13451 status=9614 reach, conf, sel_sys.peer, 1 event, event_reach,
> srcadr=GPS_BANC(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=00,
> stratum=0, precision=-21, rootdelay=0.000, rootdispersion=0.000,
> refid=BTFP, reach=377, unreach=0, hmode=3, pmode=4, hpoll=4, ppoll=10,
> flash=00 ok, keyid=0, ttl=64, offset=9.121, delay=0.000,
> dispersion=0.236, jitter=3.489,
> reftime=c0311460.c183a17a Wed, Mar 6 2002 17:19:12.755,
> org=c0311460.c183a17a Wed, Mar 6 2002 17:19:12.755,
> rec=c0311460.c18428f8 Wed, Mar 6 2002 17:19:12.755,
> xmt=c0311460.c1831775 Wed, Mar 6 2002 17:19:12.755,
> filtdelay= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00,
> filtoffset= 9.12 9.76 10.44 11.20 12.02 12.93 13.86 14.90,
> filtdisp= 0.00 0.24 0.48 0.74 0.99 1.26 1.52 1.79
> ntpq> cv
> assID=0 status=0000 clk_okay, last_clk_okay,
> type=16, timecode="065 22:19:27.764471000 0", poll=110, noreply=0,
> badformat=0, baddata=0, fudgetime1=0.000, stratum=0, refid=BTFP,
> flags=0
> ntpq>
>
>
>
> _______________________________________________
> questions mailing list
> questions@lists.ntp.org
> https://lists.ntp.org/mailman/listinfo/questions
>
-
Re: drift value very large and very unstable
Kevin,
1. As per the advice in the documentation, do not use iburst with
reference clock drivers. The only reason you might want to do this is to
reduce the time to set the clock on initial start. This is unnecessary
as the clock is now set on the first reply received. Using iburst anyway
will screw up the radio protocol in some drivers.
2. The temptation to reduce the poll interval below the default can be
counterproductive. The driver interface uses a median filter to clean up
nominal jitter due to serial port and interrupt latencies. Generally,
the jitter is reduced as the number of stages and the poll interval are
increased. There are cases, in particular with kernel PPS signals, where
a smaller poll interval can result in marginally better performance, but
in other cases it generally not a good idea.
3. The burst mode is designed for use when the poll interval of
necessity must be very long, like at least 1024 s. The current design
will rate-limit if burst is used with a poll interval of 512 s or less.
This is to protect busy servers with a minimum average default headway
of 16 s. Some operators might set the headway higher.
Dave
Kevin Oberman wrote:
>>From: Harlan Stenn
>>Date: Thu, 06 Mar 2008 06:33:17 +0000
>>Sender: questions-bounces+oberman=es.net@lists.ntp.org
>>
>>
>>iburst is good. burst, very likely not.
>
>
> In general, I agree, but the context is for a connected reference clock.
>
> I would set maxpoll to 4 as there is no reason NOT to update from the
> reference clock on a frequent (16 second) basis, but I'm less certain if
> iburst is really appropriate. (Nor am I sure that it's inappropriate.)
-
Re: drift value very large and very unstable
>>> In article <20080306162227.5985F45047@ptavv.es.net>, oberman@es.net (Kevin Oberman) writes:
>> From: Harlan Stenn Date: Thu, 06 Mar 2008 06:33:17 +0000
>> Sender: questions-bounces+oberman=es.net@lists.ntp.org
>>
>> iburst is good. burst, very likely not.
Kevin> In general, I agree, but the context is for a connected reference
Kevin> clock.
I missed that - sorry, and thanks Kevin!
--
Harlan Stenn
http://ntpforum.isc.org - be a member!
-
Re: drift value very large and very unstable
Andy wrote:
> The good news is that "new ntp.conf" appears to work! This is the first
> configuration that has produced reasonable results, granted it could
> still be a fluke since the drift was rather unpredictable (but _always_
> settled near +/-500ppm). The bad news is that we _require_ some of the
> commands removed from ntp.conf (at least burst and step). After letting
> ntp run with the "new ntp.conf" for at least 16 hours, the drift had
> stabilized around 33ppm:
>
> sbc1 root 1->ntpq -crv
> assID=0 status=0444 leap_none, sync_uhf_clock, 4 events,
> event_peer/strat_chg,
> version="ntpd 4.2.4p0@1.1472 Tue Jan 8 16:23:44 UTC 2008 (1)",
> processor="i686", system="Linux/2.6.18.8-RedHawk-4.2-trace", leap=00,
> stratum=1, precision=-20, rootdelay=0.000, rootdispersion=0.272,
> peer=13451, refid=BTFP,
> reftime=c0320bd4.c1843a15 Thu, Mar 7 2002 10:55:00.755, poll=4,
> clock=c0320bd5.6dfc379d Thu, Mar 7 2002 10:55:01.429, state=4,
> offset=0.029, frequency=33.562, jitter=0.002, noise=0.002,
> stability=0.001
>
>
> This test ran with the previously problematic Redhawk kernel and all of
> the same hardware. To further isolate the problem, I've added the
> 'burst' command back into ntp.conf, removed the drift file, and
> restarted ntp.
>
> Andy
>
This may seem like a long email, but it mostly consists of two cut &
paste jobs interspersed with brilliant analysis. The two cut & paste
chunks are from two different runs of ntp, one that shows ntp working
correctly and one that shows it "failing". The _only_ difference
between the two runs is that time stepping was left to default behavior
in the working run and time stepping was disabled in the "failing" run.
So, please don't be discouraged by the length of the email, you may find
it an intriguing read...
As I mentioned in a previous email, I was going to run ntp while adding
back in some of the features I removed. After doing this, at least
superficially, I've isolated the problem to time step being disabled.
It doesn't matter whether I specify 'tinker step 0' in ntp.conf or use
the '-x' argument on the command line. Both result in drift that
approaches 500ppm. First the working configuration. Below is ntpq
output and the ntp.conf file after running just over 12 hours with step
_enabled_ in which everything works correctly. Below the ntpq/ntp.conf
information is the beginning and ending of the stats.loop file for the
same run.
Keep in mind, NTP runs perfectly on this same set of hardware when
running the 2.6.9-42 linux kernel with time stepping *disabled*. If the
true drift of this system is around 33ppm (two different runs with
stepping disabled have settled near 33ppm), then would the clock offset
ever get larger than 128ms and require a step? In fact, the answer
seems to be "no, a time step is never even required". As proof,
stats.loop shows the largest offset to be 0.023634 seconds and that is
the third entry in the file. The offset only goes down after that,
eventually achieving microsecond accuracy. Yes, this is with time
stepping *enabled*, but still the point is that no step was even needed
to keep time accurately and to establish a reasonable (but apparently
accurate) drift value.
/*****************************************/
ntpq for working configuration, stepping enabled
/*****************************************/
sbc1 root 3->ntpq
ntpq> pe
remote refid st t when poll reach delay offset
jitter
================================================== ============================
*GPS_BANC(0) .BTFP. 0 l 8 16 377 0.000 -0.006
0.001
ntpq> as
ind assID status conf reach auth condition last_event cnt
================================================== =========
1 19400 9614 yes yes none sys.peer reachable 1
ntpq> rv &1
assID=19400 status=9614 reach, conf, sel_sys.peer, 1 event, event_reach,
srcadr=GPS_BANC(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=00,
stratum=0, precision=-21, rootdelay=0.000, rootdispersion=0.000,
refid=BTFP, reach=377, unreach=0, hmode=3, pmode=4, hpoll=4, ppoll=10,
flash=00 ok, keyid=0, ttl=64, offset=-0.006, delay=0.000,
dispersion=0.105, jitter=0.001,
reftime=cb7bc62f.16499b2e Fri, Mar 7 2008 8:48:31.087,
org=cb7bc62f.16499b2e Fri, Mar 7 2008 8:48:31.087,
rec=cb7bc62f.164a14e6 Fri, Mar 7 2008 8:48:31.087,
xmt=cb7bc62f.16490801 Fri, Mar 7 2008 8:48:31.087,
filtdelay= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00,
filtoffset= -0.01 -0.01 -0.01 -0.01 -0.01 -0.01 -0.01 -0.01,
filtdisp= 0.00 0.20 0.21 0.23 0.24 0.26 0.27 0.47
ntpq> rv
assID=0 status=0444 leap_none, sync_uhf_clock, 4 events,
event_peer/strat_chg,
version="ntpd 4.2.4p0@1.1472 Tue Jan 8 16:23:44 UTC 2008 (1)",
processor="i686", system="Linux/2.6.18.8-RedHawk-4.2-trace", leap=00,
stratum=1, precision=-20, rootdelay=0.000, rootdispersion=0.037,
peer=19400, refid=BTFP,
reftime=cb7bc634.1649ad18 Fri, Mar 7 2008 8:48:36.087, poll=4,
clock=cb7bc635.182e76f5 Fri, Mar 7 2008 8:48:37.094, state=4,
offset=-0.003, frequency=33.551, jitter=0.002, noise=0.002,
stability=0.001
ntpq> quit
sbc1 root 10->ps x|grep ntpd
9554 ? Ss 0:00 ntpd -c /etc/ntp_debug.conf
sbc1 root 4->cat /etc/ntp_debug.conf
# Debug stuff
statistics clockstats peerstats loopstats
statsdir /var/lib/ntp/log/
filegen clockstats file stats.clock type pid link enable
filegen peerstats file stats.peer type pid link enable
filegen loopstats file stats.loop type pid link enable
restrict default nomodify notrap noquery
restrict 127.0.0.1
driftfile /var/lib/ntp/drift
tinker panic 0
server 127.127.16.0 prefer mode 2 minpoll 4 burst # Symmetricom BC635
tos orphan 6
/*****************************************/
stats.loop for working configuration, stepping enabled
/*****************************************/
54532 8012.087 0.014105000 0.000 0.004986901 0.000000 6
54532 8922.088 0.023610000 10.445 0.007191059 3.692976 6
54532 8937.087 0.023634000 10.465 0.006726625 3.454469 6
54532 8952.088 0.023633000 10.484 0.006292182 3.231368 6
54532 8970.087 0.023631000 10.512 0.005885797 3.022683 6
54532 8986.087 0.023628000 10.535 0.005505659 2.827473 6
54532 9004.087 0.023625000 10.559 0.005150073 2.644872 6
54532 9021.088 0.023622000 10.582 0.004817452 2.474065 6
54532 9036.088 0.023620000 10.605 0.004506314 2.314291 5
54532 9053.087 0.023616000 10.699 0.004215271 2.165075 5
54532 9068.087 0.023292000 10.781 0.003944691 2.025449 5
54532 9085.088 0.022913000 10.875 0.003692351 1.894924 5
54532 9101.088 0.022565000 10.961 0.003456068 1.772800 4
54532 9119.090 0.022186000 11.344 0.003235634 1.663816 4
54532 9136.087 0.021169000 11.688 0.003047943 1.561096 4
54532 9151.087 0.020285000 11.977 0.002868169 1.463843 4
54532 9167.088 0.019392000 12.273 0.002701435 1.373317 4
54532 9182.087 0.018601000 12.539 0.002542389 1.288048 4
54532 9197.088 0.017851000 12.793 0.002392922 1.208199 4
54532 9213.087 0.017095000 13.055 0.002254277 1.133948 4
54532 9228.088 0.016425000 13.289 0.002121941 1.063943 4
54532 9246.090 0.015669000 13.559 0.002002814 0.999779 4
54532 9262.088 0.015036000 13.789 0.001886778 0.938751 4
54532 9278.088 0.014437000 14.008 0.001777582 0.881520 4
54532 9293.087 0.013905000 14.207 0.001673378 0.827590 4
54532 9310.088 0.013337000 14.422 0.001578123 0.777857 4
54532 9326.087 0.012832000 14.617 0.001486967 0.730888 4
54532 9344.088 0.012297000 14.832 0.001403735 0.687889 4
54532 9359.088 0.011876000 15.000 0.001321477 0.646196 4
54532 9377.088 0.011401000 15.195 0.001247494 0.608393 4
54532 9392.087 0.011025000 15.352 0.001174473 0.571774 4
54532 9409.087 0.010623000 15.527 0.001107767 0.538445 4
54532 9427.088 0.010225000 15.703 0.001045725 0.507488 4
54532 9442.087 0.009909000 15.844 0.000984548 0.477309 4
54532 49946.087 0.000006000 33.555 0.000001419 0.001337 4
54532 49963.088 0.000006000 33.555 0.000001369 0.001251 4
54532 49979.087 0.000008000 33.555 0.000001526 0.001170 4
54532 49996.087 0.000008000 33.555 0.000001467 0.001095 4
54532 50012.087 0.000010000 33.555 0.000001538 0.001024 4
54532 50030.088 0.000012000 33.555 0.000001646 0.000958 4
54532 50045.088 0.000013000 33.555 0.000001576 0.000896 4
54532 50063.088 0.000015000 33.555 0.000001635 0.000838 4
54532 50079.087 0.000016000 33.555 0.000001570 0.000784 4
54532 50095.087 0.000017000 33.555 0.000001554 0.000733 4
54532 50110.088 0.000019000 33.555 0.000001618 0.000686 4
54532 50126.087 0.000022000 33.555 0.000001775 0.000642 4
/*****************************************/
Now here are some ntpq and stats.loop values for the exact same
hardware/software configuration as above, except with stepping disabled
via 'tinker step 0'. There was no reboot between these runs, only the
tinker step was added back into ntp.conf and the drift file was
deleted. This test was allowed to run for just over two hours and the
drift value was still increasing, but experience with this setup
indicates the drift was not going to come back down if the test were
allowed to run longer. The stats.loop output shows the beginning of the
file, shows where offset reaches it's maximum, and then shows the end of
the file. As you can see, the offset max is 0.093575450 seconds, so no
time step is required nor is one taken. Yet, the drift runs out of control.
/*****************************************/
ntpq for working configuration, stepping enabled
/*****************************************/
sbc1 root 27->ntpq
ntpq> pe
remote refid st t when poll reach delay offset
jitter
================================================== ============================
*GPS_BANC(0) .BTFP. 0 l 3 16 377 0.000 34.313
0.323
ntpq> as
ind assID status conf reach auth condition last_event cnt
================================================== =========
1 52112 9614 yes yes none sys.peer reachable 1
ntpq> rv &1
assID=52112 status=9614 reach, conf, sel_sys.peer, 1 event, event_reach,
srcadr=GPS_BANC(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=00,
stratum=0, precision=-21, rootdelay=0.000, rootdispersion=0.000,
refid=BTFP, reach=377, unreach=0, hmode=3, pmode=4, hpoll=4, ppoll=10,
flash=00 ok, keyid=0, ttl=64, offset=34.313, delay=0.000,
dispersion=0.017, jitter=0.323,
reftime=cb7b0304.fe74557e Thu, Mar 6 2008 18:55:48.993,
org=cb7b0304.fe74557e Thu, Mar 6 2008 18:55:48.993,
rec=cb7b0304.fe74aa8a Thu, Mar 6 2008 18:55:48.993,
xmt=cb7b0304.fe74063f Thu, Mar 6 2008 18:55:48.993,
filtdelay= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00,
filtoffset= 34.31 34.28 34.25 34.21 34.18 34.15 33.75 33.72,
filtdisp= 0.00 0.02 0.03 0.05 0.06 0.08 0.26 0.27
ntpq> rv
assID=0 status=0444 leap_none, sync_uhf_clock, 4 events,
event_peer/strat_chg,
version="ntpd 4.2.4p0@1.1472 Tue Jan 8 16:23:44 UTC 2008 (1)",
processor="i686", system="Linux/2.6.18.8-RedHawk-4.2-trace", leap=00,
stratum=1, precision=-20, rootdelay=0.000, rootdispersion=34.818,
peer=52112, refid=BTFP,
reftime=cb7b0304.fe74aa8a Thu, Mar 6 2008 18:55:48.993, poll=4,
clock=cb7b0310.9f487fd7 Thu, Mar 6 2008 18:56:00.622, state=4,
offset=34.313, frequency=368.414, jitter=0.323, noise=0.952,
stability=0.526
ntpq>
ntpq> quit
sbc1 root 28->date
Thu Mar 6 18:56:35 EST 2008
sbc1 root 29->cat /etc/ntp_debug.conf
# Debug stuff
statistics clockstats peerstats loopstats
statsdir /var/lib/ntp/log/
filegen clockstats file stats.clock type pid link enable
filegen peerstats file stats.peer type pid link enable
filegen loopstats file stats.loop type pid link enable
restrict default nomodify notrap noquery
restrict 127.0.0.1
driftfile /var/lib/ntp/drift
tinker step 0
server 127.127.16.0 prefer mode 2 minpoll 4 burst # Symmetricom BC635
tos orphan 6
sbc1 root 30->
/*****************************************/
stats.loop for working configuration, stepping enabled
/*****************************************/
54531 78843.994 0.000851046 0.000 0.000300891 0.000000 6
54531 79747.994 0.030705963 33.578 0.026475415 11.871458 6
54531 79765.994 0.031304524 33.611 0.024766387 11.104738 6
54531 79780.994 0.031796892 33.640 0.023167488 10.387536 6
54531 79797.994 0.032358327 33.672 0.021672109 9.716657 6
54531 79815.994 0.032955986 33.708 0.020273503 9.089109 7
54531 79833.994 0.033550503 33.717 0.018965291 8.502084 7
54531 79849.994 0.034074021 33.725 0.017741370 7.952972 7
54531 79865.994 0.034603163 33.733 0.016596587 7.439324 7
54531 79882.994 0.035167590 33.742 0.015525968 6.958851 7
54531 79898.994 0.035694285 33.751 0.014524407 6.509410 8
54531 79916.994 0.036289215 33.753 0.013587967 6.088996 8
54531 79932.994 0.036820295 33.755 0.012711766 5.695734 8
54531 79950.994 0.037417341 33.758 0.011892642 5.327871 8
54531 79965.994 0.037910326 33.760 0.011125913 4.983767 9
54531 79981.994 0.038438966 33.760 0.010409017 4.661888 9
54531 79998.994 0.039003336 33.761 0.009738788 4.360796 9
54531 80013.994 0.039500613 33.762 0.009111498 4.079152 9
54531 80030.994 0.040060894 33.762 0.008525328 3.815697 8
54531 80046.994 0.040590507 33.765 0.007976912 3.569258 8
54531 80063.994 0.041151342 33.767 0.007464352 3.338735 7
54531 80079.994 0.041677784 33.777 0.006984742 3.123103 7
54531 80096.994 0.042239126 33.788 0.006536642 2.921397 7
54531 80113.994 0.042804284 33.799 0.006117732 2.732720 6
54531 80131.994 0.043397711 33.845 0.005726459 2.556278 6
54531 80149.994 0.043991734 33.892 0.005360728 2.391238 6
54531 80166.994 0.044557325 33.938 0.005018487 2.236855 5
54531 80183.994 0.045117876 34.120 0.004698547 2.093385 5
54531 80201.994 0.045714207 34.317 0.004400142 1.959410 5
54531 80216.994 0.046208335 34.482 0.004119662 1.833791 5
54531 80233.994 0.046769870 34.671 0.003858701 1.716664 4
54531 80250.994 0.047327945 35.394 0.003614874 1.625964 4
54531 80268.994 0.047925094 36.125 0.003387989 1.542768 4
54531 80285.994 0.048483893 36.865 0.003175326 1.466640 4
54531 80300.994 0.048980676 37.565 0.002975434 1.394102 4
54531 80317.994 0.049539934 38.321 0.002790278 1.331168 4
54531 80333.994 0.050070750 39.085 0.002616804 1.274155 4
54531 80350.994 0.050633071 39.858 0.002455857 1.222764 4
54531 80368.994 0.051225187 40.640 0.002306763 1.176702 4
54531 81596.994 0.091688138 120.566 0.000560630 1.336847 4
54531 81614.994 0.092277665 121.974 0.000564323 1.345953 4
54531 81631.994 0.092832915 123.390 0.000563197 1.354974 4
54531 81647.994 0.093355741 124.815 0.000558310 1.363858 4
54531 81665.994 0.093944241 126.248 0.000562173 1.372754 4
54531 81680.994 0.094438060 127.599 0.000554090 1.370047 4
54531 81698.994 0.095025116 129.049 0.000558317 1.380290 4
54531 81713.994 0.095517171 130.416 0.000550471 1.378559 4
54531 81729.994 0.095041514 131.866 0.000541684 1.387719 4
54531 81745.994 0.094563218 133.309 0.000534172 1.394739 4
54531 81760.994 0.094058731 134.654 0.000530552 1.388682 4
54531 81775.994 0.093548859 135.992 0.000528012 1.382476 4
54531 81791.994 0.093575450 137.420 0.000493999 1.388228 4
54531 81808.994 0.093133732 138.841 0.000487771 1.392381 4
54531 81825.994 0.092695262 140.256 0.000481884 1.395154 4
54531 81843.994 0.092282190 141.664 0.000473829 1.396781 4
54531 81860.994 0.091845491 143.065 0.000469349 1.397366 4
54531 81876.994 0.091866474 144.467 0.000439098 1.397917 4
54531 81893.994 0.090927669 145.855 0.000528087 1.396613 4
54531 81909.994 0.090950463 147.242 0.000494046 1.395513 4
54531 81926.994 0.090506990 148.623 0.000488011 1.393711 4
54531 86165.994 0.032871409 368.915 0.001025820 0.522859 4
54531 86182.994 0.033430722 369.425 0.000979730 0.521283 4
54531 86198.994 0.033955849 369.943 0.000935071 0.520889 4
54531 86215.994 0.032518157 370.440 0.001011648 0.517866 4
54531 86232.994 0.033075338 370.944 0.000966597 0.516237 4
54531 86248.994 0.033601583 371.457 0.000923113 0.515799 4
54531 86264.994 0.032128572 371.947 0.001008385 0.512674 4
54531 86282.994 0.032720790 372.447 0.000966217 0.511019 4
54531 86298.994 0.032747589 372.946 0.000903863 0.509617 4
54531 86314.994 0.032778294 373.446 0.000845556 0.508444 4
54531 86332.994 0.031872478 373.933 0.000853321 0.505733 4
54531 86349.994 0.032433287 374.428 0.000822466 0.504391 4
/*****************************************/
So, the summary is that drift goes to 500ppm when stepping is disabled
but runs normally when stepping is enabled and both situations never
require a time step. This makes no sense to me. By the way, as
mentioned previously, we require that time does not step backward due to
a problem in some commercial software that cannot currently tolerate
time moving backwards.
Quite frankly, I don't think it's unreasonable that a system require
time to monotonically increase. Clearly this isn't the first system
that requires such behavior (i.e. time step disable was not added for
me). I understand it takes 14 days to recover from an offset of 600
seconds, but I also understand that if we have an offset of more than
10ms in this system, then something isn't working correctly. I'm going
to be bold and say that we simply will _never_ have an offset of 600
seconds in this system. If we do, they will have a recovery procedure
that involves rebooting the system, which will force a quick sync during
startup. If they continue to have a problem, it will be fixed, most
likely by swapping hardware until the problem is fixed or flying someone
in to work on the system.
To summarize, we really need to disable time stepping to keep time from
moving backwards. Maybe the commercial software will be fixed before
this problem is solved, but I don't want to rely on that and, even then,
monotonically increasing time may remain a requirement.
Andy
-
Re: drift value very large and very unstable
Rob Neal wrote:
> On Wed, 5 Mar 2008, Andy Helten wrote:
>
>
>> I think I have been giving it enough time to stabilize -- any test I
>> consider legitimate was allowed to run for at least 8 hours. Most tests
>> ran overnight for 18-24 hours and some tests ran over weekends for
>> nearly 72 hours. Results were always the same (very large drift). In
>> fact, if allowed to run long enough, the drift almost always reached the
>> +/-500 max.
>>
> The drift only tells part of the story. What is the offset
> doing? Does it cross zero and continue to diverge, or is
> it still headed to zero?
>
In one run that I looked at closely (which was the subject of another
email I just sent), the offset increased to about 90ms and then started
to decrease. I stopped the run before it had time to fully converge to
zero and the drift value was still increasing, so it's not the perfect
example. A quick look at the logs from another runs shows the offset
reaching 115ms. This test ran for several hours, the drift eventually
reached 500ppm at which point the offset bounced around from 1 to 3ms
(i.e. the offset was very unstable). I guess you would expect an
unstable offset if you are banging up against the upper end of the
drift. Here are some lines from that stats.loop:
54525 62788.352 0.023098393 0.000 0.008166515 0.000000 6 <-BEGIN
54525 62791.352 0.023381710 0.004 0.007639732 0.001478 6
54525 62809.352 0.025086955 0.031 0.007171702 0.009616 6
54525 62825.352 0.026604190 0.056 0.006729925 0.012703 6
54525 62840.352 0.028026520 0.082 0.006315321 0.014822 6
54525 62856.352 0.029544500 0.110 0.005931771 0.017072 6
54525 62872.352 0.031063284 0.139 0.005574586 0.019098 6
54525 62889.352 0.032676090 0.172 0.005245631 0.021358 6
54525 62905.352 0.034194599 0.205 0.004936122 0.023067 5
54525 62922.354 0.035808330 0.350 0.004652435 0.055665 5
54525 62937.352 0.037232161 0.483 0.004380973 0.070196 5
54525 62955.354 0.038941107 0.650 0.004142326 0.088332 5
54525 62972.352 0.040555202 0.815 0.003916589 0.101018 4
54525 62990.352 0.042264909 1.460 0.003713166 0.246815 4
54525 63663.352 0.106036189 47.865 0.001520471 1.434006 4
54525 63678.352 0.107457194 49.402 0.001508397 1.447306 4
54525 63696.352 0.109161199 51.068 0.001534212 1.476368 4
54525 63712.352 0.110677398 52.756 0.001531972 1.504564 4
54525 63727.352 0.112099421 54.360 0.001518664 1.517296 4
54525 63743.352 0.113614091 56.094 0.001518165 1.545992 4
54525 63758.354 0.115036493 57.739 0.001506528 1.558793 4 <-MAX
54525 63773.352 0.113958623 59.369 0.001459845 1.567895 4
54525 63791.352 0.114164639 61.111 0.001367501 1.590703 4
54525 63808.352 0.113274943 62.840 0.001317288 1.608565 4
54525 63825.352 0.113386776 64.570 0.001232844 1.624260 4
54525 63843.352 0.113093372 66.296 0.001157876 1.637280 4
54525 63860.352 0.112205284 68.008 0.001127688 1.646820 4
54525 63875.352 0.111627571 69.605 0.001074448 1.640657 4
54525 63891.352 0.111143832 71.300 0.001019502 1.647666 4
54525 63906.352 0.111067353 72.889 0.000954040 1.640427 4
54525 63924.352 0.110773822 74.580 0.000898437 1.646740 4
54525 63942.352 0.110479720 76.265 0.000846819 1.651672 4
54525 63958.352 0.109498471 77.936 0.000864766 1.654077 4
54528 52785.354 0.002121099 500.000 0.000561826 0.014400 10
54528 52803.352 0.002828893 500.000 0.000582078 0.017494 10
54528 52821.352 0.003035901 500.000 0.000549381 0.016695 10
54528 52838.352 0.003150695 500.000 0.000515499 0.015727 9
54528 52854.354 0.001669291 500.000 0.000711928 0.014711 9
54528 52869.352 0.001417609 500.000 0.000671866 0.013761 9
54528 52886.354 0.001530270 500.000 0.000629734 0.012872 9
54528 52904.352 0.001413156 500.000 0.000590516 0.012041 10
54528 52921.354 0.002351464 500.000 0.000644339 0.018575 10
54528 52940.352 0.003152638 500.000 0.000665967 0.021479 10
54528 52956.352 0.003170197 500.000 0.000622986 0.020094 10
54528 52973.352 0.002956953 499.991 0.000587606 0.019083 9 <-END
>> The tinker commands are also necessary (at least disabling the step) due
>> to some commercial software that has serious problems with backward time
>> steps. This problem should be fixed in a future version, but that may
>> not be soon enough for us. Even then, we may not want time to step
>> backwards.
>>
> There is a reason they are options. Try setting your clock
> by hand an hour or so off, and starting ntpd. Watch the
> time it reports while it plays catch-up. Scary.
> Your call, but it would probably fail an audit.
>
I understand your point and it is indeed scary to consider the
catastrophic failures enabled by preventing time steps. My
counter-point is that no one is going to be setting time on these
systems and time should never jump. If IRIG-B time is jumping around
wildly, then no other subsystem will work correctly if it relies on
accurate time in any way. It would need to be fixed. In fact, if
IRIG-B time jumps more than a certain amount, our subsystem stops using
it for synchronizing system time. It is better for us to drift from
IRIG-B time, so long as the various boards in our system remain
synchronized with each other.
In reality, we must be able to assume IRIG-B time is stable. With that
assumption, we must also be able to assume we can keep system time
within a few milliseconds of IRIG-B time. We use IRIG-B time directly
(i.e. read it from the IRIG PMC's registers) on boards that require
highly accurate time synchronization , but not all boards have an IRIG-B
PMC. We use ntp-disciplined system time for timestamps that aren't so
critical. The NTP synchronization requirements are TBD, but will
probably be on the order of 1-50ms accuracy between the IRIG-B synced
NTP server and the various NTP clients. I don't think this is
unreasonable and is achievable in all the testing I've done on *other*
hardware and software.
Andy
-
Re: drift value very large and very unstable
> So, the summary is that drift goes to 500ppm when stepping is disabled
> but runs normally when stepping is enabled and both situations never
> require a time step. This makes no sense to me. By the way, as
> mentioned previously, we require that time does not step backward due to
> a problem in some commercial software that cannot currently tolerate
> time moving backwards.
>
> Quite frankly, I don't think it's unreasonable that a system require
> time to monotonically increase.
Forgive me if this answer misses a point in the earlier details, or shows my
ignorance of NTP, but a few ideas/thoughts.
Oscillators and drift can go in either direction, fast or slow, its a
physics-based situation. You can't write code around that and provide a
software solution that is monotonic at all times. However, a single negative
step just at the start may be required before going monotic after that
event. (Not an expert, but that is my understanding).
With this ref clock and a GPS-drive IRIG source, you may only see a single
negative step when NTP first begins running on a new system with no drift
file, or a system that has been powered off a long time with a
battery-driven clock drifting over that long time. Once NTP is humming along
after the initial step and some updates, you shouldn't see a step again.
This makes me think that you should insert a delay in launching your
sensitive application, or block the application at some point, so it does
not see the (possible) first time step.
Fran Horan
JHU/APL
-
Re: drift value very large and very unstable
Fran Horan wrote:
>
>
>
>> So, the summary is that drift goes to 500ppm when stepping is disabled
>> but runs normally when stepping is enabled and both situations never
>> require a time step. This makes no sense to me. By the way, as
>> mentioned previously, we require that time does not step backward due to
>> a problem in some commercial software that cannot currently tolerate
>> time moving backwards.
>>
>> Quite frankly, I don't think it's unreasonable that a system require
>> time to monotonically increase.
>>
>
> Forgive me if this answer misses a point in the earlier details, or shows my
> ignorance of NTP, but a few ideas/thoughts.
>
> Oscillators and drift can go in either direction, fast or slow, its a
> physics-based situation. You can't write code around that and provide a
> software solution that is monotonic at all times. However, a single negative
> step just at the start may be required before going monotic after that
> event. (Not an expert, but that is my understanding).
>
> With this ref clock and a GPS-drive IRIG source, you may only see a single
> negative step when NTP first begins running on a new system with no drift
> file, or a system that has been powered off a long time with a
> battery-driven clock drifting over that long time. Once NTP is humming along
> after the initial step and some updates, you shouldn't see a step again.
> This makes me think that you should insert a delay in launching your
> sensitive application, or block the application at some point, so it does
> not see the (possible) first time step.
>
> Fran Horan
> JHU/APL
>
>
Hey Fran,
Yes, exactly, we do perform an initial time sync with stepping enabled.
This is done prior to initializing the commercial software and so it
does not cause problems if time moves backwards. And, yes, if we are
below the step threshold after the initial sync (which should always be
the case), then we should stay below that threshold until the end of
time. Following this logic, we should allow time steps and be comforted
knowing they will never occur in a normally functioning system. I agree
this is reasonable and does not conflict with my own rant that "if we
have an offset of more than 10ms in this system, then something isn't
working correctly".
This approach is definitely worth considering and I'll bring it up with
the decision makers. However, there is always concern that months or
years from now someone will say -- "Hey, some dumbass left time stepping
enabled, let's disable it on all systems immediately". Surely this
wouldn't be done without some regression testing, but then again such a
mundane change shouldn't need exhaustive testing, right? Riiiiight.
I guess was just hoping someone will say, "Oh, right, that's a known
problem. You need to do 'X' to fix it."
Andy
-
Re: drift value very large and very unstable
On Fri, Mar 07, 2008 at 09:13:14AM -0600, Andy Helten wrote:
> As I mentioned in a previous email, I was going to run ntp while adding
> back in some of the features I removed. After doing this, at least
> superficially, I've isolated the problem to time step being disabled.
> It doesn't matter whether I specify 'tinker step 0' in ntp.conf or use
> the '-x' argument on the command line.
This looks like a kernel bug. When the step threshold is 0 or larger
than 0.5s, kernel time discipline is disabled and time is adjusted
only by adjtime(). I've recently seen a similar problem on a PowerPC
machine where adjtime() called with delta smaller than 1ms was
ignored.
--
Miroslav Lichvar
-
Re: drift value very large and very unstable
Miroslav Lichvar wrote:
> On Fri, Mar 07, 2008 at 09:13:14AM -0600, Andy Helten wrote:
>
>> As I mentioned in a previous email, I was going to run ntp while adding
>> back in some of the features I removed. After doing this, at least
>> superficially, I've isolated the problem to time step being disabled.
>> It doesn't matter whether I specify 'tinker step 0' in ntp.conf or use
>> the '-x' argument on the command line.
>>
>
> This looks like a kernel bug. When the step threshold is 0 or larger
> than 0.5s, kernel time discipline is disabled and time is adjusted
> only by adjtime(). I've recently seen a similar problem on a PowerPC
> machine where adjtime() called with delta smaller than 1ms was
> ignored.
>
Thank you for this pointer! I'm fuzzy at best on the relationship
between NTP and the kernel, but I did find a read through a thread on
the mailing list about 'tinker step 0' and the kernel time discipline:
http://lists.ntp.isc.org/pipermail/q...er/011531.html
I read through it quickly, so I am still unsure why disabling the kernel
time discipline is necessary if stepping is disabled but no step would
have occurred even if stepping were enabled. Is there an "official" NTP
page that covers the topic of kernel time discipline? At any rate, I
will look closer at the kernel and the problems with adjtime.
Andy
-
Re: drift value very large and very unstable
Andy Helten wrote:
> Andy wrote:
>
>>The good news is that "new ntp.conf" appears to work! This is the first
>>configuration that has produced reasonable results, granted it could
>>still be a fluke since the drift was rather unpredictable (but _always_
>>settled near +/-500ppm). The bad news is that we _require_ some of the
>>commands removed from ntp.conf (at least burst and step). After letting
>>ntp run with the "new ntp.conf" for at least 16 hours, the drift had
>>stabilized around 33ppm:
>>
>>sbc1 root 1->ntpq -crv
>>assID=0 status=0444 leap_none, sync_uhf_clock, 4 events,
>>event_peer/strat_chg,
>>version="ntpd 4.2.4p0@1.1472 Tue Jan 8 16:23:44 UTC 2008 (1)",
>>processor="i686", system="Linux/2.6.18.8-RedHawk-4.2-trace", leap=00,
>>stratum=1, precision=-20, rootdelay=0.000, rootdispersion=0.272,
>>peer=13451, refid=BTFP,
>>reftime=c0320bd4.c1843a15 Thu, Mar 7 2002 10:55:00.755, poll=4,
>>clock=c0320bd5.6dfc379d Thu, Mar 7 2002 10:55:01.429, state=4,
>>offset=0.029, frequency=33.562, jitter=0.002, noise=0.002,
>>stability=0.001
>>
>>
>>This test ran with the previously problematic Redhawk kernel and all of
>>the same hardware. To further isolate the problem, I've added the
>>'burst' command back into ntp.conf, removed the drift file, and
>>restarted ntp.
>>
>>Andy
>>
>
>
> This may seem like a long email, but it mostly consists of two cut &
> paste jobs interspersed with brilliant analysis. The two cut & paste
> chunks are from two different runs of ntp, one that shows ntp working
> correctly and one that shows it "failing". The _only_ difference
> between the two runs is that time stepping was left to default behavior
> in the working run and time stepping was disabled in the "failing" run.
> So, please don't be discouraged by the length of the email, you may find
> it an intriguing read...
>
>
> As I mentioned in a previous email, I was going to run ntp while adding
> back in some of the features I removed. After doing this, at least
> superficially, I've isolated the problem to time step being disabled.
> It doesn't matter whether I specify 'tinker step 0' in ntp.conf or use
> the '-x' argument on the command line. Both result in drift that
> approaches 500ppm. First the working configuration. Below is ntpq
> output and the ntp.conf file after running just over 12 hours with step
> _enabled_ in which everything works correctly. Below the ntpq/ntp.conf
> information is the beginning and ending of the stats.loop file for the
> same run.
>
> Keep in mind, NTP runs perfectly on this same set of hardware when
> running the 2.6.9-42 linux kernel with time stepping *disabled*. If the
> true drift of this system is around 33ppm (two different runs with
> stepping disabled have settled near 33ppm), then would the clock offset
> ever get larger than 128ms and require a step? In fact, the answer
> seems to be "no, a time step is never even required". As proof,
> stats.loop shows the largest offset to be 0.023634 seconds and that is
> the third entry in the file. The offset only goes down after that,
> eventually achieving microsecond accuracy. Yes, this is with time
> stepping *enabled*, but still the point is that no step was even needed
> to keep time accurately and to establish a reasonable (but apparently
> accurate) drift value.
>
>
> /*****************************************/
> ntpq for working configuration, stepping enabled
> /*****************************************/
> sbc1 root 3->ntpq
> ntpq> pe
> remote refid st t when poll reach delay offset
> jitter
> ================================================== ============================
> *GPS_BANC(0) .BTFP. 0 l 8 16 377 0.000 -0.006
> 0.001
> ntpq> as
>
> ind assID status conf reach auth condition last_event cnt
> ================================================== =========
> 1 19400 9614 yes yes none sys.peer reachable 1
> ntpq> rv &1
> assID=19400 status=9614 reach, conf, sel_sys.peer, 1 event, event_reach,
> srcadr=GPS_BANC(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=00,
> stratum=0, precision=-21, rootdelay=0.000, rootdispersion=0.000,
> refid=BTFP, reach=377, unreach=0, hmode=3, pmode=4, hpoll=4, ppoll=10,
> flash=00 ok, keyid=0, ttl=64, offset=-0.006, delay=0.000,
> dispersion=0.105, jitter=0.001,
> reftime=cb7bc62f.16499b2e Fri, Mar 7 2008 8:48:31.087,
> org=cb7bc62f.16499b2e Fri, Mar 7 2008 8:48:31.087,
> rec=cb7bc62f.164a14e6 Fri, Mar 7 2008 8:48:31.087,
> xmt=cb7bc62f.16490801 Fri, Mar 7 2008 8:48:31.087,
> filtdelay= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00,
> filtoffset= -0.01 -0.01 -0.01 -0.01 -0.01 -0.01 -0.01 -0.01,
> filtdisp= 0.00 0.20 0.21 0.23 0.24 0.26 0.27 0.47
> ntpq> rv
> assID=0 status=0444 leap_none, sync_uhf_clock, 4 events,
> event_peer/strat_chg,
> version="ntpd 4.2.4p0@1.1472 Tue Jan 8 16:23:44 UTC 2008 (1)",
> processor="i686", system="Linux/2.6.18.8-RedHawk-4.2-trace", leap=00,
> stratum=1, precision=-20, rootdelay=0.000, rootdispersion=0.037,
> peer=19400, refid=BTFP,
> reftime=cb7bc634.1649ad18 Fri, Mar 7 2008 8:48:36.087, poll=4,
> clock=cb7bc635.182e76f5 Fri, Mar 7 2008 8:48:37.094, state=4,
> offset=-0.003, frequency=33.551, jitter=0.002, noise=0.002,
> stability=0.001
> ntpq> quit
> sbc1 root 10->ps x|grep ntpd
> 9554 ? Ss 0:00 ntpd -c /etc/ntp_debug.conf
> sbc1 root 4->cat /etc/ntp_debug.conf
> # Debug stuff
> statistics clockstats peerstats loopstats
> statsdir /var/lib/ntp/log/
> filegen clockstats file stats.clock type pid link enable
> filegen peerstats file stats.peer type pid link enable
> filegen loopstats file stats.loop type pid link enable
>
> restrict default nomodify notrap noquery
> restrict 127.0.0.1
>
> driftfile /var/lib/ntp/drift
>
> tinker panic 0
>
> server 127.127.16.0 prefer mode 2 minpoll 4 burst # Symmetricom BC635
> tos orphan 6
>
> /*****************************************/
> stats.loop for working configuration, stepping enabled
> /*****************************************/
> 54532 8012.087 0.014105000 0.000 0.004986901 0.000000 6
> 54532 8922.088 0.023610000 10.445 0.007191059 3.692976 6
> 54532 8937.087 0.023634000 10.465 0.006726625 3.454469 6
> 54532 8952.088 0.023633000 10.484 0.006292182 3.231368 6
> 54532 8970.087 0.023631000 10.512 0.005885797 3.022683 6
> 54532 8986.087 0.023628000 10.535 0.005505659 2.827473 6
> 54532 9004.087 0.023625000 10.559 0.005150073 2.644872 6
> 54532 9021.088 0.023622000 10.582 0.004817452 2.474065 6
> 54532 9036.088 0.023620000 10.605 0.004506314 2.314291 5
> 54532 9053.087 0.023616000 10.699 0.004215271 2.165075 5
> 54532 9068.087 0.023292000 10.781 0.003944691 2.025449 5
> 54532 9085.088 0.022913000 10.875 0.003692351 1.894924 5
> 54532 9101.088 0.022565000 10.961 0.003456068 1.772800 4
> 54532 9119.090 0.022186000 11.344 0.003235634 1.663816 4
> 54532 9136.087 0.021169000 11.688 0.003047943 1.561096 4
> 54532 9151.087 0.020285000 11.977 0.002868169 1.463843 4
> 54532 9167.088 0.019392000 12.273 0.002701435 1.373317 4
> 54532 9182.087 0.018601000 12.539 0.002542389 1.288048 4
> 54532 9197.088 0.017851000 12.793 0.002392922 1.208199 4
> 54532 9213.087 0.017095000 13.055 0.002254277 1.133948 4
> 54532 9228.088 0.016425000 13.289 0.002121941 1.063943 4
> 54532 9246.090 0.015669000 13.559 0.002002814 0.999779 4
> 54532 9262.088 0.015036000 13.789 0.001886778 0.938751 4
> 54532 9278.088 0.014437000 14.008 0.001777582 0.881520 4
> 54532 9293.087 0.013905000 14.207 0.001673378 0.827590 4
> 54532 9310.088 0.013337000 14.422 0.001578123 0.777857 4
> 54532 9326.087 0.012832000 14.617 0.001486967 0.730888 4
> 54532 9344.088 0.012297000 14.832 0.001403735 0.687889 4
> 54532 9359.088 0.011876000 15.000 0.001321477 0.646196 4
> 54532 9377.088 0.011401000 15.195 0.001247494 0.608393 4
> 54532 9392.087 0.011025000 15.352 0.001174473 0.571774 4
> 54532 9409.087 0.010623000 15.527 0.001107767 0.538445 4
> 54532 9427.088 0.010225000 15.703 0.001045725 0.507488 4
> 54532 9442.087 0.009909000 15.844 0.000984548 0.477309 4
>
>
>
> 54532 49946.087 0.000006000 33.555 0.000001419 0.001337 4
> 54532 49963.088 0.000006000 33.555 0.000001369 0.001251 4
> 54532 49979.087 0.000008000 33.555 0.000001526 0.001170 4
> 54532 49996.087 0.000008000 33.555 0.000001467 0.001095 4
> 54532 50012.087 0.000010000 33.555 0.000001538 0.001024 4
> 54532 50030.088 0.000012000 33.555 0.000001646 0.000958 4
> 54532 50045.088 0.000013000 33.555 0.000001576 0.000896 4
> 54532 50063.088 0.000015000 33.555 0.000001635 0.000838 4
> 54532 50079.087 0.000016000 33.555 0.000001570 0.000784 4
> 54532 50095.087 0.000017000 33.555 0.000001554 0.000733 4
> 54532 50110.088 0.000019000 33.555 0.000001618 0.000686 4
> 54532 50126.087 0.000022000 33.555 0.000001775 0.000642 4
> /*****************************************/
>
>
> Now here are some ntpq and stats.loop values for the exact same
> hardware/software configuration as above, except with stepping disabled
> via 'tinker step 0'. There was no reboot between these runs, only the
> tinker step was added back into ntp.conf and the drift file was
> deleted. This test was allowed to run for just over two hours and the
> drift value was still increasing, but experience with this setup
> indicates the drift was not going to come back down if the test were
> allowed to run longer. The stats.loop output shows the beginning of the
> file, shows where offset reaches it's maximum, and then shows the end of
> the file. As you can see, the offset max is 0.093575450 seconds, so no
> time step is required nor is one taken. Yet, the drift runs out of control.
>
>
> /*****************************************/
> ntpq for working configuration, stepping enabled
> /*****************************************/
> sbc1 root 27->ntpq
> ntpq> pe
> remote refid st t when poll reach delay offset
> jitter
> ================================================== ============================
> *GPS_BANC(0) .BTFP. 0 l 3 16 377 0.000 34.313
> 0.323
> ntpq> as
>
> ind assID status conf reach auth condition last_event cnt
> ================================================== =========
> 1 52112 9614 yes yes none sys.peer reachable 1
> ntpq> rv &1
> assID=52112 status=9614 reach, conf, sel_sys.peer, 1 event, event_reach,
> srcadr=GPS_BANC(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=00,
> stratum=0, precision=-21, rootdelay=0.000, rootdispersion=0.000,
> refid=BTFP, reach=377, unreach=0, hmode=3, pmode=4, hpoll=4, ppoll=10,
> flash=00 ok, keyid=0, ttl=64, offset=34.313, delay=0.000,
> dispersion=0.017, jitter=0.323,
> reftime=cb7b0304.fe74557e Thu, Mar 6 2008 18:55:48.993,
> org=cb7b0304.fe74557e Thu, Mar 6 2008 18:55:48.993,
> rec=cb7b0304.fe74aa8a Thu, Mar 6 2008 18:55:48.993,
> xmt=cb7b0304.fe74063f Thu, Mar 6 2008 18:55:48.993,
> filtdelay= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00,
> filtoffset= 34.31 34.28 34.25 34.21 34.18 34.15 33.75 33.72,
> filtdisp= 0.00 0.02 0.03 0.05 0.06 0.08 0.26 0.27
> ntpq> rv
> assID=0 status=0444 leap_none, sync_uhf_clock, 4 events,
> event_peer/strat_chg,
> version="ntpd 4.2.4p0@1.1472 Tue Jan 8 16:23:44 UTC 2008 (1)",
> processor="i686", system="Linux/2.6.18.8-RedHawk-4.2-trace", leap=00,
> stratum=1, precision=-20, rootdelay=0.000, rootdispersion=34.818,
> peer=52112, refid=BTFP,
> reftime=cb7b0304.fe74aa8a Thu, Mar 6 2008 18:55:48.993, poll=4,
> clock=cb7b0310.9f487fd7 Thu, Mar 6 2008 18:56:00.622, state=4,
> offset=34.313, frequency=368.414, jitter=0.323, noise=0.952,
> stability=0.526
> ntpq>
> ntpq> quit
> sbc1 root 28->date
> Thu Mar 6 18:56:35 EST 2008
>
> sbc1 root 29->cat /etc/ntp_debug.conf
> # Debug stuff
> statistics clockstats peerstats loopstats
> statsdir /var/lib/ntp/log/
> filegen clockstats file stats.clock type pid link enable
> filegen peerstats file stats.peer type pid link enable
> filegen loopstats file stats.loop type pid link enable
>
> restrict default nomodify notrap noquery
> restrict 127.0.0.1
>
> driftfile /var/lib/ntp/drift
>
> tinker step 0
>
> server 127.127.16.0 prefer mode 2 minpoll 4 burst # Symmetricom BC635
> tos orphan 6
> sbc1 root 30->
>
>
> /*****************************************/
> stats.loop for working configuration, stepping enabled
> /*****************************************/
> 54531 78843.994 0.000851046 0.000 0.000300891 0.000000 6
> 54531 79747.994 0.030705963 33.578 0.026475415 11.871458 6
> 54531 79765.994 0.031304524 33.611 0.024766387 11.104738 6
> 54531 79780.994 0.031796892 33.640 0.023167488 10.387536 6
> 54531 79797.994 0.032358327 33.672 0.021672109 9.716657 6
> 54531 79815.994 0.032955986 33.708 0.020273503 9.089109 7
> 54531 79833.994 0.033550503 33.717 0.018965291 8.502084 7
> 54531 79849.994 0.034074021 33.725 0.017741370 7.952972 7
> 54531 79865.994 0.034603163 33.733 0.016596587 7.439324 7
> 54531 79882.994 0.035167590 33.742 0.015525968 6.958851 7
> 54531 79898.994 0.035694285 33.751 0.014524407 6.509410 8
> 54531 79916.994 0.036289215 33.753 0.013587967 6.088996 8
> 54531 79932.994 0.036820295 33.755 0.012711766 5.695734 8
> 54531 79950.994 0.037417341 33.758 0.011892642 5.327871 8
> 54531 79965.994 0.037910326 33.760 0.011125913 4.983767 9
> 54531 79981.994 0.038438966 33.760 0.010409017 4.661888 9
> 54531 79998.994 0.039003336 33.761 0.009738788 4.360796 9
> 54531 80013.994 0.039500613 33.762 0.009111498 4.079152 9
> 54531 80030.994 0.040060894 33.762 0.008525328 3.815697 8
> 54531 80046.994 0.040590507 33.765 0.007976912 3.569258 8
> 54531 80063.994 0.041151342 33.767 0.007464352 3.338735 7
> 54531 80079.994 0.041677784 33.777 0.006984742 3.123103 7
> 54531 80096.994 0.042239126 33.788 0.006536642 2.921397 7
> 54531 80113.994 0.042804284 33.799 0.006117732 2.732720 6
> 54531 80131.994 0.043397711 33.845 0.005726459 2.556278 6
> 54531 80149.994 0.043991734 33.892 0.005360728 2.391238 6
> 54531 80166.994 0.044557325 33.938 0.005018487 2.236855 5
> 54531 80183.994 0.045117876 34.120 0.004698547 2.093385 5
> 54531 80201.994 0.045714207 34.317 0.004400142 1.959410 5
> 54531 80216.994 0.046208335 34.482 0.004119662 1.833791 5
> 54531 80233.994 0.046769870 34.671 0.003858701 1.716664 4
> 54531 80250.994 0.047327945 35.394 0.003614874 1.625964 4
> 54531 80268.994 0.047925094 36.125 0.003387989 1.542768 4
> 54531 80285.994 0.048483893 36.865 0.003175326 1.466640 4
> 54531 80300.994 0.048980676 37.565 0.002975434 1.394102 4
> 54531 80317.994 0.049539934 38.321 0.002790278 1.331168 4
> 54531 80333.994 0.050070750 39.085 0.002616804 1.274155 4
> 54531 80350.994 0.050633071 39.858 0.002455857 1.222764 4
> 54531 80368.994 0.051225187 40.640 0.002306763 1.176702 4
>
>
>
> 54531 81596.994 0.091688138 120.566 0.000560630 1.336847 4
> 54531 81614.994 0.092277665 121.974 0.000564323 1.345953 4
> 54531 81631.994 0.092832915 123.390 0.000563197 1.354974 4
> 54531 81647.994 0.093355741 124.815 0.000558310 1.363858 4
> 54531 81665.994 0.093944241 126.248 0.000562173 1.372754 4
> 54531 81680.994 0.094438060 127.599 0.000554090 1.370047 4
> 54531 81698.994 0.095025116 129.049 0.000558317 1.380290 4
> 54531 81713.994 0.095517171 130.416 0.000550471 1.378559 4
> 54531 81729.994 0.095041514 131.866 0.000541684 1.387719 4
> 54531 81745.994 0.094563218 133.309 0.000534172 1.394739 4
> 54531 81760.994 0.094058731 134.654 0.000530552 1.388682 4
> 54531 81775.994 0.093548859 135.992 0.000528012 1.382476 4
> 54531 81791.994 0.093575450 137.420 0.000493999 1.388228 4
> 54531 81808.994 0.093133732 138.841 0.000487771 1.392381 4
> 54531 81825.994 0.092695262 140.256 0.000481884 1.395154 4
> 54531 81843.994 0.092282190 141.664 0.000473829 1.396781 4
> 54531 81860.994 0.091845491 143.065 0.000469349 1.397366 4
> 54531 81876.994 0.091866474 144.467 0.000439098 1.397917 4
> 54531 81893.994 0.090927669 145.855 0.000528087 1.396613 4
> 54531 81909.994 0.090950463 147.242 0.000494046 1.395513 4
> 54531 81926.994 0.090506990 148.623 0.000488011 1.393711 4
>
>
>
> 54531 86165.994 0.032871409 368.915 0.001025820 0.522859 4
> 54531 86182.994 0.033430722 369.425 0.000979730 0.521283 4
> 54531 86198.994 0.033955849 369.943 0.000935071 0.520889 4
> 54531 86215.994 0.032518157 370.440 0.001011648 0.517866 4
> 54531 86232.994 0.033075338 370.944 0.000966597 0.516237 4
> 54531 86248.994 0.033601583 371.457 0.000923113 0.515799 4
> 54531 86264.994 0.032128572 371.947 0.001008385 0.512674 4
> 54531 86282.994 0.032720790 372.447 0.000966217 0.511019 4
> 54531 86298.994 0.032747589 372.946 0.000903863 0.509617 4
> 54531 86314.994 0.032778294 373.446 0.000845556 0.508444 4
> 54531 86332.994 0.031872478 373.933 0.000853321 0.505733 4
> 54531 86349.994 0.032433287 374.428 0.000822466 0.504391 4
> /*****************************************/
>
>
> So, the summary is that drift goes to 500ppm when stepping is disabled
> but runs normally when stepping is enabled and both situations never
> require a time step. This makes no sense to me. By the way, as
> mentioned previously, we require that time does not step backward due to
> a problem in some commercial software that cannot currently tolerate
> time moving backwards.
>
> Quite frankly, I don't think it's unreasonable that a system require
> time to monotonically increase. Clearly this isn't the first system
> that requires such behavior (i.e. time step disable was not added for
> me). I understand it takes 14 days to recover from an offset of 600
> seconds, but I also understand that if we have an offset of more than
> 10ms in this system, then something isn't working correctly. I'm going
> to be bold and say that we simply will _never_ have an offset of 600
> seconds in this system. If we do, they will have a recovery procedure
> that involves rebooting the system, which will force a quick sync during
> startup. If they continue to have a problem, it will be fixed, most
> likely by swapping hardware until the problem is fixed or flying someone
> in to work on the system.
>
> To summarize, we really need to disable time stepping to keep time from
> moving backwards. Maybe the commercial software will be fixed before
> this problem is solved, but I don't want to rely on that and, even then,
> monotonically increasing time may remain a requirement.
>
> Andy
If ntpd won't work with stepping disabled it's probably a bug and you
should report it as such.
OTOH, ntpd clearly DOES work with stepping enabled so run it that way!
Ntpd does not step the time unless something is badly broken somewhere.
Once it has the correct time from the reference clock, it will stay
synchronized until it's shut down. The only case I can recall where
ntpd has regularly stepped the time was when running under Linux with
heavy disk activity, sufficient to cause the loss of timer interrupts.
If your application will not tolerate time steps, don't start it until
after ntpd acquires the correct time. In most cases ntpd can synch up
within thirty to sixty seconds so wait sixty seconds after starting ntpd
before starting your application.
I believe that there are scripts or programs that will monitor ntpd and
start an application after ntpd has acquired synch.
-
Re: drift value very large and very unstable
Andy Helten wrote:
> Fran Horan wrote:
>
>>
>>
>>
>>
>>>So, the summary is that drift goes to 500ppm when stepping is disabled
>>>but runs normally when stepping is enabled and both situations never
>>>require a time step. This makes no sense to me. By the way, as
>>>mentioned previously, we require that time does not step backward due to
>>>a problem in some commercial software that cannot currently tolerate
>>>time moving backwards.
>>>
>>>Quite frankly, I don't think it's unreasonable that a system require
>>>time to monotonically increase.
>>>
>>
>>Forgive me if this answer misses a point in the earlier details, or shows my
>>ignorance of NTP, but a few ideas/thoughts.
>>
>>Oscillators and drift can go in either direction, fast or slow, its a
>>physics-based situation. You can't write code around that and provide a
>>software solution that is monotonic at all times. However, a single negative
>>step just at the start may be required before going monotic after that
>>event. (Not an expert, but that is my understanding).
>>
>>With this ref clock and a GPS-drive IRIG source, you may only see a single
>>negative step when NTP first begins running on a new system with no drift
>>file, or a system that has been powered off a long time with a
>>battery-driven clock drifting over that long time. Once NTP is humming along
>>after the initial step and some updates, you shouldn't see a step again.
>>This makes me think that you should insert a delay in launching your
>>sensitive application, or block the application at some point, so it does
>>not see the (possible) first time step.
>>
>>Fran Horan
>>JHU/APL
>>
>>
>
> Hey Fran,
>
> Yes, exactly, we do perform an initial time sync with stepping enabled.
> This is done prior to initializing the commercial software and so it
> does not cause problems if time moves backwards. And, yes, if we are
> below the step threshold after the initial sync (which should always be
> the case), then we should stay below that threshold until the end of
> time. Following this logic, we should allow time steps and be comforted
> knowing they will never occur in a normally functioning system. I agree
> this is reasonable and does not conflict with my own rant that "if we
> have an offset of more than 10ms in this system, then something isn't
> working correctly".
>
> This approach is definitely worth considering and I'll bring it up with
> the decision makers. However, there is always concern that months or
> years from now someone will say -- "Hey, some dumbass left time stepping
> enabled, let's disable it on all systems immediately". Surely this
> wouldn't be done without some regression testing, but then again such a
> mundane change shouldn't need exhaustive testing, right? Riiiiight.
>
> I guess was just hoping someone will say, "Oh, right, that's a known
> problem. You need to do 'X' to fix it."
>
> Andy
A comment in ntp.conf and/or the startup file, explaining WHY stepping
is enabled should go a long way toward solving the "dumbass" problem.