drift value very large and very unstable - NTP

This is a discussion on drift value very large and very unstable - NTP ; I realize this is long, but I tried to include the whole story. I did work earnestly to solve this on my own, but unfortunately I've been spinning my wheels the last few days. Thanks for any help. I am ...

+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 20 of 40

Thread: drift value very large and very unstable

  1. drift value very large and very unstable

    I realize this is long, but I tried to include the whole story. I did
    work earnestly to solve this on my own, but unfortunately I've been
    spinning my wheels the last few days. Thanks for any help.

    I am having a problem with drift values approaching and, on occasion,
    reaching +/-500ppm. My time source setup:

    GPS --> XL-GPS::IRIGB --> SBC0::IRIGB --> SBC0::NTP

    The XL-GPS is synchronized with GPS time and outputs an IRIG-B signal.
    The processor board, SBC0, is a single board computer housing a
    Symmetricom BC635PMC IRIG-B receiver. Three different SBC0s and three
    different BC635 PMCs were tested and all produced the same results. The
    BC635 IRIG-B receiver is the only time source for NTP (see "BC635" conf
    file below) using NTP's Bancomm reference clock support. This is the
    "target system".

    The drift using this configuration is typically near +/- 500ppm. I say
    "+/-" because from one run of NTP to the next it may completely swing,
    for example, from +486ppm to -490ppm on the same processor board. Most
    of the time this wild swing only happens following a reboot, but I've
    observed it on at least two occasions when ntpd was simply stopped and
    then restarted (with no conf file changes and no reboots between).

    To make matters more interesting, the drift consistently settles at
    ~100ppm when using only a local NTP server that is synchronized with
    other public stratum 2 NTP servers (such as ntp.idealab.com, zagbot.com,
    etc). In other words, when syncing with public NTP Internet severs, the
    drift does not swing from positive to negative and it always settles at
    a reasonable value (<100 ppm).

    I've done several tests, including the use of a 1Hz timestamp print out
    feature of the XL-GPS. The timestamp is synchronized with system's 1PPS
    and so it comes out nearly exactly once every second. I wrote a script
    that waits for the 1Hz timestamp, when the timestamp print occurs, the
    script runs a C program that grabs IRIG-B time from the BC635 PMC and
    grabs system time using clock_gettime() and then prints these two
    timestamps. I then combine these three timestamps into a log file (one
    line for each 1Hz sample). This test seems to prove the stability of
    the XL-GPS, BC635, and SBC0's system clock (which is not being
    disciplined by NTP during the test). In particular, the test showed
    SBC0's drift is in line with the 100ppm value seen when syncing with a
    network time source. This test results were also consistent with the
    claimed accuracy of the SBC0 oscillator, 30ppm. In other words, the
    500ppm value seems to be a completely bogus fabrication of NTP.

    Another piece of evidence is that the IRIG-B PMC was used on two
    different single board computers (one was a Concurrent PP110, the other
    a Concurrent VP315) where the drift was stable and settled at reasonable
    values on both of these boards. In this case, the BC635 IRIG-B PMC did
    not have a time reference, instead the time was set manually on the
    BC635 and the BC635 operated in flywheel mode (i.e. the IRIG-B time
    drifted with the clock on the BC635). This was the "development
    system". Several weeks of testing on this system always produced stable
    results. Drift values always stabilized at the same reasonable value,
    for example, ~20ppm for one of these "other" SBCs. It was only after
    several weeks of running on these boards that we then moved to the
    "target system", SBC0, and then began experiencing the problem with drift.


    The "target system" summary:

    - SBC0 (2 Intel CPUs)
    - GPS --> XL-GPS::IRIGB --> SBC0::IRIGB --> SBC0::NTP
    - Concurrent RedHawk 4.2 (Hanoi)
    - Linux sbc9 2.6.18.8-RedHawk-4.2-trace #1 SMP PREEMPT Tue May 29
    12:44:24 EDT 2007 i686 i686 i386 GNU/Linux


    The "development system" summary:

    - SBC1: Concurrent PP110, Pentium III-M (1 CPU)
    - PP110::IRIGB --> PP110::NTP
    - SBC2: Concurrent VP315, Pentium M (1 CPU)
    - VP315::IRIGB --> VP315::NTP
    - Enterprise Linux, Version 4 (original release), kernel version:
    - Linux ntp1 2.6.9-5.EL #1 Wed Jan 5 19:22:18 EST 2005 i686 i686 i386
    GNU/Linux


    Common items between "target system" and "development system":

    - ntpd - NTP daemon program - Ver. 4.2.4p0
    - BC635PMC hardware (i.e. exact same pieces of hardware)
    - BC635PMC v6.5.0 driver from Symmetricom


    Some other notes and thoughts on this problem:

    - I have searched the web and NTP mailing list and have found various
    instances of problems with large drift values, but none fit my situation
    exactly or the instances were resolved by some means not applicable here.
    - There is "no" activity on SBC0 when this problem occurs. By "no" I
    mean no additional applications except whatever may be running as a cron
    job (which isn't much). By "no" I also mean that there is no additional
    hardware causing a heavy interrupt load on the system.
    - The drift has _always_ gone near or equal to +/-500ppm -- i.e. it has
    never stabilized at a reasonable value when running with the BC635 IRIGB
    time source.
    - I've tested the "target system" with and without the XL-GPS time
    source, in which case the BC635 IRIG-B PMC runs in "flywheel" mode. In
    "flywheel" mode, the drift problem is the same.
    - The linux kernel has only a CompactFlash for a local disk and, as
    such, the kernel is configured without swap space.
    - The "target system" has various requirements that necessitate running
    in the fashion in which we are running. For example, there is no
    connection to the Internet, nor can there be reliance on "other" network
    time sources. The system must be completely self sufficient with one
    local IRIG-B synced board serving as a local stratum 1 NTP server for
    several other local SBC0 boards.
    - I am not sure what other run-time NTP information is useful so I
    didn't include any. Just let me know what you would like to see. It is
    not easily possible to run tests with the BC635 and the VP315 or PP110
    since those pieces of hardware are no longer co-located.
    - I have tested ntp-4.2.4.p4 and ntp-dev-4.2.5p113 distributions and the
    drift problem is the same (although, I only ran these versions once,
    long enough to see the drift go above 400ppm).
    - The drift file was deleted prior to almost every run of NTP. I say
    "almost" because some for some tests I wanted to see what NTP would do
    when starting with a large drift value.
    - There have been in the neighborhood of 50 different test runs
    (probably more, but I'm not counting).
    - One other test we plan to run is installing RedHat Enterprise Linux
    Version 4, Update 4 on SBC0. This is a software environment more
    similar to ones on which the BC635 driver was developed.

    Andy


    /************************************************** *****/
    NTP conf file for BC635 IRIG-B PMC
    /************************************************** *****/

    # Base conf file for all normal operation and initial sync for both server
    # and client

    # Debug stuff
    statistics clockstats peerstats loopstats
    statsdir /var/lib/ntp/log/
    filegen clockstats file stats.clock type pid link enable
    filegen peerstats file stats.peer type pid link enable
    filegen loopstats file stats.loop type pid link enable

    restrict default nomodify notrap noquery
    restrict 127.0.0.1

    tinker panic 0 # don't let daemon exit for any time difference

    driftfile /var/lib/ntp/drift

    # Base conf file for normal operation for both server and client
    tinker step 0 # disable stepping, so that we only slew time

    # Conf file for normal operation of a server

    server 127.127.16.0 prefer mode 2 minpoll 4 iburst burst # Symmetricom
    BC635
    tos orphan 6




    /************************************************** *****/
    NTP conf file for network NTP server
    /************************************************** *****/
    # Base conf file for all normal operation and initial sync for both server
    # and client

    # Debug stuff
    statistics clockstats peerstats loopstats
    statsdir /var/lib/ntp/log/
    filegen clockstats file stats.clock type pid link enable
    filegen peerstats file stats.peer type pid link enable
    filegen loopstats file stats.loop type pid link enable

    restrict default nomodify notrap noquery
    restrict 127.0.0.1

    tinker panic 0 # don't let daemon exit for any time difference

    driftfile /var/lib/ntp/drift

    # Base conf file for normal operation for both server and client
    tinker step 0 # disable stepping, so that we only slew time

    # Conf file for initial sync of a client

    server 192.168.2.90 prefer iburst burst minpoll 5 maxpoll 9

  2. Re: drift value very large and very unstable

    Andy,

    All I have at the moment is to make sure you have seen the known hardware
    and OS issues pages at support.ntp.org/Support/TroubleshootingNTP.

    It looks like there was some more information on APIC and ACPI, but those
    links area currently broken.
    --
    Harlan Stenn
    http://ntpforum.isc.org - be a member!

  3. Re: drift value very large and very unstable



    On Mon, 3 Mar 2008, Andy Helten wrote:
    -snippage-
    > I am having a problem with drift values approaching and, on occasion,
    > reaching +/-500ppm.
    >

    -snippage-
    > NTP conf file for BC635 IRIG-B PMC
    > /************************************************** *****/
    >
    > tinker panic 0 # don't let daemon exit for any time difference

    -snippage--
    >
    > # Base conf file for normal operation for both server and client
    > tinker step 0 # disable stepping, so that we only slew time
    >
    > # Conf file for normal operation of a server
    >
    > server 127.127.16.0 prefer mode 2 minpoll 4 iburst burst # Symmetricom
    > BC635
    > tos orphan 6

    -- snippage --

    Lose the 'iburst burst' on 16.

    With the two tinker commands above you give ntpd the requirement
    to amortize the offset entirely with frequency control.

    Are you giving it long enough to do so?

    If possible, toss those tinker options and try again.

    ntpq -p, ntpq -c as -c "rv &x" (where x is the association index
    for the refclock 16) and ntpq -crv would be useful.

    Rob

  4. Re: drift value very large and very unstable

    >>> In article , Harlan Stenn writes:

    Harlan> It looks like there was some more information on APIC and ACPI, but
    Harlan> those links area currently broken. -- Harlan Stenn
    Harlan> http://ntpforum.isc.org - be a member!

    Those links point to the page on "Configuring Trimble...Refclocks".

    --
    Harlan Stenn
    http://ntpforum.isc.org - be a member!

  5. Re: drift value very large and very unstable

    Rob Neal wrote:
    > On Mon, 3 Mar 2008, Andy Helten wrote:
    > -snippage-
    >
    >> I am having a problem with drift values approaching and, on occasion,
    >> reaching +/-500ppm.
    >>
    >>

    > -snippage-
    >
    >> NTP conf file for BC635 IRIG-B PMC
    >> /************************************************** *****/
    >>
    >> tinker panic 0 # don't let daemon exit for any time difference
    >>

    > -snippage--
    >
    >> # Base conf file for normal operation for both server and client
    >> tinker step 0 # disable stepping, so that we only slew time
    >>
    >> # Conf file for normal operation of a server
    >>
    >> server 127.127.16.0 prefer mode 2 minpoll 4 iburst burst # Symmetricom
    >> BC635
    >> tos orphan 6
    >>

    > -- snippage --
    >
    > Lose the 'iburst burst' on 16.
    >
    > With the two tinker commands above you give ntpd the requirement
    > to amortize the offset entirely with frequency control.
    >
    > Are you giving it long enough to do so?
    >
    > If possible, toss those tinker options and try again.
    >
    > ntpq -p, ntpq -c as -c "rv &x" (where x is the association index
    > for the refclock 16) and ntpq -crv would be useful.
    >
    > Rob
    >
    >

    Rob,

    In this case, the purpose of 'iburst burst' is too decrease startup so
    that ntp will begin servicing sync requests within a reasonable amount
    of time. I'm not sure that both are necessary, but definitely one of
    them (along with minpoll 4) decreases startup time from several minutes
    to about 20 seconds. I seem to recall reading somewhere in the NTP docs
    that burst and iburst have no effect on reference clocks -- it simply
    isn't true for the BC635 (refclock_bancomm.c). Removing them is still
    worth a try and I will run like that overnight. In fact, I started
    running ntpd with the ntp.conf below (after making the suggested
    ntp.conf changes) and the ntpq output below is after only about 25
    minutes of ntp operation. This is running the Redhawk 2.6.18 linux
    kernel on the same exact hardware as was used last night on the Redhat
    2.6.9-42 kernel (the relevance of this kernel is mentioned below).

    I think I have been giving it enough time to stabilize -- any test I
    consider legitimate was allowed to run for at least 8 hours. Most tests
    ran overnight for 18-24 hours and some tests ran over weekends for
    nearly 72 hours. Results were always the same (very large drift). In
    fact, if allowed to run long enough, the drift almost always reached the
    +/-500 max.

    The tinker commands are also necessary (at least disabling the step) due
    to some commercial software that has serious problems with backward time
    steps. This problem should be fixed in a future version, but that may
    not be soon enough for us. Even then, we may not want time to step
    backwards.

    I should also provide an update for a test that ran last night in which
    the base RedHat EL4 Update 4 distribution (2.6.9-42 kernel) was used
    with ntp 4.2.4p0 and the exact same single board computer and exact same
    BC635 hardware. This test stabilized at a drift of -35ppm with a very
    small offset (0.021 milliseconds). This test ran overnight and by late
    morning the drift was changing only by a few hundredths at a time. In
    other words, everything was working as expected. So, whatever the
    problem, it almost definitely is software related (and most likely is a
    problem with the kernel?).

    Regarding the kernel's HZ value and its relation to time loss/gain, is
    there a way to determine the actual value at runtime? I want the value
    of HZ that is actually in use in the running kernel. I wasn't able to
    find a way to do this. By the HZ macro in /usr/include, I get a value
    of 100 and by the "/boot/config-*" file I see a value of 250. This is
    why I would like a sysctl type value or /proc entry with the actual HZ
    value, not a macro or config file. Any ideas?

    Thanks,
    Andy

    /**************************************/
    new ntp.conf
    /**************************************/
    # Debug stuff
    statistics clockstats peerstats loopstats
    statsdir /var/lib/ntp/log/
    filegen clockstats file stats.clock type pid link enable
    filegen peerstats file stats.peer type pid link enable
    filegen loopstats file stats.loop type pid link enable

    restrict default nomodify notrap noquery
    restrict 127.0.0.1

    driftfile /var/lib/ntp/drift

    server 127.127.16.0 prefer mode 2 minpoll 4 # Symmetricom BC635
    tos orphan 6



    /**************************************/
    ntpq output
    /**************************************/

    sbc1 root 31->ntpq
    ntpq> pe
    remote refid st t when poll reach delay offset
    jitter
    ================================================== ============================
    *GPS_BANC(0) .BTFP. 0 l 4 16 377 0.000 9.121
    3.489
    ntpq> as

    ind assID status conf reach auth condition last_event cnt
    ================================================== =========
    1 13451 9614 yes yes none sys.peer reachable 1
    ntpq> rv &1
    assID=13451 status=9614 reach, conf, sel_sys.peer, 1 event, event_reach,
    srcadr=GPS_BANC(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=00,
    stratum=0, precision=-21, rootdelay=0.000, rootdispersion=0.000,
    refid=BTFP, reach=377, unreach=0, hmode=3, pmode=4, hpoll=4, ppoll=10,
    flash=00 ok, keyid=0, ttl=64, offset=9.121, delay=0.000,
    dispersion=0.236, jitter=3.489,
    reftime=c0311460.c183a17a Wed, Mar 6 2002 17:19:12.755,
    org=c0311460.c183a17a Wed, Mar 6 2002 17:19:12.755,
    rec=c0311460.c18428f8 Wed, Mar 6 2002 17:19:12.755,
    xmt=c0311460.c1831775 Wed, Mar 6 2002 17:19:12.755,
    filtdelay= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00,
    filtoffset= 9.12 9.76 10.44 11.20 12.02 12.93 13.86 14.90,
    filtdisp= 0.00 0.24 0.48 0.74 0.99 1.26 1.52 1.79
    ntpq> cv
    assID=0 status=0000 clk_okay, last_clk_okay,
    type=16, timecode="065 22:19:27.764471000 0", poll=110, noreply=0,
    badformat=0, baddata=0, fudgetime1=0.000, stratum=0, refid=BTFP,
    flags=0
    ntpq>

  6. Re: drift value very large and very unstable

    iburst is good. burst, very likely not.
    --
    Harlan Stenn
    http://ntpforum.isc.org - be a member!

  7. Re: drift value very large and very unstable



    On Wed, 5 Mar 2008, Andy Helten wrote:

    >
    > I think I have been giving it enough time to stabilize -- any test I
    > consider legitimate was allowed to run for at least 8 hours. Most tests
    > ran overnight for 18-24 hours and some tests ran over weekends for
    > nearly 72 hours. Results were always the same (very large drift). In
    > fact, if allowed to run long enough, the drift almost always reached the
    > +/-500 max.

    The drift only tells part of the story. What is the offset
    doing? Does it cross zero and continue to diverge, or is
    it still headed to zero?
    >
    > The tinker commands are also necessary (at least disabling the step) due
    > to some commercial software that has serious problems with backward time
    > steps. This problem should be fixed in a future version, but that may
    > not be soon enough for us. Even then, we may not want time to step
    > backwards.

    There is a reason they are options. Try setting your clock
    by hand an hour or so off, and starting ntpd. Watch the
    time it reports while it plays catch-up. Scary.
    Your call, but it would probably fail an audit.

    >
    > Regarding the kernel's HZ value and its relation to time loss/gain, is
    > there a way to determine the actual value at runtime? I want the value
    > of HZ that is actually in use in the running kernel. I wasn't able to
    > find a way to do this. By the HZ macro in /usr/include, I get a value
    > of 100 and by the "/boot/config-*" file I see a value of 250. This is
    > why I would like a sysctl type value or /proc entry with the actual HZ
    > value, not a macro or config file. Any ideas?

    Kernel sysctl of some sort. Consult the Linux kernelmongers.
    >
    > Thanks,
    > Andy
    >
    > sbc1 root 31->ntpq
    > ntpq> pe
    > remote refid st t when poll reach delay offset
    > jitter
    > ================================================== ============================
    > *GPS_BANC(0) .BTFP. 0 l 4 16 377 0.000 9.121
    > 3.489

    Jitter is ugly for an attached refclock. You have something
    bad happening, this should be much lower.
    > ntpq> as
    >
    > ind assID status conf reach auth condition last_event cnt
    > ================================================== =========
    > 1 13451 9614 yes yes none sys.peer reachable 1
    > ntpq> rv &1
    > assID=13451 status=9614 reach, conf, sel_sys.peer, 1 event, event_reach,
    > srcadr=GPS_BANC(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=00,
    > stratum=0, precision=-21, rootdelay=0.000, rootdispersion=0.000,
    > refid=BTFP, reach=377, unreach=0, hmode=3, pmode=4, hpoll=4, ppoll=10,
    > flash=00 ok, keyid=0, ttl=64, offset=9.121, delay=0.000,
    > dispersion=0.236, jitter=3.489,
    > reftime=c0311460.c183a17a Wed, Mar 6 2002 17:19:12.755,
    > org=c0311460.c183a17a Wed, Mar 6 2002 17:19:12.755,
    > rec=c0311460.c18428f8 Wed, Mar 6 2002 17:19:12.755,
    > xmt=c0311460.c1831775 Wed, Mar 6 2002 17:19:12.755,
    > filtdelay= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00,
    > filtoffset= 9.12 9.76 10.44 11.20 12.02 12.93 13.86 14.90,
    > filtdisp= 0.00 0.24 0.48 0.74 0.99 1.26 1.52

    1.79
    Looks like the offset is still trending to zero, with a
    long way to go.

    Rob

  8. Re: drift value very large and very unstable

    Harlan Stenn wrote:
    > iburst is good. burst, very likely not.
    >


    Actually, I just ran a test and, at least for refclock_bancomm, burst is
    the only one that matters. However, even with 'burst', it would still
    take 64 seconds for ntp to declare a peer if minpoll is not also
    decreased. I attributed this to the fact that 'burst' only applies to
    normal ops, so at least one "normal" polling period is required.

    Andy

  9. Re: drift value very large and very unstable

    > From: Harlan Stenn
    > Date: Thu, 06 Mar 2008 06:33:17 +0000
    > Sender: questions-bounces+oberman=es.net@lists.ntp.org
    >
    >
    > iburst is good. burst, very likely not.


    In general, I agree, but the context is for a connected reference clock.

    I would set maxpoll to 4 as there is no reason NOT to update from the
    reference clock on a frequent (16 second) basis, but I'm less certain if
    iburst is really appropriate. (Nor am I sure that it's inappropriate.)
    --
    R. Kevin Oberman, Network Engineer
    Energy Sciences Network (ESnet)
    Ernest O. Lawrence Berkeley National Laboratory (Berkeley Lab)
    E-mail: oberman@es.net Phone: +1 510 486-8634
    Key fingerprint:059B 2DDF 031C 9BA3 14A4 EADA 927D EBB3 987B 3751

  10. Re: drift value very large and very unstable

    The good news is that "new ntp.conf" appears to work! This is the first
    configuration that has produced reasonable results, granted it could
    still be a fluke since the drift was rather unpredictable (but _always_
    settled near +/-500ppm). The bad news is that we _require_ some of the
    commands removed from ntp.conf (at least burst and step). After letting
    ntp run with the "new ntp.conf" for at least 16 hours, the drift had
    stabilized around 33ppm:

    sbc1 root 1->ntpq -crv
    assID=0 status=0444 leap_none, sync_uhf_clock, 4 events,
    event_peer/strat_chg,
    version="ntpd 4.2.4p0@1.1472 Tue Jan 8 16:23:44 UTC 2008 (1)",
    processor="i686", system="Linux/2.6.18.8-RedHawk-4.2-trace", leap=00,
    stratum=1, precision=-20, rootdelay=0.000, rootdispersion=0.272,
    peer=13451, refid=BTFP,
    reftime=c0320bd4.c1843a15 Thu, Mar 7 2002 10:55:00.755, poll=4,
    clock=c0320bd5.6dfc379d Thu, Mar 7 2002 10:55:01.429, state=4,
    offset=0.029, frequency=33.562, jitter=0.002, noise=0.002,
    stability=0.001


    This test ran with the previously problematic Redhawk kernel and all of
    the same hardware. To further isolate the problem, I've added the
    'burst' command back into ntp.conf, removed the drift file, and
    restarted ntp.

    Andy


    Andy wrote:
    > Rob Neal wrote:
    >
    >> On Mon, 3 Mar 2008, Andy Helten wrote:
    >>
    >> -- snippage --
    >>
    >> Lose the 'iburst burst' on 16.
    >>
    >> With the two tinker commands above you give ntpd the requirement
    >> to amortize the offset entirely with frequency control.
    >>
    >> Are you giving it long enough to do so?
    >>
    >> If possible, toss those tinker options and try again.
    >>
    >> ntpq -p, ntpq -c as -c "rv &x" (where x is the association index
    >> for the refclock 16) and ntpq -crv would be useful.
    >>
    >> Rob
    >>
    >>
    >>

    > Rob,
    >
    > In this case, the purpose of 'iburst burst' is too decrease startup so
    > that ntp will begin servicing sync requests within a reasonable amount
    > of time. I'm not sure that both are necessary, but definitely one of
    > them (along with minpoll 4) decreases startup time from several minutes
    > to about 20 seconds. I seem to recall reading somewhere in the NTP docs
    > that burst and iburst have no effect on reference clocks -- it simply
    > isn't true for the BC635 (refclock_bancomm.c). Removing them is still
    > worth a try and I will run like that overnight. In fact, I started
    > running ntpd with the ntp.conf below (after making the suggested
    > ntp.conf changes) and the ntpq output below is after only about 25
    > minutes of ntp operation. This is running the Redhawk 2.6.18 linux
    > kernel on the same exact hardware as was used last night on the Redhat
    > 2.6.9-42 kernel (the relevance of this kernel is mentioned below).
    >
    > I think I have been giving it enough time to stabilize -- any test I
    > consider legitimate was allowed to run for at least 8 hours. Most tests
    > ran overnight for 18-24 hours and some tests ran over weekends for
    > nearly 72 hours. Results were always the same (very large drift). In
    > fact, if allowed to run long enough, the drift almost always reached the
    > +/-500 max.
    >
    > The tinker commands are also necessary (at least disabling the step) due
    > to some commercial software that has serious problems with backward time
    > steps. This problem should be fixed in a future version, but that may
    > not be soon enough for us. Even then, we may not want time to step
    > backwards.
    >
    > I should also provide an update for a test that ran last night in which
    > the base RedHat EL4 Update 4 distribution (2.6.9-42 kernel) was used
    > with ntp 4.2.4p0 and the exact same single board computer and exact same
    > BC635 hardware. This test stabilized at a drift of -35ppm with a very
    > small offset (0.021 milliseconds). This test ran overnight and by late
    > morning the drift was changing only by a few hundredths at a time. In
    > other words, everything was working as expected. So, whatever the
    > problem, it almost definitely is software related (and most likely is a
    > problem with the kernel?).
    >
    > Regarding the kernel's HZ value and its relation to time loss/gain, is
    > there a way to determine the actual value at runtime? I want the value
    > of HZ that is actually in use in the running kernel. I wasn't able to
    > find a way to do this. By the HZ macro in /usr/include, I get a value
    > of 100 and by the "/boot/config-*" file I see a value of 250. This is
    > why I would like a sysctl type value or /proc entry with the actual HZ
    > value, not a macro or config file. Any ideas?
    >
    > Thanks,
    > Andy
    >
    > /**************************************/
    > new ntp.conf
    > /**************************************/
    > # Debug stuff
    > statistics clockstats peerstats loopstats
    > statsdir /var/lib/ntp/log/
    > filegen clockstats file stats.clock type pid link enable
    > filegen peerstats file stats.peer type pid link enable
    > filegen loopstats file stats.loop type pid link enable
    >
    > restrict default nomodify notrap noquery
    > restrict 127.0.0.1
    >
    > driftfile /var/lib/ntp/drift
    >
    > server 127.127.16.0 prefer mode 2 minpoll 4 # Symmetricom BC635
    > tos orphan 6
    >
    >
    >
    > /**************************************/
    > ntpq output
    > /**************************************/
    >
    > sbc1 root 31->ntpq
    > ntpq> pe
    > remote refid st t when poll reach delay offset
    > jitter
    > ================================================== ============================
    > *GPS_BANC(0) .BTFP. 0 l 4 16 377 0.000 9.121
    > 3.489
    > ntpq> as
    >
    > ind assID status conf reach auth condition last_event cnt
    > ================================================== =========
    > 1 13451 9614 yes yes none sys.peer reachable 1
    > ntpq> rv &1
    > assID=13451 status=9614 reach, conf, sel_sys.peer, 1 event, event_reach,
    > srcadr=GPS_BANC(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=00,
    > stratum=0, precision=-21, rootdelay=0.000, rootdispersion=0.000,
    > refid=BTFP, reach=377, unreach=0, hmode=3, pmode=4, hpoll=4, ppoll=10,
    > flash=00 ok, keyid=0, ttl=64, offset=9.121, delay=0.000,
    > dispersion=0.236, jitter=3.489,
    > reftime=c0311460.c183a17a Wed, Mar 6 2002 17:19:12.755,
    > org=c0311460.c183a17a Wed, Mar 6 2002 17:19:12.755,
    > rec=c0311460.c18428f8 Wed, Mar 6 2002 17:19:12.755,
    > xmt=c0311460.c1831775 Wed, Mar 6 2002 17:19:12.755,
    > filtdelay= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00,
    > filtoffset= 9.12 9.76 10.44 11.20 12.02 12.93 13.86 14.90,
    > filtdisp= 0.00 0.24 0.48 0.74 0.99 1.26 1.52 1.79
    > ntpq> cv
    > assID=0 status=0000 clk_okay, last_clk_okay,
    > type=16, timecode="065 22:19:27.764471000 0", poll=110, noreply=0,
    > badformat=0, baddata=0, fudgetime1=0.000, stratum=0, refid=BTFP,
    > flags=0
    > ntpq>
    >
    >
    >
    > _______________________________________________
    > questions mailing list
    > questions@lists.ntp.org
    > https://lists.ntp.org/mailman/listinfo/questions
    >


  11. Re: drift value very large and very unstable

    Kevin,

    1. As per the advice in the documentation, do not use iburst with
    reference clock drivers. The only reason you might want to do this is to
    reduce the time to set the clock on initial start. This is unnecessary
    as the clock is now set on the first reply received. Using iburst anyway
    will screw up the radio protocol in some drivers.

    2. The temptation to reduce the poll interval below the default can be
    counterproductive. The driver interface uses a median filter to clean up
    nominal jitter due to serial port and interrupt latencies. Generally,
    the jitter is reduced as the number of stages and the poll interval are
    increased. There are cases, in particular with kernel PPS signals, where
    a smaller poll interval can result in marginally better performance, but
    in other cases it generally not a good idea.

    3. The burst mode is designed for use when the poll interval of
    necessity must be very long, like at least 1024 s. The current design
    will rate-limit if burst is used with a poll interval of 512 s or less.
    This is to protect busy servers with a minimum average default headway
    of 16 s. Some operators might set the headway higher.

    Dave

    Kevin Oberman wrote:

    >>From: Harlan Stenn
    >>Date: Thu, 06 Mar 2008 06:33:17 +0000
    >>Sender: questions-bounces+oberman=es.net@lists.ntp.org
    >>
    >>
    >>iburst is good. burst, very likely not.

    >
    >
    > In general, I agree, but the context is for a connected reference clock.
    >
    > I would set maxpoll to 4 as there is no reason NOT to update from the
    > reference clock on a frequent (16 second) basis, but I'm less certain if
    > iburst is really appropriate. (Nor am I sure that it's inappropriate.)


  12. Re: drift value very large and very unstable

    >>> In article <20080306162227.5985F45047@ptavv.es.net>, oberman@es.net (Kevin Oberman) writes:

    >> From: Harlan Stenn Date: Thu, 06 Mar 2008 06:33:17 +0000
    >> Sender: questions-bounces+oberman=es.net@lists.ntp.org
    >>
    >> iburst is good. burst, very likely not.


    Kevin> In general, I agree, but the context is for a connected reference
    Kevin> clock.

    I missed that - sorry, and thanks Kevin!
    --
    Harlan Stenn
    http://ntpforum.isc.org - be a member!

  13. Re: drift value very large and very unstable


    Andy wrote:
    > The good news is that "new ntp.conf" appears to work! This is the first
    > configuration that has produced reasonable results, granted it could
    > still be a fluke since the drift was rather unpredictable (but _always_
    > settled near +/-500ppm). The bad news is that we _require_ some of the
    > commands removed from ntp.conf (at least burst and step). After letting
    > ntp run with the "new ntp.conf" for at least 16 hours, the drift had
    > stabilized around 33ppm:
    >
    > sbc1 root 1->ntpq -crv
    > assID=0 status=0444 leap_none, sync_uhf_clock, 4 events,
    > event_peer/strat_chg,
    > version="ntpd 4.2.4p0@1.1472 Tue Jan 8 16:23:44 UTC 2008 (1)",
    > processor="i686", system="Linux/2.6.18.8-RedHawk-4.2-trace", leap=00,
    > stratum=1, precision=-20, rootdelay=0.000, rootdispersion=0.272,
    > peer=13451, refid=BTFP,
    > reftime=c0320bd4.c1843a15 Thu, Mar 7 2002 10:55:00.755, poll=4,
    > clock=c0320bd5.6dfc379d Thu, Mar 7 2002 10:55:01.429, state=4,
    > offset=0.029, frequency=33.562, jitter=0.002, noise=0.002,
    > stability=0.001
    >
    >
    > This test ran with the previously problematic Redhawk kernel and all of
    > the same hardware. To further isolate the problem, I've added the
    > 'burst' command back into ntp.conf, removed the drift file, and
    > restarted ntp.
    >
    > Andy
    >


    This may seem like a long email, but it mostly consists of two cut &
    paste jobs interspersed with brilliant analysis. The two cut & paste
    chunks are from two different runs of ntp, one that shows ntp working
    correctly and one that shows it "failing". The _only_ difference
    between the two runs is that time stepping was left to default behavior
    in the working run and time stepping was disabled in the "failing" run.
    So, please don't be discouraged by the length of the email, you may find
    it an intriguing read...


    As I mentioned in a previous email, I was going to run ntp while adding
    back in some of the features I removed. After doing this, at least
    superficially, I've isolated the problem to time step being disabled.
    It doesn't matter whether I specify 'tinker step 0' in ntp.conf or use
    the '-x' argument on the command line. Both result in drift that
    approaches 500ppm. First the working configuration. Below is ntpq
    output and the ntp.conf file after running just over 12 hours with step
    _enabled_ in which everything works correctly. Below the ntpq/ntp.conf
    information is the beginning and ending of the stats.loop file for the
    same run.

    Keep in mind, NTP runs perfectly on this same set of hardware when
    running the 2.6.9-42 linux kernel with time stepping *disabled*. If the
    true drift of this system is around 33ppm (two different runs with
    stepping disabled have settled near 33ppm), then would the clock offset
    ever get larger than 128ms and require a step? In fact, the answer
    seems to be "no, a time step is never even required". As proof,
    stats.loop shows the largest offset to be 0.023634 seconds and that is
    the third entry in the file. The offset only goes down after that,
    eventually achieving microsecond accuracy. Yes, this is with time
    stepping *enabled*, but still the point is that no step was even needed
    to keep time accurately and to establish a reasonable (but apparently
    accurate) drift value.


    /*****************************************/
    ntpq for working configuration, stepping enabled
    /*****************************************/
    sbc1 root 3->ntpq
    ntpq> pe
    remote refid st t when poll reach delay offset
    jitter
    ================================================== ============================
    *GPS_BANC(0) .BTFP. 0 l 8 16 377 0.000 -0.006
    0.001
    ntpq> as

    ind assID status conf reach auth condition last_event cnt
    ================================================== =========
    1 19400 9614 yes yes none sys.peer reachable 1
    ntpq> rv &1
    assID=19400 status=9614 reach, conf, sel_sys.peer, 1 event, event_reach,
    srcadr=GPS_BANC(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=00,
    stratum=0, precision=-21, rootdelay=0.000, rootdispersion=0.000,
    refid=BTFP, reach=377, unreach=0, hmode=3, pmode=4, hpoll=4, ppoll=10,
    flash=00 ok, keyid=0, ttl=64, offset=-0.006, delay=0.000,
    dispersion=0.105, jitter=0.001,
    reftime=cb7bc62f.16499b2e Fri, Mar 7 2008 8:48:31.087,
    org=cb7bc62f.16499b2e Fri, Mar 7 2008 8:48:31.087,
    rec=cb7bc62f.164a14e6 Fri, Mar 7 2008 8:48:31.087,
    xmt=cb7bc62f.16490801 Fri, Mar 7 2008 8:48:31.087,
    filtdelay= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00,
    filtoffset= -0.01 -0.01 -0.01 -0.01 -0.01 -0.01 -0.01 -0.01,
    filtdisp= 0.00 0.20 0.21 0.23 0.24 0.26 0.27 0.47
    ntpq> rv
    assID=0 status=0444 leap_none, sync_uhf_clock, 4 events,
    event_peer/strat_chg,
    version="ntpd 4.2.4p0@1.1472 Tue Jan 8 16:23:44 UTC 2008 (1)",
    processor="i686", system="Linux/2.6.18.8-RedHawk-4.2-trace", leap=00,
    stratum=1, precision=-20, rootdelay=0.000, rootdispersion=0.037,
    peer=19400, refid=BTFP,
    reftime=cb7bc634.1649ad18 Fri, Mar 7 2008 8:48:36.087, poll=4,
    clock=cb7bc635.182e76f5 Fri, Mar 7 2008 8:48:37.094, state=4,
    offset=-0.003, frequency=33.551, jitter=0.002, noise=0.002,
    stability=0.001
    ntpq> quit
    sbc1 root 10->ps x|grep ntpd
    9554 ? Ss 0:00 ntpd -c /etc/ntp_debug.conf
    sbc1 root 4->cat /etc/ntp_debug.conf
    # Debug stuff
    statistics clockstats peerstats loopstats
    statsdir /var/lib/ntp/log/
    filegen clockstats file stats.clock type pid link enable
    filegen peerstats file stats.peer type pid link enable
    filegen loopstats file stats.loop type pid link enable

    restrict default nomodify notrap noquery
    restrict 127.0.0.1

    driftfile /var/lib/ntp/drift

    tinker panic 0

    server 127.127.16.0 prefer mode 2 minpoll 4 burst # Symmetricom BC635
    tos orphan 6

    /*****************************************/
    stats.loop for working configuration, stepping enabled
    /*****************************************/
    54532 8012.087 0.014105000 0.000 0.004986901 0.000000 6
    54532 8922.088 0.023610000 10.445 0.007191059 3.692976 6
    54532 8937.087 0.023634000 10.465 0.006726625 3.454469 6
    54532 8952.088 0.023633000 10.484 0.006292182 3.231368 6
    54532 8970.087 0.023631000 10.512 0.005885797 3.022683 6
    54532 8986.087 0.023628000 10.535 0.005505659 2.827473 6
    54532 9004.087 0.023625000 10.559 0.005150073 2.644872 6
    54532 9021.088 0.023622000 10.582 0.004817452 2.474065 6
    54532 9036.088 0.023620000 10.605 0.004506314 2.314291 5
    54532 9053.087 0.023616000 10.699 0.004215271 2.165075 5
    54532 9068.087 0.023292000 10.781 0.003944691 2.025449 5
    54532 9085.088 0.022913000 10.875 0.003692351 1.894924 5
    54532 9101.088 0.022565000 10.961 0.003456068 1.772800 4
    54532 9119.090 0.022186000 11.344 0.003235634 1.663816 4
    54532 9136.087 0.021169000 11.688 0.003047943 1.561096 4
    54532 9151.087 0.020285000 11.977 0.002868169 1.463843 4
    54532 9167.088 0.019392000 12.273 0.002701435 1.373317 4
    54532 9182.087 0.018601000 12.539 0.002542389 1.288048 4
    54532 9197.088 0.017851000 12.793 0.002392922 1.208199 4
    54532 9213.087 0.017095000 13.055 0.002254277 1.133948 4
    54532 9228.088 0.016425000 13.289 0.002121941 1.063943 4
    54532 9246.090 0.015669000 13.559 0.002002814 0.999779 4
    54532 9262.088 0.015036000 13.789 0.001886778 0.938751 4
    54532 9278.088 0.014437000 14.008 0.001777582 0.881520 4
    54532 9293.087 0.013905000 14.207 0.001673378 0.827590 4
    54532 9310.088 0.013337000 14.422 0.001578123 0.777857 4
    54532 9326.087 0.012832000 14.617 0.001486967 0.730888 4
    54532 9344.088 0.012297000 14.832 0.001403735 0.687889 4
    54532 9359.088 0.011876000 15.000 0.001321477 0.646196 4
    54532 9377.088 0.011401000 15.195 0.001247494 0.608393 4
    54532 9392.087 0.011025000 15.352 0.001174473 0.571774 4
    54532 9409.087 0.010623000 15.527 0.001107767 0.538445 4
    54532 9427.088 0.010225000 15.703 0.001045725 0.507488 4
    54532 9442.087 0.009909000 15.844 0.000984548 0.477309 4



    54532 49946.087 0.000006000 33.555 0.000001419 0.001337 4
    54532 49963.088 0.000006000 33.555 0.000001369 0.001251 4
    54532 49979.087 0.000008000 33.555 0.000001526 0.001170 4
    54532 49996.087 0.000008000 33.555 0.000001467 0.001095 4
    54532 50012.087 0.000010000 33.555 0.000001538 0.001024 4
    54532 50030.088 0.000012000 33.555 0.000001646 0.000958 4
    54532 50045.088 0.000013000 33.555 0.000001576 0.000896 4
    54532 50063.088 0.000015000 33.555 0.000001635 0.000838 4
    54532 50079.087 0.000016000 33.555 0.000001570 0.000784 4
    54532 50095.087 0.000017000 33.555 0.000001554 0.000733 4
    54532 50110.088 0.000019000 33.555 0.000001618 0.000686 4
    54532 50126.087 0.000022000 33.555 0.000001775 0.000642 4
    /*****************************************/


    Now here are some ntpq and stats.loop values for the exact same
    hardware/software configuration as above, except with stepping disabled
    via 'tinker step 0'. There was no reboot between these runs, only the
    tinker step was added back into ntp.conf and the drift file was
    deleted. This test was allowed to run for just over two hours and the
    drift value was still increasing, but experience with this setup
    indicates the drift was not going to come back down if the test were
    allowed to run longer. The stats.loop output shows the beginning of the
    file, shows where offset reaches it's maximum, and then shows the end of
    the file. As you can see, the offset max is 0.093575450 seconds, so no
    time step is required nor is one taken. Yet, the drift runs out of control.


    /*****************************************/
    ntpq for working configuration, stepping enabled
    /*****************************************/
    sbc1 root 27->ntpq
    ntpq> pe
    remote refid st t when poll reach delay offset
    jitter
    ================================================== ============================
    *GPS_BANC(0) .BTFP. 0 l 3 16 377 0.000 34.313
    0.323
    ntpq> as

    ind assID status conf reach auth condition last_event cnt
    ================================================== =========
    1 52112 9614 yes yes none sys.peer reachable 1
    ntpq> rv &1
    assID=52112 status=9614 reach, conf, sel_sys.peer, 1 event, event_reach,
    srcadr=GPS_BANC(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=00,
    stratum=0, precision=-21, rootdelay=0.000, rootdispersion=0.000,
    refid=BTFP, reach=377, unreach=0, hmode=3, pmode=4, hpoll=4, ppoll=10,
    flash=00 ok, keyid=0, ttl=64, offset=34.313, delay=0.000,
    dispersion=0.017, jitter=0.323,
    reftime=cb7b0304.fe74557e Thu, Mar 6 2008 18:55:48.993,
    org=cb7b0304.fe74557e Thu, Mar 6 2008 18:55:48.993,
    rec=cb7b0304.fe74aa8a Thu, Mar 6 2008 18:55:48.993,
    xmt=cb7b0304.fe74063f Thu, Mar 6 2008 18:55:48.993,
    filtdelay= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00,
    filtoffset= 34.31 34.28 34.25 34.21 34.18 34.15 33.75 33.72,
    filtdisp= 0.00 0.02 0.03 0.05 0.06 0.08 0.26 0.27
    ntpq> rv
    assID=0 status=0444 leap_none, sync_uhf_clock, 4 events,
    event_peer/strat_chg,
    version="ntpd 4.2.4p0@1.1472 Tue Jan 8 16:23:44 UTC 2008 (1)",
    processor="i686", system="Linux/2.6.18.8-RedHawk-4.2-trace", leap=00,
    stratum=1, precision=-20, rootdelay=0.000, rootdispersion=34.818,
    peer=52112, refid=BTFP,
    reftime=cb7b0304.fe74aa8a Thu, Mar 6 2008 18:55:48.993, poll=4,
    clock=cb7b0310.9f487fd7 Thu, Mar 6 2008 18:56:00.622, state=4,
    offset=34.313, frequency=368.414, jitter=0.323, noise=0.952,
    stability=0.526
    ntpq>
    ntpq> quit
    sbc1 root 28->date
    Thu Mar 6 18:56:35 EST 2008

    sbc1 root 29->cat /etc/ntp_debug.conf
    # Debug stuff
    statistics clockstats peerstats loopstats
    statsdir /var/lib/ntp/log/
    filegen clockstats file stats.clock type pid link enable
    filegen peerstats file stats.peer type pid link enable
    filegen loopstats file stats.loop type pid link enable

    restrict default nomodify notrap noquery
    restrict 127.0.0.1

    driftfile /var/lib/ntp/drift

    tinker step 0

    server 127.127.16.0 prefer mode 2 minpoll 4 burst # Symmetricom BC635
    tos orphan 6
    sbc1 root 30->


    /*****************************************/
    stats.loop for working configuration, stepping enabled
    /*****************************************/
    54531 78843.994 0.000851046 0.000 0.000300891 0.000000 6
    54531 79747.994 0.030705963 33.578 0.026475415 11.871458 6
    54531 79765.994 0.031304524 33.611 0.024766387 11.104738 6
    54531 79780.994 0.031796892 33.640 0.023167488 10.387536 6
    54531 79797.994 0.032358327 33.672 0.021672109 9.716657 6
    54531 79815.994 0.032955986 33.708 0.020273503 9.089109 7
    54531 79833.994 0.033550503 33.717 0.018965291 8.502084 7
    54531 79849.994 0.034074021 33.725 0.017741370 7.952972 7
    54531 79865.994 0.034603163 33.733 0.016596587 7.439324 7
    54531 79882.994 0.035167590 33.742 0.015525968 6.958851 7
    54531 79898.994 0.035694285 33.751 0.014524407 6.509410 8
    54531 79916.994 0.036289215 33.753 0.013587967 6.088996 8
    54531 79932.994 0.036820295 33.755 0.012711766 5.695734 8
    54531 79950.994 0.037417341 33.758 0.011892642 5.327871 8
    54531 79965.994 0.037910326 33.760 0.011125913 4.983767 9
    54531 79981.994 0.038438966 33.760 0.010409017 4.661888 9
    54531 79998.994 0.039003336 33.761 0.009738788 4.360796 9
    54531 80013.994 0.039500613 33.762 0.009111498 4.079152 9
    54531 80030.994 0.040060894 33.762 0.008525328 3.815697 8
    54531 80046.994 0.040590507 33.765 0.007976912 3.569258 8
    54531 80063.994 0.041151342 33.767 0.007464352 3.338735 7
    54531 80079.994 0.041677784 33.777 0.006984742 3.123103 7
    54531 80096.994 0.042239126 33.788 0.006536642 2.921397 7
    54531 80113.994 0.042804284 33.799 0.006117732 2.732720 6
    54531 80131.994 0.043397711 33.845 0.005726459 2.556278 6
    54531 80149.994 0.043991734 33.892 0.005360728 2.391238 6
    54531 80166.994 0.044557325 33.938 0.005018487 2.236855 5
    54531 80183.994 0.045117876 34.120 0.004698547 2.093385 5
    54531 80201.994 0.045714207 34.317 0.004400142 1.959410 5
    54531 80216.994 0.046208335 34.482 0.004119662 1.833791 5
    54531 80233.994 0.046769870 34.671 0.003858701 1.716664 4
    54531 80250.994 0.047327945 35.394 0.003614874 1.625964 4
    54531 80268.994 0.047925094 36.125 0.003387989 1.542768 4
    54531 80285.994 0.048483893 36.865 0.003175326 1.466640 4
    54531 80300.994 0.048980676 37.565 0.002975434 1.394102 4
    54531 80317.994 0.049539934 38.321 0.002790278 1.331168 4
    54531 80333.994 0.050070750 39.085 0.002616804 1.274155 4
    54531 80350.994 0.050633071 39.858 0.002455857 1.222764 4
    54531 80368.994 0.051225187 40.640 0.002306763 1.176702 4



    54531 81596.994 0.091688138 120.566 0.000560630 1.336847 4
    54531 81614.994 0.092277665 121.974 0.000564323 1.345953 4
    54531 81631.994 0.092832915 123.390 0.000563197 1.354974 4
    54531 81647.994 0.093355741 124.815 0.000558310 1.363858 4
    54531 81665.994 0.093944241 126.248 0.000562173 1.372754 4
    54531 81680.994 0.094438060 127.599 0.000554090 1.370047 4
    54531 81698.994 0.095025116 129.049 0.000558317 1.380290 4
    54531 81713.994 0.095517171 130.416 0.000550471 1.378559 4
    54531 81729.994 0.095041514 131.866 0.000541684 1.387719 4
    54531 81745.994 0.094563218 133.309 0.000534172 1.394739 4
    54531 81760.994 0.094058731 134.654 0.000530552 1.388682 4
    54531 81775.994 0.093548859 135.992 0.000528012 1.382476 4
    54531 81791.994 0.093575450 137.420 0.000493999 1.388228 4
    54531 81808.994 0.093133732 138.841 0.000487771 1.392381 4
    54531 81825.994 0.092695262 140.256 0.000481884 1.395154 4
    54531 81843.994 0.092282190 141.664 0.000473829 1.396781 4
    54531 81860.994 0.091845491 143.065 0.000469349 1.397366 4
    54531 81876.994 0.091866474 144.467 0.000439098 1.397917 4
    54531 81893.994 0.090927669 145.855 0.000528087 1.396613 4
    54531 81909.994 0.090950463 147.242 0.000494046 1.395513 4
    54531 81926.994 0.090506990 148.623 0.000488011 1.393711 4



    54531 86165.994 0.032871409 368.915 0.001025820 0.522859 4
    54531 86182.994 0.033430722 369.425 0.000979730 0.521283 4
    54531 86198.994 0.033955849 369.943 0.000935071 0.520889 4
    54531 86215.994 0.032518157 370.440 0.001011648 0.517866 4
    54531 86232.994 0.033075338 370.944 0.000966597 0.516237 4
    54531 86248.994 0.033601583 371.457 0.000923113 0.515799 4
    54531 86264.994 0.032128572 371.947 0.001008385 0.512674 4
    54531 86282.994 0.032720790 372.447 0.000966217 0.511019 4
    54531 86298.994 0.032747589 372.946 0.000903863 0.509617 4
    54531 86314.994 0.032778294 373.446 0.000845556 0.508444 4
    54531 86332.994 0.031872478 373.933 0.000853321 0.505733 4
    54531 86349.994 0.032433287 374.428 0.000822466 0.504391 4
    /*****************************************/


    So, the summary is that drift goes to 500ppm when stepping is disabled
    but runs normally when stepping is enabled and both situations never
    require a time step. This makes no sense to me. By the way, as
    mentioned previously, we require that time does not step backward due to
    a problem in some commercial software that cannot currently tolerate
    time moving backwards.

    Quite frankly, I don't think it's unreasonable that a system require
    time to monotonically increase. Clearly this isn't the first system
    that requires such behavior (i.e. time step disable was not added for
    me). I understand it takes 14 days to recover from an offset of 600
    seconds, but I also understand that if we have an offset of more than
    10ms in this system, then something isn't working correctly. I'm going
    to be bold and say that we simply will _never_ have an offset of 600
    seconds in this system. If we do, they will have a recovery procedure
    that involves rebooting the system, which will force a quick sync during
    startup. If they continue to have a problem, it will be fixed, most
    likely by swapping hardware until the problem is fixed or flying someone
    in to work on the system.

    To summarize, we really need to disable time stepping to keep time from
    moving backwards. Maybe the commercial software will be fixed before
    this problem is solved, but I don't want to rely on that and, even then,
    monotonically increasing time may remain a requirement.

    Andy

  14. Re: drift value very large and very unstable


    Rob Neal wrote:
    > On Wed, 5 Mar 2008, Andy Helten wrote:
    >
    >
    >> I think I have been giving it enough time to stabilize -- any test I
    >> consider legitimate was allowed to run for at least 8 hours. Most tests
    >> ran overnight for 18-24 hours and some tests ran over weekends for
    >> nearly 72 hours. Results were always the same (very large drift). In
    >> fact, if allowed to run long enough, the drift almost always reached the
    >> +/-500 max.
    >>

    > The drift only tells part of the story. What is the offset
    > doing? Does it cross zero and continue to diverge, or is
    > it still headed to zero?
    >


    In one run that I looked at closely (which was the subject of another
    email I just sent), the offset increased to about 90ms and then started
    to decrease. I stopped the run before it had time to fully converge to
    zero and the drift value was still increasing, so it's not the perfect
    example. A quick look at the logs from another runs shows the offset
    reaching 115ms. This test ran for several hours, the drift eventually
    reached 500ppm at which point the offset bounced around from 1 to 3ms
    (i.e. the offset was very unstable). I guess you would expect an
    unstable offset if you are banging up against the upper end of the
    drift. Here are some lines from that stats.loop:

    54525 62788.352 0.023098393 0.000 0.008166515 0.000000 6 <-BEGIN
    54525 62791.352 0.023381710 0.004 0.007639732 0.001478 6
    54525 62809.352 0.025086955 0.031 0.007171702 0.009616 6
    54525 62825.352 0.026604190 0.056 0.006729925 0.012703 6
    54525 62840.352 0.028026520 0.082 0.006315321 0.014822 6
    54525 62856.352 0.029544500 0.110 0.005931771 0.017072 6
    54525 62872.352 0.031063284 0.139 0.005574586 0.019098 6
    54525 62889.352 0.032676090 0.172 0.005245631 0.021358 6
    54525 62905.352 0.034194599 0.205 0.004936122 0.023067 5
    54525 62922.354 0.035808330 0.350 0.004652435 0.055665 5
    54525 62937.352 0.037232161 0.483 0.004380973 0.070196 5
    54525 62955.354 0.038941107 0.650 0.004142326 0.088332 5
    54525 62972.352 0.040555202 0.815 0.003916589 0.101018 4
    54525 62990.352 0.042264909 1.460 0.003713166 0.246815 4

    54525 63663.352 0.106036189 47.865 0.001520471 1.434006 4
    54525 63678.352 0.107457194 49.402 0.001508397 1.447306 4
    54525 63696.352 0.109161199 51.068 0.001534212 1.476368 4
    54525 63712.352 0.110677398 52.756 0.001531972 1.504564 4
    54525 63727.352 0.112099421 54.360 0.001518664 1.517296 4
    54525 63743.352 0.113614091 56.094 0.001518165 1.545992 4
    54525 63758.354 0.115036493 57.739 0.001506528 1.558793 4 <-MAX
    54525 63773.352 0.113958623 59.369 0.001459845 1.567895 4
    54525 63791.352 0.114164639 61.111 0.001367501 1.590703 4
    54525 63808.352 0.113274943 62.840 0.001317288 1.608565 4
    54525 63825.352 0.113386776 64.570 0.001232844 1.624260 4
    54525 63843.352 0.113093372 66.296 0.001157876 1.637280 4
    54525 63860.352 0.112205284 68.008 0.001127688 1.646820 4
    54525 63875.352 0.111627571 69.605 0.001074448 1.640657 4
    54525 63891.352 0.111143832 71.300 0.001019502 1.647666 4
    54525 63906.352 0.111067353 72.889 0.000954040 1.640427 4
    54525 63924.352 0.110773822 74.580 0.000898437 1.646740 4
    54525 63942.352 0.110479720 76.265 0.000846819 1.651672 4
    54525 63958.352 0.109498471 77.936 0.000864766 1.654077 4

    54528 52785.354 0.002121099 500.000 0.000561826 0.014400 10
    54528 52803.352 0.002828893 500.000 0.000582078 0.017494 10
    54528 52821.352 0.003035901 500.000 0.000549381 0.016695 10
    54528 52838.352 0.003150695 500.000 0.000515499 0.015727 9
    54528 52854.354 0.001669291 500.000 0.000711928 0.014711 9
    54528 52869.352 0.001417609 500.000 0.000671866 0.013761 9
    54528 52886.354 0.001530270 500.000 0.000629734 0.012872 9
    54528 52904.352 0.001413156 500.000 0.000590516 0.012041 10
    54528 52921.354 0.002351464 500.000 0.000644339 0.018575 10
    54528 52940.352 0.003152638 500.000 0.000665967 0.021479 10
    54528 52956.352 0.003170197 500.000 0.000622986 0.020094 10
    54528 52973.352 0.002956953 499.991 0.000587606 0.019083 9 <-END


    >> The tinker commands are also necessary (at least disabling the step) due
    >> to some commercial software that has serious problems with backward time
    >> steps. This problem should be fixed in a future version, but that may
    >> not be soon enough for us. Even then, we may not want time to step
    >> backwards.
    >>

    > There is a reason they are options. Try setting your clock
    > by hand an hour or so off, and starting ntpd. Watch the
    > time it reports while it plays catch-up. Scary.
    > Your call, but it would probably fail an audit.
    >


    I understand your point and it is indeed scary to consider the
    catastrophic failures enabled by preventing time steps. My
    counter-point is that no one is going to be setting time on these
    systems and time should never jump. If IRIG-B time is jumping around
    wildly, then no other subsystem will work correctly if it relies on
    accurate time in any way. It would need to be fixed. In fact, if
    IRIG-B time jumps more than a certain amount, our subsystem stops using
    it for synchronizing system time. It is better for us to drift from
    IRIG-B time, so long as the various boards in our system remain
    synchronized with each other.

    In reality, we must be able to assume IRIG-B time is stable. With that
    assumption, we must also be able to assume we can keep system time
    within a few milliseconds of IRIG-B time. We use IRIG-B time directly
    (i.e. read it from the IRIG PMC's registers) on boards that require
    highly accurate time synchronization , but not all boards have an IRIG-B
    PMC. We use ntp-disciplined system time for timestamps that aren't so
    critical. The NTP synchronization requirements are TBD, but will
    probably be on the order of 1-50ms accuracy between the IRIG-B synced
    NTP server and the various NTP clients. I don't think this is
    unreasonable and is achievable in all the testing I've done on *other*
    hardware and software.

    Andy

  15. Re: drift value very large and very unstable



    > So, the summary is that drift goes to 500ppm when stepping is disabled
    > but runs normally when stepping is enabled and both situations never
    > require a time step. This makes no sense to me. By the way, as
    > mentioned previously, we require that time does not step backward due to
    > a problem in some commercial software that cannot currently tolerate
    > time moving backwards.
    >
    > Quite frankly, I don't think it's unreasonable that a system require
    > time to monotonically increase.


    Forgive me if this answer misses a point in the earlier details, or shows my
    ignorance of NTP, but a few ideas/thoughts.

    Oscillators and drift can go in either direction, fast or slow, its a
    physics-based situation. You can't write code around that and provide a
    software solution that is monotonic at all times. However, a single negative
    step just at the start may be required before going monotic after that
    event. (Not an expert, but that is my understanding).

    With this ref clock and a GPS-drive IRIG source, you may only see a single
    negative step when NTP first begins running on a new system with no drift
    file, or a system that has been powered off a long time with a
    battery-driven clock drifting over that long time. Once NTP is humming along
    after the initial step and some updates, you shouldn't see a step again.
    This makes me think that you should insert a delay in launching your
    sensitive application, or block the application at some point, so it does
    not see the (possible) first time step.

    Fran Horan
    JHU/APL



  16. Re: drift value very large and very unstable


    Fran Horan wrote:
    >
    >
    >
    >> So, the summary is that drift goes to 500ppm when stepping is disabled
    >> but runs normally when stepping is enabled and both situations never
    >> require a time step. This makes no sense to me. By the way, as
    >> mentioned previously, we require that time does not step backward due to
    >> a problem in some commercial software that cannot currently tolerate
    >> time moving backwards.
    >>
    >> Quite frankly, I don't think it's unreasonable that a system require
    >> time to monotonically increase.
    >>

    >
    > Forgive me if this answer misses a point in the earlier details, or shows my
    > ignorance of NTP, but a few ideas/thoughts.
    >
    > Oscillators and drift can go in either direction, fast or slow, its a
    > physics-based situation. You can't write code around that and provide a
    > software solution that is monotonic at all times. However, a single negative
    > step just at the start may be required before going monotic after that
    > event. (Not an expert, but that is my understanding).
    >
    > With this ref clock and a GPS-drive IRIG source, you may only see a single
    > negative step when NTP first begins running on a new system with no drift
    > file, or a system that has been powered off a long time with a
    > battery-driven clock drifting over that long time. Once NTP is humming along
    > after the initial step and some updates, you shouldn't see a step again.
    > This makes me think that you should insert a delay in launching your
    > sensitive application, or block the application at some point, so it does
    > not see the (possible) first time step.
    >
    > Fran Horan
    > JHU/APL
    >
    >

    Hey Fran,

    Yes, exactly, we do perform an initial time sync with stepping enabled.
    This is done prior to initializing the commercial software and so it
    does not cause problems if time moves backwards. And, yes, if we are
    below the step threshold after the initial sync (which should always be
    the case), then we should stay below that threshold until the end of
    time. Following this logic, we should allow time steps and be comforted
    knowing they will never occur in a normally functioning system. I agree
    this is reasonable and does not conflict with my own rant that "if we
    have an offset of more than 10ms in this system, then something isn't
    working correctly".

    This approach is definitely worth considering and I'll bring it up with
    the decision makers. However, there is always concern that months or
    years from now someone will say -- "Hey, some dumbass left time stepping
    enabled, let's disable it on all systems immediately". Surely this
    wouldn't be done without some regression testing, but then again such a
    mundane change shouldn't need exhaustive testing, right? Riiiiight.

    I guess was just hoping someone will say, "Oh, right, that's a known
    problem. You need to do 'X' to fix it."

    Andy

  17. Re: drift value very large and very unstable

    On Fri, Mar 07, 2008 at 09:13:14AM -0600, Andy Helten wrote:
    > As I mentioned in a previous email, I was going to run ntp while adding
    > back in some of the features I removed. After doing this, at least
    > superficially, I've isolated the problem to time step being disabled.
    > It doesn't matter whether I specify 'tinker step 0' in ntp.conf or use
    > the '-x' argument on the command line.


    This looks like a kernel bug. When the step threshold is 0 or larger
    than 0.5s, kernel time discipline is disabled and time is adjusted
    only by adjtime(). I've recently seen a similar problem on a PowerPC
    machine where adjtime() called with delta smaller than 1ms was
    ignored.

    --
    Miroslav Lichvar

  18. Re: drift value very large and very unstable

    Miroslav Lichvar wrote:
    > On Fri, Mar 07, 2008 at 09:13:14AM -0600, Andy Helten wrote:
    >
    >> As I mentioned in a previous email, I was going to run ntp while adding
    >> back in some of the features I removed. After doing this, at least
    >> superficially, I've isolated the problem to time step being disabled.
    >> It doesn't matter whether I specify 'tinker step 0' in ntp.conf or use
    >> the '-x' argument on the command line.
    >>

    >
    > This looks like a kernel bug. When the step threshold is 0 or larger
    > than 0.5s, kernel time discipline is disabled and time is adjusted
    > only by adjtime(). I've recently seen a similar problem on a PowerPC
    > machine where adjtime() called with delta smaller than 1ms was
    > ignored.
    >


    Thank you for this pointer! I'm fuzzy at best on the relationship
    between NTP and the kernel, but I did find a read through a thread on
    the mailing list about 'tinker step 0' and the kernel time discipline:

    http://lists.ntp.isc.org/pipermail/q...er/011531.html

    I read through it quickly, so I am still unsure why disabling the kernel
    time discipline is necessary if stepping is disabled but no step would
    have occurred even if stepping were enabled. Is there an "official" NTP
    page that covers the topic of kernel time discipline? At any rate, I
    will look closer at the kernel and the problems with adjtime.

    Andy

  19. Re: drift value very large and very unstable

    Andy Helten wrote:
    > Andy wrote:
    >
    >>The good news is that "new ntp.conf" appears to work! This is the first
    >>configuration that has produced reasonable results, granted it could
    >>still be a fluke since the drift was rather unpredictable (but _always_
    >>settled near +/-500ppm). The bad news is that we _require_ some of the
    >>commands removed from ntp.conf (at least burst and step). After letting
    >>ntp run with the "new ntp.conf" for at least 16 hours, the drift had
    >>stabilized around 33ppm:
    >>
    >>sbc1 root 1->ntpq -crv
    >>assID=0 status=0444 leap_none, sync_uhf_clock, 4 events,
    >>event_peer/strat_chg,
    >>version="ntpd 4.2.4p0@1.1472 Tue Jan 8 16:23:44 UTC 2008 (1)",
    >>processor="i686", system="Linux/2.6.18.8-RedHawk-4.2-trace", leap=00,
    >>stratum=1, precision=-20, rootdelay=0.000, rootdispersion=0.272,
    >>peer=13451, refid=BTFP,
    >>reftime=c0320bd4.c1843a15 Thu, Mar 7 2002 10:55:00.755, poll=4,
    >>clock=c0320bd5.6dfc379d Thu, Mar 7 2002 10:55:01.429, state=4,
    >>offset=0.029, frequency=33.562, jitter=0.002, noise=0.002,
    >>stability=0.001
    >>
    >>
    >>This test ran with the previously problematic Redhawk kernel and all of
    >>the same hardware. To further isolate the problem, I've added the
    >>'burst' command back into ntp.conf, removed the drift file, and
    >>restarted ntp.
    >>
    >>Andy
    >>

    >
    >
    > This may seem like a long email, but it mostly consists of two cut &
    > paste jobs interspersed with brilliant analysis. The two cut & paste
    > chunks are from two different runs of ntp, one that shows ntp working
    > correctly and one that shows it "failing". The _only_ difference
    > between the two runs is that time stepping was left to default behavior
    > in the working run and time stepping was disabled in the "failing" run.
    > So, please don't be discouraged by the length of the email, you may find
    > it an intriguing read...
    >
    >
    > As I mentioned in a previous email, I was going to run ntp while adding
    > back in some of the features I removed. After doing this, at least
    > superficially, I've isolated the problem to time step being disabled.
    > It doesn't matter whether I specify 'tinker step 0' in ntp.conf or use
    > the '-x' argument on the command line. Both result in drift that
    > approaches 500ppm. First the working configuration. Below is ntpq
    > output and the ntp.conf file after running just over 12 hours with step
    > _enabled_ in which everything works correctly. Below the ntpq/ntp.conf
    > information is the beginning and ending of the stats.loop file for the
    > same run.
    >
    > Keep in mind, NTP runs perfectly on this same set of hardware when
    > running the 2.6.9-42 linux kernel with time stepping *disabled*. If the
    > true drift of this system is around 33ppm (two different runs with
    > stepping disabled have settled near 33ppm), then would the clock offset
    > ever get larger than 128ms and require a step? In fact, the answer
    > seems to be "no, a time step is never even required". As proof,
    > stats.loop shows the largest offset to be 0.023634 seconds and that is
    > the third entry in the file. The offset only goes down after that,
    > eventually achieving microsecond accuracy. Yes, this is with time
    > stepping *enabled*, but still the point is that no step was even needed
    > to keep time accurately and to establish a reasonable (but apparently
    > accurate) drift value.
    >
    >
    > /*****************************************/
    > ntpq for working configuration, stepping enabled
    > /*****************************************/
    > sbc1 root 3->ntpq
    > ntpq> pe
    > remote refid st t when poll reach delay offset
    > jitter
    > ================================================== ============================
    > *GPS_BANC(0) .BTFP. 0 l 8 16 377 0.000 -0.006
    > 0.001
    > ntpq> as
    >
    > ind assID status conf reach auth condition last_event cnt
    > ================================================== =========
    > 1 19400 9614 yes yes none sys.peer reachable 1
    > ntpq> rv &1
    > assID=19400 status=9614 reach, conf, sel_sys.peer, 1 event, event_reach,
    > srcadr=GPS_BANC(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=00,
    > stratum=0, precision=-21, rootdelay=0.000, rootdispersion=0.000,
    > refid=BTFP, reach=377, unreach=0, hmode=3, pmode=4, hpoll=4, ppoll=10,
    > flash=00 ok, keyid=0, ttl=64, offset=-0.006, delay=0.000,
    > dispersion=0.105, jitter=0.001,
    > reftime=cb7bc62f.16499b2e Fri, Mar 7 2008 8:48:31.087,
    > org=cb7bc62f.16499b2e Fri, Mar 7 2008 8:48:31.087,
    > rec=cb7bc62f.164a14e6 Fri, Mar 7 2008 8:48:31.087,
    > xmt=cb7bc62f.16490801 Fri, Mar 7 2008 8:48:31.087,
    > filtdelay= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00,
    > filtoffset= -0.01 -0.01 -0.01 -0.01 -0.01 -0.01 -0.01 -0.01,
    > filtdisp= 0.00 0.20 0.21 0.23 0.24 0.26 0.27 0.47
    > ntpq> rv
    > assID=0 status=0444 leap_none, sync_uhf_clock, 4 events,
    > event_peer/strat_chg,
    > version="ntpd 4.2.4p0@1.1472 Tue Jan 8 16:23:44 UTC 2008 (1)",
    > processor="i686", system="Linux/2.6.18.8-RedHawk-4.2-trace", leap=00,
    > stratum=1, precision=-20, rootdelay=0.000, rootdispersion=0.037,
    > peer=19400, refid=BTFP,
    > reftime=cb7bc634.1649ad18 Fri, Mar 7 2008 8:48:36.087, poll=4,
    > clock=cb7bc635.182e76f5 Fri, Mar 7 2008 8:48:37.094, state=4,
    > offset=-0.003, frequency=33.551, jitter=0.002, noise=0.002,
    > stability=0.001
    > ntpq> quit
    > sbc1 root 10->ps x|grep ntpd
    > 9554 ? Ss 0:00 ntpd -c /etc/ntp_debug.conf
    > sbc1 root 4->cat /etc/ntp_debug.conf
    > # Debug stuff
    > statistics clockstats peerstats loopstats
    > statsdir /var/lib/ntp/log/
    > filegen clockstats file stats.clock type pid link enable
    > filegen peerstats file stats.peer type pid link enable
    > filegen loopstats file stats.loop type pid link enable
    >
    > restrict default nomodify notrap noquery
    > restrict 127.0.0.1
    >
    > driftfile /var/lib/ntp/drift
    >
    > tinker panic 0
    >
    > server 127.127.16.0 prefer mode 2 minpoll 4 burst # Symmetricom BC635
    > tos orphan 6
    >
    > /*****************************************/
    > stats.loop for working configuration, stepping enabled
    > /*****************************************/
    > 54532 8012.087 0.014105000 0.000 0.004986901 0.000000 6
    > 54532 8922.088 0.023610000 10.445 0.007191059 3.692976 6
    > 54532 8937.087 0.023634000 10.465 0.006726625 3.454469 6
    > 54532 8952.088 0.023633000 10.484 0.006292182 3.231368 6
    > 54532 8970.087 0.023631000 10.512 0.005885797 3.022683 6
    > 54532 8986.087 0.023628000 10.535 0.005505659 2.827473 6
    > 54532 9004.087 0.023625000 10.559 0.005150073 2.644872 6
    > 54532 9021.088 0.023622000 10.582 0.004817452 2.474065 6
    > 54532 9036.088 0.023620000 10.605 0.004506314 2.314291 5
    > 54532 9053.087 0.023616000 10.699 0.004215271 2.165075 5
    > 54532 9068.087 0.023292000 10.781 0.003944691 2.025449 5
    > 54532 9085.088 0.022913000 10.875 0.003692351 1.894924 5
    > 54532 9101.088 0.022565000 10.961 0.003456068 1.772800 4
    > 54532 9119.090 0.022186000 11.344 0.003235634 1.663816 4
    > 54532 9136.087 0.021169000 11.688 0.003047943 1.561096 4
    > 54532 9151.087 0.020285000 11.977 0.002868169 1.463843 4
    > 54532 9167.088 0.019392000 12.273 0.002701435 1.373317 4
    > 54532 9182.087 0.018601000 12.539 0.002542389 1.288048 4
    > 54532 9197.088 0.017851000 12.793 0.002392922 1.208199 4
    > 54532 9213.087 0.017095000 13.055 0.002254277 1.133948 4
    > 54532 9228.088 0.016425000 13.289 0.002121941 1.063943 4
    > 54532 9246.090 0.015669000 13.559 0.002002814 0.999779 4
    > 54532 9262.088 0.015036000 13.789 0.001886778 0.938751 4
    > 54532 9278.088 0.014437000 14.008 0.001777582 0.881520 4
    > 54532 9293.087 0.013905000 14.207 0.001673378 0.827590 4
    > 54532 9310.088 0.013337000 14.422 0.001578123 0.777857 4
    > 54532 9326.087 0.012832000 14.617 0.001486967 0.730888 4
    > 54532 9344.088 0.012297000 14.832 0.001403735 0.687889 4
    > 54532 9359.088 0.011876000 15.000 0.001321477 0.646196 4
    > 54532 9377.088 0.011401000 15.195 0.001247494 0.608393 4
    > 54532 9392.087 0.011025000 15.352 0.001174473 0.571774 4
    > 54532 9409.087 0.010623000 15.527 0.001107767 0.538445 4
    > 54532 9427.088 0.010225000 15.703 0.001045725 0.507488 4
    > 54532 9442.087 0.009909000 15.844 0.000984548 0.477309 4
    >
    >
    >
    > 54532 49946.087 0.000006000 33.555 0.000001419 0.001337 4
    > 54532 49963.088 0.000006000 33.555 0.000001369 0.001251 4
    > 54532 49979.087 0.000008000 33.555 0.000001526 0.001170 4
    > 54532 49996.087 0.000008000 33.555 0.000001467 0.001095 4
    > 54532 50012.087 0.000010000 33.555 0.000001538 0.001024 4
    > 54532 50030.088 0.000012000 33.555 0.000001646 0.000958 4
    > 54532 50045.088 0.000013000 33.555 0.000001576 0.000896 4
    > 54532 50063.088 0.000015000 33.555 0.000001635 0.000838 4
    > 54532 50079.087 0.000016000 33.555 0.000001570 0.000784 4
    > 54532 50095.087 0.000017000 33.555 0.000001554 0.000733 4
    > 54532 50110.088 0.000019000 33.555 0.000001618 0.000686 4
    > 54532 50126.087 0.000022000 33.555 0.000001775 0.000642 4
    > /*****************************************/
    >
    >
    > Now here are some ntpq and stats.loop values for the exact same
    > hardware/software configuration as above, except with stepping disabled
    > via 'tinker step 0'. There was no reboot between these runs, only the
    > tinker step was added back into ntp.conf and the drift file was
    > deleted. This test was allowed to run for just over two hours and the
    > drift value was still increasing, but experience with this setup
    > indicates the drift was not going to come back down if the test were
    > allowed to run longer. The stats.loop output shows the beginning of the
    > file, shows where offset reaches it's maximum, and then shows the end of
    > the file. As you can see, the offset max is 0.093575450 seconds, so no
    > time step is required nor is one taken. Yet, the drift runs out of control.
    >
    >
    > /*****************************************/
    > ntpq for working configuration, stepping enabled
    > /*****************************************/
    > sbc1 root 27->ntpq
    > ntpq> pe
    > remote refid st t when poll reach delay offset
    > jitter
    > ================================================== ============================
    > *GPS_BANC(0) .BTFP. 0 l 3 16 377 0.000 34.313
    > 0.323
    > ntpq> as
    >
    > ind assID status conf reach auth condition last_event cnt
    > ================================================== =========
    > 1 52112 9614 yes yes none sys.peer reachable 1
    > ntpq> rv &1
    > assID=52112 status=9614 reach, conf, sel_sys.peer, 1 event, event_reach,
    > srcadr=GPS_BANC(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=00,
    > stratum=0, precision=-21, rootdelay=0.000, rootdispersion=0.000,
    > refid=BTFP, reach=377, unreach=0, hmode=3, pmode=4, hpoll=4, ppoll=10,
    > flash=00 ok, keyid=0, ttl=64, offset=34.313, delay=0.000,
    > dispersion=0.017, jitter=0.323,
    > reftime=cb7b0304.fe74557e Thu, Mar 6 2008 18:55:48.993,
    > org=cb7b0304.fe74557e Thu, Mar 6 2008 18:55:48.993,
    > rec=cb7b0304.fe74aa8a Thu, Mar 6 2008 18:55:48.993,
    > xmt=cb7b0304.fe74063f Thu, Mar 6 2008 18:55:48.993,
    > filtdelay= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00,
    > filtoffset= 34.31 34.28 34.25 34.21 34.18 34.15 33.75 33.72,
    > filtdisp= 0.00 0.02 0.03 0.05 0.06 0.08 0.26 0.27
    > ntpq> rv
    > assID=0 status=0444 leap_none, sync_uhf_clock, 4 events,
    > event_peer/strat_chg,
    > version="ntpd 4.2.4p0@1.1472 Tue Jan 8 16:23:44 UTC 2008 (1)",
    > processor="i686", system="Linux/2.6.18.8-RedHawk-4.2-trace", leap=00,
    > stratum=1, precision=-20, rootdelay=0.000, rootdispersion=34.818,
    > peer=52112, refid=BTFP,
    > reftime=cb7b0304.fe74aa8a Thu, Mar 6 2008 18:55:48.993, poll=4,
    > clock=cb7b0310.9f487fd7 Thu, Mar 6 2008 18:56:00.622, state=4,
    > offset=34.313, frequency=368.414, jitter=0.323, noise=0.952,
    > stability=0.526
    > ntpq>
    > ntpq> quit
    > sbc1 root 28->date
    > Thu Mar 6 18:56:35 EST 2008
    >
    > sbc1 root 29->cat /etc/ntp_debug.conf
    > # Debug stuff
    > statistics clockstats peerstats loopstats
    > statsdir /var/lib/ntp/log/
    > filegen clockstats file stats.clock type pid link enable
    > filegen peerstats file stats.peer type pid link enable
    > filegen loopstats file stats.loop type pid link enable
    >
    > restrict default nomodify notrap noquery
    > restrict 127.0.0.1
    >
    > driftfile /var/lib/ntp/drift
    >
    > tinker step 0
    >
    > server 127.127.16.0 prefer mode 2 minpoll 4 burst # Symmetricom BC635
    > tos orphan 6
    > sbc1 root 30->
    >
    >
    > /*****************************************/
    > stats.loop for working configuration, stepping enabled
    > /*****************************************/
    > 54531 78843.994 0.000851046 0.000 0.000300891 0.000000 6
    > 54531 79747.994 0.030705963 33.578 0.026475415 11.871458 6
    > 54531 79765.994 0.031304524 33.611 0.024766387 11.104738 6
    > 54531 79780.994 0.031796892 33.640 0.023167488 10.387536 6
    > 54531 79797.994 0.032358327 33.672 0.021672109 9.716657 6
    > 54531 79815.994 0.032955986 33.708 0.020273503 9.089109 7
    > 54531 79833.994 0.033550503 33.717 0.018965291 8.502084 7
    > 54531 79849.994 0.034074021 33.725 0.017741370 7.952972 7
    > 54531 79865.994 0.034603163 33.733 0.016596587 7.439324 7
    > 54531 79882.994 0.035167590 33.742 0.015525968 6.958851 7
    > 54531 79898.994 0.035694285 33.751 0.014524407 6.509410 8
    > 54531 79916.994 0.036289215 33.753 0.013587967 6.088996 8
    > 54531 79932.994 0.036820295 33.755 0.012711766 5.695734 8
    > 54531 79950.994 0.037417341 33.758 0.011892642 5.327871 8
    > 54531 79965.994 0.037910326 33.760 0.011125913 4.983767 9
    > 54531 79981.994 0.038438966 33.760 0.010409017 4.661888 9
    > 54531 79998.994 0.039003336 33.761 0.009738788 4.360796 9
    > 54531 80013.994 0.039500613 33.762 0.009111498 4.079152 9
    > 54531 80030.994 0.040060894 33.762 0.008525328 3.815697 8
    > 54531 80046.994 0.040590507 33.765 0.007976912 3.569258 8
    > 54531 80063.994 0.041151342 33.767 0.007464352 3.338735 7
    > 54531 80079.994 0.041677784 33.777 0.006984742 3.123103 7
    > 54531 80096.994 0.042239126 33.788 0.006536642 2.921397 7
    > 54531 80113.994 0.042804284 33.799 0.006117732 2.732720 6
    > 54531 80131.994 0.043397711 33.845 0.005726459 2.556278 6
    > 54531 80149.994 0.043991734 33.892 0.005360728 2.391238 6
    > 54531 80166.994 0.044557325 33.938 0.005018487 2.236855 5
    > 54531 80183.994 0.045117876 34.120 0.004698547 2.093385 5
    > 54531 80201.994 0.045714207 34.317 0.004400142 1.959410 5
    > 54531 80216.994 0.046208335 34.482 0.004119662 1.833791 5
    > 54531 80233.994 0.046769870 34.671 0.003858701 1.716664 4
    > 54531 80250.994 0.047327945 35.394 0.003614874 1.625964 4
    > 54531 80268.994 0.047925094 36.125 0.003387989 1.542768 4
    > 54531 80285.994 0.048483893 36.865 0.003175326 1.466640 4
    > 54531 80300.994 0.048980676 37.565 0.002975434 1.394102 4
    > 54531 80317.994 0.049539934 38.321 0.002790278 1.331168 4
    > 54531 80333.994 0.050070750 39.085 0.002616804 1.274155 4
    > 54531 80350.994 0.050633071 39.858 0.002455857 1.222764 4
    > 54531 80368.994 0.051225187 40.640 0.002306763 1.176702 4
    >
    >
    >
    > 54531 81596.994 0.091688138 120.566 0.000560630 1.336847 4
    > 54531 81614.994 0.092277665 121.974 0.000564323 1.345953 4
    > 54531 81631.994 0.092832915 123.390 0.000563197 1.354974 4
    > 54531 81647.994 0.093355741 124.815 0.000558310 1.363858 4
    > 54531 81665.994 0.093944241 126.248 0.000562173 1.372754 4
    > 54531 81680.994 0.094438060 127.599 0.000554090 1.370047 4
    > 54531 81698.994 0.095025116 129.049 0.000558317 1.380290 4
    > 54531 81713.994 0.095517171 130.416 0.000550471 1.378559 4
    > 54531 81729.994 0.095041514 131.866 0.000541684 1.387719 4
    > 54531 81745.994 0.094563218 133.309 0.000534172 1.394739 4
    > 54531 81760.994 0.094058731 134.654 0.000530552 1.388682 4
    > 54531 81775.994 0.093548859 135.992 0.000528012 1.382476 4
    > 54531 81791.994 0.093575450 137.420 0.000493999 1.388228 4
    > 54531 81808.994 0.093133732 138.841 0.000487771 1.392381 4
    > 54531 81825.994 0.092695262 140.256 0.000481884 1.395154 4
    > 54531 81843.994 0.092282190 141.664 0.000473829 1.396781 4
    > 54531 81860.994 0.091845491 143.065 0.000469349 1.397366 4
    > 54531 81876.994 0.091866474 144.467 0.000439098 1.397917 4
    > 54531 81893.994 0.090927669 145.855 0.000528087 1.396613 4
    > 54531 81909.994 0.090950463 147.242 0.000494046 1.395513 4
    > 54531 81926.994 0.090506990 148.623 0.000488011 1.393711 4
    >
    >
    >
    > 54531 86165.994 0.032871409 368.915 0.001025820 0.522859 4
    > 54531 86182.994 0.033430722 369.425 0.000979730 0.521283 4
    > 54531 86198.994 0.033955849 369.943 0.000935071 0.520889 4
    > 54531 86215.994 0.032518157 370.440 0.001011648 0.517866 4
    > 54531 86232.994 0.033075338 370.944 0.000966597 0.516237 4
    > 54531 86248.994 0.033601583 371.457 0.000923113 0.515799 4
    > 54531 86264.994 0.032128572 371.947 0.001008385 0.512674 4
    > 54531 86282.994 0.032720790 372.447 0.000966217 0.511019 4
    > 54531 86298.994 0.032747589 372.946 0.000903863 0.509617 4
    > 54531 86314.994 0.032778294 373.446 0.000845556 0.508444 4
    > 54531 86332.994 0.031872478 373.933 0.000853321 0.505733 4
    > 54531 86349.994 0.032433287 374.428 0.000822466 0.504391 4
    > /*****************************************/
    >
    >
    > So, the summary is that drift goes to 500ppm when stepping is disabled
    > but runs normally when stepping is enabled and both situations never
    > require a time step. This makes no sense to me. By the way, as
    > mentioned previously, we require that time does not step backward due to
    > a problem in some commercial software that cannot currently tolerate
    > time moving backwards.
    >
    > Quite frankly, I don't think it's unreasonable that a system require
    > time to monotonically increase. Clearly this isn't the first system
    > that requires such behavior (i.e. time step disable was not added for
    > me). I understand it takes 14 days to recover from an offset of 600
    > seconds, but I also understand that if we have an offset of more than
    > 10ms in this system, then something isn't working correctly. I'm going
    > to be bold and say that we simply will _never_ have an offset of 600
    > seconds in this system. If we do, they will have a recovery procedure
    > that involves rebooting the system, which will force a quick sync during
    > startup. If they continue to have a problem, it will be fixed, most
    > likely by swapping hardware until the problem is fixed or flying someone
    > in to work on the system.
    >
    > To summarize, we really need to disable time stepping to keep time from
    > moving backwards. Maybe the commercial software will be fixed before
    > this problem is solved, but I don't want to rely on that and, even then,
    > monotonically increasing time may remain a requirement.
    >
    > Andy


    If ntpd won't work with stepping disabled it's probably a bug and you
    should report it as such.

    OTOH, ntpd clearly DOES work with stepping enabled so run it that way!
    Ntpd does not step the time unless something is badly broken somewhere.
    Once it has the correct time from the reference clock, it will stay
    synchronized until it's shut down. The only case I can recall where
    ntpd has regularly stepped the time was when running under Linux with
    heavy disk activity, sufficient to cause the loss of timer interrupts.

    If your application will not tolerate time steps, don't start it until
    after ntpd acquires the correct time. In most cases ntpd can synch up
    within thirty to sixty seconds so wait sixty seconds after starting ntpd
    before starting your application.

    I believe that there are scripts or programs that will monitor ntpd and
    start an application after ntpd has acquired synch.


  20. Re: drift value very large and very unstable

    Andy Helten wrote:
    > Fran Horan wrote:
    >
    >>
    >>
    >>
    >>
    >>>So, the summary is that drift goes to 500ppm when stepping is disabled
    >>>but runs normally when stepping is enabled and both situations never
    >>>require a time step. This makes no sense to me. By the way, as
    >>>mentioned previously, we require that time does not step backward due to
    >>>a problem in some commercial software that cannot currently tolerate
    >>>time moving backwards.
    >>>
    >>>Quite frankly, I don't think it's unreasonable that a system require
    >>>time to monotonically increase.
    >>>

    >>
    >>Forgive me if this answer misses a point in the earlier details, or shows my
    >>ignorance of NTP, but a few ideas/thoughts.
    >>
    >>Oscillators and drift can go in either direction, fast or slow, its a
    >>physics-based situation. You can't write code around that and provide a
    >>software solution that is monotonic at all times. However, a single negative
    >>step just at the start may be required before going monotic after that
    >>event. (Not an expert, but that is my understanding).
    >>
    >>With this ref clock and a GPS-drive IRIG source, you may only see a single
    >>negative step when NTP first begins running on a new system with no drift
    >>file, or a system that has been powered off a long time with a
    >>battery-driven clock drifting over that long time. Once NTP is humming along
    >>after the initial step and some updates, you shouldn't see a step again.
    >>This makes me think that you should insert a delay in launching your
    >>sensitive application, or block the application at some point, so it does
    >>not see the (possible) first time step.
    >>
    >>Fran Horan
    >>JHU/APL
    >>
    >>

    >
    > Hey Fran,
    >
    > Yes, exactly, we do perform an initial time sync with stepping enabled.
    > This is done prior to initializing the commercial software and so it
    > does not cause problems if time moves backwards. And, yes, if we are
    > below the step threshold after the initial sync (which should always be
    > the case), then we should stay below that threshold until the end of
    > time. Following this logic, we should allow time steps and be comforted
    > knowing they will never occur in a normally functioning system. I agree
    > this is reasonable and does not conflict with my own rant that "if we
    > have an offset of more than 10ms in this system, then something isn't
    > working correctly".
    >
    > This approach is definitely worth considering and I'll bring it up with
    > the decision makers. However, there is always concern that months or
    > years from now someone will say -- "Hey, some dumbass left time stepping
    > enabled, let's disable it on all systems immediately". Surely this
    > wouldn't be done without some regression testing, but then again such a
    > mundane change shouldn't need exhaustive testing, right? Riiiiight.
    >
    > I guess was just hoping someone will say, "Oh, right, that's a known
    > problem. You need to do 'X' to fix it."
    >
    > Andy


    A comment in ntp.conf and/or the startup file, explaining WHY stepping
    is enabled should go a long way toward solving the "dumbass" problem.


+ Reply to Thread
Page 1 of 2 1 2 LastLast