ntpd PLL and clock overshoot - NTP

This is a discussion on ntpd PLL and clock overshoot - NTP ; As known the PLL implemented in ntpd/ntp_loopfilter.c can overshoot. In the source code of ntpd 4.1.2 it is documented that this overshoot is less than 5 percent. Measurements I performed confirm this. The PLL algorithms in ntpd 4.1.x and are ...

+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 20 of 40

Thread: ntpd PLL and clock overshoot

  1. ntpd PLL and clock overshoot

    As known the PLL implemented in ntpd/ntp_loopfilter.c can overshoot. In
    the source code of ntpd 4.1.2 it is documented that this overshoot is
    less than 5 percent. Measurements I performed confirm this.

    The PLL algorithms in ntpd 4.1.x and are 4.2.x different. The source
    code of ntpd 4.2.0 specifies that document UDel TR 97-4-3 documents the
    ntpd 4.2 PLL algorithm. Anyone any idea where I can find that document ?
    I could not find it via Google, and it's not in David Mills list of
    papers (http://www.cis.udel.edu/~mills/papers.html). I'm asking this
    because for ntpd 4.2.0 I measured an overshoot of about 100 percent. Can
    anyone confirm this ?


  2. Re: ntpd PLL and clock overshoot

    Bart Van Assche wrote:

    > As known the PLL implemented in ntpd/ntp_loopfilter.c can overshoot. In
    > the source code of ntpd 4.1.2 it is documented that this overshoot is
    > less than 5 percent. Measurements I performed confirm this.
    >
    > The PLL algorithms in ntpd 4.1.x and are 4.2.x different. The source
    > code of ntpd 4.2.0 specifies that document UDel TR 97-4-3 documents the
    > ntpd 4.2 PLL algorithm. Anyone any idea where I can find that document ?
    > I could not find it via Google, and it's not in David Mills list of
    > papers (http://www.cis.udel.edu/~mills/papers.html). I'm asking this
    > because for ntpd 4.2.0 I measured an overshoot of about 100 percent. Can
    > anyone confirm this ?
    >


    I can't confirm the 100 percent but the current version doesn't work too
    well with my GPS reference clock at startup! I had something like a 90
    millisecond offset when I started ntpd. Over the next few minutes it
    corrected that offset but didn't stop, or even slow down, when it hit
    the zero line. It kept right on going until it had a -9 millisecond
    offset. It took about thirty minutes lock in tightly! I could have
    figured it out with pencil and paper in less time than that and, as a
    mathematician, I can't count to twenty with my shoes on!

  3. Re: ntpd PLL and clock overshoot

    Richard,

    The overshoot is the result of a misdirected Solaris/Linux design of the
    adjtime() system call. The design added a poll to the transfer function
    in order to speed the response to a programmed offset. The result
    completely torpedoes the PLL transient response and there is nothing
    that can be done to correct it in ntpd itself. What you see is what you
    get. The problem is most apparent with large initial frequency offsets,
    which are guaranteed to result in pinball behavior. Note the overshoots
    do not occur in FreeBSD or Tru64, which have a reasonable design.

    Dave

    Richard B. Gilbert wrote:

    > Bart Van Assche wrote:
    >
    >> As known the PLL implemented in ntpd/ntp_loopfilter.c can overshoot.
    >> In the source code of ntpd 4.1.2 it is documented that this overshoot
    >> is less than 5 percent. Measurements I performed confirm this.
    >>
    >> The PLL algorithms in ntpd 4.1.x and are 4.2.x different. The source
    >> code of ntpd 4.2.0 specifies that document UDel TR 97-4-3 documents
    >> the ntpd 4.2 PLL algorithm. Anyone any idea where I can find that
    >> document ? I could not find it via Google, and it's not in David Mills
    >> list of papers (http://www.cis.udel.edu/~mills/papers.html). I'm
    >> asking this because for ntpd 4.2.0 I measured an overshoot of about
    >> 100 percent. Can anyone confirm this ?
    >>

    >
    > I can't confirm the 100 percent but the current version doesn't work too
    > well with my GPS reference clock at startup! I had something like a 90
    > millisecond offset when I started ntpd. Over the next few minutes it
    > corrected that offset but didn't stop, or even slow down, when it hit
    > the zero line. It kept right on going until it had a -9 millisecond
    > offset. It took about thirty minutes lock in tightly! I could have
    > figured it out with pencil and paper in less time than that and, as a
    > mathematician, I can't count to twenty with my shoes on!



  4. Re: ntpd PLL and clock overshoot

    user@domain.invalid wrote:

    > Richard,
    >
    > The overshoot is the result of a misdirected Solaris/Linux design of the
    > adjtime() system call. The design added a poll to the transfer function
    > in order to speed the response to a programmed offset. The result
    > completely torpedoes the PLL transient response and there is nothing
    > that can be done to correct it in ntpd itself. What you see is what you
    > get. The problem is most apparent with large initial frequency offsets,
    > which are guaranteed to result in pinball behavior. Note the overshoots
    > do not occur in FreeBSD or Tru64, which have a reasonable design.
    >
    > Dave


    Dave,

    Have you reported this to Sun as a bug?

    It seems to me that they should fix it and, perhaps, create a new
    function, say, adjtime_fast() that would add the functionality without
    breaking ntpd. It might not get you anywhere but he who does not ask
    does not get!

  5. Re: ntpd PLL and clock overshoot

    user@domain.invalid wrote:

    > Richard,
    >
    > The overshoot is the result of a misdirected Solaris/Linux design of the
    > adjtime() system call. The design added a poll to the transfer function
    > in order to speed the response to a programmed offset.


    I think you mean "pole" rather than "poll"! You are talking about
    poles and zeros, are you not?

  6. Re: ntpd PLL and clock overshoot

    Richard,

    I know why they changed the design and why Linux did, too. It allows
    quick adjustments with large offsets when done manually. From what
    evidence I have, they use an exponential algorithm, which is simple to
    implement. However, the extra pole that algorithm introduces conflicts
    with the NTP PLL. There really should be a way to turn the algorithm on
    and off, but I don't expect much sympathy from the kernelmongers.

    Dave

    Richard B. Gilbert wrote:
    > user@domain.invalid wrote:
    >
    >> Richard,
    >>
    >> The overshoot is the result of a misdirected Solaris/Linux design of
    >> the adjtime() system call. The design added a poll to the transfer
    >> function in order to speed the response to a programmed offset. The
    >> result completely torpedoes the PLL transient response and there is
    >> nothing that can be done to correct it in ntpd itself. What you see is
    >> what you get. The problem is most apparent with large initial
    >> frequency offsets, which are guaranteed to result in pinball behavior.
    >> Note the overshoots do not occur in FreeBSD or Tru64, which have a
    >> reasonable design.
    >>
    >> Dave

    >
    > Dave,
    >
    > Have you reported this to Sun as a bug?
    >
    > It seems to me that they should fix it and, perhaps, create a new
    > function, say, adjtime_fast() that would add the functionality without
    > breaking ntpd. It might not get you anywhere but he who does not ask
    > does not get!


  7. Re: ntpd PLL and clock overshoot

    Richard,

    Pole, of course, as used in polynominols and confingered flactions. That
    too.

    Dave

    Richard B. Gilbert wrote:
    > user@domain.invalid wrote:
    >
    >> Richard,
    >>
    >> The overshoot is the result of a misdirected Solaris/Linux design of
    >> the adjtime() system call. The design added a poll to the transfer
    >> function in order to speed the response to a programmed offset.

    >
    > I think you mean "pole" rather than "poll"! You are talking about
    > poles and zeros, are you not?


  8. Re: ntpd PLL and clock overshoot

    In article ,
    Richard B. Gilbert wrote:

    > I can't confirm the 100 percent but the current version doesn't work too
    > well with my GPS reference clock at startup! I had something like a 90
    > millisecond offset when I started ntpd. Over the next few minutes it
    > corrected that offset but didn't stop, or even slow down, when it hit
    > the zero line. It kept right on going until it had a -9 millisecond


    That's only a 10% overshoot, which is only twice the design target, so
    is a different problem.

    The problem you are seeing is that (ignoring its ability to modify the
    loop time constants) ntpd uses a simple analogue process controller
    type mechanism to control the phase, based on measured phase errors.

    Such processes don't have any prior knowledge of the amount of noise in
    the phase error signal, whereas a human does. The human realises that,
    for example, 89.9 out of the initial 90ms are the initial transient, whereas
    the ntpd control loop assumes it could all be a random excursion and the
    actual clock may be correct. (Note that some instances of ntpd may be
    operating in contexts where all the 90ms is phase noise.)

    Such linear control can either overshoot and converge quickly, or can
    be over, or critcally damped, but take longer to converge in the first
    place.

    My feeling is that there is scope for ntpd to learn the likely phase
    noise and to use a fast and dead beat way of getting into the noise band
    before applying the linear control loop. Once the systematic errors
    have been removed, the gaussian noise assumptions that underly the
    analysis of the behaviour of the current algorithm may well apply and
    it may then be the best algorithm for maintaining lock.

    I think there may well be a good case for using Nick McClaren's, statistics
    based, leased squares fit, at least during initial acquisition, rather than
    the linear controller that is currently used.

    Note, it may be necessary to ensure that time is not served before the
    error is within the noise region, as that may cause downstream servers
    to do their initial acquisition based on the initial catch up of their
    server, rather than the true time, and might cause instabilities in the
    network, taken as a whole.

    A related problem is that ntpd has no built in knowledge that crystals
    only vary by 1 or 2 ppm with temperature, so when presented with transients
    can end up believing it needs a long term frequency correction of 500ppm.
    My feeling is that there should be a coarse control loop that can cope
    with long term changes (including sudden ones like changing motherboards)
    and a fine control loop with only a limited control range (although maybe
    the whole range can be used for the phase correction).

    A real life example of this is a CD drive, where fast, fine, tracking
    is applied to the read head itself, using voice coils, and longer term
    corrections are applied using the head positioner.

  9. Re: ntpd PLL and clock overshoot

    David,

    The modern NTP feedback loop is much more intricate than you report. It
    is represented as a hybrid phase/frequency feedback loop with a
    state-machine driven initial frequency measurement. Details are in the
    book and in the documents recetly posted for the IETF NTP Working Group.
    See the stuff linked from the NTP project page.

    There are lots of nasty little approximations in the PLL/FLL code due to
    imprecise measurement of some time intervals. While the design targe for
    overshoot is 5-6 percent, I would not be surprised if in some cases it
    is 10 percent.

    Dave

    David Woolley wrote:

    > In article ,
    > Richard B. Gilbert wrote:
    >
    >
    >>I can't confirm the 100 percent but the current version doesn't work too
    >>well with my GPS reference clock at startup! I had something like a 90
    >>millisecond offset when I started ntpd. Over the next few minutes it
    >>corrected that offset but didn't stop, or even slow down, when it hit
    >>the zero line. It kept right on going until it had a -9 millisecond

    >
    >
    > That's only a 10% overshoot, which is only twice the design target, so
    > is a different problem.
    >
    > The problem you are seeing is that (ignoring its ability to modify the
    > loop time constants) ntpd uses a simple analogue process controller
    > type mechanism to control the phase, based on measured phase errors.
    >
    > Such processes don't have any prior knowledge of the amount of noise in
    > the phase error signal, whereas a human does. The human realises that,
    > for example, 89.9 out of the initial 90ms are the initial transient, whereas
    > the ntpd control loop assumes it could all be a random excursion and the
    > actual clock may be correct. (Note that some instances of ntpd may be
    > operating in contexts where all the 90ms is phase noise.)
    >
    > Such linear control can either overshoot and converge quickly, or can
    > be over, or critcally damped, but take longer to converge in the first
    > place.
    >
    > My feeling is that there is scope for ntpd to learn the likely phase
    > noise and to use a fast and dead beat way of getting into the noise band
    > before applying the linear control loop. Once the systematic errors
    > have been removed, the gaussian noise assumptions that underly the
    > analysis of the behaviour of the current algorithm may well apply and
    > it may then be the best algorithm for maintaining lock.
    >
    > I think there may well be a good case for using Nick McClaren's, statistics
    > based, leased squares fit, at least during initial acquisition, rather than
    > the linear controller that is currently used.
    >
    > Note, it may be necessary to ensure that time is not served before the
    > error is within the noise region, as that may cause downstream servers
    > to do their initial acquisition based on the initial catch up of their
    > server, rather than the true time, and might cause instabilities in the
    > network, taken as a whole.
    >
    > A related problem is that ntpd has no built in knowledge that crystals
    > only vary by 1 or 2 ppm with temperature, so when presented with transients
    > can end up believing it needs a long term frequency correction of 500ppm.
    > My feeling is that there should be a coarse control loop that can cope
    > with long term changes (including sudden ones like changing motherboards)
    > and a fine control loop with only a limited control range (although maybe
    > the whole range can be used for the phase correction).
    >
    > A real life example of this is a CD drive, where fast, fine, tracking
    > is applied to the read head itself, using voice coils, and longer term
    > corrections are applied using the head positioner.



  10. Re: ntpd PLL and clock overshoot

    In article ,
    user@domain.invalid (probably David Mills with an IT department that is
    overzealous about preventing spam) wrote:

    > The modern NTP feedback loop is much more intricate than you report. It
    > is represented as a hybrid phase/frequency feedback loop with a


    There may be various finesses, but it is still the essentially analogue
    nature of the process that causes people to complain about overshoots
    and runaway frequency excursions.

    > state-machine driven initial frequency measurement. Details are in the


    As I understand it, the initial frequency measurement is only applied
    when cold started (no ntp.drift). Moreover, the perceived problem being
    reported here is about the initial phase correction. It is normal
    to have to make phase corrections many times the mean phase error
    on a restart, even though it isn't normal to have to do a signficant
    frequency correction.

    > There are lots of nasty little approximations in the PLL/FLL code due to
    > imprecise measurement of some time intervals. While the design targe for
    > overshoot is 5-6 percent, I would not be surprised if in some cases it
    > is 10 percent.


    I think the problem here is that a human trying to manually control the
    effective frequency might have overshot by only 0.1%. They would have
    slewed the phase in at the maximum acceptable rate and then made a
    step change in frequency at the moment they crossed a measured phase
    error of zero, stepping by minus the average rate of phase change
    during the slew in. Only then would they start operating anything
    like the current algorithm.

    What they are seeing is 10% of the original error after about an hour,
    when they know that they could have achieved 0.1% in under 10 minutes,
    assuming a 500ppm slew rate limit. (They'd probably need some automation
    to time the transition accurately enough to get to 100 microseconds, as
    assumed here.)

    The best way of implementing this is probably to provide the system with
    memory about the likely phase measurement noise, but a simpler approach
    of detecting the first zero crossing would probably work quite well.


  11. Re: ntpd PLL and clock overshoot

    David,

    You are victim of faulty engineering intuition. See Chapter 4 in The
    Book. See the graphs therein showing the response to initial
    phase/frequency excursions compiled both in simulation and practice with
    the current algorithm. If your experience is markedly different from
    these data, then suspect something in the operating system, in
    particular unexpected behavior in the adjtime() system call.

    Your scenario where the operator slings the frequency as the response
    crosses zero is equivalent to a frequency-lock model which disregards
    the initial phase error. This is in fact the model for the initial
    frequency estimate when the frequency file has not yet been created.
    This is most important when the initial poll interval is very long, as
    it must be with the telephone modem driver.

    With all of the machines here, including FreeBSD, Solaris, HP-UX, SunOS,
    Tru64 and HP-UX, the loop response in steady state is as I reported
    earlier. The results with Linux are highly suspect, as at least in some
    cases the timer interrupt frequency has been changed significantly
    without compensation in the kernel parameters. I have recommended to
    avoid Linux in any case involving precision timekeeping.

    Dave

    David Woolley wrote:
    > In article ,
    > user@domain.invalid (probably David Mills with an IT department that is
    > overzealous about preventing spam) wrote:
    >
    >
    >>The modern NTP feedback loop is much more intricate than you report. It
    >>is represented as a hybrid phase/frequency feedback loop with a

    >
    >
    > There may be various finesses, but it is still the essentially analogue
    > nature of the process that causes people to complain about overshoots
    > and runaway frequency excursions.
    >
    >
    >>state-machine driven initial frequency measurement. Details are in the

    >
    >
    > As I understand it, the initial frequency measurement is only applied
    > when cold started (no ntp.drift). Moreover, the perceived problem being
    > reported here is about the initial phase correction. It is normal
    > to have to make phase corrections many times the mean phase error
    > on a restart, even though it isn't normal to have to do a signficant
    > frequency correction.
    >
    >
    >>There are lots of nasty little approximations in the PLL/FLL code due to
    >>imprecise measurement of some time intervals. While the design targe for
    >>overshoot is 5-6 percent, I would not be surprised if in some cases it
    >>is 10 percent.

    >
    >
    > I think the problem here is that a human trying to manually control the
    > effective frequency might have overshot by only 0.1%. They would have
    > slewed the phase in at the maximum acceptable rate and then made a
    > step change in frequency at the moment they crossed a measured phase
    > error of zero, stepping by minus the average rate of phase change
    > during the slew in. Only then would they start operating anything
    > like the current algorithm.
    >
    > What they are seeing is 10% of the original error after about an hour,
    > when they know that they could have achieved 0.1% in under 10 minutes,
    > assuming a 500ppm slew rate limit. (They'd probably need some automation
    > to time the transition accurately enough to get to 100 microseconds, as
    > assumed here.)
    >
    > The best way of implementing this is probably to provide the system with
    > memory about the likely phase measurement noise, but a simpler approach
    > of detecting the first zero crossing would probably work quite well.
    >


  12. Re: ntpd PLL and clock overshoot

    David Woolley wrote:

    > In article ,
    > user@domain.invalid (probably David Mills with an IT department that is
    > overzealous about preventing spam) wrote:
    >
    >
    >>The modern NTP feedback loop is much more intricate than you report. It
    >>is represented as a hybrid phase/frequency feedback loop with a

    >
    >
    > There may be various finesses, but it is still the essentially analogue
    > nature of the process that causes people to complain about overshoots
    > and runaway frequency excursions.
    >
    >
    >>state-machine driven initial frequency measurement. Details are in the

    >
    >
    > As I understand it, the initial frequency measurement is only applied
    > when cold started (no ntp.drift). Moreover, the perceived problem being
    > reported here is about the initial phase correction. It is normal
    > to have to make phase corrections many times the mean phase error
    > on a restart, even though it isn't normal to have to do a signficant
    > frequency correction.
    >
    >
    >>There are lots of nasty little approximations in the PLL/FLL code due to
    >>imprecise measurement of some time intervals. While the design targe for
    >>overshoot is 5-6 percent, I would not be surprised if in some cases it
    >>is 10 percent.

    >
    >
    > I think the problem here is that a human trying to manually control the
    > effective frequency might have overshot by only 0.1%. They would have
    > slewed the phase in at the maximum acceptable rate and then made a
    > step change in frequency at the moment they crossed a measured phase
    > error of zero, stepping by minus the average rate of phase change
    > during the slew in. Only then would they start operating anything
    > like the current algorithm.
    >
    > What they are seeing is 10% of the original error after about an hour,
    > when they know that they could have achieved 0.1% in under 10 minutes,
    > assuming a 500ppm slew rate limit. (They'd probably need some automation
    > to time the transition accurately enough to get to 100 microseconds, as
    > assumed here.)
    >
    > The best way of implementing this is probably to provide the system with
    > memory about the likely phase measurement noise, but a simpler approach
    > of detecting the first zero crossing would probably work quite well.
    >


    I believe that Dave Mills has already explained that the problem is due
    to changes in the adjtime() routine in both Sun Solaris and Unix.
    This being the case, the choices would seem to be:
    a. Live with it.
    b. Get Sun and the Linux developers to back out the change to adjtime()
    that broke ntpd.
    c. Provide a custom adjtime() for each platform affected. I suspect
    that the routine in question runs in kernel mode and may be part of the
    kernel so that this may be easier said than done!

  13. Re: ntpd PLL and clock overshoot

    In article <9tCdnQXAi99MsKzYnZ2dnUVZ_vKdnZ2d@comcast.com>,
    Richard B. Gilbert wrote:

    > I believe that Dave Mills has already explained that the problem is due
    > to changes in the adjtime() routine in both Sun Solaris and Unix.


    No. He has explained that the 100% overshoot problem is due to that. We
    are talking about a 10% overshoot here. Dave Mills has said that a 6 to 7%
    overshoot represents the intended normal behaviour and that a 10% overshoot
    could well be the result of approximations to that in the actual implementation.

    (This is on a subthread about 10% overshoots, but something produced a
    one entry References line, so it is possible that the threading is broken.)

    > b. Get Sun and the Linux developers to back out the change to adjtime()
    > that broke ntpd.


    Have they violated their own documentation? If not they have no
    reason to do so.

    > c. Provide a custom adjtime() for each platform affected. I suspect


    That's more or less done anyway.

    > that the routine in question runs in kernel mode and may be part of the
    > kernel so that this may be easier said than done!


    From what I understand from what Dave Mills said recently about reasons
    for not using large minimum step values, the problem only happens when
    adjtime is used, i.e. the kernel time discipline is disabled or not used,
    so the ntpd code involved is in user space. (Also, the API for the kernel
    code is not adjtime, but adjtimex, ntpadjtime, etc.)

  14. Re: ntpd PLL and clock overshoot

    [100% overshoot]

    >I believe that Dave Mills has already explained that the problem is due
    > to changes in the adjtime() routine in both Sun Solaris and Unix.
    >This being the case, the choices would seem to be:
    >a. Live with it.
    >b. Get Sun and the Linux developers to back out the change to adjtime()
    >that broke ntpd.
    >c. Provide a custom adjtime() for each platform affected. I suspect
    >that the routine in question runs in kernel mode and may be part of the
    >kernel so that this may be easier said than done!


    I assume the fix is something simple like replacing a select with
    a simple assignment.

    For Linux, it would help some of us if somebody would track down
    the place that needs fixing and publish a diff. I took a quick
    scan and didn't find it, but my kernel may be before somebody added
    that tweak.

    --
    The suespammers.org mail server is located in California. So are all my
    other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited
    commercial e-mail to my suespammers.org address or any of my other addresses.
    These are my opinions, not necessarily my employer's. I hate spam.


  15. Re: ntpd PLL and clock overshoot

    Hal Murray wrote:
    > [100% overshoot]
    >
    >
    >>I believe that Dave Mills has already explained that the problem is due
    >> to changes in the adjtime() routine in both Sun Solaris and Unix.
    >>This being the case, the choices would seem to be:
    >>a. Live with it.
    >>b. Get Sun and the Linux developers to back out the change to adjtime()
    >>that broke ntpd.
    >>c. Provide a custom adjtime() for each platform affected. I suspect
    >>that the routine in question runs in kernel mode and may be part of the
    >>kernel so that this may be easier said than done!

    >
    >
    > I assume the fix is something simple like replacing a select with
    > a simple assignment.
    >
    > For Linux, it would help some of us if somebody would track down
    > the place that needs fixing and publish a diff. I took a quick
    > scan and didn't find it, but my kernel may be before somebody added
    > that tweak.
    >

    any specifics where to look?

    the public visible adjtime(x) seems to live in glibc.

    uwe

  16. Re: ntpd PLL and clock overshoot

    A few days ago I reported that I measured a larger overshoot with nptd
    4.2.0a than with ntpd 4.1.0. By this time I have performed several test
    with ntpd 4.2.2p3, and it seems to perform significantly better than
    previous versions: overshoot is within spec and it is more accurate than
    other versions I tried (all tests have been performed with a Linux
    kernel, versions 2.4.20, 2.6.10 and 2.6.13).

    Dave, can you tell me what is wrong with Linux with regard to precision
    timekeeping ?

    Other people asked where the Linux implementation of the adjtimex()
    system call can be found. Its implementation resides in source file
    kernel/time.c, functions sys_adjtimex() and do_adjtimex(). See e.g.
    http://www.kernelhq.cc/browse.py?css=taichi


    David L. Mills wrote:

    > With all of the machines here, including FreeBSD, Solaris, HP-UX, SunOS,
    > Tru64 and HP-UX, the loop response in steady state is as I reported
    > earlier. The results with Linux are highly suspect, as at least in some
    > cases the timer interrupt frequency has been changed significantly
    > without compensation in the kernel parameters. I have recommended to
    > avoid Linux in any case involving precision timekeeping.



  17. Re: ntpd PLL and clock overshoot

    David L. Mills wrote:

    > With all of the machines here, including FreeBSD, Solaris, HP-UX, SunOS,
    > Tru64 and HP-UX, the loop response in steady state is as I reported
    > earlier. The results with Linux are highly suspect, as at least in some
    > cases the timer interrupt frequency has been changed significantly
    > without compensation in the kernel parameters. I have recommended to
    > avoid Linux in any case involving precision timekeeping.


    Hello Dave,
    there is at least one issue with APIC routed interupts on linux running
    on nVidia nForce 1 and 2 based Boards resulting in a too fast and irregular clock.

    It seems the timer interrupt is handled _twice_ on occasion.

    I have A7N266-E and A7N8X-E boards produce this problem with various kernels
    in the 2.6.1n range.

    The same boards ran ntpd on linux 2.4 and no APIC routing just perfect.

    Symtoms:
    The clock starts to run ahead by ~8-900ppm resulting in hard correction
    of -.5 to -1.5 seconds every couple of hours. Adjusting the system tick value
    results in symetric corrections +.5 .. -.5 which would indicate an extremely
    unstable clock.

    This started out for me as a problem with ntpd not syncing
    BUT is now Linux/Hardware related with ntpd being the whistle-blower.

    One of the reasons i started reading this group some weeks ago.

    uwe

  18. Re: ntpd PLL and clock overshoot

    In article <45325369.2060102@gmail.com>,
    Bart Van Assche wrote:

    > Other people asked where the Linux implementation of the adjtimex()
    > system call can be found. Its implementation resides in source file
    > kernel/time.c, functions sys_adjtimex() and do_adjtimex(). See e.g.
    > http://www.kernelhq.cc/browse.py?css=taichi


    No. They asked where tickadj was implemented, and the important part of
    that was originally in sched.c and for 2.4.26 was in timer.c. For 2.4,
    at least, it looks like adjtime is implemented using the one shot mode
    of adjtimex. The relevant code in timer.c is update_wall_time_one_tick().

    This version does not have a non-linear behaviour, but, what might confuse
    ntpd, is that the slew rate is only exactly 500 ppm if the HZ value
    exactly divides 500. In particular, for HZ=1000, it is clamped to one
    microsecond per tick, i.e. 1000ppm. (I don't think the HZ=1000 case is a
    problem, for ntpd, but non-divisors of 500, below 500, may mean that the
    maximum slew rate is less than ntpd assumes, e.g. 500/200 is 2, giving
    a maximum slew rate of 400ppm if HZ is 200.) Note that this all only
    applies if you disable kernel or you use tinker options that don't allow
    it to be used.

    It's possible that 2.6 versions do do something more clever. It's also
    possible that the routine is in yet another source file.

  19. Re: ntpd PLL and clock overshoot

    David,

    I beg to seriously differ. To the extent the z transform of the impulse
    response preserves the poles and zeros of the poplynomial
    representation, the digital NTP loop behaves as an analog loop, at least
    in the linear regime of operation.

    The initial frequency estimate measures only the frequency offset; the
    phase offset is left to fend for itsel, which could indeed result in an
    initial offset exceeding the "overshoot" target. The purpose of the
    state machine in the first place is to allow large initial poll
    intervals, as with telephone modem services.

    Look more carefully at the clock state machine. If the initial offset is
    greater than the step threshold, the clock is set, which results in an
    initial offset of zero. After the stepout interval the frequency is
    measured and corrected, but the phase at that time could be quite large,
    even greater than the step threshold, in which case the clock is set
    again, resulting in no initial offset at all. On the other hand, if the
    resulting offset when the frequency is measured is less than the step
    threshold, ordinary linear mode results. This is NOT overshoot, just an
    initial phase error.

    If the initial offset is less than the step threshold, the freqjuency
    measurement is made, but the phase is disciplined normally during the
    stepout interval. Again, this could result in a phase offset after the
    stepout interval, but this is NOT overshoot.

    Dave

    David Woolley wrote:
    > In article ,
    > user@domain.invalid (probably David Mills with an IT department that is
    > overzealous about preventing spam) wrote:
    >
    >
    >>The modern NTP feedback loop is much more intricate than you report. It
    >>is represented as a hybrid phase/frequency feedback loop with a

    >
    >
    > There may be various finesses, but it is still the essentially analogue
    > nature of the process that causes people to complain about overshoots
    > and runaway frequency excursions.
    >
    >
    >>state-machine driven initial frequency measurement. Details are in the

    >
    >
    > As I understand it, the initial frequency measurement is only applied
    > when cold started (no ntp.drift). Moreover, the perceived problem being
    > reported here is about the initial phase correction. It is normal
    > to have to make phase corrections many times the mean phase error
    > on a restart, even though it isn't normal to have to do a signficant
    > frequency correction.
    >
    >
    >>There are lots of nasty little approximations in the PLL/FLL code due to
    >>imprecise measurement of some time intervals. While the design targe for
    >>overshoot is 5-6 percent, I would not be surprised if in some cases it
    >>is 10 percent.

    >
    >
    > I think the problem here is that a human trying to manually control the
    > effective frequency might have overshot by only 0.1%. They would have
    > slewed the phase in at the maximum acceptable rate and then made a
    > step change in frequency at the moment they crossed a measured phase
    > error of zero, stepping by minus the average rate of phase change
    > during the slew in. Only then would they start operating anything
    > like the current algorithm.
    >
    > What they are seeing is 10% of the original error after about an hour,
    > when they know that they could have achieved 0.1% in under 10 minutes,
    > assuming a 500ppm slew rate limit. (They'd probably need some automation
    > to time the transition accurately enough to get to 100 microseconds, as
    > assumed here.)
    >
    > The best way of implementing this is probably to provide the system with
    > memory about the likely phase measurement noise, but a simpler approach
    > of detecting the first zero crossing would probably work quite well.
    >



  20. Re: ntpd PLL and clock overshoot

    David,

    I beg to seriously differ. To the extent the z transform of the impulse
    response preserves the poles and zeros of the poplynomial
    representation, the digital NTP loop behaves as an analog loop, at least
    in the linear regime of operation.

    The initial frequency estimate measures only the frequency offset; the
    phase offset is left to fend for itsel, which could indeed result in an
    initial offset exceeding the "overshoot" target. The purpose of the
    state machine in the first place is to allow large initial poll
    intervals, as with telephone modem services.

    Look more carefully at the clock state machine. If the initial offset is
    greater than the step threshold, the clock is set, which results in an
    initial offset of zero. After the stepout interval the frequency is
    measured and corrected, but the phase at that time could be quite large,
    even greater than the step threshold, in which case the clock is set
    again, resulting in no initial offset at all. On the other hand, if the
    resulting offset when the frequency is measured is less than the step
    threshold, ordinary linear mode results. This is NOT overshoot, just an
    initial phase error.

    If the initial offset is less than the step threshold, the freqjuency
    measurement is made, but the phase is disciplined normally during the
    stepout interval. Again, this could result in a phase offset after the
    stepout interval, but this is NOT overshoot.

    Dave

    David Woolley wrote:
    > In article ,
    > user@domain.invalid (probably David Mills with an IT department that is
    > overzealous about preventing spam) wrote:
    >
    >
    >>The modern NTP feedback loop is much more intricate than you report. It
    >>is represented as a hybrid phase/frequency feedback loop with a

    >
    >
    > There may be various finesses, but it is still the essentially analogue
    > nature of the process that causes people to complain about overshoots
    > and runaway frequency excursions.
    >
    >
    >>state-machine driven initial frequency measurement. Details are in the

    >
    >
    > As I understand it, the initial frequency measurement is only applied
    > when cold started (no ntp.drift). Moreover, the perceived problem being
    > reported here is about the initial phase correction. It is normal
    > to have to make phase corrections many times the mean phase error
    > on a restart, even though it isn't normal to have to do a signficant
    > frequency correction.
    >
    >
    >>There are lots of nasty little approximations in the PLL/FLL code due to
    >>imprecise measurement of some time intervals. While the design targe for
    >>overshoot is 5-6 percent, I would not be surprised if in some cases it
    >>is 10 percent.

    >
    >
    > I think the problem here is that a human trying to manually control the
    > effective frequency might have overshot by only 0.1%. They would have
    > slewed the phase in at the maximum acceptable rate and then made a
    > step change in frequency at the moment they crossed a measured phase
    > error of zero, stepping by minus the average rate of phase change
    > during the slew in. Only then would they start operating anything
    > like the current algorithm.
    >
    > What they are seeing is 10% of the original error after about an hour,
    > when they know that they could have achieved 0.1% in under 10 minutes,
    > assuming a 500ppm slew rate limit. (They'd probably need some automation
    > to time the transition accurately enough to get to 100 microseconds, as
    > assumed here.)
    >
    > The best way of implementing this is probably to provide the system with
    > memory about the likely phase measurement noise, but a simpler approach
    > of detecting the first zero crossing would probably work quite well.
    >



+ Reply to Thread
Page 1 of 2 1 2 LastLast