refclock use causes core dump of ntpd - NTP

This is a discussion on refclock use causes core dump of ntpd - NTP ; Hello all I have just built a new FreeBSD 6.2 box with ntp=dev-4.2.5p8 to test some refclocks I have. One of the clocks is a TrueTime TL-3, a semi-consumer WWV based clock. Its data output format matches some/most of the ...

+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 20 of 30

Thread: refclock use causes core dump of ntpd

  1. refclock use causes core dump of ntpd

    Hello all

    I have just built a new FreeBSD 6.2 box with ntp=dev-4.2.5p8 to test
    some refclocks I have.

    One of the clocks is a TrueTime TL-3, a semi-consumer WWV based
    clock. Its data output format matches
    some/most of the other TT clocks that use the refclock_true driver.
    (driver 5). 99 times out of 100, ntpd
    dumps core when the serial line is attached. I did get it to stay up
    exactly once, and the driver worked
    like a charm, but the other 99+ times its started it coredumps.
    Here's the best I've got at this point:

    >From /var/log/messages:

    Feb 19 19:29:58 lunar kernel: pid 27288 (ntpd), uid 0: exited on
    signal 6 (core dumped)

    Using gdb (which I know next to nothing about, btw) I did get this:
    Program received signal SIGABRT, Aborted.
    0x282aeecb in kill () from /lib/libc.so.6
    (gdb)

    So, does anyone have any suggestions on where to go now?
    I am going to try one or all of the following:
    1) Try an older FBSD release like 5.4 or even 4.8
    2) Try an older released version of ntpd.
    3) See if running in DEBUG has any other hints.

    Any suggestions appreciated,

    Roger, Harvey Mudd College


  2. Re: refclock use causes core dump of ntpd

    Roger,

    I assume you are using driver 36?

    Please fire up gdb again, and when it drops core type "where" to show the
    complete stack backtrace.

    It will probably take more work, but that will be a good start.

    H

  3. Re: refclock use causes core dump of ntpd

    On Feb 19, 10:40 pm, Harlan Stenn wrote:
    > Roger,
    >
    > I assume you are using driver 36?
    >
    > Please fire up gdb again, and when it drops core type "where" to show the
    > complete stack backtrace.
    >
    > It will probably take more work, but that will be a good start.
    >
    > H



    Harlan,

    No, this is driver 5, the TrueTime serial port driver.

    Here is the current billboard, without the receiver serial line
    connected.

    lunar# ntpq -p
    remote refid st t when poll reach delay
    offset jitter
    ================================================== ============================
    TRUETIME(0) .WWV. 0 l - 64 0 0.000
    0.000 0.002
    *SPECTRACOM(0) .WWVB. 0 l 23 64 377 0.000
    -0.552 1.793
    +Remote1 .GPS. 1 u 285 512 377 24.693
    4.798 2.152
    -Remote2 .WWV. 1 u 331 512 377 23.261
    5.017 0.894
    +Remote3 .GPS. 1 u 328 512 377 23.546
    4.760 1.068

    As you can see, it seems to be quite happy with just the NetClock/2
    connected. (driver 4)
    If I were to connect the serial line from the TT-3 TrueTime, it will
    coredump.
    The TrueTime refid (WWV) is from a config fudge line.

    I'll attach gdb to the process and see what a 'where' produces after
    the dump.

    Roger



  4. Re: refclock use causes core dump of ntpd

    On Feb 19, 10:40 pm, Harlan Stenn wrote:
    >
    > Please fire up gdb again, and when it drops core type "where" to show the
    > complete stack backtrace.
    >
    > It will probably take more work, but that will be a good start.
    >
    > H


    OK, here is a start:

    Program received signal SIGABRT, Aborted.
    0x282aeecb in kill () from /lib/libc.so.6
    (gdb) where
    #0 0x282aeecb in kill () from /lib/libc.so.6
    #1 0x282aee68 in raise () from /lib/libc.so.6
    #2 0x282adb78 in abort () from /lib/libc.so.6
    #3 0x0807c87f in true_doevent (peer=0x80ad2b0, event=e_Poll) at
    refclock_true.c:788
    #4 0x0807cc53 in true_receive (rbufp=0x71ff) at refclock_true.c:540
    #5 0x08053046 in ntpdmain (argc=0, argv=0xbfbfeb50) at ntpd.c:1104
    #6 0x080532ff in main (argc=1, argv=0xbfbfec7c) at ntpd.c:317
    (gdb) exit

    I'll be taking a look at refclock_true in the indicated area, but
    I don't pretend to be much of a programmer.

    Roger


  5. Re: refclock use causes core dump of ntpd

    On Feb 20, 7:46 am, "wa6zvp" wrote:
    > On Feb 19, 10:40 pm, Harlan Stenn wrote:
    >
    >
    >
    > > Please fire up gdb again, and when it drops core type "where" to show the
    > > complete stack backtrace.

    >
    > > It will probably take more work, but that will be a good start.

    >
    > > H

    >
    > OK, here is a start:
    >
    > Program received signal SIGABRT, Aborted.
    > 0x282aeecb in kill () from /lib/libc.so.6
    > (gdb) where
    > #0 0x282aeecb in kill () from /lib/libc.so.6
    > #1 0x282aee68 in raise () from /lib/libc.so.6
    > #2 0x282adb78 in abort () from /lib/libc.so.6
    > #3 0x0807c87f in true_doevent (peer=0x80ad2b0, event=e_Poll) at
    > refclock_true.c:788
    > #4 0x0807cc53 in true_receive (rbufp=0x71ff) at refclock_true.c:540
    > #5 0x08053046 in ntpdmain (argc=0, argv=0xbfbfeb50) at ntpd.c:1104
    > #6 0x080532ff in main (argc=1, argv=0xbfbfec7c) at ntpd.c:317
    > (gdb) exit
    >
    > I'll be taking a look at refclock_true in the indicated area, but
    > I don't pretend to be much of a programmer.
    >
    > Roger


    >From the looks of it, refclock_true got into a state where it

    thought that it should know what type of clock it was dealing
    with, (up->type) being valid, yet it didn't. The switch on this
    variable ended up falling through to the default, which is to
    abort. (crash and burn).

    While I realize that this particular clock (a TT-3) isn't advertised
    as being supported, I don't think the driver should get into this
    state.

    I don't think putting any random serial data on the input should
    cause an abort.

    I'll keep on trying. Any hints on getting the driver
    to log debug data, or where I should go next would be appreciated.

    Roger


  6. Re: refclock use causes core dump of ntpd

    Roger,

    > Program received signal SIGABRT, Aborted.
    > 0x282aeecb in kill () from /lib/libc.so.6
    > (gdb) where
    > #0 0x282aeecb in kill () from /lib/libc.so.6
    > #1 0x282aee68 in raise () from /lib/libc.so.6
    > #2 0x282adb78 in abort () from /lib/libc.so.6
    > #3 0x0807c87f in true_doevent (peer=0x80ad2b0, event=e_Poll) at refclock_true.c:788
    > #4 0x0807cc53 in true_receive (rbufp=0x71ff) at refclock_true.c:540
    > #5 0x08053046 in ntpdmain (argc=0, argv=0xbfbfeb50) at ntpd.c:1104
    > #6 0x080532ff in main (argc=1, argv=0xbfbfec7c) at ntpd.c:317
    > (gdb) exit


    Well, the good news is the problem is now obvious.

    We know it is from the call to abort() at line 788 of refclock_true.c.

    Please open an issue a http://bugs.ntp.isc.org with this information.

    The bad news is that refclock_true.c does not have a maintainer - anybody
    want to volunteer?

    H

  7. Re: refclock use causes core dump of ntpd

    Roger,

    By my read of the code if you set flag2 it will generate a debug file in
    /tmp/trueX (where X is the unit number you chose).

    H

  8. Re: refclock use causes core dump of ntpd

    On Feb 20, 8:35 pm, Harlan Stenn wrote:
    > Roger,
    >
    > By my read of the code if you set flag2 it will generate a debug file in
    > /tmp/trueX (where X is the unit number you chose).
    >
    > H


    * Yep, theoretically. I have that flag set and the file gets created,
    but
    nothing ever gets written to it. That would be the next bug.

    Roger


  9. Re: refclock use causes core dump of ntpd

    On Feb 20, 11:02 am, Harlan Stenn wrote:

    > Well, the good news is the problem is now obvious.
    >
    > We know it is from the call to abort() at line 788 of refclock_true.c.


    Yep. Unfortunately, the code should never get there. Yea, right.

    >
    > Please open an issue ahttp://bugs.ntp.isc.orgwith this information.
    >
    > The bad news is that refclock_true.c does not have a maintainer - anybody
    > want to volunteer?
    >
    > H


    Its probably necessary to own one of the beasts that this code was
    designed to handle in order to do anything with it. The likelihood
    that many folks with programming skills have one any more is slim, so
    its unlikely you will find a volunteer.

    Please don't mark me as "maintainer", but I will spend some time on
    solving this.
    It appears that the state machine gets severely confused.
    I have confirmed that the first string returned to the driver is
    actually
    a valid truetime timecode which certainly shouldn't cause a call to
    abort.

    The abort call is the default action at the end of a switch/case block
    where
    the value of the switch must contain something not recognized. Now to
    find it.

    I suspect I'll be putting lots of debug print statements in the code.
    Wish me luck.

    Roger




  10. Re: refclock use causes core dump of ntpd

    On Feb 20, 10:51 pm, "wa6zvp" wrote:
    > On Feb 20, 11:02 am, Harlan Stenn wrote:
    >
    > > Well, the good news is the problem is now obvious.

    >
    > > We know it is from the call to abort() at line 788 of refclock_true.c.

    >
    > Yep. Unfortunately, the code should never get there. Yea, right.
    >


    OK, I got warm and cuddly with gdb, at least enough to set some
    breakpoints
    and look at variables.

    The main culprit looks like line 540 (in refclock_true). This is in
    the
    received data function. It calls true_doevent with a parameter of
    e_Poll.
    Event e_Poll is never handled anywhere in doevent, so is very state
    dependant.

    Even replacing line 788, the original abort call location with a
    break;,
    the program would abort at other unhandled places in doevent.

    I suppose its possible that all the 'intended' refclocks for this
    driver
    behaved in a way that this issue never cropped up. But this is the
    logical
    driver for the clock I have, since it has the exact timestamp parsing
    needed.

    Knowing I'm probably the only person in the world trying to attach
    this
    clock to NTP, there is not much reward in spending hours and hours to
    incorporate the clock specific things necessary into the code. So I
    took a different route. I added 'break' statements before each
    'abort'
    statement in this driver. It now performs quite nicely, thank you.

    Here is the end result:

    lunar# ntpq -p
    remote refid st t when poll reach delay
    offset jitter
    ================================================== ============================
    *TRUETIME(0) .WWV. 0 l 34 64 377 0.000
    2.586 1.311
    +SPECTRACOM(0) .WWVB. 0 l 36 64 377 0.000
    -19.930 8.259
    -Remote 1 .GPS. 1 u 94 256 377 24.973
    2.271 1.014
    +Remote 2 .WWV. 1 u 98 256 377 23.715
    1.790 2.170
    -Remote 3 .GPS. 1 u 35 512 377 24.024
    2.364 2.465

    Note that the WWVB rx is not performing well at this moment. This is
    the worst
    time of the day (first few hours after sunset) for 60Khz propagation
    to my QTH.
    It will settle right in with the others in another hour or two.

    Thanks Harlan for your suggestions.

    Roger


  11. Re: refclock use causes core dump of ntpd

    Roger,

    The downside of your approach is that you'll have to keep making those
    changes with every upgrade.

    If you are going to use a break statement, would it be worth printing out
    the "offending" values in each switch statement? Tracking this information
    (especially if the data format for your model is documented somewhere) would
    make it much easier for somebody to upgrade the driver to support your
    model.

    And I again invite folks with Truetime clocks to add themselves to

    http://ntp.isc.org/Support/ConfiguringTrueTimeRefclocks

    http://ntp.isc.org/Support/TrueTimeRefclockUsers

    H

  12. Re: refclock use causes core dump of ntpd

    Harlan Stenn wrote:

    > The downside of your approach is that you'll have to keep making those
    > changes with every upgrade.


    Maybe you should just incorporate his changes and hope for the best :-/

    Is it appropriate anyway for a refclock driver to terminate ntpd if it
    isn't happy? From a quick look, only refclock_true and refclock_jupiter
    call abort(), and only refclock_oncore and refclock_mx4200 call exit().
    All the others presumably mark the refclock "bad" in some way and allow
    things to carry on, ignoring it.

    --
    Ronan Flood

  13. Re: refclock use causes core dump of ntpd

    Ronan Flood wrote:
    > Harlan Stenn wrote:
    >
    >
    >>The downside of your approach is that you'll have to keep making those
    >>changes with every upgrade.

    >
    >
    > Maybe you should just incorporate his changes and hope for the best :-/
    >
    > Is it appropriate anyway for a refclock driver to terminate ntpd if it
    > isn't happy? From a quick look, only refclock_true and refclock_jupiter
    > call abort(), and only refclock_oncore and refclock_mx4200 call exit().
    > All the others presumably mark the refclock "bad" in some way and allow
    > things to carry on, ignoring it.
    >


    Maybe there should be a specification, perhaps in the new RFC, that says
    what to do if a refclock failure is detected.

    Marking the refclock "insane" comes to mind as a reasonable response.


  14. Re: refclock use causes core dump of ntpd

    Richard B. gilbert wrote:

    > Maybe there should be a specification, perhaps in the new RFC, that says
    > what to do if a refclock failure is detected.
    >
    > Marking the refclock "insane" comes to mind as a reasonable response.
    >

    Hi,

    enumerate configured sources, mark them OK or gone
    be verbose about it in the logfile
    die if no source is found?
    Or could i even get usefull info
    by pokeing with ntpq or such?

    I really like 'make -k' like things
    find all oopses in one go if possible.

    uwe

  15. Re: refclock use causes core dump of ntpd

    "Richard B. gilbert" wrote in message
    news:45DD97C3.6040800@comcast.net...
    [...]
    > Maybe there should be a specification, perhaps in the new RFC, that
    > says what to do if a refclock failure is detected.
    >
    > Marking the refclock "insane" comes to mind as a reasonable response.


    Congratulations, Richard. I think that's a first. Not a single
    exclamation point anywhere!

    Groetjes,
    Maarten Wiltink



  16. Re: refclock use causes core dump of ntpd

    Maarten Wiltink wrote:
    > "Richard B. gilbert" wrote in message
    > news:45DD97C3.6040800@comcast.net...
    > [...]
    >
    >>Maybe there should be a specification, perhaps in the new RFC, that
    >>says what to do if a refclock failure is detected.
    >>
    >>Marking the refclock "insane" comes to mind as a reasonable response.

    >
    >
    > Congratulations, Richard. I think that's a first. Not a single
    > exclamation point anywhere!
    >
    > Groetjes,
    > Maarten Wiltink
    >
    >


    When did I ever use a SINGLE exclamation point? ;-)


  17. Re: refclock use causes core dump of ntpd

    "wa6zvp" wrote:

    > > > We know it is from the call to abort() at line 788 of refclock_true.c.

    > >
    > > Yep. Unfortunately, the code should never get there. Yea, right.

    >
    > OK, I got warm and cuddly with gdb, at least enough to set some
    > breakpoints
    > and look at variables.
    >
    > The main culprit looks like line 540 (in refclock_true). This is in
    > the
    > received data function. It calls true_doevent with a parameter of
    > e_Poll.
    > Event e_Poll is never handled anywhere in doevent, so is very state
    > dependant.
    >
    > Even replacing line 788, the original abort call location with a
    > break;,
    > the program would abort at other unhandled places in doevent.


    That's more understandable, but looking at the code I don't see how it
    got to line 788, since that's the default on a switch(up->type) which
    should only ever be one of t_unknown, t_goes, t_omega, t_tm, or t_tcu
    as they are the only values ever assigned to it, and they all have
    matching cases in the switch. What was the value of up->type when
    it got to line 788? And up->state?

    What I'd expect is that the state machine starts with t_unknown and
    s_Base then sees e_Init, from true_start() lines 290-292, which takes
    it into ss_InqGOES. If it then gets e_Poll from true_receive(), it
    would abort at line 726. Various other scenarios I have not looked
    at exhaustively, but getting to line 788 is puzzling ...

    --
    Ronan Flood

  18. Re: refclock use causes core dump of ntpd

    On Feb 22, 7:04 am, Ronan Flood wrote:
    > "wa6zvp" wrote:
    > > > > We know it is from the call to abort() at line 788 of refclock_true.c.

    >
    > > > Yep. Unfortunately, the code should never get there. Yea, right.

    >
    > > OK, I got warm and cuddly with gdb, at least enough to set some
    > > breakpoints
    > > and look at variables.

    >
    > > The main culprit looks like line 540 (in refclock_true). This is in
    > > the
    > > received data function. It calls true_doevent with a parameter of
    > > e_Poll.
    > > Event e_Poll is never handled anywhere in doevent, so is very state
    > > dependant.

    >
    > > Even replacing line 788, the original abort call location with a
    > > break;,
    > > the program would abort at other unhandled places in doevent.

    >
    > That's more understandable, but looking at the code I don't see how it
    > got to line 788, since that's the default on a switch(up->type) which
    > should only ever be one of t_unknown, t_goes, t_omega, t_tm, or t_tcu
    > as they are the only values ever assigned to it, and they all have
    > matching cases in the switch. What was the value of up->type when
    > it got to line 788? And up->state?


    * My recollection is that up->type actually had t_unknown in it,
    making
    it even more puzzling. Don't remember state.

    > What I'd expect is that the state machine starts with t_unknown and
    > s_Base then sees e_Init, from true_start() lines 290-292, which takes
    > it into ss_InqGOES. If it then gets e_Poll from true_receive(), it
    > would abort at line 726. Various other scenarios I have not looked
    > at exhaustively, but getting to line 788 is puzzling ...


    * It certainly is. I'll fiddle with more gdb tonight, maybe doing
    some
    instruction tracing from true_recieve.

    I can't do much from work, since I can't disconnect the serial data
    line.
    If I start ntpd with gdb, it just says 'normal completion', meanwhile
    the forked process crashes. Is there a way to get gdb to follow into
    the forked process? If not, I have to get it running without the data
    and
    attach to the running process. This will have to wait till tonight.

    My feeling is that a refclock driver should _never_ cause ntpd to die.
    I think it should just do verbose debugging and continue on as best it
    can.
    The fact that it never gets into a reached status would be a clue that
    its not working right. In this case, however, continuing makes it
    work.
    This happens because the serial data is actually parsed correctly.

    More later.

    Roger




  19. Re: refclock use causes core dump of ntpd

    "wa6zvp" wrote:

    > > matching cases in the switch. What was the value of up->type when
    > > it got to line 788? And up->state?

    >
    > * My recollection is that up->type actually had t_unknown in it,
    > making it even more puzzling. Don't remember state.


    Hmm, I'm wondering if the fact that there's no "break" after the
    call to msyslog at line 784 could be significant? Surely not.

    > If I start ntpd with gdb, it just says 'normal completion', meanwhile
    > the forked process crashes. Is there a way to get gdb to follow into
    > the forked process? If not, I have to get it running without the data
    > and attach to the running process. This will have to wait till tonight.


    The -n option to ntpd will keep it in the foreground, not sure how
    you pass that to it from gdb, um

    gdb ntpd
    run -n

    maybe?

    --
    Ronan Flood

  20. Re: refclock use causes core dump of ntpd

    On Feb 22, 10:05 am, Ronan Flood wrote:
    > "wa6zvp" wrote:
    > > > matching cases in the switch. What was the value of up->type when
    > > > it got to line 788? And up->state?

    >
    > > * My recollection is that up->type actually had t_unknown in it,
    > > making it even more puzzling. Don't remember state.


    ** Yes. I remembered correctly. up-type = t_unknown, up->state =
    s_Base, and event = e_Poll.
    Very bizarre. See below.

    > Hmm, I'm wondering if the fact that there's no "break" after the
    > call to msyslog at line 784 could be significant? Surely not.


    * Don't think so.

    > > If I start ntpd with gdb, it just says 'normal completion', meanwhile
    > > the forked process crashes. Is there a way to get gdb to follow into
    > > the forked process? If not, I have to get it running without the data
    > > and attach to the running process. This will have to wait till tonight.

    >
    > The -n option to ntpd will keep it in the foreground, not sure how
    > you pass that to it from gdb, um
    >
    > gdb ntpd
    > run -n
    >
    > maybe?


    **
    gdb ntpd
    (gdb) set args -n
    run

    * Despite the fact that doevent is entered with valid params for type
    and state,
    that is, t_unknown and t_Base, it still gets to the abort statement at
    the end
    of the function whenever its called with event = e_Poll.

    Nowhere in this function is this event handled or used. Not even
    mentioned.

    So, I backed out my hack from last night, and now have
    simply commented out the one and only line that calls
    doevent with event=e_Poll. Line 540 in function true_receive.

    Thanks, Ronan, for giving me the encouragement to at least learn
    a little bit about gdb.

    Maybe if gdb can do a conditional break, and stop at the top of
    doevent
    if event=e_Poll. Then I could try stepping it and see what happens.
    Or not.

    Roger


+ Reply to Thread
Page 1 of 2 1 2 LastLast