move_fd() causing bad behavior on AIX5.3 - NTP

This is a discussion on move_fd() causing bad behavior on AIX5.3 - NTP ; ntp-4.2.4p4 / AIX 5.3 This was annoying to chase down because I guess it's also screwing up the fds used to log errors. The symptom is that ntpd exits silently due to a bind() error, but only when daemonized. Running ...

+ Reply to Thread
Results 1 to 16 of 16

Thread: move_fd() causing bad behavior on AIX5.3

  1. move_fd() causing bad behavior on AIX5.3

    ntp-4.2.4p4 / AIX 5.3

    This was annoying to chase down because I guess it's also screwing up
    the fds used to log errors. The symptom is that ntpd exits silently
    due to a bind() error, but only when daemonized. Running under some
    debugging tools, such as Aprobe, changes the behavior (no failure),
    possibly because of additional fd usage.

    This sounds like bug #604 (https://support.ntp.org/bugs/show_bug.cgi?
    id=614). Was that ever confirmed fixed for AIX5?

    I haven't looked into exactly how move_fd() is screwing up, but ntpd
    is happy on my test box with the call to it removed entirely.

  2. Re: move_fd() causing bad behavior on AIX5.3

    >>> In article , brandon.phillips@lmco.com writes:

    brandon> ntp-4.2.4p4 / AIX 5.3 This was annoying to chase down because I
    brandon> guess it's also screwing up the fds used to log errors. The
    brandon> symptom is that ntpd exits silently due to a bind() error, but only
    brandon> when daemonized. Running under some debugging tools, such as
    brandon> Aprobe, changes the behavior (no failure), possibly because of
    brandon> additional fd usage.

    brandon> This sounds like bug #604
    brandon> (https://support.ntp.org/bugs/show_bug.cgi? id=614). Was that
    brandon> ever confirmed fixed for AIX5?

    Did you mean bug 604 or 614? Regardless, both are marked FIXED, and 604 has
    been VERIFIED.

    Another reason why bugs sometimes do not appear under a debugger is when the
    debugger initialized variables.

    brandon> I haven't looked into exactly how move_fd() is screwing up, but
    brandon> ntpd is happy on my test box with the call to it removed entirely.

    We don't do much testing under AIX because we don't have easy access to a
    box.

    An additional approach would be to use some of the assertion stuff in
    include/ntp_assert.h and see if you can get something to violate an
    assertion.

    And if you can shed any light on bugs 135, 309, 598, or 716, that would be
    swell, too.

    H

  3. Re: move_fd() causing bad behavior on AIX5.3

    Harlan Stenn wrote:
    > Did you mean bug 604 or 614? Regardless, both are marked FIXED, and 604 has
    > been VERIFIED.


    Oops, 614 is what I meant. I did see that it is closed, but the
    comments on it left me with some doubt about whether it had been
    verified on AIX5 specifically.

    > We don't do much testing under AIX because we don't have easy access to a
    > box.
    >
    > An additional approach would be to use some of the assertion stuff in
    > include/ntp_assert.h and see if you can get something to violate an
    > assertion.


    I will take a look at that. We were interested in NTPv4 for the
    ability to really always slew (tinker step 0) since our software is
    allergic to time stepping. We may abandon the idea though, due to
    lack of confidence in the maturity of NTPv4 on AIX5.

    > And if you can shed any light on bugs 135, 309, 598, or 716, that would be
    > swell, too.


    135, 716: still exist. I had to explicitly compile away IPv6 support
    to resolve the issue.

    309: I don't think our setup would hit this so can't comment.

    598: This is interesting since it also complains about the xntpd IBM
    ships (which we currently use as well). We have had some issues with
    the clocks jumping back to 0:00...1970 after reboots; I believe we
    finally convinced IBM there was a problem. The other issues discussed
    in this bug I am not sure about; we'll have to investigate and see if
    they are contributing to time related pecularities.


  4. Re: move_fd() causing bad behavior on AIX5.3

    brandon.phillips@lmco.com wrote:
    > Harlan Stenn wrote:
    >> Did you mean bug 604 or 614? Regardless, both are marked FIXED, and 604 has
    >> been VERIFIED.

    >
    > Oops, 614 is what I meant. I did see that it is closed, but the
    > comments on it left me with some doubt about whether it had been
    > verified on AIX5 specifically.
    >
    >> We don't do much testing under AIX because we don't have easy access to a
    >> box.
    >>


    One thing that we could do for you is to make the move_fd() code do
    nothing by creating an #ifdef NO_MOVE_FD option so that AIX can specify
    to ignore it.

    >> An additional approach would be to use some of the assertion stuff in
    >> include/ntp_assert.h and see if you can get something to violate an
    >> assertion.

    >
    > I will take a look at that. We were interested in NTPv4 for the
    > ability to really always slew (tinker step 0) since our software is
    > allergic to time stepping. We may abandon the idea though, due to
    > lack of confidence in the maturity of NTPv4 on AIX5.
    >


    It is mature but the ideosyncrasies of the AIX implementation of the IP
    stacks is causing unnecessary problems.

    >> And if you can shed any light on bugs 135, 309, 598, or 716, that would be
    >> swell, too.

    >
    > 135, 716: still exist. I had to explicitly compile away IPv6 support
    > to resolve the issue.
    >


    I think that basically these are the same thing. The real problem here
    is that the IPv6 wildcard socket is trying to grab the IPv4 wildcard too
    and it's already bound which is what is causing it to fail. This is a
    bug in the O/S since it shouldn't be doing that.

    > 309: I don't think our setup would hit this so can't comment.
    >


    This looks almost identical to #965.

    > 598: This is interesting since it also complains about the xntpd IBM
    > ships (which we currently use as well). We have had some issues with
    > the clocks jumping back to 0:00...1970 after reboots; I believe we
    > finally convinced IBM there was a problem. The other issues discussed
    > in this bug I am not sure about; we'll have to investigate and see if
    > they are contributing to time related pecularities.


    This is probably an O/S issue where NTP is not able to retrieve or set
    the clock properly.

    Danny

  5. Re: move_fd() causing bad behavior on AIX5.3

    brandon.phillips@lmco.com wrote:
    > Harlan Stenn wrote:
    >> Did you mean bug 604 or 614? Regardless, both are marked FIXED, and 604 has
    >> been VERIFIED.

    >
    > Oops, 614 is what I meant. I did see that it is closed, but the
    > comments on it left me with some doubt about whether it had been
    > verified on AIX5 specifically.
    >
    >> We don't do much testing under AIX because we don't have easy access to a
    >> box.
    >>


    One thing that we could do for you is to make the move_fd() code do
    nothing by creating an #ifdef NO_MOVE_FD option so that AIX can specify
    to ignore it.

    >> An additional approach would be to use some of the assertion stuff in
    >> include/ntp_assert.h and see if you can get something to violate an
    >> assertion.

    >
    > I will take a look at that. We were interested in NTPv4 for the
    > ability to really always slew (tinker step 0) since our software is
    > allergic to time stepping. We may abandon the idea though, due to
    > lack of confidence in the maturity of NTPv4 on AIX5.
    >


    It is mature but the ideosyncrasies of the AIX implementation of the IP
    stacks is causing unnecessary problems.

    >> And if you can shed any light on bugs 135, 309, 598, or 716, that would be
    >> swell, too.

    >
    > 135, 716: still exist. I had to explicitly compile away IPv6 support
    > to resolve the issue.
    >


    I think that basically these are the same thing. The real problem here
    is that the IPv6 wildcard socket is trying to grab the IPv4 wildcard too
    and it's already bound which is what is causing it to fail. This is a
    bug in the O/S since it shouldn't be doing that.

    > 309: I don't think our setup would hit this so can't comment.
    >


    This looks almost identical to #965.

    > 598: This is interesting since it also complains about the xntpd IBM
    > ships (which we currently use as well). We have had some issues with
    > the clocks jumping back to 0:00...1970 after reboots; I believe we
    > finally convinced IBM there was a problem. The other issues discussed
    > in this bug I am not sure about; we'll have to investigate and see if
    > they are contributing to time related pecularities.


    This is probably an O/S issue where NTP is not able to retrieve or set
    the clock properly.

    Danny

  6. Time server monitor by Meinberg - Vista

    Any target date for a Vista compatible version?
    --tony

  7. Re: Time server monitor by Meinberg - Vista

    Tony Rutkowski wrote:
    > Any target date for a Vista compatible version?
    > --tony


    The current version works now. No need for a special Vista version. Look
    for Dave Taylor's recent posts. Was there some reason to believe that it
    didn't work?

    Danny

  8. Re: Time server monitor by Meinberg - Vista

    Hi Danny,

    The Meinberg NTP version works fine. It's the *monitor 1.0* that
    seems unable to find the NTP install on Vista. I think Dave's posts
    just addressed the NTP install.
    --tony

    >Tony Rutkowski wrote:
    > > Any target date for a Vista compatible version?
    > > --tony

    >
    >The current version works now. No need for a special Vista version. Look
    >for Dave Taylor's recent posts. Was there some reason to believe that it
    >didn't work?
    >
    >Danny


  9. Re: Time server monitor by Meinberg - Vista

    Tony Rutkowski wrote:
    > Hi Danny,
    >
    > The Meinberg NTP version works fine. It's the *monitor 1.0* that
    > seems unable to find the NTP install on Vista. I think Dave's posts
    > just addressed the NTP install.
    > --tony


    Correct - I haven't tried the monitor as yet. Using the monitor (0.9n or
    1.0) on XP it can talk to a remote Vista NTP server correctly.

    On Vista, 1.0 cannot find the local ntp server, but can talk to remote
    servers. It can even get status from the local server (but not service
    details) if you define the local server as an External server on the
    Configuration page. BTW: the first time I needed to edit the registry to
    get the remote nodes in!

    Cheers,
    David



  10. Re: Time server monitor by Meinberg - Vista

    David J Taylor wrote:
    > Tony Rutkowski wrote:
    >> Hi Danny,
    >>
    >> The Meinberg NTP version works fine. It's the *monitor 1.0* that
    >> seems unable to find the NTP install on Vista. I think Dave's posts
    >> just addressed the NTP install.
    >> --tony

    >
    > Correct - I haven't tried the monitor as yet. Using the monitor (0.9n or
    > 1.0) on XP it can talk to a remote Vista NTP server correctly.
    >
    > On Vista, 1.0 cannot find the local ntp server, but can talk to remote
    > servers. It can even get status from the local server (but not service
    > details) if you define the local server as an External server on the
    > Configuration page. BTW: the first time I needed to edit the registry to
    > get the remote nodes in!


    It seems that the routine which detects whether the NTP service is installed
    may require modification to work on Vista.

    My colleague who maintains the time server monitor will have to check this.

    Martin
    --
    Martin Burnicki

    Meinberg Funkuhren
    Bad Pyrmont
    Germany

  11. Re: Time server monitor by Meinberg - Vista

    Martin Burnicki wrote:
    []
    > It seems that the routine which detects whether the NTP service is
    > installed may require modification to work on Vista.
    >
    > My colleague who maintains the time server monitor will have to check
    > this.
    >
    > Martin


    Thanks, Martin. A lot of things in Vista now require more priviledge than
    before, at least when run from a user-mode program. It could be something
    as simple as ensuring that you only try and access the HKLM part of the
    registry with read-only access.

    By the way, I am seeing rather poor performance of NTP on Vista, and yet
    Heiko (?) reported good performance earlier this year (IIRC). Any tests
    done at Meinberg? Here are my results:

    http://www.david-taylor.myby.co.uk/mrtg/gemini_ntp.html

    Periods of stability and periods of oscillation. Why? The service is
    started as:

    C:\Tools\NTP\bin\ntpd.exe -M -g -c "C:\Tools\NTP\etc\ntp.conf"

    on that system - the image path in HKLM\CurrentControlSet\services\ntp\.

    Cheers,
    David



  12. Re: move_fd() causing bad behavior on AIX5.3

    Danny Mayer wrote:
    > One thing that we could do for you is to make the move_fd() code do
    > nothing by creating an #ifdef NO_MOVE_FD option so that AIX can

    specify
    > to ignore it.


    This seems like a good idea since it's so broken right now. You
    probably
    want to default IPv6 support off for AIX5 as well.

    > It is mature but the ideosyncrasies of the AIX implementation of the

    IP
    > stacks is causing unnecessary problems.


    > I think that basically these are the same thing. The real problem here
    > is that the IPv6 wildcard socket is trying to grab the IPv4 wildcard

    too
    > and it's already bound which is what is causing it to fail. This is a
    > bug in the O/S since it shouldn't be doing that.


    If we can identify specific problems, we will try to open PMRs with IBM
    to correct/track the issues.

    >> 598: This is interesting since it also complains about the xntpd IBM
    >> ships (which we currently use as well). We have had some issues with
    >> the clocks jumping back to 0:00...1970 after reboots; I believe we
    >> finally convinced IBM there was a problem. The other issues

    discussed
    >> in this bug I am not sure about; we'll have to investigate and see if
    >> they are contributing to time related pecularities.


    > This is probably an O/S issue where NTP is not able to retrieve or set
    > the clock properly.


    FYI, the reset-to-1970 issue we encountered (and got IBM to fix) would
    infrequently occurr after a sudden power loss; it would not update later
    as is suggested by #598.

    The only other "ntp is unreliable" issue we have had with the old xntpd
    was tracked down to a priority issue (xntpd was getting starved for CPU,
    messing up the algorithms).

    Now that I have a v4 ntpd hacked into functioning on AIX5, we plan to
    deploy
    it in some of our test runs to see how things behave.

  13. Re: move_fd() causing bad behavior on AIX5.3

    There is a good reason for the move_fd code. Frank would probably know
    better, and he's on vacation for another 2-3 weeks' time.

    I think it has to do with making sure there is room to open different sorts
    of files, and may only be important if one has refclocks.

    But it could be Bad to disable move_fd() in general for AIX.

    As for IPv6, are there any versions of AIX where IPv6 is working?

    H

  14. Re: move_fd() causing bad behavior on AIX5.3

    Harlan Stenn wrote:
    > There is a good reason for the move_fd code. Frank would probably know
    > better, and he's on vacation for another 2-3 weeks' time.
    >


    Actually I was the one that wrote it and yes there are good reasons for
    doing it. However, if for some reason this causes a problem on his
    particular O/S we can create a conditional macro to have it not use it.

    > I think it has to do with making sure there is room to open different sorts
    > of files, and may only be important if one has refclocks.
    >
    > But it could be Bad to disable move_fd() in general for AIX.
    >


    We don't know the general case to be able to answer that one.

    > As for IPv6, are there any versions of AIX where IPv6 is working?


    Yes, IPv6 does work on AIX. It's just that it's being confused by 6over4
    and 4in6 and at least some versions of the O/S is not keeping the
    address space separate. It should be running as a dual-stack or at least
    not trying to play tricks with the address space.

    Danny

  15. Re: move_fd() causing bad behavior on AIX5.3

    Danny wrote:
    > Harlan Stenn wrote:
    > > There is a good reason for the move_fd code. Frank would probably know
    > > better, and he's on vacation for another 2-3 weeks' time.


    > Actually I was the one that wrote it and yes there are good reasons for
    > doing it.


    Cool - thanks for letting me know those things.

    Would you please *add* the good reasons to a comment in the code just
    before the definition of move_fd() when you get a chance? There are
    *plenty* of undocumented cases where we do something for a good reason
    nobody can remember anymore.

    > However, if for some reason this causes a problem on his
    > particular O/S we can create a conditional macro to have it not use it.


    And until somebody besides you knows the good reason for move_fd(),
    nobody else will know if disabling move_fd() for AIX will be a good
    thing or if it will trade one problem for another (in this case, the
    underlying reason for having move_fd()).

    > > I think it has to do with making sure there is room to open different sorts
    > > of files, and may only be important if one has refclocks.
    > >
    > > But it could be Bad to disable move_fd() in general for AIX.
    > >

    >
    > We don't know the general case to be able to answer that one.


    OK, so again, when the underlying reasons fo having move_fd() are
    documented (and therefore better known) in general, we have a chance of
    coming up with a better answer for this, too.

    > > As for IPv6, are there any versions of AIX where IPv6 is working?

    >
    > Yes, IPv6 does work on AIX. It's just that it's being confused by 6over4
    > and 4in6 and at least some versions of the O/S is not keeping the
    > address space separate. It should be running as a dual-stack or at least
    > not trying to play tricks with the address space.


    OK, sounds to me like that is an instance of "not working", at least as
    far as we are concerned.

    If there are known problems/issues with things like this, I would
    strongly request that people add this sort of information somewhere. It
    could be the code, or it could be at http://support.ntp.org/Dev .

    If we can figure out which versions of the OS do what, and if we can
    determine what OS version we have at runtime, we can have a single AIX
    executable DTRT depending on the OS verison (or patch level).

    H

  16. Re: move_fd() causing bad behavior on AIX5.3

    Harlan Stenn wrote:
    > Danny wrote:
    >> Harlan Stenn wrote:
    >>> There is a good reason for the move_fd code. Frank would probably know
    >>> better, and he's on vacation for another 2-3 weeks' time.

    >
    >> Actually I was the one that wrote it and yes there are good reasons for
    >> doing it.

    Yes, you wrote the initial version, I then refined that to cover
    more coner cases and added a 28 line text to explain the reasons for the
    code. What information is missing ? Or are you looking at a version
    different from 4.2.4pX ?

    >
    > Cool - thanks for letting me know those things.
    >
    > Would you please *add* the good reasons to a comment in the code just
    > before the definition of move_fd() when you get a chance? There are
    > *plenty* of undocumented cases where we do something for a good reason
    > nobody can remember anymore.


    Do you see the comment starting .. On Unix systems the stdio library... ?
    This is supposed to be the explanation the the move_fd() implementation.

    >
    >> However, if for some reason this causes a problem on his
    >> particular O/S we can create a conditional macro to have it not use it.

    >
    > And until somebody besides you knows the good reason for move_fd(),
    > nobody else will know if disabling move_fd() for AIX will be a good
    > thing or if it will trade one problem for another (in this case, the
    > underlying reason for having move_fd()).


    Good point ! That's why I'd rather like to find out what goes wrong
    with move_fd() on AIX. We may either uncover a ntpd implementation bug,
    an AIX spcialty or an AIX library bug.

    >
    >>> I think it has to do with making sure there is room to open different sorts
    >>> of files, and may only be important if one has refclocks.
    >>>
    >>> But it could be Bad to disable move_fd() in general for AIX.
    >>>

    >> We don't know the general case to be able to answer that one.

    >
    > OK, so again, when the underlying reasons fo having move_fd() are
    > documented (and therefore better known) in general, we have a chance of
    > coming up with a better answer for this, too.
    >
    >>> As for IPv6, are there any versions of AIX where IPv6 is working?

    >> Yes, IPv6 does work on AIX. It's just that it's being confused by 6over4
    >> and 4in6 and at least some versions of the O/S is not keeping the
    >> address space separate. It should be running as a dual-stack or at least
    >> not trying to play tricks with the address space.

    >
    > OK, sounds to me like that is an instance of "not working", at least as
    > far as we are concerned.
    >
    > If there are known problems/issues with things like this, I would
    > strongly request that people add this sort of information somewhere. It
    > could be the code, or it could be at http://support.ntp.org/Dev .
    >
    > If we can figure out which versions of the OS do what, and if we can
    > determine what OS version we have at runtime, we can have a single AIX
    > executable DTRT depending on the OS verison (or patch level).
    >
    > H


    Frank

+ Reply to Thread