ntpd fscked up again - Networking
This is a discussion on ntpd fscked up again - Networking ; Here I had another case of why ntpd is finicky and cannot be relied
on.
ntp.conf specified a timeserver which is in datacenter.
Datacenter had power outage.
Power is restored.
ntp client comes up before network switch and before timeserver.
...
-
ntpd fscked up again
Here I had another case of why ntpd is finicky and cannot be relied
on.
ntp.conf specified a timeserver which is in datacenter.
Datacenter had power outage.
Power is restored.
ntp client comes up before network switch and before timeserver.
ntp does not see the timeserver. It quietly drops it from the list,
and basically sits there doing nothing, since there is nothing else
on the list. (and having more stuff on the list would not help)
I get no notice from it whatsoever, nothing in log files, etc. It just
sits there, doing nothing, not letting me know, time is drifting. I
have 26 servers acting in this manner.
Come on people. That's not how highly available programs behave. It
should keep on trying, or at least exit if it cannot do its job. If it
exited, my cron job with ntpdate would take over and fix things.
This is on Ubuntu Hardy.
--
Due to extreme spam originating from Google Groups, and their inattention
to spammers, I and many others block all articles originating
from Google Groups. If you want your postings to be seen by
more readers you will need to find a different means of
posting on Usenet.
http://improve-usenet.org/
-
Re: ntpd fscked up again
Ignoramus24166 wrote:
> Here I had another case of why ntpd is finicky and cannot be relied
> on.
>
> ntp.conf specified a timeserver which is in datacenter.
>
> Datacenter had power outage.
>
> Power is restored.
>
> ntp client comes up before network switch and before timeserver.
>
> ntp does not see the timeserver. It quietly drops it from the list,
> and basically sits there doing nothing, since there is nothing else
> on the list. (and having more stuff on the list would not help)
>
> I get no notice from it whatsoever, nothing in log files, etc. It just
> sits there, doing nothing, not letting me know, time is drifting. I
> have 26 servers acting in this manner.
>
> Come on people. That's not how highly available programs behave. It
> should keep on trying, or at least exit if it cannot do its job. If it
> exited, my cron job with ntpdate would take over and fix things.
>
> This is on Ubuntu Hardy.
well the source is there..rewrite it yourself and submit..
-
Re: ntpd fscked up again
In comp.os.linux.misc Ignoramus24166 wrote:
> ntp does not see the timeserver. It quietly drops it from the list,
> and basically sits there doing nothing [...]
> Come on people. That's not how highly available programs behave. It
> should keep on trying, or at least exit if it cannot do its job. If it
> exited, my cron job with ntpdate would take over and fix things.
Unfortunately, it seems "they" are having a big discussion about this
over in the ntpd corner. It's probably one of the most important feature
requests for the application and they want to design it properly before
releasing it. (This seems fair enough to me, but I must admit I'm a little
bemused over the wrangling. Server doesn't respond? Lookup the DNS again
maybe to get a different address. Try again. Run out of addresses? Start
from the top. Lost all servers? Scream. (How?))
Chris
-
Re: ntpd fscked up again
Chris Davies wrote:
> In comp.os.linux.misc Ignoramus24166
> wrote:
>> ntp does not see the timeserver. It quietly drops it from the list, and
>> basically sits there doing nothing [...]
I have never noticed this, though it may well be true. I have a very simple
system, 2 computers on my LAN, with the main machine connected to the
Internet directly. The other machine uses the main machine as the time
server. But my main machine is almost always up, except during power
failures or when I am booting a new kernel. It is the other one that goes up
and down (dual boot). So I am not likely to notice this except when a power
failure happens and the other machine comes up first.
>
>> Come on people. That's not how highly available programs behave. It
>> should keep on trying, or at least exit if it cannot do its job. If it
>> exited, my cron job with ntpdate would take over and fix things.
>
> Unfortunately, it seems "they" are having a big discussion about this
> over in the ntpd corner. It's probably one of the most important feature
> requests for the application and they want to design it properly before
> releasing it. (This seems fair enough to me, but I must admit I'm a
> little bemused over the wrangling. Server doesn't respond? Lookup the DNS
> again maybe to get a different address. Try again. Run out of addresses?
> Start from the top. Lost all servers? Scream. (How?))
>
The how part seems easy to me: write it in /var/log/messages.
--
.~. Jean-David Beyer Registered Linux User 85642.
/V\ PGP-Key: 9A2FC99A Registered Machine 241939.
/( )\ Shrewsbury, New Jersey http://counter.li.org
^^-^^ 11:50:01 up 34 days, 17:56, 4 users, load average: 4.67, 4.52, 4.53
-
Re: ntpd fscked up again
Ignoramus24166 writes:
>Here I had another case of why ntpd is finicky and cannot be relied
>on.
>ntp.conf specified a timeserver which is in datacenter.
>Datacenter had power outage.
>Power is restored.
>ntp client comes up before network switch and before timeserver.
>ntp does not see the timeserver. It quietly drops it from the list,
>and basically sits there doing nothing, since there is nothing else
>on the list. (and having more stuff on the list would not help)
>I get no notice from it whatsoever, nothing in log files, etc. It just
>sits there, doing nothing, not letting me know, time is drifting. I
>have 26 servers acting in this manner.
>Come on people. That's not how highly available programs behave. It
>should keep on trying, or at least exit if it cannot do its job. If it
>exited, my cron job with ntpdate would take over and fix things.
>This is on Ubuntu Hardy.
I am not at all sure that you are right, that it is doing nothing. It is
probably waiting. The behaviour of ntpd is wait at least the poll interval
( wich in the default has minimum poll of 6 which is 64 sec) and try again.
If that does not work, it backs off the poll interval eventually waiting
for poll interval 10 ( 20 min) before trying again. This is to prevent
swamping the servers in just your case. Imagine you serve 10000 clients (
which some of the main servers do). You go down and come up again, and then
all 10000 clients bombard you with a request per second or a request per
millisec since they have not heard from you. You have a very verysick
server on your hands.
Ie, how long have you let it run before deciding it is doing nothing.
-
Re: ntpd fscked up again
On Sep 10, 5:10*am, Ignoramus24166
24166.invalid> wrote:
> Come on people. That's not how highly available programs behave. It
> should keep on trying, or at least exit if it cannot do its job. If it
> exited, my cron job with ntpdate would take over and fix things.
For every person who complains that it doesn't track a change in DNS,
there's a person who complains because it changes servers
unpredictably. As for exiting if it cannot do its job, it is still
available to synchronize the time with any server that might connect
to it.
This is a well-known problem with many daemons. You need to follow a
practiced and tested power-up sequence if you lose power to more than
one component. It is often necessary to power cycle some devices more
than once to ensure that every device powers up with everything it
needs to start up actually working.
DS
-
Re: ntpd fscked up again
On 2008-09-10, Chris Davies wrote:
> In comp.os.linux.misc Ignoramus24166 wrote:
>> ntp does not see the timeserver. It quietly drops it from the list,
>> and basically sits there doing nothing [...]
>
>> Come on people. That's not how highly available programs behave. It
>> should keep on trying, or at least exit if it cannot do its job. If it
>> exited, my cron job with ntpdate would take over and fix things.
>
> Unfortunately, it seems "they" are having a big discussion about this
> over in the ntpd corner. It's probably one of the most important feature
> requests for the application and they want to design it properly before
> releasing it. (This seems fair enough to me, but I must admit I'm a little
> bemused over the wrangling. Server doesn't respond? Lookup the DNS again
> maybe to get a different address. Try again. Run out of addresses? Start
> from the top. Lost all servers? Scream. (How?))
>
To me, the answer "just exit with nonzero exit code" and "do not lock
up he socket you cannot make use of" would suffice. It would not be
perfect, but it will work. "Keep trying if you cannot find anything"
would be even better. Right now the result is the worst imaginable.
--
Due to extreme spam originating from Google Groups, and their inattention
to spammers, I and many others block all articles originating
from Google Groups. If you want your postings to be seen by
more readers you will need to find a different means of
posting on Usenet.
http://improve-usenet.org/
-
Re: ntpd fscked up again
On Wed, 10 Sep 2008 07:10:51 -0500, Ignoramus24166 wrote:
> Here I had another case of why ntpd is finicky and cannot be relied
> on.
>
> ntp.conf specified a timeserver which is in datacenter.
>
> Datacenter had power outage.
>
> Power is restored.
>
> ntp client comes up before network switch and before timeserver.
>
> ntp does not see the timeserver. It quietly drops it from the list,
> and basically sits there doing nothing, since there is nothing else
> on the list. (and having more stuff on the list would not help)
cfengine could save your bacon here...
--
* John Oliver http://www.john-oliver.net/ *
-
Re: ntpd fscked up again
> Ignoramus24166 wrote:
> > Here I had another case of why ntpd is finicky and cannot be relied
> > on.
> >
> > ntp.conf specified a timeserver which is in datacenter.
> >
> > Datacenter had power outage.
> >
> > Power is restored.
> >
> > ntp client comes up before network switch and before timeserver.
OK, perhaps the client should deal with that more gracefully, but it
can be equally well argued that the power-on sequence for the data
center should see to it that the servers are all up and running before
the clients get restarted, and that the switches are all up before the
servers.
> > ntp does not see the timeserver. It quietly drops it from the list,
> > and basically sits there doing nothing, since there is nothing else
> > on the list. (and having more stuff on the list would not help)
I thought the whole point about NTP server selection was to select
multiple time sources such that no single point of failure could take
them all out from the client's grasp.
rick jones
--
firebug n, the idiot who tosses a lit cigarette out his car window
these opinions are mine, all mine; HP might not want them anyway... 
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
-
Re: ntpd fscked up again
On 2008-09-10, John Oliver wrote:
> On Wed, 10 Sep 2008 07:10:51 -0500, Ignoramus24166 wrote:
>> Here I had another case of why ntpd is finicky and cannot be relied
>> on.
>>
>> ntp.conf specified a timeserver which is in datacenter.
>>
>> Datacenter had power outage.
>>
>> Power is restored.
>>
>> ntp client comes up before network switch and before timeserver.
>>
>> ntp does not see the timeserver. It quietly drops it from the list,
>> and basically sits there doing nothing, since there is nothing else
>> on the list. (and having more stuff on the list would not help)
>
> cfengine could save your bacon here...
>
How? I am very curious about cfengine.
--
Due to extreme spam originating from Google Groups, and their inattention
to spammers, I and many others block all articles originating
from Google Groups. If you want your postings to be seen by
more readers you will need to find a different means of
posting on Usenet.
http://improve-usenet.org/
-
Re: ntpd fscked up again
On 2008-09-10, Rick Jones wrote:
>> Ignoramus24166 wrote:
>> > Here I had another case of why ntpd is finicky and cannot be relied
>> > on.
>> >
>> > ntp.conf specified a timeserver which is in datacenter.
>> >
>> > Datacenter had power outage.
>> >
>> > Power is restored.
>> >
>> > ntp client comes up before network switch and before timeserver.
>
> OK, perhaps the client should deal with that more gracefully, but it
> can be equally well argued that the power-on sequence for the data
> center should see to it that the servers are all up and running before
> the clients get restarted, and that the switches are all up before the
> servers.
>
>> > ntp does not see the timeserver. It quietly drops it from the list,
>> > and basically sits there doing nothing, since there is nothing else
>> > on the list. (and having more stuff on the list would not help)
>
> I thought the whole point about NTP server selection was to select
> multiple time sources such that no single point of failure could take
> them all out from the client's grasp.
The server became available minutes later. After a whole day, ntpd was still
not talking to it and had to be restarted by /etc/init.d/ntp
restart. Then it started talking.
--
Due to extreme spam originating from Google Groups, and their inattention
to spammers, I and many others block all articles originating
from Google Groups. If you want your postings to be seen by
more readers you will need to find a different means of
posting on Usenet.
http://improve-usenet.org/
-
Re: ntpd fscked up again
> > I thought the whole point about NTP server selection was to select
> > multiple time sources such that no single point of failure could
> > take them all out from the client's grasp.
> The server became available minutes later. After a whole day, ntpd
> was still not talking to it and had to be restarted by
> /etc/init.d/ntp restart. Then it started talking.
_The_ server. I'll grant that what the client ntpd was doing wasn't
terribly friendly, but I also think that configuring only a single
source of time is brittle at best, even if the client ntpd were going
to recover from starting before the time server was reachable.
That said, I wonder what the effect of enabling iburst and setting the
minpoll interval higher might be in such a situation.
rick jones
--
denial, anger, bargaining, depression, acceptance, rebirth...
where do you want to be today?
these opinions are mine, all mine; HP might not want them anyway... 
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
-
Re: ntpd fscked up again
On 2008-09-10, Rick Jones wrote:
>> > I thought the whole point about NTP server selection was to select
>> > multiple time sources such that no single point of failure could
>> > take them all out from the client's grasp.
>
>> The server became available minutes later. After a whole day, ntpd
>> was still not talking to it and had to be restarted by
>> /etc/init.d/ntp restart. Then it started talking.
>
> _The_ server. I'll grant that what the client ntpd was doing wasn't
> terribly friendly, but I also think that configuring only a single
> source of time is brittle at best, even if the client ntpd were going
> to recover from starting before the time server was reachable.
At that time, as I said, the core switch was down also and NO servers
were reachable. If I had 100 servers in my config, it would not help.
> That said, I wonder what the effect of enabling iburst and setting the
> minpoll interval higher might be in such a situation.
Whatever the poll interval was, it was surely less than a day.
--
Due to extreme spam originating from Google Groups, and their inattention
to spammers, I and many others block all articles originating
from Google Groups. If you want your postings to be seen by
more readers you will need to find a different means of
posting on Usenet.
http://improve-usenet.org/
-
Re: ntpd fscked up again
In comp.os.linux.networking Ignoramus24166 wrote:
> On 2008-09-10, Rick Jones wrote:
> > _The_ server. I'll grant that what the client ntpd was doing
> > wasn't terribly friendly, but I also think that configuring only a
> > single source of time is brittle at best, even if the client ntpd
> > were going to recover from starting before the time server was
> > reachable.
> At that time, as I said, the core switch was down also and NO
> servers were reachable. If I had 100 servers in my config, it would
> not help.
I suppose that is true if there is only the one core switch as well
and only one path out of each client you are pretty much toast until
everything is back up again no matter how many servers you configure
nor how often the cron job fires.
> > That said, I wonder what the effect of enabling iburst and setting
> > the minpoll interval higher might be in such a situation.
> Whatever the poll interval was, it was surely less than a day.
Since the initial minpoll is 64 (?) seconds, it probably gave-up on
the order of minutes. That is based on my ass-u-me-ing it tries a
fixed number of times to reach the server and then gives-up.
rick jones
--
a wide gulf separates "what if" from "if only"
these opinions are mine, all mine; HP might not want them anyway... 
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
-
Re: ntpd fscked up again
Rick Jones writes:
>In comp.os.linux.networking Ignoramus24166 wrote:
>> On 2008-09-10, Rick Jones wrote:
>> > _The_ server. I'll grant that what the client ntpd was doing
>> > wasn't terribly friendly, but I also think that configuring only a
>> > single source of time is brittle at best, even if the client ntpd
>> > were going to recover from starting before the time server was
>> > reachable.
>> At that time, as I said, the core switch was down also and NO
>> servers were reachable. If I had 100 servers in my config, it would
>> not help.
>I suppose that is true if there is only the one core switch as well
>and only one path out of each client you are pretty much toast until
>everything is back up again no matter how many servers you configure
>nor how often the cron job fires.
>> > That said, I wonder what the effect of enabling iburst and setting
>> > the minpoll interval higher might be in such a situation.
>> Whatever the poll interval was, it was surely less than a day.
>Since the initial minpoll is 64 (?) seconds, it probably gave-up on
>the order of minutes. That is based on my ass-u-me-ing it tries a
>fixed number of times to reach the server and then gives-up.
There is a difference in the behaviour if the server has been reached and
has never been reached and the dns returns no address. There is a "dynamic"
proposal but it is still not in place.
So what you can do is to run a cron job which runs every 5 min after
bootup. It looks at ntpq -p and sees if there are any servers in the list.
If there are none, it restarts ntpd.
#!/bin/bash
if [ "`ntpq -p|awk 'flag==1 { print "OK";exit} $0 ~ /=====/ {flag=1}'`" != 'OK' ];then
service ntpd restart
fi
(not positive it will work)
-
Re: ntpd fscked up again
Bill Unruh writes:
> Rick Jones writes:
>
>>In comp.os.linux.networking Ignoramus24166 wrote:
>>> On 2008-09-10, Rick Jones wrote:
>
>>> > _The_ server. I'll grant that what the client ntpd was doing
>>> > wasn't terribly friendly, but I also think that configuring only a
>>> > single source of time is brittle at best, even if the client ntpd
>>> > were going to recover from starting before the time server was
>>> > reachable.
>
>>> At that time, as I said, the core switch was down also and NO
>>> servers were reachable. If I had 100 servers in my config, it would
>>> not help.
>
>>I suppose that is true if there is only the one core switch as well
>>and only one path out of each client you are pretty much toast until
>>everything is back up again no matter how many servers you configure
>>nor how often the cron job fires.
>
>>> > That said, I wonder what the effect of enabling iburst and setting
>>> > the minpoll interval higher might be in such a situation.
>
>>> Whatever the poll interval was, it was surely less than a day.
>
>>Since the initial minpoll is 64 (?) seconds, it probably gave-up on
>>the order of minutes. That is based on my ass-u-me-ing it tries a
>>fixed number of times to reach the server and then gives-up.
>
>
> There is a difference in the behaviour if the server has been reached and
> has never been reached and the dns returns no address. There is a "dynamic"
> proposal but it is still not in place.
>
> So what you can do is to run a cron job which runs every 5 min after
> bootup. It looks at ntpq -p and sees if there are any servers in the list.
> If there are none, it restarts ntpd.
>
>
> #!/bin/bash
> if [ "`ntpq -p|awk 'flag==1 { print "OK";exit} $0 ~ /=====/ {flag=1}'`" != 'OK' ];then
> service ntpd restart
> fi
>
> (not positive it will work)
I use this and therefore know it works:
> cat /root/bin/ntp-check
x=`/etc/rc.d/init.d/ntpd status`
case $x in
*stopped*|*dead*)
status=`/etc/rc.d/init.d/ntpd start 2>&1`
printf "had to start ntpd on `uname -n`, state <$x>\nrestart got status\n $status"\
| mail -s cron myemailaddr
;;
esac
(replace myemailaddr with your own)
-
Re: ntpd fscked up again
On Thu, 11 Sep 2008 01:00:27 +0000, Bill Unruh wrote:
> So what you can do is to run a cron job which runs every 5 min after
> bootup. It looks at ntpq -p and sees if there are any servers in the
> list. If there are none, it restarts ntpd.
Alternatively, how about
server 127.127.1.0 # local clock
fudge 127.127.1.0 stratum 12
in ntp.conf? Is this still reasonable?
See http://www.meinberg.de/english/info/ntp.htm
Additionally, there should be an entry for the local clock which
can be used as a fallback resource if no other time source is available.
Since the local clock is not very accurate, it should be fudged to a low
stratum
--
Regards
Alex
http://www.badphorm.co.uk/
-
Re: ntpd fscked up again
On 2008-09-11, Bill Unruh wrote:
> Rick Jones writes:
>
>>In comp.os.linux.networking Ignoramus24166 wrote:
>>> On 2008-09-10, Rick Jones wrote:
>
>>> > _The_ server. I'll grant that what the client ntpd was doing
>>> > wasn't terribly friendly, but I also think that configuring only a
>>> > single source of time is brittle at best, even if the client ntpd
>>> > were going to recover from starting before the time server was
>>> > reachable.
>
>>> At that time, as I said, the core switch was down also and NO
>>> servers were reachable. If I had 100 servers in my config, it would
>>> not help.
>
>>I suppose that is true if there is only the one core switch as well
>>and only one path out of each client you are pretty much toast until
>>everything is back up again no matter how many servers you configure
>>nor how often the cron job fires.
>
>>> > That said, I wonder what the effect of enabling iburst and setting
>>> > the minpoll interval higher might be in such a situation.
>
>>> Whatever the poll interval was, it was surely less than a day.
>
>>Since the initial minpoll is 64 (?) seconds, it probably gave-up on
>>the order of minutes. That is based on my ass-u-me-ing it tries a
>>fixed number of times to reach the server and then gives-up.
>
>
> There is a difference in the behaviour if the server has been reached and
> has never been reached and the dns returns no address. There is a "dynamic"
> proposal but it is still not in place.
>
> So what you can do is to run a cron job which runs every 5 min after
> bootup. It looks at ntpq -p and sees if there are any servers in the list.
> If there are none, it restarts ntpd.
>
>
> #!/bin/bash
> if [ "`ntpq -p|awk 'flag==1 { print "OK";exit} $0 ~ /=====/ {flag=1}'`" != 'OK' ];then
> service ntpd restart
> fi
I like this a loit.
'thanks
i
> (not positive it will work)
>
--
Due to extreme spam originating from Google Groups, and their inattention
to spammers, I and many others block all articles originating
from Google Groups. If you want your postings to be seen by
more readers you will need to find a different means of
posting on Usenet.
http://improve-usenet.org/
-
Re: ntpd fscked up again
Alex Potter writes:
>On Thu, 11 Sep 2008 01:00:27 +0000, Bill Unruh wrote:
>> So what you can do is to run a cron job which runs every 5 min after
>> bootup. It looks at ntpq -p and sees if there are any servers in the
>> list. If there are none, it restarts ntpd.
>Alternatively, how about
> server 127.127.1.0 # local clock
> fudge 127.127.1.0 stratum 12
>in ntp.conf? Is this still reasonable?
>See http://www.meinberg.de/english/info/ntp.htm
>Additionally, there should be an entry for the local clock which
>can be used as a fallback resource if no other time source is available.
>Since the local clock is not very accurate, it should be fudged to a low
>stratum
That of course does absolutely nothing for your time accuracy. You cannot
get accurate time by comparing your wristwatch to itself. And it will not
wake up and use the external ones when they come online.
>--
>Regards
>Alex
>http://www.badphorm.co.uk/
-
Re: ntpd fscked up again
On Thu, 11 Sep 2008 06:00:38 +0000, Unruh wrote:
> That of course does absolutely nothing for your time accuracy. You
> cannot get accurate time by comparing your wristwatch to itself.
Of course not.
>And it
> will not wake up and use the external ones when they come online.
Ah, thanks for correcting my mis-apprehension.
--
Regards
Alex
http://www.badphorm.co.uk/