Making traffic to decommissioned pool servers stop sooner

I’m working on a change to the reference implementation of NTP, ntpd, to improve the pool directive to automatically prune non-responsive and non-surviving pool sources and replace them with others. This is a win for users of the pool, but it’s also a win for pool server operators, as it will cause clients using the new code to stop polling servers configured with pool which are no longer serving NTP.

Please consider testing the new code on any pool clients you have. You can also test it on pool servers by using noselect such as:

pool 2.pool.ntp.org iburst noselect

See Testing needed for ntpd pool change to whittle outliers - #2 by davehart

Thanks,
Dave Hart

P.S. I’ve made a pointless edit to make this post visible after someone unkindly marked it as “inappropriate”.

8 Likes

Why? Chrony is about 1000 times better.

NTPD is old, outdated and pretty useless.

I stopped using NTPD a long time ago, it lasted for a long time, but became irrelevant.

Chrony has noselect already. Sorry mate, you are wasting your time in my opinion.

It is old, yes, as the original NTP implementation. It is still being maintained, so I’m not sure why you claim it’s outdated. It is far from useless. If you’re happy with Chrony, more power to you, but some folks prefer ntpd, particularly those for whom it is the only full-featured NTP implementation, such as those using Windows, and those using less-mainstream systems which NTPsec and Chrony have no support for.

Please don’t hijack this thread into a debate about which NTP implementation you’d like to see people using.

7 Likes

I’m a bit mystified at why Dave’s post is flagged and temporarily hidden, but here’s what I got of it from the email notification I received:

I’m working on a change to the reference implementation of NTP, ntpd, to improve the pool directive to automatically prune non-responsive and non-surviving pool sources and replace them with others. This is a win for users of the pool, but it’s also a win for pool server operators, as it will cause clients using the new code to stop polling servers configured with pool which are no longer serving NTP.

My question to @davehart is: doesn’t the pool directive already do this? My ntpd-based pool server has several local peers, including two stratum 1 hosts, but also includes a number of pools in its configuration. Within a short time (I’d guess 2^10 seconds, but can’t be sure without measuring) after ntpd starts up on this system, the pools are discarded from its peers and it uses only the local peers and static servers.

When you say discarded, do you mean they disappear from the ntpq -p billboard, or simply that they aren’t contributing to the time solution (for example, have a tally code of - in the first character of the billboard)?

With this change, ntpd slowly (by default no more often than every 10 minutes) cycles out one non-surviving pool server at a time, replacing it by soliciting another pool server. Over time, that causes all the pool servers showing tally code - to be replaced with servers which survive, showing tally code + or #.

chrony is indeed very good!

However there are so few ntpd implementations and ntpd is still in widespread use, so improving it is welcome. It also has a difference license than chrony. Either ntpd or chrony is, in my experience, much better for time keeping than some of the SNTP-only implementations that are configured by default on some Linux distributions.

2 Likes

I’d suggest having the rate be a number of hours rather than “every 10 minutes” (though mostly from concern of how many more DNS queries this will create!)

Another suggestion would be to have a rate at which all servers are replaced (maybe measured in days and defaulted to weeks). This would keep the load on the NTP servers better balanced and allow a server to leave the system without turning off the NTP service.

ntpd hangs onto the full list of servers it gets from each DNS query and works its way through them. For example, with a dual-homed client using 2.pool.ntp.org, ntpd gets 8 addresses and will requery DNS at most every 80 minutes. In practice it will be less frequently as the replacement doesn’t happen at exactly 10 minutes, rather when a server hasn’t survived for 10 poll intervals. The replacement slows as the poll interval increases, and stops when a cohort of survivors has been found.

With the code as it is, a system can leave the pool and will stop receiving queries from existing clients with the latest code if it stops serving NTP for under 3 hours (10240 seconds, or ten polls at the default maxpoll 10).

Adding another mechanism to replace even responsive pool servers is an interesting idea. It will result in more DNS queries when such replacement is triggered, so I think a long timeframe such as weeks would be a reasonable default. I would think such replacement should also be limited to one at a time no more often than several hours default, to give the algorithms time to digest the change and possibly demobilize servers for not surviving with the changed cohort.

They disappear from the ntpq -p billboard entirely. This is on ntpd 4.2.8p15@1.3728-o Wed Sep 23 11:46:38 UTC 2020 (1), the default on Debian bullseye (current stable).

I can not disagree more. Checking all servers every 10 minutes is quite stupid.
As the round-robin-DNS will assign 4 (or more) servers at every request anyway.

Most DNS-queries will be cached for 3600sec (1 hour) anyway, so it doesn’t need much more checking.

The problem some don’t seem to grasp is that is if 1 or 10? monitors can reach your server, there is no need to do extensive (extra) testing, just to make you happy with a 20/20 on all testing-servers.

The point of the monitors is to establish of your server can be reached and if it’s on time, when it is, it will be listed (maybe tested by few others) but it will be used.

The total score only means ALL monitors where able to reach you. However with the single monitor in the past we already know this is impossible for some.

As for decommissioned traffic, DNS caching is to blame. Just close port 123 and your traffic is gone.

What is the point to this discussion anyway?

If you sign up for testing, just state monitor only. The ntp-pool dns-servers have no control over DNS-caching.

DNS queries are cached based on the TTL value returned by the authoritative name servers to the recursive resolvers.
In the pool’s case, the ntpns.org nameservers return the TTL as 150 seconds

dig pool.ntp.org @f.ntpns.org.                                                                                                                                                                                               ✔ 

; <<>> DiG 9.18.14 <<>> pool.ntp.org @f.ntpns.org.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 43417
;; flags: qr aa rd; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: c2925b27a11b0b29 (echoed)
;; QUESTION SECTION:
;pool.ntp.org.                  IN      A

;; ANSWER SECTION:
pool.ntp.org.           150     IN      A       162.159.200.123
pool.ntp.org.           150     IN      A       129.250.35.251
pool.ntp.org.           150     IN      A       108.61.23.93
pool.ntp.org.           150     IN      A       64.111.99.224

;; Query time: 163 msec
;; SERVER: 31.3.105.98#53(f.ntpns.org.) (UDP)
;; WHEN: Thu Jun 29 12:08:24 PDT 2023
;; MSG SIZE  rcvd: 165

Per RFC8767, recursive resolvers may only return stale data when the authority server is unresponsive.

1 Like

A certain percentage of resolvers misbehave, but yeah.

I wonder if ntpq -clpeers would show them. By default, the peers billboard elides dynamic peers which are unreachable. I find it confusing and have more than once wondered if that should be changed.