Testing needed for ntpd pool change to whittle outliers

ntpd has had support for over a decade generally described as “automatic server discovery” that was designed to not only select servers automatically, but also refine the selection over time to stick with servers that contribute to the time solution and preempt and replace outliers.

This code is used both for manycastclient and for pool associations. Unfortunately, for over a decade it’s been effectively inoperative except in the rare situation where less than tos minclock (default 3) servers survive the clock clustering and selection process.

After a bunch of testing myself, I’d appreciate others willing to donate some time to try out the new code and help work out any remaining kinks. Obviously pool clients using ntpd are prime candidates to test this. Those running ntpd as pool servers could also help out by configuring pool associations with noselect or using prefer on their explicitly configured sources.

The basic logic is that when a server has not survived for 10 consecutive poll intervals it is a candidate to be preempted. Only the lowest-scoring servers are removed, and only once there are already maxclock associations. Those preempted are automatically replaced within a poll interval or two by newly-solicited sources.

You may want to adjust tos maxclock from its default of 10 to see how it works combining more or less sources.

The code is available in:

https://people.nwtime.org/hart/ntp-dev-3792-msm.tar.gz

One concern to watch out for is the possibility of increased queries to the pool.ntp.org DNS servers. That effect fades as the client settles on a set of sources which all survive and potentially contribute to the time solution.

Feedback here or to hart@ntp.org or questions@lists.ntp.org is most welcome, even if only to say you’ve tried it and didn’t have any issues, so I get a feel for how much testing has been done.

Thanks in advance!

Dave Hart

4 Likes

Feedback from a helpful tester that they didn’t see the removal of peers in their log prompts me to add that you might want to turn on logging of peer events in ntp.conf:

logconfig +peerall

2 Likes

Some people have reported a failure to build the tarball I provided with error message suggesting using CFLAGS=-fPIC. I have an updated tarball which should fix that problem. It has a couple other bits of fine tuning I’ve done.

https://people.nwtime.org/hart/ntp-dev-3792-msm-v2.tar.gz

All build and run testing is appreciated, particularly with pool in ntp.conf.

Cheers.
Dave Hart

Maybe I don’t understand the change correctly, but I suspect this breaks the pool load balancing. If all pool clients were replacing servers until they were all passing the clustering algorithm (i.e. have a similar offset), I think the traffic of some servers would depend more on their location and network asymmetries in relation to other servers, rather than their speed setting.

2 Likes

I feel there should be a mechanism to remove the oldest ephemeral associations; on the premise, there might be a server that still serves accurate&precise time which has moved out of the pool for some reason. I made at least one post somewhere[1] that should be weekly.

[1] Thoughts on networking and threads...

Is that necessarily wrong, though? Surely on an NTP client’s priority list the goals of reducing the overall load on the pool or obeying some notion of load fairness cannot rank ahead of providing accurate time… It seems to me this is one case where the goals of the pool’s operators and consumers don’t quite line up.

I kind of agree, but accidentally DDoSing high quality pool servers isn’t good for operators or consumers either.

1 Like

I don’t think we’d see DDoSing servers by honing in on those that provide better service to a particular client. Different clients with different network paths are going to see different servers as high quality.

Please consider giving the code a try for yourself and comment on the changes before they’re part of a general release. This is the best time to make design changes.

I’m interested in any problems seen building the code, of course, but I’m also willing to post Windows binaries if anyone wants to try it on Windows (Bas?) but doesn’t have the tools to build it from source.

1 Like

I would guess it’s probably not a major problem, but I could imagine that a stratum 1 server in a good data center might be strongly preferred over, say, a decent stratum 4 server on a home cable connection, even if they both provide generally acceptable service.

I could imagine that strange things might happen in a medium-sized country with a small, uneven distribution of NTP servers.

:person_shrugging:


As a general comment, the NTP Pool isn’t the only user of the pool setting. Operators of services like ntp.ubuntu.com or internal corporate NTP servers might have their own opinions for or against changes.


Edit: Personal opinion: I worry that replacing servers after 10 poll intervals could potentially lead to a client getting, for example, 10 roughly equal servers, randomly selecting different servers every few minutes as network jitter changes, randomly replacing several of them every 640 seconds, getting stuck on a 64 second poll interval without settling down, and sort of reducing to the behavior of a basic SNTP client, always using the newest servers and sending a higher level of DNS and NTP traffic, without proper clock stability.

If that’s possible, it could significantly change dynamics, and lead to higher traffic from some clients.

Also you know exactly how the code works and I don’t, and you have been thinking about this a lot longer than I have. So. :person_shrugging:

Unless things have changed dramatically since I was on the team which maintains ntp.ubuntu.com (I left in 2019), I can tell you that these sort of algorithm tweaks aren’t really on their radar. They only serve out of a very limited number of locations, so if, e.g. their servers in Europe are preferred by European clients and their US servers are preferred by American clients, it would likely be viewed as a positive.

It seems like capturing some network traffic and doing some statistical comparisons on the before & after states would answer that fairly definitively. @davehart any plans for that?

1 Like

But the only way to know how widespread deployment will work is with kinda widespread deployment, and then you’re kinda stuck with it. :grimacing:

Edit: I’m sorry if I sound like I’m crapping on everything. This is interesting and good work! There’s just a certain, kind of unavoidable degree of risk.

You definitely don’t sound like that. A degree of caution is always warranted when changes can affect an Internet’s worth of client systems. :smiley:

2 Likes

The increase in DNS traffic is difficult to predict too, so I think at first this should be a new option, not something enabled by default.

Even from the client’s point of view, I’m not convinced yet this is generally a good thing to do. It might improve the stability of the combined offset (not accuracy), but couldn’t it also lead to a less robust selection, e.g. due to the servers more likely being in the same network or sharing their upstream sources?

1 Like

You’re worrying about something the proposed change doesn’t do. Please give it a whirl:

https://people.nwtime.org/hart/ntp-dev-3792-msm-v2.tar.gz

Specifically, it preempts one server when there are maxclock associations and it’s been at least ten minutes (default) since the last preemption. Multiple servers are not ejected at the same time.

Cheers,
Dave Hart

1 Like

Not so far. I’m not in a good position to be able to capture much traffic. Some things could be learned from looking at pool project DNS query rates, but uptake of new code in ntpd can be slow. I anticipated and fixed the ntpdc -c monlist amplification issue in 2010 by removing monlist, adding ntpq -c mrulist, and not long after defaulting to disabling all responses to ntpdc/mode 7 queries in ntpd. That code was in ntp-dev and didn’t make it to ntp-stable until 2014, still before the wave of monlist abuse, but not far enough ahead to get substantial uptake in time. Now all sorts of people have the impression that all ntpq queries are dangerous and should be blocked from everything but localhost… It will take quite a while to see the effect become widespread, though hopefully this code won’t languish in ntp-dev for so long before making it into a stable release.

Cheers,
Dave Hart

I think the pool keyword (in ntpd, chrony and others) should “rotate servers” in two ways:

  1. when a server in markedly unhealthy (not responding, false ticker, …) for a period of time.

  2. On a MUCH slower schedule take the oldest IP from the “pool” keyword and replace it with a new one. The schedule could be set to replace every IP for example every 8 weeks, so with 4 IPs wait 2 weeks.

The objective would be to help on the load feedback mechanism (so old servers don’t accumulate ntpd clients) and allow servers to leave the system without having to turn off service.

1 Like

If you’re going in that direction, it would also make sense to limit caching of pool DNS results. (Maybe to 1 week, to pick a random number without thinking it through.) I just had an NTP server try to solicit a “new” pool IP that I deleted from DNS last year!

Edit: This example might multiply DNS traffic from “a couple queries a year” to “scores of queries a year”, though…

Pool DNS records have a 150s TTL. Try: dig pool.ntp.org

ntpd doesn’t get TTL when resolving hostnames. In fact, not all hostname resolution uses DNS. And ntpd hangs on to extra IP addresses until it can use them, so if this ntpd had been running for a year, this is expected behavior for me.

Yeah. But if you make some kind of change like in Ask’s post to more rapidly recycle servers, the current IP caching would partly work against it.

It doesn’t matter because TTL is for the DNS-server to query again or not.
A client never checks TTL of a DNS. Hostname resolving is always a BIND or DNS-cache problem, never ever something a client should worry about.

Sorry mate, you are wrong on this.
Also you can tell ntpd to use IP’s instead of resolving, then you don’t have this problem.