Gradually add/remove server to/from pool in parallel to score increase/decrease

The point of this exercise was not to get a global view of the pool system. It was to prove the point that if using the global pool name pool.ntp.org, a client still only gets allocated servers from the local country zone, maybe a bit beyond. But not really from the overall global pool. Precisely due to “how the pool DNS service accounts for the locale of the query source, as seen by the pool.ntp.org authoritative nameservers”, modulated by how/whether intermediate nameservers pass through a client’s locale to the authoritative nameservers.

I.e., if a country zone is under-served, using the global zone does not help/make much of a difference as compared to using the country zone. I.e., the load is not spread across servers beyond the country zone (at least not to the degree that has been claimed, e.g., earlier in this thread).

I.e., clients in a country may still get bad service due to too few servers in the zone, and if those are overloaded. And servers in a country may still get overloaded, even if all clients were using the global zone, because clients from the country still only get assigned to servers from the country zone, rather than servers being drawn from all servers available in the global zone. The latter is the target that Ask aims for with the changes he is mulling over, it is not yet reality.

Please note that the original finding was done as part of the paper and associated blog post referred to earlier. I was just giving my results as a practical hands-on example of the effect reported in the paper, encouraging people to try it for themselves if they don’t believe the finding, there is a simple, very illustrative Python script contained in the blog post. And/or read the full paper.

Only difference is, if I recall correctly, that for the paper, the nameservers for the pool were asked directly, while I was going through intermediaries. But the effect is essentially the same, except in the latter case, due to how some intermediaries operate, one may get slightly more servers. Or, if the intermediaries pass the source information through to the GeoDNS servers, one gets exactly the same servers as from the GeoDNS servers themselves.

My original point with this thread was to propose a mechanism that could help servers in under-served zones. And I was mentioning the limitation of the global zone only because the claim was made earlier in this thread that there were no such thing as an under-served zone, if only every client were to use the global zone.

1 Like

This is indeed what happens.

Alternative solution: instead changing scores and complex mechanism, why not simply changing the zone creation algorithm?

  1. why not simply get the rid of country zones and have everyone fall back to the continent zones?
  2. For small continent zones (Africa, South America), then add extra servers from elsewhere

Ask has commented about this before.

1 Like

Here are some of the servers that I would get from the continent zone if I were using the pool myself from Singapore (being a pool server, I use hand-picked static servers myself, so don’t care):

     remote                                   refid      st t when poll reach   delay   offset   jitter
=======================================================================================================
*218.186.3.36                            .GPS.            1 u   13   64  377   1.8075   0.0739   0.0543
#162.159.200.1                           10.69.8.5        3 u   27   64  377  34.2213   5.9066   0.0572
#162.159.200.123                         10.71.8.242      3 u   54   64  377  34.3150   6.0285   0.0946
-2606:4700:f1::1                         10.120.8.100     3 u   46   64  377   1.2941  -0.0307   0.0297
-2606:4700:f1::123                       10.241.8.163     3 u    1   64  377   1.1362  -0.2071   0.0310
+17.253.60.125                           .GPSs.           1 u   17   64  377   1.1089  -0.0861   0.9035
#51.16.235.6                             80.250.149.52    3 u   16   64  377 352.6277  -3.8519   0.6007
#62.228.228.9                            193.11.166.8     2 u   64   64  376 326.7766 -43.4005   7.0785
#194.225.150.25                          194.190.168.1    2 u    5   64  377 422.8539  -0.9276   0.2350
#115.84.157.6                            202.28.117.7     2 u    6   64  377  57.3866  -5.3612   2.0517
-103.214.22.185                          252.128.123.109  3 u   60   64  377  30.7888   0.4180   0.1407

So it goes without saying that this requires a strong reliance on the client implementation automagically sorting out the (perceived) falsetickers over time, as the pool command does, and the server command with chronyd, and what @davehart is planning for the server command on ntpd classic (I am not sufficiently familiar with NTPsec to comment on its behavior).

Yeah, nothing new, really. It is roughly clear in which direction the system needs to move/what needs to be done. Ask just needs to get around to doing it, considering the balance between his resources and his priorities.

Also, the more challenged zones are being backfilled with capacity from large Internet players, either truly locally, or based on local anycast instances - as recently happened for the China zone - the less urgency remains for need of potentially stability sensitive changes to the core of the system.

Yes, at least ntpd is designed for that, right? And for systems that really care about precision, this should be applied for all NTP servers in regardless of their server’s location, or not use the pool whatsoever, given it’s a free, no guarantee service.
(but I may be missing something here)

We did a longer experiment lasting for a week (section 6.1 in here), using 131 clients against 6 NTP servers – most offsets were OK – and these were clients from Africa and other countries with few NTP servers in their pool country zone

1 Like

The instructions for using the pool still suggest use of the server command for adding sources. To use the continent zones, typically, manual intervention is needed, so at least a few users will follow the guidance and set things up with the server command. With ntpd, use of the server command currently means that you’re stuck “forever” with the server that got selected upon command execution, even if it turns out to be bad, or goes away entirely.

EDIT: I now realize that the proposal wasn’t for people to explicitly configure continent zones instead of country or global zone, but for that to happen transparently on the infrastructure side. So the above is moot, assuming that most distributions will nowadays pre-configure the time daemons with the pool command, despite the guidance in the pool’s usage instructions (and less and less distributions use ntpd classic as default time daemon, anyways).

Sure, if one needs a certain level of service, then one needs to consider whether the pool is the right source.

Still, the pool commits to at least provide a certain minimum level of service, in the sense of having implemented certain safeguards (even though not explicitly spelling out, e.g., quantitative service guarantees).

E.g., the current zoning concept, however arguably considered outdated and even dysfunctional nowadays, was about giving clients “better” service by reducing the likelihood of path asymmetries, which by simple statistics goes up the longer the path is.

Or the monitoring approach that has recently been refined by having a server’s performance evaluated at a rough level at least from now multiple vantage points, and that is intended to prevent clients from getting assigned to servers serving bad time, having bad connectivity, or simply vanishing completely, is about at least a basic level of quality assurance.

So while the client is obviously always ultimately responsible for assessing whether the service they get meets their needs, now completely saying, we allow assigning servers to clients where there is an elevated likelihood that they might get suboptimal service to some extent voids all the other efforts in providing at least some minimal level of service.

I.e., I fully understand Ask not taking this lightly and just mapping some local zones to bigger zones, but taking a bit more time to think things through thoroughly, and to consider how some features that the pool currently provides, or some equivalent, can be maintained/reproduced still in a different configuration/setup.

1 Like

It’s also about allowing clients to get assigned servers that do have bad international/intercontinental connectivity due to Internet backbone providers rate-limiting NTP traffic. (That was as much the “fault” of the ISPs used by the original monitoring servers, though.)

1 Like

Would be interesting to conduct similar tests for AAAA records (IPv6 addresses). Too bad pool.ntp.org hasn’t got any, but 2.pool.ntp.org has.

Do you suspect it world behave any different? Perhaps due to the lower amount of IPv6 servers in Norway?

I was more thinking about how GeoDNS can handle IPv6 locations.

(On the other hand; if the validation method is as good or bad as GeoDNS, it may be hard to measure)

Here you go

391 2a03:94e0:ffff:185:181:61:0:91 NO, Norway
359 2a0e:dc0:4:9187::123 US, United States
339 2a01:799:191e:b700::1 NO, Norway
319 2001:470:7826::123 NO, Norway
240 2001:67c:24e4:33::56 NO, Norway
238 2a0d:5600:30:3e::bade NO, Norway
217 2606:4700:f1::1 US, United States
165 2606:4700:f1::123 US, United States
65 2001:67c:24e4:33::36 NO, Norway
59 2a02:ed06::197 NO, Norway
20 2001:67c:558::43 SE, Sweden

Worth noting, one of the American IP addresses it shows (2a0e:dc0:4:9187::123) are mine, and it is very much Norwegian. I believe 2606:4700:f1::1 and 2606:4700:f1::123 are Cloudflare.
The one Swedish server is only in the Norwegian zone, so my GeoIP was probably wrong on that too.

Edit:
Edited the script to scrape zones from the pool’s website rather than using GeoIP (but ignores Cloudflare)

391 2a03:94e0:ffff:185:181:61:0:91 @ europe no
359 2a0e:dc0:4:9187::123 @ europe no
339 2a01:799:191e:b700::1 @ europe no
319 2001:470:7826::123 @ europe no
240 2001:67c:24e4:33::56 @ europe no
238 2a0d:5600:30:3e::bade @ europe no
217 2606:4700:f1::1 Cloudflare
165 2606:4700:f1::123 Cloudflare
65 2001:67c:24e4:33::36 @ europe no
59 2a02:ed06::197 @ europe no
20 2001:67c:558::43 @ europe no
2 Likes

Again, very smart idea!

Here’s my view from Singapore over the last 30 hours or so, querying 8.8.4.4 for AAAA answers for 2.pool.ntp.org at roughly 1 second intervals:

65692 2001:470:ec42:60:49e2:5be6:b73:7449  US, United States @ asia sg
63416 2402:1f00:8000:800::36bb  SG, Singapore @ asia sg
50863 2606:4700:f1::123  US, United States @ ao ar asia at au be bg bh br ca cl cn cy cz de dk dz es europe fi fr gd ge gh gr hk ie in is it jp ke kh kz lk lu lv mg mn nc nl no north-america np nz oceania pl ps pt py qa ro ru se sg south-america sr tr ua uk us
49581 2606:4700:f1::1  US, United States @ ao ar asia at au be bg bh br ca cl cn cy cz de dk dz es europe fi fr gd ge gh gr hk ie in is it jp ke kh kz lk lu lv mg mn nc nl no north-america np nz oceania pl ps pt py qa ro ru se sg south-america sr tr ua uk us
28629 2406:2000:e4:a1f::1000  TW, Taiwan @ asia sg
28529 2406:da18:6d1:a600::be00:4  SG, Singapore @ asia sg
28307 2406:2000:e4:a1f::1001  TW, Taiwan @ asia sg
27959 2400:6180:0:d1::745:6001  SG, Singapore @ asia sg
27768 2406:da18:6d1:a600::be00:5  SG, Singapore @ asia sg
27411 2400:8901:e001:2ff:0:a789:b456:c123  SG, Singapore @ asia sg
27100 2001:678:8::123  NL, Netherlands @ africa asia au br ca de es europe fr in jp kr nl north-america oceania pl se sg south-america uk us za
14325 2001:df1:800:a003::123  SG, Singapore @ asia sg
7531 2406:da18:2b5:3807:7a9d:4fb9:536f:7dcb  SG, Singapore @ asia sg
3129 2403:cfc0:1113:a87d::1  SG, Singapore @ asia sg
2942 2405:fc00::123  SG, Singapore @ asia sg
1558 2001:19f0:4401:a75:5400:2ff:feeb:523e  SG, Singapore @ asia sg
371 2405:fc00::1  SG, Singapore @ asia sg
337 2400:6180:0:d0::1333:b001  SG, Singapore @ asia sg
326 2400:6180:0:d1::532:2001  SG, Singapore @ asia sg
320 2a04:3543:1000:2310:d862:f5ff:fe4e:1077  SG, Singapore @ asia sg
296 2400:6180:0:d0::14c8:a001  SG, Singapore @ asia sg
60 2403:2500:300::1a7  SG, Singapore @ asia sg
47 2a04:5201:8018::71  SC, Seychelles @ asia sg
20 2603:c024:4504:e8e8::beef  SG, Singapore asia sg
13 2406:da18:cdf:e900:9420:6c3:f3:7abc  SG, Singapore @ asia sg
10 2400:6180:0:d0::12:6001  SG, Singapore asia sg
4 2400:6180:0:d0::b9:3001  SG, Singapore asia sg

I.e., while GeoIP shows some addresses as located outside Singapore, all servers are registered in the pool’s Singapore zone (and a few additional country zones in some cases).

There are currently 26 IPv6 servers listed as active for the Singapore zone. Until an hour or two ago or so, I had seen exactly that many distinct IPv6 addresses. Now, there are 27 addresses, but still a good match.

A similar exercise for a vantage point in Germany, with 1.0.0.1 et al. as upstream resolvers of a local caching DNS resolver, yields 278 distinct IPv6 addresses over some 30 hours or so. Out of those, 21 are not in Germany according to GeoIP, but all are listed in the pool’s Germany zone (among potentially further country zones for some servers). There are currently 397 IPv6 servers listed as active on the pool’s website.

Here the ones GeoIP locates outside Germany, or doesn’t list a specific country for:

4959 2a0e:fd45:d34::123  EU, Europe @ de europe
3703 2606:4700:f1::123  US, United States @ ao ar asia at au be bg bh br ca cl cn cy cz de dk dz es europe fi fr gd ge gh gr hk ie in is it jp ke kh kz lk lu lv mg mn nc nl no north-america np nz oceania pl ps pt py qa ro ru se sg south-america sr tr ua uk us
3527 2606:4700:f1::1  US, United States @ ao ar asia at au be bg bh br ca cl cn cy cz de dk dz es europe fi fr gd ge gh gr hk ie in is it jp ke kh kz lk lu lv mg mn nc nl no north-america np nz oceania pl ps pt py qa ro ru se sg south-america sr tr ua uk us
2577 2603:c020:800c:5600:2d86:2529:ebb0:376a  US, United States @ de europe
2236 2001:418:3ff::1:53  US, United States @ asia au de europe fr jp nl north-america uk us
1846 2001:418:3ff::53  US, United States @ asia au de europe fr jp nl north-america uk us
1637 2a12:edc0:4:be15::1  IP Address not found @ de europe
1618 2001:678:8::123  UA, Ukraine @ africa asia au br ca de es europe fr in jp kr nl north-america oceania pl se sg south-america uk us za
1428 2a12:8d02:2100:293:5054:ff:fe3a:161a  IP Address not found @ de europe
1405 2001:67c:dac::1  IP Address not found @ de europe
1395 2001:41d0:704:7800::1  FR, France @ de europe
1380 2a13:7840:ba5e::2  IP Address not found @ de europe
1305 2a13:7840:c0de::2  IP Address not found @ de europe
1046 2001:41d0:700:1324::  FR, France @ de europe
830 2603:c020:8009:b300:9dee:c6dd:1031:dd6d  US, United States @ de europe
725 2a06:e881:7300:1::123  EU, Europe @ de europe
483 2001:678:d50::dead:beef  IP Address not found @ de europe
453 2a06:e881:7301:1::123  EU, Europe @ de europe
108 2603:c020:800c:d707:7f91:116c:e90a:a5b0  US, United States @ de europe
99 2001:41d0:700:5bda::1:1  FR, France @ de europe
95 2603:c020:800b:d010:2::  US, United States @ de europe

Makes 100% sense on this one

Update on this: After now running for three and a half days or so, tallying 1150360+ AAAA records received from 8.8.4.4 for the global zone name 2.pool.ntp.org, the situation is still the same:

27 individual server IPv6 addresses seen, matching the peak of 27 servers listed as active in the Singapore zone over the last few days.

Similarly unchanged: None of the servers returned is not registered (at least) in the Singapore zone. No server exclusively from outside the local zone was returned from the nameserver despite querying for the global zone.

This one (any.time.nl) is ours and I can guarantee you that it is not in UA. :face_with_monocle:

Interesting…

BTW, why do you ask for 2.pool.ntp.org at resolvers? If you do that every second, caching might be in the way, right? Why not ask one of the authoritatives, such as a.ntpns.org directly?

Because I wanted to see more what an actual client would see.

It is clear that a big service such as Google DNS is not monolithic, both due to the anycast aspect, as well as likely the internal realization of the service (e.g., multitude of individual server instances behind the anycast address even within a local service instance). So I found it interesting to see that in practice as well. E.g., even queries at very short intervals would return different addresses in subsequent queries (if the zone is large enough). And the TTL values would vary wildly, even across sequential queries.

That is different from a simple, local resolver, where the set of servers in sequential queries stays the same until the local cache entries expire, while one can observe the TTL decrease with each query. That’s what I see from my vantage point in Germany with a different setup. And that behavior is more in line with what one would expect from a “normal” caching resolver.

Though, the point of this exercise was specifically to see how intermediate servers impact how clients “see” the pool through those intermediaries. Google DNS seems to pass the client location info through as is. At least the effect is that a client only sees pool servers from the zone it itself is located in. Just as if it were querying the authoritatives directly.

I still need to check what the behavior is with Quad9. In earlier tests, I had seen slightly more servers than active for the local zone. That seems to indicate that it might be mixing in at least some servers from outside the local zone. But need to go back to better see why that is, where those servers came from (using the two geolocation methods suggested by @Badeand, especially looking at the zones a server is registered in). Though even here, at least the numbers suggest that this mixing in is a far cry from encompassing the full 1.6k something breadth of the global zone for IPv6.

1.1.1.1 seems similar to Google, in that it also only returns servers from the local zone of the client. But as I have a local cache in between in this case, this test needs to run longer to offset the impact of the caching (which limits the variabilty of responses, even if 1.1.1.1 itself would vary the same way Google do).

1 Like

Just for completeness: Re-running the test with Quad9 over the past two days, looking at the number of unique server addresses seen during that time, and zones that server addresses found are registered in confirms the previous impression of how this recursive resolver impacts clients’ view of the pool’s global zone:

23 server addresses registered in the Singapore zone seen, slightly below the number of servers listed as active in that zone right now. Exact match for Hong Kong, with the number of 9 servers listed as currently active matching the number of addresses seen and registered in Hong Kong. And with 20 vs. 21 for Japan also a pretty good match. Overall, 47 unique server addresses seen, with that number being smaller than the sum of the above as servers registered in multiple zones are counted multiple times in above numbers.

The above to some extent is in line with expectations: Quad9 does not support EDNS Client Subnet (ECS) for privacy reasons (except on one specific service address dedicated for that purpose). I.e., the pool’s authoritative nameservers cannot rely on the geo-information associated with each client’s IP address, but can only indirectly infer the client’s country zone from the IP address of the resolver contacting the authoritatives. And those, by design or just how it works out based on Quad9’s infrastructure (or maybe even just erroneous data in the GeoIP database) are associated with the three country zones mentioned.

Interesting that Japan and Hong Kong are the only two other countries seen in this test. But thereby even more strongly underlining: Even when using the global pool zone, clients will only ever be assigned servers from their own country zone. And only in exceptional cases (e.g., use of a recursive resolver that does not support ECS) will servers from a very limited number of other zones be assigned to clients. 47 certainly is a far cry from the full 1.5k+ servers listed in the global IPv6 pool.

I.e., this again underlines that use of the global zone pool.ntp.org (or its IPv6-enabled variant 2.pool.ntp.org) does not tap into the full global pool of servers, but behaves exactly the same as the client’s own country zone, maybe extended very slightly by one or more “close-by” country zones.

I.e., pool.ntp.org does not have enough servers even for countries that have little own servers, as even when using the global zone, the pool itself - currently - limits the clients to the servers of their own country, including for countries that have little own servers.

2 Likes

This topic reminds me of something I believe I’ve mentioned before. And although it might only be slightly related to this topic, perhaps it can’t hurt to mention it again:

There are probably a lot of IoT-like devices without a NTP-daemon that only to an sntp-query at regular intervals, like on the top of every our or so (the DNS Queries graph seems to confirm this).

Suppose this happens in a network of a large ISP, that does CGNAT for IPv4. And suppose these devices query for a name (like pool.ntp.org) that only has an IPv4-record. Than what might happen is this: they all contact the ISPs resolver at almost the same time. The resolver does a lookup and returns (from the cache) the same answer to all devices, who subsequently all issue an sntp request to the same small set of IPv4 addresses. Some clients might get a reply, but for a number of them rate limiting might kick in.

What do think? Is this a realistic scenario? How would it fit into this thread, or is it off-topic?

I agree it’s realistic and probably a problem but I don’t think there’s much the Pool can or should do about it. (Except for enabling IPv6, which would help but not fully solve it.) It’s a fundamental problem with IP-based limits, DNS and NAT. Lower TTLs, higher rate limits and returning more IP addresses might have an effect but there are tradeoffs and that would be a complicated discussion.

On the other hand, I don’t think we’ve heard anyone from the other side come and say “hi, I rebooted 10,000 VMs behind NAT simultaneously and it was almost impossible for them to get time,” so maybe it’s not enough of a problem for anyone to notice.