Regulating the load of the NTP servers

There is a need for regulating the load an NTP server is receiving. And not relative to other NTP servers as it is today, but in terms of absolute values. An NTP server owner must be able to declare the maximum value of queries his/her server is ready to handle during a given time interval. This would guarantee that the infrastructure of the volunteer would not get overloaded, otherwise leading the volunteer to eventually leave the pool.

That would also require a regulatory loop implemented in the pool infrastructure. The required basic information for this feedback loop is the absolute load of the NTP servers, already discussed in details in this thread:

The ratio between the actual load and the declared maximum load of an NTP server would be a key factor for the geoDNS server to decide on the frequency the IP address of the given NTP server appearing in the DNS replies for a query of pool server.

I am proposing an action plan item: discuss the implementation details of the load feedback of an NTP server in the already quoted thread.

2 Likes

There’s many questions to be answered, and challenges to be overcome, but if it were possible, this were a good thing I think. It would help servers in some areas not getting bored at the “3Gbit” setting, while it would help servers in other areas not get overloaded by many Mbit/s traffic at even the currently lowest setting of “512kbit”. It could help foster the number of smaller servers added to the pool to grow, despite structural challenges in some regions.

At the same time, this does not address one crucial challenge:

What to to do when the demand is higher than the available capacity as per such fixed settings? Where does the excess go? Can it just be ignored? Maybe hand out not four records per IP protocol version, but only one, so clients use less servers simultaneously (as far as the implementation supports it, and users don’t manually try to override it)?

Or…?

1 Like

I do not think it is crucial question, but still a valid one. There are multiple possible ways to go. One option, a radical one is to give DNS answer without address records for the excess traffic request. It is more important to protect the volunteer’s infrastructure relative to provide the time service.

Indeed that would arguably be a somewhat radical approach, to essentially refuse service to clients.

But it could simplify the implementation, at the cost of not being exact, in the sense of having a feedback loop control the actual traffic. But using the rate of DNS requests as a rough measure of the current demand, calibrated periodically against actual traffic seen by a subset of servers reporting their actual load, a similar mechanism as today could be used. Only that it doesn’t hand out server IP addresses in proportion to the current netspeed setting vs. the sum of all netspeed values, but matching it against the estimated rate.

I think some of the work done by Ask already goes somewhat in that direction, e.g., recording the number of times a server’s IP address is being returned for queries, and comparing that to the netspeed fraction (the relatively new “Client distribution” section on a server’s management page). That number of DNS responses with a server’s IP address would then just need to be matched against the estimated rate that would cause on the server.

1 Like

Some DNS request would cause one NTP client, some other DNS request thousands of clients (or some orders more). Depending the query is originating from someone’s own DNS resolver, or from big DNS resolver, like 1.1.1.1 or 8.8.8.8.
The DNS traffic measurement with calibration servers is probably better than nothing, but I am not sure about its reliability.

1 Like

Just brainstorming further. What if server operators could define the frequency the IP address of their server could appear in DNS answers, instead of defining the bandwith relative to the global load as it is today? The server operators could really fine tune their own load, eliminating the need of the complex logic to create a proxy for the actual load of the NTP servers.

Yes, sure, that is what is going on behind the scenes. But this fine-granular picture isn’t what is actually driving this. But given a large enough population, as can be assumed for the pool, including most zones, this will statistically boil down to some average value.

Sure, a closed-loop control system will typically be more accurate than an open-loop one. But it is also more effort and more complex.

Sounds like what I proposed, except the input target value has a different unit, “frequency the IP address of their server could appear in DNS answers”, rather than “bitrate value scaled to the frequency the IP address of their server could appear in DNS answers”. I.e., similar to what is currently misleadingly called “netspeed”, misleadingly given in some kbit, Mbit, or Gbit, but with a better name, and not being relative.

That would get rid of the (complexity of the) automatic feedback loop, and make the feedback loop be manual operator intervention, which could be good enough.

In a way, that is what we have today, except that currently, part of the equation is a bit of a moving target, at least in underserved zones. And the granularity of the inputs doesn’t really give operators full control. But in well-served zones, that is what we pretty much have today: Set a “netspeed” value, observe the traffic, and adjust the netspeed value until you get the desired throughput. In a well-served zone, that should remain sufficiently stable for some time, until traffic patterns change, e.g., the effect you criticised above, how the DNS rate maps to actual traffic when the resolver landscape/usage changes.

1 Like

Sorry that I overlooked you already had the same idea before.

Some more thinking, it is not the frequency of the IP address that can be in the DNS answers, but the same thing multipled by the validity of the DNS answer.
Explanation: if the geoDNS server answers to a big DNS resolver, but with double TTL value as before, it will likely generate double quantity of NTP client access.

Keep in mind the big DNS resolvers are anycast with the actual resolvers spread around the world in many different data centers in cities that are major Internet hubs. In the US, for example, that’s typically at least the SF Bay area, Seattle, Chicago, Dallas, Miami, DC, and New York. It’s probably more than those 7 for both 1.1.1.1 and 8.8.8.8 (and 9.9.9.9, quad9.net). Those actual resolvers will be querying the authoritative geoDNS servers from the region of the country they serve – it’s not like all 1.1.1.1 queries come from one city, and similarly for the others mentioned.

Also keep in mind some global open recursive DNS servers pass the client subnet along to the authoritative server to allow better selection of responses close network-wise to the client subnet. I think Google’s 8.8.8.8 doesn’t for client privacy reasons, and quad9.net has multiple sets of anycast IPs so you can choose whether they pass on client subnet or not. I’m not sure about Cloudflare’s 1.1.1.1. Again, for all of them the queries will come from the nearest datacenter to you network-wise where they operate thanks to the anycast architecture, so the answer will be ballpark for the right region even without the EDNS client-subnet.

As far as the suggestions here to allow server operators to get a guarantee of actual traffic levels, the pool doesn’t have the information to make that possible. It would require the servers open up their management protocol to some set of IP addresses representing the geoDNS servers or some central pool infrastructure for the pool to be able to query the traffic level. For ntpd, that’s not difficult to do with ntpq and restrict statements limiting who can use its mode 6 queries if the pool were to decide to implement that approach. Something similar could be configured with chronyc I bet, though by default that is available to localhost only via unix domain sockets, I believe it can be opened up to a limited set of IP addresses via TCP or UDP as well for this type of implementation. Still, it’s a chunk of work for @Ask to implement for just those two NTP server implementations and the geoDNS changes to go with it.

I speculate he’s going to prioritize his existing plans to improve the situation of zone collapse by essentially making all geographic zones equivalent to the global zone and making geoDNS better at returning a full set of servers as close to the recursive resolver or the client IP subnet as possible rather than restricted to the country zone of the recursive DNS or client with EDNS client subnet, but based on GeoIP lookup rather than BGP as with the anycast services.

1 Like

It is possible to guarantee a stable avarage traffic level based on the DNS answer rate of the IP address of NTP server. However, beacause the DNS answer rate is just a proxy for the actual load, and has high variablilty depending on what DNS resolver is querying, the traffic level will have low accuracy. Sure, the pool system has to know the actual load of the NTP server to make the traffic level precise.
The main point is the elimination of the dependency on the number of the NTP servers available for the load of a given server. That will remove the chance of the avalanche effect we have already seen within multiple country zones.

For what it’s worth, from what I understand of what @Ask is planning based on his comments on this site, it too will help with the zone collapse issue, and I expect he’s thinking about that problem as well and may have ideas about including more distant servers in the response when the number of active servers nearby is below some threshhold.

To be clear, I am not involved in operation of the pool. I saw at least one comment elsewhere that seemed to suggest the writer thought I was.

That is a good thing. However, we should avoid that the whole pool worldwide get in trouble, because that action makes the situation better localy (in a country) but makes the situation worse globally. The code must be ready to handle the already mentioned theoretical edge case, where simply there is not enough capacity, the existing NTP servers must not be overloaded due to the shortage of overall capacity.

1 Like

Actually they do.

dig +short whoami-ecs.lua.powerdns.org TXT @8.8.8.8

1 Like

Thanks for that pointer. At some point I came across a webpage offering similar functionality. I searched for it fruitlessly as I was responding. For the record, 1.1.1.1 and 9.9.9.9 do not share the client subnet with the authoritative server, at least not the nodes I’m reaching from Maryland, US.

As an aside, consider making your internal clients used a shared resolver which itself queries one of the big three over an encrypted channel, for folks running Unix. For those on Windows 11, MS has made it really easy to manually configure DNS override and to use DNS over HTTPs to contact well-known DoH resolvers like these by suggesting the correct template when you flip the encrypted DNS switch on.

Encrypting your DNS traffic before it hits your ISP deprives them of the ability, widely used in the US, for the ISP snoop on DNS traffic to feed back that information to advertisers or data brokers as an additional revenue stream.

2 Likes

The odious systemd allows easy configuration of its DNS client/server to use DNS over TLS in Linux. It does add latency to queries though, but it can be mitigated a bit by its caching the replies.

Setting the DNS answer rate for a given server may not always work properly to regulate the load on that server. An event, similar to the current event with Yandex’ device (collapse of Russia zone, ru.pool.ntp.org) might drastically increase the load on a server without any change in the DNS answer rate.
However, continuous per zone calibration may still help.

1 Like