Optimizing server distribution across country zones

Hi all,

as far as I am aware, the current default server distribution works like this: Each active server is included in the zone for the country it is in, the zone for the continent it is on, and (at sufficient netspeed) the global zone.

When a client asks for servers, the response is:

  • if there is at least one server in the country zone of the country the client is asking from, up to four servers from the same country,
  • if there are no servers in the country zone but in the encompassing continent zone, up to four servers from the continental zone,
  • if there are no servers in neither the country nor the continental zone, up to four servers from the global zone

The main issue for this thread is that the different countries do not balance each other out. In countries with few servers, these are often overwhelmed and sometimes an entire zone collapses when the servers can’t keep up with the request rate and get demoted by the monitoring. Meanwhile in countries with lots of servers, the servers still have capacity left unused. As an interesting side effect, coverage in countries with no servers is sometimes more stable than a country with a small amount of servers since the load gets distributed across the entire continent. All of this has been discussed in different threads here before.

The goal of this thread is to collect and discuss different approaches on how to distribute the available servers across the country zones, so we achieve a better coverage and better utilization of the resources made available in the pool.

Parameters to think about:

  • How do we select servers to support underserved zones? Simply by continental grouping? Geographic adjacency of countries?
  • What is the impact on the servers used as support? How do we prevent a scenario where we degrade a not yet underserved zone by adding too much load from nearby underserved zones?
  • How do we handle the difference between zones? Does an optimization approach fit for large and small zones alike?
  • How do we balance higher availability from adding more servers to a zone against higher latency from adding servers that are farther away?
  • How do we define which zones are actually “underserved”? A certain amount of DNS requests per server? A certain amount of DNS request in proportion to netspeed? The amount of servers?
  • How much complexity do we want to introduce?

I want to explicitly exclude the topic of IPv4 and IPv6 from the scope of this thread. Any server distribution algorithm should work well for both protocol versions, as I assume there will be both IPv4-only and IPv6-only clients for the foreseeable future, where the distribution of servers in one protocol version is irrelevant to the quality of distribution in the other.

I want to present two options that I have thought about. Both are designed purely on the NTP Pool management level without having to introduce new features to the GeoDNS servers.

The first approach was suggested here Minor new features on the website - #9 by ask and is to define a limit on at least how many servers should be in a zone. If there are not enough servers in a given zone, then servers from the surrounding zones could be added.

For the specific implementation: The Pool could calculate the average netspeed of the servers directly mapped to the zone, then add all servers (or at least all that are not themselves in an underserved zone) from the surrounding continent to the zone and scale their netspeeds down so that in the end, they add a total netspeed equal to the total netspeed of the “missing” servers if they all had the average netspeed. Pros: Low complexity, works with information already present in the core Pool database, no change for zones that are already empty or have a huge amount of servers. Cons: Static, might overload other zones in the same continent. Variation: Define a list with the minimum server count per zone based on the DNS statistics to account for differences in Pool usage by country.

Another approach could be to calculate the “load” on a zone based on the amount of DNS requests the zone gets in proportion to the “amount” of netspeed it has. Then the servers from zones with below average load are added to nearby (=same continent?) zones with above average load, again with a scaled down netspeed to support the above-average-loaded zones without needlessly dominating them. This could be mapped again on a continental level, and might even implement that countries that have a lower relative load are weighted to take more support load. Pros: Dynamic, scales with zone usage and server counts. Cons: Need to implement DNS metrics in the Pool management algorithms, zone generation now depends on an “external” factor, might degrade service quality for “below average” zones that were already well served.

For any approach, instead of just using all available servers of the continental zone, we could define a distance matrix, indicating the distance between countries as a simple number abstracting geographical distance and internet interconnections. If a zone needs support, the support is provided by servers from all countries but scaled to weigh servers from nearby countries with a higher netspeed.

What are your thoughts on this topic? Do you have further input or ideas on how to improve the server distribution?

Just an observation - my server is in United States but had clients from Brazil, India, Dominican Republic, and Canada (among others assumed)

My guess on those is that these are requests explicitly requesting another zone. So clients that have us.pool.ntp.org or north-america.pool.ntp.org configured, but are currently in another country.

While I totally support your effort, the reality for the past years is that it appears little effort has been made to correct the overload situation for some countries. Some ideas have been suggested, but were not implemented. Ask has been reluctant to change anything…

Only the NTP server operators know how much load their own server could bear. Allow the server operators to scale the load on their server.
Today by default, each server is in their respective zone with the selected bandwith. If we look at it theoretically, in fact, the servers are in all zones, but mostly with zero bandwith, except one or few zones.

Let’s have a UI to the server operators to decide to be in what zone with what bandwith. So there would be no one individual bandwith per server, but bandwith value per server+zone pair.

1 Like

My intent is to facilitate the first step of this change - brainstorming, collection and discussing how a better system might look like. I understand the reluctance and hope that with this exchange of ideas, we can contribute to a well defined and thought through suggestion. In order to implement anything, first there needs to be the plan on what to actually implement

Interesting idea. While this would enable maximum operator control, I’m not sure many would alter the defaults or dive into the complexity of fine tuning the 200+ zones for each server…

What do you think about a checkbox “support nearby underserved zones” in the server management where operators can opt in or out of a balancing system?

Hi Sebhoster and everyone,

This is a great initiative. While geographic segmentation (by country) is the current standard, it doesn’t always align with network reality. I’d like to add a third dimension to the brainstorming: Network Visibility/Topography.

Political borders are often “thick” for people but “thin” for packets. Using external data sources (like BGP paths or latency maps) could allow for more efficient distribution.

The “Network Proximity” Case: Take the example of major European providers like OVH. They have massive capacity and excellent peering in Poland. If the Polish zone is underserved, an OVH server located in France or Germany might actually offer better latency and stability for a Polish client than a local “in-country” server on a poorly connected domestic ISP.

Points to consider:

  • Logical vs. Physical: Instead of just “neighboring countries” (static), we could look at “neighboring networks” (dynamic).

  • Integration: This would require integration with external datasets (like MaxMind’s ISP data or RIPE Atlas), which adds complexity but significantly improves the quality of the “pool” for the end-user.

Looking forward to seeing if we can incorporate “Network Distance” alongside geographic distance in the proposed models.

It seems to me that the current algorithm assumes that all countries have the same demand or devices. Perhaps the mix of local and continental servers should be weighed by the local and continental populations size as indirect measures of the population of devices, though device penetration varies wildly among countries. Yet, device penetration is not that a rare statistic for a country.

By the way, is there actual metrics of the load of a zone? I mean two countries with the same number of servers with the same bandwidth values can have significantly different load conditions, if the client populations are significantly different.

Yes, they are a bit hidden but exist - at least on the DNS level. They are used for example to generate the Client Distribution statistics on the server management pages, like the one @ccb056 posted above. I’m currently drafting a longer separate post on those numbers, but for this thread here it is sufficient to say that the numbers on how many DNS requests originated from which country are available.

Mind you, this is purely on the DNS level - it is not a direct measure of the actual NTP request load on the servers in a zone. But since we de not have those metrics, and since it would be quite the undertaking to build and establish a mechanism to collect those metrics, I would consider comparing the DNS load a good proxy.

2 Likes

NTP server does become less accurate when distance increases, and a lot of times server operators don’t know how bad it would be when they serve a zone far away.

You are definitely right. As the country and global load numbers are available on server operator’s page, I believe the NTP Pool DNS system does have the metrics by country and zone ready, but it’s hidden at this moment.

Why not just handle countries / zones with not enough servers as if they have 0 servers and use the current algorithm for that?

That would indeed be an easy way to adress some of the problems with a real low complexity on the cost of lower quality of service.

If we look at it on a weight distribution view, that would be like adding all servers from the continent to the underserved zone without any adjustments to the server netspeed. By scaling the netspeed we can “prioritize” the local servers without overwhelming them.

The current algorithm simply does not care about demand, just about the countries the servers reside in.

That would work if the Pool had global coverage and enough servers in all zones.

That is not required to get some rough estimate of the demand in a zone. Figures such as from Wikipedia’s List of countries by number of Internet users could be used as an alternative basis for an estimate of the demand to be expected in a zone. Also not perfect, because already those numbers themselves carry some incertainty as to their accuracy, and as @ebahapo also mentions, there is some further uncertainty as to how that maps into actual traffic demand.

But that is similar to the DNS request numbers carrying some uncertainty as to how they relate to actual NTP traffic demand.

I guess the DNS numbers might be closer to the actual demand to be expected, but the number of Internet users might be easier to get started with in a first step, given they are static, while the dynamic DNS numbers can be used in a potential second step.

And I think simplicity is key to move forward on this at all, avoiding that perfect is the enemy of the good. I’d prefer to see a solution that is not perfect, but that addresses the most important issues with the current implementation effectively rather sooner than later, rather than never seeing a “perfect” solution, and the current issues never getting addressed.

1 Like

Fully agree.
But the data modelling behind should be still properly designed.
As an example: For the load of an NTP server today the operator has the choice of the “bandwidth”. However, this is a relative value to the other servers sharing the load in the same zone. Also, the same overall bandwidth between two different countries may still mean different load on the server.
Instead of being relative, we should make this value absolute, as much as we can.
An appropriate selection (the best proxy we have today for the actual load) would be the number of geoDNS reply rate for the given zone/geolocation. The operator can decide the exponent for the power of two permitted maximum replies per second that the IP address of his own NTP server may appear in geoDNS replies. For example:

-1 ->       testing mode, no reply allowed
 0 ->       1 permitted maximal geoDNS reply/sec
10 ->    1024 permitted maximal geoDNS reply/sec
16 ->   65536 permitted maximal geoDNS reply/sec
24 -> 1677216 permitted maximal geoDNS reply/sec

(Bonus, we could get a good metrics how much free capacity or overload we have for a given zone: Sum of the reply rates of all NTP servers in the zone / geoDNS request rate.)

Prefill the values with some heuristics. Then, allow the operator to customize the values the way he wish.

Taking the ideas so far together we might get something like this:

  • Each server operator defines a maximum tolerated NTP request rate for each server. This replaces netspeed. The operator also can opt out of any support to adjacent zones.
  • From empirical data yet to be collected, we define an assumed ratio between the amount of NTP requests experienced by a server and DNS requests where the server is announced. This ratio might be different for each zone. Now we can calculate an approximate maximum DNS request rate for each server.
  • With this maximum in mind, we distribute servers:
      1. All servers are added to their main country zone. Total DNS rate capacity for each country is calculated and compared to the expected demand (plus some safety margin).
      2. For each continent, all servers from zones with spare capacity are added to all zones where demand is not met. When total available capacity on a continent exceeds the demand, the server weights are scaled down to only meet the demand. When total available capacity on a continent is not enough, the server weights are scaled down to not exceed their respective DNS rate limits.
      3. All servers from zones that still have capacity left are added to all zones that still have unmet demand. When total available capacity exceeds the demand, the server weights are scaled down to only meet the demand. When total available capacity is not enough, the server weights are scaled down to not exceed their respective DNS rate limits.

This follows the current Country - Continent - Global chain, but this can be altered to more distribution rounds and different metrics on how late in the process zones start to support certain other zones, with an arbitrary number of rounds and without having hard groups like the continents. Here we could consider connectivity statistics - countries with better network connectivity between them would be prioritized to support each other.

Caveat: since we only limit by predicting the expected demand and weighting down servers, the maximum will not be enforced. If the total configured capacity of the pool does not meet the total demand, at least some servers will see more traffic than their configured maximum.