I take a big hit in queries

Demand = Roughly the volume of requests that all the NTP clients in the zone generate. Or similar definitions. Thing is, how much “traffic” does the zone generate, in whatever unit is relevant to you: Packets per second, bits/second, GB/month, …

Your server has a certain netspeed. So do all others in the same zone. Add up all those netspeeds. Then calculate what percentage your netspeed is of the overall sum of netspeeds. That is the second permyriad (see below for permyriad vs. percent).

By definition, the baseline here is 100%, or 10000 permyriads. I.e., the sum of all netspeeds of the servers in a specific zone is considered 100%. And each server will then have a fraction thereof, e.g., 13% for each of NTPman’s servers. Or the respecitive permyriad fraction out of the total of 10000 permyriads (it’s just a factor of 100 between the two).

As there are zones with many servers, the unit is not percent, because the percent value might be too small/have too many zeros after the decimal point for those countries, and/or for small netspeed values. Thus, permyriad was chosen. E.g., your IPv4 server currently has 4.813 ‱, which would be 0.04813%. The other way round, in zones with fewer servers, you get bigger permyriad values, because the 100% is divided among fewer servers. I.e., each server gets a larger share of the 100%.

Similar with the DNS queries. Assume the pool got 100,000 DNS queries from Belgium in the last three days (not a realistic number, but for illustration). Then the first permyriad means that that specific server has been getting about 13% of those 100,000 DNS queries, or about 13,000 queries. And yours has gotten 0.54405 % in roughly the same timeframe, or 54.405 ‱.

We do not know what normal numbers are. There is no baseline. Germany has e.g. 600 servers, Belgium 29.

Sure we see numbers, but they mean nothing.

Do you have any idea what you are trying to say?

Nobody knows the the size of a country, the number of requests…bla bla bla.

A better number would be: Req’s PER server regardless of a country. That would instantly show if a server is DDOS’ed or not.

Then a second number should be, traffic-setting….so you can see if 512Kbit or 3GB means a lot.

We need to know how much a server is getting requests…all the rest is (sorry) rubbish.

Again, I do not understand any of the meaning of current numbers, they do NOT show requests per server….or anything usefull. @ask can you please fix this mistake?

Sorry to hear that. Currently don’t understand what you don’t understand. E.g., what “baseline” you are looking for.

Generally, there are two things:

  • What the permyriad values “mean”, i.e., plainly just how they are defined. Not much to “understand”, just accept the definition.
  • How that relates to load on an individual server within a zone.

And for various reasons, the second item is not as straight-forward, or simple, as many of us would like it to be, so we have to make do with what we have.

Yes, at least I believe so.

Sure. But

a) because of how the pool currently operates, that number is specific to each country, whether one likes it or not, and

b) while there’s been calls to actually make the pool aware of how many requests each server gets, there is no clear solution how that could realistically be achieved anytime soon with the available resources and given various other constraints and open questions.

Sure, that has been discussed before, that the setting should be an absolute one. But there’s various challenges in the way of doing that, and its non-trivial, and thus again no realistic perspective to get something like this implemented anytime soon.

Sure, but it is all we factually have right now, and for the foreseeable future, whether we like it or not. So we have to make do with what we have.

I think that is because you want them to mean what you would find useful, and that is just not what they mean, for various reasons. They have a somewhat clear definition, and that is what they mean, and whether you find that useful to you is another matter.

I think what Bas’s issue is, is that the current stats are relative to an unknown. Where as absolute values might be more useful to size your server. It is sort of annoying to have to do a tcpdump to figure out what your requests/s is?

There are 66 servers in my zone, are they super computers, or Raspberry Pis?

Yes, we get the portion of bandwidth that our server contributes, but that doesn’t help if we don’t know what type of hardware is contributing the rest.

I think Bas would like to see the absolute numbers, total requests for zone per second, requests served by you per second, along with percentages. Those absolute values must be there as they are used to calculate the percentages?

Yes exactly…you read my mind :+1:

If you want to see NTP requests per second, the only place is on your NTP server, or if you have a firewall in front of it, maybe there. If you want to see something visual, set up graphing for it. There has been threads on how to do that.

The “central NTP pool infrastructure” does not have that information. The DNS queries does not have a 1:1 correlation to NTP queries. Here are a few examples:

  • If I manually do a DNS query of one of the pool names, I will receive 4 IPv4 addresses, or if I do it of the 2.ntppool, I’ll receive 4 IPv4 addresses and 4 IPv6 addresses. I might not do a NTP query after that.
  • A simple SNTP setup might run a service periodically do a DNS query, receive the 4 addresses, but only do a once-off NTP query to one of those addresses.
  • A NTP server that starts up with a “ntpd server x.x.x.x” kind of setup, will do a single DNS query, receive 4 IPv4 addresses, but use only one of them, but keep on using it until the machine is rebooted or the service restarted.
  • A NTP server starts up with a “ntpd pool x.x.x.x” kind of setup, might do a single DNS query, receive 4 IPv4 addresses, start to use all 4 and after a while stop using the worst ones, but keep on using the good ones, again until the machine is rebooted or the service is restarted.

So the point is, the “central infrastructure” cannot know how many NTP queries a DNS query will result in.

There is one more thing, most people do not query the pool DNS servers directly, they use their ISP’s DNS servers or one the well known DNS servers, like Google’s 8.8.8.8. Currently the NTP pool DNS servers, give responses a TTL of 120, that means a DNS server in that chain may cache the result for 120 seconds. So for the next 120 seconds, they can just respond with the same response, without contacting the pool DNS servers.

I proposed once a solution for that: https://community.ntppool.org/t/load-feedback-from-the-ntp-servers-to-the-pool-infrastructure , but there wasn’t any real interest in it. The idea is simple, NTP servers could report their load to the pool infrastructure.

That is a good idea, as we have verified servers now, that didn’t happen before.

So the report can be seen as correct.

Or add a token to the config like we have with the monitors.

I don’t think there was lack of interest as such, just that it is probably not as easy to realize as it sounds, and that even other pressing topics don’t get addressed due to bottlenecks in the project.

Thus the discussion just petered out without any real conclusion/way forward in view, like so many others (IPv6, underserved zones, vendor zones, delays and noticeable unreliability in processing of mails sent to server-owner-help@ntppool.org, …).

@Bas Is it any better now with number of requests?

Sending server load stats from the server to a centralized host would not be possible for NTP appliances.

Right, but firewall/router in front may provide that information if it is not possible to query the appliance for statistics. Anyhow, there would not be need that all NTP server to provide the information.

Yes it’s a lot better.

Numbers dropped to normal levels.

I agree. But we do need better numbers to judge what our servers get for a load, so we can see if there are problems.

What we have now makes no sense at all.

I generally concur that “better” numbers are always good, e.g., them getting more predictable/comparable across zones. In a way, the most straight-forward approach in that direction would be the eventual dissolution of the current strict, country zone-centric GeoDNS approach. When the load is distributed more evenly, it is easier to correlate a specific setting with an actual load that it would cause, obviating the need for complex feedback mechanisms, additional requirements on at least a sufficiently large subset of server operators, and the handling and processing/analysing of what would be vast amounts of data, and a “simple” calibration could be all that is needed.

Not sure though how having that data available would have helped much in your case. Unless you knew all along, e.g., from device specs, what the load is that your system could handle. Then you could obviously have determined rather quickly that the load is more than your system can handle.

But having followed your laboring throughout the process until now, or also other cases over the years, I don’t think that the numbers would have helped any. Because the capabilities of devices and systems are typically not known, e.g., consumer or even semi-professional routers, due to the specifics of pool NTP traffic. (Dedicated NTP appliances being one notable exception because that is a core benchmark for them, and sufficiently professional routers/network devices for which the relevant numbers are given for when DNAT, or some other function needing stateful flow tracking, is active.)

Rather, I think really accepting that the load is what caused your issues was the hard part, and that Belgium simply is/was an underserved zone, i.e., that the lowest netspeed setting was still attracting too much traffic for your system.

Or would you really have said, “ok, I know my system can handle, e.g., 400 packets/s, so I need to set the pool to below 400 packets/s, and everything will be fine” (assuming that such functionality were available)?

Or wouldn’t anyhow have experimenting, and iteratively approaching the threshold where your system starts to slide from being fine towards not being fine anymore, been the approach that was needed, and that you actually followed, or at least tried?

Because I think the main issue here was not that there was no direct setting in, e.g., packets/second, or that the pool did not provide the right numbers. Rather, that, like many others, you tried the aforementioned iterative approach, but the inability to reduce the load even further, below the current lowest load-sharing setting (regardless of what unit it takes), basically broke that, preventing you to move forward on your own. (And the issue that due to the strict zone-based GeoDNS mechanisms, underserved zones exist in the first place.)

Just FYI. After moving my NTP server to it’s own port on my firewall, I can now pull packets/s stats that does not get mixed up with lan/general internet traffic, without resorting to tcpdump:

This is with OPNsense firewall software.

See the massive attacks, but they only happen short time and not every day.

You can see them, as they have a huge RX but no TX response, because they are droped.

Some joker or very bad client doing this. Above it router data, below is server data.

You see the big numer of drops….Now going to search for the IP causing this and ban it.

Hi Bas,
Thanks for sharing. Few observations:

  • Time period looks different for the two graphs. Are you showing the same period?
  • Peak no. of requests in top graph is 3.75Mbyte/s. Looks like a serious peak, but in absolute numbers it is not a lot. Is your connection/router not able to handle it? Would be interested to see the no of dropped packets during the peak.
  • I may misinterprete the second graph, but statistics look like your server is accepting command packets? As per chrony documentation (link) command packets are normally only accepted at localhost address (127.0.0.1 or ::1), unless a different address is specifically mentioned using the “bindcmdaddress” option in chrony.conf.
    I assume this is rather uncommon as setting this to a publicly accessible address opens the possibility for amplification attacks? Could you please check?
    PS: my chrony.conf does not contain such a command.

Found something nasty about Chrony, when you have a high maxloglimit and you execture

chronyc -n clients |sort -rn -k 3,3 |head

The number of packets goes thru the roof….ok, it’s not an attack, my mistake, reading the wrong color. :face_with_spiral_eyes:

That issue is solved, so I lowered the maxloglimit, as I only need to know the abusers.

I did find them.

root@server:~# chronyc -n clients |sort -rn -k 3,3 |head
194.78.124.15               29365   4523   3   2   261       0      0   -     -
91.183.19.198                4997   1451   3   1  129m       0      0   -     -
94.72.74.97                  2145   1284   9  12  339m       0      0   -     -
194.78.68.110                1527   1045  12  -3  274m       0      0   -     -
78.24.169.1                  1114    735  13  12  300m       0      0   -     -
87.66.140.213                 865    616  -4  12   51h       0      0   -     -
213.181.51.130               1432    566   2   6  167m       0      0   -     -
81.246.64.102                 784    536  -3  -1   28h       0      0   -     -
84.195.57.233                 759    488  10  12   30h       0      0   -     -
213.211.165.252               747    467   0   1  342m       0      0   -     -

Especially the top one, it hits my server all the time, every second or more.

So I knocked all these IP’s in my firewall…..be gone! :grin:

I did not find any other systems that abuse polling.

But know this, we got a lot of help from bigger servers in BE, including my own in Germany.

That helps BIG TIME! :+1:

I’ll have a look to see if the same IPs hit my server.