Bas, I understand this topic is already fairly long, but I would still suggest reading it from the beginning. He gets 250-300 Mbps of NTP traffic with the lowest 512 kbps setting. This is a problem for all of Russia at the moment. Many Russian servers are currently struggling under the extra load. Providing even some rate-limited service is better than entirely dropping the traffic, because it may ease the burden from some of the other Russian NTP servers in the pool.
Maybe as the next step our Russian NTP servers operators could provide us some more detailed data about the traffic. The overall query rate isn’t that useful. Most importantly, does the majority of the traffic seem to come from a small set of IP addresses, or is the traffic more spread out among a larger set of IP addresses. Preferably as measured over at least a few minutes, not seconds. This way we wouldn’t need to guess about the advice we give.
If there are specific problematic IP addresses, contacting the IP address owner’s abuse contact might be useful (see whois). It is also possible that the source IP address is forged, but that’s hard to prove.
Yes, it’s not rocket science. It’s not been spelled out here before because of the poor practice of some server operators of configuring their setup to treat the monitors differently than all other traffic. That’s counterproductive as the intent is for the monitors to see exactly the same service as any client would see. If, for example, a server operator exempts the monitors from their firewall rule to reject all traffic from outside the US, the pool would continue to direct clients from around the world to the server which works only for US IP addresses. You might imagine other scenarios where special treatment of monitors creates broken pool NTP service. Please consider that before helping non-rocket-scientist server operators figure out the monitor addresses, and definitely don’t post monitor IP addresses.
The discard average value is in powers of two, so this causes ntpd to not respond to IP addresses that average more than one query every 2 seconds, except occasionally replying with a KoD packet requesting the source to reduce its query rate and providing no useful time information. Using a value of 5 would mean each IP address can only query once every 32 seconds and get a useful reply.
As @Bas pointed out, in many cases it’s better to rate-limit in the NTP daemon rather than in the router, so monitors and normal, low-volume NTP clients are not affected. However, if your NTP server is on a private network behind a NAT, or behind a stateful firewall, the flood of traffic may be overwhelming the NAT/firewall’s connection-tracking packets-per-second and table size and processing time capabilities while your server is capable of handling more NTP traffic.
In particular, rate-limiting in the NTP daemon means that daemon has to be able to handle each packet just to reject it, which takes much of the resources as responding does, and your router/NAT/firewall has to be able to handle the flood. When it’s an intentional DoS flood, rate limiting in the router (or better, at your ISP’s border routers) is likely wise.
Keep in mind even without NAT or other connection tracking, routers that can keep up with, say, 1 Gbps of typical traffic may not be able to handle 1 Gbps of NTP traffic due to the tiny packets and therefore higher packets per second. You could say PPS but that can be confused with the pulse-per-second from a GPS or other frequency standard used by most stratum-1 NTP servers. P/S might be better in a NTP discussion context.
My pool server in the us and north-america zones is set to a pool netspeed of 250 Mbps, but that’s just a relative setting with all the other servers in a given zone. In practice, it’s getting about 1 Mbps of NTP traffic, around 12,500 packets per second. It’s running under Windows 11 on a circa-2010 HP Z400 workstation with a Xeon W3520 @ 2.66 GHz base speed and 4 cores/8 hyperthreads. It’s built with a 45nm process and has a CPUmark around 3000, where AMD’s current high-end Threadripper CPUs are 33,000 - 65,000. I haven’t spent too much time micro-optimizing it because my focus in on maintaining and improving ntpd, not running the fastest pool server, but when I push up over ~2.5 Mbps pool monitoring starts to show degradation.
@davehart Thank you, that was the config i haven’t found tonight
I’ve added discard average 1 to my config now.
@NTPman I’m using the script you mention and now it’s working but haven’t reportet anything so far.
What i can see right now is that if the server score is >10 the request rate goes to heaven. If the score goes down <10 the traffic sinks immediately (ok, i took about 5minutes) and looks like floor noise.
The problem in the past with china zone was different. If the server was in the pool and the monitors bailed it out the traffic was still there.
Indeed I did, and indeed nobody did, for a reason.
But I think when an entire country zone has collapsed, and people are hurting because their systems get overwhelmed, giving pointers in an attempt to help people when the pool is failing them, in a desperate attempt to stabilize the zone, is legitimate.
I’d prefer the pool would fix this kind of issue to make the pool actually work as intended everywhere, or add IPv6 to more zones, rather than fine-tuning aspects such as trying to figure out whether an IPv4 address and an IPv6 one are actually the same server, and only hand out one of them to clients, when in some places, there is neither that even works properly. Or considering to fine-tune the monitoring to not have monitors monitor servers in the same network region, when servers are overrun by traffic, and drop like flies from the pool in some areas. Or optimize server allocation to clients by trying to figure out what the actual RTT is from a client area to a server, while clients in some areas would be happy if they’d get service at all, and servers there that they could provide good service, and not spend their efforts to fight a desperate, loosing battle against being overrun by traffic.
Again, as maybe that hasn’t registered yet, entire zones are collapsing, or continuously teetering on the brink of collapse. As long as those things are not being addressed, people who nonetheless believe in the project will try to make do with what they have and can do, however desperate, and probably fruitless in the end as they obviously cannot overcome the structural issues of the current pool system.
That’s interesting, because normally with a score below 10, a server is removed from the pool’s DNS. Is there perhaps some lingering DNS traffic? Or is there some kind of ‘shadow pool’ active? I’ve seen this before, whereas hostnames resolve to IP addresses that (partly) overlap with the ones in the NTP pool (like ‘cn.ntp.org.cn’ an ’ us.ntp.org.cn’ and ‘de.ntp.org.cn’, etc.).
I think that this would not help. I don’t have dedicated internal NTP server, but the Mikrotik 4011 replies to queries itself, unfortunately it has no tunables. I think that if I would NAT packets to internal VM which is capable of handling such a traffic, it will smoke out router too and interrupt its’ main services. I set netspeed to 512 kbps and get up to 500Mbps (!) just of NTP traffic. That’s wrong and I must self-protect, as NTP is the auxiliary and volunteer service, not main.
And not limiting at the router results in bandwidth and CPU overload of router, it starts dropping packets (even useful, not NAT only) and gets bad results in scoring, but does not interrupt its’ main services.
Thanks. We’ve implemented Russia-only monitoring, it showed no losses until now.