I run a single NTP server (physical box, sole task, native IPv4 and native IPv6 connections through same line) and especially over the last few days I’ve noticed that the IPv4 monitoring has been a bit crazy at the same time that the IPv6 has been rock solid. Given that there’s only one server - physical connection etc. is there something specifically difference about the way the monitoring works between the two addressing schemes?
Hi, most likely a dodgy IPv4 router somewhere between the monitor and yourself. You could try an "mtr --udp --port 123 " to the monitor, but you’d have to be looking at just the right moment to see the problem given the issue looks intermittent. The monitor IPs are here: https://dev.ntppool.org/monitoring/network-debugging/
I’d like to know those too, for they sometimes also error out with “RATE” as the reason in the CSV log. I assume that it’s because I have the restriction “kod” by default, so they may need a special rule without this restriction. At least the production monitors seem happy with the new rule specific for them.
At the moment, I have only the IPv6 server running, so NAT is not an issue.
Also, do the monitor servers abide by this best practice in RFC4330 §10?
A client MUST NOT under any conditions use a poll interval less
than 15 seconds.
Why would they be getting a RATE KOD?
However, how much does KOD make sense? Only canonical clients would honor it, but canonical clients usually don’t violate the RATE. Rather, I suppose that only non canonical clients would abuse the RATE, which probably just ignore KOD.
Should or should not KOD be used in restriction rules?
I use it in my configurations. Yes, there are a small amount of devices out there that don’t honor it, but better to have it enabled for the majority. At the very least they are not receiving time when they get rate limited.
@ebahapo what was the time period for the 1562 requests?
Does adding limited / kod actually help or does it just make the server send the same number of responses, but now some of them rate limited?
The monitoring system is setup to send up to 4 or 5 queries 2 seconds apart. (Steve has spotted in the tcpdump diagnostics he’s doing that it sometimes seem to double up the queries; I haven’t had time to debug this and fix it though)
Is it doubling up the number of queries or doubling up the delay between the queries? Why am I asking this? Reading the code of the monitoring, in case one query times out (2 sec elapsed time) the calling procedure still sleeps for an additional 2 seconds before the next query to the same server.
I just grabbed that information when I posted. Here’s another snapshot as of now:
remote address port local address count m ver rstr avgint lstint
xxxx:xxxx:xxxx:xxxx::xxxx 51252 xxxx:xxxx:xxxx:xxxx::xxxx 909 3 4 158 6 1
xxxx:xxxx:xxxx:xxxx::xxxx 35674 xxxx:xxxx:xxxx:xxxx::xxxx 733 3 4 158 6 3
These are rogue clients. Obviously, they ignore the KODs and just bang the system for over 1h. I suspect that they are malicious bots or drones. I should rate limit them in the firewall though, as ntpd assumes compliant clients and cannot properly deal with them.
However, since I added rate exceptions for the IP addresses of the monitoring servers, they have not been sent KODs anymore.
Right, not all clients will respect a KOD, and in fact poorly coded clients when they don’t receive a proper time packet reply they will simply start querying more! Thankfully NTP packets are tiny so it’s not really a burden, just an annoyance at someone else’s ignorance.
However, there are a few legitimate reasons some IPs might end up querying more than expected, one that comes to mind is it could be a proxy. In fact, I had issues with some IPs that after some communication back & forth we found out they were being used by Tesla. Not only were they proxying through a small subnet, but their client had a bug causing a greater rate than it should have been. But then I’ve also had some clients doing sub-second querying that came from AWS… Which I eventually blocked.
Another thing is if you are using NTPD or Chrony. With NTPD even when a client exceeds the rate limit, there is still a percentage of packets that NTPD will reply correctly instead of with a KOD. I don’t know how Chrony behaves.
I use ‘hashlimit’ with iptables in order to rate-limit by IP source. Be aware depending on how many QPS you get you might have to bump up your conntrack_max in sysctl.conf & hashsize in modprobe.conf… I have my hashlimit set to a burst of 8, and avg 4/min with a 2/min expire (that’s all that is really necessary). I find most clients are either very well behaved, or wildly abusive… There’s no in-between… lol. About 1/6th of the traffic gets dropped with the above settings.
Chrony has the same behavior. At the maximum rate limit setting, still 1 in every 16 NTP packets is responded to. The manual defends this behavior as a way to prevent completely cutting off a DDOS-ed address.
This is necessary to prevent an attacker who is sending requests with a spoofed source address from completely blocking responses to that address.
The patterns are interesting. Polls for some servers come in groups of 3 separated by 2 seconds (if response arrives) or 5 seconds (no response).
The other major pattern has polls in groups of 3 alternating with polls in groups of 5. [This is a simplification.] About 10% of the hosts have this second pattern. I suggested that a second monitor of older vintage might inadvertently be running.
if it’s intermittent and it’s bugging you, I would work out when the checks from the monitor are due (you can watch them come in with the appropriate tcpdump recipe), then fire up mtr around the same time and see if the monitor packets arrive / gets an answer / there’s an obvious drop along the route at that time.