Hello, I’m fairly now to running NTP servers and would like some aid troubleshooting one of my servers. I switched one of my servers to Chrony (from NTPsec), and now the score is all over the place, and especially on IPv6, I keep getting emails about the score being waaay into the negatives, but ignore it for long enough, it’ll be back up to nearly 20 in due time. And then promptly drop again soon enough. It’s not nearly as bad on IPv4, but something seems to be going on there too, though at least the score doesn’t drop below 10.
Hello Badeand, welcome to the pool. Your username indicates it is where you belong
A good first place to look at is the csv log, linked at the bottom of the score page of each server. It shows you the detailed log entries for all the measurements of the different monitors, including possible error messages. You can extend the time of returned results by increasing the number behind “limit” in the URL.
Looking at the log of your IPv6 server, in the time your score dropped, a lot of monitors show “i/o timeout” which means either your server was not reachable or did not respond to the ntp query of the monitor.
Another good first step for troubleshooting are the general metrics of your server - how high is the utilization of CPU, RAM, bandwidth, how many ntp packets per second are you serving, and are there any anomalies in those metrics during the time the score dropped. And of course check the server logs if chrony or any system processes have thrown any interesting looking errors or warnings.
If your server is running fine and not at capacity, I would check if HostHatch deploys any form of DDoS protections that might get triggered by the amount of ntp traffic.
Thank you! As a rubber duck, I do indeed enjoy it in the pool!
I did have this ratelimit in place, which I commented out:
ratelimit interval 500 burst 2000
Could that potentially be causing this? I figured this was a pretty high limit?
Doing tcpdump -ni eth0 udp dst port 123 | grep Client | pv -l -r >/dev/null tells me it gets around 600-800 packets per second currently.
The score dropped while I was sleeping or doing other stuff, and didn’t really have time to look at the resource usage at the time. I don’t really record resource usage such as bandwidth, CPU and memory usage anywhere, so I have no idea if it might’ve been overloaded at the time. Right now the CPU usage is mostly sitting around 10-20%.
I guess the memory usage is a bit high? Could that be it?
total used free shared buff/cache available
Mem: 1.9Gi 1.6Gi 70Mi 12Mi 405Mi 301Mi
Swap: 1.0Gi 524Ki 1.0Gi
Just created a swap file, in case that’s it. If that starts filling up, I guess I’ll either have to move the NTP server out to its own VM, or get more memory for it. It didn’t have any swap before.
I didn’t manage to find any clues in the logs.
I took a look around their website but found nothing about any DDos protections. I also have another NTP server in the same datacenter, and they seem to be even be part of the same IPv4 subnet, and the other server has no such issues. Main difference is probably that the other server is running NTPsec rather than Chrony, and has less CPU load (less other non-NTP stuff going on I guess), yet is handing even more NTP traffic (around 1500 packets per second).
I think we found your problem. The second parameter, interval, specifies how much time a client has to wait between requests in order not to get ratelimited. A higher value corresponds to a longer time, which means a more restricitve rate limiting.
In this particular case, since you set it to higher than the maximum of 12, chrony will use 12. It then translates the interval as 2^(your value), which comes out at 4096. The unit for this is “seconds between packets”. So you limited all clients to one packet every 69 minutes. No wonder your score got messed up!
The default for interval is 3 (one packet every 8 seconds), but lower values are not unusual in order to not block clients that share a public IP address. Negative values are also allowed in order to allow multiple requests per second and client.
Your “burst” parameter was also set to higher than the maximum (which is 255), but here a higher value means more packets are allowed to skirt the interval rule. So higher burst - less restrictive.
I recommend you read the chrony config documentation: chrony – chrony.conf(5)
It’s a lot of text but really comprehensive.