Maybe a newbie question here. Here are my findings on my server since it enters the cn zone:
At 30k QPS, dropped packets at ~100/s, bandwidth usage at 30~50Mbps, ntpd CPU ~10%;
At 80k QPS, dropped packets at ~1.5k/s, bandwidth usage at 80~90Mbps, ntpd CPU ~30%;
At 110k QPS, dropped packets at ~20k/s, bandwidth usage at 110~130Mbps, ntpd CPU ~70%.
I guess the suddenly increasing in dropped packets indicates that my server is not healthy. I wonder what should I adjust, or maybe some more info are needed to figure out the bottleneck? I have already disabled conntrack as other posts suggested.
Where are you measuring your dropped packets? If it’s in Ethernet receive, you may need to look into NIC buffer & coalesce settings. A great resource for this (and lots of other network troubleshooting) on Linux is Monitoring and Tuning the Linux Networking Stack: Receiving Data.
It’s from “ntpq -c iostats”. When it starts dropping packets massively, I see no problem in the server’s networking, like no packet loss in tcp/icmp etc.
I have tried to use rsntp, which seems to send replies at 200Mbps, but the score just drops again and again adding empty lines in csv logs. Not sure what I did wrong.
In a quick scan through the code, it looks like there are two places where ntpd can drop a packet due to insufficient memory buffers available, and it might also include packets dropped due to failed sanity/rate limit checks (the docs refer to this, but I couldn’t find where in the code it does so). If the docs are correct, then a certain number of drops is expected on a pool server due to dumb/malicious clients. There don’t seem to be any buffer tuning options in the config file, so someone more experienced in ntpd development would have to comment on what to do about it.
But it sounds like chrony is doing the job, so I guess that’s all academic now.
As for the dropped packets, I believe that if you have the “limited” keyword in your “restrict” config option, the packets dropped because of this rule will be counted in the “dropped packets” counter. The effect of this can be significant. The “discard” config option controls this. I believe the default is “discard average 3 minimum 2”, which means the average interpacket spacing is 2^3 seconds, ie. 8 seconds with a minimum of 2 seconds between each individual request.
Just took a look at my config - I indeed have “limited” in the “restrict” line. I don’t have “discard” so it should be using the default.
If I understand it correctly, these are limits for a single client’s query interval, so statistically the packet drops it caused should be almost linear with the QPS number. This doesn’t seem to be the case for me because it’s increasing exponentially here.
Adding one more data point:
At 130k QPS, dropped packets at ~40k/s, bandwidth usage still at 110~130Mbps, ntpd CPU ~70%.
My guess is that it was responding with “unsynchronized” packets. Was there a synchronized NTP client/server listening on localhost on port 11123?
You could check the output of rsntp -d | grep 'Client received' and see if the leap is 0.
That might work ok with 200 kpps, but probably not much more than that. rsntp should be able to do that with multiple CPU cores. It should be built by cargo --release. Debug binary is slower.
My guess would be the nature that many clients in CN are behind NAT.
And from the traffic graph on my pool server, I offten see large traffic on several certain time points, like 08:00 or so. They may be from ntpdate commands in cron jobs and scripts, but not from NTP daemons.
The problem is I don’t have any reliable way to verify that.