Dropped packets

Maybe a newbie question here. Here are my findings on my server since it enters the cn zone:

At 30k QPS, dropped packets at ~100/s, bandwidth usage at 30~50Mbps, ntpd CPU ~10%;
At 80k QPS, dropped packets at ~1.5k/s, bandwidth usage at 80~90Mbps, ntpd CPU ~30%;
At 110k QPS, dropped packets at ~20k/s, bandwidth usage at 110~130Mbps, ntpd CPU ~70%.

I guess the suddenly increasing in dropped packets indicates that my server is not healthy. I wonder what should I adjust, or maybe some more info are needed to figure out the bottleneck? I have already disabled conntrack as other posts suggested.

Thanks for any help!

Where are you measuring your dropped packets? If it’s in Ethernet receive, you may need to look into NIC buffer & coalesce settings. A great resource for this (and lots of other network troubleshooting) on Linux is Monitoring and Tuning the Linux Networking Stack: Receiving Data.

It’s from “ntpq -c iostats”. When it starts dropping packets massively, I see no problem in the server’s networking, like no packet loss in tcp/icmp etc.

I have tried to use rsntp, which seems to send replies at 200Mbps, but the score just drops again and again adding empty lines in csv logs. Not sure what I did wrong.

I have switched to plain chrony for now.

With chrony my server is currently serving 120k qps at 130Mbps without a single dropped packet.

This seems to be a problem of ntpd I guess…

In a quick scan through the code, it looks like there are two places where ntpd can drop a packet due to insufficient memory buffers available, and it might also include packets dropped due to failed sanity/rate limit checks (the docs refer to this, but I couldn’t find where in the code it does so). If the docs are correct, then a certain number of drops is expected on a pool server due to dumb/malicious clients. There don’t seem to be any buffer tuning options in the config file, so someone more experienced in ntpd development would have to comment on what to do about it.

But it sounds like chrony is doing the job, so I guess that’s all academic now.

As for the dropped packets, I believe that if you have the “limited” keyword in your “restrict” config option, the packets dropped because of this rule will be counted in the “dropped packets” counter. The effect of this can be significant. The “discard” config option controls this. I believe the default is “discard average 3 minimum 2”, which means the average interpacket spacing is 2^3 seconds, ie. 8 seconds with a minimum of 2 seconds between each individual request.

Just took a look at my config - I indeed have “limited” in the “restrict” line. I don’t have “discard” so it should be using the default.

If I understand it correctly, these are limits for a single client’s query interval, so statistically the packet drops it caused should be almost linear with the QPS number. This doesn’t seem to be the case for me because it’s increasing exponentially here.

Adding one more data point:

At 130k QPS, dropped packets at ~40k/s, bandwidth usage still at 110~130Mbps, ntpd CPU ~70%.

My guess is that it was responding with “unsynchronized” packets. Was there a synchronized NTP client/server listening on localhost on port 11123?

You could check the output of rsntp -d | grep 'Client received' and see if the leap is 0.

That might work ok with 200 kpps, but probably not much more than that. rsntp should be able to do that with multiple CPU cores. It should be built by cargo --release. Debug binary is slower.

There is, I started chrony on port 11123 a few minutes before starting rsntp.

Thanks. I’ll try again after my new servers are added to the cn pool.

Yes, I compiled it using --release. The performance is great, but the cpu usage is about 2x of chrony. Not a big problem for me though.

My guess would be the nature that many clients in CN are behind NAT.

And from the traffic graph on my pool server, I offten see large traffic on several certain time points, like 08:00 or so. They may be from ntpdate commands in cron jobs and scripts, but not from NTP daemons.

The problem is I don’t have any reliable way to verify that.

1 Like