The issue of NTP requests exceeding bandwidth load

Hello!
I see drastic decrease in russian NTP server count 2 months ago (the same time I started to experience problems) - maybe somebody tries to DDoS russian timeservers, as described here for example. It’s interesting what was the reason. Not only bandwidth is the issue, but packet count too.
pool.ntp.org: Statistics for 46.188.16.150 - are these metrics too high?
I’ll try to decrease netspeed to minimum possible to continue providing time service (I switched from 500 Mbps to 50 Mbps with no effect), but, if enormous traffic continues, I’ll be very sorry to remove my server from the project.
BTW I have one VDS that I can use as NTP server too, but 2 servers of one person cannot serve the whole zone. =(

In addition to the reduced server count issue discussed above, there is also a possibility of an actual attack using forged source IP addresses for NTP queries. I hope you’re using some sort of rate limiting. My chrony.conf has “ratelimit interval 3 burst 8 leak 2”, for example (I kindly request to not stray off topic by discussing what exactly those limits should be set to, people tend to have different opinions – use whatever suits your use case).

2 Likes

until ~2 months ago it was not an issue, I had actual 500 Mbps setting of my netspeed (I have 500Mbps Ethernet connection) and NTP queries were replied by my edge Mikrotik RB4011 without any problems. But now I think I should enable ratelimiting or maybe NAT NTP packages or maybe NAT requests to internal VM which has full-feature NTPD, Chrony and iptables in order to tune loading.
I’ll investigate where packets originate from, too.

1 Like

There’s always the option to leave the server in the pool with a “monitoring only” setting and observe how the situation evolves, occasionally “dipping your toe in” to test the water, I mean, the pool :slight_smile: . No need to remove completely right away. E.g., with “monitoring only” at this time, you could quickly determine whether the traffic is likely legitimate and only due to overload, or likely an attack.

When reducing to zero (via “monitoring only setting”), the traffic should start to die down soon, and noticeably (though as it is way beyond right now, might not be immediately visible). If the traffic does not go down noticeably, it is likely/perhaps an attack. Because there should be no new clients being directed to your server by the pool, but someone is holding on to your address and hammering it. Either maliciously, or unwittingly, e.g., by some configuration error somewhere (the kind of which triggered the creation of the pool in the first place).

E.g., could be some DNS server somewhere that for some reason is holding on to a snapshot of the pool’s response at a specific point in time, rather than reflecting the rotation of DNS records over time that should happen on the pool side. I think I observed something like that myself recently, when the load on one of my servers was going up noticeably despite it being in said “monitoring only” mode by default, and only having “dipped its toe” in the pool for a short moment some time before the increase. I.e., these things do happen even without any malice intended.

1 Like

I opened the valve a bit (512 kbps) - everything seems to be OK, will continue monitoring. I am long-time member of the Pool and leaving it is the last resort.
Maybe it’s a good sign from the Universe to implement Zabbix monitoring at home.

By the way, owning one of 8 timeservers in the greatest country is a reason to be proud ^^,

You’re probably much better off leaving the NTP service to the Mikrotik, as NAT/Port Forwarding to an internal server will stress the NAT mapping table and drive up router CPU and memory use. See Monitors belgg1-19sfa9p and belgg2-19sfa9p having hiccups? - #20 by davehart

You could still have the Mikrotik use an internal Chrony or ntpd server as its NTP source so you can make sure it’s serving the best time you can craft with careful selection and more tools than the router’s NTP implemention likely offers.

1 Like

Glad that things have improved, especially since it means it likely was overload only, not some malicious activity. As your score currently is below 10, i.e., you are currently not getting new traffic from the pool, and if the traffic dropped, that means that nothing else is continuing to send traffic in your direction independent of the pool (apart from residual background load).

So it will be interesting to see what happens when your score gets above 10 again.

Though your “netspeed” fraction from the sum of all netspeed setting values configured for the Russia zone being only 0.00465 % makes me wonder where the remaining 99% are going.

And it would of course be interesting to know where all the other servers that were active some 60 days ago have gone. Maybe it’s just an accounting issue:

For the fraction of “netspeed”, it is my impression that also servers with score lower than ten, or even in the last days before being removed from the pool, are taken into account. Not sure how the per-zone server count handles these servers. If they are not counted there if their score is below 10, that could mean that there are still more servers in the zone than those 9, but for one reason or another not active right now.

Case in point, the count is only 8 this very moment. Maybe, just maybe that is because your server isn’t currently included in the DNS rotation, and the number will go up again when its score exceeds 10 again. But could just as well be that 99 servers dropped out of the DNS rotation, and instead 98 others started to be included again…

1 Like

Just looking at what DNS has returned during the last few minutes, there are at least 17 servers in the zone, among them the Cloudflare ones, which likely soak up at lot of the traffic.

AFAIK our communication authority has warned Cloudflare several times for local law violation and it may be slowed down/banned/experience other problems.

Me too. Abrupt fall of server count.
I looked through history of our IT chat and found that I noticed first problems with NTP service overload on October, 25th.
All the way 8 servers for a whole country are too few.

1 Like

While the zone page currently says there are only six servers in the zone right now, I’ve been seeing 47 since yesterday.

Your server is back now, very good. But it is showing the prototypical signs of periodic overload: Score slowly rising above 10, then soon dropping below 10 again. That means that the traffic coming to your server once it is put back into DNS rotation in one way or another is causing packets getting dropped at a higher rate, which is then detected by the monitors as well. Typically, that is due to overload somewhere.

Try reducing the netspeed setting a bit.

Though the score dropping below 10 obviously does not mean it isn’t serving clients anymore. It is just not getting new ones from the pool until the load situation has resolved itself. So this is a very crude mechanism to regulate load on your server, keeping it on a rough average at an acceptable level. And it could be an acceptable way to run the server.

Though it depends on particular circumstances whether that is the most efficient way, i.e., whether it is serving the optimal amount of clients that way. Since the recovery from values below 10 takes a while, even when the peak of the overload has subsided, it is likely serving less clients during that time than it could. And when in overload, it is not giving good service.

Thus, setting a lower netspeed could avoid the “sawtooth” in load and score, potentially leading to a higher average in clients served, and better service.

I guess the drop was gradual over time, from some 122 servers 60 days ago, to 13 two weeks ago, to 6 right now (though the value probably is slightly higher during the off-peak hours during the night).

But still indeed, I wonder why there are so few servers in some areas compared to others. In China, I learned in this forum, the regulatory environment prevents citizens from running servers in their living rooms, and the cross-subsidizing of home connections by business connections makes bandwidth in datacenters expensive, as @summer76527 explains in his opening post to this thread. Besides other potential factors as well, making, e.g, bandwidth in the USA slightly more expensive than, e.g., in Europe (at least at around the time that blog post was written; things may obviously have changed since then, though some fundamental aspects are probably still similar).

I’d be curious to learn about the situation in Russia, in case you have any insights, and are willing to share.

Down to 5.

The decline is visible in the graph on the zone page as well, though the temporal resolution is not optimal to discern details especially in the most recent past. I think the thin, teal-colored line could be the number of registered IPv4 servers, and the thick green one could be the “active” servers. I think the dotted lines are for IPv6. Not sure yet what the brownish lines are, could be the difference between registered and active, i.e., inactive count, seeing how it kind of mirrors the “active” line.

EDIT: Looking at how the graph is generated, the brownish line is indeed the difference between “registered” and “active” counts.

We have a similar situation in ru-zone of pool. We have less than 10 active servers left in the pool in the ru-zone (war and security reason, ntp DDoS reflection attacks, etc. Maybe some pool monitoring troubles). Now if I add my server to the pool and set the bandwidth to minimal 512k, I get “waves” of requests from “normal” 1-10-50-100kpps to 1500kpps (1,500,000 requests per second, yes) in 15minutes. My old server can handle about 200-250kpps without losing responses and ~80-90% CPU load, but 1.5 millon is too high (it’s above 500-600 Mbit/s of bandwidth). Any Raspberry/embedded servers dies at all.

When server reach 100% CPU it can’t handle so many requests, it loses its score and the pool deletes it. The cpu load decreases, the server starts to cope, the score grows, the pool makes server active again and sends a million requests to server again. Again, the server can’t handle all the requests, the score decreases, and so on in a circle (or waves).

Another pool member from the ru-segment has the same problem, his server can handle ~0.5-1 million requests per second, but it’s have impact to another services. Unfortunately, we are forced to disconnect from the pool too :frowning:

We have observed the servers in the ru-zone and we see that the score is floating in waves for all the servers in the zone. Apparently there are too many requests and the ru-zone servers simply overload, as a result they fly out of the pool in a circle. All this bad situation for pool and timekeeping community.

We suggest that if there are few servers in a zone, pool can distribute clients to a “more global” zone more actively. Otherwise, the entire overloaded zone will permanent score/servers flapping and work poorly.

2 Likes

That’s a long-standing, well-known issue that comes up on a regular basis in this forum, for different zones. There’s been a number of different proposals, e.g., similar to yours, but there is no progress addressing that - at least not that it would be visible, e.g., I am not aware of anything specific. Some signs that something might be happening, but not sure what, and when it will come to fruition.

One mitigation, not a fix, would be to tell clients not to use the country zone, or the so-called global zone, because both of them will limit clients to servers from the country zone only.

Rather, tell clients to use the continent zone.

But that doesn’t help the servers, obviously, at least not in the near term, as they can hardly influence how the clients are configured. And in many cases, the users of the clients aren’t even aware how their client is configured, or even that it is using the pool, or even aware their system is getting time from the Internet.

The question is, though, what specifically happened in the Russia zone. I am not saying that things were perfect, but it seems up until around October 15, 2024, things were at least stable: Around mid-130s registered servers, around mid-120s active servers.

Then, starting on October 16, 2024, numbers started to drop.

So it would be interesting to know what happened on, or around that date, that may have triggered this, directly or indirectly.

The red line is registered servers, the yellow one active ones.

Maybe a few, very big servers left the pool, and the remaining ones weren’t able to absorb the additional load, leading to the eventual collapse of the zone, with servers dropping out like dominoes.

EDIT: Indeed, it seems, for one reason or another, server capacity left the pool, which might have caused the zone to cross a tipping point.

Date Registered servers Active servers Capacity
2024-10-10 135 128 26725468
2024-10-11 135 127 27224956
2024-10-12 135 128 27220968
2024-10-13 135 128 27216968
2024-10-14 135 126 27165968
2024-10-15 135 125 26165968
2024-10-16 134 119 23518956
2024-10-17 130 76 13902468
2024-10-18 122 53 12642492
2024-10-19 121 54 15778492
2024-10-20 118 63 13144492

The (relative) capacity seems to have dropped by 10% from October 15 to October 16, much higher than the relative drop in servers. So it might be that a big contributor dropped from the pool, or lowered their relative capacity share, increasing the relative share of remaining servers. Maybe that overloaded some of the remaining servers, causing them to drop, which in turn increased relative share of remaining ones, etc.

1 Like

Yes, I understand. Good thing about client. I’m in contact with one regional
equipment manufacturers and request them use global pool address in CPE firmware.

Score initially implied the accuracy of time and the suitability of the server for synchronization. Now the situation is such that the score
and accuracy is lost due to server overload, and not due to “time tracking accuracy”. That’s the question.

You patch are obvious. But allowing sub-speeds/priorities like 5-10 kbps is only necessary for zones where there are few servers (and many requests), otherwise everyone in pool will set their minimum speeds and the situation will repeat itself.

Hmm, I think it also implied reachability/availability. Consider that the penalty for lack of accuracy is very low compared to the penalty for not being reachable, whatever the reasons for the latter.

To me, the scores and the monitors are just the canary. I.e., they are supposed to reflect what actual clients are seeing, and manage the distribution of clients to servers such that clients get “good” service, both accurate, but even more important available service. When I don’t get a reply from the server, what good is it that it may have perfect accuracy?

One issue also is that a common reaction to issues with a server is to remove it from the pool. That precipitates the trend.

Yes, the netspeed setting is only relative, and if their is not enough capacity, adjusting relative weights does not help.

Yes, indeed. That is why I suggested that the range of available netspeeds be a function of the number of servers in a zone to the assumed number of clients. But seeing as the capacity to get anything implemented at all is very low, my hope would be that a simpler solution has a higher chance of getting implemented than a more sophisticated one requiring more effort to implement.

But also, this on its own does not solve the problem if the imbalance is too high, as it might be in the Russia zone. There always needs to be enough capacity already so that a low setting does not result in too high a load.

That is why there might be the need for a synchronized approach, i.e., line up enough new servers, but do not add them one by one, as they will get overwhelmed. But add them all at once, so that the load is evenly shared among them.

NO! (Sorry for being direct.)

That is exactly part of the problem: It is called “global”, but what it actually does, currently, is lock the client into the local country zone.

I guess the naming “global” comes from the fact that it can be configured anywhere on the world, and will give you servers. And the intention is that it will give you servers that are “close”.

But the way it is currently implemented is that “close” means “from the same zone only as the client” (as best as the system can determine the zone the client is located in).

For details, see, e.g., this post, and the materials it refers to.

1 Like

and what is then proposed for client settings? Specify servers of another zone? I think that equepment buyers will be surprised and will not see any logic in this.

Yes, fully agree. But that is the current situation until the problem is solved.

So indeed, current approach would be to use a specific country, or perhaps better, as more generic, continent zone that is known to have enough servers in it, and that is “close” to the location where most clients are expected to reside.

And if clients with that configuration are going to be distributed across the world, then some will not get “local” service. But at least they will get service, and there is a common view in this forum that latency, as such, is not as much of an issue as many people think. And in some cases, “local” may even also mean higher latencies.

I work in a small ISP. For some time we registered a pool zone on our DNS-servers for testing and directed our users to our “local near” stratum1 ntp-servers (one of them was in pool). Some time ago we removed these DNS-records and now the clients’ equipment send request to the common pool.
Perhaps it’s time to return to this decision. This is a rather controversial decision, but a good one for clients and pool now. This will not solve the problem of zone congestion of cause.

I am not sure I fully understand, but if you run your own NTP server(s), and it is/they are currently not in the pool, but only used by your own customers, maybe adding it/them to the NTP pool would add the capacity that is needed to help the zone recover.

But not sure I understood correctly, or how much capacity you could add to the NTP pool, and that would obviously add a lot of clients to your servers that are not your customers, which may or may not be ok.

Not sure how big you are, i.e., what would happen the other way round, as you seem to suggest, when your clients no longer connect to the NTP pool, but to your own servers only? Could that free enough resources on the NTP pool for it to recover?

And if someone has access to a datacenter with compute and bandwidth capacity, it might also help to add servers there. They might not be stratum 1, but right now, the critical issue is capacity in the zone, not accuracy. If there is enough capacity added, that would also help smaller servers to join, and slowly increase the overall capacity that way. (Which is what my patch proposal is supposed to help with as well.)

But there’s obviously a lot of "if"s in there, i.e., not sure this can be pulled off without common coordination, like was done for the China zone some time ago.

And you are responsible towards your customers, obviously, so directing them to your servers instead of the pool might be the way to go if the NTP pool doesn’t work anymore. And there’s also voices in this forum that say that (commercial) organizations responsible for a large number of clients should indeed run their own servers instead of tapping into the NTP pool with its often volunteer-run servers.

1 Like