Collapse of Russia country zone

Here’s a server that is scheduled for removal from the pool, i.e., it is not included in the pool’s DNS rotation anymore. Note how the score starts to stabilize starting at about the time that removal from DNS rotation was likely becoming effective, and/or when it was set to “monitoring only” mode:

https://www.ntppool.org/scores/46.0.192.91

Similar for other servers that are in “monitoring only”, e.g.

https://www.ntppool.org/scores/195.35.71.4
https://www.ntppool.org/scores/195.112.113.253

I.e., unlikely that widespread blocking of the monitors is the issue.

This zone collapse was started by high request rate from 3-5 IPs. I was really busy at that time, so I didn’t store that IPs.
The request rate from each of that IPs was high enough to exhaust conntrack connection limit of typically configured linux firewall, after which the monitor excluded the server from the pool of active servers.
Step by step the monitor rearranged traffic to a new victim, amplifying the attack with traffic from good behaving clients.

1 Like

Hi @timz, welcome to the forum!

This can happen. Doesn’t need to be malicious, can be a misconfigured/badly programmed client, could be caused by CGNAT concentrating too many actual clients behind a single IPv4 address.

It is generally suggested to exempt incoming NTP traffic from connection tracking, at least for those lucky enough to be able to configure that on the device, like in the case of an open Linux system :slight_smile: . In this context, connection tracking of incoming traffic towards the server serves no purpose that couldn’t be addressed by stateless filtering.

That is actually the job of the system, to distribute traffic among available servers, e.g., considering their availability. Doesn’t have to be actual amplification, if the ratio of clients to available servers to serve the clients is too high, then the result is high traffic on each server, no amplification needed.

Next time something like that happens, I encourage you to raise it here in the forum, and people can weigh in on as to how to deal with such clients. E.g., see whether other people are affected as well or not.

It’s easy to find addresses. You turn on “only monitoring” in the pool and collect the IPs from which the requests were made (any linux traffic accounting tool, or ntp-server log). Then you get query time from https://www.ntppool.org/scores/x.x.x.x/log… and look into own query log. Got it :slight_smile:

Yes. Graphs looks like my own server. I not blocking monitoring. My server just die under load if I join the pool. And the server primarily serves not the pool. But the pool is important too. Because almost every device now syncs from the pool. Smartphones, smart TVs, CPE’s, smart house, etc. even refrigerators :slight_smile:

1 Like

Ah, sorry, didn’t mean to put you (or anyone else) on the spot.

Ah, true, forgot about that. Might be a bit of work if the server is very busy, but otherwise indeed a viable option.

I can see your point, but personally I would not bother. The rate limiting would affect the clients in any case. When the server’s score drops below 10, the traffic decreases and the chances of a monitor probe getting dropped by the rate limiting decreases, thus increasing the score. When the score increases to above 10, the server gets included in the pool again and the traffic increases, causing more monitor probes to get dropped in the process. So in the end a rate-limited server’s score would hover around 10, which is perfectly fine in this situation.

Edit: Generally speaking, while there is a way to figure out the monitoring server IP addresses, that is nearly always the wrong approach to the problem.

It all depends on what one wants, and how much effort to spend. In this case, I had in mind that one server was also serving direct customers of the operator, so that might be an incentive to not wholesale rate limit, but be more focused, at least exempt own customers. Those might be identifiable by IP address range, so another, probably simpler approach to limit the impact in that case.

And it could help to start reducing the load before it is actually necessary, without resorting to indiscriminate rate limiting.

But I fully concede, more effort for potentially limited benefit, especially also due to the high latency control loop.

And wasn’t it you who at some point published a set of scripts that were intended to achieve just that, limiting traffic without actually dropping any user traffic? :wink:

At least this guy wanted to shut down RU zone by malicious activity. I think that he is not the only one who thought about that.

1 Like

Sure, I didn’t mean to say there aren’t bad actors out there. But often enough, it is just carelessness, or ignorance. And seeing that traffic seems to actually drop once the server is removed out of the pool supports that in the cases we’ve seen in this thread, probably no malice was intended.

And despite what the poster writes, actually creating that much traffic is not easy. That is where the amplification potential of some NTP messages came into the picture in the past. But I think most servers nowadays are not susceptible to that anymore.

  1. Does anybody analyzed the traffic in the ru zone ? Is there maybe one ore more IPs which are doing a (D)DOS ?

  2. Does anybody have connections to ISP, University or any other institution who can provide some NTP Server power to the pool ?

2 Likes

@ PoolMUC

As ISP we can do terrible things if we need.
All our customers use our DNS servers. Ok. We can hijack dns zone *.pool.ntp.org and resolve these names to our ntp-server’s IPs. Any server hardware easily handle 1-2kpps for us. Customers will be happy and pool have less queries.

But… DNS hijacking is a bad decision. Very bad. But if the pool’s ru-zone starts to respond really badly or dies, there will be no other way out. This is an extreme case. This has not happened yet. Let’s continue to observe

1 Like

I block nothing, but when checking…it seems that this is at a DNS level.

bas@workstation:~$ nslookup ru.pool.ntp.org
Server:		127.0.0.53
Address:	127.0.0.53#53

Non-authoritative answer:
Name:	ru.pool.ntp.org
Address: 192.36.143.130
Name:	ru.pool.ntp.org
Address: 51.250.9.134
Name:	ru.pool.ntp.org
Address: 162.159.200.123
Name:	ru.pool.ntp.org
Address: 162.159.200.1

bas@workstation:~$ nslookup 51.250.9.134
;; Got SERVFAIL reply from 127.0.0.53
** server can't find 134.9.250.51.in-addr.arpa: SERVFAIL

bas@workstation:~$ nslookup 51.250.9.134 1.1.1.1
** server can't find 134.9.250.51.in-addr.arpa: NXDOMAIN

The pool isn’t blocking them, see here:

https://www.ntppool.org/scores/51.250.9.134

Scores are bad, yet handed out by the pool…@ask why?

1 Like

Sure, thus I hope my own ISP wields this power responsibly :slight_smile:

Then I wonder why go through this complicated route of hijacking the DNS zone, and not add those powerful servers to the pool directly. Ok, would not only help your customers, but you’d support other people as well. But if you have the capacity…

But as said before, your first responsibility is obviously making your paying customers happy.

I don’t think there’s a need to ping ask for everything. There are other people in here who can answer just as well. For this specific example, have a look at the CSV log. An excerpt:

ts_epoch,ts,offset,step,score,monitor_id,monitor_name,leap,error
1731689974,2024-11-15 16:59:34,,1,6.041157246,24,recentmedian,,
1731689974,2024-11-15 16:59:34,0.00029242,1,6.041157246,41,belgg2-19sfa9p,,
1731689911,2024-11-15 16:58:31,,-5,5.306481361,24,recentmedian,,network: i/o timeout
1731689911,2024-11-15 16:58:31,0.000367793,1,8.395638466,59,descn2-19sfa9p,,
1731689857,2024-11-15 16:57:37,,-5,5.306481361,24,recentmedian,,network: i/o timeout
1731689857,2024-11-15 16:57:37,,-5,1.176772952,25,inblr1-1a6a7hp,,network: i/o timeout
1731689786,2024-11-15 16:56:26,,-5,5.306481361,24,recentmedian,,network: i/o timeout
1731689786,2024-11-15 16:56:26,,-5,5.220339298,67,fihel1-2trgvm8,,network: i/o timeout
1731689781,2024-11-15 16:56:21,,1,7.784883022,24,recentmedian,,
1731689781,2024-11-15 16:56:21,,-5,7.935919285,32,fihel1-z4ytm9,,network: i/o timeout
1731689676,2024-11-15 16:54:36,,1,7.784883022,24,recentmedian,,
1731689676,2024-11-15 16:54:36,,-5,1.96946907,21,deksf1-1a6a7hp,,network: i/o timeout
1731689661,2024-11-15 16:54:21,,1,7.784883022,24,recentmedian,,
1731689661,2024-11-15 16:54:21,,-5,5.306481361,41,belgg2-19sfa9p,,network: i/o timeout
1731689577,2024-11-15 16:52:57,,1,**10.758252144**,24,recentmedian,,

… meaning that 7 minutes ago the score was above 10 and the server was included in the pool. There’s also some delay when the scorings get transferred to the actual pool DNS entries.

2 Likes

If you would care to look more closely yourself, you’d see that the server is periodically rotating into and out of the pool as the score goes up and down. Just as the system is supposed to work.

You still seem to cling to the notion that the response to RDNS queries has any relevance in this, either determining the country that a server resides in (as in a previous thread), or whether it is alive/healthy/… (as here).

RDNS has no bearing whatsoever on any of those aspects.

2 Likes

k = kilo = *1000. 1-2k = 1000-2000 queries per second. It’s very very low. I think 18-years old Core2Duo can serve it and have 5% cpu. Now we have a 1.5 million requests per second from pool. Even Xeon(R) E-2246 (6cores,12 threads @ 3.60GHz run many Chrony daemons for multithreading) goes into full 100% load and only manages to respond to 2/3 of the requests.

2 Likes

Ok, sorry, didn’t look at the numbers in detail, just triggered on the general message that high-jacking the DNS entries and redirecting the NTP traffic to your servers wouldn’t be a problem for you. But that would then probably only work if it is just your own customers.

Yeah, if the pool doesn’t help bootstrapping out of this capacity crunch, you’re maybe even lucky that you can offer your customers that way out, then.

As for analyzing the traffic, here’s a useful oneliner that works on Rocky Linux 9. Other operating systems / versions may vary. Adjust the “10000” packet count limit as you see fit.

tcpdump -n -c 10000 inbound and ip and udp and dst port 123 | cut -d" " -f3 | cut -d. -f1-4 | sort | uniq -c | sort -rn | head

3 Likes

You are right, my mistake.
Didn’t think of that.

Loads of stuff on my mind.

Should have checked with ntpdate.

1 Like

I’ll possibly try this when I get to pool again, thanks. :metal: