Collapse of Russia country zone

Hope you’re making progress regarding restoring your Internet connection and getting your monitors back up and running properly.

Well so far it’s stable, but just running IPv4.

It has not dropped or affected the monitor, nor my own chrony-servers.

It’s running for 2 days with the ‘old’ firmware 07.62

Hopefully the problem is with the firmware, as the mess started when AVM released 07.8x and higher.

No matter the modem I use.

If it’s still good by monday, the next step is activating IPv6 again.

1 Like

I thought chronyd itself is multi-threaded so one instant is enough, but I might be wrong.

Take a look into GitHub - mlichvar/rsntp: High-performance NTP server written in Rust
Only one chrony need.

tcpdump -n -c 10000 ip and udp and dst port 123 | cut -d" " -f3 | cut -d. -f1-4 | sort | uniq -c | sort -rn | head
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
10000 packets captured
10853 packets received by filter
13 packets dropped by kernel
   4438 46.29.197.7
      8 109.252.180.248
      7 85.140.117.56
      6 37.190.52.246
      6 37.190.52.221
      6 37.190.52.122
      5 79.139.151.131
      5 217.66.159.69
      4 95.73.157.119
      4 85.140.4.138 

short time later

time tcpdump -n -c 20000 ip and udp and dst port 123 | cut -d" " -f3 | cut -d. -f1-4 | sort | uniq -c | sort -rn | head
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
20000 packets captured
20901 packets received by filter
0 packets dropped by kernel
     23 213.87.144.48
     18 95.216.147.216
     16 213.87.144.114
     15 91.195.204.104
     15 213.87.139.39
     14 91.195.204.66
     14 195.191.175.244
     13 119.143.46.115
     12 46.138.163.167
     12 213.87.139.106

real    0m8.701s
user    0m0.232s
sys     0m0.032s

Try that one also:

I am using its output to block the flooding IP addresses. It reduces the load drastically.

3 Likes

Here are some additional monitoring statistics from a Finnish server, in case you find them useful:
https://hirvi.miuku.net/stats/ping-46.188.16.150.html
https://hirvi.miuku.net/stats/ntppacketloss-46.188.16.150.html
I started the monitoring just recently, so it may take a while to get meaningful results.

Huh? Did you do ratelimiting at the router? As is looks to me your connection isn’t stable and simply closes.

Have you ‘limited’ the rate at the DNS-side in the ntppool.org config tool?

As you should limit it there, not in your router. Judging from your page, the DNS-limiter is set to the max in the pool-page.

If you just want to serve e.g. Russia, then set it at the minimum…it will still be high, but normally not overload your connection.

You should not limit it in your own router, bad idea.

Bas, I understand this topic is already fairly long, but I would still suggest reading it from the beginning. He gets 250-300 Mbps of NTP traffic with the lowest 512 kbps setting. This is a problem for all of Russia at the moment. Many Russian servers are currently struggling under the extra load. Providing even some rate-limited service is better than entirely dropping the traffic, because it may ease the burden from some of the other Russian NTP servers in the pool.

5 Likes

I understand, but limiting at the router will give bad results in scoring.

Using Chrony and use ratelimit is a better way, as it only drops ‘abusers’ and won’t block the monitors, as they don’t come that often.

Typical CGNAT will hit you hard. Monitors do not.

Ergo, use Chrony and set ratelimit, and see what happens.

In my opinion a better option.

Your doubling down on your statements suggests otherwise.

Maybe as the next step our Russian NTP servers operators could provide us some more detailed data about the traffic. The overall query rate isn’t that useful. Most importantly, does the majority of the traffic seem to come from a small set of IP addresses, or is the traffic more spread out among a larger set of IP addresses. Preferably as measured over at least a few minutes, not seconds. This way we wouldn’t need to guess about the advice we give.

If there are specific problematic IP addresses, contacting the IP address owner’s abuse contact might be useful (see whois). It is also possible that the source IP address is forged, but that’s hard to prove.

2 Likes

I was trying to find my problem, but turned out to be Fritzbox firmware problem.

Total different matter. It’s hard to find problems when code is closed.

I do code myself, and have solved problems before in opensource as you can compile and test again.

AVM is being stupid and their latest firmwares are crap. As such I tried what ever I could to find it.

Found it…next :slight_smile:

1 Like

Yes, it’s not rocket science. It’s not been spelled out here before because of the poor practice of some server operators of configuring their setup to treat the monitors differently than all other traffic. That’s counterproductive as the intent is for the monitors to see exactly the same service as any client would see. If, for example, a server operator exempts the monitors from their firewall rule to reject all traffic from outside the US, the pool would continue to direct clients from around the world to the server which works only for US IP addresses. You might imagine other scenarios where special treatment of monitors creates broken pool NTP service. Please consider that before helping non-rocket-scientist server operators figure out the monitor addresses, and definitely don’t post monitor IP addresses.

ntpd and ntpsec also have rate limiting controls.

For example from my pool server ntp.conf:

restrict default kod [...]
discard average 1

The discard average value is in powers of two, so this causes ntpd to not respond to IP addresses that average more than one query every 2 seconds, except occasionally replying with a KoD packet requesting the source to reduce its query rate and providing no useful time information. Using a value of 5 would mean each IP address can only query once every 32 seconds and get a useful reply.

1 Like

As @Bas pointed out, in many cases it’s better to rate-limit in the NTP daemon rather than in the router, so monitors and normal, low-volume NTP clients are not affected. However, if your NTP server is on a private network behind a NAT, or behind a stateful firewall, the flood of traffic may be overwhelming the NAT/firewall’s connection-tracking packets-per-second and table size and processing time capabilities while your server is capable of handling more NTP traffic.

In particular, rate-limiting in the NTP daemon means that daemon has to be able to handle each packet just to reject it, which takes much of the resources as responding does, and your router/NAT/firewall has to be able to handle the flood. When it’s an intentional DoS flood, rate limiting in the router (or better, at your ISP’s border routers) is likely wise.

Keep in mind even without NAT or other connection tracking, routers that can keep up with, say, 1 Gbps of typical traffic may not be able to handle 1 Gbps of NTP traffic due to the tiny packets and therefore higher packets per second. You could say PPS but that can be confused with the pulse-per-second from a GPS or other frequency standard used by most stratum-1 NTP servers. P/S might be better in a NTP discussion context.

My pool server in the us and north-america zones is set to a pool netspeed of 250 Mbps, but that’s just a relative setting with all the other servers in a given zone. In practice, it’s getting about 1 Mbps of NTP traffic, around 12,500 packets per second. It’s running under Windows 11 on a circa-2010 HP Z400 workstation with a Xeon W3520 @ 2.66 GHz base speed and 4 cores/8 hyperthreads. It’s built with a 45nm process and has a CPUmark around 3000, where AMD’s current high-end Threadripper CPUs are 33,000 - 65,000. I haven’t spent too much time micro-optimizing it because my focus in on maintaining and improving ntpd, not running the fastest pool server, but when I push up over ~2.5 Mbps pool monitoring starts to show degradation.

@davehart Thank you, that was the config i haven’t found tonight :slight_smile:
I’ve added discard average 1 to my config now.

@NTPman I’m using the script you mention and now it’s working but haven’t reportet anything so far.

What i can see right now is that if the server score is >10 the request rate goes to heaven. If the score goes down <10 the traffic sinks immediately (ok, i took about 5minutes) and looks like floor noise.

The problem in the past with china zone was different. If the server was in the pool and the monitors bailed it out the traffic was still there.

If desired i can provide a tcpdump.

Indeed I did, and indeed nobody did, for a reason.

But I think when an entire country zone has collapsed, and people are hurting because their systems get overwhelmed, giving pointers in an attempt to help people when the pool is failing them, in a desperate attempt to stabilize the zone, is legitimate.

I’d prefer the pool would fix this kind of issue to make the pool actually work as intended everywhere, or add IPv6 to more zones, rather than fine-tuning aspects such as trying to figure out whether an IPv4 address and an IPv6 one are actually the same server, and only hand out one of them to clients, when in some places, there is neither that even works properly. Or considering to fine-tune the monitoring to not have monitors monitor servers in the same network region, when servers are overrun by traffic, and drop like flies from the pool in some areas. Or optimize server allocation to clients by trying to figure out what the actual RTT is from a client area to a server, while clients in some areas would be happy if they’d get service at all, and servers there that they could provide good service, and not spend their efforts to fight a desperate, loosing battle against being overrun by traffic.

Again, as maybe that hasn’t registered yet, entire zones are collapsing, or continuously teetering on the brink of collapse. As long as those things are not being addressed, people who nonetheless believe in the project will try to make do with what they have and can do, however desperate, and probably fruitless in the end as they obviously cannot overcome the structural issues of the current pool system.

1 Like

That’s interesting, because normally with a score below 10, a server is removed from the pool’s DNS. Is there perhaps some lingering DNS traffic? Or is there some kind of ‘shadow pool’ active? I’ve seen this before, whereas hostnames resolve to IP addresses that (partly) overlap with the ones in the NTP pool (like ‘cn.ntp.org.cn’ an ’ us.ntp.org.cn’ and ‘de.ntp.org.cn’, etc.).

I think that this would not help. I don’t have dedicated internal NTP server, but the Mikrotik 4011 replies to queries itself, unfortunately it has no tunables. I think that if I would NAT packets to internal VM which is capable of handling such a traffic, it will smoke out router too and interrupt its’ main services. I set netspeed to 512 kbps and get up to 500Mbps (!) just of NTP traffic. That’s wrong and I must self-protect, as NTP is the auxiliary and volunteer service, not main.

And not limiting at the router results in bandwidth and CPU overload of router, it starts dropping packets (even useful, not NAT only) and gets bad results in scoring, but does not interrupt its’ main services.

Thanks. We’ve implemented Russia-only monitoring, it showed no losses until now.

1 Like