Collapse of Russia country zone

Bas · November 17, 2024, 7:20pm

Have you considered they are blocking the monitors of the pool?

And as such the number of good servers drop, but the rest ends up with all the requests?

If there is a firewall in place, maybe you should point the monitoring-system out to them and tell them it’s only there to keep GOOD NTP-servers online.

But if they block us, they are hurting only their own NTP-systems by doing so.

I wonder: Do they block monitors? Do they block NTP traffic?

I do not know the answer…

Bas · November 17, 2024, 7:32pm

Same issue. I can only talk about my monitor, it’s unlimited, unregulated it will check any NTP-server it’s asked to do.

The question should be, is China blocking our monitors?
If so, can we ask them stop blocking us and explain what the monitors do.

Sorry, but I doubt the problem is the monitors.

When I check by hand:

bas@workstation:~$ ntpdate 139.199.215.251
ntpdig: no eligible servers

It’s not my side blocking access. Same for Russia in my opinion.

PoolMUC · November 17, 2024, 7:50pm

Sorry I misunderstood. That quote also contained a part of a message of mine, so I inferred that was also subject of your comment.

Not sure many more are needed. Some more diversity would be welcome, e.g., also in the zone with the single most number of estimated Internet users, or in that region in general.

E.g., some fellow timekeepers in China had enlisted a university’s resources to set up a monitor, but their requests for help to set it up, both here in this forum, but I understand also via more direct channels with pool responsibles, apparently never got anywhere, so I guess they gave up eventually.

If that change you propose would make it easier to set up monitors without going through a single bottleneck, that would indeed be helpful and welcome. My VPSs in Germany are underutilized as servers, but instead of decommissioning them when the first fixed-term subscription is up, at least one of them might make a well-connected monitor (though that wouldn’t really boost the diversity aspect I highlight above, just adding another city in Germany as monitor location).

There’s obviously always things to optimize, especially if that means that it subsequently frees resources to do other important stuff. So I hope this will precipitate work on removing the lock-in of clients into their (assumed) respective country zones in order to spread the load wider especially in under-served zones like the ones this thread is about, with the titular consequences for the zone.

Sure, with that backdrop, the China zone will likely never have the vibrant assortment of living-room hosted servers contributing to the pool as there is in Europe or the USA. But the parallel thread shows that people are still willing to set up servers in data-centers with capacity similar to living-room hosted servers elsewhere, but structural issues with the pool prevent that. Because as it is right now, the entry barrier for such small servers is just too high, much higher than, e.g., in Europe or the USA. And as you say, there’s only so much incentive for big players to add on top of the resources they also contribute to make that chicken-and-egg problem go away.

I thought there previously had been somewhat broad consensus in this forum that the tight lock-in of clients to servers only from their own country zone (as best as that can be determined by the pool) is “bad”. E.g., a while back, when that was discussed I think in the context of the paper by @giovane, @marco.davids, et al., or around that time, Ask had mentioned that he is preparing something like that. Because beyond the threats discussed in that paper, this lock-in is causing multiple different but real-life issues, documented time and time again on this forum, but then subsiding for a while at least because I guess people give up, not getting anywhere. That is not specific to the China zone, or the Russia zone, or any other. But it would be something that the pool can do to improve the situation, also in China and Russia, and I understood endeavored to do so at least in general. It now just needs to happen some time…

And there’s been no evidence that I am aware of that either of those conjectures would be true, but some evidence that both of them most likely are not.

As I wrote before, why is it so hard to grasp that when an estimated 132,000,000 Internet users want to get time from less than 10 active servers, that is not going to work for smaller servers like of those operators who contribute to this thread?

I understand to some extent how difficult it might be if not experienced oneself, as I hardly get 2 Mbit/s traffic at a 3 Gbit netspeed setting on my servers in Germany myself. I guess the situation might be similar in the USA.

But when I set up my first server in Singapore, I realized what it really means to have a server in an under-served zone, because I am getting peaks above 2Mbit/s already at the 512 kbit netspeed setting (and as that is the lowest setting currently available, I had to remove that server from the pool because of that the other day). In South Korea, it is similar. So I can only imagine what the situation in China, or now in Russia must be.

I never said it was. In fact, I chose this example in that context precisely because of that, because like with the other examples, I had the impression from related contributions, and based on who contributed, that it might get some traction. And despite it being so challenging, if its implementation were to help with the issues discussed, e.g., in this thread, I’d be happy to see it implemented. But as I see similar challenges to what you mention, my fear that if it were implemented, it could further delay implementation of some fixes, or at least some mitigations to the main problem: clients being locked to servers of their own zone only.

But obviously, what is being worked on, or isn’t, is rather opaque to me at least, the last few features that came out were nothing I had on the radar before they came out. If you happen to have better insights, I’d be happy to hear, and many others as well I guess. I guess part of the frustration with all these issues people are having is because there is pretty much no communication on what is being worked on, and what the plans are, to at least get a perspective as to when things might improve.

PoolMUC · November 17, 2024, 7:56pm

This question has been answered many times:

NO

Welcome to the club! May I respectfully refer you to @avij’s previous post, and invite you to reread, and take to heart. It is still applicable in this current thread.

Bas · November 17, 2024, 8:17pm

Sure? I can not get time from the IP given.
How can you be sure they don’t block us? I do not know, my command gives no answer. The monitor will answer the same.
If NO, how come I can’t reach it by hand?

CGNAT is a problem, I have it too.
To counter that you can either block those IP’s, ratelimit or simply contact those abusive CGNAT providers and tell them to intercept 123 and run their own servers.

To be blunt, I hate lazy CGNAT ISP’s too, and trust me, my servers are being attacked by them too.

I do put ratelimits on those idiots. Chrony has a good line for it: ratelimit

PoolMUC · November 17, 2024, 8:20pm

May I respectfully invite you to re-read, and take to heart, what @avij wrote previously:

Bas · November 17, 2024, 8:34pm

I have read it…but do you understand his problem?

He’s having a major load on his server, I get it.

All the rest too.

So why can’t I get time from his server when I test? As that is what the monitor does.

How come the number of ntp-servers in Russia dropped?

Do they drop monitor requests? You say NO, I’m not so sure.

PoolMUC · November 17, 2024, 8:49pm

Because it may be offline/having strict rate limiting in place in an attempt to somehow deal with the issue?

If you look at the offset/score graph, you’ll see that it was active during earlier periods, and apparently is not reachable anymore just somewhat recently.

Same with other servers, once removed from the pool, the scores recover.

So unlikely there is some broad blocking of monitors. Similar in China, by the way.

That’s the proverbial million dollar question.

When looking at the data that is shared as part of this thread, you can see that it kind of started when a big chunk of capacity was removed from the zone, for whatever reason, but the number of servers did not drop in proportion to that.

The hypothesis is that the traffic that that supposedly “big” server was getting originally was now redistributed among all the other servers still in the pool.

As the zone was possibly on edge already as it was, that additional load overloaded some other servers, so they dropped, first just in score, then more and more actually removed from the pool, perhaps people got fed up by too much traffic, as participants in this thread mention they are considering, or know of others who did.

So as servers dropped like flies from the pool, the situation now is that some 10 servers serve an estimated 132,000,000 users (exact numbers not important, just the relation, e.g., to other zones).

DDoS attacks were considered as well, but given that the traffic load seems responsive to a server being removed from the pool, either by score dropping below 10, or being put in “monitoring only” mode, that seems unlikely (though, as always, not impossible).

Bas · November 17, 2024, 8:59pm

I do know Putin said there would be a Russian Internet/Firewall.

No, I’m not going into politics, but it might explain they drop our monitors.

Same for China.

I also do not know if all servers dropped, they could still be working.

Has anyone asked Putin why it happened? Anyone the email of him?

PoolMUC · November 17, 2024, 9:08pm

Please, there are a lot of conjectures and rumors, one could almost say conspiracy theories.

But so far, there is no data to support that.

On the other hand, there is data/evidence to the effect that there is no filtering or anything of that sort to the extent it would cause the problems discussed here and the parallel thread.

If you have actual data, or other evidence that can be looked at, please share. But peddling hear-say and conjectures does not help.

summer76527 · November 18, 2024, 1:54am

I don’t think so.China’s firewall operates on a blacklist mechanism.It only blocks websites that are on the blacklist(such as Google,YouTube,etc.)or shuts down high-risk traffic(such as vless,a VPN protocol,etc.).It’s clear that the NTP protocol is not within the scope of what is blocked.I believe there might be other reasons for the monitor’s inability to connect.As for the undersea cables between China and the US,I found that there are two,one with a bandwidth of 80Gbps and the other with 5.12Tbps(the information I found might be outdated,if anyone has more recent news,please let me know).However,China has too many users,and during peak afternoon hours,there might be data transmission far exceeding the capacity of the cables,and UDP from the monitor might be discarded(note that this is just my speculation,but I don’t think it has a direct relationship with network blocking).

PoolMUC · November 18, 2024, 7:24am

How come, then, that when the server is not in the pool, its score nicely recovers, and stays well above 10?

The graph is not perfect, and I am not saying there aren’t issues with connectivity at all, and things couldn’t be better. But if the speculation were true, then the score would need to actually go up and down noticeably throughout the day even for a server that is in “monitor only” mode.

When, however, local traffic is added, the problems start, in your case, because it’s easily too much of it.

But also, e.g., in case of Tencent. From probes within China, there is about 10% packet loss. I am not talking about the monitors, but actual clients inside China. How do you figure that international connections being overloaded cause traffic inside China to be affected that way? Or, the other way round, how should monitors within China help if there is 10% packet loss for traffic within China?

And I am not saying connectivity within China is bad, pings to one of the Tencent servers show pretty much no packet loss.

So my personal conclusion is that even the Tencent servers are overloaded, until further evidence rather than speculation is presented that hints in other directions. Or a plausible line of reasoning based on existing evidence but leading to a different conclusion is presented. Or flaws in the above line of reasoning, or interpretation of current evidence (which is far from being definitive proof).

And repeating such speculation also doesn’t help to get the problem solved, because it blocks the view to what I’ve seen it boil down to here and elsewhere:

Adding more capacity is needed, and as that presents a chicken-and-egg problem, and faces structural obstacles in some places, spreading the current load more widely is needed, because this lock-in of clients to servers of their own zone is not only causing issues here, but similar, and other issues elsewhere as well.

As others have pointed out as well, blaming the monitors against all current evidence is not helpful. I am not saying they are perfect, and having more diversity would be welcome. But they are not the decisive issue neither in China nor in Russia (based on current evidence).

And the gradual relaxation/removal of the lock-in has already been pretty much agreed upon in this forum because it is “bad” for many reasons. Now, it just needs to be implemented - contingent, unfortunately, on the resolution of some non-trivial challenges in the process.

timz · November 18, 2024, 9:49am

On Saturday I saw over 1600 packets/second from 213.183.x.y.
I think the rate could be higher if the bandwidth were not exhausted by other ntp-clients.
I submitted an abuse report to abuser’s ISP but still have not received any replies.

PoolMUC · November 18, 2024, 10:33am

Thanks for sharing! Very interesting. And while NTP itself then seems not in scope, it is likely that just inspecting a large number of packets for some perceived “risk” indirectly may affect latency-sensitive NTP traffic as well to contribute to the wide spread and variability in offsets seen in the graph. Though even then, some monitors seem to see a rather steady offset despite that, perhaps because of shorter network paths for them.

Bas · November 18, 2024, 3:35pm

What if the system auto-blacklists the monitors? Then it disables servers after the firewall from being read and listed in the pool.
As we then score all NTP-servers as unreachable, ergo 0 points.

The monitors do ‘hit’ their NTP severs and always the same source and target, they could see that as an ‘attack’.

I mean, I use fail2ban myself to counter attacks on my servers, is very effective, but it depends on logging and counters, when you hit the counter too often the door is closed for a period of time.
If they use that mechanism in their firewall, it will explain the ‘packet loss’.

I fail to see how congestion would show such major drops, it will slow down things, sure. But won’t block it. Of course UDP has no resend options…but it should not block every attempt.

My 2 cents.

kkursor · November 18, 2024, 7:54pm

the same for me in Moscow.

It’s not so hard as it is used to think of. It blocks only foreign propaganda that says that we are all failed as a nation and should beg for forgiveness all life long. Usual foreign sites (except UA-zone) are work as they used to. Maybe sometimes Cheburnet will evolve… but not now.

That’s funny, but writing Putin is the BEST way to solve some problem that does not solve itself.
For example, you write to grid company with a complain about electricity in your countryhouse. They begin to play football with you. That may last forever until you write Putin.
President’s administration forwards your complain to the same grid company but they begin running around you, calling you 30 times a day to get your positive feedback and so on.

Could you please share full IP or whois data?

I am trying to raise Chrony CT on my internal proxmox server. Maybe NATting queries to it should help me survive as my Mikrotik dies under load. Proxmox has hard limit to container, I’ve set it to 50M/sec.

davehart · November 18, 2024, 9:41pm

PoolMUC:

davehart:

one or both of two issues brought up in this thread

And there’s been no evidence that I am aware of that either of those conjectures would be true, but some evidence that both of them most likely are not.

As I wrote before, why is it so hard to grasp that when an estimated 132,000,000 Internet users want to get time from less than 10 active servers, that is not going to work for smaller servers like of those operators who contribute to this thread?

I understand to some extent how difficult it might be if not experienced oneself, as I hardly get 2 Mbit/s traffic at a 3 Gbit netspeed setting on my servers in Germany myself. I guess the situation might be similar in the USA.

But when I set up my first server in Singapore, I realized what it really means to have a server in an under-served zone, because I am getting peaks above 2Mbit/s already at the 512 kbit netspeed setting (and as that is the lowest setting currently available, I had to remove that server from the pool because of that the other day). In South Korea, it is similar. So I can only imagine what the situation in China, or now in Russia must be.

I notice you claim there is evidence my two conjectures are most likely wrong, yet point to no such evidence. I should just dismiss your claim, but I will point out that apuls provided evidence that does not clarify the server in question is in the .ru zone, but assuming it is backs up the idea of DoS of that zone causing servers to be kicked out by monitoring. The link is immediately below, don’t be confused as it looks like quoting the message without line breaks. Click on it to jump to the message.

And I never said you said it was. I said “as you might think” and you either suffered a language barrier or decided to put words in my mouth and respond at length to something I didn’t say.

[EDITED: Removed a few comments because I was confused regarding the person I was responding to. Apologies to those who participate via email and may not see my edits.]

davehart · November 18, 2024, 9:54pm

That fits perfectly with a DDoS against a *.ru.pool.ntp.org DNS name: When a server is attacked it gets overloaded and the monitor removes it from the zone for failing to reliably respond. Once it’s removed from the zone, the DDoS attacks against that zone no longer reach it and it recovers.

kkursor · November 18, 2024, 10:26pm

I brought up my NTP service again. Not on border Mikrotik, but at home server in LXC container. Router rate-limiting is configured as 10p/s per IP with burst to 20 p/s. LXC is rate-limited at 50 Mbps. Chrony is configured as @Bas recommended.

root@ntp:~# cat /etc/chrony/conf.d/ratelimit.conf
allow all
ratelimit interval 3 burst 8 leak 4

Now I am able to capture tcpdump and it’s less likely that mikrotik will fail to handle packets - it doesn’t need to think about them anymore, just forward to internal network and forget.

davehart · November 18, 2024, 10:37pm

In fact, it needs to remember the mapping from external IP/port to internal IP/port so that it can forward the response back through. It will need that mapping only once but will probably remember it for 30 seconds because the handling is the same for all UDP traffic. This means the router will consume memory for the last 30 seconds of NTP traffic and consume CPU adding and removing and looking up those mappings.

Topic		Replies	Views
The issue of NTP requests exceeding bandwidth load Server operators	54	1427	November 24, 2024
Adding servers to the China zone Server operators	386	27599	June 9, 2022
List of trackers monitoring	32	537	May 21, 2025
Gradually add/remove server to/from pool in parallel to score increase/decrease Pool Development monitoring , dns	78	2921	August 22, 2024
CN pool collapse a few hours every day Server operators	48	1814	February 17, 2024

Collapse of Russia country zone

Related topics