Collapse of Russia country zone

Just for information that the Russia country zone is the latest to have collapsed apparently (based on number of active servers, yellow line), or is in the process of collapsing (based on trend of number of registered servers, red line).

See symptoms described in this post, but also earlier ones in that thread (which originally was about challenges adding a small server to the China zone).

2 Likes

I’ve configured 512 kbps netspeed yesterday and waited for my score to raise up to 10.
When it reached 10, I have 250-300 Mbps of NTP traffic in spite of setting and my Mikrotik 4011 (very powerful device!) starts smoking out and losing packets. =(
I am sorry to suspend my servers’ presence in the Pool. I am having uninterruptible service on my home internet connection (we have a barrier in the yard and if internet connection fails, nobody can enter or leave the yard). Let’s look how and when the situation resolves. I suppose that leaving of the servers from the pool is not actual leaving but a trouble with server monitoring (cross-border communications are about to be closed by iron curtains from both sides and monitoring servers simply cannot reach our internal servers, though they are running normally and keeping correct time inside Russia)

3 Likes

This claim keeps coming up for the China country zone and other zones as well, but there is so far no data to support that.

On the other hand, at least in the case of the China country zone, there is data that suggests that the monitors, or international connections, are not the issue, at least not the primary one. And your description above suggests the same to be true for the Russia zone.

When you get 250-300 Mbps of NTP traffic, and your device, or line, or some other equipment starts dropping packets, that is the problem, and the monitors are only detecting that.

The point of the monitors is to check that servers are keeping good time, but even more so, that they are available. If a server starts dropping packets, that means its availability is degraded, so it will eventually be removed from the pool’s DNS rotation until it recovers sufficiently to not drop packets anymore. Upon which it will eventually be added to the DNS rotation again once the score as metric for availability/reachability crosses the threshold of 10 again.

So, the root cause of the packet drop must be addressed, e.g., by reducing the load on the server. E.g., by distributing the load more widely than the pool currently does.

1 Like

I am speaking about the reason why the zone lost 95% of active servers. We live on a more hostile IT-world than it was 20 years ago when you could simply open your home WiFi and not be afraid that somebody could hack something or do something else unsuitable. So many admins I know whose companies are Russia-only-oriented simply block foreign traffic. For example, leroymerlin.ru is not accessible from abroad. This may be the reason for monitoring to mark servers as missing (though they continue to work with russian IP addresses), raising load on remains that do not implement GeoIP protection.

Sure, if despite there nowadays being 10+ monitors world-wide, they are all being blocked, that could cause some servers to be dropped without cause (in the sense of providing good service to the local clients that the pool assigns to them). But for that to trigger the chain reaction seen in this case, the stability of the zone likely was somewhat precarious already to begin with.

So again, more capacity is needed to begin with, e.g., by distributing the load wider than the client’s own zone only.

If explicit, outright blocking is in play, as you suggest, then to counter that, a monitor within the zone could help start to overcome this, and if only to assess the level of blocking.

But, e.g., setting up monitors in China has not been fruitful so far, despite local people willing to host one, so I am not hopeful in this case either. But who knows…

If monitor servers’ address are not secret, I could test their availability from several ISPs.

They’re not exactly secret, but unlike in the past, also not on a known, fixed subnet.

Any server registered in the pool will see the probes from the monitors coming in, but if the server was in the pool before, or otherwise had clients, the monitors’ packets will not easily be identifiable. But if there were a server to be registered that never had any NTP clients before, that might be different…

I don’t know why people always tend to blame the monitors. They are generally working fine. Don’t shoot the messenger.

I would like to draw your attention to the red line. This signifies the registered number of pool servers. The drop in the red line means that those NTP server operators have actively removed their server from the pool, for whatever reason. Maybe for the increase in traffic, as widely speculated earlier.

2 Likes

They work fine if not counteracted with something like Great Firewall of China. We have similar system.

Looking at the graph above, there are currently around 60 servers that are still registered but are not active. It would be very interesting to know what these servers are so that someone could check if they respond to queries both from Russia and from abroad. If they respond to queries from Russia but not from abroad, then there’s indeed some sort of filtering in place somewhere, which may cause issues.

1 Like

Here’s the current score distribution of the 47 servers I’ve seen since yesterday:

      1 -38.5
      1 -11.7
      1 2.3
      1 3.5
      1 5.7
      1 5.9
      1 6.1
      1 6.3
      1 6.5
      1 6.6
      1 6.7
      1 7.3
      2 7.5
      1 7.7
      1 7.9
      2 8.0
      2 8.1
      1 8.3
      1 8.4
      2 8.6
      1 8.7
      4 8.8
      1 9.0
      1 9.4
      2 9.6
      3 9.8
      1 10.1
      2 10.4
      1 17.0
      1 18.2
      1 19.5
      1 19.7
      4 20.0

I think positive scores, with some margin for error obviously, hint at servers that seem reachable by a sufficient number of monitors, but are out of the pool due to overload. That is based on the scoring behavior of overloaded servers that I’ve seen so far, e.g., how far they drop below 10 in case of overload before the score increases again. I think if those servers were affected by widespread blocking of monitors, I wouldn’t have seen them.

Certainly, complete blocking of only a subset of monitors would make the server more susceptible to issues with the non-blocked monitors.

Obviously, we don’t know what is going on with the other 13 servers. Me not having seen them could indicate that their score is constantly so low that they are completely out of rotation, e.g., because they block the monitors. Or their “netspeed” setting is so low that they don’t get rotated in often enough to be seen within ~20 hours time window. But I haven’t looked into that yet.

this is also true, one of the reasons is that NTP is used as a DDoS amplifier in the hacker war.

if GeoIP filtering is enabled on servers, it means that the servers are not in the pool and servers have not queries from pool users.

Many companies simply stopped being open to the Internet as before. Some companies are under sanctions and have closed all ties with foreign projects. At the same time, there are still a number of servers that are ready to service requests. But we cannot do this physically. NTP servers are a hobby and a volunteer project, often this is weak and old hardware (or low-power like Raspberry) that cannot handle 1,500,000 requests per second=500-600Mbps (I get real traffic from 150 000 to 1 500 000 pps if I set 512k priority).
Or these are some corporate servers on which NTP is not a priority task and it is unacceptable for NTP to overload the server by 200%.

Ultimately, there are millions of users and devices that have a pool address registered for time synchronization. All of them will soon lose normal pool operation. If the pool can distribute the load to servers in other zones, that will help.

2 Likes

I found two servers (operated by the same person/group) that are registered but the score is a solid -99:

https://www.ntppool.org/scores/37.193.156.169
https://www.ntppool.org/scores/89.189.177.241

It would be interesting to hear if these servers answer NTP queries when queried from Russia.

I also found some .ru servers which are periodically unreachable, keep bad time or servers that seem to collapse under load, but they’re less interesting for diagnostics.

No


> ntpdate -q 89.189.177.241
server 89.189.177.241, stratum 0, offset 0.000000, delay 0.00000
15 Nov 18:32:11 ntpdate[72894]: no server suitable for synchronization found
> ntpdate -q 37.193.156.169
server 37.193.156.169, stratum 0, offset 0.000000, delay 0.00000
15 Nov 18:32:28 ntpdate[72899]: no server suitable for synchronization found

no ping, no ntp, in other words unreachable

My server is pool.ntp.org: Statistics for 46.188.16.150
Traffic makes my Mikrotik RB4011 smoke. Maybe it is possible to rate-limit NTP packets to answer at most 50k PPS, for example. But this will cause problems with monitoring, too.

Good, so at least in this case the monitors seem to work fine.

Quite frankly, I haven’t seen any evidence that the monitors are to blame for the problem. What the .ru zone needs is simply more capacity. Maybe ask from large universities or large companies if they would have some bandwidth to spare?

Yes, rate limiting would cause such a server to drop out of the pool periodically, but in my opinion even a rate-limited server would help the general situation, ie. it would decrease the load at other servers.

3 Likes

But as I argued in the other thread, this is not a problem in the current situation. I.e., right now, your server will get overloaded almost no matter what. Rate limiting will protect your server and line, and prevent them from going up in smoke, or your line not being usable for anything else anymore.

In a way, the score dropping is a very crude load managment mechanism. Use it to keep the load in check, and keep your server in the pool with whatever it can provide, rather than removing even more capacity from the zone.

And if you can figure out the IP addresses of the monitors as hinted at above, maybe enlisting someone else’s help with an IP address not “beleaguered” by NTP clients, you could selectively drop those if the load gets too high, minimizing impact on your actual clients.

3 Likes

good servers in the pool are servers of some commercial companies, they also serve for the “working needs” of the company. When a server is overloaded and does not respond, commercial problems arise. So the server will simply be excluded from the pool by admins.
For some, 500 Mbit traffic ntp queris overloads a 100 Mbit channel. Therefore, simply rate limiting at server does not help.

It is not the rate limiting directly that is supposed to help, but the indirect route via the scores dropping when packets are dropped, thus reducing the load coming from the pool.

As ISP, you might be able to more easily find out the IP addresses of the monitors, and block just those when needed.

Or look at the names of the monitors, which encode the location including the country, and use selective, temporary geo-blocking to force your scores to drop when/shortly before the load gets too high.

I’ve opened NTP port at router but rate-limited it at 5000 pps or 10 Mbps. Let’s have a look. It’s better than nothing.

1 Like