Collapse of Russia country zone

Yes, and that is also happening in this case, as was described previously. But when the traffic volume while in the pool reaches upper triple-digit Mbps volumes (at a 512kbit netspeed setting), even when dropping out of the pool, a corresponding proportion of residual traffic remains and takes time to die down.

And yes, there’s other effects as well, like I’ve seen traffic go up on a server this morning which I removed from the pool last night, before it continued to go down as expected. Even more pronounced on another system not long ago, both rate-/volume-wise as well as duration-wise, before it eventually subsided again.

No need for “shadow DNS”, the actual DNS has enough quirks and occasional malfunctions to cause this. Not to speak of client implementations. But as was discussed throughout this thread, and the previous one, this is unlikely to be the issue here.

Why is it so hard to grasp that in a zone with roughly 132,000,000 Internet users and less than 10 active servers, each one of them gets too much traffic?

I know, I can set the netspeed of my server in Germany to 3 Gbit, and will hardly get 2Mbit/s of traffic. But that is different in other regions of the world, where people can only dream of a client/server ratio like in Europe or the USA. This thread is about what the findings of your paper mean in practice when the pool is limiting clients to servers from their own zone only.

Got you.

How can this problem in general be addressed? More servers in underserved countries off course. But since that is not the case right now and might not happen any time soon, should de distribution logic be modified? What do you suggest?

Also; do users experience problems? Or will Cloudflare remain as the only server left, handling all NTP requests nicely?

I updated the script that it is capable of debugging. Just try the new version with the -d 2 -p 20 options.

1 Like

That any of the multitude of proposals that have been discussed in this forum time and time again gets implemented. I almost don’t care which one, as long as something is being done. And discussions actually have meaning and can come to a conclusion because also those without whose input all of this is moot let us get at least a glimpse of what their thinking is, and where community input is needed/desired to shape this.

You weighed in on this in context of IPv6, but that very part of the discussion was wider than just IPv6, at least in my understanding also encompassing the topic of “geographic” allocation of clients to servers.

I mean, I don’t even mind if the system really gets enabled to allocate servers to clients based on almost individual RTT, on the contrary, as long as that doesn’t mean we have to wait yet another few years until even the basic problems are being addressed by then-perfect solutions.

The problem is that there were so many proposals, but without feedback, the discussions never came to a conclusion. E.g., one idea was to slowly start mapping requests to a country zone to the encompassing continent zone in a transparent fashion.

1 Like

Just looking at the score graphs of big servers like Tencent’s in China, I don’t think so. If we are serious about the monitors indeed reflecting the service quality that clients get, then I’d say, no, already today in other zones, where we can see how even big servers cope, or don’t, based on score history unobfuscated by anycast networking (like in the case of Cloudflare).

And interspersed user reports here and there.

Or reports from I think the Phillipines, were there were/are a few servers, but while scoring good with some monitors, they don’t seem to actually serve many local clients. Not necessarily a load issue, but again consequence of locking in clients to servers of their own zone.

Or me using a fellow server in Singapore to get time locally, which periodically becomes unavailable though just low single-digit (less than two typically) milliseconds away. Where should sustained packet loss in that context come from if not from overload, or overzealous protection mechanisms triggering on that same condition, even if false positive, as I experience myself.

Thanks @NTPman for the script modifications.
Thanks @davehart i’ve looked for the option but didn’t found it tonight.

Could it be that this is a router firmware bug like in the Fortinet FIrewalls ?

I see

  • 99% is IPv4 traffic
  • 99% comming from ADSL, VDSL, Phone provider (ISP)
  • ~85% of the clients aren’t running into a rate limit or doing excessive request

Traffic since 15-11-2024

 eth0                                                                     11:37
  ^                                            r
  |                                            r                          r
  |                                      r     r        r                 r
  |                                      r     r        r                 r
  |                                      r     r     r  r                 r
  |                             r  r     r     r     r  r     r           r
  |                             r  r     r     r     r  r     r  r  r     r
  |                             r  r     r     r     r  r  r  r  r  r     r
  |                             r  r     r     rt    r  r  r  r  r  r     r
  |                             r  rt    rt    rt r  rt rt rt rt r  r     rt
 -+--------------------------------------------------------------------------->
  |  12 13 14 15 16 17 18 19 20 21 22 23 00 01 02 03 04 05 06 07 08 09 10 11

 h  rx (MiB)   tx (MiB)  ][  h  rx (MiB)   tx (MiB)  ][  h  rx (MiB)   tx (MiB)
12        0.0        0.0 ][ 20       35.8       13.2 ][ 04    5,447.4    1,095.4
13        0.0        0.0 ][ 21    4,447.9      758.2 ][ 05    7,403.5    1,576.1
14        0.0        0.0 ][ 22    4,913.2      971.2 ][ 06    2,965.2    1,180.6
15        0.0        0.0 ][ 23      262.4       90.9 ][ 07    4,534.8    1,082.0
16        0.0        0.0 ][ 00    6,966.1    1,389.7 ][ 08    4,255.8      762.7
17        0.0        0.0 ][ 01      362.8      120.1 ][ 09    3,531.4      764.9
18        0.0        0.0 ][ 02    8,707.5    2,032.1 ][ 10      657.5      246.7
19       90.3       36.0 ][ 03    1,468.5      545.2 ][ 11    8,185.2    1,599.5

ps: made into the pool for 17 minutes :see_no_evil:

1 Like

I would concentrate on the 15% that does get rate limited.

I see that your server does not respond to a significant chunk of inbound traffic. While it is expected that rx>tx in a rate limited configuration, your ratio seems unusually uneven. Why is that?

For comparison, here’s a similar graph for a server in Singapore (94.237.79.110) with “ratelimit interval 3 burst 8 leak 4” in chrony.conf:

$ vnstat -hg
 eth0                                                                     12:40
  ^  r
  |  rt
  |  rt rt r
  |  rt rt rt
  |  rt rt rt                                  r  r                 rt rt
  |  rt rt rt rt                               rt rt rt rt    rt rt rt rt
  |  rt rt rt rt                         rt rt rt rt rt rt rt rt rt rt rt rt
  |  rt rt rt rt rt                      rt rt rt rt rt rt rt rt rt rt rt rt
  |  rt rt rt rt rt                rt rt rt rt rt rt rt rt rt rt rt rt rt rt
  |  rt rt rt rt rt rt rt rt rt rt rt rt rt rt rt rt rt rt rt rt rt rt rt rt
 -+--------------------------------------------------------------------------->
  |  13 14 15 16 17 18 19 20 21 22 23 00 01 02 03 04 05 06 07 08 09 10 11 12

 h  rx (MiB)   tx (MiB)  ][  h  rx (MiB)   tx (MiB)  ][  h  rx (MiB)   tx (MiB)
13    9,993.2    9,485.9 ][ 21    1,157.6    1,111.8 ][ 05    5,657.1    5,320.9
14    8,791.9    8,397.4 ][ 22    1,479.8    1,414.0 ][ 06    5,531.5    5,211.9
15    8,150.1    7,775.0 ][ 23    2,191.1    2,092.4 ][ 07    4,503.2    4,248.4
16    5,580.8    5,380.3 ][ 00    2,957.3    2,750.0 ][ 08    5,854.8    5,600.3
17    3,210.1    3,113.0 ][ 01    4,835.9    4,508.7 ][ 09    5,381.9    5,169.8
18    1,941.3    1,878.6 ][ 02    4,633.7    4,372.5 ][ 10    6,373.8    6,107.7
19    1,405.5    1,352.8 ][ 03    6,081.9    5,721.5 ][ 11    6,477.4    6,254.7
20    1,185.2    1,133.8 ][ 04    6,130.8    5,831.2 ][ 12    4,583.5    4,413.3

That’s good question.
Even at the start yesterday without the discard averege 1 option the traffic was asymetric and without firewall.

It’s just a small VM with 1CPU & 2GB RAM.

Installed chrony now with your ratelimit.
4096 clients
3 with droped by ratelimit
and about 50-80 with an negative interval between ntp packets most from the same IP range 95.26.x.y

1 Like

In the past I have suggested that the pool DNS servers have a configured minimum number of servers per country zone (maybe scale this value with the amount of queries for that zone). When the available server count falls below the minimum, start handing out servers from neighbouring country zones or even world wide.

The idea being that service levels are maintained albeit perhaps with servers that have a higher RTT, and to prevent the in-country volunteers from being overwhelmed.

Currently I understand the volunteer “contract” to be that volunteer servers will get clients generally from the same region. A change like this would mean that any volunteer server anywhere can end up getting clients from anywhere, and an overall increase in baseline pps for everyone (because everyone is bearing the load of CN/RU instead of few).

If this is seen as a problem maybe it could be opt-in with documentation explaining why one might want to be generous with this, and a call put out for help by enabling this.

This is however the only way I see under-served country zones getting more servers.

Is it time for us (i.e. not Ask or anyone else currently burdened with running pool.ntp.org) to consider setting up a “testbed” ntp pool, maybe ipv6-only, to try out such suggestions in the hope that what works might be implemented in pool.ntp.org?

I have no problem with your message about determining pool monitor addresses. You provided a hint, without spelling out the details. I replied to and quoted a message from @umike which did spell out exact steps to determine monitor IP addresses, though with a bit more work than the method you hinted at.

Agreed completely. That’s why I was glad you only hinted, as messages on that topic will be seen by many who might be tempted to use the knowledge for ill-advised special handling of monitors. I suspect @kkursor didn’t pick up on the hinted method based on his subsequent posts, whether due to language barrier or less familiarity with pool monitor operations.

Agreed regarding IPv6. I’m not aware of any work trying to avoid giving the same server’s IPv4 and IPv6 address in a single response, though I might have missed it. I believe I saw it requested, but my impression is that’s not likely to consume pool developer focus anytime soon.

I made the same-subnet suggestion because I believe it’s trivial to implement and could enable many more monitors to be brought up by current pool server operators. Making that change shouldn’t deflect much @Ask time from other issues, particularly if he does it while making other changes to the central monitor infrastructure.

Turning to the issue of collapsing zones:

For China, the problem is structural with the subsidization of residential internet by data center internet users combined with disallowing servers on those residential connections. I don’t see there’s much the pool can do to improve that situation beyond continuing to try to encourage big players in China to add monitors, but it’s an uphill sell because that’s asking companies to spend money by increasing their bandwith use at premium pricing for altruistic reasons, which as profit-focused enterprises they’re naturally disinclined to do.

For the Russia zone collapse, it sounds to me like the problem could be one or both of two issues brought up in this thread:

  1. State actors may have blocked Cloudflare as punishment for some violations of “communications” regulations or law. If they were handling a huge proportion of .ru.pool.ntp.org traffic, that might explain the explosion of traffic for other pool server operators in Russia.

  2. Anti-Russia digital vigilantes (or some might say hacktivists) organized a (D)DoS of NTP traffic to pool servers in Russia explicitly intending to break .ru.pool.ntp.org. Investigating this on victim Russian pool servers with ntpq -c "mrulist sort=avgint" for ntpd or something similar for Chrony assuming it has functionality to show IP addresses of recent clients along with average interval between requests could point to the abusive IP addresses which could be filtered at the router or more productively blackholed by the ISP or its upstreams so the traffic never reaches pool server operators like @kkursor’s routers to begin with.

This is not as simple as you might think. First, geographic location and network location can be very different things. In my small town in Maryland, US, population about 5000, there are three residential ISPs, the legacy wired telephone company Verizon’s DSL service[1], the cable TV provider and a rural fiber broadband startup. There’s a different fiber startup for government and libraries, and finally Comcast for a few large institutions like a college, hospital, and one manufacturer. So in my town, there are 5 different ISPs with substantial use, and each one has a different set of paths to reach the wider Internet. Knowing something about the latency from one has little to do with the experience on others. Reaching across town between these ISPs generally invoves a hop to 1.5 to 2 hours drive away where there’s a major internet exchange, in Philadelphia, Baltimore, or Ashburn, VA.
Moreover, the pool simply doesn’t have a way to determine latency from a given IP address querying its DNS servers to various pool servers. If you can think of how this might be implemented, please speak up.

[1] Which DSL they’ve been increasing the price of and degrading the speeds of for years as they try to move away from their copper plant with its heavy regulation requiring servicing every home and business in their territory and high maintenance costs due to unionized labor and the sprawling, but now much less used, copper plant as people moved to cellphones, which

Cloudflare 1
Top Countries
us 247.03 ‱ 189.39 ‱ (1.30x)
cn 1008.19 ‱ 3064.54 ‱ (0.33x)
ru 1802.84 ‱ 3959.22 ‱ (0.46x)

Cloudflare 2
Top Countries
us 246.67 ‱ 189.39 ‱ (1.30x)
cn 1011.67 ‱ 3064.54 ‱ (0.33x)
ru 1802.61 ‱ 3959.22 ‱ (0.46x)

As i wrote above there are many clients with an interval from -5 to -1
ex:

Hostname                      NTP   Drop Int IntL Last     Cmd   Drop Int  Last
===============================================================================

37.99.x.y                1191    918  -2  -1     2       0      0   -     -
31.29.x.y                1030    838  -4   1     2       0      0   -     -
89.191.x.y               1000    809  -2   1     0       0      0   -     -
178.141.x.y               988    330  -3  -1     4       0      0   -     -
46.34.x.y                 984    413  -5   1     0       0      0   -     -
...

@avij it looks like your ratelimit work

1 Like

Of course, special handling of monitors is cheating and discredits the very idea of monitoring.
Now I am in the country building water supply for my countryhouse, far from Moscow and NTP problems. Allowed 50k NTP packets per second on the router, awaiting for score to raise enough to rejoin to pool. No problems with my services now, about 100-150 pps.

Go ahead and say “we need more servers in North Korea”. Will be the same by meaning.

Reached the end of the thread. Waste of time. No single soul noted that russia is killing my country. “Deep concern” and ill values. They are the only who responsible for this issue, not “monitors” not “pool” not bad glitchy firewalls. My guess they are cutting themselves off and DoSed altogether. We have big volunteer groups coordinating DoS attacks on russia and only lazy are not participating. Happens that local datacenters and national VPN providers were even ok to give approval to bombard them. Not “new servers” but stop killing people, the only solution and the rest are consequences.

New tests are pending to cut off russian cheburnet (this is how they call it ironically) from global net entirely, so don’t save drowning man who wants this so, and their pleas for help is just false, since >80% support (current/new) war throughout all history times.

1 Like

История рассудит, кто там кого убивает.
Думаю, осталось недолго.

Cheburnet (from Cheburashka and Internet) is running on chips from washing machines and air conditioners, powered by toilets. That’s the reason why it’s slow and ugly.

As described here, there are no fixed monitors, but they are selected for each server automatically. That’s good and possible can mitigate cross-border problems.

1 Like

Post must be at least 20 characters?

I’m sure we all have our own opinions about the current political situation, but let’s keep politics out of this.

5 Likes

Got an email tonight from the network abuse team

IN Attack notification
Threshold Packets 1,000,000 packets/s
Sum 370,515,000 packets/300s (1,235,050 packets/s), 74,057 flows/300s (246 flows/s), 26.420 GByte/300s (721 MBit/s)
External 213.183.x.y, 85,000 packets/300s (283 packets/s), 2 flows/300s (0 flows/s), 0.006 GByte/300s (0 MBit/s)
External 80.73.x.y, 75,000 packets/300s (250 packets/s), 2 flows/300s (0 flows/s), 0.005 GByte/300s (0 MBit/s)
External 77.66.x.y, 50,000 packets/300s (166 packets/s), 1 flows/300s (0 flows/s), 0.004 GByte/300s (0 MBit/s)

and many many more lines…

1 Like

Can we leave politics out of this please?

The question is, why is Russia starving of NTP-servers?

Same as for China, that has been reported many times too.

How come this is happening? It’s just time. Nothing special about it.

Makes no sense…as time can easily be taken from the sky…this only hurts normal people.

Time-signals can be received bij a 5 euro device…it simply makes no sense to block NTP or even DDOS-timeservers.

That is my opinion.

But please, no politics…the pool isn’t about that.