Adding servers to the China zone

Could one of the problems be that time servers actually in China, like the ones xushuang listed above, are being dropped from the pool because the pool’s monitoring (which has to traverse the Great Firewall fo China, potentially) is seeing packet loss? They might be absolutely fine within China, but given the vantage point of the quality monitoring we think there are problems due to filtering/DPI capacity problems?

http://www.pool.ntp.org/scores/120.132.6.211
http://www.pool.ntp.org/scores/120.132.6.225

Certainly I’ve seen some interesting things happen with UDP traffic suddenly change, presumably because of Chinese filters adapting to different VPN services, etc. Would that explain the sudden drop off in China? Has anybody got access to the pool monitoring data of any in-China NTP servers which have fallen out of the pool to confirm?

And if this is the case, we haven’t made things better for Chinese users at all — because the NTP servers which they would be able to reach have fallen out of the pool, and NTP services we are offering them from outside China (which are suffering packet loss as UDP 123 transits the GFW) are actually a degraded service, even before you consider how overloaded they are :wink:

@tomli took enitiative to talk about this earlier in this thread, but I am not sure if there came anything out of it.

Personally I think the pool need to change the setup to have multiple monitorings stations and allow that not all NTP servers will be available from all monitoring stations.

1 Like

It looks like some servers are recovering:

16 (+8) active 1 day ago
15 (+9) active 7 days ago

http://www.pool.ntp.org/zone/cn

1 Like

Yeah, that’s the plan. It’s mostly already supported on the beta system. The beta code is running in kubernetes and there’s a bit more work to do before everything is working properly (so I can move the production code to the same branch and to run in the same way). There’s also (still…) some work to do to better manage the increase in monitoring data.

I’ve been focused on an update to the DNS server; we rolled it out to most of the servers over the last couple of weeks so I should soon be able to focus on this again. (And the work we’ve talked about elsewhere on the forum around “backfill servers”).

By now the asia pool faces same problem like 8 days before, the continental pool losts one third of its ipv4 servers in a day, 157->107. Notable country pools:

  • China: 19 -> 7
  • India: 8 -> 3
  • Japan: 27 -> 17
  • Taiwan: 7 -> 2
  • Hong Kong: 9 -> 2

Is there something wrong with the montoring system?

1 Like

Indeed, after a bit of a recovery (maybe almost 20 servers in the zone), .cn is now back to single digits!

There are 9 active servers in this zone.
19 (-10) active 1 day ago

Is there any way to find out which 10 NTP servers left the zone, and if so, whether they themselves opted to leave the zone or whether monitoring booted them?

Speaking only for myself, but at the current rate it looked like my server in Singapore (also in the .cn pool) would have sent some 6.5TB of NTP traffic this month and that would have exceeded my 4TB quota. I’ve grown tired of watching the traffic amounts manually, so I’ve now scripted things a bit and my server will now drop probes from the pool monitoring servers if the estimated monthly traffic (according to vnstat) exceeds 3.90 TB. This does not affect other clients, but will cause the score of my server to drop below 10, thus dropping the server from the pool DNS. When the estimated monthly traffic drops below 3.90 TB, the server will start responding to probes from the pool monitoring servers again. If a hard limit of 3.95TB of monthly transmitted data is exceeded, all NTP traffic will be dropped until next month and new quota limits.

My U.S. server (also in .cn pool) has a similar setup, but the numbers are different. It has a 3TB monthly quota and at the current rate it would have sent 4.68 TB this month. Both of these servers are configured as 384kbit/s in the pool.

So, the current low scores for my servers are intentional:
http://www.pool.ntp.org/scores/94.237.64.20
http://www.pool.ntp.org/scores/173.255.246.13

As of now, the China zone says:

“IPv4: There are 5 active servers in this zone”

This might be bad. Those five servers may get quite a lot of traffic at the moment. One of my servers will join the .cn pool again in a few hours when its score reaches 10 again (see above message for why it was below 10), so I’ll get to see this myself. Stats here: http://biisoni.miuku.net/stats/ntppackets.html

Edit: 4 active servers now.

“IPv4: There are 5 active servers in this zone”

Yeah… just starting to see the bandwidth of incoming packets get rather brutal!

There are 8 active servers right now and 5.103.139.163 is on average recieving more than 60Mbps over the last 24 hours.

edit: wow… just went down to 5 and traffic increased about 20Mbps

Seems like the zone collapsed now. It is down to three servers.

I speculate that, in addition to GFW potentially affecting the monitoring, one of the problems now is that it’s very easy to get a level of traffic that a single-threaded ntpd process cannot handle. This happened to us yesterday and because that single core became overloaded, it started sending fewer replies. The result of that was that we got even more queries. At one point our NTP server’s inbound traffic was about six times the outbound. Normally we monitor the PPS and bytes in:out ratio to see if we are being abused for DDoS type attacks. In this instance it looked like we were being attacked.

The problems for us all started around the time that the DNS master went down — 09:30 UTC on 30th June — but got really bad around 03:30 UTC on 1st July, and then I started to get alerts from our network monitoring setup.

The times on this PPS graph are UTC+02, with inbound in blue and outbound in green. The “flat cap” to the blue is when we start hitting the PPS limit of the firewall!

37

Thankfully it didn’t take that long to convert one quad-core VM (of which only one core was being used by ntpd) into four single-core VMs. Within our network now we are now using ECMP on our routers to spread the incoming requests across what has become a four-node NTP “anycast cluster” serving for our IP address that is in the CN Pool. The result was almost instantaneous (just before 07:00 on that graph). We are now getting about <50% CPU usage on each VM, and our inbound traffic has subsided by about 80%. We’re still seeing about 2:1 ratio of in:out, but that’s probably just a ton of misconfigured clients we have picked up.

Anyway, it could be that some servers have fallen out of the CN zone not because of bandwidth but because of CPU throttling — which turns into a vicious cycle that causes bandwidth problems, and then NTP Pool Monitoring throws the node out. And that just piles the load back on to the rest…

1 Like

Can I have this server 5.103.128.88 added to the CN pool?

Thanks,

May I politely suggest that the problems with the China zone and its disapperaing servers are - at least to some degree - also caused by the pool monitoring system and its flaky network connection?

We have 2 perfectly well working servers in a perfectly well working network that nonetheless keep getting bumped out of the pool several times a month simply because the monitoring system seems to have trouble reaching them. And I’m not the only one with that problem.

And while we sit in Europe, I would not be surprised to learn this affects Chinese networks as well. And this is a way to loose Servers - if an operator sees his server getting removed again and again and again without being able to do anything about it, he might very well conclude that the NTP Pool Project has such an ample supply of server capacity so it can afford this largesse and doesn’t need his servers anyway

2 Likes

Chinese zone collapsed - 3 hosts left. My own server will also be pushed out at some point since it recieves more traffic that it can answer now (100+ Mbit)

1 Like

…aaaand we’ve just gone below score of 10 as well.

“IPv4 There is 1 active server in this zone.”

We’re now at the stage where e.g. LeoBodnar’s GPS devices are bouncing in and out of the pool because as soon as they get added they’re ramping up to >100Mbit/sec traffic, start dropping packets, and their reputation slides back down again. There are too few members of the pool to share the load adequately. As @Hedberg says, the zone has collapsed :frowning:

@Ask - Could the monitor be configured to be a slightly more forgiving for hosts in the CN zone? E.g. when a host doesnt reply every time?

1 Like

@hedberg - sounds like a very simple (maybe temporary) fix for the problems that we are currently theorising the .cn zone to have had - e.g. earlier posts by @ChrisW, myself, and others.

What happens when there are no servers in a zone? Do requests just get bumped up one level?