Oh dear, this explains it. 6 servers left in CN zone. There were 50 just a few months ago.
We saw LeoBodnar’s tweet yesterday and so added a stratum-2 node (preferring our LeoNTP GPS receiver on the same network) to the China pool, with a notional bandwidth setting of 1Mbit/sec — I understand that this is a “weight” and I didn’t want to pick a setting which would pull loads of traffic straight to our new node. That’s also why we run this NTP server on a separate IP address — in case we needed to nullroute it with upstreams due to DDoS.
With 1Mbit/sec set, we’re currently receiving around 10k requests per second, peaking at 40k requests per second. That’s a bandwidth usage of around 25Mbit/sec at 95th percentile (we are an ISP, we pay for transit links with CDRs and 95%iles) — which equates to about 8Tbytes/month for people billed that way by VPS providers.
Could one of the problems be that time servers actually in China, like the ones xushuang listed above, are being dropped from the pool because the pool’s monitoring (which has to traverse the Great Firewall fo China, potentially) is seeing packet loss? They might be absolutely fine within China, but given the vantage point of the quality monitoring we think there are problems due to filtering/DPI capacity problems?
Certainly I’ve seen some interesting things happen with UDP traffic suddenly change, presumably because of Chinese filters adapting to different VPN services, etc. Would that explain the sudden drop off in China? Has anybody got access to the pool monitoring data of any in-China NTP servers which have fallen out of the pool to confirm?
And if this is the case, we haven’t made things better for Chinese users at all — because the NTP servers which they would be able to reach have fallen out of the pool, and NTP services we are offering them from outside China (which are suffering packet loss as UDP 123 transits the GFW) are actually a degraded service, even before you consider how overloaded they are
@tomli took enitiative to talk about this earlier in this thread, but I am not sure if there came anything out of it.
Personally I think the pool need to change the setup to have multiple monitorings stations and allow that not all NTP servers will be available from all monitoring stations.
It looks like some servers are recovering:
16 (+8) active 1 day ago
15 (+9) active 7 days ago
Yeah, that’s the plan. It’s mostly already supported on the beta system. The beta code is running in kubernetes and there’s a bit more work to do before everything is working properly (so I can move the production code to the same branch and to run in the same way). There’s also (still…) some work to do to better manage the increase in monitoring data.
I’ve been focused on an update to the DNS server; we rolled it out to most of the servers over the last couple of weeks so I should soon be able to focus on this again. (And the work we’ve talked about elsewhere on the forum around “backfill servers”).
By now the asia pool faces same problem like 8 days before, the continental pool losts one third of its ipv4 servers in a day, 157->107. Notable country pools:
- China: 19 -> 7
- India: 8 -> 3
- Japan: 27 -> 17
- Taiwan: 7 -> 2
- Hong Kong: 9 -> 2
Is there something wrong with the montoring system?
Indeed, after a bit of a recovery (maybe almost 20 servers in the zone), .cn is now back to single digits!
There are 9 active servers in this zone.
19 (-10) active 1 day ago
Is there any way to find out which 10 NTP servers left the zone, and if so, whether they themselves opted to leave the zone or whether monitoring booted them?
Speaking only for myself, but at the current rate it looked like my server in Singapore (also in the .cn pool) would have sent some 6.5TB of NTP traffic this month and that would have exceeded my 4TB quota. I’ve grown tired of watching the traffic amounts manually, so I’ve now scripted things a bit and my server will now drop probes from the pool monitoring servers if the estimated monthly traffic (according to vnstat) exceeds 3.90 TB. This does not affect other clients, but will cause the score of my server to drop below 10, thus dropping the server from the pool DNS. When the estimated monthly traffic drops below 3.90 TB, the server will start responding to probes from the pool monitoring servers again. If a hard limit of 3.95TB of monthly transmitted data is exceeded, all NTP traffic will be dropped until next month and new quota limits.
My U.S. server (also in .cn pool) has a similar setup, but the numbers are different. It has a 3TB monthly quota and at the current rate it would have sent 4.68 TB this month. Both of these servers are configured as 384kbit/s in the pool.
As of now, the China zone says:
“IPv4: There are 5 active servers in this zone”
This might be bad. Those five servers may get quite a lot of traffic at the moment. One of my servers will join the .cn pool again in a few hours when its score reaches 10 again (see above message for why it was below 10), so I’ll get to see this myself. Stats here: http://biisoni.miuku.net/stats/ntppackets.html
Edit: 4 active servers now.
“IPv4: There are 5 active servers in this zone”
Yeah… just starting to see the bandwidth of incoming packets get rather brutal!
There are 8 active servers right now and 184.108.40.206 is on average recieving more than 60Mbps over the last 24 hours.
edit: wow… just went down to 5 and traffic increased about 20Mbps
Seems like the zone collapsed now. It is down to three servers.
I speculate that, in addition to GFW potentially affecting the monitoring, one of the problems now is that it’s very easy to get a level of traffic that a single-threaded
ntpd process cannot handle. This happened to us yesterday and because that single core became overloaded, it started sending fewer replies. The result of that was that we got even more queries. At one point our NTP server’s inbound traffic was about six times the outbound. Normally we monitor the PPS and bytes in:out ratio to see if we are being abused for DDoS type attacks. In this instance it looked like we were being attacked.
The problems for us all started around the time that the DNS master went down — 09:30 UTC on 30th June — but got really bad around 03:30 UTC on 1st July, and then I started to get alerts from our network monitoring setup.
The times on this PPS graph are UTC+02, with inbound in blue and outbound in green. The “flat cap” to the blue is when we start hitting the PPS limit of the firewall!
Thankfully it didn’t take that long to convert one quad-core VM (of which only one core was being used by
ntpd) into four single-core VMs. Within our network now we are now using ECMP on our routers to spread the incoming requests across what has become a four-node NTP “anycast cluster” serving for our IP address that is in the CN Pool. The result was almost instantaneous (just before 07:00 on that graph). We are now getting about <50% CPU usage on each VM, and our inbound traffic has subsided by about 80%. We’re still seeing about 2:1 ratio of in:out, but that’s probably just a ton of misconfigured clients we have picked up.
Anyway, it could be that some servers have fallen out of the CN zone not because of bandwidth but because of CPU throttling — which turns into a vicious cycle that causes bandwidth problems, and then NTP Pool Monitoring throws the node out. And that just piles the load back on to the rest…
Can I have this server 220.127.116.11 added to the CN pool?
May I politely suggest that the problems with the China zone and its disapperaing servers are - at least to some degree - also caused by the pool monitoring system and its flaky network connection?
We have 2 perfectly well working servers in a perfectly well working network that nonetheless keep getting bumped out of the pool several times a month simply because the monitoring system seems to have trouble reaching them. And I’m not the only one with that problem.
And while we sit in Europe, I would not be surprised to learn this affects Chinese networks as well. And this is a way to loose Servers - if an operator sees his server getting removed again and again and again without being able to do anything about it, he might very well conclude that the NTP Pool Project has such an ample supply of server capacity so it can afford this largesse and doesn’t need his servers anyway
Chinese zone collapsed - 3 hosts left. My own server will also be pushed out at some point since it recieves more traffic that it can answer now (100+ Mbit)
…aaaand we’ve just gone below score of 10 as well.
“IPv4 There is 1 active server in this zone.”
We’re now at the stage where e.g. LeoBodnar’s GPS devices are bouncing in and out of the pool because as soon as they get added they’re ramping up to >100Mbit/sec traffic, start dropping packets, and their reputation slides back down again. There are too few members of the pool to share the load adequately. As @Hedberg says, the zone has collapsed