CN pool collapse a few hours every day

Hello.
Im back, I was very active about 5 years ago trying to get enough
servers for the CN pool. An for a few years it was working.

I still have 8 ntp servers in pool and alot better hardware than
before. But during peak hours, that isnt enough.
Most likely, the demand is also higher in 2024.

And the scoring system doesnt take into count that the CN zone
is underserved and when my servers gets scores below 10 and
are kicked out.

And the rest will take the hit.

And get kicked out.

And after some hours its just a few servers or 0 servers in
the cn.pool.ntp.org pool.

And if my server gets obove score 10 again I will be the only
ntp server for ENTIRE china.

About 500-1000 Mbit/s of NTP traffic!
Not very funny and this is starting to become I real problem
as I cant work for a while.

Its not acceptable and I cant have it like this.

I have lower the bandwith on my servers and some days it works
fine, but its just random when it does and when it doesnt.

My suggestions of howto fix this

1

My servers are in zones “@ cn europe se” mostly.
If I could differentiate the bandwidth on different zones
I could for example have
@ - 1 Gbit
europe - 1 Gbit
se - 1 Gbit
cn - 10 Mbit
I have much capacity that arent used that can be used in other
zones instead.

2

It seems like peak times are 10:00 → 16:00 CET time.
Around 14:00 CET are the worse time.
I have just accepted that the zone might not work during the
entire day, the resources needed for that just isnt here.
But I dont want to have my connection “DDoS”:ed because of it.
So I want to opt-out from the CN zone with my servers during
that time. Can something like this fixed?

3

More forgiving scoring system for the CN zone so everyone arent
kicked out. Its counterproductive.

4

If nothing of the above or any other suggestion that achieve
that my connection isnt killed because of the mount of ntp
packets I must ask to have my servers removed from the CN
zone completely. Very sorry but I cant have it like this.
Or ask, you can give me back the admin rights so I can remove
them myself. And maybe add to some other pool that I can help
but doesnt have so much traffic as the CN pool.

@ask - or anyone else have any other suggestion of howto solve this?

You may want to run a monitor(s) yourself so China has a monitor too or more of them.

Question 1 and 2, not possible as the zone is DNS regulated.
The load and access times can’t be regulated because not all DNS-servers cache the same and not all clients poll the DNS every time they need time.

I dont have any servers inside China so I cant do that.

I mean adding these functions to the ntppool dist code so its possible.

But they can use pool.ntp.org or pool2.ntp.org

I fail to see the problem on this matter.

The time will be correct and plenty servers will help them out.

I think this is missing the point. While I guess all operators strive to serve clients as best as possible, or otherwise maintain good scores on one’s servers, I understand the main issue here is the high volume of traffic this causes on a few servers. Yes, if clients would use the global zones rather than geographical zones, that issue would go away (assuming the pool does good load balancing, even if that means that many times, that could mean handing out non-optimal servers).

However, the way people configure their clients is out of the control of server operators. So unless we find a way to reach out to all those anonymous clients that followed historic, or, depending on language considered, even current guidance on the pool’s web pages, and tell them to change their client configuration, that suggestion does not help to reduce the load on individual servers in the near-term.

Rather, that could only be helped by changes of the pool itself, e.g., by not sticking to the respective region implied by a zone name and rather automatically/implicitly map it to the enclosing zone, e.g., continent zone, or even global zone right away.

I understood @ask is working on something like that, I’d be curious to hear what the latest status is on that topic.

Ask could simply remap all country to the global pool, then they will all point to the same DNS-entry and the problem is solved.

As for overloading, have you tried to find what specific IP’s are causing this? Maybe just a few overload your system, then you could block them and see if it helps.

Obviously for the op to consider, but putting the sheer number of potential clients in the China zone in relation with the number of servers in that zone, it would not have been my initial thought that only a handful of clients might be responsible for the overload. But who knows.

I don’t think there is a shortage of ideas how to potentially address this. And if the pool were configured statically with a simple zone file, like is likely for the pool that you set up with your own servers, that might indeed be one of the simpler approaches.

But the pool is a much more complex animal, with the zones constantly being updated as servers come and go, or are being reconfigured, and with interactions with various other systems in the ecosystem, e.g., the statistics. While I hope it ultimately will mostly be an issue of configuration, the sheer complexity of the overall system, and its criticality, warrant a more measured approach, e.g., trying it out on the beta site first.

And as mentioned elsewhere, while I don’t recall anymore what exactly they are, and how critical to implement before such a change, I understood @ask such that he had a few more things to do in preparation of such a change, e.g. optimization of the matching algorithms.

Also, it might not be as “simple” as just transparently mapping geographical zones to global ones. I faintly remember people in this forum saying they were not ok with their servers automatically being included in the global zone, as they weren’t previously aware happens if the bandwidth setting crosses a certain threshold. Or in some jurisdictions, some clients are required to use time that traces back to their national official time. While it could be argued that someone with such requirements should perhaps not use the pool, as the pool’s zone concept likely does not sufficiently fulfil such legal requirements even today, such issues at least deserve some consideration before, e.g., “simply” transparently mapping regional zones to global ones.

So while seeing how this topic keeps coming up time and time again and would thus hopefully be adressed rather sooner than later, I think @ask as the one having the best understanding as to the inner workings of the pool, and of the efforts as well as the risks such a change may entail for such a crucial system, he’s the one who needs to manage and execute the process (not least also being probably the only one who could actually change things in the system).

So I’d be curious as to where his current thoughts are on this matter, e.g. what obstacles would need to be addressed before being able to move forward, apart from him finding the time to actually do so.

2 Likes

We have tried to add multiple servers to the CN zone few years ago Adding servers to the China zone

But as you all could understand, things will decay naturally over time, those servers did helped a lot but only few of them still work today, so the situation of CN pool is getting bad again.

I’ve tried to add few more new nodes to the pool, however, it seems recently the pool only allow servers physically located near China to be added, e.g., my JP nodes are ok but EU ones are not.

The reality is that bandwidth in Asia are generally far expensive than EU and US, it’s hard for individuals to keep expending money on pool servers physically located in the region. So once some servers got kicked out of the pool, other servers collapse quickly, not only for the network flood, but also caused by server bills.

There are some big companies & orgs in China providing public NTP services, but 1) They don’t have reason to join pool and validate their IPs, and 2) Someone added their IPs to the pool long ago anyway, but the monitor system don’t like them and keeps kicking them out of the pool, so hardly could they help.

I think it would be a very useful feature to allow node operators being able to opt-in to certain “extra zones” directly in thier manage panel pages, then who wish to help could easily add their server to CN zone, as well as other zones with very few servers, and opt-out as they desire. This could also avoid massive human resource needed for pool admins as today we need manual operations to add a non-China server to the CN zone.

I want to point that the problem of CN zone is not an issue of service quality, but of service availability. Requiring best network quality of servers in the zone (to China clients) does not really make sense, and is proven to only lead to collapse of the zone and unavailability to those clients. A server located other side of the earth with slightly worse quality is far more useful than a server with perfect network access but not able to answer. Besides, I didn’t feel any problem with a NTP server ~300ms away, which may only increase few ms of error/deviation comparing to a near one.

I believe some other Asia zones are sharing the same problem. I know TW zone had this issue few years ago, and I see there is a post about VN zone facing exactly the same problem as CN zone does. Vn.pool.ntp.org is down!

3 Likes

Back in 2019 I did reach @ask and arranged several interested parties to host monitor sites inside China. The “Monitors in China” thread is still in the forum’s messages, we have tried multiple times to push forward, but unfortunately we haven’t ever heard back.

My server is still in the CN zone and bears accumulating over 40T traffic each month. I have to react to Hetzner’s automated system for netscan from time to time too, although all the reported traffic looks like just normal NTP.

The server is currently being kicked out regularly and I am not sure what makes the situation different from the past (it was never kicked out back to the old single-node monitoring system time): pool.ntp.org: Statistics for 78.46.102.180

The chronyd process only takes 40% CPU at the time and server stats look normal too:

$ chronyc serverstats 
NTP packets received       : 788147126053
NTP packets dropped        : 0

Ah, I do have one thing to add:

My server is experiencing frequent attacks recently, according to Hetzner:

The attacking IPs are from all around China. This is very rare in the past. Maybe the whole CN pool is experiencing similar challenges and contributes to the problem.

Did you ever consider Hertzner is wrong?

Maybe your server is too small and can’t handle the load.

Just checked your graph and almost all monitors say it’s unstable.

I had this same problem in Amsterdam and moved the VPS to Germany, now it’s problem free.

You do know your server is listed as serving the world and China, that will produce a big load and Hertzner may not like you do this.

Bas.

Yes of course. Well I’m just trying to help. It’s already a dedicated server with 1 Gbps port and unlimited traffic. If setups like this cannot handle the CN pool, I doubt that it’s still feasible for any volunteering party. After 5 years of waiting I don’t really know if keep putting in more resources like this is still worth it.

Anyway, I’ve been configuring my own servers to use my own geodns’d ntp.felixc.at which resolves to the major Chinese corp-owned public ntp servers (with my own monitoring) for Chinese IPs and just CNAME to pool.ntp.org for the rest of world.

I’m not saying your setup isn’t enough, it may be Hertzner that doesn’t like what you do and it could trigger their firewall to block your traffic.
You could easily see if your system is overloaded by checking other ports, if they fail too then your system could be overloaded.
If they don’t fail, then a firewall is knocking 123 offline.
Also, check ‘sudo nethogs’ and ‘top’ to see if your system is in trouble or not.

I doubt it’s your server as it doesn’t happen all the time AND Hertzner is warning you.
I would talk to the datacenter and ask them if they block it.

If they do, you know where the problem is.

My datacenter confirmed they blocked me in Amsterdam, so I moved the server to a datacenter where they don’t block. Simple as that. :rofl:

You could ask them to whitelist your server for port 123, then couldn’t be blocked on high loads/traffic.

The total traffic of CN zone is far far beyond any single server could handle, and any sane ISP could easily label that as DDoS. This is exactly the issue this thread is about, and it has nothing to do with if any specific server is powerful enough or is configured optimize.

That just doesn’t matter at all when a whole country is requesting your server, you always get flooded anyway. The only way to solve is to reduce traffic send to single servers in the zone, while the total traffic is not something the pool could control (and it’s growing over time), the only reasonable way is to increase amount of servers in the zone, that’s what we’re talking.

3 Likes

I agree with you, as I would never set my server to serve such a big country on my own.
Especially if you are the only one.

Also, I do not understand how a country with 1billion people can not run their own time-servers.
It’s my believe they do run them, but probably not published or accessible outside their great-firewall.

I mean, China has companies as MSI, Foxconn, Asus, Apple as manufacturer-plants, sure they run time.
Won’t be that hard to do for them.

Technology on large scale is China, yet a time server is impossible to run? Nah, don’t believe it.

Also, China has a different calendar, maybe they run their own system? I do not know.
But such a big country without time-servers, sorry, don’t buy it.

Bas.

We do. They are kicked out by ntppool because network connectivity issues from outside China.

Welcome to the forum, @lilydjwg!

Hmm, that sounds like an easy answer, but I am not sure it holds up. Taking the number of servers in other regions per users in those regions, extrapolating from there to how many servers there should be in the China zone given the number of users in that zone, and comparing that to the number of actual servers in the zone, the discrepancy is too big to blame it only on connectivity issues from monitors outside the zone.

Similarly, why aren’t there any monitors in the China zone? Again, extrapolating from how many monitors per users there are in other regions, there should be a few in the China zone, but there aren’t.

Instead, I believe that, for whatever reasons that have led to this, we now have a chicken and egg problem: There are too few servers in the zone, so those that are in it get huge amounts of traffic. So any servers that are getting added to the zone immediately get huge amounts of traffic, in turn preventing/discouraging new servers being added to the zone. This thread was started because the high costs associated with that high load are unbearable for a volunteer, not primarily because of the bad scores. The monitoring system does not cause this high load or associated cost. It maybe adds a bit to the issue, as it causes fluctuations in the traffic on servers that keep reinforcing themselves and the issues for the server operators. But the root cause is the lack of servers, which in turn prevents adding new servers gradually.

So short of adding at once a sufficient amount of servers to that zone that can immediately absorb the traffic load originating from the zone without going into overload right away, two options keep getting discussed in this forum:

  • Suggest to people to use global zones instead of regional ones so the load gets distributed among a larger pool of servers that way. Word-of-mouth might not be sufficient to reach a large enough audience. Updating the instructions on the web site might help in the mid term at least, steering new users to a better configuration.
  • Update the pool implementation so that it automatically spreads the load of the regional zones across larger regions. Discussion and implementation will take time, but once implemented, the effect should be noticeable in the short term.

Note that the China zone is not alone in having too few servers for the number of users, other zones have similar issues. Except maybe that the magnitude of the discrepancy is unique simply due to the sheer number of potential users in the zone.

And it certainly would’t hurt to have one or more monitors in the China zone. But that needs to be managed by people with a presence in that zone. Obviously, people who don’t have a presence there cannot set up and run monitors. So it needs volunteers from the China zone willing to run monitors inside that zone. (Though as mentioned before, I personally am currently not convinced just having monitors inside the China zone, or in other zones with similar issues, would on their own solve the issues discussed in this and other similar threads.)

2 Likes

We do have quite a few servers in mainland China, but they are not usable from the pool, e.g. from the servers I’m currently using (from ntp.felixc.at or cn.ntp.org.cn pool)

  • for this tencent server all scorers are from faraway and there are a few timeouts in the log.
  • for this aliyun server it scores ok but the page says it’s unavailable.
  • and this one (https://www.ntppool.org/scores/58.220.133.132) is scheduled for deletion.

Also some people want to help by adding servers to the cn pool but are rejected because their servers are far from China.

felixonmars has already offerred to run a monitor, but no further communications is done.

One problem perhaps unique to mainland China is that the network connectivity is usually not good from outside. Mainland China services that work fine inside may be slow to e.g. Germany servers. This is because very small per-capita international bandwidth, and perhaps also the ICP license policy that prevents global CDNs to operate in mainland China.

1 Like

There are many high-bandwidth servers in China (mainly provided by large companies such as Tencent or Alibaba, as well as national institutions such as the National Time Service Center and the National Academy of Metrology) that have been kicked out due to poor access to overseas monitors and low scores. But in fact, these servers work well and the service quality in mainland China is very good, and these servers are supposed to be the main force in processing requests.

This is a general statement that would be helpful to break down, i.e., what exactly do you mean by “not usable from the pool”? That they aren’t in the pool, that they get dropped from the DNS rotation of the pool, that they don’t respond well to client requests? See the following for some aspects I mean.

This indeed doesn’t look too good. But please note that in some cases, “far away” servers yield better scores than ones that I would expect to be “closer”, such as the ones in Hongkong, Singapore, or Taiwan. But admittedly, those are likely still “outside” the Chinese domestic network in terms of network distance, despite relative geographical proximity.
Regardless, the pool could do a better job in more aggressively promoting monitors with higher scores into the “active” monitor role. I’ve had similar issues with some of my own servers.

This server not being available is not the fault of the pool, but the decision of the person who added the server to the pool. It is the owners of a server who need to allow for a server to be added to the pool, and should ideally do so themselves (the verification mechanism will enforce that going forward). So just because there generally may be more servers in the China region, that doesn’t mean they are suitable to be added to the pool. That is entirely up to the owner/operator of each server, and a server certainly shouldn’t be added to the pool without the server operator’s consent.

Actually, it is already deleted. And again, that is entirely the decision of the person who added the server to the pool. There is nothing the pool can do if someone decides to take their server out of the pool.

I am obviously not directly involved in those cases, so cannot comment first hand. But it is not my impression that servers would be “rejected”, and that because they are “far from China”. Rather, adding servers to zones other than the one auto-detected for them (including cases where the auto-detection was wrong) is a manual process, and that is a huge bottleneck. See many other isues pending for a long time as well, e.g., vendor zones.

Again, I cannot comment first hand. I was just observing that there are servers in many places, sometimes multiple ones in rather small zones. So maybe the communication just fell between the cracks, as hinted at above happens in many, many cases. I could imagine multiple people in this forum who are already operating monitors being willing to help setting one up in the China zone, but at the end of the day, @ask will likely need to do something to add it to the system. No guarantee, but the more has been prepared, the less he’d need to do, the higher I’d expect the likelihood it could succeed.

I am not discounting that the monitoring from outside the zone isn’t a contributing factor, but considering this server mentioned above, that does’t seem to be the general case. I.e., some servers seem to get good enough scores despite being monitored from outside the zone. And note as well that this server, not being in the pool, may also be seeing less traffic, which is again no proof, but a hint that load on the servers may be more of a factor than the monitoring.

That is obviously nothing I have recent first-hand experience. But then, I don’t fully understand the issue, because then there should be enough capacity to handle all requests, even as some servers maybe move out of and back into the pool periodically. But seeing the issue that triggered this thread, I would say that there maybe isn’t enough capacity if one server dropping out of the zone can cause such a chain reaction of then further servers dropping out, and the situation only recovering when the load drops during the night.

Ok, that is a big difference to other zones, which do not rely on a few big servers, but where there are hundreds of small servers as well which have no issue operating in the pool without getting overloaded. This reliance on a few big servers, and the instability it causes if one of them temporarily drops out for whatever reason, suggest that that situation is not a healthy one. Even manually adding further servers from outside the zone would only ever be a temporary patch only, as suggested by this topic resurfacing peridically. Even if manually adding servers were easier, e.g., if operators could add their own servers to other zones without need for admin intervention. So I still think that beyond adding monitors in the China zone to counter to some degree at least the high reliance on a few big servers, broadening the server base by dissolving the regional zones would likely be the only lasting fix to the issue. And is anyhow being considered because it will be beneficial also in other zones.