Rofl had to post this irony. I did an ntp server test to test offsets and that $25 a year HK openvz vps showed up in the millionths of a second. Mostly irony I’m sure. An offset of 0.000003
Aha, I see my servers listed in @studentmain’s testing result , and with quite a loss rate. But that’s actually intended.
I have implemented rate limit on my server, in
iptables but not
ntp.conf, to make sure my server could be removed from the pool periodically, so that I won’t run out of all my traffic and lost my other services on them at middle of a month (or to pay a thousand-dollar bill). I see @avij here is having similar problem and is using similar workaround as well.
Actually, I’m quite agree with @LeoBodnar that servers of .cn pool should only been removed from the DNS response with total failures, but not high packet drop rate. Because the original problem here for .cn pool is that we have insufficient resource to handle all requests from the clients in this zone. Packet loss is just not possible to prevent.
It’s clearly better to have all clients suffer a relatively low loss rate on requests to the pool, when there are multiple servers keep staying in pool to serve all the traffic (with some packets dropped), than servers have inbound traffic higher than the capacity get kicked out quickly, remains only very few ones serving all traffic, and every client suffers a high loss rate on requests.
As I live in China myself, I know the pain from client side very well. I have st1 servers set up for own usage and for some of my servers inside China. But I just not able to provide those servers to pool, for I can’t afford to pay those bandwidth and traffic bills.
This is an egg-chicken situation, the only way to solve it is to provide enough capacity of handling the entire zone, so that new servers can be added without been “DDoS” to death. We did so in OP, but it just didn’t worked out, because when we don’t really have enough resource in the pool, servers with packet loss (again, this is not possible to prevent in this circumstance) get kicked, then higher traffic leads to higher loss rate on other server, then more servers are kicked… then the whole zone collapsed.
So how about we keep a constant amount of servers in the pool regardless of their dropping packets (only remove those don’t answer to any probes for long time, but keep ones that occasionally fail two or three probes), and gracefully add more servers to handle those requests, to finally reach the point that total capacity is larger than total requests, and servers in pool don’t need to run at full bandwidth, as other zones like .us or .eu are?
I’m afraid there is no simple answer. Connectivity is a complex thing, and packet loss to a monitoring station doesn’t always mean a traffic issue.
The latter would work if the other server operators are happy about the unusual flow of traffic. I already saw quite some complaints here about too much Chinese traffic hitting like sg, hk, or tw pool. It might work better if the traffic was offload to a “super spare” zone like europe, but I doubt the server admins there will like the idea…
The former however, will also break when the server is truly overloaded. I once tried to add my tiny machine in China, and only managed to connect to it again some time after its drop out of pool. But I can understand if this becomes an opt-in option for a server admin. The hard part would be how to prove that the server is really working. Of course more hands are needed this way…
I also want to add that, I am a bit unsure of whether the current net speed weight setting is working as intended. From the descriptive text I would expect 1/1000 traffic from a server set to 1Mbps than one with 1000Mbps setting. If this assumption was true, I could already have all my tiny boxes added right now. But from the data posted above, a 384Kbps node still serve 60Mbps while my machine at 1000Mbps got 140Mbps. The ratio doesn’t really make a big difference at all for all those smaller boxes.
If the traffic are scheduled by DNS rotate (that’s how I understand it), should a box with 1Mbps only appear once per 1000 resolves compare to the one with 1000Mbps setting? I would like to expect this even if we have only 2 hosts left in the zone.
Please do correct me if I am wrong!
We had to temporarily disable China traffic from our server 184.108.40.206
With bandwidth we have no problems but high amount of udp packets overloads our firewall. At the moment we don’t have possibility to bypass firewall. If we can’t find any reasonable solution (on this week) to get packets trough without overloading firewall, I post for server removal from China zone.
This only works when there are tens of servers available. The current dns rotation always presents 4 servers in a query, unless the total available server count fells below 4. If we have only 4 servers available at this point, say 384kbps *1 and 1000Mbps *3, they will all be presented not regarding their initial registered bandwidth. Then the 384kbps one will receive unexpected high (1953x) traffic and fail.
Output 4 servers in a row is never a problem in healthy zones, but for zones with single-digit servers available, this will effectively break the bandwidth setting.
Yes, and I am proposing to change it to prefer the bandwidth preference over presenting 4 servers. I think this will resolve issues of other broken zones as well.
One issue there is rarely mentioned is the spike in traffic everytime your IP is rotated into the DNS for even a short time. It is usually 20-30Mbps and I have observed much higher values. No normal equipment behind a NAT device etc can handle it and despite of the many NTP servers added yesterday there is only 3 active servers right now.
If we don’t care about dropped packets, this would be “easy” to solve.
Many ISPs don’t charge for incoming bandwidth, so just set up a few $5 virtual machines that swallow NTP packets at gigabit speeds but don’t send anything back. This would also need a cronjob at the pool monitoring server to “update servers set score=20 where class=‘notmonitored’;” or something to that effect. With a few of these in the pool the traffic for the other servers should be tolerable.
BUT I don’t think doing that would be smart. I would rather try to use the capacity that we have elsewhere in the global zone.
To summarize what I believe are the challenges:
- There is a significant amount of traffic to the cn zone
- At least 150Mbit, probably more than 500Mbit based on reports in this thread
- There is significant packet loss international -> China
- This affects traffic from the monitoring server to NTP servers in China
- There is significant packet loss China -> international
- This affects client’s ability to reach international servers
- There is currently far more client traffic than NTP server capacity
- Typical VMs bought by individuals have 1TB/mo~20TB/mo bandwidth included
- This is around 3Mbit~60Mbit
- To handle this load with individual’s VMs, it would take around 170 of them
- Handling more than 10kpps (~7Mbit) of NTP requires special software or hardware
- For instance: Getting beyond 10k qps?
- Firewalls also need to be considered carefully at speeds higher than this
So, with all those challenges, what can we do?
- Drop traffic when client load is greater than NTP server capacity
- This could be done as an empty DNS response
- This has its own problems
- Have a much lower score threshold for the cn zone
- This would allow services to stay in longer in the face of packet loss and add some additional capacity
- This would also make it harder for servers to signal they wanted to leave due to billing or other reasons
- Recruit a mix of high bandwidth and low bandwidth servers to handle the load
- That’s how this thread started
- The low bandwidth users would want some sort of way to limit their max bandwidth due to billing
- Servers in China have a hard time staying in the pool due to issue #2
- Setup a monitoring server in China
- Who could provide a place for this to run?
What I would like to see is a combination of all four. This would require work and co-ordination to complete, and that is a problem on its own.
Short term: I think a good start would be to adjust the scores and penalty system. So you are not punished so hard for a missing packet now and than and long term look at a local monitoring server. The latter was also the plan last year.
In a local LUG discuss group, there is someone willing to provide a server with direct access to st1 time source for monitoring station. But it’s unclear what the pool admins require a monitoring station to equipped, that they may not fully fit (or willing to fit) the requirement.
I believe @felixonmars knows them better than me
I believe our approaches differ in that when I activate throttling, only queries from the monitoring server get dropped. This means I’m still answering queries from other clients even when the throttling is activated, but because my score drops below 10, I’ll get fewer requests.
I’ve been out for a while so it will take me some time to catch up but I wanted to say thanks to whoever added a few of my servers to the CN pool
To handle this load with individual’s VMs, it would take around 170 of them
The current dns rotation always presents 4 servers in a query, unless the total available server count fells below 4
Put these together, and there’s a possible solution: raise the DNS rotation minimum for the CN zone to 170. If there are fewer than 170 servers in the CN pool, and a query comes in from CN, combine the CN pool with some pool that has excess capacity (*), so the CN nodes are 100% loaded instead of 5000% or 10000% loaded relative to their chosen netspeed. If people add a few servers capable of more than 10kpps that’s great–it brings down the load for the smaller ones–but it might be better to keep the raw IP count high to keep the zone robust against major outages.
The minimum number of DNS entries should be different for each zone. 4 is probably fine for most zones in the world because most countries are relatively small, but the big ones like CN (or IN or US) will need much higher minimums. Even better would be to estimate both capacity (from netspeed settings) and request load (from DNS query data), and use that as the threshold rather than a raw IP count; however, a simple hardcoded “nodes_in_zone > 170” threshold might help bootstrap CN in the short term. This might even be a permanent solution–we don’t expect the load from CN to go down in the foreseeable future, so why would we remove that code? If the CN zone gets big and robust enough, we won’t need to raise the hardcoded limit either, the pool would naturally never be too small again.
In theory, US needs a minimum number of servers much greater than 4 too, but US has always had hundreds, so this problem never comes up there.
China is a big place. It should be easy to get 170 volunteers to run small VM servers there, but before that can happen, the NTP pool DNS service has to get really, really good at never, ever hitting them with a super-sized DDoS cannon. I put my NTP servers in the US and CA pools because I was already running them for my own clients, so it cost me nothing to add them to the NTP pool and let other people use them too. If I ever got a bandwidth bill or a service outage from my participation in the NTP pool, I’d seriously reconsider my contribution, and probably drop at least some nodes out of the pool entirely. This kind of casual, zero-cost, opportunistic contribution that happens in other parts of the world should be possible in China too, and fixing the pool’s DNS server behavior when demand exceeds supply is a big part of making that happen.
(*) where’s the excess capacity? It’s tempting to say “hurl the packets at Europe/the rest of Asia/the world”, but if inter-zone packets are just going to get dropped, then it’s better to just blackhole them rather than impose costs on already stressed international network operators. In that case, the “larger pool” is just enough blackhole IPs to make the CN pool big enough that netspeed matches reality. Are there anycast blackhole IPs that could be used? 127.0.0.1?
You can remove our server 220.127.116.11 from china pool.
By the way, does anyone have any tips for optimizing the number of requests that you are able to handle? My server is running Chrony on Ubuntu because I could not get ntpd to work but I could try to switch if it would be faster.
please add to CN (remove from FR)
Some of them have score -100. I added two IPv4 servers. Thanks.
The problem with the .CN pool isnt much of a bandwith problem, its packets
per second and usually the hardware cant keep up with the packet rate. The
peak to my servers has been 300 kpps/s. Its very underserved, we need to
get back to around 50 servers for the .CN pool for it to be stable.
For plan B I would suggest to change the monitoring score for the .CN pool
so that you would need ten “bad scores” instead of one too loose a point
in the monitoring. Then a lot less servers would be kicked out and soon
we hopefully have around 50 servers that is bare minimum for the .CN pool.
I guess that would be a quite easy fix that can be implemented right now
and doesnt cost anyone money. What do you think about that @ask?