The issue of NTP requests exceeding bandwidth load

This is just a kind reminder in case someone is not closely monitoring their server, for them to check whether everything is still as it is intended. It is sent automatically when the score drops below a certain value and/or stays too low for too long. Note the part where it says “If you have resolved the problem or if it has resolved itself”. In your case, it is “expected” that this can happen, and as it is load-related, the “it has resolved itself” part is the relevant one. I.e., when it is load-related, it will typically “resolve itself” sooner or later. That is the control loop I mentioned previously:

  • score crosses 10, pool starts adding load to the server
  • load on server or infrastructure gets too high, packets start getting dropped, score falls
  • when score drops below 10, pool stops directing new traffic to the server
  • existing traffic goes down slowly as DNS entries pointing to the server expire, and clients start using other servers returned by pool DNS
  • as load subsides, packet drops will become less, and score starts increasing again
  • when score crosses 10, cycle begins anew

This controls the load on a very rough average level that can be handled by the server. The feedback loop latency determines how high, and how low the score goes, and how long it stays above/below 10.

The higher the netspeed, the more pronounced the amplitude of the cycle. High netspeed means that when crossing 10, the load increases very fast to high values before the dropping score reduces the load. And it means the load will stay for longer, i.e., the score will drop to very low values.

That is why you are seeing those very pronounced cycles with your server, and why you got the email ftom Ask.

Reducing the netspeed should dampen the cycle so that score doesn’t drop too low anymore.

I think I understood. I assume that if they offer 20 Mbit/s, then they hopefully should support that. My guess is though that with your high “netspeed” setting, you are exceeding by far those 20 Mbit/s. This excess is what Alibaba may handle better. But the underlying issue is that the actual load is too high in either case.

So the suggestion is to go to “monitoring-only”, and then go to 512kbit in a first step. With current netspeed, the whole system is completely out of control.

Usually, the CPU is the last bottleneck. Other components hit their limits typically way before the CPU.

Yes, that would be my guess.

Yes, that may be the case. But at your current netspeed setting, it is not possible to investigate, because the traffic is likely way beyond what you contracted.

That would be very helpful. E.g., what peak bitrate is caused by a “512 kbit” netspeed setting. From there, one can then extrapolate upwards, e.g., 3 Mbit setting. But more importantly, determine the lower bound that a server needs to be able to handle. Minimum bandwidth/packet rate that server plus infrastructure should be able to handle.

But also the variation throughout the day. E.g., maybe the peak is reached only during one or two (or even some more would still be ok) hours per day. Then the score might drop during those periods, but work well the rest of the day. Or it might still drop throughout the day, but the ups and downs are ok on average.

Zones like China are challenging, because the current unfavorable ratio between clients and servers makes it difficult to add servers. But I think maybe also a bit the too high expectation, in the currently given circumstances, that the “scores” should be good, especially not below 10, and people removing their servers right away if there is such trouble (focusing on scoring primarily). If more of those stayed in the pool, that might help build the critical mass, slowly, very slowly.

But maybe the imbalance is just too big to be overcome by many small servers. And in the above, I am focusing on the scores only. If there are other issues, e.g., higher cost from too much traffic, negative impact on other (likely more important) services, frequent blocking due to protection or traffic quotas kicking in, or server being too often and/or too low in the scoring (where there is no hard defintion of “too”), …, those may obviously be reasons when it doesn’t make sense to stay in the pool after all.

But without good data, e.g., how much traffic is generated at 512 kbit netspeed setting, and the traffic profile throughout an entire day, it will be difficult to know.

E.g., one strategy could be for many, many people setting up many servers, but keep them in monitoring-only mode. And then, when a good number of prospective servers has been reached, enable them all simultaneously (as good as can be done), so that it is not a single server that gets too high a share at once, but that the overall capacity added at once is large enough so that a new single server isn’t overloaded. Kind of like the strategy previously used to bootstrap the zone by adding servers from outside the zone at once, except that now, the servers are from within the zone, and the concerted action is to switch out of “monitoring-only” at the same time.

Again, knowing the traffic profile to be expected would help figure out how many new servers would be needed, whether such an approach would even be realistic.