The issue of NTP requests exceeding bandwidth load

Hello everyone,

I am based in China, where the cost of server bandwidth is quite high, often charged per Mbps. As a result, my server is limited to just 3 Mbps for both upload and download. Recently, I joined the NTP pool project and configured a connection speed limit of 512 Kbps. However, I have noticed that the bandwidth usage on my server can still be very high, sometimes even surpassing 180 Mbps, which is about 60 times my total available bandwidth!

This excessive bandwidth usage is severely impacting the performance of my other services. I would like to know if there are any solutions or best practices to better control the bandwidth used by the NTP Pool.

Any advice or suggestions would be greatly appreciated.

Thank you!

This has forced me to pause (or even consider leaving the project). The peak request rate from NTP has reached hundreds of Mbps, and even the average request rate almost saturates my entire available bandwidth.


:frowning:

This is a likely result of being a server in an under-served zone. The bandwidth setting, as is mentioned in the documents, is a relative measure of your share of requests that arrive. My system is in the US pool, is set to 6 Mbps, and rarely sees the actual bandwidth climb above that. At the moment it is hovering around 105 Kbps.

I suspect that you will not have much success in the pool as it is currently configured.

Welcome to the forum, @summer76527, and thanks for considering to add your server to the pool!

As starting point on the topic of what some people call ā€œunder-served zonesā€, see, e.g., this post, and the references therein.

As one of my servers is affected by this as well, albeit to a much lesser extent, I just suggested an update to the poolā€™s code that Iā€™ve been mulling for a while now. It does not solve the underlying issue, but could help servers like yours or mine join the pool more easily (but no guarantees, obviously).

Thank you for your reply.
I have read several of the posts you mentioned Network: i/o timeout - Server operators - NTP Pool Project, CN pool collapse a few hours every day - Server operators - NTP Pool Projectand I have a general understanding of the issue.

In fact, some large internet companies (such as Alibaba Cloud, Tencent Cloud) or the China National Time Service Center, etc., provide NTP services. Unfortunately, in China, there are a large number of smart devices (possibly in the hundreds of millions!) that come with the default NTP server set to pool.ntp.org. The NTP servers of large internet companies, due to their low scores, cannot join the pool to share the traffic load. As a result, only a few hundred servers are available in the China region, which have to handle the majority of the countryā€™s NTP requests. This is undoubtedly a daunting situation.

Based on this, is there a place where I can submit an application to the administrators: to include some of the NTP servers from large enterprises in the pool permanently (not to be removed due to low scores), so that small servers like mine can handle only minor requests?

P.S.: In China, the cost of commercial bandwidth is very expensive. For example, with Alibaba Cloud, a 10 Mbps dedicated IP costs Ā„525 (about $73) per month, and 100 Mbps bandwidth costs Ā„7,725 ($1,084). It is almost impossible for individuals to bear such high bandwidth costs.


Thank you very much for your reply! I checked my serverā€™s score status and found that my server is handling 10.104 out of every million requests. According to this post (https://community.ntppool.org/t/adding-servers-to-the-china-zone/88), this means my server processes 100 requests per second, but in reality, it could be much higher, to the point where my service provider considers it a DDoS attack:(

I believe this is indeed related to the insufficient number of volunteer servers in the China region. I am trying to contact the administrators to address this issue.

Thank you very much!

I am not sure this would help, because it does not solve the underlying issue of the extreme imbalance between number of clients and number of servers.

Servers dropping out of the pool is mostly caused by elevated packet drops. Those in turn are caused by overload of the server, or of some related infrastructure. Keeping servers in the pool would thus only reinforce the problem of the server becoming more unavailable.

The other way round, the monitoring system is designed (though not necessarily perfect) to reflect what ordinary clients would see. So when the monitors see elevated packet drops, so would ordinary clients. And the more overload there is, the worse service also ordinary clients get. The monitoring system is just the ā€œcanaryā€ indicating the problem.

And by dropping servers from the pool, the pool is kind of regulating the load on the server within bounds ā€œacceptableā€ to the server, though very coarsely only. ā€œAcceptableā€ in quotes because obviously, the upper bound is actually already in the overload area of the server.

That said, it would be interesting to see what would happen if servers, at least the ā€œbigā€ ones, would be kept in the pool even when their scores drop below 10, as you suggest, whether that would help in any way. Their service quality would go down, but them staying in the pool could be a welcome diversion. In the sense that they attract traffic, pulling it away from other servers, so that those would thus get less traffic. As long as clients getting lower service quality are still happy, and donā€™t, e.g., move away from those servers, or start hammering them until they get an answer, potentially making the problem even worse.

But considering the dimensions we are talking about, not sure that would actually work in practice: 33398182 Internet users per server in the pool (as calculated in the post referred to earlier).

I may have caused some misunderstanding. By ā€œlarge enterprise servers with low scores,ā€ I do not mean servers that cannot provide service (on the contrary, due to the ample resources of these enterprises, their latency and bandwidth are better than those of volunteer servers). Instead, I refer to servers that have been incorrectly assigned lower scores and thus removed from the pool.

I am not entirely clear on how the scores are calculated, but it could be due to various factors, such as the connection to the international internet, local carrier routes, and so on. These ā€œlarge enterpriseā€ NTP servers, although running well, are incorrectly given scores below ten, preventing them from joining the NTP pool. I am referring to such servers; if they were fixed in the pool, they could help alleviate the network pressure across the entire CN pool.

That is a claim that keeps coming up time and time again, that it is the scores/monitors/monitor locations that are the issue here, because they lead to ā€œincorrectlyā€ low scores. I donā€™t rule that out completely, but from all the data points Iā€™ve seen so far, that seems not the issue, at least not the primary one that would solve the problem.

I see that you have scheduled your server for deletion from the pool. Please, if it is not too much to ask, reverse that deletion, and keep the server in the pool. Set it to ā€œmonitoring onlyā€ mode.

That way, we can see how the monitoring fares once the load on your server subsides, after a few days. And from that, we can then infer the contribution especially of the ā€œinternational internetā€ on the scoring.

Again, two items: I donā€™t think those are ā€œincorrectly givenā€, more on that in a second. And I donā€™t think keeping them in the pool would help, see previous post.

Besides looking forward to the outcome of the test of you keeping your server in the pool in ā€œmonitoring onlyā€ mode, Iā€™ve also set up another test recently, looking at one of those ā€œlarge enterprise NTP servers running wellā€, namely time2.cloud.tencent.com. Not sure whether that qualifies in your sense, if not, just name another one, and Iā€™ll take a look.

Anyway, the test is from four vantage points inside China, chosen at random (from an admittedly limited set of options). And looking at the data, I calculate a packet loss of slightly above 6%. Within China, not via some international connections or anything like that.

Comparing that with other tests, that is rather high. Though obviously, it is not fully representative, due to its limited scope. But it is at least a hint that it really might be overload of the server, or some infrastructure around it, that is causing this.

E.g., you refer to ā€œlocal carrier routesā€. That is something I (or anyone else from outside China, I guess) cannot assess from afar, obviously. But also, if that were part of the issue, that is what the monitoring system is supposed to take into account and reflect as well. But in case that is indeed a general issue locally, then maybe it shouldnā€™t be taken into account. Not sure though that would help the service that local clients would get.

All that said, I think just given the large client population in China would warrant to have a local monitor (or two, orā€¦). I donā€™t think it would really change much as to the problem we talk about, but it certainly would at least better reflect the local view of the huge local client population. And weā€™d finally know for good whether it is the monitorsā€™ placement that is the issue, or not. And as such, I am personally quite unhappy that any efforts of volunteers trying to set up monitors in China have been fruitless so far.

Update 2024-11-04 ~10:00 UTC:

  • I have increased the number of vantage points on the original measurement to 10, to get a better sample size.
  • I realized that I had set up the original measurement with the name of the test subject, and that that name pointed to two different IP addresses over time so far. With the intended address being in the clear majority, I donā€™t think it makes much of a difference with respect to the aspect we are interested in. Still, to be sure, I created a new measurement with the same vantage points, but explicitly using the intended IP address as target. So far, the measurement seems to be seeing a packet loss of slightly above 10%, but I guess it is currently the more busy part of the day. So I expect that to go down as more data is being collected also during the less busy parts of the day.
  • I took the liberty to similarly set up a measurement to your server, again same vantage points. Itā€™s on average 10 requests per minute, so should be negligible in comparison to all the other traffic youā€™re seeing, but let me know if you donā€™t want that for whatever reason, and Iā€™ll stop it right away. Anyhow, it seems one of the vantage points cannot reach your server at all for some reason, but the rest is looking good so far.
  • Your server in monitoring-only mode seems to be looking good from the pool monitoring point of view as well. While there are still some packet drops, those seem to be specific to a subset of the monitors only, which is not uncommon. There is a sufficiently large set of monitors which do not have noticeable issues reaching your server so that the overall score is converging towards the possible maximum.
    Screenshot from 2024-11-04 11-42-01 If not known already anyhow, depending on your device, one can hover over or tap on the ā€œoverall scoreā€ row in the table below the graph to better see its trend in the graph. The offsets observed have a somewhat wider spread than other servers I am familiar with, but the sample of servers I mostly deal with is somewhat biased and not representative. More importantly, the offsets observed are still small enough to not affect the score, which is the object of interest for our topic.

Thank you very much for your detailed reply!

I see that you have scheduled the removal of the server from the pool. If itā€™s not too much trouble, please cancel the deletion and keep the server in the pool. Set it to ā€œmonitor-onlyā€ mode.

I have already canceled the deletion request and set it to ā€œmonitor-only.ā€ From the network statistics panel, I can see that the traffic has significantly decreased. Although there are still NTP requests, they are within the range that the server can handle.

I recently set up another test to check one of the ā€œwell-running large enterprise NTP servers,ā€ specifically time2.cloud.tencent.com. Iā€™m not sure if this aligns with what you meant, but if not, just let me know, and Iā€™ll look into another one.

In any case, the test was conducted from four favorable locations within China, randomly selected (from a limited set of recognized options). Reviewing the data, I calculated a packet loss rate slightly over 6%. This was within China, not through any international links or similar.

I just registered an account on ripe.net, but unfortunately, I discovered that tests require points. I currently donā€™t have any points, so I used some online ping tools from Chinese websites to check. Here are some results:

Since I am not particularly familiar with the underlying principles of NTP, I am not sure if there is a correlation between ping and NTP quality at the network transmission level (admittedly, the clock configuration of the server itself is also important, but I will assume that the time inside the server is completely standard). In China, a large number of smart devices also rely on the network to synchronize time, such as Android-based smartwatches and smart speakers. There may be multiple such devices in each household, which adds to the pressure on the NTP pool, not just from computers and smartphones, but also from a large number of smart devices.

I took the liberty of setting up similar measurements for your server, from the same favorable locations. ā€¦but if you donā€™t want this for any reason, please let me know, and I will stop immediately.

I very much welcome this behavior. In fact, I am also learning about the underlying principles of NTP.

For some reason, one of the favorable locations seems unable to reach your server at all.

I find this strange. When I use ping (123.57.221.61_åœØēŗæping_多地ping_多ēŗæč·Æping_ꌁē»­ping_ē½‘ē»œå»¶čæŸęµ‹čƕ_ęœåŠ”å™Ø延čæŸęµ‹čƕ) to measure my serverā€™s network latency, it is very good, and my server has not experienced any form of failure in the past 24 hours.

If you didnā€™t know, according to your equipment, you can hover your mouse over the ā€œoverall scoreā€ row in the table below the chart or click on it.

Thank you very much! I didnā€™t know this was possible.

Next, I plan to temporarily upgrade my bandwidth to 10 Mbps and add the server to the pool to see if there is a correlation between traffic peaks and time periods.

Iā€™m sorry, I am a non-native English speaker. I am using translation software and chatgpt to communicate. So please ignore my grammar mistake:ļ¼ˆ

1 Like

Good, thanks! The score trend visible is very interesting. Thereā€™s a few monitors that have issues, but there is a sufficient number that work well enough to keep an almost perfect score. That seems to suggest that the issue with low scoring is (likely) not due to the location of the monitors and their traffic crossing international connections.

Yes, ā€œfullā€ NTP clients (e.g., chronyd, NTP classic, NTPsec), often running on servers or PCs, will keep polling a server for a while after initially getting the IP address of a server. They have mechanisms to continually track upstream servers, and to make continuous corrections to the local clock (called ā€œdiscipliningā€ the clock) so it is smooth without jumps in time. That is why they hold on to the IP addresses they got initially across multiple queries to the upstream servers. That is what you are still seeing.

The IoT devices you mention tend to implement a simpler version of NTP called SNTP. Those typically donā€™t continuously adjust their clocks, but poll upstream servers periodically, and then set the time according to that. That entails that when configured with a DNS name pointing to the pool, they keep re-resolving that each time, and depending on the timing, get different IP addresses every time. So when a server leaves the pool, it will not get any requests from this type of client anymore.

But also non-IoT devices may have an SNTP client only, e.g., older versions of MS Windows (newer versions have ā€œhigh precision timekeepingā€ support, which sounds like a full(er) NTP implementation), or systemd-timesyncd becoming more common on Linux.

I encourage you (and others) to set up your own RIPE Atlas software probe. You can earn points that way to eventually run your own tests, but you also help improve the coverage of the RIPE Atlas project so it becomes more useful for investigations like ours.

The additional load, both CPU as well as network wise, is rather small, and the network throughput can be limited to as low as 10 kbit/s. Only thing is that the tools are currently a tad chatty as far as system logging is concernedā€¦

It really depends on circumstances. Pings can be a good rough indicator of network transmission conditions that also NTP packets will encounter, but that always needs to be taken with a grain of salt.

Obviously, ICMP is not NTP, so network operators may treat them differently, and, e.g., ping does not assess the performance of the NTP server itself, e.g., whether it is overloaded or otherwise dropping packets (e.g., rate limiting).

So if one sees some network transmission behavior for ICMP, it is likely that also NTP might be affected, or vice versa, giving an indication as to how to potentially proceed further in the investigation. Always keeping in mind that it might also be quite different.

Case in point: While the NTP measurement to time2.cloud.tencent.com clearly shows packet loss, a ping measurement from the same four original vantage points as the original NTP measurement doesnā€™t indicate any relevant level of packet loss. In other cases, the correlation is nicely visible.

Since it is quite easy to use, this is typically the first tool I use when troubleshooting a wide range of networking issues (provided the target hopefully doesnā€™t block it).

Sorry, I didnā€™t mean to imply there to be anything wrong with your server. This can simply happen if two network operators donā€™t cooperate very well, or some operator implements some filtering of NTP traffic (there was a time when NTP servers could be abused for amplification attacks).

E.g., my Alibaba-hosted server in Singapore for some reason seems unreachable from the monitor in Finland, at least via NTP.

Good idea, looking forward to the outcome of the experiment. I hope 10 Mbps will be sufficient to cope with the potential peaks youā€™ll see, keeping my fingers crossed :crossed_fingers: :slight_smile:

Variations in the Singapore zone are quite pronounced during the daily cycle.

Thereā€™s been a proposal for an API to be able to programatically control the ā€œnetspeedā€ setting, e.g., to automatically manage traffic volume/rate in the face of bandwidth limits/traffic quotas. But the right combination of know-how and resources/time wasnā€™t available so far to make that happen.

Kind of a sledgehammer method to regulate the load on a server would be to expressly use the monitoring system for that purpose, like it anyhow implictly regulates the traffic by dropping servers from the pool when they get overloaded. A bit more difficult to use these days, now that we (thankfully) have a much larger and diverse set of monitors. And the feedback loop might be a bit too slow in severely underserved zones if traffic rate shall be managed (i.e., when there isnā€™t a ā€œhardā€ bandwidth limitation, but, e.g., too high bitrate could trigger protection mechanisms).

Same here :wink:

No worries! :grin:

Same in China, I have a fixed address public IPv4 port for my whole companyā€™s office system, and Iā€™ve joined the pool using that ip addr some years before.

What bothers me is the stateful-nat of udp links runs out of my routerā€™s mem easily, and blocks the whole internet access. Itā€™s not possible to serve NTP directly on the router or put my NTP server on this ip, fixed IPv6 address is also very expensive in China.

Is there a practice how to config stateless nat on commercial or consumer grade routers? Iā€™ve tried tc command but itā€™s not working on my asus router.

1 Like

It all depends on the device, and what OS it is running, and the UI it is offering to interact with the OS and other functionality of the device. Your mentioning of tc, and your device being an Asus one suggests it might be running some Linux with CLI access, e.g., OpenWrt or DD-WRT.

If that is true, this thread has a good discussion of potential options, including tuning the connection tracking system a bit if it cannot be disabled entirely.

Note though that while a very common issue, device performance seems not the primary issue in the case that triggered this current thread.

Rather, as far as I understood, that the high traffic volume/bitrate was triggering DDoS protection mechanisms on the cloud provider side, and generally exceeding the contracted bandwidth by far, which both in turn would then impact other services as well.

But infrastructure performance (like in this case of what I understood to be a cloud-based instance), or device performance (in a physical setup like yours seems to be) would be likely next potential bottlenecks.

You mentioned that IoT devices tend to implement a simpler version of NTP called SNTP.

I never knew there was a simpler version of NTP. I had always assumed that NTP, like DNS queries, only uses one packet and that all devices follow the same logic. Thank you for the explanation!

I encourage you (and others) to set up your own RIPE Atlas software probes.

I reviewed the documentation, and it does seem like a fantastic project. I am trying to join, but I found some articles online stating that the review process can take a very long time (some people even waited a year to get approval). I hope my application wonā€™t take that long.

Clearly, ICMP is not NTP, so network operators might treat them differently, for example, ping does not evaluate the performance of the NTP server itself, such as whether it is overloaded or dropping packets for other reasons (e.g., rate limiting).

Yes, it is obvious that network operators treat ICMP packets and UDP packets differently. I hadnā€™t considered this point. You are right; the server at time2.cloud.tencent.com might experience overload at certain times, which ICMP pings cannot detect.

Good idea, looking forward to the experiment results. I hope 10 Mbps will be sufficient to handle the potential peaks. Letā€™s hope for the best.

I purchased a server from a small IDC located in Zhejiang Province (central China, with relatively good internet latency to most areas), with a specified bandwidth of 20 Mbps both upstream and downstream.

NTP Pool Scores
Traffic Monitor

Unfortunately, I noticed that every time the score exceeds 10 and the server starts accepting NTP packets, the score quickly drops. This is quite different from what I experienced with Alibaba Cloud (where my server maintained a good score after joining the pool until it triggered bandwidth cleanup due to far exceeding the bandwidth limit). I suspect this might be due to the lower pps (packets per second) capability of the network interface card provided by the small IDC. I will observe for a few more days, and if this server continues to struggle with NTP requests, I will shut it down.

Yeah, the basic packet is always the same, but the logic, e.g., as to how to send one, how to fill the fields in the packet, how to deal with the response, what to do in between polls, when to send the next poll, and whether to maintain any kind of state (in a wide sense) from one poll to the next differs. Initially, there used to be two separate documents for SNTP and NTP, but in the last document revision, SNTP was absorbed as just a short chapter in the NTP specification, basically just saying they donā€™t implement the full internal timekeeping logic, and how that may reflect on the on-wire protocol (not the format, but how, and what fields are populated, and when to send).

And itā€™s not even black and white, thereā€™s various shades in between. NTP describes/prescribes in some detail also the internal algorithms of timekeeping. That is where some clients deviate from the specs and implement varying degrees of simplified mechanisms. And some people even contest that it is even the place of the protocol specification to prescribe in such detail the internal behavior, and whether an implementation not implementing all the details is to be considered compliant or not, even if that often would not be discernible when looking just at individual packets/packet exchanges on the wire.

There are two types of probes, hardware, and software. Hardware probes may indeed take a long time, and even just because the current stock has run out. And as they are provided for free, but still cost money to manufacture, thereā€™s criteria how to allocate them, e.g., prefer areas/networks where there arenā€™t that many yet, like in China.

Software probes only take as long as it takes to install the software, and fill in and send the form online. In case of software probes, itā€™s just a registration (in the sense of notification), not an application (in the sense of requiring approval). As far as I understand, there are currently no criteria that are being checked before accepting a registration. And even if there were, seeing as there are only a few probes in China as compared to the country size, or in Asia as a whole, I donā€™t think locations, e.g., in Asia would currently be rejected.

Ah, very nice!

Your graphs show the prototypical signs of overload.

There may be many factors that can cause this, some potentially under your control, like local firewall configuration. And others outside your direct control, like the uplink properties and protections such as with Alibaba.

I note you have a ā€œnetspeedā€ setting of 12 Mbit or above. That is when the server will be formally included in the ā€œglobalā€ pool, which I donā€™t think makes any difference with respect to your current situation (the traffic from outside will be many magnitudes smaller than local traffic), which however makes a ā€œ@ā€ appear in the zone list.

Before removing from the pool, I encourage you to try a few things. Start by setting to ā€œmonitor-onlyā€. That way, you can see what the ā€œbaselineā€ is, i.e., what are the general properties of the setup, both the VM and its underlying infrastructure, as well as the Internet connection.

After this has settled a bit after some days, start with the lowest ā€œnetspeedā€ setting, and see what happens, both how much traffic you actually get, and how the system reacts to it. If it is not clearly too much traffic, i.e., beyond the available 20 Mbit/s, it may be worthwhile investigating a bit, and tuning things that are under your control, or that you can perhaps indirectly influence.

And even if the score is still not perfect at that setting, it may be still worthwhile keeping the server. We could discuss that in more detail once we get there, but as hinted at, the score dropping below ten could be considered as a normal mechanism to regulate the load on the server, and one just needs to accept the not-so-beautiful aspects of the graph :slight_smile: (see the Tencent graph, or many, many others in the pool - I can show you other examples if you like). As mentioned, the monitors are much more sensitive than actual clients (on purpose), especially than the many typical IoT devices you mention.

The server might still be a worthwhile addition to the pool, and if there were more of that, the problem might even get better over time as more capacity is added that way, in small increments.

Unfortunately, I received an email from ask:

This is an automated email for foxhank.

The NTP Pool system has marked your timeserver as either unreachable or keeping bad time. Until the issue has been addressed, it will not be included in the DNS zones for discovery by the NTP users.

If you have resolved the problem or if it has resolved itself, thereā€™s no need to reply to this emailā€”the monitoring system will notice and after a short time include the server in the system again.

(You can follow the status on the URL listed next to the server IP).

The NTP system is based on IPs rather than the DNS of your server. If the IP changes, it is considered a new server.


110.42.35.211 (score: -20.9) https://www.ntppool.org/s/59733

I may have caused some misunderstanding: my new server (110.42.35.211) is not hosted on Alibaba Cloud; it is hosted by a small IDC provider that primarily caters to gamers (for example, renting Minecraft servers). Therefore, their network infrastructure might not be as robust as that of large cloud service providers.

My virtual machine has a 2-core, 2048MB configuration, and the backend monitoring shows that the load is not very high (CPU usage is around 10% or even lower). So, the issue might be with the network. It appears that this small IDCā€™s network is not ideal, and they may not have optimized their network for UDP traffic.

I have re-enabled my server which is based on Alibaba Cloud (https://www.ntppool.org/scores/123.57.221.61). Letā€™s see what happens:
Network Monitoring: http://123.57.221.61:3000/?5-f

This server also hosts other applications (like my blog), but it doesnā€™t receive much traffic. There are probably only bot and CDN refresh requests :rofl:. By observing the traffic patterns, I think I can estimate the request characteristics of NTP in the China region.

The time required for software probe setup is the same as the time needed to install the software, fill out, and submit the online form. In the case of a software probe, it is just a registration (in the sense of notification) and not an application (in the sense of requiring approval). ā€¦I donā€™t think they would reject, for example, locations in Asia.

Thank you for the information. I am reading the documentation and trying to join the probe program. My server is usually idle anyway, so itā€™s a good opportunity to put it to use.

The server might still be a worthy addition to the pool, and with more servers, the issues might even improve over time as more capacity is added in smaller increments.

Ah, I completely agree with your idea! In fact, I am considering how to add more servers. November 11th is a shopping festival in China, and some cloud service providers might offer special deals (for example, a lightweight server for 36 RMB, approximately $5). It comes with 2 cores, 2GB of RAM, and 6 Mbps of upload and download bandwidth. Although it has some limitations compared to regular cloud servers (such as a monthly traffic limit of 2048GB and simpler security group rules), I believe it could handle some NTP requests. If I can get this server, I will deploy NTP service and monitor the request trends.


This is just a kind reminder in case someone is not closely monitoring their server, for them to check whether everything is still as it is intended. It is sent automatically when the score drops below a certain value and/or stays too low for too long. Note the part where it says ā€œIf you have resolved the problem or if it has resolved itselfā€. In your case, it is ā€œexpectedā€ that this can happen, and as it is load-related, the ā€œit has resolved itselfā€ part is the relevant one. I.e., when it is load-related, it will typically ā€œresolve itselfā€ sooner or later. That is the control loop I mentioned previously:

  • score crosses 10, pool starts adding load to the server
  • load on server or infrastructure gets too high, packets start getting dropped, score falls
  • when score drops below 10, pool stops directing new traffic to the server
  • existing traffic goes down slowly as DNS entries pointing to the server expire, and clients start using other servers returned by pool DNS
  • as load subsides, packet drops will become less, and score starts increasing again
  • when score crosses 10, cycle begins anew

This controls the load on a very rough average level that can be handled by the server. The feedback loop latency determines how high, and how low the score goes, and how long it stays above/below 10.

The higher the netspeed, the more pronounced the amplitude of the cycle. High netspeed means that when crossing 10, the load increases very fast to high values before the dropping score reduces the load. And it means the load will stay for longer, i.e., the score will drop to very low values.

That is why you are seeing those very pronounced cycles with your server, and why you got the email ftom Ask.

Reducing the netspeed should dampen the cycle so that score doesnā€™t drop too low anymore.

I think I understood. I assume that if they offer 20 Mbit/s, then they hopefully should support that. My guess is though that with your high ā€œnetspeedā€ setting, you are exceeding by far those 20 Mbit/s. This excess is what Alibaba may handle better. But the underlying issue is that the actual load is too high in either case.

So the suggestion is to go to ā€œmonitoring-onlyā€, and then go to 512kbit in a first step. With current netspeed, the whole system is completely out of control.

Usually, the CPU is the last bottleneck. Other components hit their limits typically way before the CPU.

Yes, that would be my guess.

Yes, that may be the case. But at your current netspeed setting, it is not possible to investigate, because the traffic is likely way beyond what you contracted.

That would be very helpful. E.g., what peak bitrate is caused by a ā€œ512 kbitā€ netspeed setting. From there, one can then extrapolate upwards, e.g., 3 Mbit setting. But more importantly, determine the lower bound that a server needs to be able to handle. Minimum bandwidth/packet rate that server plus infrastructure should be able to handle.

But also the variation throughout the day. E.g., maybe the peak is reached only during one or two (or even some more would still be ok) hours per day. Then the score might drop during those periods, but work well the rest of the day. Or it might still drop throughout the day, but the ups and downs are ok on average.

Zones like China are challenging, because the current unfavorable ratio between clients and servers makes it difficult to add servers. But I think maybe also a bit the too high expectation, in the currently given circumstances, that the ā€œscoresā€ should be good, especially not below 10, and people removing their servers right away if there is such trouble (focusing on scoring primarily). If more of those stayed in the pool, that might help build the critical mass, slowly, very slowly.

But maybe the imbalance is just too big to be overcome by many small servers. And in the above, I am focusing on the scores only. If there are other issues, e.g., higher cost from too much traffic, negative impact on other (likely more important) services, frequent blocking due to protection or traffic quotas kicking in, or server being too often and/or too low in the scoring (where there is no hard defintion of ā€œtooā€), ā€¦, those may obviously be reasons when it doesnā€™t make sense to stay in the pool after all.

But without good data, e.g., how much traffic is generated at 512 kbit netspeed setting, and the traffic profile throughout an entire day, it will be difficult to know.

E.g., one strategy could be for many, many people setting up many servers, but keep them in monitoring-only mode. And then, when a good number of prospective servers has been reached, enable them all simultaneously (as good as can be done), so that it is not a single server that gets too high a share at once, but that the overall capacity added at once is large enough so that a new single server isnā€™t overloaded. Kind of like the strategy previously used to bootstrap the zone by adding servers from outside the zone at once, except that now, the servers are from within the zone, and the concerted action is to switch out of ā€œmonitoring-onlyā€ at the same time.

Again, knowing the traffic profile to be expected would help figure out how many new servers would be needed, whether such an approach would even be realistic.

Just some quick feedback: Looking at the traffic monitor you kindly shared, the currently limiting factor seems not that the system wouldnā€™t be able to handle the load. Rather, it seems someone (the IDC operator?) is explicitly dropping packets. I suspect again some kind of ā€œprotectionā€ mechanism kicking in.

Note how in the 5 minute graph, after the traffic peaks and then suddenly stops, the traffic goes to almost zero. And after a while, I goes back up again to near 1 Mbit/s. This is also nicely visible in the 5 minute table, except that one has too short a history, and at this very moment, the near-zero (actually ~36 kbit/s upstream and downstream combined) is about to move out of the table scope.

If it were just overload in some component, e.g., the network interface card or a connection tracking entity not able to deal with the bitrate or packet rate anymore, the drop would not be so sudden, but it would remain somewhat steady at a high level for a bit before starting to drop drop (when the load decreases because the server has been taken out of the pool). And it wouldnā€™t drop to near zero, and stay there for some time. Note the jump back up to almost 1 Mbit/s. At that time, the server is not in the pool, so it wouldnā€™t get any new traffic. So the jump is when the residual NTP traffic comes back to the VM when the explicit block of the traffic is lifted. It is as if a switch were flipped off, stopping NTP traffic, and then some time later back on, allowing traffic again.

So, if youā€™re interested to pursue this, you could try to find out what is causing this block, e.g., whether the IDC has some protection mechanisms in place, maybe that can be disabled somehow, or the trigger threshold increased.

But there is obviously always the risk that even if this current issue can be overcome, the next one may come up (e.g., actual capacity limits). So it may take some more steps to investigate until perhaps making your NTP server work as part of the pool, and there is always the risk that an issue, even when identified, cannot be overcome.

EDIT: For documentation purposes, I took the liberty of taking a screenshot of the graph referred to above to conserve the pattern described (the above links point to graphs being updated continuously). Note that the very last peak is tapering off more slowly (continuing outside the time window shown in the screenshot) because the server was manually taken out of the pool at about that time.

I began to experience the same stuff at RU zone about a month ago.
I had NTP service at my home Mikrotik 4011 router for ages and never noticed it, but about a month ago load became enormous, about 1-2 million (!!!) packets per second, using all available bandwidth (500 Mbps).
I had to block NTP traffic on firewall because it looks as DDoS and my near-enterprise level router starts dropping packets on services I need.

1 Like

Welcome to the forum, @kkursor !

With an estimated 132,000,000 Internet users potentially accessing 9 servers, there is a ratio of about 14,666,667 users per server. More than, e.g., in South Korea, but less than in China (these are rough numbers only, the actual number may differ, but I think the order of magnitude gives a good enough indication). So itā€™s likely that itā€™s not only the typical issue of an underserved zone.

Looking at the history of number of servers in the Russia zone, there seems to be a clear drop from about 123 servers 60 days ago to the current 9. So very roughly (depending on ā€œnetspeedā€ settings of servers that left), that could explain an increase by a factor of above ten. Which would mean you might have gotten about 30 Mbps before the drastic drop in the number of servers. Which seems somewhat high, but is not impossible, depending on how you ā€œpositionedā€ your server in the pool. E.g., if this was not at home on a DSL line, but maybe some datacenter, or professional network setting.

Anyway, it is not clear what your ā€œnetspeedā€ setting is, but that would be the first knob to twitch. I.e., either gradually decrease that until the load is acceptable. Or, since the load currently is way beyond, my recommendation would again be to go to the 512 kbit/s setting to see what the ā€œbaselineā€ is, and work your way up from there until you reach the maximum target load. Or probably somewhat below what would be acceptable, owing to the granularity of steps available for the ā€œnetspeedā€ setting.

Note that in the China zone, it seems the 512 kbit setting can produce peaks of almost 10 Mbit/s, so it should be less in the Russia zone.

If you donā€™t mind sharing, knowing what server weā€™re talking about wouldnā€™t hurt. Though is not as relevant as in the case of the Cina zone because it seems clear from your description that in this case, the issue is not suspected to potentially be due to the placement of monitors, or some international connections.