Network: i/o timeout

Hi, I made my first ntp server this time.
But I get red spots periodically, how do I fix this?

I turned it off because of unifi’s Intrusion Detection and Prevention function,
but the symptoms are the same.
What could be the problem?

There are relatively few servers in Korea,
so it’s possible that My server or local firewall becomes overwhelmed when I increase the “network speed” setting.

1728847171,2024-10-13 19:19:31,-0.003045615,1,7.514120579,50,kricn1-1a6a7hp,,
1728847119,2024-10-13 19:18:39,,1,5.56763792,24,recentmedian,,
1728847119,2024-10-13 19:18:39,0.005339929,1,5.56763792,63,ausyd2-2trgvm8,,
1728846940,2024-10-13 19:15:40,,-5,4.808040142,24,recentmedian,,network: i/o timeout
1728846940,2024-10-13 19:15:40,,-5,-2.039467096,25,inblr1-1a6a7hp,,network: i/o timeout
1728846859,2024-10-13 19:14:19,,-5,4.808040142,24,recentmedian,,network: i/o timeout
1728846859,2024-10-13 19:14:19,,-5,6.85696888,50,kricn1-1a6a7hp,,network: i/o timeout
1728846853,2024-10-13 19:14:13,,-5,4.808040142,24,recentmedian,,network: i/o timeout
1728846853,2024-10-13 19:14:13,,-5,8.796831131,62,usdaa1-1tcp71g,,network: i/o timeout
1728846850,2024-10-13 19:14:10,,-5,4.808040142,24,recentmedian,,network: i/o timeout
1728846850,2024-10-13 19:14:10,,-5,4.732137203,9,ussjc1-1a6a7hp,,network: i/o timeout
1728846797,2024-10-13 19:13:17,,1,10.244354248,24,recentmedian,,
1728846797,2024-10-13 19:13:17,,-5,4.808040142,63,ausyd2-2trgvm8,,network: i/o timeout
1728846604,2024-10-13 19:10:04,,1,10.324253082,24,recentmedian,,
1728846604,2024-10-13 19:10:04,,-5,3.116350412,25,inblr1-1a6a7hp,,network: i/o timeout
1728846599,2024-10-13 19:09:59,,1,10.324253082,24,recentmedian,,
1728846599,2024-10-13 19:09:59,,-5,-6.149111271,56,plszy1-361g4fk,,network: i/o timeout
1728846514,2024-10-13 19:08:34,,1,10.324253082,24,recentmedian,,
1728846514,2024-10-13 19:08:34,-0.003126107,1,12.481019974,50,kricn1-1a6a7hp,,
1728846511,2024-10-13 19:08:31,,1,10.324253082,24,recentmedian,,
1728846511,2024-10-13 19:08:31,-0.00068212,1,10.244354248,9,ussjc1-1a6a7hp,,
1728846461,2024-10-13 19:07:41,,1,10.324253082,24,recentmedian,,

You have loads of network-timeouts.

Meaning your server can’t be reached.

Looks like a poor network-connection to me, or very unstable.

Could be a firewall dropping requests and thinks it’s being attacked.

What ever it is, it means an unstable connection where it doesn’t reply to requests.

Looking at the pattern @ https://www.ntppool.org/scores/175.117.50.8/log?limit=20000&monitor=recentmedian, that would be my guess. The system is only assigning clients to the server when the score is 10 or above. Note how, whenever the score crosses that boundary, the score drops below 10 again rather quickly in many cases (not all, but I guess you might have been playing with the netspeed setting).

Also, seeing that you seem to have your netspeed set to a rather high value supports that. See the “Client distribution” on your server’s page, which suggests that you have set your netspeed to a value that should give you roughly 4.5 % of the load in the Korea zone. Seeing that there are only 9 active servers in that zone suggests that you might indeed be getting too much traffic, which is clogging some part of your system: your line, your firewall/router, your NTP server, or…

I think you are right.
Now my server is saying that it is responsible for 25% traffic in Korea.
I will lower the traffic from 1gbit to 12mbit and wait.[quote=“PoolMUC, post:3, topic:3553, full:true”]

Looking at the pattern @ https://www.ntppool.org/scores/175.117.50.8/log?limit=20000&monitor=recentmedian, that would be my guess. The system is only assigning clients to the server when the score is 10 or above. Note how, whenever the score crosses that boundary, the score drops below 10 again rather quickly in many cases (not all, but I guess you might have been playing with the netspeed setting).

Also, seeing that you seem to have your netspeed set to a rather high value supports that. See the “Client distribution” on your server’s page, which suggests that you have set your netspeed to a value that should give you roughly 4.5 % of the load in the Korea zone. Seeing that there are only 9 active servers in that zone suggests that you might indeed be getting too much traffic, which is clogging some part of your system: your line, your firewall/router, your NTP server, or…
[/quote]

The values are in permyriads, so the fraction of DNS queries you get from Korea would actually be just 0.25 %. The second value is the fraction of the netspeed setting. Note that the two values are updated differently, which is why the relation between the two is currently this skewed: Fraction of DNS queries is averaged over a few days, while fraction of netspeed is the near instantaneous value. So when you set up your server just recently, and also if it was more out of the pool than in it since then, that might explain the low DNS fraction value.

Many zones in Asia are what some call underserved. I.e., too few servers for too many clients (and the current distribution mechanism of the pool unfortunately kind of locks clients into their zone, be it when they use the global pool zone, be it when they use the country zone). In Germany, with a netspeed setting of 3Gbit/s, I get just a little bit over 2Mbit/s. In Singapore, with a netspeed setting of 3Mbit/s, I get peaks above 8Mbit/s.

Note also that it may not be the “speed” of the incoming traffic that is the issue, but that NTP is many rather small packets, which many firewall/router devices cannot handle very well, especially consumer devices (and there are tips how to optimize, e.g., turn off connection tracking if you use a Linux-based router/firewall, but that does not make the problem go away completely).

And by the way, welcome to the forum and the project!

3 Likes

I’ve been setting my Netspeed to 512 Kbit for about 12 hours and I still get timeout.
Is it my ISP problem? I don’t even have a firewall.
Also, looking at history, I get timeout periodically.
I wonder if I should think about replacing an ISP.

Could you describe your infrastructure, please? What router, what NTP server you are using?

I already see that you have unifi’s Intrusion Detection and Prevention function switched off.

How much traffic are you getting with that setting?

Again looking at the monitor logs, the pattern of the score ramping up to slighly above 10, then dropping below 10 again quickly suggests that the issue is most likely related to the traffic volume (actually, probably more the packet rate) you get.

As @NTPman asked, info about your local infrastructure and uplink type would be of interest.

I note that ICMP echo requests (pings) to your IP address are being blocked, which limits a bit the ability to troubleshoot from the outside.

But you can run a ping to some typically well-connected, nearby destination yourself. My guess would be that latency, jitter, and possibly also packet drop would increase in line with the server exceeding a score of 10.

I’ve quickly compiled a list of Internet users per number of servers in a country, and unfortunately, South Korea does not have a good ratio. (Note that this is only a very rough indication, as, e.g., users may have multiple devices, clients often use multiple servers, some devices use non-pool servers, it looks at IPv4 only, …)

Country Internet users* Pool servers** Users per server
Belgium 10021242 20 501062
China 1102140000 33 33398182
Germany 77794405 498 156214
Philippines 73003313 9 8111479
Singapore 4821119 25 192845
South Korea 49421084 9 5491232

* List of countries by number of Internet users - Wikipedia
** as of about 2024-10-14 ~19:15 UTC

So might be a chicken-and-egg problem like in other countries: There are too few servers in the zone, so a single server gets too much traffic. That prevents adding more servers to the zone, which would be needed to alleviate the problem. (Unless there are other reasons as well that hamper people running their own servers, like in China.)

Currently, like finally happened in China recently, big infrastructure providers would need to step in to take the brunt of the load. Seems anyhow the case in Korea, where four servers seem to take most of the load (at least when looking at the frequency with which server IP addresses are being returned by the pool):

image

The issue of underserved zones is actually one of the most pressing issues that in my view needs to get solved by the pool: That users are not strictly locked to only access servers in their own zone by default. There are multiple cases documented in this forum that highlight different kinds of issues this is causing.

It’s been silent on that front recently, not sure whether that is because the issue was solved (like for the China zone, where a big infrastructure provider stepped in), or people just gave up because of lack of response from the pool (like possibly for the Philippines or Vietnam zones).

Note that the widespread suggestion to simply use the global pool pool.ntp.org does NOT address the issue of clients by default being locked to servers in their own zone only. That fact is just not that obvious as with explicit use of a country zone, but well-documented. Also, that suggestion would at best help clients get better service, but not server operators of overwhelmed servers, as they have no way to control how clients are getting configured.

ISP - UDM SE - NTP SERVER
My ISP guarantees me a round trip 1 gigabits speed to fibar. On average, I get 950mbps speed for all up and down.

My server is an ntp stratum1 server that is configured with GPS+PPS.
(It’s still being tested, so I’m getting it from another ntp stratum 1 server.)

Yes, that’s correct.
The moment the score is 10 points, it’s a timeout. So if you go back below 10 points, there’s no timeout.
This phenomenon continues to repeat itself indefinitely.

This is because I changed net speed to disabled, monitoring only and waited about 12 hours. I never had a time out for 12 hours.

Maybe this happens because there are many requests on my server because of the lack of ntp servers in Korea as you guessed.
I think I need to find a solution to this.

This speed is typically for large packets, e.g., file upload/download, video streaming, … NTP is many small packets, which when handled on the CPU (vs. with hardware-acceleration) creates a bottleneck.

E.g., if your uplink is based on some “tunnel” protocol (such as PPPoE) vs., e.g., native IP over Ethernet, or if the device implements NAT or IDS/IPS or a firewall, that would typically run on the CPU on many types of devices.

E.g., note the difference in throughput for routed traffic for large and small packets for an example device, even with L3HW offloaded routing. Just to illustrate the general issue, not making a statement about that particular device or vendor, they just happen to publish that kind of information, but this is a general issue that affects more or less all devices. And the (scant) publicly available documentation of your device also has hints that it is not much different in this respect.

IDS/IPS/firewall are disabled as you write, what about NAT? Even if it is not used/needed, it might still be active. And even just plain L3 forwarding between IP subnets might be an “issue”, see example above.

So what would be interesting is what NTP traffic throughput you get on your server, e.g., at the 512Kbit/s “netspeed” setting?

The device you use seems to be enterprise grade, so more performant and functionally hopefully also more capable than a consumer device. Can you measure the traffic throughput on the outside of the device on the uplink side (towards your ISP)? Then, obviously, one could see a potential discrepancy between what comes from your ISP, and how much of that gets through your device and to your NTP server (or gets dropped by the device itself).

What is the HW/SW (OS, NTP implementation) of your NTP server?

In some way, the server phasing in and out of the pool might to some extent be considered a cosmetic issue only (red dots in the graph, and lines going up and down obviously don’t look good).

Sure, when there is overload, not only will the monitors notice packet drops, but so will actual clients. But clients are more tolerant to this than the monitors are.

And the pool system kind of implements a control loop to manage the acceptable load on the server, even if it is rather coarse right now, on and off only. (I previously suggested smoothing the adding and removing of traffic to servers a bit for just that reason, though I understand that seems not so easy to implement, based on feedback by Ask towards the (current) end of the thread.)

At least in the Singapore zone, from patterns observed there, it is my impression that some servers are run with too high a “netspeed” setting, and thus keep phasing into and out of the pool continuously. But one can still get good time from them. Not pretty, but still good enough for many/most clients.

This all doesn’t matter.

Your ISP could have a router/firewall in place that thinks your server is being attacked and as such block IP’s.

There is only 1 way to be sure: Call your ISP and ask :laughing:

I have had this before and called my provider, told them my sever was providing services and wasn’t under any attack.
As such they white-listed my IP’s and I haven’t had any problems since.

As said, call them, explain what you are doing.

Setting the netspeed won’t change matters if your ISP ‘protects’ you.

ISP answered that they don’t know anything. Funny.
I think it’s probably our Korean internet network fee issue. This is a very big issue and we also covered it in Cloudflare, and because of this, Twitch service ended this year.

The ISP I’m using is an ISP that doesn’t have undersea cables in Korea.
So I think they set traffic restrictions on overseas access. So I decided to change to an ISP that has undersea cables and has no traffic restrictions on overseas access next month.

Until then, I set up my firewall to be accessible only in some Asian countries.

Maybe they simply don’t have any end user protection mechanisms in place. Those may not be mandatory, but at the discretion of each individual ISP.

Also, such protections would typically either block ports entirely, e.g., no traffic goes through at all. E.g., NetBIOS or SMTP might be blocked flatly.

Or they detect a certain pattern, e.g., traffic volume/bitrate of a certain traffic type exceeding a certain limit. Then, they would typically block all traffic that is considered a threat, not let a bit of it through when protections are activated.

The NTP port is reachable in general, so it is not the first type of protection.

Also, when packet drops start, it seems as if some packets still get through. So I think it unlikely (not impossible) that it would be the second type of protection.

And if it were, reducing the netspeed could help because it might keep the traffic under the detection threshold.

But given the client/server ratio in South Korea, even the lowest setting might still cause enough traffic to trigger the protection mechanism and/or simply overwhelm some part of the system, causing those packet drops.

So it would still be of interest to learn how much traffic you are getting, e.g., at the 512 Kbit/s “netspeed” setting, regardless of your already scheduled move to another ISP.

Ok, I can’t assess how that might affect the NTP traffic in the way it manifests.

You would mostly get traffic from South Korea only, due to how the pool assigns clients to servers. While there might be the stray client from elsewhere, e.g., because someone outside South Korea explicitly chose to configure their client with the country zone or uses a privacy-minded DNS service, or because the pool’s GeoDNS falsely locates a client in South Korea, that should be rather marginal, and only affect those users. As the monitors’ probe packets get through as long as the score is below 10, that also suggests this is unlikely (but again, not impossible, and be it through some side effect) to be the issue here.

But again, YMMV, can’t seriously assess the local situation from afar.

Ok, then any further analysis attempts at this point are best deferred until you have been moved over to rule out one possible area of possible concern.

Maybe you’ve already found out, but that would likely also block the monitors’ probe packets, making your score drop - unless you make enough individual or broad enough exceptions to let traffic from a sufficient number of monitors through. Which in turn is typically frowned upon as the monitors’ view should as truthfully as possible reflect what “real” clients would see.

Also, as re-iterated above, due to the way the pool currently assigns clients to servers, the traffic volume from clients outside your zone would likely be too small to make a relevant difference (YMMV).

Keep us updated, as it will help others if you get it sorted.

That UniFi Dream Machine SE is acting as a firewall, isn’t it? In a later post, you mentioned blocking all traffic except a few countries (US and Korea, I’m guessing), that implies a firewall.

Take a look at the CPU and memory usage of the UDM. You can see it in the side panel under the insights tab for the router itself. Start at, e.g. https://[UDM_IP]/network/default/devices and click on the router, then on the side panel that pops up, the middle tab is insights. Expand the “System statistics” at the bottom to see it:

Turning off the geolocation features of the UDM should help with both CPU load and marginally the NTP delay.

You might be able to further decrease the router CPU load and the NTP delay by adding a firewall rule to allow all UDP destination port 123 in to all addresses and source port 123 out to all to short-circuit the firewall filtering.

3 Likes

Ah, very true. Running a GeoIP lookup on all incoming NTP packets probably hurts the cause way more than just accepting those (relatively) few packets from out of zone ever could…

2 Likes

I could not reach the time.ravnus.com NTP server from my NTP monitors in London, Bangalore, and Frankfurt. Other NTP servers in Korea were reachable.
time.ravnus.com was reachable from my US-based monitors, but had higher losses than the other Korean servers.

I tried using traceroute to localize the losses, but either
– Traceroute was disabled (router doesn’t emit TTL expired ICMPs)
– router links used RFC1918 addresses (cannot be mapped to ASN)
– router links used RFC6598 addresses " " "
Any other suggestions?

I have currently blocked access from countries other than Korea, Japan, Taiwan, Singapore, and the United States. It is natural that you fail to access from other countries.
I’m waiting for an ISP change now. Because I think the ISP I’m currently using is restricting traffic going out of the country. It takes about a month to change the ISP.

Now I turned off the firewall so that I can access it from all countries again.
Can you tell me how fast it is in other countries other than Korea?
I want to know before and after ISP change.