"i/o timeout" from different monitoring stations

Bas · October 16, 2019, 6:57am

Well it’s not Telenet for sure, see what happens when I query your machine:

root@workstation:~# ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset -0.002199, delay 0.04863
16 Oct 08:55:36 ntpdate[4863]: adjust time server 81.83.23.34 offset -0.002199 sec
root@workstation:~#

I’m comming from EDPnet.

It works fine.

Bas · October 16, 2019, 10:11am

I checked your monitor-status again.
It looks like some router (somewhere) is blocking UDP traffic, if you look at it, about 1 time per hour the monitor passes then is blocked again.
This pattern seems to repeat.
However, I tried the ntpdate command repeatedly and it passes everytime, no problem at all.

It’s not Telenet doing it, just look:

bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset 0.000722, delay 0.04694
16 Oct 12:07:56 ntpdate[6898]: adjust time server 81.83.23.34 offset 0.000722 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset 0.000110, delay 0.04807
16 Oct 12:08:00 ntpdate[6899]: adjust time server 81.83.23.34 offset 0.000110 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset 0.000640, delay 0.04805
16 Oct 12:08:05 ntpdate[6900]: adjust time server 81.83.23.34 offset 0.000640 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset 0.004339, delay 0.05460
16 Oct 12:08:09 ntpdate[6903]: adjust time server 81.83.23.34 offset 0.004339 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset 0.000970, delay 0.04657
16 Oct 12:08:12 ntpdate[6904]: adjust time server 81.83.23.34 offset 0.000970 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset 0.000113, delay 0.04800
16 Oct 12:08:17 ntpdate[6905]: adjust time server 81.83.23.34 offset 0.000113 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset 0.000592, delay 0.04706
16 Oct 12:08:20 ntpdate[6906]: adjust time server 81.83.23.34 offset 0.000592 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset -0.001216, delay 0.05058
16 Oct 12:08:21 ntpdate[6907]: adjust time server 81.83.23.34 offset -0.001216 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset 0.001088, delay 0.04745
16 Oct 12:08:23 ntpdate[6908]: adjust time server 81.83.23.34 offset 0.001088 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset 0.000454, delay 0.04713
16 Oct 12:08:24 ntpdate[6909]: adjust time server 81.83.23.34 offset 0.000454 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset 0.000248, delay 0.04843
16 Oct 12:08:26 ntpdate[6910]: adjust time server 81.83.23.34 offset 0.000248 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset -0.000791, delay 0.05057
16 Oct 12:09:49 ntpdate[6958]: adjust time server 81.83.23.34 offset -0.000791 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset 0.000299, delay 0.05293
16 Oct 12:09:50 ntpdate[6959]: adjust time server 81.83.23.34 offset 0.000299 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset 0.000899, delay 0.04782
16 Oct 12:09:53 ntpdate[6960]: adjust time server 81.83.23.34 offset 0.000899 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset -0.000032, delay 0.04848
16 Oct 12:09:54 ntpdate[6962]: adjust time server 81.83.23.34 offset -0.000032 sec
bas@workstation:~/Bureaublad$

A pass all the time.

Also tested your port:

root@server:~# nmap -sU -p123 81.83.23.34

Starting Nmap 7.40 ( https://nmap.org ) at 2019-10-16 13:38 CEST
Nmap scan report for kenny.savanger.be (81.83.23.34)
Host is up (0.021s latency).
PORT STATE SERVICE
123/udp open ntp

Nmap done: 1 IP address (1 host up) scanned in 0.69 seconds
root@server:~#

Bas.

michael · January 9, 2020, 8:32am

Dear All,

The same situation seems to prevail when trying to add servers from Germany originating in Networks like Unitymedia, Vodafone or Deutsche Telekom. Trying to add 130.180.22.90, I also get ca. 90 % of port 123 UDP packets blocked by Telia:

9.|-- nyk-bb2-link.telia.net 90.0% 10 109.4 109.4 109.4 109.4 0.0

I wonder if this is a mistake or a purposeful action to prevent DDoS attacs by either Telia or the local WAN provider.

Without any contractual relationship to Telia, this seems to imply practically, that one is unable to contribute to ntp by joining the network in such situation, correct? Against my real intention, I will therefore delete the IP.

Regards,

Michael

NTPman · January 9, 2020, 11:02am

Many of us understand and share your frustration. However, when there are better traffic conditions your server may be included in the pool temporarily. There is a chance that the monitoring or the Internet connectivity situation improves one day.

ask · January 15, 2020, 8:47am

In the beta system the New Jersey monitor now retries the query on timeouts (and other network errors) and will report success if just one of three queries succeed.

I’m curious if this will perform better or if the network trouble is “all on” and “all off” for longer than the 4-7 seconds a test takes. (cc @stevesommars)

hildeb · January 15, 2020, 9:55am

I’m running a server in the german DFN (German research network) and I’m also having issues with negative scores and the like.

https://www.ntppool.org/scores/193.175.73.20

Currently I’m back to a positive score, but it oscillates…

ask · January 15, 2020, 4:21pm

Same update on the Amsterdam beta monitor since a few minutes ago.

stevesommars · January 16, 2020, 1:14am

The timeouts seem to be probabilistic, not on/off. I’ll keep looking.

ask · January 16, 2020, 1:17am

I wonder if doing ten or twenty queries over a minute would help then…

NTPman · January 17, 2020, 8:25am

Feedback.
Much better monitoring result for the same host via the beta system than the production one:

ask · January 17, 2020, 9:43am

Ok, that graph is promising but unless I am missing something I don’t see it obviously in the monitoring data. It was about 09:00 on the 15th that I made the change.

select 
  d, h,
  (100/count(*))*sum(nope) as bad_ratio,
  sum(ok),sum(nope)
from
(
  select date(ts) as d, hour(ts) as h, if(step > 0, 1, 0) as 'ok', if(step <= 0, 1, 0) as 'nope'
  from log_scores where ts > '2020-01-13'
  and monitor_id = 20
  and score > 0
) as t2
group by d,h;

Gives something like this (full CSV attached):

d	h	bad_ratio	sum(ok)	sum(nope)
2020-01-15	0	1.1168	1151	13
2020-01-15	1	2.4893	1371	35
2020-01-15	2	1.9708	1343	27
2020-01-15	3	0.0742	1347	1
2020-01-15	4	0.5352	1115	6
2020-01-15	5	0.4441	1345	6
2020-01-15	6	1.0424	1329	14
2020-01-15	7	0.8627	1264	11
2020-01-15	8	0.503	989	5
2020-01-15	9	1.992	738	15
2020-01-15	10	1.3605	725	10
2020-01-15	11	2.8289	687	20
2020-01-15	12	1.108	714	8
2020-01-15	13	2.0633	712	15
2020-01-15	14	1.2312	722	9
2020-01-15	15	2.2059	665	15
2020-01-15	16	1.3267	595	8
2020-01-15	17	1.0654	650	7
2020-01-15	18	1.4837	664	10
2020-01-15	19	1.3333	666	9
2020-01-15	20	0.5722	695	4
2020-01-15	21	1.0072	688	7
2020-01-15	22	0.7225	687	5
2020-01-15	23	2.0649	664	14
2020-01-16	0	1.3393	663	9
2020-01-16	1	1.585	683	11

ask · January 17, 2020, 9:54am

For what it’s worth, the similar BigQuery query is this:

select 
  d, h,
  (100/count(*))*sum(nope) as bad_ratio,
  sum(ok),sum(nope)
from
(
  select date(ts) as d, extract(hour from ts) as h, if(step > 0, 1, 0) as ok, if(step <= 0, 1, 0) as nope
  from ntppool.ntpbeta.log_scores where ts > '2020-01-13'
  and monitor_id = 20
  and score > 0
) as t2
group by d,h
order by d,h

ask · January 17, 2020, 9:56am

Despite not seeing clearly in the data that it made a difference, since it also didn’t get worse I have updated the production monitor to use the new retry logic too now to get some more data.

Analytics of the data set via BigQuery as seen above (replace ‘ntpbeta’ with ‘ntppool’) is very welcome.

apuls · January 17, 2020, 11:44pm

On the first look it’s better. But the polling interval have increased and it need now much longer to recover.

1579289104,"2020-01-17 19:25:04",-0.001040629,1,14.6,6,"Newark, NJ, US",0,
1579289104,"2020-01-17 19:25:04",-0.001040629,1,14.6,,,0,
1579273070,"2020-01-17 14:57:50",0.00027202,1,14.3,6,"Newark, NJ, US",0,
1579273070,"2020-01-17 14:57:50",0.00027202,1,14.3,,,0,
1579256967,"2020-01-17 10:29:27",0,-5,14,6,"Newark, NJ, US",,"i/o timeout"
1579256967,"2020-01-17 10:29:27",0,-5,14,,,,"i/o timeout"
1579252473,"2020-01-17 09:14:33",0.000585068,1,20,6,"Newark, NJ, US",0,
1579252473,"2020-01-17 09:14:33",0.000585068,1,20,,,0,
1579251566,"2020-01-17 08:59:26",0.000619647,1,20,6,"Newark, NJ, US",0,
1579251566,"2020-01-17 08:59:26",0.000619647,1,20,,,0,
1579250657,"2020-01-17 08:44:17",0.00257311,1,20,6,"Newark, NJ, US",0,

Anmerkung 2020-01-18 004504

elljay · January 18, 2020, 8:44am

Yup, seeing the same - polling interval in the live system is now roughly 4.5 hours instead of 15 minutes.

ask · January 18, 2020, 9:42am

Whoops! I completely miss that from staring at log files.

Each check has a 2 second delay as per the RFC. When only one check was done per server the checks were serial, but immediately after each other. With a 2 second delay between 3 samples each check took just over 4 seconds (instead of typically from a few milliseconds to a few hundred).

I also fixed a bug in the change from a few days ago that made the new logic to ignore the timeouts if there was a good response work.

Dave · January 19, 2020, 8:44am

Morning,

Running LeoNTP St1 on 1Gbps FTTP connection - location UK. Rock solid for one year - average responses 4k / second and always receiving constant 20 score.

Since the 15th all over the place:

NTP 19Jan2020

Have checked everything, even changed box Router (only device on the feed/router is the NTP Server), latest firmwares etc.

Not sure what has caused this change. Checks, part from Newark, show the system is responding correctly and timely. I am going to try the Beta system. Do this actually contribute to the pool, or is it purely for new monitoring software and different location to Newark?

Does anyone have any advice, of anything else to try?

Many thanks in advance.

Dave.

littlejason99 · January 19, 2020, 1:40pm

The beta pool is strictly for monitoring.

You can try running mtr from your IP to the monitor (be sure to use UDP/123) and see if the packets are getting lost along the way.

https://dev.ntppool.org/monitoring/network-debugging/

Monitor IP is in the link above, also you can have it run a traceroute to you so you can test both directions…

Dave · January 19, 2020, 2:17pm

Thank you very much indeed for these pointers, and the result is worrying:

Would appear that something is wrong between the UK and US, at zayo.com, am I correct? Or is that normal at that point - the core routers are too busy to acknowledge low priority traffic? Left it running for 2000 packets so far (more that the appended image) and zero loss end to end, if that makes sense?

Here are the latest results, live:

NTP 19Jan2020-2

and Beta - note the first drop out was ‘intentional’ - re-flashed the LeoNTP firmware, just to be sure:

NTP 19Jan2020-Beta

I will of course let the Beta run longer. I will try to find the IP address of the Beta monitor and run an MTR on that…

I am no expert at interpreting TraceRoute/MTR outputs. What do you make of the outputs shown above? It does seem strange that for a year perfect and then, bang, all over the place, with absolutely no change whatsoever this end, from the middle of this week onwards.

Many thanks in advance littlejason, I really appreciate your time and help.

Dave.

Update: The MTR results after 150 minutes continuous monitoring:

Many thanks in advance…

elljay · January 19, 2020, 4:14pm

Hi Dave, it looks like the issue mostly went away early this morning. Both live and beta system show only one/two drops (apart from when you were re-flashing) since 9am-ish today.

Topic		Replies	Views
Monitoring stations timeout to our NTP servers Server operators	103	8295	May 22, 2021
List of Monitoring IPs? Server operators	18	187	January 2, 2025
Only one monitoring server can connect Server operators monitoring	9	1025	July 22, 2023
Monitoring station routing problems Server operators	10	1007	July 13, 2019
Monitoring Croatian servers problems since Thursday 27/12/2018 Server operators monitoring	3	1062	January 3, 2019

"i/o timeout" from different monitoring stations

Related topics