"i/o timeout" from different monitoring stations

Well it’s not Telenet for sure, see what happens when I query your machine:

root@workstation:~# ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset -0.002199, delay 0.04863
16 Oct 08:55:36 ntpdate[4863]: adjust time server 81.83.23.34 offset -0.002199 sec
root@workstation:~#

I’m comming from EDPnet.

It works fine.

I checked your monitor-status again.
It looks like some router (somewhere) is blocking UDP traffic, if you look at it, about 1 time per hour the monitor passes then is blocked again.
This pattern seems to repeat.
However, I tried the ntpdate command repeatedly and it passes everytime, no problem at all.

It’s not Telenet doing it, just look:

bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset 0.000722, delay 0.04694
16 Oct 12:07:56 ntpdate[6898]: adjust time server 81.83.23.34 offset 0.000722 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset 0.000110, delay 0.04807
16 Oct 12:08:00 ntpdate[6899]: adjust time server 81.83.23.34 offset 0.000110 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset 0.000640, delay 0.04805
16 Oct 12:08:05 ntpdate[6900]: adjust time server 81.83.23.34 offset 0.000640 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset 0.004339, delay 0.05460
16 Oct 12:08:09 ntpdate[6903]: adjust time server 81.83.23.34 offset 0.004339 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset 0.000970, delay 0.04657
16 Oct 12:08:12 ntpdate[6904]: adjust time server 81.83.23.34 offset 0.000970 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset 0.000113, delay 0.04800
16 Oct 12:08:17 ntpdate[6905]: adjust time server 81.83.23.34 offset 0.000113 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset 0.000592, delay 0.04706
16 Oct 12:08:20 ntpdate[6906]: adjust time server 81.83.23.34 offset 0.000592 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset -0.001216, delay 0.05058
16 Oct 12:08:21 ntpdate[6907]: adjust time server 81.83.23.34 offset -0.001216 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset 0.001088, delay 0.04745
16 Oct 12:08:23 ntpdate[6908]: adjust time server 81.83.23.34 offset 0.001088 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset 0.000454, delay 0.04713
16 Oct 12:08:24 ntpdate[6909]: adjust time server 81.83.23.34 offset 0.000454 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset 0.000248, delay 0.04843
16 Oct 12:08:26 ntpdate[6910]: adjust time server 81.83.23.34 offset 0.000248 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset -0.000791, delay 0.05057
16 Oct 12:09:49 ntpdate[6958]: adjust time server 81.83.23.34 offset -0.000791 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset 0.000299, delay 0.05293
16 Oct 12:09:50 ntpdate[6959]: adjust time server 81.83.23.34 offset 0.000299 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset 0.000899, delay 0.04782
16 Oct 12:09:53 ntpdate[6960]: adjust time server 81.83.23.34 offset 0.000899 sec
bas@workstation:~/Bureaublad$ ntpdate -q 81.83.23.34
server 81.83.23.34, stratum 2, offset -0.000032, delay 0.04848
16 Oct 12:09:54 ntpdate[6962]: adjust time server 81.83.23.34 offset -0.000032 sec
bas@workstation:~/Bureaublad$

A pass all the time.

Also tested your port:

root@server:~# nmap -sU -p123 81.83.23.34

Starting Nmap 7.40 ( https://nmap.org ) at 2019-10-16 13:38 CEST
Nmap scan report for kenny.savanger.be (81.83.23.34)
Host is up (0.021s latency).
PORT STATE SERVICE
123/udp open ntp

Nmap done: 1 IP address (1 host up) scanned in 0.69 seconds
root@server:~#

Bas.

Dear All,

The same situation seems to prevail when trying to add servers from Germany originating in Networks like Unitymedia, Vodafone or Deutsche Telekom. Trying to add 130.180.22.90, I also get ca. 90 % of port 123 UDP packets blocked by Telia:

9.|-- nyk-bb2-link.telia.net 90.0% 10 109.4 109.4 109.4 109.4 0.0

I wonder if this is a mistake or a purposeful action to prevent DDoS attacs by either Telia or the local WAN provider.

Without any contractual relationship to Telia, this seems to imply practically, that one is unable to contribute to ntp by joining the network in such situation, correct? Against my real intention, I will therefore delete the IP.

Regards,

Michael

1 Like

Many of us understand and share your frustration. However, when there are better traffic conditions your server may be included in the pool temporarily. There is a chance that the monitoring or the Internet connectivity situation improves one day.

In the beta system the New Jersey monitor now retries the query on timeouts (and other network errors) and will report success if just one of three queries succeed.

I’m curious if this will perform better or if the network trouble is “all on” and “all off” for longer than the 4-7 seconds a test takes. (cc @stevesommars)

2 Likes

I’m running a server in the german DFN (German research network) and I’m also having issues with negative scores and the like.

https://www.ntppool.org/scores/193.175.73.20

Currently I’m back to a positive score, but it oscillates…

Same update on the Amsterdam beta monitor since a few minutes ago.

2 Likes

The timeouts seem to be probabilistic, not on/off. I’ll keep looking.

I wonder if doing ten or twenty queries over a minute would help then…

Feedback.
Much better monitoring result for the same host via the beta system than the production one:

image

image

Ok, that graph is promising but unless I am missing something I don’t see it obviously in the monitoring data. It was about 09:00 on the 15th that I made the change.

select 
  d, h,
  (100/count(*))*sum(nope) as bad_ratio,
  sum(ok),sum(nope)
from
(
  select date(ts) as d, hour(ts) as h, if(step > 0, 1, 0) as 'ok', if(step <= 0, 1, 0) as 'nope'
  from log_scores where ts > '2020-01-13'
  and monitor_id = 20
  and score > 0
) as t2
group by d,h;

Gives something like this (full CSV attached):

d h bad_ratio sum(ok) sum(nope)
2020-01-15 0 1.1168 1151 13
2020-01-15 1 2.4893 1371 35
2020-01-15 2 1.9708 1343 27
2020-01-15 3 0.0742 1347 1
2020-01-15 4 0.5352 1115 6
2020-01-15 5 0.4441 1345 6
2020-01-15 6 1.0424 1329 14
2020-01-15 7 0.8627 1264 11
2020-01-15 8 0.503 989 5
2020-01-15 9 1.992 738 15
2020-01-15 10 1.3605 725 10
2020-01-15 11 2.8289 687 20
2020-01-15 12 1.108 714 8
2020-01-15 13 2.0633 712 15
2020-01-15 14 1.2312 722 9
2020-01-15 15 2.2059 665 15
2020-01-15 16 1.3267 595 8
2020-01-15 17 1.0654 650 7
2020-01-15 18 1.4837 664 10
2020-01-15 19 1.3333 666 9
2020-01-15 20 0.5722 695 4
2020-01-15 21 1.0072 688 7
2020-01-15 22 0.7225 687 5
2020-01-15 23 2.0649 664 14
2020-01-16 0 1.3393 663 9
2020-01-16 1 1.585 683 11

For what it’s worth, the similar BigQuery query is this:

select 
  d, h,
  (100/count(*))*sum(nope) as bad_ratio,
  sum(ok),sum(nope)
from
(
  select date(ts) as d, extract(hour from ts) as h, if(step > 0, 1, 0) as ok, if(step <= 0, 1, 0) as nope
  from ntppool.ntpbeta.log_scores where ts > '2020-01-13'
  and monitor_id = 20
  and score > 0
) as t2
group by d,h
order by d,h

Despite not seeing clearly in the data that it made a difference, since it also didn’t get worse I have updated the production monitor to use the new retry logic too now to get some more data.

Analytics of the data set via BigQuery as seen above (replace ‘ntpbeta’ with ‘ntppool’) is very welcome.

1 Like

On the first look it’s better. But the polling interval have increased and it need now much longer to recover.

1579289104,"2020-01-17 19:25:04",-0.001040629,1,14.6,6,"Newark, NJ, US",0,
1579289104,"2020-01-17 19:25:04",-0.001040629,1,14.6,,,0,
1579273070,"2020-01-17 14:57:50",0.00027202,1,14.3,6,"Newark, NJ, US",0,
1579273070,"2020-01-17 14:57:50",0.00027202,1,14.3,,,0,
1579256967,"2020-01-17 10:29:27",0,-5,14,6,"Newark, NJ, US",,"i/o timeout"
1579256967,"2020-01-17 10:29:27",0,-5,14,,,,"i/o timeout"
1579252473,"2020-01-17 09:14:33",0.000585068,1,20,6,"Newark, NJ, US",0,
1579252473,"2020-01-17 09:14:33",0.000585068,1,20,,,0,
1579251566,"2020-01-17 08:59:26",0.000619647,1,20,6,"Newark, NJ, US",0,
1579251566,"2020-01-17 08:59:26",0.000619647,1,20,,,0,
1579250657,"2020-01-17 08:44:17",0.00257311,1,20,6,"Newark, NJ, US",0,

Anmerkung 2020-01-18 004504

Yup, seeing the same - polling interval in the live system is now roughly 4.5 hours instead of 15 minutes.

image

Whoops! I completely miss that from staring at log files.

Each check has a 2 second delay as per the RFC. When only one check was done per server the checks were serial, but immediately after each other. With a 2 second delay between 3 samples each check took just over 4 seconds (instead of typically from a few milliseconds to a few hundred).

I also fixed a bug in the change from a few days ago that made the new logic to ignore the timeouts if there was a good response work.

1 Like

Morning,

Running LeoNTP St1 on 1Gbps FTTP connection - location UK. Rock solid for one year - average responses 4k / second and always receiving constant 20 score.

Since the 15th all over the place:

NTP 19Jan2020

Have checked everything, even changed box Router (only device on the feed/router is the NTP Server), latest firmwares etc.

Not sure what has caused this change. Checks, part from Newark, show the system is responding correctly and timely. I am going to try the Beta system. Do this actually contribute to the pool, or is it purely for new monitoring software and different location to Newark?

Does anyone have any advice, of anything else to try?

Many thanks in advance.

Dave.

The beta pool is strictly for monitoring.

You can try running mtr from your IP to the monitor (be sure to use UDP/123) and see if the packets are getting lost along the way.

https://dev.ntppool.org/monitoring/network-debugging/

Monitor IP is in the link above, also you can have it run a traceroute to you so you can test both directions…

Thank you very much indeed for these pointers, and the result is worrying:

Would appear that something is wrong between the UK and US, at zayo.com, am I correct? Or is that normal at that point - the core routers are too busy to acknowledge low priority traffic? Left it running for 2000 packets so far (more that the appended image) and zero loss end to end, if that makes sense?

Here are the latest results, live:

NTP 19Jan2020-2

and Beta - note the first drop out was ‘intentional’ - re-flashed the LeoNTP firmware, just to be sure:

NTP 19Jan2020-Beta

I will of course let the Beta run longer. I will try to find the IP address of the Beta monitor and run an MTR on that…

I am no expert at interpreting TraceRoute/MTR outputs. What do you make of the outputs shown above? It does seem strange that for a year perfect and then, bang, all over the place, with absolutely no change whatsoever this end, from the middle of this week onwards.

Many thanks in advance littlejason, I really appreciate your time and help.

Dave.

Update: The MTR results after 150 minutes continuous monitoring:

Many thanks in advance…

Hi Dave, it looks like the issue mostly went away early this morning. Both live and beta system show only one/two drops (apart from when you were re-flashing) since 9am-ish today.