"i/o timeout" from different monitoring stations

Many of us understand and share your frustration. However, when there are better traffic conditions your server may be included in the pool temporarily. There is a chance that the monitoring or the Internet connectivity situation improves one day.

In the beta system the New Jersey monitor now retries the query on timeouts (and other network errors) and will report success if just one of three queries succeed.

I’m curious if this will perform better or if the network trouble is “all on” and “all off” for longer than the 4-7 seconds a test takes. (cc @stevesommars)

2 Likes

I’m running a server in the german DFN (German research network) and I’m also having issues with negative scores and the like.

https://www.ntppool.org/scores/193.175.73.20

Currently I’m back to a positive score, but it oscillates…

Same update on the Amsterdam beta monitor since a few minutes ago.

2 Likes

The timeouts seem to be probabilistic, not on/off. I’ll keep looking.

I wonder if doing ten or twenty queries over a minute would help then…

Feedback.
Much better monitoring result for the same host via the beta system than the production one:

image

image

Ok, that graph is promising but unless I am missing something I don’t see it obviously in the monitoring data. It was about 09:00 on the 15th that I made the change.

select 
  d, h,
  (100/count(*))*sum(nope) as bad_ratio,
  sum(ok),sum(nope)
from
(
  select date(ts) as d, hour(ts) as h, if(step > 0, 1, 0) as 'ok', if(step <= 0, 1, 0) as 'nope'
  from log_scores where ts > '2020-01-13'
  and monitor_id = 20
  and score > 0
) as t2
group by d,h;

Gives something like this (full CSV attached):

d h bad_ratio sum(ok) sum(nope)
2020-01-15 0 1.1168 1151 13
2020-01-15 1 2.4893 1371 35
2020-01-15 2 1.9708 1343 27
2020-01-15 3 0.0742 1347 1
2020-01-15 4 0.5352 1115 6
2020-01-15 5 0.4441 1345 6
2020-01-15 6 1.0424 1329 14
2020-01-15 7 0.8627 1264 11
2020-01-15 8 0.503 989 5
2020-01-15 9 1.992 738 15
2020-01-15 10 1.3605 725 10
2020-01-15 11 2.8289 687 20
2020-01-15 12 1.108 714 8
2020-01-15 13 2.0633 712 15
2020-01-15 14 1.2312 722 9
2020-01-15 15 2.2059 665 15
2020-01-15 16 1.3267 595 8
2020-01-15 17 1.0654 650 7
2020-01-15 18 1.4837 664 10
2020-01-15 19 1.3333 666 9
2020-01-15 20 0.5722 695 4
2020-01-15 21 1.0072 688 7
2020-01-15 22 0.7225 687 5
2020-01-15 23 2.0649 664 14
2020-01-16 0 1.3393 663 9
2020-01-16 1 1.585 683 11

For what it’s worth, the similar BigQuery query is this:

select 
  d, h,
  (100/count(*))*sum(nope) as bad_ratio,
  sum(ok),sum(nope)
from
(
  select date(ts) as d, extract(hour from ts) as h, if(step > 0, 1, 0) as ok, if(step <= 0, 1, 0) as nope
  from ntppool.ntpbeta.log_scores where ts > '2020-01-13'
  and monitor_id = 20
  and score > 0
) as t2
group by d,h
order by d,h

Despite not seeing clearly in the data that it made a difference, since it also didn’t get worse I have updated the production monitor to use the new retry logic too now to get some more data.

Analytics of the data set via BigQuery as seen above (replace ‘ntpbeta’ with ‘ntppool’) is very welcome.

1 Like

On the first look it’s better. But the polling interval have increased and it need now much longer to recover.

1579289104,"2020-01-17 19:25:04",-0.001040629,1,14.6,6,"Newark, NJ, US",0,
1579289104,"2020-01-17 19:25:04",-0.001040629,1,14.6,,,0,
1579273070,"2020-01-17 14:57:50",0.00027202,1,14.3,6,"Newark, NJ, US",0,
1579273070,"2020-01-17 14:57:50",0.00027202,1,14.3,,,0,
1579256967,"2020-01-17 10:29:27",0,-5,14,6,"Newark, NJ, US",,"i/o timeout"
1579256967,"2020-01-17 10:29:27",0,-5,14,,,,"i/o timeout"
1579252473,"2020-01-17 09:14:33",0.000585068,1,20,6,"Newark, NJ, US",0,
1579252473,"2020-01-17 09:14:33",0.000585068,1,20,,,0,
1579251566,"2020-01-17 08:59:26",0.000619647,1,20,6,"Newark, NJ, US",0,
1579251566,"2020-01-17 08:59:26",0.000619647,1,20,,,0,
1579250657,"2020-01-17 08:44:17",0.00257311,1,20,6,"Newark, NJ, US",0,

Anmerkung 2020-01-18 004504

Yup, seeing the same - polling interval in the live system is now roughly 4.5 hours instead of 15 minutes.

image

Whoops! I completely miss that from staring at log files.

Each check has a 2 second delay as per the RFC. When only one check was done per server the checks were serial, but immediately after each other. With a 2 second delay between 3 samples each check took just over 4 seconds (instead of typically from a few milliseconds to a few hundred).

I also fixed a bug in the change from a few days ago that made the new logic to ignore the timeouts if there was a good response work.

1 Like

Morning,

Running LeoNTP St1 on 1Gbps FTTP connection - location UK. Rock solid for one year - average responses 4k / second and always receiving constant 20 score.

Since the 15th all over the place:

NTP 19Jan2020

Have checked everything, even changed box Router (only device on the feed/router is the NTP Server), latest firmwares etc.

Not sure what has caused this change. Checks, part from Newark, show the system is responding correctly and timely. I am going to try the Beta system. Do this actually contribute to the pool, or is it purely for new monitoring software and different location to Newark?

Does anyone have any advice, of anything else to try?

Many thanks in advance.

Dave.

The beta pool is strictly for monitoring.

You can try running mtr from your IP to the monitor (be sure to use UDP/123) and see if the packets are getting lost along the way.

https://dev.ntppool.org/monitoring/network-debugging/

Monitor IP is in the link above, also you can have it run a traceroute to you so you can test both directions…

Thank you very much indeed for these pointers, and the result is worrying:

Would appear that something is wrong between the UK and US, at zayo.com, am I correct? Or is that normal at that point - the core routers are too busy to acknowledge low priority traffic? Left it running for 2000 packets so far (more that the appended image) and zero loss end to end, if that makes sense?

Here are the latest results, live:

NTP 19Jan2020-2

and Beta - note the first drop out was ‘intentional’ - re-flashed the LeoNTP firmware, just to be sure:

NTP 19Jan2020-Beta

I will of course let the Beta run longer. I will try to find the IP address of the Beta monitor and run an MTR on that…

I am no expert at interpreting TraceRoute/MTR outputs. What do you make of the outputs shown above? It does seem strange that for a year perfect and then, bang, all over the place, with absolutely no change whatsoever this end, from the middle of this week onwards.

Many thanks in advance littlejason, I really appreciate your time and help.

Dave.

Update: The MTR results after 150 minutes continuous monitoring:

Many thanks in advance…

Hi Dave, it looks like the issue mostly went away early this morning. Both live and beta system show only one/two drops (apart from when you were re-flashing) since 9am-ish today.

Likely someone at Zayo made a network change that had a negative impact. If the issue doesn’t resolve itself in the next few days you can try contacting Zayo tech support explaining the issue and sending that MTR report so they know which routers are causing issues.

Thank you elljay and littlejason - very much appreciated.

So I will sit on my hands and wait. Perhaps that was the best move, but after a year never having had this problem, I was concerned - it has always been rock solid. Since early Wednesday until today didn’t strike me as something that was going to go away. Latest Stats - production:

NTP 19Jan2020-3

And Beta:

I will revert back tomorrow, hopefully to say case closed.

Once again, many thanks,

Dave.

1 Like

Thank you for helping test and diagnose! (And of course for having a server in the pool) :slight_smile: