Monitoring stations timeout to our NTP servers

Hi @ialbarki,
If you are member of RIPE Atlas and if you host a Probe or an Anchor, you get credit and with this credit, you can run measurment from other probes arround the world. Measures include ping, traceroute and also ntp query.
When you create a measurement you can set probes candidates based on various criteria (AS, region etc…)
If you want, I can run a measurement for some of your servers and I then set results public so you can see reachability of your server at a point of time from an international perspective.
Regards,
Antoine

@ialbarki
The IP of your NTP belongs to an IP pool of Level3 a.k.a. Centurylink, so therefore routers belonging to Level3 are being traversed (of course).
As stated in my post here, I “proofed” that there are some weird NTP problems which can be tracked down to aggressive rate limiting of NTP traffic done on Level3 routers.
At least on traffic between Europe and the U.S., Level3 is eating some of the NTP packets.

Since today I had no luck to get NTP working again via Level3. My pool server is online again, because I changed outgoing routing towards the pool monitor to go via Cogent :frowning:
Maybe you could open a ticket with your host/ISP to get in touch with Level3 engineers to stop that stupid filtering.

What you see in WHOIS is the technical contact address. This contact is not mandatorily tied to the physical location of the servers.
I am Tech-C for some IP networks which are located over 300 km away from where I work :wink:

According to a traceroute, the IP 139.178.64.42 is located in Newark - because the last two hops in front of the target are named
0.xe-1-0-0.bbr1.ewr1.packet.net
0.ae12.dsr2.ewr1.packet.net

EWR is the IATA code for Newark Liberty International Airport :wink:

I added a monitoring system in Holland (Amsterdam) to the beta system tonight: https://web.beta.grundclock.com/user/ask

Still todo:

  • Finish packaging the software so others can run it more easily
  • Figure out scoring system for the monitors and how to distribute the checks
  • Self-service sign-up for monitors
2 Likes

That’s very good news for us Europeans, stuck in the middle of nowhere between America and Asia…

Do you mean it will be possible to run probes without the need of a dedicated server you have control on?

Pretty simple to me: if one probe is still above 10 it means someone, somewhere is still able to ping this particular server, regardless of the connectivity of each particular probe or the connectivity of the server to each particular probe.

If you go the way above, we wouldn’t need to sign-up to each probe, you just monitor all servers from all probes and you include ones above 10 in one probe minimum: done

NTP is resilient enough to cope with one or even two bad servers served by the pool’s DNS, but I think we can’t afford to loose servers from difficult areas like Asia or India because one or many probes are going rogue

With a “probe software package” easy to set-up, we can have a dozen of monitors, obviously self-excluding, as you don’t want a probe to monitor NTP on localhost.

Distribution the burden of the monitoring would make the system more robust in case of a mecanical failure and a human failure as well, as you seem to be the sole and only manager (l wish you a long and healthy live BTW :wink:)

I think median score from all monitors would be a good start. With a larger number of monitors it could be shifted lower, maybe 25th percentile, meaning a server needs to be considered good by at least 75% of all monitors in order to be offered to clients.

It could also mean that one monitor is just handing out 10+ scores to every IP address it looks at. Monitors can fail too, and if self-serve users can sign them up then there are lots of new QA problems for the pool.

Ideally, if there’s a monitor in the same zone(s) as the pool server, its score should dominate the aggregated score, presuming that intra-zone communication is more reliable and desirable than inter-zone communication.

e.g. If a server is intending to serve CN only, and it’s inside CN and mostly unreachable from the rest of the world, then a score from a monitor outside of CN is entirely irrelevant–clients inside CN don’t care about other clients, and clients outside CN would be routed to other zone server pools.

Taken to its logical extreme, this would mean that every host could have one score for each zone (if the zone has a monitor inside it), which determines its inclusion in each zone. When there’s no monitor in the same zone, it would use a best guess based on average or 25th percentile or whatever.

Hello,

I have the same issue here on my servers.

CU
Jörg

We run three stratum 1 servers and our experience with the V6 monitoring is spotty at best. IPv4 runs fine, no issues. IPv6 on the same servers has repeatedly become unreachable according to the NTP monitoring system over the last several months (each currently sitting at -100). We have validated global reachability using multiple servers located across the globe, looking glasses and RIPE tools, but the NTP monitoring system will still report our V6 as down when all other methods we test with say otherwise. We do not use tunnels for V6, we use multiple peering and transit connections (including HE and Cogent, given their V6 networks are effectively separate as of the time of writing this). I honestly do not know what else we can do.

Following this up, thanks to an above post listing the V4 address of the monitoring system, I was able to lookup the V6 address of the system and then trouble-shoot the return path from our side. It turns out one of our peers appears to be blackholing the return traffic from us to them, so we’ve implemented filtering accordingly and the monitoring system score for our V6 addresses are now increasing. And yes, we’ve opened a ticket with that peer so then can sort out the issue on their end.

For my servers (also in Switzerland) the connection from the production monitoring in Newark is horrible over IPv4. Over IPv6, it is just perfect:

Is there something somebody can do to ameliorate the situation?

Same here,

Server with IPv4 & IPv6.
IPv6 score drops below 14 and IPv5 score is stabe at 20.

@ask, where are you with it now?
Is the code accessible that one can submit possible fixes?

I tried to add my servers to this monitoring development, but it does not seem to work.

I am also having the same issue.
On the product ntppool:
https://www.pool.ntp.org/scores/45.76.111.149
https://www.pool.ntp.org/scores/2401:c080:1000:4175:5400:2ff:fe32:f445

Server is based in Japan.
IPv6 is solid ; IPv4 keeps oscillating
I was almost pulling my hair out - since all my tests from multiple servers and locations (US, IN, Japan) were 100% positive.

Now I loaded these to the beta pool.
https://web.beta.grundclock.com/scores/45.76.111.149
https://web.beta.grundclock.com/scores/2401:c080:1000:4175:5400:2ff:fe32:f445

These were just put today - so they haven’t yet reached the perfect 20. However the trend is very clearly visible - especially for the IPv4 interface. Newark keeps reporting a timeout - not sure if it is behaving any better than the production pool ?

Any ideas how to improve the situation - due to this issue, I guess the pool effective capacity is much lesser than what it should be? If the transport related issues are not easily solvable; is it an option to tweak the score downgrade that the Newark monitor gives - so that it is not downgraded so aggressively? What is also puzzling is why the transport is OK for some polls, and not the others ?

i have the same problem here in italy but only from isp fastweb and tiscali
ipv4 score floating 10/14
ipv6 score 20

from isp wind i have
ipv4 score 20
ipv6 score 20

1565817537,"2019-08-14 21:18:57",0.00379887,1,7.3,,,0,
1565816626,"2019-08-14 21:03:46",0,-5,6.6,6,"Newark, NJ, US",,"i/o timeout"
1565816626,"2019-08-14 21:03:46",0,-5,6.6,,,,"i/o timeout"
1565815642,"2019-08-14 20:47:22",0.001398844,1,12.3,6,"Newark, NJ, US",0,
1565815642,"2019-08-14 20:47:22",0.001398844,1,12.3,,,0,
1565814669,"2019-08-14 20:31:09",0.001399639,1,11.9,6,"Newark, NJ, US",0,
1565814669,"2019-08-14 20:31:09",0.001399639,1,11.9,,,0,
1565813742,"2019-08-14 20:15:42",0.002030171,1,11.4,6,"Newark, NJ, US",0,
1565813742,"2019-08-14 20:15:42",0.002030171,1,11.4,,,0,
1565812824,"2019-08-14 20:00:24",0.002168236,1,11,6,"Newark, NJ, US",0,
1565812824,"2019-08-14 20:00:24",0.002168236,1,11,,,0,
1565811916,"2019-08-14 19:45:16",0.002224739,1,10.5,6,"Newark, NJ, US",0,
1565811916,"2019-08-14 19:45:16",0.002224739,1,10.5,,,0,
1565810982,"2019-08-14 19:29:42",0.001927572,1,10,6,"Newark, NJ, US",0,
1565810982,"2019-08-14 19:29:42",0.001927572,1,10,,,0,
1565809981,"2019-08-14 19:13:01",0.002043029,1,9.5,6,"Newark, NJ, US",0,
1565809981,"2019-08-14 19:13:01",0.002043029,1,9.5,,,0,
1565808980,"2019-08-14 18:56:20",0.001458194,1,8.9,6,"Newark, NJ, US",0,
1565808980,"2019-08-14 18:56:20",0.001458194,1,8.9,,,0,
1565808019,"2019-08-14 18:40:19",0.001889322,1,8.3,6,"Newark, NJ, US",0,
1565808019,"2019-08-14 18:40:19",0.001889322,1,8.3,,,0,
1565807110,"2019-08-14 18:25:10",0,-5,7.7,6,"Newark, NJ, US",,"i/o timeout"
1565807110,"2019-08-14 18:25:10",0,-5,7.7,,,,"i/o timeout"

The beta site also has a monitor in Amsterdam; I’m curious if (anecdotally) the Amsterdam one is getting different/better results. (Also, the beta site has a bazillion other changes around adding and managing the servers that could use some testing!)

https://manage-beta.grundclock.com/manage/servers

When I click on that link, it gives me a:

500 - Server Error

Ouch! That didn’t work, our server hit a bad gear.

I am logged into the beta site…

Me too. I’m sure it was working yesterday :upside_down_face:

Hi,
I’m happy to report that the IPv4 score of the server is now much much better and as solid as the IPv6 side.
Last reported timeout from the Newark monitoring server was at the below time, and the last successful monitoring request was 2019-08-18 07:05:41; so ~5 days and score is now a solid 20.

1565687391,“2019-08-13 09:09:51”,0,-5,-2.8,6,“Newark, NJ, US”,“i/o timeout”

I have not changed anything impacting this on my end - so something else has changed for the good and hopefully it stays that way :slight_smile: !

Thanks!

Hi,

since yesterday 21:00 MESZ I get on all of my servers

1566222048,“2019-08-19 13:40:48”,0,-5,-8.7,6,“Newark, NJ, US”,“i/o timeout”

CU
Jörg

Hi Jörg, tracerouting to and from the monitoring servers may help you work out where the routing’s broken. This page https://dev.ntppool.org/monitoring/network-debugging/ gives the monitoring server IPs and a tool to traceroute back to your servers (replace the 8.8.8.8 with your IP).