Monitoring stations timeout to our NTP servers

No sorry I don’t think the monitoring works fine.
If a monitoring server has the purpose to monitor port 123 UDP and is connected via networks that drop UDP Port 123 packets, the server fails its only job.

Your MTR shows that the packet loss starts in the routing path and the server owner has no impact on that. Since the project works DNS location based it’s also kind of uninteresting what a station in NY thinks the status of the servers in the EU might be.

The monitoring of the project will be probably ok once the station in Amsterdam is live and connected via a network which fits the requirements.

In this case the problematic point is closer to the NTP server than the monitoring host and most of the servers in the UK zone don’t seem to have this problem, so I’d not blame the network connection of the monitoring host. FWIW, my mtr test was within EU.

There may be nothing the server admin can do except move to a different network. It’s bad luck. Internet is becoming more and more hostile to NTP. Some of my servers were affected too.

Yes, it would be nice to have a monitoring host in each zone, so servers could be selectively enabled/disabled per zone.

Ideally, instead of having few specialized monitoring hosts, I’d like each NTP server to have an option to monitor other servers. It could be a simple shell script (e.g. running as a cronjob) that would fetch a list of servers, measure its offset against them, and upload the results. With so much data it would be easier to decide if a server should be enabled in a zone or not.

So the issue here isn’t the server (Its our prototype serial number #001 LeoNTP thats been running fine for years) or its connectivity (its on a 1Gbps leased line set to 50Mbit for purposes of pool). I did some tests from a few hosts within the UK and there no loss to the server on port 123. Somewhere between the monitoring and the server an intermediate host is dropping NTP packets. This is likely out of my control and the monitoring stations control.

(Side note can we get an MTR from the monitoring to my host to see where the issue lies ?)

However the monitoring is relying on a single result to dictate whether a host is “suitable” for inclusion in the pool. This is crazy for the reasons highlighted above. There needs to be some santiy checking, tests from multiple servers in different locations before a host is deemed unreliable.

Anthony

1 Like

That is a feature in the works to have monitoring from multiple locations (like on the beta site).

NTT seems to pop up a lot when people have issues with NTP traffic being dropped…

http://trace.ntppool.org/traceroute/51.155.16.62

Yes they timeout as in the path to use UDP-packets are dropped.
As such you get a bad score.

If you try the beta-server it monitors from Amsterdam and the results are different.
A lot different.

Do not trust the NJ monitoring station as it does not have a clear path to all when it comes to UDP.

Greetings Bas.

How do I use the beta server ?

Here they are: https://web.beta.grundclock.com/nl/

We all have the same problem with the Newark monitor.
You are not alone and people are working to find the cause of the problems.

Just did a tcpdump of 8 hours for Steven to compare the results of my server and what the monitor reports.
As the monitor reports my server bad all the time, the blue-line jumps up and down like a dancer for Newark where LA and Amsterdam reported everything to be fine.

Similar problems here @ byrpi2.dynu.net: IPv6 is always super stable (once the server config is final) and IPv4 fluctuates wildly with lots of timeouts in-between…
Any hints what I could try on my side?
Thanks,
Michael

@mby Your server doesn’t look too terrible NTP-wise, although there are some issues with ICMP.

IPv4 NTP: https://atlas.ripe.net/measurements/23910167/#!probes
IPv6 NTP: https://atlas.ripe.net/measurements/23910168/#!probes

IPv4 Trace: https://atlas.ripe.net/measurements/23910177/#!probes
IPv6 Trace: https://atlas.ripe.net/measurements/23910179/#!probes

IPv4 Ping: https://atlas.ripe.net/measurements/23910187/#!probes
IPv6 Ping: https://atlas.ripe.net/measurements/23910192/#!probes

When you say that IPv4 is fluctuating wildly, what do you mean by that?

Thanks and I mean that IPv4 connections seems to have many more timeouts when probed by the pool than IPv6, which works nearly flawlessly:

Just two side questions:

  1. are these ip static ? reverse lookup shows up as normal ISP
  2. is the line symetric or asymetric ?

It‘s ISP but IPs should never change; and it‘s an asymmetric line.

Ok, thought it’s like a dial up which change the ip.

@mby What sort of traffic levels (NTP packets per second) are you seeing during the times when the score is below 10?

Thanks Steve, and it is always super-low traffic; appalling is that only the IPv4 probes break down while IPv6 is alive and kicking…
But anyways, I’ve monitored the accuracy of my server quite extensively and di some testing, so it does not appear to be a local problem, so I consider this as closed, thank you everyone.

P.S.: When looking at the beta site data, I have these statistics, so it looks like something I can’t influence, indeed:
IPv4: Monitoring Station:Newark, NJ, US (7.9)Los Angeles, CA (-49.3)Amsterdam (-0.3)
IPv6: Monitoring Station:Newark, NJ, US (19.5)Amsterdam (19.4)

@mby those are delightfully consistent problems!

@stevesommars might be able to help analyze some tcpdump’s if you can work with him. He’s been working out which [ network links / transit providers / etc ] are causing trouble and looking at the failure patterns.

Thanks @ask, I’m currently working with @stevesommars on it; we’ll keep you posted.