Monitpronig stations timeout to our NTP servers

Hi,

Our NTP servers always have a 20+ score, but suddenly it becomes -70 when I checked the logs here
https://www.ntppool.org/scores/212.26.18.41/log?limit=5800&monitor=*

we noticed that the drop happened when the monitoring station changed from “Los Angeles, CA (old)” to “Newark, NJ, US”

1557090329,"2019-05-05 21:05:29",0,-5,-1.2,,,,"i/o timeout"
1557089265,"2019-05-05 20:47:45",0,-5,4,6,"Newark, NJ, US",,"i/o timeout"
1557089265,"2019-05-05 20:47:45",0,-5,4,,,,"i/o timeout"
1557088349,"2019-05-05 20:32:29",-0.013459226,1,9.4,6,"Newark, NJ, US",0,
1557088349,"2019-05-05 20:32:29",-0.013459226,1,9.4,,,0,
1557087443,"2019-05-05 20:17:23",-0.005148073,1,8.9,6,"Newark, NJ, US",0,
1557087443,"2019-05-05 20:17:23",-0.005148073,1,8.9,,,0,
1557086497,"2019-05-05 20:01:37",0,-5,8.3,6,"Newark, NJ, US",,"i/o timeout"
1557086497,"2019-05-05 20:01:37",0,-5,8.3,,,,"i/o timeout"
1557085547,"2019-05-05 19:45:47",0,-5,14,6,"Newark, NJ, US",,"i/o timeout"
1557085547,"2019-05-05 19:45:47",0,-5,14,,,,"i/o timeout"
1557082571,"2019-05-05 18:56:11",-0.00556996,1,20,1,"Los Angeles, CA (old)",0,
1557082571,"2019-05-05 18:56:11",-0.00556996,1,20,,,0,
1557081336,"2019-05-05 18:35:36",-0.006961309,1,20,1,"Los Angeles, CA (old)",0,
1557081336,"2019-05-05 18:35:36",-0.006961309,1,20,,,0,
1557079681,"2019-05-05 18:08:01",-0.008754776,1,20,1,"Los Angeles, CA (old)",0,
1557079681,"2019-05-05 18:08:01",-0.008754776,1,20,,,0,
1557078305,"2019-05-05 17:45:05",-0.009837079,1,20,1,"Los Angeles, CA (old)",0,
1557078305,"2019-05-05 17:45:05",-0.009837079,1,20,,,0,

after i checkd this community post here we added the our servers to the new beta server that has two monitoring solutions, and we got this score
Monitoring Station:Newark, NJ, US (-74.2) Los Angeles, CA (19.4)

I am not sure what should we do, Please advise.
Regards
Ibrahim
ialbarki@isu.net.sa

Hi Ibrahim,

Exactly the same for me here in Switzerland.

The IPv4 was awful since 5.5 @19:43. Moment of switching between LA monitoring station to Newark. IPv6 is still OK.

It seems that a transit provider is doing ratelimiting on NTP packets on its backbone. Other operators experience the same behavior here in Switzerland.

I do some NTP measurements with RIPE Atlas Probes (100 probes Worldwide) and my server is perfectly reachable in IPv4.

Regards,
Antoine

It’s why we need more than one probe on different continents.

@jacota thanks for your reply, can you share with me how to do NTP measurements with RIPE Atlas probes?

is there is any way to know the bad traceroute? I can reach Los Angeles Monitoring station and i expect it is IP address is 207.171.3.17 but i am not sure what is the IP Address of Newark Monitoring station.

Regards
Ibrahim

According to previous posts, the current server shall be 139.178.64.42

Thanks, @alica,

I am not sure, but I know that there are multiple monitoring stations one in Los Angeles, CA (old and another in Newark, NJ, US, I can’t find any details about the IP address of these two monitoring stations.

if you did a quick lookup for the IP address 207.171.3.17 you will find it in Los Angeles , but this IP address 139.178.64.42 is not in Los Angeles nor Newark, according to the lookup here, it is in New York city.

Regards
Ibrahim

Hi @ialbarki,
If you are member of RIPE Atlas and if you host a Probe or an Anchor, you get credit and with this credit, you can run measurment from other probes arround the world. Measures include ping, traceroute and also ntp query.
When you create a measurement you can set probes candidates based on various criteria (AS, region etc…)
If you want, I can run a measurement for some of your servers and I then set results public so you can see reachability of your server at a point of time from an international perspective.
Regards,
Antoine

@ialbarki
The IP of your NTP belongs to an IP pool of Level3 a.k.a. Centurylink, so therefore routers belonging to Level3 are being traversed (of course).
As stated in my post here, I “proofed” that there are some weird NTP problems which can be tracked down to aggressive rate limiting of NTP traffic done on Level3 routers.
At least on traffic between Europe and the U.S., Level3 is eating some of the NTP packets.

Since today I had no luck to get NTP working again via Level3. My pool server is online again, because I changed outgoing routing towards the pool monitor to go via Cogent :frowning:
Maybe you could open a ticket with your host/ISP to get in touch with Level3 engineers to stop that stupid filtering.

What you see in WHOIS is the technical contact address. This contact is not mandatorily tied to the physical location of the servers.
I am Tech-C for some IP networks which are located over 300 km away from where I work :wink:

According to a traceroute, the IP 139.178.64.42 is located in Newark - because the last two hops in front of the target are named
0.xe-1-0-0.bbr1.ewr1.packet.net
0.ae12.dsr2.ewr1.packet.net

EWR is the IATA code for Newark Liberty International Airport :wink:

I added a monitoring system in Holland (Amsterdam) to the beta system tonight: https://web.beta.grundclock.com/user/ask

Still todo:

  • Finish packaging the software so others can run it more easily
  • Figure out scoring system for the monitors and how to distribute the checks
  • Self-service sign-up for monitors
2 Likes

That’s very good news for us Europeans, stuck in the middle of nowhere between America and Asia…

Do you mean it will be possible to run probes without the need of a dedicated server you have control on?

Pretty simple to me: if one probe is still above 10 it means someone, somewhere is still able to ping this particular server, regardless of the connectivity of each particular probe or the connectivity of the server to each particular probe.

If you go the way above, we wouldn’t need to sign-up to each probe, you just monitor all servers from all probes and you include ones above 10 in one probe minimum: done

NTP is resilient enough to cope with one or even two bad servers served by the pool’s DNS, but I think we can’t afford to loose servers from difficult areas like Asia or India because one or many probes are going rogue

With a “probe software package” easy to set-up, we can have a dozen of monitors, obviously self-excluding, as you don’t want a probe to monitor NTP on localhost.

Distribution the burden of the monitoring would make the system more robust in case of a mecanical failure and a human failure as well, as you seem to be the sole and only manager (l wish you a long and healthy live BTW :wink:)

I think median score from all monitors would be a good start. With a larger number of monitors it could be shifted lower, maybe 25th percentile, meaning a server needs to be considered good by at least 75% of all monitors in order to be offered to clients.

It could also mean that one monitor is just handing out 10+ scores to every IP address it looks at. Monitors can fail too, and if self-serve users can sign them up then there are lots of new QA problems for the pool.

Ideally, if there’s a monitor in the same zone(s) as the pool server, its score should dominate the aggregated score, presuming that intra-zone communication is more reliable and desirable than inter-zone communication.

e.g. If a server is intending to serve CN only, and it’s inside CN and mostly unreachable from the rest of the world, then a score from a monitor outside of CN is entirely irrelevant–clients inside CN don’t care about other clients, and clients outside CN would be routed to other zone server pools.

Taken to its logical extreme, this would mean that every host could have one score for each zone (if the zone has a monitor inside it), which determines its inclusion in each zone. When there’s no monitor in the same zone, it would use a best guess based on average or 25th percentile or whatever.

Hello,

I have the same issue here on my servers.

CU
Jörg

We run three stratum 1 servers and our experience with the V6 monitoring is spotty at best. IPv4 runs fine, no issues. IPv6 on the same servers has repeatedly become unreachable according to the NTP monitoring system over the last several months (each currently sitting at -100). We have validated global reachability using multiple servers located across the globe, looking glasses and RIPE tools, but the NTP monitoring system will still report our V6 as down when all other methods we test with say otherwise. We do not use tunnels for V6, we use multiple peering and transit connections (including HE and Cogent, given their V6 networks are effectively separate as of the time of writing this). I honestly do not know what else we can do.