Monitoring stations timeout to our NTP servers

FWIW, I have zero loss right up until it falls apart at 2604:1380:2:63::1. Not much my ISP can do about that. There’s also horrible latency & jitter once the trace reaches 2604:1380::

I disagree with your conclusion that “the issue is the monitor, not all of the other servers”. Unless “all the other servers” are connected directly to the monitor then there’s some amount of internet between each server and the monitor.

People add servers to the pool and take servers out of the pool all of the time. At the North America level there are 11% less IPv6 servers than 14 days ago, but at the Global level it’s 6%. Both levels have more active IPv6 servers than yesterday. That doesn’t sound like an issue at the monitor end. Do you have specific knowledge of the 30-some and why they’re no longer in the pool? As volunteer admins we (currently) have no access to any more information than you do.

NTP uses UDP packets, which aren’t guaranteed to be delivered. Routers may be overloaded and drop packets. They may have traffic shaping or limiting rules in place. A router may be misconfigured.

Tracerouting to your server from the monitor three times gave me three different routes of either 7 or 8 hops - none of the traceroutes reach as far as your server. In two cases the last hop has “tunnel” in the name which implies the route might then be tunnelled.

For me the available evidence points to the issue being a routing issue between you and the monitor. As you have the relationship with your ISP to deliver traffic then my suggestion is for you to log a ticket with your ISP who will have access to diagnostics that neither you or I do.

Big surprise, my server is back and the count is +29 over yesterday.

Servers are being added and removed all the time. I myself am known to add and remove servers from the pool, change IP’s which is the same as removing then adding again, etc… it’s not surprising to see the count change. To your comment though of +29 just because your server is back, that is 0.6% of the total servers in the pool, pretty small number to still try and blame everything on the monitoring server.

Again, this was the count of US IPv6 servers, a figure which is fairly steady for the past 6 months. In each of these episodes when people report that servers which have been solid for literally years suddenly fall out of the pool for days, that count of US IPv6 serves shows a significant and sudden decline. When the problem resolves itself, the count goes back to that steady-state number it was at before. Sure, 15% or 40% of US IPv6 servers might all be removed at the same time, and then all be replaced at the same time a couple of days later, at the same time that people are reporting that there are connectivity issues to fully operational servers, but Occam would suggest that’s a ridiculous proposition. Clearly, though, denial is more important than figuring out what’s wrong with the monitoring, so I won’t worry about it any more.

And clearly a passive aggressive response and ignoring that I think we all want more active servers will help improve things :roll_eyes:

Time for a cuppa… :tea: :slight_smile:

The monitoring station in NY clearly has NTP IPv4 and sometimes IPv6 connectivity issues.

The zone https://www.ntppool.org/zone/de constantly flaps between minus and plus 100 servers and so is my personal server. In reality my server is fully operational and clients use it without issues.

Since we are talking about ~10% of the servers in the pool the IPv4 traffic to the servers flaps accordingly. Munin statistic for reference: https://abload.de/img/ntp_ip-yearajkt8.png

If the issue gets worse the overall pool stability is questionable and the operations team should have a look.

@aubergine:

i could confirm it, too.

My server is listed with his IPv4 and IPv6 address. The monitoring showed the last 2 weeks (or so) a massiv up and down with a lot of timeouts:
https://www.ntppool.org/scores/193.31.26.254/log?limit=200&monitor=*

During this time, the monitoring of the IPv6 address doesn’t show any problem:
https://www.ntppool.org/scores/2a03:4000:2b:1348::123:254/log?limit=200&monitor=*

The server is monitored by Pingdom from different locations on the world (not directly the ntpd, but the web- and nameserver on this server) and both connections where testet every 60 seconds. Between the 09/01/2019 and 09/30/2019 Pingdom recorded a uptime of 100%.

On my opinion, it’s clearly a problem with the monitoring-server. The operations team really should a look on it.

Dear @aubergine, @Jens and @ntppool,
What you see as a problem with the monitoring server itself is a network issue. There are servers even across the Atlantic ocean from Newark which never shows failure, for example: https://www.ntppool.org/user/cp8t2gudho2eojnkjni . If the outbound packets from the monitoring server takes the NTT network, this network sometimes drops the packets. Other networks do not have this malicious behavior. The details of the founding of that serious network problem is here: Holy assymetrical routes, Batman! Is this an as7012.net routing issue?

1 Like

Hi all, maybe it’s helpful to quote a couple of @Ask’s posts here.

(TL;DR: the plan is to increase the number of monitors in the pool and have them in different countries. Testing is underway in the Beta site.)

This thread also gives some details of plans.

1 Like

To be clear, I don’t think there’s a problem with a monitoring server in itself, but rather with its connectivity.

You have to remember NTP traffic is UDP and just because TCP traffic is traversing the network okay, does not mean UDP traffic is. Any monitoring other then actual monitoring of your NTP service is not the same thing. UDP is best effort and has no error checking. There is also the issue with NTP traffic is often rate limited on networks or by same “bad” network operators completely blocked because of the NTP DDOS attacks that left a bad taste in some admin’s mouths.

Again (as mentioned previous posts) NTP services are fine as seen from other points on the internet. I don’t understand why people keep bringing up TCP and explaining basic networking, as though everyone experiencing intermittent connectivity issues from the monitoring service has found themselves at the rodeo for the first time.

I can see that ntppool monitoring is as broken as it’s ever been
https://www.ntppool.org/scores/51.155.16.62
I’ll check back in a year or so.
Leo

Also have an issue here in the UK:
https://www.ntppool.org/scores/51.155.16.62/log?limit=200&monitor=*

Our connection is solid no reports of anything else having issues.
P.S Can we have a 95Mbit option as selecting 100Mb with a 100Mb connection generally starts to cause issues as the units are hitting line speed.

From here, I see a 70% packet loss to that server, so I’d say the monitoring is working fine. As usual lately, only the port 123 is affected. Other ports are ok.

$ mtr -u -P 123 51.155.16.62
 Host                                    Loss%   Snt   Last   Avg  Best  Wrst StDev
 ...
 7. de-fra04d-rc1-ae-6-0.aorta.net        0.0%   147   22.8  21.4  18.5	 35.7   2.3
 8. de-fra02a-ri1-ae-1-0.aorta.net        0.0%   147   22.2  21.9  17.9	 38.6   3.5
 9. ae-21.r00.frnkge13.de.bb.gin.ntt.net 63.0%   147   22.0  21.3  18.8	 25.1   1.3
10. ae-14.r24.frnkge08.de.bb.gin.ntt.net 72.6%   147   22.3  21.5  19.0  25.4   1.7
11. ae-5.r24.londen12.uk.bb.gin.ntt.net  72.1%   147   45.3  42.5  38.7	 48.8   2.2
12. ae-3.r05.londen12.uk.bb.gin.ntt.net  74.7%   147   41.9  43.1  38.7  50.3   2.3
    ae-1.r04.londen12.uk.bb.gin.ntt.net                                            
13. ae-1.a00.londen12.uk.bb.gin.ntt.net  70.7%   147   45.0  43.2  38.6	 58.0   3.9
    ae-0.a00.londen12.uk.bb.gin.ntt.net                                            
14. ???
15. vl-50.ae6.cr1.th-lon.zen.net.uk	 75.3%   147   54.7  45.4  38.3	 74.3   8.3
16. ae4-0.cr1.wh-man.zen.net.uk          74.7%   147   50.3  47.2  39.6	 77.2   9.7
17. ae0-0.ar6.wh-man.zen.net.uk          72.6%   147   48.2  48.1  44.6	 57.9   2.3
18. 88-98-152-106.dsl.zen.co.uk          71.9%   147   61.5  61.7  58.4	 64.8   1.7

It’s a UK server, serving UK clients, taken out of the pool by a single station in USA due to packet loss somewhere in Frankfurt.
This is such a waste of good will.
Leo

1 Like

No sorry I don’t think the monitoring works fine.
If a monitoring server has the purpose to monitor port 123 UDP and is connected via networks that drop UDP Port 123 packets, the server fails its only job.

Your MTR shows that the packet loss starts in the routing path and the server owner has no impact on that. Since the project works DNS location based it’s also kind of uninteresting what a station in NY thinks the status of the servers in the EU might be.

The monitoring of the project will be probably ok once the station in Amsterdam is live and connected via a network which fits the requirements.

In this case the problematic point is closer to the NTP server than the monitoring host and most of the servers in the UK zone don’t seem to have this problem, so I’d not blame the network connection of the monitoring host. FWIW, my mtr test was within EU.

There may be nothing the server admin can do except move to a different network. It’s bad luck. Internet is becoming more and more hostile to NTP. Some of my servers were affected too.

Yes, it would be nice to have a monitoring host in each zone, so servers could be selectively enabled/disabled per zone.

Ideally, instead of having few specialized monitoring hosts, I’d like each NTP server to have an option to monitor other servers. It could be a simple shell script (e.g. running as a cronjob) that would fetch a list of servers, measure its offset against them, and upload the results. With so much data it would be easier to decide if a server should be enabled in a zone or not.

1 Like

So the issue here isn’t the server (Its our prototype serial number #001 LeoNTP thats been running fine for years) or its connectivity (its on a 1Gbps leased line set to 50Mbit for purposes of pool). I did some tests from a few hosts within the UK and there no loss to the server on port 123. Somewhere between the monitoring and the server an intermediate host is dropping NTP packets. This is likely out of my control and the monitoring stations control.

(Side note can we get an MTR from the monitoring to my host to see where the issue lies ?)

However the monitoring is relying on a single result to dictate whether a host is “suitable” for inclusion in the pool. This is crazy for the reasons highlighted above. There needs to be some santiy checking, tests from multiple servers in different locations before a host is deemed unreliable.

Anthony

1 Like