Suggestions for monitors, as Newark fails a lot and the scores are dropped too quickly

Everything you said is correct. A patch to the site to add a legend to this graph would be welcome… One extra bit of information: The graph changed in 2011 when IPv6 got counted separately. I believe until then they were just counted in the inactive set.

The JSON for the graph is available at https://www.ntppool.org/zone/..json?limit=240 (replace the dot with a zone name, for example https://www.ntppool.org/zone/asia.json?limit=240 – or change the limit number to get more granular data).

Thinking off the wall… This issue seems to be with various peers, so maybe if the (each) monitor looked at which peers were en route to “bad” servers it could figure out if a peer was a common factor so not send repeated messages to outages caused by that peer &/ flag up to that peer that their route is broken. Maybe it would also help in deciding where additional monitors should be positioned.

Yes, the traceroute debug tool actually has a “json mode” that was the start on that (a really really long time ago.

https://trace.ntppool.org/traceroute/8.8.8.8?json=1

The motivation for writing it was to have a tool that could output traceroute’s in a format the computer could think about.

1 Like

Ooh, nice! (A long time in a galaxy far, far away… :smile:)

Hi ask,

the monitoring issue is back and the .de Zone flapped by -10 to -20% the last 24 hours.
https://www.ntppool.org/zone/de

Current Score 0, monitoring connectivity problems since 1 day:
https://www.ntppool.org/scores/37.120.164.45

Beta does a better job:
https://web.beta.grundclock.com/scores/37.120.164.45

Regards

Aubergine supplied some traceroutes. Packets are being dropped near the NTP server, probably by Zayo.

Currently an interesting situation even in the beta system.

Monitoring Station: Newark, NJ, US (-3) Los Angeles, CA (20) Amsterdam (20)

This gives a overall rating of 2.1 which is hard to understand. It’s more like the one monitoring node should be threaded as faulty, since the majority of monitoring members thinks the NTP server is working perfectly.

I’m seeing the same issue here in Finland. A lot of the servers I manage are suddenly seeing scores of -15 to -33 even though I can’t find anything wrong with them.

2 Likes

Cool, more then a year later and the problem is still there.
Stuggi, the problem isn’t your server, it really isn’t.

I took all my servers out of the pool for a long time now.
You can also list them without being in the pool, they will still be used.

Pool flapped by -20% over the last days because of the monitoring.

Almost 2 years after, let me add:
(background:)

  • I maintain 2 servers in Portugal, each with v4 and v6 addresses;
  • Both v6 addresses are always at score 20 except at maintenance reboots (less
    than once a year);
  • Both v4 addresses are lately always out of the pool, below 10. In the last weeks I receive
    almost daily emails “NTP Pool: Problems with your NTP server”.

It’s unfortunate, since there are few servers in pt.pool.ntp.org and these 2 stratum 2 do work (in Europe, at least).

I’ll just leave this here for the deniers:

So yes, there is nothing we server maintainers can do, the Newark probe is really surrounded by a Zayo moat full of NTP packet eaters.
That disclaimer is still effectively enforced. Note the final words:

It is our evaluation at this time that providing unfiltered NTP traffic is now more of a risk than benefit.

Example current scores in the beta site for 1 IP:

Monitoring Station:Newark, NJ, US (4.5) Los Angeles, CA (19.9) Amsterdam (20)

Typical. QED, etc.

now: Monitoring Station:Newark, NJ, US (-60)
beta: Monitoring Station:Newark, NJ, US (-32.9)Los Angeles, CA (20)Amsterdam (19.6)

?..

Not sure if this is related or relevant, but a while ago I added a stratum 1 server that was behind NAT (at a residential connection served by an ‘el cheapo’ CPE with port forwarding enabled). It seemed to work well, but as I gradually increased the ‘net speed’ setting for IPv4 and IPv6, I noticed that IPv4 quickly collapsed at some point. Even though in terms of bandwidth the load was not spectaculair, the CPE apparently couldn’t handle the many UDP-packers per second (400 or so) coming in. IPv6 (no NAT, obviously) did not suffer and kept on working quite well.

I just wanted to share this experience in the light of this discussion, because I can imagine that quite a number of pool volunteers are running their service with a similar setup. Perhaps their service also suffers from the same limitations, causing the monitoring-system to drop them below 10 on a regular basis.

As a disclaimer I’d like to note that I have also noticed weird, unexpected monitoring-drops on a 1 Gbit/s enterprise environment, so I’m not claiming that observed issues at the monitoring side aren’t real.

1 Like

OTOH, In my case above, I’m talking about 2 machines with direct 10Gbps uplinks.
CPU is also not an issue, they can deal with the 30k-40k clients (when in the pool) without breaking a sweat.

Net testing determined that the packets from monewr1 all arrive perfectly. It’s the answers hat are randomly discarded by Zayo. Depending on the time of day, the rate of their drops reaches 90%. The drops only affect port UDP/123. And only in IPv4: IPv6 maintains a perfect 20.

1 Like

Actually that’s bad. Theoretically, if the monitoring fails, it can take the entire pool down with it, right?

Yes, it does appear that way with the current single-point-of-failure monitor.
The “beta” monitor is far superior in that respect. And others…

This sounds like connection tracking is still enabled somewhere (most likely on the NAT device). Even a really shitty consumer router will generally route thousands of requests per second without issue, provided conntrack is disabled for both inbound and outbound NTP traffic.

If conntrack is not disabled, then it’ll struggle under the very diverse client base of the pool, which causes massive bloating of the conntrack state tables and corresponding poor performance in much the manner you describe above.

3 Likes

Scores now are particularly interesting. Example:

  • Official - Monitoring Station:Newark, NJ, US (-74.2)

  • Beta - Monitoring Station:Newark, NJ, US (-46.8) Los Angeles, CA (20) Amsterdam (20)

:frowning:

I have precisely the same impression. The score of my server never had such a low value:

image

1 Like

Same pattern as my example, drops intensified yesterday. But for some weeks now score
at Newark is low.

(Apparently they only discard NTP traffic from Europe, judging by the complaints here.)

1 Like