Suggestions for monitors, as Newark fails a lot and the scores are dropped too quickly

It’s really crazy that there are still discussions about the NJ network monitoring quality for NTP.
Like this is not about alternate facts or feelings.

Has it gone worse compared to LA => Yes
Did many partners complain about monitoring before the move => No
Can partners do something => No

I agree. We need more monitoring stations, and disregard the monitoring results of those which has no relevance for the client set served by the server. For example: my server is in the ch.pool.ntp.org, and no connectivity issue with the Swiss NTP client set (I believe so). The very bad results of the monitoring station located in Newark has absolutely no relevance to the connectivity of my servers to my clients, since my NTP servers has no NTP clients located in North America.

The current situation is annoying, everybody must agree with me. I feel this forum is the most appropriate place to express my frustration.

By the way, not the geography, but the network topology which counts. My servers has as bad result from Amsterdam test monitoring station as Newark, even being on the same continent. On the other hand, from LA it is top flat on score 20, even the west cost of USA is farther away from Switzerland then Amsterdam or Newark.

I have to admit as well, If there were no unnecessary NTP packet drops on the Internet, a single monitoring station would be enough to track the availability of the NTP servers all over the world. So I suggest to collect the most comprehensive information about those ISPs which are inadvertently of even maliciously dropping NTP packets, and put them into a hall of shame.

That is not what I’m saying.
The scale should decrease but if e.g. a Belgian monitor sees you as good but Newark doesn’t, that should not mean you need to be taken out of a Belgian or even European pool.
Also, we can not trust all monitors to be right when checking as we see now.

The monitor system should be solid to know when a monitor is flawed or the server.
A server should not be marked bad because a monitor has problems.
That is what I’m suggesting.

Before I entered the forum I started trouble-shooting my server for weeks of all the email marking it bad.
But all tests I did from different locations didn’t show any losses other then a few over many days.

Only to find later on that my wasted time on finding why my server is marked bad is due to the monitoring system.

The tracing we do at the moment is continues and structured from Steve, also it’s checked pro-active on multiple sides and servers.
It is established that the servers involved are marked bad but are not different then those marked good.

The only thing we do not know yet is why it happens. It doesn’t look like congestion as it happens on multiple routes not just mine.
Other “hidden” monitors are running and they show a different story.

Bas.

1 Like

FYI, on the charts the lines (best I can determine) are:

Solid Light Blue: Total IPv4
Solid Dark Green: IPv4 serving time (Good Servers)
Solid Yellow: IPv4 not in rotation (Bad Servers)

Dotted lines are the same, but IPv6.

The IPv4 dips you see in those charts were before the move, I don’t recall exactly what the issues were way back then but I’m sure there are posts about it with some of the troublesome routes.

If you look at all the continental zones, on average it’s about 7-10% that are not in rotation. That’s not bad considering all the various factors (that the pool has no control over)…

IPv6 doesn’t have this problem, and you see nobody complain about IPv6, just IPv4.

I’ve been complaining about IPv6. Every time my server IPv6 server gets kicked by the new monitor it’s accompanied by a massive loss of other US-based IPv6 servers. I can’t duplicate any issue testing from any other point I have access to, just the ntppool.org monitor seems to have problems.

1 Like

Sorry, my mistake, most people reporting problems that I saw said IPv6 was fine and IPv4 was not.

The common thing is the monitor has problems, so we can include IPv6 as well.

It’s getting better and better and better with this location.

For other that want to know, the last week my system was marked bad, again.
But the analysis of data shows now signs of congestion, apart from a spike there is consistent round-trip-timing.

This isn’t conclusive in any way, just to show that we are testing the crap out of everything to find the cause.
Thanks Steve for the picture showing a week of testing:

roundtrip-last-week

As you see, if there where congestion problems it would show a lot more then a vertical-line, it would be all over the place.

If we compare this to the monitor…ooooops…

That is a big difference in outcome. Same locations, same network-path.

Jason, I want to comment on this, because I’m talking about bad score because a monitor can’t reach the NTP-server.
This isn’t much of an issue if other monitors can.

However, if the time is OFF big time, and seen by just 1 monitor, yes then it should be taken out and marked bad.

But if it’s just a time-out on a monitor (sadly it’s just 1 monitor in the pool) then no, it should confirm if it’s really out.

As you know the pool is made out of 4 servers in the DNS all the time, so the client will go to the next in line.
Because of the 4 best servers, the chance of getting no connection at all is very small.

There is nothing wrong with a double-check if the server is really out.
But there is something really wrong if the server is very bad at keeping the right time.

2 different things. Also 2 different ways to monitor those.

That is I why I suggested to move a server to the best monitor for a “ping”…but off-time is a no-no instantly.
Time-out on 1 monitor isn’t bad, bad-time-tracking is.

Bas.

I agree more with this version than with your original proposal, which assigned scores to the monitors themselves. No individual monitor can reach 100% of the NTP pool. Each individual NTP server should be scored by the monitors that can best reach that particular server.

If one entity is operating multiple monitors then a per-monitor score might be useful to manage them. If there are two monitors with identical scores that reach similar subsets of the Internet, then one of them should be shut down and moved elsewhere to improve diversity. This wouldn’t be useful for selecting NTP servers for the pool, only to help owners of multiple monitors determine where to put their resources.

We don’t?

Look at network operators along the routes. I found 3 different network operators in probes from several endpoints (not counting the endpoints, these are Telia, GTT, and Zayo) to edpnet. Along the edge to edpnet, two of those operators are fine, the third one isn’t, sometimes for hours at a time. Even if we can’t use that information to say the problem is definitely Zayo, we can say the problem is not the pool monitor, the pool server, or several other network operators between them.

We know from past experience (and academic papers, and commercial reports, and ops monitoring data…) that issues like this happen all over the Internet all the time. We expect 5-10% of the Internet to be unreachable from a single IP at any time (on average–much worse when crossing certain zone boundaries).

Or do you mean we don’t know as in “we don’t know which of 20 different possible root causes for Newark’s bad behavior is the correct one”? I don’t see what we’d gain from knowing that–it’s not particularly actionable information. We already know the final answer is not likely to be a problem with the monitor host itself since it is consistently scoring 90% of NTP hosts well. This excludes most of the possible root causes Ask could fix directly.

If it turns out that the Zayo-edpnet edge is the root cause, an edpnet customer (i.e. Bas) can complain to edpnet about Zayo’s poor performance. If it turns out that the Packet-Zayo edge is the root cause, someone will have to become a Packet customer long enough to complain, because Ask won’t have time to do it. :wink:

The “deploy a diverse array of monitors” action doesn’t need any information about individual routing failures. It only needs to know that failures happen in broad statistical terms, and have a plan for coping with them.

Congestion doesn’t necessarily add latency. One way to keep queue times down is to just drop all the packets you don’t like (e.g. the UDP ones) whenever the queue starts to fill up.

Sorry that is rubbish, there are papers enough and they don’t mean anything other then produce funds to get a department subsidized.
Do not go there, the Internet is not that bad, in fact it’s 99% fine all the time.

Trust me, I run a website with UDP-realtime-traffic 24/7 and the response is great, no drops or jitter or anything.

I have been at the start of the Internet when it was called Arpa-net…comparing conditions back then to now is ridiculous.
At the time Fidonet was still in place.

Today any IP that is online is about 99,99% reachable all the time, and that is where we are today.
Bad peers happen, but not long.

I do not agree. Sorry.

And give me papers that prove that in real-life today. As it is not the case.

Bas.

99.99% is 2, maybe 3 orders of magnitude better than reality.

RIPE reports global reachability of root DNS servers at about 98.5% (99.0% in US and EU, lower everywhere else). Root DNS servers are the most reachable IPs on the Internet (heavily replicated anycast hosts that are often local to their clients), so 98.5% is an upper bound for global IP reachability. Normal non-anycast IP addresses will be worse. Normal non-anycast IP addresses outside of well-connected countries will be much worse.

Internet researchers appear to have stopped bothering to write articles about this because…

  • it hasn’t changed much in the last 20 years. Standards for performance, robustness, and latency have all improved, but they’ve improved everywhere at once, so the fraction of the Internet failing to meet current standards at any time is the same.
  • it’s the kind of information that a commercial ISP already collects about itself in real time. Academic papers have nothing to contribute any more.
  • it’s never been easier or cheaper to rent, buy, or borrow IP addresses all over the world for a few hours to run your own global monitoring reachability study.

The 90-95% figure is an easy result to replicate. Pick target hosts in 30 locations outside EU and US at random, configure monitor hosts in 10 locations at random to ping them and count replies, then wait for disagreements in the counts to appear. This is pretty boring in the US, CA, UK, NL, half the EU, where reachability is 98-99.97% and all the monitors agree within 1%. It gets more interesting with target IPs in the rest of the world.

After collecting data for a day, I get:

  • target IPs 1-23 are reachable from all monitors, >90% packets received.
  • target IP 24 went down for about 20 minutes, all monitors reported 100% packet loss at the same time. Went back to 99.7% everywhere when the target IP came back up.
  • target IPs 25-28 produced assorted different results from 0 to 50% packet loss across all the monitors. Variations occurred over time as well, i.e. at some times more monitors agreed, but at other times they diverged again. Each IP was unreachable to different monitors according to ASNs used in the route.
  • target IP 29 was 99% reachable from 6 monitors, but 0% reachable from the other 4 monitors, for a continuous 30 minutes during the test (a textbook netsplit).
  • target IP 30 was 91% reachable from one monitor, 74% from another, 0% from the remaining 8. No substantial changes over the test interval, i.e. some monitors could never reach the target IP, and even the successful monitors had a lot of packet loss all the time.
  • 5 monitors produced nearly identical results. It turned out those monitors had nearly identical uplink paths, so there was no routing diversity. I had sufficient diversity from the other 5 to complete the experiment.

All monitors agreed on the reachability of only 80% of the IPs they were monitoring at any time, and produced divergent results for the other 20% at least some of the time. My original guesstimate that a single monitor can measure 90-95% of global reachability was too optimistic!

Comparing traceroutes between monitors experiencing success and failure identifies peer ASNs that are involved in the failure. In my data set these all correlate, i.e. one ASN is common to all failing routes and does not appear in any successful routes to a particular target IP. These failing routes started working again within an hour in most cases.

According to the monitor graph you posted and the random day of Internet activity I monitored, when bad peers happen, they usually get resolved in less than an hour. So I agree bad peers are individually short-lived, but the short lives of multiple bad peers get aggregated together when they interact with the NTP pool monitor.

At any given time there could be dozens of bad peers in the Internet, and many of them would be in the path of a monitor trying to reach all 4300 nodes in the NTP server pool from a single IP address. If the average lifetime of a bad peer is 30 minutes, each bad peer will eat two NTP monitor poll packets for every NTP pool server on the other side of the peer. One bad peer can make half of Europe disappear from monitoring.

1 Like

The solution is simple.

If an NTP-server fails at 1 monitor it’s handed to the next for testing.
So if you have 10 monitors, it should test 10 times if it’s really failing.

If it fails before all 10 fail then it’s online and should be marked as such.

Nobody can reach a sever all the time, this is true, but one may expect the monitor-system at least try it’s best to make sure it’s online or not.
At the moment it fails often and marks you bad instantly, sorry but that is a failing monitoring system.

I have 4 IPv4-servers online now, all fail from time to time, but none has been offline…that’s not right.
As 2 of them serve also IPv6 and score perfect.

That alone makes the current monitoring-system flawed. It should not make these mistakes, regardless if some path drops UDP or not. Then it should use a second, third, fourth etc monitor to be 100% sure it’s gone.

Currently it doesn’t do that, it marks you bad instantly.

The purpose of the NTP Pool is to give high quality service to users, not high scores to servers. A server that’s barely reachable isn’t helpful to users.

Factors need to be balanced – like having failsafes to stop the entire Pool getting disabled if there’s a nasty monitoring failure, or balancing the value of marginal servers in underserved zones – but the pool shouldn’t make extraordinary effort to include broken servers.

I would be upset if I was getting false “your server isn’t reachable” emails all the time, though.

It’s not a problem for users to have a handful of mostly working servers disabled in well-served zones.

2 Likes

But you make a mistake.

We have tested many servers that the monitor marks as bad, and then you get mailed.
Where “ntpdate -q” gave a perfect response.

Also the pool gives at any time 4 servers via a DNS-request.

We have PROOF the servers are NOT broken, yet the monitor marks them bad, that is the problem.

In short, the pool itself is being handicapped because the monitor marks far to many servers as bad, when they are not bad at all.

Most of those servers being in Europe, the biggest pool of them all, but the monitor can’t monitor them properly.

You are missing the point, the servers in question ARE reachable…just the monitor is getting a timeout, but the timeout is false to the rest of the world.

The purpose of a monitor is to check if servers work, but if the monitor is false…then what? And the monitor is falsely marking good servers as bad.

4 month with at least one false-positive monitoring email per week now.
This project looks inactively managed.

@de IPv4 zone flapping by 10-20% of total capacity and nobody cares (https://www.ntppool.org/zone/de)

1 Like

The beta system has active monitors in Amsterdam, Los Angeles and Newark (New Jersey). The Los Angeles one is on a different network than the previous Los Angeles one (though in the same physical building) and is IPv4 only (I haven’t taken time to get the routers in the rack there setup with IPv6).

The current system (on the beta site) tries to spread out the queries so the monitoring systems don’t all go around the same time. I haven’t really given consideration to doing it the opposite way as you suggest here. I’ll have to ponder that a bit.

The DNS data updates every minute or so (I forget the exact interval) and gets distributed over a couple minutes (it’s intentionally not immediate or synchronized).

N is “3” for the beta site – how many monitoring systems would we need to have meaningful data that could be processed like this?

I appreciate that you are bringing this up, there are a lot of details like this.

  1. Experience have shown that it’s very impractical to “write rules” around this for specific countries, so the system really needs to “figure it out”.

  2. I’m a lot more concerned about the “outliers” versus the well served generally well working geographies. Someone commented (in frustration) how Germany lost a double digit percentage of active servers for some days last month. I haven’t looked at it for a while, but when I’ve been tinkering with how to improve service in the poorly served countries then Germany usually stood out in the data as the best served country counted as server capacity to queries. (So … having even 20% less capacity for a few days has much less impact than some countries might have from losing two servers).

Haha, I know it’s a frustrating topic (obviously, dozens and dozens of messages later), but that was a hilarious image in my head. Thanks for the laugh!

My hypothesis is that most of the NTP servers running old “DDOS-abusable” software don’t support IPv6 so the network filtering some networks are doing is (mostly?) IPv4 only.

There were lots of complaints before the move, actually!

Though catching up on this thread (thanks for starting it, @Bas, even if I disagree with your wish for passing a server that shows any sign of life; I agree with @Zygo (and others who expressed similar sentiment) that for the protection of server operators and the experience of the NTP clients we have to also make sure broken servers are promptly removed.

On the beta system with three monitors; is it “unfairly” kicking out working servers because the overall score is basically a composite (rather than a percentile of the 3 monitors, as @Zygo have suggested)?

I’m guessing that’d be a relatively easy change to make. I’d like to rewrite the scoring in Go, but for now it’s here: https://github.com/abh/ntppool/blob/master/lib/NTPPool/Control/Monitor.pm#L161

Yeah, I’m not really sure what to do about those, but they do suck. I feel adjusting the system to send them less would just be plastering over a legitimate problem (that the monitoring system has false negatives from the perspective of the server operator and of clients on a different network path than the monitor), but it would … help?

Everything you said is correct. A patch to the site to add a legend to this graph would be welcome… One extra bit of information: The graph changed in 2011 when IPv6 got counted separately. I believe until then they were just counted in the inactive set.

The JSON for the graph is available at https://www.ntppool.org/zone/..json?limit=240 (replace the dot with a zone name, for example https://www.ntppool.org/zone/asia.json?limit=240 – or change the limit number to get more granular data).

Thinking off the wall… This issue seems to be with various peers, so maybe if the (each) monitor looked at which peers were en route to “bad” servers it could figure out if a peer was a common factor so not send repeated messages to outages caused by that peer &/ flag up to that peer that their route is broken. Maybe it would also help in deciding where additional monitors should be positioned.

Yes, the traceroute debug tool actually has a “json mode” that was the start on that (a really really long time ago.

https://trace.ntppool.org/traceroute/8.8.8.8?json=1

The motivation for writing it was to have a tool that could output traceroute’s in a format the computer could think about.

1 Like

Ooh, nice! (A long time in a galaxy far, far away… :smile:)