Suggestions for monitors, as Newark fails a lot and the scores are dropped too quickly

Bas · October 15, 2019, 10:50am

Bjorn asked me to post my suggestion here as how monitoring should work to avoid false negatives:

So here it goes.

I checked the monitor URL for several ntp-servers, and it’s my believe there is a problem with the algorithm that determines the score.

When 1 monitor has a time-out the score is down immidiatly.

Even on the beta-server.

Instead you should code it different.

Start with monitor Newark = ok = score / if fail start next monitor / no score yet.

Then monitor LA = ok = score / if fail start next monitor / no score yet

Then monitor Amsterdam = ok = score / if fail => combine all 3 monitors and score is bad after each 3 tries.

However, average the best monitor results and dedicate the best monitor to the NTP server until it fails.

Not only do you have a good monitor-system but it will also show what monitor is flawed, as a bad monitor will hardly test any ntp-servers at all.

Also, make a score go bad only after e.g. 9 tests, if the 10th is still bad then and only then count all scores and show it.

At the moment just a few time-outs are enough to make your score negative, regardless if other monitors can reach you.

Just an idea, based on what I see, the score goes bad without a multiple check, 1 bad monitor and it goes down.

The reason for this is the beta-website monitor output, as you can see my NTP-server is fine but 1 monitor fails a lot of times, marking my server bad but that is simply wrong, see for yourself:

https://web.beta.grundclock.com/scores/77.109.90.72/log?limit=200&monitor=*

When 1 monitor has a time-out the score is bad, the other monitors report my server to be fine.
Sadly the official side only uses 1 monitor and that is Newark, and if you are unlucky like me, the score of your server is terrible and unfair.

As such I suggest a better algorithm should be in place as explained above.

Greetings Bas.

littlejason99 · October 15, 2019, 3:23pm

There is no LA monitor anymore, the servers were moved from LA to NJ…

There are over 4,000 servers in the pool, at any given time there is a few that have connectivity issues (usually due to some 3rd party connection, not directly tied to the server or monitor station).

Servers are pulled out of rotation quickly if they are unreachable because it would be worse to “assume” the server is still up and give that IP to hundreds, thousands, or even tens of thousands of requests and have it fail to reply to them.

Adding multiple monitor stations into the production site is on the to-do list, Ask has been traveling a lot (for his job) and not able to dedicate much time to pool development lately.

Bas · October 15, 2019, 4:31pm

Did you look into the matter?
Newark is dropping on me al the time, several times a day, where the other monitors do not.
This means it’s a bad monitor, as it repeats on the normal side and on the beta-side.

As a result of this bad monitor I get mailed every day, sometimes multiple mails a day.
After that I got Steve after me, suggesting a lot.
I do not blame him, but the problem is the monitor-system, it’s not funny to explain all the time that the monitor has a problem.

I’m in contact with Bjorn and he suggested I should post it.
Too many people are told to contact their ISP, as if they have nothing better to do.

The Beta with (was?) 3 monitors does show that Newark (=via packet) drops UDP, the other ,monitors are fine.

You simply can not tell people their system is faulty when the monitor is the problem. Yet you state you can’t do anything.
I’m near to leave the pool for what it is, as I’m tired of the daily mails blaming me for not being a good server.
When I’m not!

I feel like talking to a brick wall, where everything is denied as a problem.

Did you look at this link at all?

https://web.beta.grundclock.com/scores/77.109.90.72/log?limit=200&monitor=*

It clearly shows the Newark monitor is flawed, and tracing with “mtr --upd” shows the problem, all the time.

How come your OWN log and my Trace show exactly the same?

NTPman · October 16, 2019, 10:04am

I have just the opposite opinion on that.
To get on common denominator, we should analyze the different failure scenarios, and how to deal with them. And all of that not from the viewpoint of the time server itself, but from the viewpoint of the clients, at the end of the day, the time service is provided to the clients.

Server is under maintenance, let’s say for a couple of hours. Even if client tries to synchronize to them, it is not a big issue. Most likely they will synchronize immediately to one of the other three severs. Later on, when the server is back, they will synchronize to that one too.
Network outage for about the same time: similar to the previous case.
Server is permanently down. A rare case, and that is important. The clients will synchronize to the other 3 severs. Not so good, but not a show stopper. Even if the permanently unreachable server gets removed from the pool only after many hours, it is enough. When a client gets restarted, there is a big chance that all the 4 IP address will be good.
The many times discussed random NTP packet drop due to malicious miss configuration of a transit network. The NTP protocol is redundant enough to survive high packet loss. That should not be a reason to get removed from the pool, at all.

My conclusion is that the rule to get in and get out from the pool should be just the opposite as today we have. Even one success test round-trip packet should rise drastically the score. One packet drop should just get down the score a little bit. A permanently down system will be out of service after many hours only, but that is enough to keep the pool clean for longer term.

Even if you are very aggressive to remove a server from the pool as today, it does not help much for the clients to get better service. I had many cases my clients get synchronized only to two or three servers only. The other NTP servers were unreachable. Why is that? The state of the pool to select a server counts only at startup of my client’s time synchronizing daemon. If the uptime of my client is big, like couple of weeks or months, the state change of the pool is not reflected any more in the running NTP daemon.

Zygo · October 16, 2019, 5:52pm

You’re missing case #5:

User configures a bandwidth setting on their server that is far higher than it can handle, and DDoSes themselves or nearby third parties.

This isn’t such a big deal in well-served zones like US, where a server is lucky to get only 20% of its requested traffic level. Half of the pool can disappear with little effect on pool users or server operators. Arguably, it would be better to tune the scoring for the bottom 10% of the pool to be rejected all the time–, i.e. silently drop the servers below the 10th percentile accuracy to raise the average accuracy of the pool. I’m not proposing we do that, just pointing out that “include as many nodes in the pool as possible” optimizes one quality metric at the expense of another.

In underserved zones like Asia, a pool node can get many times its requested traffic level. A naive user can easily trigger a devastating tsunami of packets to land on their NTP host–or their entire country. Fast eviction from the pool is important, because the DDoS effect gets worse for the target IP address the longer it’s in the pool. We can’t rely on the target removing themselves from the pool because:

the target might be unable to reach the ntppool admin interface because they are flooded by NTP packets
the affected entity might be a third party (wrong IP address or shared network infrastructure) who has no idea what the NTP pool is or how to make the NTP traffic stop

The #2 case is especially ugly for the NTP pool–abuse complaints, multi-ISP cooperative investigations–all the expensive, non-revenue-generating things ISPs hate to do. In the worst cases, ISPs start dropping NTP packets at their borders and evicting NTP pool servers from their hosting services because they’re too much hassle.

When in doubt, refrain from pointing a firehose of network traffic at inexperienced strangers, and be ready to turn it off immediately at the first sign of trouble.

Start with monitor Newark…[lots of conditionals and communication between monitor servers]

That sounds far more complicated than it needs to be. There’s only 4000 servers in the pool to monitor, and each monitor can score them all.

“Run N monitor servers, use the score from the monitor in the same zone if there is one, and the highest score if there isn’t” is probably good enough for small N. A truly unreachable or broken host will have a low score in every monitor, while a host with half-broken-half-working peering will have a high score in at least one monitor. A bad monitor will have a much lower average score for its pool nodes than its peers. If N is 10 or more, then the median or 90th percentile score can be used to weed out false positives or negatives. The DNS service can choose the percentile to change the tolerance for network partitioning failures.

The scoring algorithm itself doesn’t need to change, other than to pick a favorable observation point to measure the score from. If the monitoring servers can reach the NTP pool hosting networks reliably, most pool nodes will have no problems staying in the pool. Dropping out of the pool for an hour or two of maintenance and score ramp-up barely affects the query rate, and ntpdate users will be glad that servers undergoing maintenance are not included in the pool for their one-shot queries.

N > 1 monitors has been work in progress entering its 4th year this month. It’s fun to talk about how we’d use 10 or 50 monitoring stations, but it’s a moot point while there’s still only one monitor running in production.

Bas · October 16, 2019, 6:41pm

The idea behind it is to monitor every pool-server there from the start, if fail use another.
But let that monitor check until there is an error.
However at the start of monitoring let the monitor test the ntp-server with the lowest delay/ping? And keep doing that until there is a failure.
Then and only then try other monitors, if they fail too then it’s a reasonable assumption the server is gone.
You do not do this for ntp-server that is monitored trouble-free, it doesn’t need to be checked by others as it’s working fine.

Also with this approach you instantly know what monitor is having problems (all the time) as it loses servers to check, so there must be a problem.

It isn’t much communication, as you can keep the monitor-information in a database and if it’s checked ok, there is no need another monitor to check it.

Also, they do not check 4000 servers constantly, seeing the log it checks only once every 15 minutes per server.

15 minutes is 900 seconds, so 1 server checks 4.5 servers per second, if you have 5 monitors that’s not even 1 per second.
Sorry that is not a very big load as it’s only a few bytes per server

Greetings Bas.

littlejason99 · October 16, 2019, 7:41pm

Your 4 bullet points make the assumption that the end-user is running a full NTP implementation. Most requests come from basic SNTP clients (presumably embedded / mobile devices), querying one IP, via regular intervals. Giving a server the “benefit of the doubt” that it is reachable or will be at some point in the future would have disastrous results on all these one-shot requests.

Also an overwhelming majority of the NTPD distributions running today still are using either the ‘server’ directive, or have an early implementation of the ‘pool’ directive that does not automatically re-query the hostname to fetch fresh IPs when a server becomes unresponsive.

Bas · October 16, 2019, 8:33pm

I do agree with this, but the pool should make sure a server is really bad before removing it.
As such I suggest it does a a few more checks if a server seems to be faulty.
Today the check isn’t there and it’s taken out, but far worse the NTP-admins get flooded with emails telling their system is bad.

What is worse? An 1 shot (stupid) device missing a tick?

Or a 1 shot monitor starts mailing their NTP-server is BAD! to the people that want to help others?

My bet is the last is worse

Zygo · October 16, 2019, 9:07pm

But let that monitor check until there is an error.

It’s the “until” part that confuses me–it’s a conditional, which makes something that should be parallel become sequential. Why wouldn’t every monitor test every pool NTP server all the time (it’s “not a very big load as it’s only a few bytes per server” after all), determine the pool server IP’s score independently, and then send that score to the DNS service for aggregation with data from other monitor servers? Why work in series when it’s easier and better in parallel?

It looks like you’re searching through the monitors until you find one with a non-fail result, since you’re rejecting input from each monitor in turn every time that monitor detects a failure for some IP, and you stop when you get a pass from a monitor on that IP. There’s latency implied there–if an IP stops responding, you don’t know it’s really down until you’ve tried all the other monitors. If an IP is up, you don’t know how much of the world can reach it because you stopped checking. That complexity and latency and information scarcity isn’t required if you can test everything from everywhere all the time–and with only 4000 servers, we certainly can. Get the score tables from all the monitors at once, and determine that a pool server is really down–or not–immediately, based on simultaneous observations from a diverse set of networks. So the DNS server would just fetch 4000 scores from each monitor, sort the score values, and everything with at least N scores above 10.0 is in the pool for the next 15 minutes or whatever the interval is.

The highest score is always a pass result iff any pass result exists, so for a simple binary response there’s no difference between “at least one passes” and “the 100th percentile (highest) score passes”. If you’re more concerned about pool quality (e.g. you want all servers to be reachable by at least 3 out of 8 monitors) then you’d use a lower percentile score, which is available if you’re testing every server IP from every monitor on every testing cycle.

If you’re picking the median score, individual broken monitors fall out of consideration immediately and automatically (either because they’re too positive or too negative). Monitor the monitors to see how many times their scores deviate from the average (or any other choice of statistical consensus function) and you get the quality metric for monitors.

Part of the point of having multiple monitors is that they don’t each need to reach 100% of the Internet–some monitors may have better connectivity to some zones than others–so it may be perfectly reasonable for some monitors to always give useless scores in some cases (e.g. a monitor inside CN vs another monitor outside CN would often disagree, so CN monitors should only be checked for consensus against other CN monitors, not US or EU ones). If you’re going to ignore out-of-zone results anyway, then maybe there’s no need to test the IPs, so in that case the monitors could be checking different IP lists based on zone–not reachability.

Zygo · October 16, 2019, 9:17pm

What is worse? An 1 shot (stupid) device missing a tick?

Or a 1 shot monitor starts mailing their NTP-server is BAD! to the people that want to help others?

Well, the person in second case made an informed choice to interact with the service, and has the technical skills and experience to deal with issues related to running network services, so I’d say that one is clearly better.

In the first case a random device factory configured to use the NTP pool just doesn’t work, and only the tech-savvy 1% would even begin to understand why, much less be able to fix it. That’s bad.

FWIW I just filter the NTP service alerts to a folder so I never have to see them. I run NTP servers for my own clients’ use, monitor them myself, and check the NTP pool’s status page a few times a month to make sure they’re still in there. If the NTP pool correctly detects that my servers are running and strangers benefit from them…well, that’s great. If they’re not, it’s the NTP pool’s problem to solve, not mine…

Bas · October 16, 2019, 9:33pm

Yes exactly. That is precise what I mean.
You seem to forget that it’s UDP as such it’s a hit and miss as UDP doesn’t have ACK-packages.
TCP does, UDP does not.

Because ACK is missing you need a monitor that proves it’s there.

If you where talking about TCP-packages then you are 100% right…but this is UDP.
UDP is broadcast without conformation, missed = too bad.
TCP is broadcast with conformation, missed = send again until it’s right.

As such, yes, if a monitor is right then a server is good, regardless if the others that missed it.

If ALL fail, then yes, you CAN PRESUME (could be still wrong!!) the server is offline.

UDP is hit and miss or target, you do not know.
TCP is hit and if you miss it will be informed you missed.

Ergo UDP will never be a sniper…it’s a blind man with a gun!

Bas · October 16, 2019, 9:35pm

100% wrong.
It’s clear that you do not know what UPD is compared to TCP.

Zygo · October 17, 2019, 3:47am

You seem to forget that it’s UDP as such it’s a hit and miss as UDP doesn’t have ACK-packages.

NTP clients don’t use TCP, so I don’t know why TCP keeps being mentioned. A proper NTP client will send a single packet every minute to 17 minutes and collect the last 8 responses to track the server’s time. Some clients send a burst of about 8 packets over a few seconds, either for a one-shot NTP query or to prefill the NTP PLL at startup. The client needs a reply to 4 or more of the last 8 query packets sent in order to sync, and will lose sync if 5 of 8 consecutive packets are lost.

Thus it is correct for a monitor to report a score below 10.0 if it sends 4 packets a few minutes apart and there are no replies, based on the packet loss rates NTP clients will tolerate. As far as I can tell by playing with firewall rules and watching NTP monitor scores, 3 consecutive lost packets costs about 8 or 9 points in monitoring score, so the 4th lost packet would push the score below 10.0 as expected. No problems there (though maybe after the first 4 packets are lost, the accumulated penalties are a bit steep?).

The problem with the current Newark monitor setup is:

there’s only one monitor in the production pool and
it seems to be surrounded by the networking equivalent of a moat full of NTP-packet-eating alligators.

This means the current Newark monitor won’t work properly, no matter what packet-sending rules or scoring parameters are set for it, because it is in the wrong place to do its job.

If the monitoring station can be surrounded by a moat, it follows that a lot of NTP servers are also surrounded by moats, and they–correctly–shouldn’t be in the NTP pool until someone drains the moats. A host that has no routes around a moat network can’t meet the requirements for NTP service. Unlike TCP clients, NTP clients won’t just send more queries to compensate for packet loss–they’ll either drop the server and move on, or waste bandwidth forever throwing packets into the void without ever achieving sync.

Also, any NTP pool servers that happen to be on the same side of the moat as the monitoring stations–e.g. one of my own pool servers that is in the same Newark data center as the NTP pool monitor and has a perfect 20.0 score–will have incorrectly high scores that should be lowered to reflect reality. Nothing can fix that except monitoring NTP servers from both sides of the moats. There is no way that the Newark monitor alone can know if my pool server in the same data center is reachable by anyone outside of the data center.

If there are networks aggressively filtering out UDP packets (or NTP in particular), then sending more packets from the monitor to follow the ones that are lost is just going to make that problem even worse–monitoring gets less accurate, and the extra NTP monitoring pings will displace a handful of legitimate client packets, without providing more usable service to anyone.

Bas · October 19, 2019, 4:48pm

Not quite, I believe they use a non-UDP-aware loadbalancer/router for IPv4 as such it drops packages at random.

IPv6 doesn’t have this problem as their UDP-packages have much more information.

If you mtr on udp some time you will see a second EWR router popup, if that router isn’t aware of the UDP-packages being send by the monitor it will drop any request on return.

UDP contains not much, just port and some info and the sender IP, but if you do loadbalancing the return message can hit the other router in this case that is unaware of the port.
As such it refuses the answer = UDP package drop.

And a bad score in the monitor.

Not saying this is the case, but it’s so random that it can’t be much else…they do loadbalancing/ip-bonding as far as I can see.

IPv6 doesn’t have this problem, and you see nobody complain about IPv6, just IPv4.

Bas.

Zygo · October 20, 2019, 5:44am

That would imply a stateful loadbalancer which isn’t likely. NTP monitors run on baremetal hardware with a public static IP. Adding a loadbalancer isn’t necessary (only ~5 pps in either direction) and would reduce the NTP timestamp measurement accuracy. Even if it was containerized, the container host would have zero trouble tracking ~300 concurrent peer associations.

There are multiple physical links used for peering, but since the monitor IP is static, the routers don’t have to recognize anything from one direction while routing in the other direction. They just need their usual routing table. There could be 10 routers or 1000, it makes no difference.

It seems more likely that one or more of the physical links are saturated and someone has decided to prioritize packets by class, with ipv4 NTP (or UDP in general) falling into some “only if the link’s idle” priority class. IPv6 would get through because all IPv6 combined so far is not enough traffic to bother dropping yet.

mtr with UDP shows no less than 8 distinct routes to monewr1.ntppool.org from various test nodes. This doesn’t happen with tcp or icmp traceroutes. mtr on UDP shows multiple paths everywhere, though…if I try various pool.ntp.org servers, some of them have 50 (!) distinct routes between the same two IPs over a five-minute period, and most of those are nowhere near the source or destination end of the route. This doesn’t seem to impair the UDP packets at either end of the connection in any way.

Bas · October 20, 2019, 10:03am

The route doesn’t matter but the last firewall/iptable, managed router etc does.
As UDP opens temporary ports it must know about it on all ethernet-ports, else the data-port isn’t there and the NTP-client doesn’t receive the answer the server has send.
All routers underway just pass the package forward, but the last one that routes to the monitor (in this case) has to know the return port. If it doesn’t the package is dropped.
That is happening in my opinion and as a result people complain.
IPv6 has tackled this problem as the return-port is in the package or header, with IPv4 it is not.

We already tested my server to respond perfectly with other monitors, many dumps have been made en send back and forward.

Something or someone must have changed something as I’m now flat-lining since yesterday.

flat-line

at the top, where it should be

We have to wait and see what they changed.

Zygo · October 20, 2019, 1:37pm

That is not how UDP works. Each UDP packet has complete src and dst specifications (including port), in both IPv6 and IPv4. No router along the path (including the last one) needs to know anything more than the IP address to deliver replies to the originating host. The originating host has a trivial workload–it has to remember about 5 UDP ports at any time, because the timeout for an acceptable NTP server is short (no more than a second, probably much less), and it’s not sharing resources with other tenants on the same hardware (per NTP pool monitor requirements).

What you are describing is a NAT or stateful firewall setup more typical of a home WiFi or corporate edge router, not a baremetal hosting provider. In NAT cases the IP address is changed in transit, and to avoid collisions with other IP addresses behind the router the original port is changed as well. With ipv6 there are spare address bits that can be used to reverse the address change without using state in the router (if it’s necessary to change the address at all). With ipv4 the router must remember the original IP address and replace it while delivering the reply. None of this is relevant unless some router along the path is doing NAT–if nobody changes the src IP on the outgoing packet then the reply will have a routable dst IP. Stateful firewalls in a multipath setup that don’t share state across paths would be an obvious configuration failure that would break DNS and TCP too.

OTOH who knows, some hosting providers do have stateful firewall services that customers may opt in to–or may have to opt out of. Another good reason to aggregate multiple monitor scores is to work around upstream misconfigurations like that.

Bas · October 20, 2019, 2:28pm

Again as I might have not explained it very well.
UDP send a request to a server, say port 123, then the server responds to this request.
But the reply uses a temporary opened port and the networkcard/loadbalancer or whatever knows about this port and to what service it should go.
However, if you have multiple NIC’s it could be the reply is coming to the other networkcard, but that card doesn’t know the port it is temporary-opened and sees something coming that it doesn’t know what service it belongs too, so it drops the UDP in total.
Makes more sense now?

More info here: https://en.wikipedia.org/wiki/UDP_hole_punching

Hopefully I’m more clear now.
I have seen TCPDUMPS and they are ports like 51000-51008, if one (or more) NIC’s/routers/loadbalancers are not aware of these ports they don’t do anything with it.

That is what I’m trying to say.
IPv6 doesn’t have this problem as the port doesn’t need to know about this, the header contains more information then IPv4.

I know they do not use NAT but there is more then a simple router else it wouldn’t happen.
With VOIP we have this all the time and the solution is open those temporary ports as static, then it passes to the right service all the time. If you don’t do that, one gets a connection but no audio in one direction, mostly incoming.
I see no difference with the way the NTP-monitor is handling the responses from the monitored servers.

Maybe it makes more sense now? Also we have tcpdumps and proof the UDP returns are dropped from time to time, the question is WHY?
They do not drop at the NTP-server-side, we know that already as other monitors and their tcpdumps already confirmed this.

Zygo · October 20, 2019, 5:03pm

From the first sentence on that page:

UDP hole punching is a commonly used technique employed in network address translation (NAT) applications

The NTP pool monitor does not use or require NAT. There is a single unicast packet query and a single reply to a necessarily public IP address (unlike the VOIP case where both addresses can be private IPs, NTP does not support servers that don’t have a public IP). NTP clients (including the monitor) don’t need hole punching to work.

You need to tcpdump longer. I have ports from 33250 to 60978 in the last 24 hours. Looks like fairly standard port ranges for Linux. The monitor will normally use a different outgoing source port for each query to weed out misconfigured servers that reply from a different address than the one that was sent to (or maybe because there was just never a need to bind to a specific port for NTP clients).

Fairly standard US-EU network congestion, possibly made worse by a few weak links on the US side? Packet loss is often periodic (usually tracking business hours) but there’s a lot of variation that can make patterns hard to see without long-term observation. There are times of day when our users in EU can’t reach some of our VPN gateways in US at all (95% packet loss on UDP, sometimes TCP will get through but crazy slow). It’s why we have so many VPN gateways–get a connection to any US gateway and we can take it from there, intra-US links are mostly fine.

littlejason99 · October 20, 2019, 5:24pm

Not to mention your loadbalancer/router theory would mean that all pool servers would have the same issue as you, which they don’t, because the issue is usually like Zygo said, US-EU network congestion. The further away a server is from the monitor, the more likely there are to be network issues…

Topic		Replies	Views
Monitoring stations timeout to our NTP servers Server operators	103	8298	May 22, 2021
Score/network woes Server operators monitoring	71	6954	March 7, 2019
Monitor belgg1-19sfa9p Pool Development monitoring	19	758	May 31, 2023
Additional monitoring servers (help wanted) Server operators	36	3413	November 26, 2019
Monitoring upgrade Announcements	68	3375	May 25, 2023

Suggestions for monitors, as Newark fails a lot and the scores are dropped too quickly

Related topics