Suggestions for monitors, as Newark fails a lot and the scores are dropped too quickly

Bas · July 31, 2021, 9:46am

So far so good, I also noticed on a Belgium Time site that my jitter is gone to normal levels.
It used to be 10.xxx now it’s around 2.xxx
The pool also doesn’t show any timeouts at all, it has been a solid 20 on all my 4 servers.
This never happened before I changed the TTL.

avij · July 31, 2021, 12:20pm

The TTL does not matter. Whatever has changed has nothing to do with your TTL changes.

Bas · July 31, 2021, 4:56pm

So why is this code in the sources of NTP-pool?

# if it hasn't changed for a while, cache it for longer
            if (@$history && $history->[-1]->{ts} < time - 86400) {
                $self->cache_control('maxage=28800');
            }

Why would it have TTL code in there at several places.

Why is the selfcache longer then the default TTL? I’m not a progammer.
But I find it strange it’s using TTL while it should be handling IP’s as you say.

Sorry but I doubt it only works with IP’s and ignores URL’s.

Nothing changed on my side, nothing changed at the ISP.
I only altered TTL of the DNS-provider from 3600 to 86400 and there are no time-out’s anymore.

Maybe it’s just a bug in the code causing it for some reason.
It may not use domains on purpose, but maybe does and access DNS-records.

For me it seems solved.

Solid 20 on all 4 servers, never happened for more then a day or two.

We can only confirm it if others try it too, and if it works for them too, then the cause of the time-outs if found.

But it’s just a week for now, let’s discuss again after 4 weeks or so

marco.davids · July 31, 2021, 9:52pm

That code seems to change the cache-control HTTP header of the scores-page[1], when a situation has not changed for the past 24 hour or so. See screenshot for an example.

This is not in any way related to DNS TTL’s (and I agree with @avij on that) nor to the topic of this thread.

[1] URL was showing this at the time of writing, but may not be showing it at your time of reading this, hence the screenshot.

Clock · July 31, 2021, 11:42pm

I don’t see how DNS can impact the server score during monitoring.
What usually happens is to have the bad luck of the station ordering the synchronization of time to the server precisely at a bad time, when it is dealing with many requests or on some degraded route.
Remembering that if the server has specific problems dealing with NTP requests, it is very likely that most of the time it will remain with score 20 because the monitoring station only performs one or two queries every hour. My server was also fixed at 20 points yesterday until the monitoring station made a request at a bad time.

avij · August 1, 2021, 8:14am

If you still have doubts, you can do an experiment. Your servers seem to have ntp.heppen.be as the hostname. If you change the A records of ntp.heppen.be to some bogus IP addresses (such as 192.0.2.1, 192.0.2.2, 192.0.2.3, 192.0.2.4) in your DNS zone file, you should notice a few things. First, the IP addresses of your servers as listed on pool.ntp.org: Bas's pool servers remain unchanged, even if you wait a few days. Second, the traffic amount to your server will not change (unless you have hundreds of your own servers pointing to your own NTP servers, in which case there might be a small decrease). Third, your server scores remain unaffected. I can’t guarantee that they would stay at 20, but they certainly won’t drop to -100 immediately. If you do notice a drop in score, you will also notice that the drop will be only temporary.

By now you should have learned that the A records in your DNS zone file do not matter past the initial setup, and by extension, the TTL value for those A records does not have an effect either.

If you are concerned that changing the IP addresses to bogus values would break the NTP configuration of some of your own servers using ntp.heppen.be as the time source, you can also change the A records to some other nearby pool servers, so that querying ntp.heppen.be would give a proper response.

As for why your scores are now 20: There are a few different ISPs between your servers and the monitoring server. Maybe they have changed something, or the routing is done through different ISPs.

I’ve operated pool servers for 14 years and I think I’ve gathered some information on how things work under the hood during this time.

Bas · August 1, 2021, 2:54pm

I have changed one Stratum 2 server in my pool to: 192.0.2.1
So the others still work as should.
Let’s see what happens with the monitor.

BTW, all 4 servers are at different countries and locations, so just my ISP changing something would effect just 1 server and not all 4.
And it did happen on all 4 servers and always on IPv4 but never on IPv6.

Let’s see, I also changed the TTL to 3600 for that server.

I tried to add it, it does test the IP:

ntp.heppen.be
82.161.251.125
82.161.251.125 is already registered in the system.
77.109.90.72
77.109.90.72 is already registered in the system.
192.0.2.1
Bad IP address: TEST-NET
176.9.206.139
176.9.206.139 is already registered in the system.

So it does check the DNS everytime a system is added, should not do that as the DNS and IP was known, but it does.
It seem to do a realtime DNS-check omitting DNS-cache, that is strange.

avij · August 1, 2021, 3:18pm

How did you deduce it did a realtime DNS check, bypassing the DNS cache? Mind you, the last time the pool management did a DNS lookup for ntp.heppen.be was when you added your servers to the pool, which was several days ago. No wonder the TTL had expired.

Yes, I’m aware that the management page does a DNS lookup when you (re-)add a server. This is also the method that can be used when a server gets an additional IP address, like an IPv6 address in addition to the IPv4 address.

I’m worried that you changed the TTL. By changing more than one variable (the A records as I instructed) you can no longer be certain if any changes to your server’s scores is a result of that TTL change or the A record change. But suit yourself.

Zygo · August 1, 2021, 3:42pm

The change was at the monitor end. Many routes to the Newark monitor that used to go through Zayo six months ago go through Twelve99 today. I had NTP pool nodes that were continually failing at Zayo. The same nodes have been running with steady scores of 20 on Twelve99.

I haven’t been tracking this closely, so I don’t know the precise date of the change, but it probably wasn’t the only peering change in recent weeks. Throw a dart at a calendar, and it’ll always land on a date where peering changed for someone.

A single monitor can never reach the entire internet, so after each change, new complaints pop up from those who are no longer reachable via the new monitor routing.

marco.davids · August 1, 2021, 4:07pm

What you did is not quite what @avij suggested:

You introduced 192.0.2.1 in the DNS, without changing any other addresses.
You also changed the TTL, hereby introducing a second variable.
For your information; having different TTLs in an RRset is a violation of RFC2181, this aside.
After this you tried to add ntp.heppen.be again to the pool. Basically here is where you ruin the experiment

Because if you add it (again); naturally there will be a DNS lookup for ntp.heppen.be.

But the idea of the experiment was not to prove that, but rather to prove that (the DNS of) heppen.be is no longer involved (for monitoring) once a server is added to the pool.

The addition was refused by the way, because 192.0.2.1 is an invalid IP-address. So perhaps @avij should have suggested other addresses to really get the experiment right.

avij · August 1, 2021, 4:12pm

Well, in any case, we will get some useful data even with this setup. Let’s not mix up this situation any more. No changes in pool.ntp.org: Statistics for 5.196.189.119 yet, just as I expected.

Bas · August 2, 2021, 9:17am

Yep you are right. At least we know for sure DNS got nothing to do with it.

Topic		Replies	Views
Monitors aren't bad, just report strange....so people get the wrong idea...my opinion Server operators	35	1440	March 9, 2020
Every ping that Newark does it marks as timeout, yet the beta site never indicates such	7	514	December 8, 2020
Newark monitor problems monitoring	14	763	March 11, 2021
Monitoring stations timeout to our NTP servers Server operators	103	8295	May 22, 2021
Possible monitoring system problem Server operators	2	564	July 13, 2023

Suggestions for monitors, as Newark fails a lot and the scores are dropped too quickly

Related topics