Status page not working?

Unless the clock is very unstable, ntpd should normally reach the maximum interval and stay there.

ntpdate by default sends 4 requests per server, so running it once per hour would still generate slightly more NTP traffic than an already stabilized ntpd.

minpoll and maxpoll can be increased to reduce the packet rate.

Thanks for sharing some background information. Some thoughts in response, obviously depends on further details of your environment as to whether any of them could be helpful.

ntpd can also step the clock if needed, and has some options to tweak related aspects. In the worst case, if your systems use systemd, you could deploy an override to automatically restart ntpd should it quit when its “panic threshold” is being exceeded, stepping the time upon restart as a last resort.

On a Linux system, you could use SNAT to randomize source ports on outgoing client packets. I used that with good effect to make up for ntpd’s current lack of support for port randomization.

Or use chronyd, which does said randomization by default (but can also be told to use a static source port).

As it might have gotten lost, I’d like to highlight that @davehart’s suggestion to set up your own ntpd (or chronyd) instance and have your clients sync to that one rather than the pool directly was not only about load or accuracy. Rather, it would also shield your clients from the effects of some external server having issues should you decide to stick with ntpdate.

For one, such a server would kind of “filter” those effects based on more sophisticated vetting and selection mechanisms.

For another, when something happens, you can more easily trace it. Rather than having to try to understand what random, anonymous server on the Internet a particular client of yours was querying at some point in the past, and what that server was doing at the time so your client subsequently went off the rails, you will know that your clients queried your server only (or servers, for redundancy), and you can more easily troubleshoot said server’s (or servers’) behavior.

2 Likes

May I invite you to review especially the first item in the “Additional Notes” section on the “How to use the pool” page, as well as the Terms of Service, especially section 3, subsection (c).

The pool has a monitoring system in place to weed out bad servers, and is currently working on improving said monitoring system. But there is just so much that can be done on such a scale. At the end of the day, the user needs to evaluate whether the pool is suitable for their needs, and either have mechanisms in place to deal with potential issues. Or use sources that are more suitable to their needs. E.g., use a few hand-picked, reputable public servers. Or even run their own NTP upstream server(s), with a wide range as to what that could look like (from just using a self-operated server, or set of servers, as “buffer” to shield downstream clients from the external world, to running one’s own stratum 1 server, in turn spanning a wide range of potential options as to what that could look like).

E.g., when some instances of the pool reporting bad time leads to widespread outage on your side, maybe the pool is not the right time source for your systems.

Or, your infrastructure is not up to to your requirements. You write that you have “typical ntp infra (which would have picked up the best time from across all reporting ntp instances and hosts)”. But the issues you report seem to contradict that statement.

As discussed above, picking up “the best time” is but one part of what needs to be done for robust timekeeping. Another part is to properly vet the time received from upstream sources. E.g., to check whether it is consistent and makes sense. And refuse to update the time if it is not. Certainly when the risk of bad time is widespread outage.

And while it is not completely out of the picture that multiple servers report bad time simultaneously, and similarly bad time so as to have consensus, a sensible NTP client would simply refuse to make clock corrections as big as you describe even when received from multiple sources. Unless the bad time is consistently served over some period of time, and not a one-off, or just over a short period of time.

So I’d invite you to reassess whether your current infrastructure as it is, and it getting time from the pool, is suitable to your needs, and to adjust accordingly if not. This thread contains some suggestions for what that could look like, and there is more elsewhere in this forum. Or ask for advice.

2 Likes

I’ll add that when I’ve been monitoring popular “well known” servers I’ve seen many problems (dropped packets, poor leap second handling, less than amazing accuracy (from being overloaded, perhaps), etc etc.

If I was setting up a system where I didn’t have “stratum 1” type equipment and extensive monitoring, I’d follow @davehart (and @MagicNTP)'s suggestion about having a few carefully monitored servers to be “the time servers” (syncing from the NTP Pool and perhaps other sources) and then have the rest of the infrastructure sync from there.

Leap second errors (incorrectly setting leap second indicator to 1) have been uncommon in recent years. There are a couple of DCF-sync’d NTP servers in Germany that set leap=1 at the end of some months.

Its difficult to separate network effects (packet delay and loss) from the NTP server effects (time accuracy). Both affect time transfer, but the former depends on the client ↔ server path. The new NTP Pool monitor system helps distinguish the two.

The network can cause partial, or total, losses for any duration. Path delay might normally run ~50 msec, but can spike to hundreds of msec.

Generally the well-known public NTP servers (including those in the pool) are either moderately accurate, or are consistently falsetickers. The latter are detected by the NTP Pool monitor. There is unfortunately a middle ground: NTP servers that are sometimes wrong. It isn’t unusual to see an NTP server temporarily deliver inaccurate results after a restart. Such an error isn’t readily detected by the NTP Pool monitors.

Even the best NTP servers sometimes send the wrong time. This includes National Timing labs.
I echo the others suggestions, especially about using multiple NTP servers. Servers should be chosen to eliminate simultaneous errors, such as GPS rollover.

1 Like