currently, the pool monitoring and scoring system in regards to the time offset for a otherwise functional server boils down to the magic number of 125 milliseconds. A server persistently below this threshold will be included in the pool, a server persistently above it will drop out.
In a recent pull request for the ntp pool code, @askmentioned that it might be an opportunity to tighten this requirement.
So: What are your opinions on what the maximum allowed constant offset should be?
Some food for thought:
As mentioned above, the current value is 125ms. I was not able to trace back where this number came from - if anyone knows, that would be interesting.
I did some basic datascience on an hour of monitoring data:
more than 90% of the monitored offsets were below 10ms
more than 95% of the monitored offsets were below 20ms
more than 98% of the monitored offsets were below 50ms
I also read through this post that had pretty much the same question in it:
What level of accuracy should users expect from sources like time.windows.com, time.apple.com, or even our beloved NTP pool?
but I would summarize the answers there with “it depends”.
My personal take: If my servers would show offsets beyond 20ms, I’d investigate it. Allowing some leeway for long routes and networking strangeness, I’d say a server in the pool should be able to keep the offset below 50ms.
The current generous allowance was from when there was just one monitor. I think we can have more confidence in the offsets now with more monitors.
@Sebhoster The system doesn’t directly track “active” monitors in the monitoring logs, but if you are willing to do more analytics I’m happy to help do what I can to make sure you have enough data.
@Kets_One longer round trip times doesn’t necessarily mean the time offsets are higher (as you probably know!)
To clarify: If the ntp servers that I operate in the pool would show offsets beyond 20ms in the pool monitoring, I would investigate since I know that this is outside the range that they usually achieve.
The dataset was about 180,000 measurements. This fits the expectations - the statuspage logs somewhere around 3000 checks per minute, times 60 minutes we get to about 180,000.
I dumped one hour of monitoring offset measurements from the pool monitoring database, filtered out “null”, filtered out any measurements with an error message, and filtered out the monitor_id that is reserved for the overall score since that monitor does not do measurements itself. I’d consider it kind of representative, since it contains a lot of data from all monitors and all servers and the point in time was chosen at random. However, it is biased towards lower offsets since the monitors do more measurements for servers that they are “active” for, which is determined by better scores, which is influenced by lower offsets.
I have not looked at ipv4 vs ipv6, or any other possible correlation. That is an interesting idea, but in my opinion not relevant for the question at hand since we will have both address types in the pool for the foreseeable future.
Thanks @Sebhoster. @ask Ofcourse you are totally right. My mistake.
In that case I expect that offset is much lower than 125ms for servers under my control (extremes are within a bandwidth of +4ms and -4ms of GPS time) . What is a good way of finding this out an monitoring this for myself?
The pool monitoring already does this for you. Just look at the page for your server under https://manage.ntppool.org . The graph there shows you the offset measurements from the different monitors. Additionally, at the bottom of the page you get a link to the “csv protocol”, which is basically an API where you can easily access the recent monitoring data from all monitors for this server
Ah thanks what i already figured.
To come back to my first post in the conversation above, thats what i mistakenly referred to as “ping”, i meant offset ofcourse.
This has been thought out. First of all, the monitor servers should sync their clock from reliable sources, like from GPS or other good NTP servers. There’s also a safeguard mechanism in the monitoring software that periodically checks the time from a dozen (hardcoded?) well-known NTP servers. If there is a large enough time difference (>10ms) between the monitor server and several reference NTP servers, the monitor won’t check any pool NTP servers until the clock is in sync again.
To add to what @avij said, the quality of time the monitors are basing their comparisons on is vetted by screening potential monitor operators by having them provide pool server(s) for some time first. This is speculation.
I think somewhere around 100 ms is probably appropriate. I see the role of the Pool as being “better than nothing”, or rather better than not using NTP (or similar) at all. Without NTP, a human administrator might set a system clock by hand from a time signal, and they can probably get within about 100 ms like that. So if an NTP server is better than a human sitting at a keyboard, I think it’s good enough for the Pool.
With increased computing demands (often) come increased timing demands.
In my opinion we should aim a little lower than 100ms. This would be hardly an improvement beyond the current 125ms.
50 ms sounds really good to me. Having a very small number of servers sometimes drop out for a while because their offset is too bad sounds like a sensible price to pay for protecting pool users from unexpectedly high server offsets. If a server has offsets that are worse than more than 98 % of all measured offsets for a long enough time to actually drop out of the pool, the operator should really take a look at that.
What’s a reasonable worst-case scenario? What if, for example, a small country with some NTP servers and no monitors has a bad fiber cut and their international connectivity gets congested or is routed strangely?
Many of the monitors might show significant latency and offsets.
(Stratum 2 servers in the country using foreign upstream servers could also have significantly bad time.)
We could mitigate that at the expense of having it be more complicated to figure out what’s going on. For example make the maximum allowed offset be related to the median (or maximum) latency of the active monitors, or something like that.
(Which monitors are “active” is in part based on their latency to the server.)