More precise (sensible, sensitive) server monitoring score

I am wondering does this retry logic still make sense?
With the new multi-location monitoring system in place, servers get very good scores in general. We may want more precise measurement, even knowing about one packet loss and not smoothing the offset value. So I suggest to change the line:

cfg.Samples = 3

to be:

cfg.Samples = 1

in monitor/client/monitor/monitor.go at main · ntppool/monitor · GitHub

1 Like

servers get very good scores in general

Always good reminder that NA/EU networking is pretty good, but outside of that fluctuations will happen.

Removing retries maybe once more global PoPs exists.

1 Like

Rather, there are fluctuations as soon as packets hop on an undersea cable, even between the EU and NA.

The reason being that different probes may go through wildly different, long latency routes at different times.

2 Likes

I agree with @NTPman that more precise monitoring would be better. Under the assumption that the monitors represent the pool clients, any packet loss should be taken into account. Even if the cause for packet drops is outside of a server operators influence, it still potentially impacts any clients.

To account for sporadic packet drops and the resiliency of ntp clients against such drops, the point penalty for a network timeout could be decreased if the decision is made to not retry unanswered queries.

Which leads to the more general question: How harsh should the monitoring punish packet drops?

  • After how many consecutive unanswered queries should a server be considered offline and dropped from the pool? (currently: 6, in 2 bursts of 3 packets each)
  • How many packet drops should be allowed on average until a server is not considered reliable enough for the pool? (currently: up to 2/3 of packets can be lost without consequence…)
1 Like

Let me revive this old topic (the issue is painful to me).

My server has score equal to 20 from 95 monitors, and less than 20 only from 17 monitors.

The score 20 from a particular monitor means there was no timeout from many, many monitoring runs.

I do not think that the Internet quality ameliorated so much since the introduction of the new monitoring system that the NTP packet loss became such rare event.

I think current version of the monitoring system hides valuable data.

In one particular run from one monitor to one server multiple probes are sent (3 at this moment), and if any probe succeed, than the full set of sample is considered success.

What data is hidden, or lost? The distinction between two monitors, one monitor where all the three probes are always success from the other monitor where only some probes are success from the three probes for a given NTP server. Both the two monitors gives score 20 for the long run, but that is unfair.

Until this it looks theoretical. But let’s take an example from the real world.

I selected an NTP server that is reachable, but not perfectly from my test monitor (frlys1-355n9ds) in the beta system: 111.198.57.33.

tumbleweed:~ # tcpdump -nn -r ntp1.pcap | grep -E '^(06:5|07:0).*111.198.57.33'
reading from file ntp1.pcap, link-type EN10MB (Ethernet), snapshot length 262144
06:51:20.083038 IP 192.168.1.2.53697 > 111.198.57.33.123: NTPv4, Client, length 48
06:51:20.223770 IP 111.198.57.33.123 > 192.168.1.2.53697: NTPv4, Server, length 48
06:51:22.224914 IP 192.168.1.2.37661 > 111.198.57.33.123: NTPv4, Client, length 48
06:51:22.483003 IP 111.198.57.33.123 > 192.168.1.2.37661: NTPv4, Server, length 48
06:51:24.484301 IP 192.168.1.2.60743 > 111.198.57.33.123: NTPv4, Client, length 48
06:51:24.730062 IP 111.198.57.33.123 > 192.168.1.2.60743: NTPv4, Server, length 48
06:55:34.676930 IP 192.168.1.2.38994 > 111.198.57.33.123: NTPv4, Client, length 48
06:55:34.886060 IP 111.198.57.33.123 > 192.168.1.2.38994: NTPv4, Server, length 48
06:55:36.887033 IP 192.168.1.2.34582 > 111.198.57.33.123: NTPv4, Client, length 48
06:55:37.044173 IP 111.198.57.33.123 > 192.168.1.2.34582: NTPv4, Server, length 48
06:55:39.045276 IP 192.168.1.2.43220 > 111.198.57.33.123: NTPv4, Client, length 48
06:55:39.240575 IP 111.198.57.33.123 > 192.168.1.2.43220: NTPv4, Server, length 48
06:59:42.921494 IP 192.168.1.2.51422 > 111.198.57.33.123: NTPv4, Client, length 48
06:59:43.068721 IP 111.198.57.33.123 > 192.168.1.2.51422: NTPv4, Server, length 48
06:59:45.069648 IP 192.168.1.2.45790 > 111.198.57.33.123: NTPv4, Client, length 48
06:59:50.072415 IP 192.168.1.2.56603 > 111.198.57.33.123: NTPv4, Client, length 48
06:59:50.219855 IP 111.198.57.33.123 > 192.168.1.2.56603: NTPv4, Server, length 48
07:04:14.799439 IP 192.168.1.2.51640 > 111.198.57.33.123: NTPv4, Client, length 48
07:04:14.950563 IP 111.198.57.33.123 > 192.168.1.2.51640: NTPv4, Server, length 48
07:04:16.951960 IP 192.168.1.2.54686 > 111.198.57.33.123: NTPv4, Client, length 48
07:04:17.165959 IP 111.198.57.33.123 > 192.168.1.2.54686: NTPv4, Server, length 48
07:04:19.167400 IP 192.168.1.2.38934 > 111.198.57.33.123: NTPv4, Client, length 48
07:04:19.383559 IP 111.198.57.33.123 > 192.168.1.2.38934: NTPv4, Server, length 48
07:08:21.774295 IP 192.168.1.2.55438 > 111.198.57.33.123: NTPv4, Client, length 48
07:08:21.931311 IP 111.198.57.33.123 > 192.168.1.2.55438: NTPv4, Server, length 48
07:08:23.931737 IP 192.168.1.2.40330 > 111.198.57.33.123: NTPv4, Client, length 48
07:08:24.082025 IP 111.198.57.33.123 > 192.168.1.2.40330: NTPv4, Server, length 48
07:08:26.082575 IP 192.168.1.2.33998 > 111.198.57.33.123: NTPv4, Client, length 48
07:08:26.294138 IP 111.198.57.33.123 > 192.168.1.2.33998: NTPv4, Server, length 48
tumbleweed:~ # 

and

tumbleweed:~ # curl -s 'https://web.beta.grundclock.com/scores/111.198.57.33/log?limit=200&monitor=frlys1-355n9ds' | grep -E ' (06:5|07:0)'
1765696106,2025-12-14 07:08:26,0.006772629,1,19.999845505,128,frlys1-355n9ds,150.288,,
1765695859,2025-12-14 07:04:19,-0.002198036,1,19.999837875,128,frlys1-355n9ds,151.12,,
1765695590,2025-12-14 06:59:50,-0.000376309,1,19.999828339,128,frlys1-355n9ds,147.315,,
1765695339,2025-12-14 06:55:39,0.003695062,1,19.999820709,128,frlys1-355n9ds,157.169,,
1765695085,2025-12-14 06:51:25,0.002075839,1,19.999811172,128,frlys1-355n9ds,140.879,,
tumbleweed:~ # 

The sample at 06:59:50 is considered good. However, on the packet capture you can see that the second, middle probe’s reply packet is lost. (The default packet spacing is two seconds, plus three seconds waiting for the reply packet, that accounts from the packet spacing of 5 seconds to the next probe. 06:59:50.072415 - 06:59:45.069648 = 5 seconds + 0.03 sec processing time)

The score of the NTP server 111.198.57.33 is 20 from the monitor frlys1-355n9ds in the beta system, when it shouldn’t be.

I suggest the following change in the monitoring code: make the number of probes a run-time configurable parameter. The code should run properly when this parameter is equal to three (as today) and run properly as well when this parameter is equal to one.

Then, as next step for the production deployment monitors use parameter value 3 (not affecting the production), and for the beta monitors use parameter value 1 (gain experience in the beta/test environment).

I think the parameter can already be configured at run time, ie. it does not require changes to client binaries.

But you’re still ignoring my concern from the other topic that by changing the number of queries to one we would not be able to detect if some server does too aggressive rate limiting. For example chrony on Rocky Linux 9 with the iburst option seems to send five queries with a two second interval at startup. Besides, our great leader Ask said: “One reason we do multiple queries is to detect servers with overly aggressive rate limits.”

I’d still rather not touch the current scoring algorithm or the number of queries sent in a batch. But if more data is desired, the monitors could report to the pool monitor management server the number of queries sent, the number of good responses received and the number of error/timeout responses. This would be only for collecting more data, without touching the scoring at all yet.

1 Like

No we don’t.
As this will mean the monitors that are not in charge of your scoring have to work harder.
Why?

The monitors that score your server give enough data.

I do not want to see my monitor being overloaded just because of this.

An active monitor needs to examine you being a good ticker or not.
You can not make it more precise this way. As ping-time is corrected.

Making it ‘ping’ more won’t help. I fail to see your point.

Only way to make it more precise is by making the off-set figure more strickt.
But will this make it more accurate or just drop more servers?

Make scoring more sensible, yes that may help, make the rms-freq-error more strict for scoring.

Will this help? Maybe.

Question is, how much are servers off-time at all?

Less work, one third the number of packets sent relative to the today’s situation.

2 Likes

How? As currently the non-active monitors sends about 1/3th of the active monitors.
When you check the graphs.

However, how will this improve accuracy?

I’m trying to understand your point of view.

Polling more won’t change accuracy. I do not see how.

Three queries sent, and if only one reply is received, the sample is considered perfect (if there is no KoD packet). I fail to see how aggressive rate limiting translates into decrease in score.

That does not change.

We will see the difference between a very good monitor-server connectivity from a medium quality monitor-server connectivity (such as some, sporadically occuring packet loss). Today the second case is reported as a very good (no packet loss) situation.

Monitors are checked…

Why do you think the monitors are wrong/off?

Packet-loss is an internet problem…not a monitor problem.

I fail to see your issue.

I have to admit that I forgot the system was changed to behave that way some months ago. Previously, all the sent requests were expected to be replied to.

But still, how did you think you would detect rate limiting issues with only one query? I’d think it would require several queries sent at two second intervals or so.

What do you think of my proposal to keep the current number of queries in a batch, but report them independently to the pool management server? I’d think it’d be the best of both worlds, ie. it would detect rate limiting issues AND it would detect single requests getting dropped.

Hmm, strictly speaking, it means that the server meets the quality criteria currently encoded in the monitors and scoring system to the highest degree, but not necessarily that it is perfect. The NTP protocol is somewhat immune to some disturbance, and as the monitoring system’s role is to asses each server based on the suitability to serve time to clients, that is what the system reflects. A sanity check if you will, e.g., that there isn’t too much packet loss that would impair getting time, or too much off, or some other anomaly. And it is certainly not a beauty contest or competition as to who will get the best scores, though I concede that that probably plays a role in motivating server operators to add servers to the pool, and to maintain them in good condition.

I don’t see how that data would enhance the purpose of the monitoring system, which is to assess the suitability of a server to serve time and be included in the pool.

I understand the point, and think it is important to address this perceived unfairness. But I don’t think making each individual probe count and impact is the way to go. From a Global North perspective, that may work very well. But while the pool still aspires to be a truly global project, also conditions in the Global South, where infrastructure is more expensive and less reliable as a matter of fact, I fear that would kick many a server out of the pool, aggravating an already bad situation in arguably the largest parts of the world (by people). Just two or three lost packets would kick a server out of the pool.

Usually, I am all for just trying things out, which gives better answers than debating theoretically. But to be fair in that context, it would need a level playing field, and good representation from each place that the pool aspires to serve. And I see that even less with the beta system than already with the production system.

Not related to a server, but as analogy to highlight the problem: Some of my monitors particularly in Asia more or less frequently take themselves out of the monitoring and thus don’t contribute to scoring of local and regional servers. Despite their time being ok as compared to good local reference clocks. Why? Because they are being assessed against reference servers far away, sometimes half way aound the world, with one saying it is +20ms off, the other -20ms, and only one can be true. So already this introduces a serious bias into the whole system.

Though I recognize that differentiating a bit more would be considered to be fairer, so how about scaling the steps that the scores can take. E.g., subtract not the full 5 points when one out of three packets gets lost, but only like half a point. Or add less than a full point.

PS: I am currently seeing at least one of my monitors sending four probe packets, not three.

1 Like

Preload the server with arbitrary number of packets, spaced 2 seconds, and take into consideration only the last packet for scoring.

In fact, there is already a single packet loss being reflected. The score from 19.982591629 decreased to 19.959260941 . But that decrease does not seem visible.

Yes, but why is this a problem?
Nobody can score 20 a 100%.

As some routes are simply bad.

That is why we changed from 1 monitor to 50 or so monitors.

And only the monitors that score you well (presuming your server is ticking ok 24/7) will be selected to monitor you for scoring.

Those determine you score. All the rest doesn’t matter anymore.

Beware that NTP traffic is UDP, often lost, it’s a send-and-forget protocol, as such packet-loss isn’t uncommon. Typical when routes are congested, the UDP packets get dropped.

In short, scores will never be perfect over all monitors. But in the past we had only 1 monitor and if your packets where dropped too much, your system was taken out of the pool and you got emails, even when your server was perfectly fine.

The new system prevents this from happening and you ONLY drop out of the pool if ALL monitors score you bad.

The current system keeps your server in the pool, regardless if packets are lost…that is the point of the multi-monitor system. The old system did take too many systems out because it had just 1 monitor and a few dropped packets got you removed and emailed.

The monitors have no other purpose, as you score 20, you are fine. Doesn’t matter how many monitors are active for you, could be just 1 or 50, it matters that they report your system to be fine.

Else you get taken out of the pool. Forget the monitors that score bad, they do not matter.

As you replied, coming back on the topic. I realized that if one reply packet is lost from the three sent in a batch the score is decreasing a bit. If it is continuously happening in every batch, typically due to misconfiguration of the rate limiting, the score will slowly dive under 10.

On the graph the score 20 is shown for most of the monitors. However, some are typically under 20 and above 19.95 (rounded up to 20 for displaying), one can check this from the log.

I made a measurement on one of my server.

On the graph 99 monitors show score 20, and 13 monitors are displayed under score 20.

From the log 71 monitors have score 20, 28 monitors’ score is less than 20 and greater equal to 19.95. Under the score 19.95 there are 13 monitors.

1 Like

Again, why is this a problem…see mine…

It’s the same, but these monitors don’t matter for my scoring.

The active ones do…

The testing ones do not.

So what is the problem? I really fail to see what the issue is.

We all have cadidates that score bad, so? Those scores matter nothing, just that they have a different score. But those scores mean NOTHING.

So I ask again, what are you after? Or what is the problem? I realy do not understand the issue.

Nobody scores 20 at ALL severs, nobody. And beside, only Active servers score you, the rest is NOT selected and thus don’t score you to be in the pool.

I do not get your problem.

The quality of the Internet not as good, that so many monitors would have score 20. But my problem is solved (somehow), since many monitor’s score isn’t really 20, just rounded up to 20 for display. On a single packet loss the score drops tiny only. I would like to have somehow slightly bigger decrement, to be able visually differentiate from the monitors where there is really no packet loss. But this is not the way it is implemented now.

Yes, another thread. I am trying to stay organized.

I suggest to have a target score for the ratio of the number of lost packets in a batch.
That would be reached asymptotically if the packet loss ratio is always the same.

Number of reply packets received / number of monitoring packets sent in a batch => TargetScore

For example, for the batch size of three:

0/3 => -100
1/3 => -20
2/3 => 0 (must be lower than 10 for those NTP server having too aggressive rate limiting)
3/3 => 20

Of course, the target score value could be influenced by other parameters, like high offset reply, KoD packet received, zero size NTP reply packet received and so on.

For the practical purpose, I suggest to treat those conditions as packet loss. In case of high offset, record the worst offset value in the log, not the best or average.

Evaluation of the new score for a monitor after a test batch:
NewScore = (PreviousScore - TargetScore ) / 2 + TargetScore

A server having starting score of 20, and after two batch where each have one packet loss: new score = 5. That looks very aggressive, as it is. However, in case of a correctly working server, there will always be a handful of monitors where the score is bigger than 10, even 20. That will make the overall score bigger than 10, as we wish. Let’s have a simulation, one of my production server:

The result of the change:

Before the change: 99 monitors on score 20, 13 monitors less than 20

After the change: 71 monitors on score 20, 41 monitors are on score less than 20.

I suggest to try this in the test pool.