Monitoring v4 score quorum for server inclusion in pool too lenient?

MagicNTP · August 6, 2025, 5:10pm

I am wondering whether the scoring quorum behavior described in the following is intended that way, or whether it might unintentionally be too lenient.

To manage traffic consumption when the monthly traffic quota risks being overrun in some underserved zones, I selectively block NTP requests from the monitors only to force the score to drop, and for the respective server to be temporarily removed from the pool.

While extremely crude and ugly (though the new score graph coloring makes the graphs way more palatable than the old coloring ), it is quite effective in managing the traffic consumption*.

In this context I found that a single monitor scoring a server above 10 can overrule all remaining two dozen or so monitors that score that server below 10.

Is that intended?

I probably haven’t considered every angle yet, but intuitively, that doesn’t seem right to me.

(Unfortunately, it seems I have lost some screenshots that showed a single monitor above 10 leading to the overall score of above 10 as well, while all the others scored below 10. But I think the following picture with a single monitor scoring above zero determining the overall score when all other monitors score below zero gives an idea as to what that would have looked like.)

* Blocking monitors’ packets to manage bitrate limits doesn’t work, though, with such limits also typically lower in underserved zones than limits at the same cost elsewhere (where there likely aren’t even any bitrate limits relevant for the NTP traffic of a pool server in comparison to respective demand). Or to manage packet rate limits, which may affect devices almost anywhere. The feedback loop to drop the score, and subsequently reduce the bitrate/packet rate, is way too slow in comparison to the traffic surge rate when a server in an underserved zone enters the pool in an off/on fashion upon its score increasing from below 10 to above that threshold.

MagicNTP · August 6, 2025, 6:37pm

Different server (this one got kicked out of the pool “naturally” - or, in my view, should have been kicked out), but still nicely showing how a single perfect score of 20 can supersede all other scores being below 10.

ask · August 8, 2025, 5:16am

As we’ve discussed before, what you’re attempting isn’t supported by the system[1].

That said, I really appreciate how this exposes edge cases. This behavior likely stems from design choices that made sense when we had far fewer monitors (and when users frequently complained about servers being temporarily removed from the pool).

I’ll give this more thought, but the solution probably involves requiring a minimum number of successful active servers (3? 5?) to agree before calculating a median from those results. If I recall correctly, the system first attempts to get a median from monitors marked as active, then falls back to other monitors with recent data if that fails. I can refine this logic.

[1] You’ll be pleased to know the system now has an internal HTTP API for updating net speed, plus a complete API key system. I built this with your expressed needs in mind. I need to wire up the API key system in the Web UI, move the API to be accessible and work out some of the internal consequences of netspeeds potentially changing more often.

MagicNTP · August 8, 2025, 11:32am

At the end of the day, it is simply leveraging what happens in many cases “naturally”, unpredictably. Just making it happen in a more controlled way. But yeah, it’s not what it was intended for, and it’s crude and ugly and cumbersome and costly (to get it right), but we have to use the means that the pool gives us to deal with a wide range of challenges, creatively when no more direct route is available. So I am happy to learn that we’ll soon get a new tool for our toolbox to deal with a few more of the very diverse challenges that servers can face in different zones with wildly varying circumstances.

That was the only reason why I raised it here. My use case was just informative background info as to how I found the behavior, but as it is indeed not a common use case wouldn’t have warranted raising this as a larger issue. Apologies if my mentioning it diluted the focus.

But the second picture I posted was from a server that was actually not reachable from pretty much anywhere due to external issues (I guess some misguided DDoS protection having kicked in somewhere, at least there was a correlation time-wise between reducing the netspeed, and traffic flow resuming). I.e., it should have been kicked out of the pool in my view, which is why I raised this.

I fully understand the frustration around this. What vexes me in this context is that those complaints completely disregard the purpose of this intentional behavior, and it ignores that the monitoring system is only the canary in the coal mine, indicating an issue elsewhere, out of the control of the monitors, but that the monitoring itself is not the issue.

I cannot speak to earlier design decisions, but going forward, I very much feel that using the monitoring system and the scoring mechanism to deal with the issues that ultimately lead to the complaints you mention would miss the actual target, i.e., be costly to implement, but fragile, and ultimately likely to not even effectively address the actual challenges. Just like the complaints confuse the symptoms with the actual issue. More on that below.

Hmm, based on behavior observed with the old system, I always assumed that that is how it was designed to work. I don’t have any statistics, but just from my recollection, it always seemed to me that at least for new servers, monitor selection typically ended up with 5 active monitors just around the time it was decided the server could join the pool. Sometimes, there where maybe only 3 or 4 monitors at that time, but not long after, 5 monitors were selected to be active. And I don’t recall a significant number of cases where the number of active monitors would thereafter drop below 5 (but obviously, exceptions possible here and there).

Yes, that sounds very good indeed! It will be a nice tool to deal with some challenges in some circumstances. But it is just one additional tool in the tool box, and will not magically resolve all challenges.

The various zones around the globe are too diverse for a one-size-fits-all solution. Zones on different continents certainly may be quite different, but even zones on the same continent, even adjacent ones, while having similar, or related challenges, may still have somewhat differing variations of the same general challenges, potentially needing different tools from the tool box, or in different combinations.

And while some tools may not be needed in some zones/areas, that should not prevent a specific tool from being made available for use in other zones/areas.

And what holds for different zones/continents is certainly true also for individual servers within a zone/area, let alone across zones or continents.

Other tools I would be happy to welcome in our toolbox have variously been mentioned throughout various threads in this forum before:

Apart from the benefit of being able to automate the setting of the netspeed, being able to set values at a different granularity will be a benefit, allowing to maximize utilization of an individual server’s capacity, increasing overall zone capacity. I have a few servers where I could almost double the utilization, but choosing the next higher available netspeed would be just that small bit too much, so I run them at the lower netspeed value, wasting a large part of their capacity, as it was just too cumbersome to tweak the web interface for every case. Being able to flexibly set a value in between through the API would allow to set the right balance (but requires some willingness to explore this, and obviously nobody is forced to do such fine-tuning).
Allow lower netspeed values, all the way down to 1kbit. This might be ineffective in some low-traffic zones, but doesn’t really hurt, either. At the same time, in many high-traffic zones, it will help to bring additional capacity online. In small steps, obviously, but with the hope that a tipping point can be reached that will allow more and more smaller servers to be added. Again, I have a few servers in the pool at 1kbit netspeed in zones where that still generates noticeable amounts of traffic to be useful (“every server counts”), but not so much as to force me to remove them from the pool entirely, some due to bitrate limits, others due to monthly traffic quotas. And while such a low setting may be useless in some zones (i.e., produce no meaningful traffic), or ineffective in others (e.g., when a server is the only one, or one of just a very small number), that should not prevent it from being made available for use in zones where it can be helpful.
Gradual load increase in correlation with the score when it crosses the threshold of 10, vs. the current “off/on”. I’d be happy to make a proposal how to implement that, at least for illustration purposes. But this would help a large number of servers, and zones they are in, because it would manage the load on a server just around the level that is right, and reduce huge load fluctuations to other servers in the zone with their negative impact. (In some zones with relatively few servers and high traffic load, servers need to be dimensioned for peak load, e.g., due to physical or contractual bandwidth limits, or hardware packet rate restrictions, so smoothing the peaks could allow them to run at a higher average traffic load.)
Slowly blurring the current strict zone-based client-to-server assignment. This would probably be the biggest piece of work, but it is not so that at some point, I hadn’t be on the to-do list already, or that there was a shortage of ideas and different ways how this could be done, and how the transition could be managed.

I listed the items in order of my assessment as to the effort it would cost, and especially the benefit/cost ratio. I.e., some smaller items might not fully solve everything, but their implementation is cheap enough to just go ahead with it, and reap some benefits, until the “big” solution will be available.

So I hope some of the smaller items could be implemented soon due to their low effort, e.g., restoring the lower bound of the netspeed setting to 1kbit (it seems the internal API currently accepts 256kbit as the lowest value only). And the API seems in the works already anyhow.

In many cases, I would expect that this would not be used to constantly fiddle with the netspeed. One aspect is that the API hopefully allows to set better-fitting netspeed values even in between the granularity of the UI. I.e., once the “sweet spot” has been found (provided that an operator is even interested to fine-tune things that way), it would be relatively stable. In most of my use cases, this mechanism is mostly just as a backstop, i.e., traffic consumption should remain within the quota, but gets close to it, so just “short-circuit” the traffic consumption in unforeseen circumstances. Obviously don’t know what other people’s use cases might be, whether they might come up with wildly differing ones.

On the other hand, I would expect this to take some load off the overall system. Namely with the many servers that currently constantly move into and out of the pool. Provided the operator is willing to explore this, a “sweet spot” could potentially be identified which keeps the respective server in the pool for longer, rather than zoning in and out as limited only by how long it takes for score to drop and then recover.

ask · August 10, 2025, 12:01pm

Slowly blurring the current strict zone-based client-to-server assignment.

That is my plan (related to the IPv6 enablement, too)!

For now I think I fixed the most obvious failure (removing most active monitors when they were failing, leaving the system is a fussy state). I deployed it to the beta system. Would you be able to run your test there? The system should keep 5 monitors even if all/most are having poor performance.

Also on beta, but not deployed to production yet, are changes to switch monitor statuses more liberally[1] to pick “the best ones”. Before these changes (what’s in production) the monitors won’t as easily get swapped without a (temporary) performance blip. The beta system also uses a longer time horizon (24 hours instead of 12) to evaluate the monitor performance for each server.

MagicNTP · August 10, 2025, 12:29pm

Thanks!

Sure, will try to replicate. Though it may take some time to see anything as my “test” gives slightly more control over when a server might get kicked out but still a bunch of uncontrolled variables/moving parts. And the more so for the actually relevant case of such an incident occurring naturally rather than when trying to force it.

MagicNTP · August 10, 2025, 12:35pm

Great! Good to hear! And it is often not so important that stuff actually moves forward right then and there (when there is other visible stuff going on), but just hearing what the latest status is, or simply that things are still on the agenda at all while other stuff is being worked on for longer.

And keeping in mind that small but meaningful improvements are possible with very little effort, so can be realized without having to wait for the “big” solution (the eventual blurring of the country zone boundaries) to get realized (which by its nature will take some time).

Topic		Replies	Views
Erratic scoring by monitoring server (query backlog?) Server operators monitoring	20	801	December 21, 2023
Gradually add/remove server to/from pool in parallel to score increase/decrease Pool Development monitoring , dns	78	1953	August 22, 2024
Suggestions for monitors, as Newark fails a lot and the scores are dropped too quickly Server operators monitoring	91	4103	August 2, 2021
Beta system now has multiple monitors Pool Development monitoring , beta	32	4370	August 11, 2018
List of trackers monitoring	32	441	May 21, 2025

Monitoring v4 score quorum for server inclusion in pool too lenient?

Related topics