At the end of the day, it is simply leveraging what happens in many cases “naturally”, unpredictably. Just making it happen in a more controlled way. But yeah, it’s not what it was intended for, and it’s crude and ugly and cumbersome and costly (to get it right), but we have to use the means that the pool gives us to deal with a wide range of challenges, creatively when no more direct route is available. So I am happy to learn that we’ll soon get a new tool for our toolbox to deal with a few more of the very diverse challenges that servers can face in different zones with wildly varying circumstances.
That was the only reason why I raised it here. My use case was just informative background info as to how I found the behavior, but as it is indeed not a common use case wouldn’t have warranted raising this as a larger issue. Apologies if my mentioning it diluted the focus.
But the second picture I posted was from a server that was actually not reachable from pretty much anywhere due to external issues (I guess some misguided DDoS protection having kicked in somewhere, at least there was a correlation time-wise between reducing the netspeed, and traffic flow resuming). I.e., it should have been kicked out of the pool in my view, which is why I raised this.
I fully understand the frustration around this. What vexes me in this context is that those complaints completely disregard the purpose of this intentional behavior, and it ignores that the monitoring system is only the canary in the coal mine, indicating an issue elsewhere, out of the control of the monitors, but that the monitoring itself is not the issue.
I cannot speak to earlier design decisions, but going forward, I very much feel that using the monitoring system and the scoring mechanism to deal with the issues that ultimately lead to the complaints you mention would miss the actual target, i.e., be costly to implement, but fragile, and ultimately likely to not even effectively address the actual challenges. Just like the complaints confuse the symptoms with the actual issue. More on that below.
Hmm, based on behavior observed with the old system, I always assumed that that is how it was designed to work. I don’t have any statistics, but just from my recollection, it always seemed to me that at least for new servers, monitor selection typically ended up with 5 active monitors just around the time it was decided the server could join the pool. Sometimes, there where maybe only 3 or 4 monitors at that time, but not long after, 5 monitors were selected to be active. And I don’t recall a significant number of cases where the number of active monitors would thereafter drop below 5 (but obviously, exceptions possible here and there).
Yes, that sounds very good indeed! It will be a nice tool to deal with some challenges in some circumstances. But it is just one additional tool in the tool box, and will not magically resolve all challenges.
The various zones around the globe are too diverse for a one-size-fits-all solution. Zones on different continents certainly may be quite different, but even zones on the same continent, even adjacent ones, while having similar, or related challenges, may still have somewhat differing variations of the same general challenges, potentially needing different tools from the tool box, or in different combinations.
And while some tools may not be needed in some zones/areas, that should not prevent a specific tool from being made available for use in other zones/areas.
And what holds for different zones/continents is certainly true also for individual servers within a zone/area, let alone across zones or continents.
Other tools I would be happy to welcome in our toolbox have variously been mentioned throughout various threads in this forum before:
- Apart from the benefit of being able to automate the setting of the netspeed, being able to set values at a different granularity will be a benefit, allowing to maximize utilization of an individual server’s capacity, increasing overall zone capacity. I have a few servers where I could almost double the utilization, but choosing the next higher available netspeed would be just that small bit too much, so I run them at the lower netspeed value, wasting a large part of their capacity, as it was just too cumbersome to tweak the web interface for every case. Being able to flexibly set a value in between through the API would allow to set the right balance (but requires some willingness to explore this, and obviously nobody is forced to do such fine-tuning).
- Allow lower netspeed values, all the way down to 1kbit. This might be ineffective in some low-traffic zones, but doesn’t really hurt, either. At the same time, in many high-traffic zones, it will help to bring additional capacity online. In small steps, obviously, but with the hope that a tipping point can be reached that will allow more and more smaller servers to be added. Again, I have a few servers in the pool at 1kbit netspeed in zones where that still generates noticeable amounts of traffic to be useful (“every server counts”), but not so much as to force me to remove them from the pool entirely, some due to bitrate limits, others due to monthly traffic quotas. And while such a low setting may be useless in some zones (i.e., produce no meaningful traffic), or ineffective in others (e.g., when a server is the only one, or one of just a very small number), that should not prevent it from being made available for use in zones where it can be helpful.
- Gradual load increase in correlation with the score when it crosses the threshold of 10, vs. the current “off/on”. I’d be happy to make a proposal how to implement that, at least for illustration purposes. But this would help a large number of servers, and zones they are in, because it would manage the load on a server just around the level that is right, and reduce huge load fluctuations to other servers in the zone with their negative impact. (In some zones with relatively few servers and high traffic load, servers need to be dimensioned for peak load, e.g., due to physical or contractual bandwidth limits, or hardware packet rate restrictions, so smoothing the peaks could allow them to run at a higher average traffic load.)
- Slowly blurring the current strict zone-based client-to-server assignment. This would probably be the biggest piece of work, but it is not so that at some point, I hadn’t be on the to-do list already, or that there was a shortage of ideas and different ways how this could be done, and how the transition could be managed.
I listed the items in order of my assessment as to the effort it would cost, and especially the benefit/cost ratio. I.e., some smaller items might not fully solve everything, but their implementation is cheap enough to just go ahead with it, and reap some benefits, until the “big” solution will be available.
So I hope some of the smaller items could be implemented soon due to their low effort, e.g., restoring the lower bound of the netspeed setting to 1kbit (it seems the internal API currently accepts 256kbit as the lowest value only). And the API seems in the works already anyhow. 
In many cases, I would expect that this would not be used to constantly fiddle with the netspeed. One aspect is that the API hopefully allows to set better-fitting netspeed values even in between the granularity of the UI. I.e., once the “sweet spot” has been found (provided that an operator is even interested to fine-tune things that way), it would be relatively stable. In most of my use cases, this mechanism is mostly just as a backstop, i.e., traffic consumption should remain within the quota, but gets close to it, so just “short-circuit” the traffic consumption in unforeseen circumstances. Obviously don’t know what other people’s use cases might be, whether they might come up with wildly differing ones.
On the other hand, I would expect this to take some load off the overall system. Namely with the many servers that currently constantly move into and out of the pool. Provided the operator is willing to explore this, a “sweet spot” could potentially be identified which keeps the respective server in the pool for longer, rather than zoning in and out as limited only by how long it takes for score to drop and then recover.