Beta system monitoring testing

Ahh ok. Thank you Dave. Just thought it’s just a drop down list :slight_smile:
Will try it out for next monitor.

As I see, the number of NTP packet samples that are collected is 3 in one run from the monitoring point to a given server, and the best packet is taken into account.

That workaround made a lot of sense when there was only one single monitoring point for the whole pool, as some servers were out of the pool due to bad connectivity condition towards the only existing monitoring station.

However, the situation is different now with multiple monitoring points. This method of sampling destroys valuable information. Let me give an example: it is not possible to distinguish between two monitoring stations towards a given NTP server to be monitored: one with 100% reachability from another one with 95% reachability. Another example is, the 3 samples monitoring reports better offset (or rather dispersion?) value than the reality is.

We should not hide the information at collection time. It is possible to average out at later stage, taking into account the best run result from the last 3, but I do not think it is needed at all.
I suggest to go back to the one single NTP packet sample monitoring.

I can see why you’d think so, but in fact I’d argue the pool monitor code gets this right even with many monitors spread across many networks.
If every NTP client were one-shot SNTP steering the clock based on a single sample, your analysis would be pretty good. However, one-shot SNTP clients are getting low-quality service over a WAN anyway, subject to being tens or even hundreds of milliseconds off systematically. If the pool gets a one-shot SNTP client within a second or two of correct time, that’s more than good enough, and it does better than that thanks to requiring consistent accuracy well under 100ms. I think it’s under 20ms with the beta monitoring code.
On the other hand, a full-fledged NTP client like Chrony or ntpd can tolerate even 75% loss and still do a credible job of estimating the server time, thanks to repeated queries over time. I know ntpd/ntpsec well, but I think Chrony is similar in broad terms as far as matters here.
ntpd keeps the 8 last samples, and ignores all but the one with the lowest RTT, while keeping an error budget that accounts for the age of that sample both to refine the clustering and combining algorithm and to relay a good error estimate down the chain of NTP servers. By comparison, the pool monitoring implementation is essentially more like a one-shot SNTP client, and benefits from choosing the best of three samples close in time before acting upon the apparent difference between the monitor’s known-good clock and the sampled server’s.
If the path between a full client and a server is lossy, the use of 8 samples means the quality of the estimate degrades but not the ability to synchronize. Does that make sense to you?

I don’t think that sort of post-processing is compatible with the architecture of the monitoring code.

1 Like

The description you provided about the time synchronization quality difference between a simple SNTP client and a full client is correct. However, that is irrelevant for the NTP monitoring, since the purpose of the monitoring is not to provide time synchronization, rather have the most detailed picture about the state of the NTP server, including the connecting network quality between the server and the monitor.

May be not at this time, but as I wrote earlier, I believe there is no necessity to modify anything in that respect.

Indeed. But shouldn’t that reflect then how an actual NTP client would assess the NTP server? And such a client would more summarily look at multiple measurements, as Dave so aptly describes. And not give a single individual sample the outsized influence it would have in “the most detailed picture” you suggest. Which risks missing the forest for all the trees.

And in fact, the current implementation is already more sensitive to disturbances than an actual NTP client would be. For good reason, to allow “really” bad servers to be dropped quickly. But balancing that with the instability that too eager dropping could cause.

One could argue that taking a single sample per pass could be compensated for by subsequent filtering later on in the post-processing as Dave mentions. But unlike you state, that would likely not go without modifications to the current code architecture.

So not sure why you think the three samples are a work-around for something (and I think most monitors in the current system even collect four samples each round). Rather, in line with Dave’s description, I think that was fully intended.

Also, regardless of what one thinks the final approach should be, it would need to be considered what the effect on the pool and its stability is. For well-served zones, like those most of the more active people in this forum can be happy to have their servers reside in, it might be ok to tighten the criteria for a server to be in the pool. But that perspective seems highly biased, as much of the global client population, if not the majority, reside in less-well- or even underserved zones. I.e., where there arent’t enough servers as it is, and many of them struggling. In many zones, I see servers constantly phasing in and out of the pool because they are overloaded (the typical sawtooth pattern coming up time and time again). Tightening the screws on those already strained servers has the potential to push more of them out, and thus further precarious zones over the edge.

The new implementation takes great care to grandfather existing monitors into the new, somewhat tighter constraint regime of the new implementation. Similarly, care must be taken to not further destabilize already precarious zones by tightening the criteria on already strained servers and thus driving even more servers out of the pool. Or keeping them out, as already is too much the case today as it is.

1 Like

The current situation is, that for most of the monitors give score 20. That is incorrect, only part of this should give 20, the others (with a bit worse connectivity) should be fluctuating around a smaller number, like 18. This would give better active monitor selections. The average of the monitor results should count towards the inclusion of a servers into the pool, not the best sample from more samples (currently 3) and drop the other two samples that carries valuable information.

Or just the reverse, overloaded servers would be out of the pool faster, allowing them quicker recovery because the overload had not much time to go very high. With other words, the overall monitoring would sense earlier a potential overload situation. With three sample only when the packet loss reached 80% the score starts to drop. With one sample, the score would start dropping at 10% packet loss. I interpret what you mentioned is an additional good argument why we should have only one monitoring sample.

I have to revise my opinion on that. The code is already in place, the way that there are 5 active monitors selected from many other monitors. That is the selection that would replace the one packet selection from the 3 samples.

Anyhow, one experience worth one million speculations, I suggest to try this out in the beta system. I guess zero risk, no one is using the beta monitoring in production, isn’t it?

You seem to think that there would be no downsides for using only one sample per check. I can think of one – there are legitimate SNTP clients that send a few queries with an interval of a few seconds and then exit. I think this is a valid use case for the pool.

Now consider an NTP server that is configured to allow only one request per 10 seconds from a single IP address. The example SNTP client would only receive one response to its queries, but the monitoring would be oblivious to this restriction if the monitoring used a single query.

I think the current implementation works just fine for our purposes. I have not seen any compelling evidence to think otherwise.

2 Likes

Sure, if the server is in a well-served zone. Elsewhere, the picture is rather different.

I don’t think the system is about scores being “correct” or not. Rather, it should balance quickly dropping bad servers with system stability.

That sounds nice, but not sure how relevant it is. A server is either in, or out. And as it currently stands, with the steps in the downward direction being as large as they are, the diffentiation will easily get lost.

There was a proposal once to take the score to scale the share a server gets distributed to clients to allow for a smooth transition into and out of the pool, instead of the current on/off, with on getting the full netspeed-derived share. In that context, more differentiated scores would be more useful, in the sense of having some effect. Or finer granularity in the steps up and down. Where there is already some scaling based on offset values. But how to take a missing sample into account? If the step is -5, as now, the logic may become oversensitive. If it is less, it may hinder timely dropping of an unreachable server.

Not sure what value this information would have in this context. Generally, that is interesting, fully agree. But as mentioned above, I think the “expressiveness” of that information, i.e., how differentiated the response to such differentiated information could be, is too limited. At least weighing effort to make this work vs. potential benefit.

Again, that would require tight balancing of the control loop. I.e., how fast to drop, how fast to get back in etc. And I don’t think using that as a control loop for managing load on overloaded servers would work well. On the other hand, it could destabilize the zone further. Because the control currently is on/off only. And any transition between low traffic (out of the pool) and full load (in the pool) is only modulated by the delay it takes for the IP address being pushed to the DNS servers, and clients getting fresh reponses from the DNS servers including those IP addresses. And in the opposite direction, there is the additional delay of the intervals between monitoring samples, the aggregate score across multiple monitors dropping sufficiently, and then the DNS entries timing out, and clients (mostly SNTP type clients) forgetting the IP address. From my studying underserved zones, that does not provide a fast enough feedback loop to get the fine-grained load control desired.

On the other hand, it destabilizes the zone further, because when a server drops, that means the other servers get hit by the load that server sheds. Which can be quite significant, kicking other servers out in turn. The entire zone remains in constant oscillation. See that in multiple zones. Instead, the goal should be a stabilization.

Would be different when the onslaught of load were modulated, as per some earlier proposal to have a gradual share increase scaled by the score, rather than the current on/off.

Perhaps, but the overall control loop is too slow/too coarse to make effective use of that.

I don’t think such an interpretation of what I wrote would be valid.

Generally agree. Except for this being more than speculation on my part. Rather, while obviously not being able to predict the exact behavior with such a change, experience gained over many years looking at underserved zones, also practically (i.e., having servers there and observing them) suggests a useful outcome of such an experiment being somewhat unlikely. Especially when considering the overall resources to get anything done in this project are spread so thin, they should be focused on the most promising aspects, where most benefit for the effort spent can be gained. And based on my experience, I don’t think trying to leverage the score system to manage load in underserved zones is a good enough use of the few resources.

But that is part of the problem. Just as the issues of underserved zones are underrepresented in this forum, so they are in the beta system. I.e., I doubt the beta system is a good enough reflection of the actual system so that transferring any insights could be transferred risk-free to the production system when such fundamental changes are concerned. And I think getting and using only one sample per cycle is a more fundamental change than the current change to add more monitors.

On the other hand there are other aspects that seem more promising, i.e., more effective, definitely more value for money. E.g., smooth moving in and out of the pool as a function of score. Or lower netspeed values (I hope I’ll be able to post more on that soon).

Or other topics altogether, such as IPv6 or vendor zones, or addressing underserved zones more fundamentally as had been planned already at some point in time, and not only because of the terrible user experience it gives to clients and server operators alike. Sure, playing in the beta system only is somewhat risk-free. But it still takes resources away from other topics that promise much better return on invest.

That system would be seen the same way if there are one sample, or three samples. Only one reply packet would be seen.

The change would be on one line only, change a constant from 3 to 1. Takes less time then spending in that conversation.

1 Like

Well, yes. I don’t think you got my point. In the scenario I presented the monitor would think everything’s fine, even if the monitored NTP server did not respond to all the valid queries.

The current monitoring with multiple queries will notice this anomaly and decrease the score of the NTP pool server accordingly.

No, because it impacts how the entire system behaves. So many more places would need to be touched and tweaked to make this work. Maybe the initial change is simple, but the expected effort to deal with the fallout, i.e., understanding how the system behaves under that change, and tuning the system to adapt to that change, is where the effort will hit.

Not really. In the current code, if there is one reply packet only from the three packet sent, the server would be considered fine. The current monitoring would not notice the situation that only one packet is allowed per 10 seconds.

That is why we have to test it in the beta. I guess, no further tweaking is required, but let us see that.

If it were some playground with no consequences, I’d agree. But while the beta system is a beta system, I don’t think we have the resources in the project to just break things (which is my expectation of the effect of that change, without further tuning at least, or even just understanding what is going on).

And even so, again, I still think the effort would be much better spent on other things, some of them similarly simple, but with more predictable, and predictably more positive outcome.

I guess I will need to examine the “dropped packets” case more closely. I may get back to this later on.

But if the NTP server responds with a rate limit message (which is more likely with multiple queries than with a single query) the monitoring will notice that and decrease the score accordingly. I’ll kindly provide you some evidence. Search for “RATE” in that CSV file.

I’ll also point out that the current monitoring sends usually four queries per check, not three. I don’t know where you got the three from. Admittedly this 3 vs. 4 difference does not really make a difference in this discussion, but let’s keep the facts straight anyway.

That is interesting.

Well, from the packet capture of my test monitor, I see three repeating queries to the same IP.

@avij and I were referring to the current production system, where most monitors send four queries per interval, but some also send just three (some/one of Ask’s own servers, so I figured he might have been trying that out on a small scale before, and for quite a while now).

I just verified this situation on my production monitor server. In one particular test case the monitor sent four queries and received two responses back (2nd and 3rd query). The result was recorded in the CSV log as “network: i/o timeout” and an appropriate score decrease. In particular, the monitor did not cherry-pick the result from the two valid responses but instead chose to return the error.

1 Like

@apuls, @davehart if you send me the IPs that were giving poor suggestions (and what you expected) in email or a DM I can figure out why it didn’t work.

The system gets location from the maxmind data and then I have a little service that makes a list of nearby airports from https://ourairports.com data. It’s trying to balance giving you big airports, but also suggest local ones if that makes more sense.

I don’t remember adding a feature for manually specifying the location code; so either that’s a bug or I forgot I did that. :slight_smile:

1 Like