Drop in number of checks per minute by monitoring probes

I note that the number of monitoring probes seems to have dropped slightly since the early hours of April 4. At the same time, the number of checks per minute performed by the monitoring probes seems to have dropped to about half its previous rate, with very slight recovery since then. This change seems to roughly correlate with the activation of the user information download and user deletion features, at least timewise.

Was this change intended, and if so, just out of curiosity, what is the background of that change?

Thanks!

Edit: I now realize that the number of probes hasn’t necessarily decreased, but there now seem to be more “active” monitors than before, and thus less monitors in “testing” mode, which triggered my initial impression of fewer monitors. Need to observe a bit longer to get better impression of apparent changes.

Edit 2: From among the 15 IPv6-enabled monitors listed on a server’s details page, only seven seem to be actually providing measurement results at this time. The number was even lower initially. So I guess the monitors may simply need software updates to be able to provide measurement results to the backend infrastrucuture. And as those updates are being applied over time, I guess more monitors will eventually come back online as time goes by, and checks per minute will go up again as well as more and more monitors come online again.

1 Like

Hi @PoolMUC – you are right! From https://status.ntppool.org/

I did a minor upgrade of the API last weekend and some of the monitoring clients got updated as well, but that wasn’t actually the cause of this. Commits · ntppool/monitor · GitHub

Part of the upgrade was upgrading the MQTT library used in the client.

The monitors connect via an HTTP API (over mTLS) and to an MQTT server (also mTLS plus a JWT token obtained from the HTTP API). The checks and returned data is over HTTP (gRPC to be specific), but the the “instant” NTP checks you can do on the website and that the system does when adding a new server is requested over MQTT.

The new client needed some fixes that I worked through, and then called it a night.

What I didn’t realize then was that an upgrade of the MQTT server (unmentioned in the release notes best I can tell) changed some of the protocol for when a client is connecting that made the client crash. I’d fixed this in the new monitor client version, but I thought it was related to the new client library so I missed that all the old clients were crashing.

Fortunately I could fix it with a server setting, so they’re all back online now.

The system has a flag for “when did this monitor last check-in”, but that got set in an API call before the client would crash. (I also monitor how many monitors are online with an MQTT connection and those stats were oddly low, but I was wrapping up Sunday and explained it away in my head somehow after testing the MQTT-dependent features on the site were working okay).

Anyway – the system re-allocated checks as it could to the running monitors, first about 9, 13 as more got upgraded which is why things were working okay despite having half the monitors that are usually running. (IPv6 monitors went from 15 to 6).

Thanks for pointing it out! I wouldn’t have noticed until next weekend probably.

3 Likes

Hello @ask,

Thanks for the fix as well as the extensive explanation, always interesting to learn more about the inner workings of the pool!

This also seems to explain another minor quirk I observed when trying to add a new server, where it would initially successfully access the server and show the “is this really your server” page, but then upon confirming the addition of the server would report that it wasn’t reachable anymore, just seconds later.

1 Like

Here’s a graph of “tests per minute per monitor” showing how the system moved the active monitors around (and how it’s back to normal now).

For each server the system prioritizes monitors that are having better result and that are closer.

The code for this is in monitor/scorer/cmd/selector.go at main · ntppool/monitor · GitHub with the “how to pick a monitor” query in monitor/query.sql at 29c316960e7027eb4cd02ef20fee6db5eaf17b4e · ntppool/monitor · GitHub

2 Likes