Beta system monitoring testing

Ohh monitor overview now with statistics :smiley: :+1:

1 Like

Yes! beta.4 has been released.

I also fixed the bug where a mix of timeouts and good responses were counted as a timeout that @avij noticed; and some confusing log messages when the API service was down (also noticed by @avij). The client now waits until both IPv4 and IPv6 IPs have been recorded before showing the user the registration URL.

The website also has options to configure the account limits for monitors (myself and @apuls have access to change them).

Mostly the fixes in beta 4 were for the ā€œselectorā€ that’s adjusting which monitors are active for each server. It’s now filtering out so servers from the same account as the monitor won’t be associated, we don’t have multiple monitors on the same network monitoring a server, and generally limits how many monitors from one account might monitor a server.

2 Likes

Are there new FreeBSD tar files? Is there a way to find them without asking?

I thought the previous timeout / good response behaviour was a feature, not a bug :slight_smile:
After this change the monitored NTP servers are allowed to have up to 75% packet loss (or 66% depending on configuration) without a penalty. I’m not sure if that’s a good thing. Maybe the new score could be calculated from a combination of the good responses and the number of packets lost.

https://www.ntppool.org/ down? I get a 503.

UPDATE: NTP Pool System Status Status - Database outage

1 Like

@john1 without asking: only if you can figure out the build number, I think.

The latest amd64 release build is at https://builds.ntppool.dev/ntppool-agent/builds/release/506/ntppool-agent_4.0.0-beta.4_freebsd_amd64.tar.gz

There are other URLs here:

https://builds.ntppool.dev/ntppool-agent/builds/release/506/checksums.txt

I’ll add directory indexes in the web server for the release path so it’s easier to discover (now done: Index of /ntppool-agent/builds/release )

1 Like

Yeah, I’m not sure it’s a good thing either! But since it was a bug, I decided to fix it. (The behavior was inconsistent depending on which NTP queries got lost).

I appreciate the point made earlier in the thread that the new monitoring system is less likely to penalize servers that just happen to have a bumpy connection to an individual monitoring server. Some of the current design choices came from my frustration with all the threads complaining about the monitoring system being unreliable when it was actually just flagging those problematic connections.

One reason we do multiple queries is to detect servers with overly aggressive rate limits.

Since the system has multiple monitors (and associates monitors that get good results with specific servers), some options for varying monitoring patterns won’t work anymore.

Maybe the client should do a random number of samples from 1 to X instead of always doing X samples. Or randomly choose between a single sample and X samples. The penalty for a timeout is pretty steep, so that would still flag servers with significant packet loss. (And if the loss is just to an individual monitor, the system will replace that monitor).

Suggestions and input are very welcome. As I think was discussed above (or in another thread), the monitoring system tries to behave like an average NTP client would. What we’re testing for is ā€œdoes this server work well for a typical NTP client.ā€ (You could argue the premise is flawed; despite how straightforward the low-level protocol is, clients have wildly different behaviors).

2 Likes

On my manage/monitors page, there is a blue block with statistics:
image

I only have one monitor, zakim1-yfhw4a, and its statistics lower down is:
image

So it looks like the blue, which I assume is the total, is double the monitor’s value. zakim1-yfhw4a does have both an IPv4 and IPv6 address registered. Does that cause the numbers to be doubled?

Thanks, exactly this what I was recently referring with other words like:

Sorry, I did not follow the recent code changes. Does that mean if a single reply packet is lost from one monitoring run (3 or 4 or whatever number query packets, one after each other with small time spacing), is this fact showing up via decreasing the score of the NTP server?

My opinion is that instead of trying to simulate any kind of NTP client, the objective of the monitoring should be to get the most precise picture about the NTP servers’ state, sticking to the protocol specification.

The target is not to simulate a client as such. The target is to assess a server’s performance the way a client would see it, so as to assess its suitability for a client to get time off that server. To some extent, that obviously requires it to behave a bit like a client, but not fully simulate it, and certainly not just for the sake of it.

I still think that to some extent, that is besides the point of the monitoring. Again, the monitoring is about assessing how suitable a server is for clients to get time from it, not a detailed performance monitoring system (e.g., it isn’t even collecting detailed low-level information such as dispersion). And clients are rather resilient to some disturbances. And that needs to be taken into account somehow.

Now whether that is done ā€œearlyā€ in the assessment process, i.e. as now, shortly after gathering the samples, or later in the process, I don’t mind. I tend to agree that gathering more detailed information first, and then ā€œsmoothingā€ it somehow later to make it more realistic (in the sense above, i.e., suitability for clients to get time) might be preferrable. I just have the impression that that would require touching more sensitive parts of the system, and isn’t easily reconcilable with the current architecture. And that it would further bias the system to primarily cater to servers in better-served parts of the world, where detailed nuances of performance would even be visible. I.e., evolve to double as detailed performance monitoring system in those places. And continue to ignore, or even make worse the conditions in less-well served parts of the world, where a client can be happy, and often is/has to be, to get a response only to every nth request, or with higher offset, or larger delay, rather than not getting any usable service at all.

Can you elaborate on how the protocol specification would prevent more summarily taking a view as the monitors currently do, or as typical NTP clients do? Or how the current behavior violates the specification?

All in all, I agree that a 75% packet loss should not be treated the same as 0% packet loss. I have large sympathies for the proposal to reflect that difference via appropriate scaling/weighting of the scoring steps.

I understand the frustration. And I think that the complaints you refer to are a bit misleading/stem from some misunderstandings. As has often been pointed out in the context of many such complaints, e.g., by @avij, that it is not the monitoring itself, or the monitors, or their placement that is causing the issues. Rather that the monitoring, as intended, only reflects what a typical client would see. And unfortunately, in many places of the world, that view isn’t good.

Thus, the way to head off those complaints perhaps isn’t tinkering with the monitoring. But rather addressing the underlying, and well-known, and acknowledged bad situation in large parts of the world.

At one point, you mentioned having specific plans to address those issues (with some questions still needing further thought). Not sure where you stand on following up on those plans/ideas, would be interesting to hear your latest thoughts on those.

The aim of the whole NTP pool to provide acceptable NTP service even in those particular places of the world. However, these conditions should not influence the design and the implementation of the NTP monitoring, rather influence the way the monitoring result is used when selecting NTP servers for the NTP clients to those areas.

I see this as not making more restriction, rather a relief for the monitoring system design and implementation. For example, we should not feel to be obliged to take into account broken NTP client implementation(s) into the monitoring code.

Ok, then maybe we are more on the same page than it might have seemed so far. I.e., the discussion should not focus on the aspect of number of samples in isolation, but would need a broader scope.

Similar to what I understood one of Dave’s points to be, the concern was more about how wide the scope of the activity could be, and how much the current code architecture can be impacted.

So far, I had the impression the current changes were primarily about diversifying the sample collection process, i.e., get more monitors online, and more easily, and evolve the monitor selection.

All of those, especially the last, obviously also impact the scoring. But the actual scoring algorithm was not changed. The changes you propose would start to affect that part of the system as well.

Ok, fully agree. Still not sure though how sticking to the specification relates specifically to the question of how many samples should be taken (which I had understood so far to be the main issue of the discussion so far).

My statement wasn’t related to the number of samples. It was rather related to the selection of the NTP client. For example, relative to normal NTP clients (ntpd, chronyd), SNTP clients may provide bad quality time, but the NTP pool cannot help on that.

A few random data points. Unfortunately I don’t have a lab with all the possible client software and versions, but ntpdate in CentOS 7 seems to send 4 queries with two second intervals. C7 has been EOL for a long time and it can probably be mostly disregarded. chrony on Rocky Linux 9 with the iburst option seems to send five queries with a two second interval at startup (in this example case, YMMV). On the other hand, servers that use iburst but do not receive responses to some of the queries will typically tolerate the situation.

I do think that we should still check for overly aggressive rate limiting. The cleanest approach might be that the monitors report to the pool monitor management server the number of queries sent, the number of good responses received and the number of error/timeout responses. Or just the number of good responses, with the error count deduced from the number of good responses and project configuration. This would enable making better score calculations later on.

1 Like

Ah, ok, sorry I misunderstood.

Fully agree. It was a bit my concern that getting one sample only would move the system to become as sensitive to disturbances as a simplistic SNTP client while still not making up for the shortcomings of such clients, even with the most vigorous vetting of servers.

@john1
I need the tar.gz file for special case. To get the build number i have the agent installed on a debian system too.
During the update process you will see the build number and can use it in the url :slight_smile:

Forget what i’ve wrote - just saw that Ask setup a directory listing :slight_smile:
Thank you @Ask :+1:

1 Like

I released beta.5 of the new monitor. More testers are still welcome!

Send a message to myself and @apuls with the account information if you don’t have access to add a beta monitor at pool.ntp.org: the internet cluster of ntp servers and we can get you setup and added to the discourse category for monitor operators.

1 Like

Finally getting around to try and test, just been so busy here, sorry all! But I’m getting the following error:

[root@web02 ~]# sudo -u ntpmon ntppool-agent setup -e test -a 2jp2meg
time=2025-07-07T12:10:24.593-07:00 level=INFO msg=ā€œusing hostname for registrationā€ env=test hostname=web02.versadns.com
ntppool-agent: error: Post ā€œhttps://beta-api.ntppool.dev/monitor/api/registrationā€: dial tcp6 [2a04:4e42::311]:443: connect: network is unreachable
[root@web02 ~]#

1 Like

@csweeney05 do you have IPv6 configured? Maybe the agent isn’t detecting if there’s IPv6 at all actually.

You can run it with --no-ipv6 and it won’t try the IPv6 interface.

I had it fail if a protocol it’s trying doesn’t work to make it explicit, but as mentioned maybe I’m not even looking if the system has an IP on that protocol.

2 Likes