Monitor belgg1-19sfa9p

Any one seeing the same, belgg1-19sfa9p keeps low on score (IPV6) and has never reached 20.0 while others easily reach 20.0 and stay stable.

There may be monitored NTP servers over IPv6 from belgg1-19sfa9p never reaching score 20.0, but not all NTP servers are like that, for example the servers I am maintaining:
https://www.ntppool.org/a/vm9onk3s5v9mcxq5dc4

That’s the point of multiple monitors. Not everyone is going to have a perfect path to your server especially IPv6. That was the problem with the old system for many, their server was fine but the limited monitoring made it seem like something was wrong with their server to the pool.

2 Likes

On average that monitor actually has some of the highest scores!

In the future I hope we can use this data to better figure out where on the internet NTP packets are being dropped (basically the “industrial scale” version of the work @stevesommars have been doing on and off).

3 Likes

Thank you Ask, as those are my servers, they all end on ‘19sfa9p’.

I see a complaint about 1 of my monitors, that is exactly the reason I complained in the past.

Because we have more monitors now, it doesn’t matter if my monitor score’s poor for some people and good for others.
As the NTP-server will still be put to use because it has a good overall scoring by other monitors.

It also reveals that not all internet peers have good and stable connections.

It reaches my other servers perfectly 20 (in a datacenter not in Belgium) pool.ntp.org: Statistics for 2001:41d0:203:654d::8f37:5630

This IPv6 monitor is a RaspberryPi4 serving IPv6 only, here in my home on a VDSL 100/30mbit connection. And the NTP UDP is prioritized in the router as real-time.

Exactly for this reason Ask has set up the new system, so nobody depends on 1 monitor alone.

1 Like

Seems my ‘poor’ monitor has no problem with this server of yours :slight_smile:

Nice 20 from what I see.

1 Like

Thank you. :blush:

I see a strange thing with the monitor selection (not specific to your monitor).
Two NTP servers sitting on the same network.
https://www.ntppool.org/scores/2a00:7580:60:211::52: it has 5 active monitors scoring all 20, and 4 testing monitors.

https://www.ntppool.org/scores/2a00:7580:60:211::46: it has only 4 active monitors, most are selected from the other’s testing monitors and a bit sub optimal selection, the score is not always 20. This NTP server has no testing monitors.
image

May someone give an explanation for this discrepancy?

Monitors are selected that score well on your server.
As far as I know monitors that score poor will move to the lower section and do not count for your median-score.
Also I believe they are given less load if they keep scoring your server bad.

It’s impossible to score good on all monitors, that is the point of this new system.
I would not worry too much, just worry if your ntp-server drops down to be marked as bad.

This new system is just in place to prevent this from happening if your ntp-server works fine.
In the past we had 1 monitor and it marked many servers as bad but they where not bad, it was just the monitor had time-outs or couldn’t reach it.

Ask has made a whole new algorithm to keep monitors from overloading and keep monitors from hitting your server very hard if it’s already passed as good.

The active monitors score your server, the testing are just testing from time to time, not making your score. If the testing-monitors score better then the active then they are replaced.

In short, they system is designed to keep your monitor from being marked as bad when it’s not the case.

1 Like

Thanks for the explanation. However, if there is no testing monitor, there is no chance that any of the active monitor eventually would be replaced with a better one. All of the non-active monitors should become testing monitor time to time to give them a chance to replace worse quality active monitor, if there is any.

In my case, I am quiet sure that your monitor belgg1-19sfa9p would be better than sgsin2-1a6a7hp.

The non-Active monitors ARE the test-monitors.
They do test your server from time to time, they are just not active for the scoring mechanism.

When the ‘active’ monitors start to fail or score lower, then the system checks the ‘non-active’ and use one or all of those that score well/better and become new ‘active’ monitor(s) to calculate your score.

It is not the case that the ‘non-active’ monitors do nothing, they monitor but are not accepted for your score as they often fail or score lower all the time.

ALL monitors listed do measure your system, but only the good scoring monitors are the ones that score your server points for being listed or not in the pool.

Monitors that can’t reach your system for some time will be delisted and not shown at all.

This is how I understand that Ask has programmed it all. We have been testing for more then a year before he replaced the old flawed system with this (almost) perfect system of testing.

1 Like

I understand that the new monitoring system should work like that. However, there may be bug in the implementation. I see that on the meantime testing monitors started to show up for 2a00:7580:60:211::46.

Yes they show up, they managed to measure your server.

https://www.ntppool.org/scores/2a00:7580:60:211::46

However it’s still bad to reach.

The question is? Why is your server so bad to reach for any monitor?

Some reach it, but most do not. You should wonder why that is.

Why not ask @stevesommars to put a monitor himself on your system, maybe he can work out what is wrong.

From all I see, your server isn’t easy to reach or responding.

Could be a firewall or some setting you have done in the config.

Trust me, the monitor system isn’t to blame. I’m 100% sure of that, not in your case.

Two of my monitors showed near 100% response rate for the month of May, as did monsjc2.
The polling rates on monsjc2 were variable though, see below beginning May 13.
Both servers responded at about 100% with good timestamps. Is this expected behavior?

polls/
day
57 2023-05-01 2a00:7580:60:211::46
54 2023-05-01 2a00:7580:60:211::52
69 2023-05-02 2a00:7580:60:211::46
72 2023-05-02 2a00:7580:60:211::52
72 2023-05-03 2a00:7580:60:211::46
72 2023-05-03 2a00:7580:60:211::52
72 2023-05-04 2a00:7580:60:211::46
69 2023-05-04 2a00:7580:60:211::52
69 2023-05-05 2a00:7580:60:211::46
72 2023-05-05 2a00:7580:60:211::52
72 2023-05-06 2a00:7580:60:211::46
72 2023-05-06 2a00:7580:60:211::52
69 2023-05-07 2a00:7580:60:211::46
69 2023-05-07 2a00:7580:60:211::52
72 2023-05-08 2a00:7580:60:211::46
69 2023-05-08 2a00:7580:60:211::52
72 2023-05-09 2a00:7580:60:211::46
72 2023-05-09 2a00:7580:60:211::52

69 2023-05-10 2a00:7580:60:211::46
72 2023-05-10 2a00:7580:60:211::52
72 2023-05-11 2a00:7580:60:211::46
72 2023-05-11 2a00:7580:60:211::52
72 2023-05-12 2a00:7580:60:211::46
69 2023-05-12 2a00:7580:60:211::52
210 2023-05-13 2a00:7580:60:211::46
72 2023-05-13 2a00:7580:60:211::52
465 2023-05-14 2a00:7580:60:211::46
72 2023-05-14 2a00:7580:60:211::52
462 2023-05-15 2a00:7580:60:211::46
72 2023-05-15 2a00:7580:60:211::52
464 2023-05-16 2a00:7580:60:211::46
69 2023-05-16 2a00:7580:60:211::52
462 2023-05-17 2a00:7580:60:211::46
72 2023-05-17 2a00:7580:60:211::52
246 2023-05-18 2a00:7580:60:211::46
39 2023-05-18 2a00:7580:60:211::52
33 2023-05-19 2a00:7580:60:211::46
18 2023-05-19 2a00:7580:60:211::52
447 2023-05-20 2a00:7580:60:211::46
72 2023-05-20 2a00:7580:60:211::52
459 2023-05-21 2a00:7580:60:211::46
69 2023-05-21 2a00:7580:60:211::52
468 2023-05-22 2a00:7580:60:211::46
72 2023-05-22 2a00:7580:60:211::52
456 2023-05-23 2a00:7580:60:211::46
69 2023-05-23 2a00:7580:60:211::52

1 Like

On May 13 the switch crashed that directly connects 2a00:7580:60:211::46. So the IP became unreachable for about two hours. However, after more than a week elapsed, I am expecting the monitoring configuration to return as it was before the switch crash.

Willem his servers are tested by MY monitor near perfect:

https://www.ntppool.org/a/bb6a3cf3e3d79ed7dbc913a39399343c

I can not answer this one.

https://www.ntppool.org/a/vm9onk3s5v9mcxq5dc4

But checking, it poor:

traceroute 2a00:7580:60:211::46
traceroute to 2a00:7580:60:211::46 (2a00:7580:60:211::46), 30 hops max, 80 byte packets
 1  fritz.box (2a02:578:440e:0:464e:6dff:fefa:37c2)  0.833 ms  1.081 ms  1.298 ms
 2  2a02:578:1000:5::1 (2a02:578:1000:5::1)  35.554 ms  36.446 ms  36.485 ms
 3  2a02:578:1:4c::1 (2a02:578:1:4c::1)  23.726 ms  25.235 ms  23.398 ms
 4  router01.bruix.be.edpnet.net (2a02:578:1:46::1)  23.355 ms router02.bruix.be.edpnet.net (2a02:578:1:1c::2)  23.577 ms router01.bruix.be.edpnet.net (2a02:578:1:46::1)  23.712 ms
 5  * * *
 6  2001:730:2301:7::d52a:a235 (2001:730:2301:7::d52a:a235)  35.331 ms  18.014 ms  18.099 ms
 7  2001:730:2300::5474:80a1 (2001:730:2300::5474:80a1)  41.732 ms  43.669 ms  44.047 ms
 8  * * *
 9  de-fra11b-rc1-lo0-0.v6.aorta.net (2001:730:2d00::5474:80d4)  47.410 ms  115.688 ms  122.360 ms
10  ch-zrh03a-rc2-lo0-0.v6.aorta.net (2001:730:2c00::5474:801d)  124.620 ms  124.698 ms  127.014 ms
11  2001:730:2700::5474:806f (2001:730:2700::5474:806f)  127.183 ms  127.091 ms  126.971 ms
12  * * *
13  2001:1700:2300:1::2 (2001:1700:2300:1::2)  134.665 ms  134.889 ms  135.689 ms
14  * * *
15  * * *
16  * * *
17  * * *
18  * * *
19  * * *
20  * * *
21  * * *
22  * * *
23  * * *
24  * * *
25  * * *
26  * * *
27  * * *
28  * * *
29  * * *
30  * * *

Takes way too much time to reach.

The traceroute is disabled to that host.

Thanks. I found (and fixed) a couple of bugs that caused this. It took a little bit to make sure I had the right fix, have it run on the beta site, etc.

The system has a pretty simplistic “scheduler” for running tests; partly from history and partly to be simple to reason with when something else doesn’t work like expected (temporary failures, monitors coming and going, etc).

The logic (until last night) was that an active monitor could test a server up to every 8 minutes, a testing monitor up to once an hour and the system would try keeping the tests 2 minutes apart.

Each server starts out with no active monitors and just slowly get a few probes from the testing monitors (all of them) and as they collect enough test results they become eligible to be “active”. The system limits the rate it will change monitors for a server (which is a little goofy in this state), so there was a period where your server got 4 monitors. The intervals listed above then made it so the active monitors basically did a test every 8 minutes each, spread out by two minutes and a few seconds each, never leaving “room” for the testing monitors to get a test in. I don’t totally get why this didn’t happen more often (or maybe it does and it’s just harder to see casually looking at the graphs?).

Another bug (fixed today, but not pushed to production yet) had the servers to be tested by a monitor sorted by “longest since this server had a test” rather than “longest since this monitor tested this server”, so a monitor “testing” a server wouldn’t prioritize this server even if it was hours (or days!) overdue.

I changed the intervals to be every 9 minutes for an active monitor, 45 minutes between testing monitors and 75 seconds between each test overall. After I put this in production your server had the full complement of active and testing servers for IPv6 in production.

3 Likes

Yeah, I think that’s just the intervals of active (previously every ~10 minutes for active monitors and a little less than hourly for testing monitors) times 3 queries per test.

Thanks Ask for the fix! If I understand well, from now on all existing monitors will show up, either active or testing mode. With other words, every monitor will check every server, at least with a minimal frequency.

1 Like

I do not call this bugs, just glitches :laughing: :ok_hand:

1 Like