Monitor scoring glitch when ntpd isn't listening?

I futzed an update of ntpd to a development testing version on t2.davehart.net and noticed what appears to be a bug in the recentmedian scoring. After bringing ntpd back up, I noticed while all the active monitors had scores close to the minimum, the recentmedian was not. Take a look:

I snapshotted the 200-row CSV immediately after that screenshot and posted it for comparison at:

https://www.davehart.net/ntp/pool/t2.davehart.net-mon-csv-20251217-1231.txt

Normal as the non-active servers don’t check your server as much.
So the active ones see it more quick you are down.

I do not know what the thresshold is, but I doubt the system will switch monitors in seconds to make your server look good again.

The active monitors score >20x as fast, so yes they drop faster too.

What happened when you where back after say 60 minutes? Did they up the score again?

Typical the selected servers bring it right to the top when the server is back.

By the looks of it, it took some time to bring it up again. Looks normal to me.

You where gone 12 hours… :face_with_spiral_eyes:

Yeah, there’s various glitches in there, e.g., a single good score (for whatever reasons, e.g., monitor not in service while the server was down, so the monitor is considered with the last score it recorded for the server) determining the overall score, at least for a short period of time.

Plus, as you observe, that single good score not even needing to be from an ā€œActiveā€ monitor, but from ā€œjustā€ a ā€œTestingā€ monitor. Though, in the case of your example, due to the long outage, the system is in the process of reshuffling the monitors in the two upper categories (only 5 ā€œActiveā€ monitors), in which case this behavior is also common (I can’t find the post right now, but I think @avij described something like that very recently happening when a new server has been added, and I keep seeing that frequently as well).

1 Like

The server was gone for 12+ hours.

The logs look good again, it should be ok in a few hours.

Question is, why did all monitors got rejected? As that is what happened.

In your log…

1765991741,2025-12-17 17:15:41,-0.000162714,1,-58.961532593,360,cayul1-h046f4,19.736,
1765991691,2025-12-17 17:14:51,-0.002033749,1,-18.668172836,364,ausyd1-h046f4,202.762,
1765991691,2025-12-17 17:14:51,-5,-63.117401123,24,recentmedian,network: read: connection refused
1765991571,2025-12-17 17:12:51,-5,-63.117401123,24,recentmedian,network: read: connection refused
1765991571,2025-12-17 17:12:51,0.002736924,1,-26.732477188,93,frmrs1-3strqkc,99.343,
1765991486,2025-12-17 17:11:26,-0.000801631,1,-19.201631546,302,rusvo1-r6np0t,146.619,
1765990846,2025-12-17 17:00:46,-5,-63.117401123,24,recentmedian,network: read: connection refused
1765990846,2025-12-17 17:00:46,-0.000659294,1,-65.460952759,311,usdfw1-762zq1,34.807,
1765990838,2025-12-17 17:00:38,-5,-63.117401123,24,recentmedian,network: read: connection refused
1765990838,2025-12-17 17:00:38,-0.000499089,1,-29.412374496,313,ussjc1-762zq1,64.731,
1765990275,2025-12-17 16:51:15,-5,-63.117401123,24,recentmedian,network: read: connection refused
1765990275,2025-12-17 16:51:15,-0.026093988,0.993031383,-18.899791718,338,aumel1-1fcy378,260.192,
1765989797,2025-12-17 16:43:17,-5,-24.668172836,285,twtpe3-2kg9ezv,0,network: read: connection refused
1765989797,2025-12-17 16:43:17,-5,-92.855377197,24,recentmedian,network: read: connection refused
1765989729,2025-12-17 16:42:09,-5,-92.855377197,24,recentmedian,network: read: connection refused
1765989729,2025-12-17 16:42:09,-5,-47.185142517,134,usday4-1tcp71g,0,network: read: connection refused
1765989667,2025-12-17 16:41:07,-5,-92.855377197,24,recentmedian,network: read: connection refused
1765989667,2025-12-17 16:41:07,-5,-35.965759277,137,usday5-1tcp71g,0,network: read: connection refused
1765989391,2025-12-17 16:36:31,-5,-92.747650146,24,recentmedian,network: read: connection refused
1765989391,2025-12-17 16:36:31,-5,-25.12055397,322,brrte1-k2cg7w,0,network: read: connection refused
1765989254,2025-12-17 16:34:14,-5,-92.747650146,24,recentmedian,network: read: connection refused
1765989254,2025-12-17 16:34:14,-5,-63.117401123,360,cayul1-h046f4,0,network: read: connection refused
1765989201,2025-12-17 16:33:21,-5,-92.747650146,24,recentmedian,network: read: connection refused
1765989201,2025-12-17 16:33:21,-5,-24.819412231,327,chzrh1-1xmhyma,0,network: read: connection refused
1765989009,2025-12-17 16:30:09,-5,-92.855377197,24,recentmedian,network: read: connection refused
1765989009,2025-12-17 16:30:09,-5,-20.721843719,159,sgsin1-z4ytm9,0,network: read: connection refused
1765988891,2025-12-17 16:28:11,-5,-92.855377197,24,recentmedian,network: read: connection refused
1765988891,2025-12-17 16:28:11,-5,-24.71121788,350,bymsq1-2zkagd8,0,network: read: connection refused
1765988526,2025-12-17 16:22:06,-5,-92.855377197,24,recentmedian,network: read: connection refused
1765988526,2025-12-17 16:22:06,-5,-13.042416573,90,sgsin1-1a6a7hp,0,network: read: connection refused
1765987994,2025-12-17 16:13:14,-5,-92.855377197,24,recentmedian,network: read: connection refused
1765987994,2025-12-17 16:13:14,-5,-24.775602341,357,itblq1-3mf5y85,0,network: read: connection refused
1765987840,2025-12-17 16:10:40,-5,-92.855377197,24,recentmedian,network: read: connection refused
1765987840,2025-12-17 16:10:40,-5,-92.855377197,309,ustul1-762zq1,0,network: read: connection refused
1765987674,2025-12-17 16:07:54,-5,-24.683494568,307,deham1-2zg3vnt,0,network: read: connection refused
1765987608,2025-12-17 16:06:48,-5,-25.861923218,97,depad1-3strqkc,0,network: read: connection refused
1765987600,2025-12-17 16:06:40,-5,-24.668172836,191,frory1-1q8hbq5,0,network: read: connection refused
1765987545,2025-12-17 16:05:45,-5,-24.668172836,84,hkhkg1-1a6a7hp,0,network: read: connection refused
1765987541,2025-12-17 16:05:41,-5,-92.855377197,24,recentmedian,network: read: connection refused
1765987541,2025-12-17 16:05:41,-5,-24.668172836,95,sgsin3-1a6a7hp,0,network: read: connection refused
1765987487,2025-12-17 16:04:47,-5,-92.855377197,24,recentmedian,network: read: connection refused
1765987487,2025-12-17 16:04:47,-5,-92.855377197,163,usbos1-33hme3c,0,network: read: connection refused
1765987484,2025-12-17 16:04:44,-5,-24.668172836,126,defra1-3strqkc,0,network: read: connection refused
1765987483,2025-12-17 16:04:43,-5,-24.683862686,346,sgsin1-1fcy378,0,network: read: connection refused
1765987433,2025-12-17 16:03:53,-5,-92.747650146,24,recentmedian,network: read: connection refused
1765987433,2025-12-17 16:03:53,-5,-24.668172836,286,debre1-3drzsgk,0,network: read: connection refused
1765987382,2025-12-17 16:03:02,-5,-92.747650146,24,recentmedian,network: read: connection refused
1765987382,2025-12-17 16:03:02,-5,-92.855377197,94,usewr1-1a6a7hp,0,network: read: connection refused
1765987378,2025-12-17 16:02:58,-5,-24.668172836,281,twrmq1-2c0kgjp,0,network: read: connection refused
1765987322,2025-12-17 16:02:02,-5,-92.654052734,24,recentmedian,network: read: connection refused
1765987322,2025-12-17 16:02:02,-5,-24.668172836,265,gblhr1-32yzyvc,0,network: read: connection refused
1765987309,2025-12-17 16:01:49,-5,-92.654052734,24,recentmedian,network: read: connection refused
1765987309,2025-12-17 16:01:49,-5,-20.909481049,116,inbom2-3strqkc,0,network: read: connection refused
1765987259,2025-12-17 16:00:59,-5,-32.595535278,137,usday5-1tcp71g,0,network: read: connection refused
1765987256,2025-12-17 16:00:56,-5,-24.668172836,152,ussfo1-1a6a7hp,0,network: read: connection refused
1765987256,2025-12-17 16:00:56,-5,-92.654052734,24,recentmedian,network: read: connection refused
1765987256,2025-12-17 16:00:56,-5,-26.989795685,109,dehhn1-3strqkc,0,network: read: connection refused
1765987208,2025-12-17 16:00:08,-5,-92.654052734,24,recentmedian,network: read: connection refused
1765987208,2025-12-17 16:00:08,-5,-24.668172836,236,frcqf1-3strqkc,0,network: read: connection refused
1765987206,2025-12-17 16:00:06,-5,-92.654052734,24,recentmedian,network: read: connection refused
1765987206,2025-12-17 16:00:06,-5,-27.748226166,221,clkna2-k2cg7w,0,network: read: connection refused
1765987158,2025-12-17 15:59:18,-5,-24.683862686,99,esmad1-3strqkc,0,network: read: connection refused
1765987153,2025-12-17 15:59:13,-5,-92.654052734,24,recentmedian,network: read: connection refuse

Just from memory without inspecting the code: I think this might be a quirk of the system slowing down monitoring when the score for a server/monitor drops to under -90 and then the scorer (the component that calculates the scores (ā€œrecentmedianā€ is the only score, but it could calculate others) falls back to a broader median of recent checks because the active monitors ā€œaren’t workingā€.

Basically the active monitors decided to hibernate, so temporarily the testing monitors became active.

2 Likes

If you would read the thread title or understand the first sentence and a half of my first message, you would know that’s not the question. ntpd was inadvertently stopped. It was not listening on UDP port 123. The monitoring system noted that correctly.
The glitch was that the active monitors were all showing less than -90 for the score shortly after ntpd came back up, but the recentmedian ā€œCurrent score:ā€ was higher at -59. I expected that only the active monitors would determine the score.

OK, thank you for the clear explanation. Everything is back to normal after all the interacting automatons did their dance.

A couple different paths to improve it:

  • Make the scorer more sophisticated to avoid it slipping into the testing pool. I’m not too fond of this idea; I suspect it’ll be complicated in other ways / scenarios.
  • When the score drops below X, force move a monitor to testing. If I remember right the system hesitates in this situation currently because there aren’t any good monitors (so it tries to keep status quo?
    • If a server lose all ā€œactiveā€ monitors the system should basically behave like it’s a new server (except the scores are lower so it’ll take a bit to recover) and work to quickly add active servers as the scores indicate health.
3 Likes

That was what puzzled me, the log didn’t say timeout but rejected.

So the log is wrong, or ntpd was still running and rejecting requests.

Or is rejected the new term for no-response?

And reading Ask’s his reply, it makes sense the system thinks of a new system after being gone for a long time.

I’d also prefer less sophistication, rather than more. So while it sounds easy, but if making the scorer refrain from picking the median from the testing pool would need more sophistication, I think pushing the monitors out of the active pool back into the testing pool more aggressively would be preferable - especially if that means not increasing the level of sophistication, or perhaps reducing it ever so slightly.

As it is not a technical issue (the scores are all rather bad, anyway, so how the median is picked is not really relevant), but rather one of optics, that should help to make it clear to a user that the median comes from the testing pool, and not some mix of both pools, or ignoring the monitors in the active pool and evaluating the testing pool only (not sure which it is technically), which can be confusing.

First time I saw what Dave described, I thought the system is going somewhat haywire when picking a median value that is not even represented in the active pool, until I realized it was an actual value by a monitor, but from the testing pool.

To make the server recover faster, maybe when the active pool is empty could the monitors in the testing pool poll more aggressively, so as to gather more data more quickly so as to move the score along quicker? That could also help speed up initial addition of a new server, which takes rather long these days in my impression…

1 Like

I support this idea, but only after say being gone 12 hours to name something.
Then simply reset, as being a new ntp-server (should be same for all) but test them faster.

So new (returned) severs get to >10 status more fast.
Beware, >10 means you are listed in the pool, it doesn’t need to be 20.

And now? Any better?

That is weird.
I do believe it’s in the pool, but it seems it has no assinged active servers.

This looks strange to me.

Maybe tomorrow? Are you serving time, in x 0/000? something yet, or is that still zero?

I would expect new/old servers to restore in less then 24 hours. This takes a bit too long.

Maybe @ask has to refine the parameters, to restore old/new/out severs faster.

Don’t know the parameters, but they seem a bit slow on ā€˜recovery’.

It should have assigned monitors by now…weird.