Monitoring upgrade

Bas · March 20, 2023, 4:45pm

You could compare monitors between good and bad.

I mean, I have a monitor (IPv4 and IPv6), they constantly monitor well on my other NTP servers that are all over Europe.
However, I see a few other monitors that are real poor for all my servers.

Wouldn’t it be a good idea to score the monitor itself?

I mean, if a monitor keeps persistently mark servers as bad but others don’t, be a good idea to inform the monitor operator to rethink if his monitor is being useful?

Nobody benefits from poor performing monitors. If mine would perform badly all the time, I would remove it.

Rate it just like you do NTP-scoring, but then on their testing performance compared to other monitors.
I mean, you see their scoring numbers, if they are not on par with other monitors and it happens a lot, it may be a good reason to remove them.

NTPman · March 20, 2023, 5:49pm

From the technical description:

The new scoring calculation is called recent median. It works simply by choosing the median score of the ‘1-scores’ from “active” monitors in the last 20 minutes.

I do not understand how the recentmedian score ends up being -8.8 when the median value of the active monitors are 19.8 on the example.

The -8.8 score is really applied, this NTP server is out of the pool.

Bas · March 20, 2023, 8:18pm

Weird is, my monitor isn’t giving a number to your system.

I would expect it to monitor ALL when other monitors fail.

Mine is belgg1 it doesn’t seem to test you.

tshoppa · March 20, 2023, 8:59pm

Even after the Sunday bugfix I’m still getting continual emails telling me about “Problems” and wacky low scores.

Historically my server (71.191.185.32) always was about 4ms +/- 2ms offset from the San Jose monitor and only in cases of complete loss of connectivity did it get a negative score and get removed from the pool.

It is interesting to look at the new graph and see that the green dots (offsets) have become much denser, and drifted on average much closer to an average of 0ms offset (with still several ms of scatter).

At same time there are all kinds of red dots that I think are individual (new) monitor scores that seem to be dominating the results (which might be the “median” - assuming that the red dot is a score and not an offset!).

It is not at all obvious how I can map the server names such as “nlams1-1a6a7hp” to a geographic location.

I had never seen red dots on the monitoring history graph before I and don’t know what they mean.

Has there been a change for what an “acceptable offset” is? Again all my green dots recently seem to be +/-10ms.

Link to my graphs: pool.ntp.org: Statistics for 71.191.185.32

Tim N3QE

Kets_One · March 20, 2023, 9:07pm

Welcome to the forum tshoppa!

I suspect that the nlams1 server is in Amsterdam, the Netherlands (hence nl ams).
Could it be that the physical distance to your server (i assume is in US) causes these low scores?
Most server names seem to tell me what their physical locations are:
sgsin1: Singapore?
inblr1: India?
deksf1: Germany or Denmark?
fihel1: Finland?

BTW: isnt it funny that my server (which is in Netherlands), receives bad scores from some monitors in the US?

apuls · March 20, 2023, 9:20pm

Just a side note:
The “code” represent the country (first two letters) followed by a three letter IATA-Code (airport code) closed to the GeoIP location.
While setup a monitor the system will do a geoip lookup and report the nearest airport location(s). If the GeoIP DB is not correct the location will also bad.

@ask
please correct me if i’m wrong

csweeney05 · March 21, 2023, 3:16am

That is kind of the point of multiple monitors though, to show how different internet routes affect traffic. To say only good scoring monitors can be used defeats the purpose. Sure there can be bad monitors that score nothing good, but the point is to have multiple monitors from all over testing to eliminate the just your server is bad here so it’s bad everywhere issue. In other words you can’t rate monitors the same as we expect to see different scores on different monitors, it’s kind of the point.

csweeney05 · March 21, 2023, 3:27am

I have noticed since we have more diversity in monitors that Verizon FiOS accounts are having more issues. Could be upstream issues with the connection to Verizon, if it helps here is a traceroute to port 123 over udp from usday3 and we are not seeing a response from the next hop where we should at 152.179.136.9 from Verizon:

traceroute to 71.191.185.32 (71.191.185.32), 30 hops max, 60 byte packets
1 firewall1.versadns.com (10.10.81.1) 0.150 ms 0.112 ms 0.120 ms
2 gw.versadns.com (217.180.209.209) 0.626 ms 0.871 ms 0.854 ms
3 * * *
4 ve1104.core2.det1.he.net (184.105.30.101) 7.299 ms 7.874 ms 7.506 ms
5 * * *
6 * * *
7 * * *
8 * * *
9 * * *
10 * * *
11 * * *
12 * * *
13 * * *
14 * * *
15 * * *
16 * * *
17 * * *
18 * * *
19 * * *
20 * * *
21 * * *
22 * * *
23 * * *
24 * * *
25 * * *
26 * * *
27 * * *
28 * * *
29 * * *
30 * * *

ask · March 21, 2023, 6:23am

The system does have a function to use server/monitor pairs that are more successful (see the “selector” section in the documentation linked above). It didn’t work because of a bug (now fixed in v3.4.0)

ask · March 21, 2023, 6:36am

Oops, production is running a version one commit before I made the “only active servers” change! Now fixed.

jord903 · March 21, 2023, 6:37am

Hello, I have two NTP servers in Taiwan, Asia, using the Taiwan Hinet and Seednet networks respectively. I am interested in joining the new monitoring network. Is it still possible to join now?

ask · March 21, 2023, 6:39am

I’ll send you a separate note, thank you!

ask · March 21, 2023, 6:45am

Hi Tim; lots of good questions!

The denser dots are just an artifact of how the graphing works. There are many more monitoring “points” now. As the old data rolls off it should show a shorter time period with slightly less data. (I’d love to fix this, but it’s temporary and I probably wouldn’t figure out the javascript before it’s sorted by itself …).
The red dots are when the score drops significantly (typically from an i/o error / timeout). See my next post for more on this!
The monitor names are [two letter country code][airport code].
The acceptable offset is currently 75 ms (!). With the improved monitoring system we can probably make this smaller.

ask · March 21, 2023, 6:48am

Thanks everyone for your patience with this rolling out!

I found a bug in the monitoring client that I’d introduced in an unrelated change a late Sunday evening a couple months ago. We missed it in testing because most of the beta monitors never got that particular version.

In each test the monitor sends multiple “probes” to a server. The intention is that it picks the best response it gets as the monitoring result. The bug I introduced made it so if any of the probes had an error the whole result got marked with that error, rather than ignoring it in favor of a successful response.

it’s fixed now and the monitors run by the project have been updated. The other monitors should get updated in the next ~12 hours or so.

NTPman · March 21, 2023, 9:10am

Well, here is another one, screenshot made some minutes ago:

ask · March 21, 2023, 9:18am

deksf1 is in Germany. Denmark would be a ‘dk’ prefix.

During testing and the last few days of looking at graphs I’ve seen so many examples of different monitors seeing very different (and very consistent) offsets. I haven’t seen any clear patterns of some monitors consistently working well for everyone and many examples of the same monitor getting “crazy” results for one server and excellent results for another. There might be some confirmation bias, but I think it validates the new design with the many monitors and the system trying to choose which to focus on.

Indeed – the code that generates the location code options is here.

I’d like the monitor to have a traceroute feature built in and occasionally send traceroute data to the monitoring API to help on this sort of thing.

willem · March 21, 2023, 9:38am

Over the weekend I added (made public) my ntp server, the monitoring results are erratic and unstable. To be honest I can’t believe it is that worse, pinging from an out of country gives perfect results.

BTW it is an IPv6 only server …

ryan1 · March 21, 2023, 11:51am

Thanks @ask for jumping on this and troubleshooting it so quickly. My scores have started to stabilize and my servers are starting to spend time in the pool again. If you need anything else from me let me know. If you need another US monitor, I’m willing to volunteer for that as well.

ChrisR · March 21, 2023, 1:35pm

No it’s not that. The red dot by the ‘T’ of Tue represents an offset of 2000ms, which is why the score dropped 5 points. It looks like just the last two digits of large offsets are being drawn on the axis label.

Max · March 21, 2023, 9:09pm

I’ve got a couple servers in Hyderabad, India I could run a monitor from.

Topic		Replies	Views
Suggestions for monitors, as Newark fails a lot and the scores are dropped too quickly Server operators monitoring	91	4103	August 2, 2021
Monitor belgg1-19sfa9p Pool Development monitoring	19	776	May 31, 2023
Score/network woes Server operators monitoring	71	6993	March 7, 2019
Beta system now has multiple monitors Pool Development monitoring , beta	32	4370	August 11, 2018
Monitoring stations timeout to our NTP servers Server operators	103	8426	May 22, 2021

Monitoring upgrade

Related topics