Monitoring upgrade

Thanks @ryan1 & @n1zyy. I was looking at a handful of cases like yours this morning. It appears to happen to quite a few servers.

Thanks @csweeney05 for looking through some of these, too.

@rlaager, @Clock & @ChrisR in your cases it looks like the system works as intended; even if some monitors are getting poor results, the majority of the monitors “win”. It sounds like maybe the system that sends emails isn’t doing the right think though, I will look into that.

It works well, my point was simply about the graph. Every point on the left (offset) y axis is labelled “00”. That is not helpful

image

1 Like

Thanks @ask . Let me know if you need me to look at anything on my side. Strange some monitors have time2 marked so poorly. I’m on Verizon enterprise and have ntp traffic assigned the highest QoS tier on my headend firewalls.

Oh, I totally missed that. I looked at the IP and went straight to the logs missing the graph! Yeah, your server appears to do so well in the monitoring that it’s breaking the graph! :smiley:

1 Like

I have some questions about the new monitoring system:

1- Why is each server being monitored by different monitoring stations?
2- What does “legacy score” and “recentmedian score” mean? Are both used to compose the overall score?
3- What are “testing” monitoring stations? Will they become active at a certain point?

@Clock Excellent questions, I wrote up some documentation for this.

@ryan1, @n1zyy, @rlaager (and others): Thank you all for the help; I found two bugs / misfeatures (read the documentation link above for context):

  • When there were less than the expected number of “good” monitors (5) the system wouldn’t add more “active” servers, meaning most monitors would only do an occasional test. Since we are still adding monitors, a lot of servers ended up with less than 5 “healthy” monitor candidates. I fixed this (about 5 hours ago).
  • The “median score” calculation would include all servers that had a probe in the last 20 minutes. Under normal circumstances this is overwhelmingly the active monitors that are unlikely to have false errors; but when too few servers were marked “active” there was a higher risk of a “bad” server being the median score. I believe this is what caused the noisy email alerts to be sent out today. I fixed this around 8pm PDT (3am UTC).
2 Likes

Thanks for the update for @ask, I do see more monitors added to my time2 server (71.245.181.11). I still only see one active server as San Jose, CA, US even though monitors (sgsin1-1a6a7hp for example) are giving it a perfect 20.

Yeah, yours and about 150 other servers are getting worse results. I’d expected that with the new diversity of monitors in the last day it’d have been resolved but it’s not! Some months ago I started adding a traceroute feature to the monitor but decided to focus on getting everything production ready instead. Right now it’d have been nice to have the traceroutes… :slight_smile:

The sgsin1-1a6a7hp monitor has random periods of local instability so it ends up not being elected as active. I’ve adjusted some parameters so maybe it will.

You could compare monitors between good and bad.

I mean, I have a monitor (IPv4 and IPv6), they constantly monitor well on my other NTP servers that are all over Europe.
However, I see a few other monitors that are real poor for all my servers.

Wouldn’t it be a good idea to score the monitor itself?

I mean, if a monitor keeps persistently mark servers as bad but others don’t, be a good idea to inform the monitor operator to rethink if his monitor is being useful?

Nobody benefits from poor performing monitors. If mine would perform badly all the time, I would remove it.

Rate it just like you do NTP-scoring, but then on their testing performance compared to other monitors.
I mean, you see their scoring numbers, if they are not on par with other monitors and it happens a lot, it may be a good reason to remove them.

From the technical description:

The new scoring calculation is called recent median. It works simply by choosing the median score of the ‘1-scores’ from “active” monitors in the last 20 minutes.

I do not understand how the recentmedian score ends up being -8.8 when the median value of the active monitors are 19.8 on the example.

The -8.8 score is really applied, this NTP server is out of the pool.

Weird is, my monitor isn’t giving a number to your system.

I would expect it to monitor ALL when other monitors fail.

Mine is belgg1 it doesn’t seem to test you.

Even after the Sunday bugfix I’m still getting continual emails telling me about “Problems” and wacky low scores.

Historically my server (71.191.185.32) always was about 4ms +/- 2ms offset from the San Jose monitor and only in cases of complete loss of connectivity did it get a negative score and get removed from the pool.

It is interesting to look at the new graph and see that the green dots (offsets) have become much denser, and drifted on average much closer to an average of 0ms offset (with still several ms of scatter).

At same time there are all kinds of red dots that I think are individual (new) monitor scores that seem to be dominating the results (which might be the “median” - assuming that the red dot is a score and not an offset!).

It is not at all obvious how I can map the server names such as “nlams1-1a6a7hp” to a geographic location.

I had never seen red dots on the monitoring history graph before I and don’t know what they mean.

Has there been a change for what an “acceptable offset” is? Again all my green dots recently seem to be +/-10ms.

Link to my graphs: pool.ntp.org: Statistics for 71.191.185.32

Tim N3QE

1 Like

Welcome to the forum tshoppa!

I suspect that the nlams1 server is in Amsterdam, the Netherlands (hence nl ams).
Could it be that the physical distance to your server (i assume is in US) causes these low scores?
Most server names seem to tell me what their physical locations are:
sgsin1: Singapore?
inblr1: India?
deksf1: Germany or Denmark?
fihel1: Finland?

BTW: isnt it funny that my server (which is in Netherlands), receives bad scores from some monitors in the US?

Just a side note:
The “code” represent the country (first two letters) followed by a three letter IATA-Code (airport code) closed to the GeoIP location.
While setup a monitor the system will do a geoip lookup and report the nearest airport location(s). If the GeoIP DB is not correct the location will also bad.

@ask
please correct me if i’m wrong :slight_smile:

3 Likes

That is kind of the point of multiple monitors though, to show how different internet routes affect traffic. To say only good scoring monitors can be used defeats the purpose. Sure there can be bad monitors that score nothing good, but the point is to have multiple monitors from all over testing to eliminate the just your server is bad here so it’s bad everywhere issue. In other words you can’t rate monitors the same as we expect to see different scores on different monitors, it’s kind of the point.

I have noticed since we have more diversity in monitors that Verizon FiOS accounts are having more issues. Could be upstream issues with the connection to Verizon, if it helps here is a traceroute to port 123 over udp from usday3 and we are not seeing a response from the next hop where we should at 152.179.136.9 from Verizon:

traceroute to 71.191.185.32 (71.191.185.32), 30 hops max, 60 byte packets
1 firewall1.versadns.com (10.10.81.1) 0.150 ms 0.112 ms 0.120 ms
2 gw.versadns.com (217.180.209.209) 0.626 ms 0.871 ms 0.854 ms
3 * * *
4 ve1104.core2.det1.he.net (184.105.30.101) 7.299 ms 7.874 ms 7.506 ms
5 * * *
6 * * *
7 * * *
8 * * *
9 * * *
10 * * *
11 * * *
12 * * *
13 * * *
14 * * *
15 * * *
16 * * *
17 * * *
18 * * *
19 * * *
20 * * *
21 * * *
22 * * *
23 * * *
24 * * *
25 * * *
26 * * *
27 * * *
28 * * *
29 * * *
30 * * *

The system does have a function to use server/monitor pairs that are more successful (see the “selector” section in the documentation linked above). It didn’t work because of a bug (now fixed in v3.4.0)

Oops, production is running a version one commit before I made the “only active servers” change! Now fixed.

2 Likes

Hello, I have two NTP servers in Taiwan, Asia, using the Taiwan Hinet and Seednet networks respectively. I am interested in joining the new monitoring network. Is it still possible to join now?

I’ll send you a separate note, thank you!