Thanks @ryan1 & @n1zyy. I was looking at a handful of cases like yours this morning. It appears to happen to quite a few servers.
Thanks @csweeney05 for looking through some of these, too.
@rlaager, @Clock & @ChrisR in your cases it looks like the system works as intended; even if some monitors are getting poor results, the majority of the monitors “win”. It sounds like maybe the system that sends emails isn’t doing the right think though, I will look into that.
Thanks @ask . Let me know if you need me to look at anything on my side. Strange some monitors have time2 marked so poorly. I’m on Verizon enterprise and have ntp traffic assigned the highest QoS tier on my headend firewalls.
Oh, I totally missed that. I looked at the IP and went straight to the logs missing the graph! Yeah, your server appears to do so well in the monitoring that it’s breaking the graph!
I have some questions about the new monitoring system:
1- Why is each server being monitored by different monitoring stations?
2- What does “legacy score” and “recentmedian score” mean? Are both used to compose the overall score?
3- What are “testing” monitoring stations? Will they become active at a certain point?
@ryan1, @n1zyy, @rlaager (and others): Thank you all for the help; I found two bugs / misfeatures (read the documentation link above for context):
When there were less than the expected number of “good” monitors (5) the system wouldn’t add more “active” servers, meaning most monitors would only do an occasional test. Since we are still adding monitors, a lot of servers ended up with less than 5 “healthy” monitor candidates. I fixed this (about 5 hours ago).
The “median score” calculation would include all servers that had a probe in the last 20 minutes. Under normal circumstances this is overwhelmingly the active monitors that are unlikely to have false errors; but when too few servers were marked “active” there was a higher risk of a “bad” server being the median score. I believe this is what caused the noisy email alerts to be sent out today. I fixed this around 8pm PDT (3am UTC).
Thanks for the update for @ask, I do see more monitors added to my time2 server (71.245.181.11). I still only see one active server as San Jose, CA, US even though monitors (sgsin1-1a6a7hp for example) are giving it a perfect 20.
Yeah, yours and about 150 other servers are getting worse results. I’d expected that with the new diversity of monitors in the last day it’d have been resolved but it’s not! Some months ago I started adding a traceroute feature to the monitor but decided to focus on getting everything production ready instead. Right now it’d have been nice to have the traceroutes…
The sgsin1-1a6a7hp monitor has random periods of local instability so it ends up not being elected as active. I’ve adjusted some parameters so maybe it will.
I mean, I have a monitor (IPv4 and IPv6), they constantly monitor well on my other NTP servers that are all over Europe.
However, I see a few other monitors that are real poor for all my servers.
Wouldn’t it be a good idea to score the monitor itself?
I mean, if a monitor keeps persistently mark servers as bad but others don’t, be a good idea to inform the monitor operator to rethink if his monitor is being useful?
Nobody benefits from poor performing monitors. If mine would perform badly all the time, I would remove it.
Rate it just like you do NTP-scoring, but then on their testing performance compared to other monitors.
I mean, you see their scoring numbers, if they are not on par with other monitors and it happens a lot, it may be a good reason to remove them.
The new scoring calculation is called recent median. It works simply by choosing the median score of the ‘1-scores’ from “active” monitors in the last 20 minutes.
Even after the Sunday bugfix I’m still getting continual emails telling me about “Problems” and wacky low scores.
Historically my server (71.191.185.32) always was about 4ms +/- 2ms offset from the San Jose monitor and only in cases of complete loss of connectivity did it get a negative score and get removed from the pool.
It is interesting to look at the new graph and see that the green dots (offsets) have become much denser, and drifted on average much closer to an average of 0ms offset (with still several ms of scatter).
At same time there are all kinds of red dots that I think are individual (new) monitor scores that seem to be dominating the results (which might be the “median” - assuming that the red dot is a score and not an offset!).
It is not at all obvious how I can map the server names such as “nlams1-1a6a7hp” to a geographic location.
I had never seen red dots on the monitoring history graph before I and don’t know what they mean.
Has there been a change for what an “acceptable offset” is? Again all my green dots recently seem to be +/-10ms.
I suspect that the nlams1 server is in Amsterdam, the Netherlands (hence nl ams).
Could it be that the physical distance to your server (i assume is in US) causes these low scores?
Most server names seem to tell me what their physical locations are:
sgsin1: Singapore?
inblr1: India?
deksf1: Germany or Denmark?
fihel1: Finland?
BTW: isnt it funny that my server (which is in Netherlands), receives bad scores from some monitors in the US?
Just a side note:
The “code” represent the country (first two letters) followed by a three letter IATA-Code (airport code) closed to the GeoIP location.
While setup a monitor the system will do a geoip lookup and report the nearest airport location(s). If the GeoIP DB is not correct the location will also bad.
That is kind of the point of multiple monitors though, to show how different internet routes affect traffic. To say only good scoring monitors can be used defeats the purpose. Sure there can be bad monitors that score nothing good, but the point is to have multiple monitors from all over testing to eliminate the just your server is bad here so it’s bad everywhere issue. In other words you can’t rate monitors the same as we expect to see different scores on different monitors, it’s kind of the point.
I have noticed since we have more diversity in monitors that Verizon FiOS accounts are having more issues. Could be upstream issues with the connection to Verizon, if it helps here is a traceroute to port 123 over udp from usday3 and we are not seeing a response from the next hop where we should at 152.179.136.9 from Verizon:
The system does have a function to use server/monitor pairs that are more successful (see the “selector” section in the documentation linked above). It didn’t work because of a bug (now fixed in v3.4.0)
Hello, I have two NTP servers in Taiwan, Asia, using the Taiwan Hinet and Seednet networks respectively. I am interested in joining the new monitoring network. Is it still possible to join now?