Monitoring upgrade

I’ve started upgrading the production site to the new monitoring system that’s been testing and baking since last spring.

Those of you with servers in the beta system will have noticed that the performance of the new system is much improved.

The website might be unavailable at some times during the next hours and the monitoring system might stop for a bit of time.

Update 1 (Sunday): A couple of bugs were found and fixed Sunday (see below); apologies for the noisy “your server score dropped” emails that were sent out to some of you the last ~40 hours.

I also added a description of how the monitoring system works.

Update 2 (Monday evening): The monitors had a bug (now fixed and being rolled out) that made a single timeout not be ignored as intended.

13 Likes

This is done – there might be some things not working right (I didn’t test adding a new server, for example), but the production system is running with multiple monitors now!

The new scoring system will kick in properly soon when a few more monitors are added to the system.

6 Likes

Thanks Ask. Please keep us posted on your progress.

1 Like

The South America and Oceania continents are not represented. As fairly geographically isolated continents, their resident servers are being monitored across a sea, which is not very reliable or consistent. Perhaps local qualifying volunteers could be urged to offer their help.

Since this post my server scores have become erratic. time(dot)santichen(dot)com is fairly stable but my Pittsburgh node, time2(dot)santichen(dot)com is all over the place. One second it’s 20 the next it’s -20 and I’m getting emails. Is this expected with the monitoring upgrade?

Thanks,
Ryan

1 Like

I’ve seen something similar as @ryan1. I got emails about two servers today after a long stretch of not hearing anything about any of them.

It looks like there are a number of “testing” monitoring servers distributed across the planet, which is promising. pool.ntp.org: Statistics for 144.126.242.176 is in Singapore, and has a much better score with the Singapore monitoring server than others.

Some of my other servers have a score of 20 on all monitoring servers, though. What contributes to score dropping–should “well you’re monitoring it from a different continent” cause issues or not? I would expect any of my servers in the pool to serve adequate-ish time globally, but wouldn’t be shocked to know that my server in Korea showed high jitter in the US.

After updating the monitoring system, one of my servers hosted at Digital Ocean and located in New York had a drop in score, including the San Jose monitoring station, which always kept the score at 20.

Anyway, my server located in Azure, which uses routing through Microsoft’s global network, remains at a fixed score of 20 from all monitoring stations. I identified an inconsistency in the line label (x axis) of the graph, where the latency values appear all 00 ms.

If you look at your details in stats the old monitoring server only has you at a 2.9, so it might be a just a fluke your server is having some issues now and the monitoring changed at the same time.

https://www.ntppool.org/scores/144.126.242.176

I’m also seeing all values of 00 along the Y axis

image

1 Like

I’m running into the same issue on my two dual-stack servers:

FWIW, I am the network operator too, so I can troubleshoot routing/BGP/peering issues as needed.

Your servers are both scored at 20, there is no issue with your servers at all.

Yes, indeed. We need more monitors – in particular outside Europe and North America. The new system should be able to handle relatively many monitors and should figure out which ones are “best” for each server (when there are enough working monitors).

Thanks @ryan1 & @n1zyy. I was looking at a handful of cases like yours this morning. It appears to happen to quite a few servers.

Thanks @csweeney05 for looking through some of these, too.

@rlaager, @Clock & @ChrisR in your cases it looks like the system works as intended; even if some monitors are getting poor results, the majority of the monitors “win”. It sounds like maybe the system that sends emails isn’t doing the right think though, I will look into that.

It works well, my point was simply about the graph. Every point on the left (offset) y axis is labelled “00”. That is not helpful

image

1 Like

Thanks @ask . Let me know if you need me to look at anything on my side. Strange some monitors have time2 marked so poorly. I’m on Verizon enterprise and have ntp traffic assigned the highest QoS tier on my headend firewalls.

Oh, I totally missed that. I looked at the IP and went straight to the logs missing the graph! Yeah, your server appears to do so well in the monitoring that it’s breaking the graph! :smiley:

1 Like

I have some questions about the new monitoring system:

1- Why is each server being monitored by different monitoring stations?
2- What does “legacy score” and “recentmedian score” mean? Are both used to compose the overall score?
3- What are “testing” monitoring stations? Will they become active at a certain point?

@Clock Excellent questions, I wrote up some documentation for this.

@ryan1, @n1zyy, @rlaager (and others): Thank you all for the help; I found two bugs / misfeatures (read the documentation link above for context):

  • When there were less than the expected number of “good” monitors (5) the system wouldn’t add more “active” servers, meaning most monitors would only do an occasional test. Since we are still adding monitors, a lot of servers ended up with less than 5 “healthy” monitor candidates. I fixed this (about 5 hours ago).
  • The “median score” calculation would include all servers that had a probe in the last 20 minutes. Under normal circumstances this is overwhelmingly the active monitors that are unlikely to have false errors; but when too few servers were marked “active” there was a higher risk of a “bad” server being the median score. I believe this is what caused the noisy email alerts to be sent out today. I fixed this around 8pm PDT (3am UTC).
2 Likes

Thanks for the update for @ask, I do see more monitors added to my time2 server (71.245.181.11). I still only see one active server as San Jose, CA, US even though monitors (sgsin1-1a6a7hp for example) are giving it a perfect 20.

Yeah, yours and about 150 other servers are getting worse results. I’d expected that with the new diversity of monitors in the last day it’d have been resolved but it’s not! Some months ago I started adding a traceroute feature to the monitor but decided to focus on getting everything production ready instead. Right now it’d have been nice to have the traceroutes… :slight_smile:

The sgsin1-1a6a7hp monitor has random periods of local instability so it ends up not being elected as active. I’ve adjusted some parameters so maybe it will.