Monitoring upgrade

ask · March 20, 2023, 4:26am

@Clock Excellent questions, I wrote up some documentation for this.

@ryan1, @n1zyy, @rlaager (and others): Thank you all for the help; I found two bugs / misfeatures (read the documentation link above for context):

When there were less than the expected number of “good” monitors (5) the system wouldn’t add more “active” servers, meaning most monitors would only do an occasional test. Since we are still adding monitors, a lot of servers ended up with less than 5 “healthy” monitor candidates. I fixed this (about 5 hours ago).
The “median score” calculation would include all servers that had a probe in the last 20 minutes. Under normal circumstances this is overwhelmingly the active monitors that are unlikely to have false errors; but when too few servers were marked “active” there was a higher risk of a “bad” server being the median score. I believe this is what caused the noisy email alerts to be sent out today. I fixed this around 8pm PDT (3am UTC).

Topic		Replies	Views
Suggestions for monitors, as Newark fails a lot and the scores are dropped too quickly Server operators monitoring	91	4060	August 2, 2021
Monitor belgg1-19sfa9p Pool Development monitoring	19	758	May 31, 2023
Score/network woes Server operators monitoring	71	6954	March 7, 2019
Beta system now has multiple monitors Pool Development beta , monitoring	32	4349	August 11, 2018
Beta system monitoring testing Pool Development	66	499	July 17, 2025