Monitoring upgrade

@Clock Excellent questions, I wrote up some documentation for this.

@ryan1, @n1zyy, @rlaager (and others): Thank you all for the help; I found two bugs / misfeatures (read the documentation link above for context):

  • When there were less than the expected number of “good” monitors (5) the system wouldn’t add more “active” servers, meaning most monitors would only do an occasional test. Since we are still adding monitors, a lot of servers ended up with less than 5 “healthy” monitor candidates. I fixed this (about 5 hours ago).
  • The “median score” calculation would include all servers that had a probe in the last 20 minutes. Under normal circumstances this is overwhelmingly the active monitors that are unlikely to have false errors; but when too few servers were marked “active” there was a higher risk of a “bad” server being the median score. I believe this is what caused the noisy email alerts to be sent out today. I fixed this around 8pm PDT (3am UTC).
2 Likes