As long as the dns servers are still respoding, the website and email notification work - I’m not worried about it. Stats are only “nice to have features”
Failure of NTHFs is a significant datum that indicates other potential system failures. Absent an understanding of the complete system architecture and the root causes of this inconvenience, one cannot rule out other types of failures, up to and including total system collapse.
Admittedly, my viewpoint derives from my time engineering safety-critical systems, where “No anomalous event is trivial” is a guiding principle.
One of the clickhouse database clusters (the one processing the DNS logs) ran out of disk space some days ago. In the process of cleaning that up I upgraded ClickHouse itself and the new version didn’t support (or the operator configuring it disabled?) the “default user” some components had been using. (All sorts of things went slightly sideways, lots of improvements to the setup and disk usage were made).
I only had time to fix the more critical parts right then and only got the data-api fixed today.
The code change was small, most of the changes were in the clickhouse configuration, Vault (that’s securely storing things like password) and in the Flux configuration that’s deploying the software.
@MagicNTP I missed that the tool that updates the status page graphs also query clickhouse; I’ll add that query to the data-api tool to have it consolidated.
Yeah, I was disappointed it took me so long to get it fixed. I usually have better luck judging risk so I do maintenance and changes according to the time / energy / awareness I have available to follow-up if something goes wrong.
(As an example, I have some big updates to the monitoring system about ready but I’ve been traveling so the changes are safely held back in just the development system).
I’m running into problems with NTP server checking (and is thus unable to add more servers to the pool).
It worked fine around six hours ago, and I added one server successfully. Now I’m trying to add a couple more and always end up with this message:
Could not decode NTP response from trace server
I ran ntpdate -q from two machines on the opposite sides of the globe, and it worked fine. Moreover, the check fails for time.cloudflare.com (162.159.200.1) and time.google.com (216.239.35.4).
Yeah, thanks for pointing it out (and to the operator who emailed the help address).
I’m not actually sure what happened. The production monitoring API got timeouts connecting to the MQTT broker. The beta and development instances were working fine, the monitors were connecting fine to MQTT too. It hasn’t happened before and it’s not clear why it did now.
The system has a fallback for checking NTP servers if the path via the monitoring system doesn’t work, but that didn’t work either (and that code path isn’t covered by tracing).