Issues with some pool backend systems?

Is it just me, or are others seeing symptoms as well that seem to indicate issues with some pool backend systems?

@Ask, seems some APIs don’t return proper data, e.g., for the offset/scores graphs and client distribution statistics of at least some servers:

And the API(s) providing data for the DNS Queries/Peak DNS Queries graphs on the status page seem to have stopped providing data some time last night:

All in all probably nothing critical is impacted, mostly “just” presentation of some metrics.

4 Likes

Indeed scores graphs for my servers are also not showing.

1 Like

The data backend seems offline:

2 Likes

As long as the dns servers are still respoding, the website and email notification work - I’m not worried about it. Stats are only “nice to have features” :slight_smile:

1 Like

Failure of NTHFs is a significant datum that indicates other potential system failures. Absent an understanding of the complete system architecture and the root causes of this inconvenience, one cannot rule out other types of failures, up to and including total system collapse.

Admittedly, my viewpoint derives from my time engineering safety-critical systems, where “No anomalous event is trivial” is a guiding principle.

It’s back!

One of the clickhouse database clusters (the one processing the DNS logs) ran out of disk space some days ago. In the process of cleaning that up I upgraded ClickHouse itself and the new version didn’t support (or the operator configuring it disabled?) the “default user” some components had been using. (All sorts of things went slightly sideways, lots of improvements to the setup and disk usage were made).

I only had time to fix the more critical parts right then and only got the data-api fixed today.

The code change was small, most of the changes were in the clickhouse configuration, Vault (that’s securely storing things like password) and in the Flux configuration that’s deploying the software.

3 Likes

Thanks @ask for your efforts and time!
Nevertheless slightly concerning…

I’m some network issues with IPv6 myself after switching routers (opnsense). :laughing:

2 Likes

@MagicNTP I missed that the tool that updates the status page graphs also query clickhouse; I’ll add that query to the data-api tool to have it consolidated.

1 Like

Yeah, I was disappointed it took me so long to get it fixed. I usually have better luck judging risk so I do maintenance and changes according to the time / energy / awareness I have available to follow-up if something goes wrong.

(As an example, I have some big updates to the monitoring system about ready but I’ve been traveling so the changes are safely held back in just the development system).

3 Likes

Excellent. Have you shared these ideas with the community? Would be nice to get some insight into what you are planning.

1 Like

I’m running into problems with NTP server checking (and is thus unable to add more servers to the pool).

It worked fine around six hours ago, and I added one server successfully. Now I’m trying to add a couple more and always end up with this message:

Could not decode NTP response from trace server

I ran ntpdate -q from two machines on the opposite sides of the globe, and it worked fine. Moreover, the check fails for time.cloudflare.com (162.159.200.1) and time.google.com (216.239.35.4).

Can you please look into it? Thanks!

The DNS query counts on the status page are updating again (via data-api changes and an update to the status page poster).

It started working again just a few minutes ago. Thanks!

Yeah, thanks for pointing it out (and to the operator who emailed the help address).

I’m not actually sure what happened. The production monitoring API got timeouts connecting to the MQTT broker. The beta and development instances were working fine, the monitors were connecting fine to MQTT too. It hasn’t happened before and it’s not clear why it did now.

The system has a fallback for checking NTP servers if the path via the monitoring system doesn’t work, but that didn’t work either (and that code path isn’t covered by tracing).