Issues with some pool backend systems?

MagicNTP · February 21, 2025, 2:55pm

Is it just me, or are others seeing symptoms as well that seem to indicate issues with some pool backend systems?

@Ask, seems some APIs don’t return proper data, e.g., for the offset/scores graphs and client distribution statistics of at least some servers:

And the API(s) providing data for the DNS Queries/Peak DNS Queries graphs on the status page seem to have stopped providing data some time last night:

All in all probably nothing critical is impacted, mostly “just” presentation of some metrics.

Kets_One · February 21, 2025, 5:39pm

Indeed scores graphs for my servers are also not showing.

Sebhoster · February 21, 2025, 7:16pm

The data backend seems offline:

apuls · February 21, 2025, 10:37pm

As long as the dns servers are still respoding, the website and email notification work - I’m not worried about it. Stats are only “nice to have features”

connatic · February 23, 2025, 1:42am

Failure of NTHFs is a significant datum that indicates other potential system failures. Absent an understanding of the complete system architecture and the root causes of this inconvenience, one cannot rule out other types of failures, up to and including total system collapse.

Admittedly, my viewpoint derives from my time engineering safety-critical systems, where “No anomalous event is trivial” is a guiding principle.

ask · February 23, 2025, 7:15am

It’s back!

One of the clickhouse database clusters (the one processing the DNS logs) ran out of disk space some days ago. In the process of cleaning that up I upgraded ClickHouse itself and the new version didn’t support (or the operator configuring it disabled?) the “default user” some components had been using. (All sorts of things went slightly sideways, lots of improvements to the setup and disk usage were made).

I only had time to fix the more critical parts right then and only got the data-api fixed today.

The code change was small, most of the changes were in the clickhouse configuration, Vault (that’s securely storing things like password) and in the Flux configuration that’s deploying the software.

Kets_One · February 23, 2025, 7:30am

Thanks @ask for your efforts and time!
Nevertheless slightly concerning…

I’m some network issues with IPv6 myself after switching routers (opnsense).

ask · February 23, 2025, 7:31am

@MagicNTP I missed that the tool that updates the status page graphs also query clickhouse; I’ll add that query to the data-api tool to have it consolidated.

ask · February 23, 2025, 7:33am

Yeah, I was disappointed it took me so long to get it fixed. I usually have better luck judging risk so I do maintenance and changes according to the time / energy / awareness I have available to follow-up if something goes wrong.

(As an example, I have some big updates to the monitoring system about ready but I’ve been traveling so the changes are safely held back in just the development system).

Kets_One · February 23, 2025, 7:36am

Excellent. Have you shared these ideas with the community? Would be nice to get some insight into what you are planning.

doas · February 23, 2025, 6:07pm

I’m running into problems with NTP server checking (and is thus unable to add more servers to the pool).

It worked fine around six hours ago, and I added one server successfully. Now I’m trying to add a couple more and always end up with this message:

Could not decode NTP response from trace server

I ran ntpdate -q from two machines on the opposite sides of the globe, and it worked fine. Moreover, the check fails for time.cloudflare.com (162.159.200.1) and time.google.com (216.239.35.4).

Can you please look into it? Thanks!

ask · February 24, 2025, 8:12am

The DNS query counts on the status page are updating again (via data-api changes and an update to the status page poster).

doas · February 24, 2025, 8:46am

It started working again just a few minutes ago. Thanks!

ask · February 24, 2025, 9:35am

Yeah, thanks for pointing it out (and to the operator who emailed the help address).

I’m not actually sure what happened. The production monitoring API got timeouts connecting to the MQTT broker. The beta and development instances were working fine, the monitors were connecting fine to MQTT too. It hasn’t happened before and it’s not clear why it did now.

The system has a fallback for checking NTP servers if the path via the monitoring system doesn’t work, but that didn’t work either (and that code path isn’t covered by tracing).

Topic		Replies	Views
Scores gone on beta site Pool Development beta	2	87	January 3, 2025
Is there data on retension of pool servers? Pool Development	4	758	August 10, 2022
Graphical representation of the pool management database Pool Development	1	118	April 8, 2025
Monitoring data availability Pool Development	8	1238	January 6, 2020
Erratic scoring by monitoring server (query backlog?) Server operators monitoring	20	784	December 21, 2023

Issues with some pool backend systems?

Related topics