When the certificate actually expires in around 1h 20m the agent might start having problems communicating. We’ll see. But yes, there are issues, it seems.
openssl x509 -in /var/lib/ntppool-agent/test/cert.pem -text -noout | grep Not
Not Before: Apr 19 11:03:34 2026 GMT
Not After : Apr 24 11:04:04 2026 GMT
date
Fri Apr 24 09:44:01 UTC 2026
Most of my test monitors are being reported as being disconnected since 2026-04-22. Only two were still connected. Just restarted one of them as part of updating the ntpool-agent package, and it’s now gone as well.
So I guess as long as a back-end connection was still active since before the current outage on beta, it’ll keep working. But once it needs to be re-established, e.g., due to agent restart, or newly established (as during registration), the daemon will refuse to run due to the inability to successufully connect with the back-end.
It will be interesting to see whether the upcoming expiry of the certificate will also break the connection. I guess as long as it tries to obtain a new certificate over an existing connection, it might be fine. But should it get a new certificate and try to re-establish the connection, that might break things.
Except that it seems not unlikely that once the certificate renewal starts working again, also the other issue currently preventing connections would be solved as well - assuming both have the same underlying likely cause of systems not yet being properly interconnected again after the recent migration of the beta clusters (and many other things before that) to new infrastructure.
Repeating much of what MagicNTP wrote above, but as I had this message already prepared I’ll paste it here anyway:
Ho hum, nothing happened when the certificate expired. I had left this test agent instance running, waiting for the cert to expire (the server where I tried to restart the agent was a different one). So this instance will keep testing for now.
But I get it, the agent keeps a connection open all the time and the cert gets checked only when establishing the connection. Restarting the service would probably trigger a certificate check. But restarting the service would currently fail anyway due to the issue mentioned above.
Restarting the agent that was already failing to start up (my previous message) but with a certificate that had expired in the meantime gave me this:
It looks like the backend has started now, so I can continue adding the test monitor.
The next issue, may be I should open another topic for that?
The nearest airport (~10 km) is missing from the list of airports that I could select for the name of the monitor.
Maybe a new topic for the airport codes would be nice. I don’t think that mechanism has changed during this ongoing service move. Reading this message first may be informative.
As for this test monitor, I now get:
level=ERROR msg=“could not get config, http error” env=test ip_version=v4 err=“unavailable: tls: failed to verify certificate: x509: certificate is valid for localhost, ingress.local, not api.test.mon.ntppool.dev”
so something is still not quite working.
Edit: Additionally, api.test.mon.ntppool.dev has an IPv6 address as well but that seems to refuse connections.
level=ERROR msg=“batch processing” env=test err=“getting server list: unavailable: tls: failed to verify certificate: x509: certificate is valid for localhost, ingress.local, not api.test.mon.ntppool.dev”
level=INFO msg=“detected certificate/connection error, flushing connection pool” pool-flusher.url=https://api.test.mon.ntppool.dev/monitor.v2.MonitorService/GetServers pool-flusher.error=“tls: failed to verify certificate: x509: certificate is valid for localhost, ingress.local, not api.test.mon.ntppool.dev”
Many test monitors seem to be unable to phone home now, based on the plots on e.g. 194.100.49.152.
Hmm.. The graph may be misleading in this situation because the newest (rightmost) plots are actually fairly old already. The graph adjusts the timescale based on the timestamps of the available measurements with the newest available measurement always on the right edge of the graph. If no new measurements come in, the graph does not change.
The CSV tells the truth – the data from the monitors stopped flowing at around 12:42:16, ie. 6.5 hours ago.
EDIT: I would have written a new message, but “An error occurred: No more than 3 consecutive replies are allowed. Please edit your previous reply, or wait for someone to reply to you.”
What I meant to say: Looks like the certificate problem is fixed now and test monitors can send their measurements again.
EDIT2: IPv6 address of api.test.mon.ntppool.dev is still unreachable, though.
Hi everyone, apologies for the lack of updates. I wanted to get ready to move the production site this weekend so I moved the beta site over some evenings in the last week; and then had some busy work days and a 17 hour flight so I didn’t realize until later (thanks for the messages here!) that renewed certificates didn’t work. I’d prepared and tested the certificate authority flip back in December, but … didn’t deploy the updated API server. Oops.
I think anything in your analysis here was accurate.
@NTPman The airport name issue could be related to the move, or a random timing of a MaxMind update maybe. I haven’t changed the locationcode tool for a while.