Still problems? Or is my install outdated (4.1.5): "...detected certificate/connection error, flushing connection pool"

I noticed a few days ago that my two monitors do not seem to “monitor and report” anymore, as the stats say 0.

Then I picked this up in my logs:

==> ntppool-agentd.log <==
May 9 18:45:51 sonne ntppool-agent[3119014]: level=INFO msg=“detected certificate/connection error, flushing connection pool” pool-flusher.url=https://api.mon.ntppool.dev/monitor.v2.MonitorService/SubmitResults pool-flusher.error=“local error: tls: bad record MAC” pool-flusher.status=0 pool-flusher.trace_id=b1968ed3e55d4e98499e9fcbfd2cc84b pool-flusher.span_id=97dadf6211170445
May 9 18:45:51 sonne ntppool-agent[3119014]: level=ERROR msg=“batch processing” env=prod ip_version=v4 monitor_ip=85.215.122.93 err=“SubmitResults: unavailable: net/http: HTTP/1.x transport connection broken: http: ContentLength=2715 with Body length 0” trace_id=b1968ed3e55d4e98499e9fcbfd2cc84b span_id=0fd0ebbc0ef26ef8

I found, googling the cert issue, that it seems to be in the GO lib itself possibly. Does anybody know how to fix my end or is it like this for everyone?

ubuntu 24.04, ntppool-agent from repo 4.1.5. Thanks for pointers.

Florian

I am also seeing this kind of error messages on pretty much all monitors I have looked at so far:

2026-05-09T18:20:07.255716602Z time=2026-05-09T18:20:07.254Z level=ERROR msg="batch processing" env=prod ip_version=v4 err="SubmitResults: deadline_exceeded: Post \"https://api.mon.ntppool.dev/monitor.v2.MonitorService/SubmitResults\": net/http: timeout awaiting response headers" trace_id=306208535e926c03e9d407422f4e058e span_id=61a05a6c78a61445
2026-05-09T18:20:14.118333235Z time=2026-05-09T18:20:14.116Z level=INFO msg="detected certificate/connection error, flushing connection pool" pool-flusher.url=https://api.mon.ntppool.dev/monitor.v2.MonitorService/SubmitResults pool-flusher.error="local error: tls: bad record MAC" pool-flusher.status=0 pool-flusher.trace_id=787e8a54b85a36b182527c1653a3e076 pool-flusher.span_id=7836206be4c1fc82
2026-05-09T18:20:14.488379913Z time=2026-05-09T18:20:14.487Z level=ERROR msg="batch processing" env=prod ip_version=v6 err="SubmitResults: unavailable: net/http: HTTP/1.x transport connection broken: http: ContentLength=1811 with Body length 0" trace_id=787e8a54b85a36b182527c1653a3e076 span_id=fc2cc4d98a386b4c

And there are others as well, e.g., related to MQTT.

Ask mentioned that for yet unknown reasons, the database in the new system is slower than the one in the old system used to be. Looking at the monthly graphs on the status page, one can see the drop from around 14000 checks/minute to about 7000 checks/minute, which I think might be a deliberate measure to reduce load on the database. But there’s a further drop more recently, and also somewhat noticeable jitter as of late. I guess the database is still overloaded, thus the connection issues when there is more data being pushed towards the system than can be stored away in the database, or some other bottleneck is hit.

I guess the MAC error message might be due to the underlying connection error resulting in an incomplete TLS record (data missing at the end), thus the MAC verification fails (not necessarily at the cryptographic level, but some other sanity checks before the MAC verification proper is being done could also fail).

I think I fixed the performance problems (lock contention) in the MySQL database this evening between fix(scorer): move UpdateServer out of batch transaction · ntppool/monitor@e03c40a · GitHub and bad_server_notifications: cut row-lock contention with subquery DELETE · abh/ntppool@cc4a6fe · GitHub

I put it in production a little bit ago and the number of monitoring checks are going back to normal. This took a bunch of time and false investigations to figure out, so I’ll get back to the other reported issues next.

Thanks! Looks like the error messages on the monitors stopped around 8 UTC, and I was able to successfully add new server just now.

@ask, I am afraid something seems off: It seems that the recent_median (“monitor” 24) isn’t running anymore at this time. At least there aren’t any recent entries in the CSV logs. And

  • recently added servers’ overall score is “stuck” at a low value even if all “testing” monitors are well above that now
  • promotion from “Testing” to “Active” doesn’t seem to happen anymore, while promotion from “Candidate” to “Testing” does, leading to an excess of “Testing” but no “Active” monitors for some servers

server #65791


server #65789

Or is that because the scorer is still catching up right now and processing/skipping over older records, but will resume again once it is all caught up?

It’s running but very slowly (I think ~60 log scores per second instead of ~300-400).

I’m adding tracing and more metrics to figure out what’s going on.

Please apologize, but I am still a bit lost with regards to the monitors I have (2). Both are listed as active and connected. But the stats are “Tests/min: 0.0 (1h), 0.0 (24h)” for both, despite seing

May 12 07:49:39 sonne ntppool-agent[3119014]: level=INFO msg=“batch processing” env=prod ip_version=v4 monitor_ip=85.215.122.93 count=1
May 12 07:49:39 sonne ntppool-agent[3119014]: level=INFO msg=“batch processing” env=prod ip_version=v6 monitor_ip=2a01:239:0:be::1 count=30
May 12 07:49:47 sonne ntppool-agent[3119014]: level=INFO msg=“batch processing” env=prod ip_version=v6 monitor_ip=2a01:239:0:be::1 count=30
May 12 07:49:54 sonne ntppool-agent[3119014]: level=INFO msg=“batch processing” env=prod ip_version=v6 monitor_ip=2a01:239:0:be::1 count=18
May 12 07:50:11 sonne ntppool-agent[3119014]: level=INFO msg=“ntp error” env=prod ip_version=v4 batchID=01KRDBR40NV2HZCP0AY7TP9PBD server=212.138.170.134 err=“network: i/o timeout”
May 12 07:50:12 sonne ntppool-agent[3119014]: level=INFO msg=“batch processing” env=prod ip_version=v4 monitor_ip=85.215.122.93 count=28
May 12 07:50:19 sonne ntppool-agent[3119014]: level=INFO msg=“batch processing” env=prod ip_version=v4 monitor_ip=85.215.122.93 count=3

So in theory, it checks, documents, but the stats don’t turn up on the monitors-Stat page in my account. Is that still a relict of the infrastructure move or is there something I need to do to fix this on my monitors?

Thanks for clearing my head :slight_smile:

I am seeing the same thing with my monitors, and I think that someone also mentioned it before, though I don’t recall who, or where. Considering when it started, I would think this is also a glitch from the migration.

Looks like they didn’t stop entirely, just became a lot rarer:

May 12 06:56:35 vps ntppool-agent[2025996]: level=ERROR msg="batch processing" env=prod ip_version=v4 monitor_ip= err="SubmitResults: deadline_exceeded: Post \"https://api.mon.ntppool.dev/monitor.v2.MonitorService/SubmitResults\": net/http: timeout awaiting response headers" trace_id=bb88160f28fbd8ea4ef921a298d5e970 span_id=f3014e00887878be
May 12 07:54:34 vps ntppool-agent[2025996]: level=ERROR msg="batch processing" env=prod ip_version=v6 monitor_ip= err="SubmitResults: unavailable: unexpected EOF" trace_id=00a61f2e0f29942f1f831f92daab6e3b span_id=8e8986be9f3a16b3
May 12 07:54:38 vps ntppool-agent[2025996]: level=ERROR msg="batch processing" env=prod ip_version=v6 monitor_ip= err="getting server list: unavailable: unexpected EOF" trace_id=70f2ddd74ee2f4f6d8e8c778f54974d8 span_id=bc63ad817ef9c0a2

Ok, it should be catching up over the next 6-10 hours. I still don’t understand why, but the MySQL in the new cluster is a lot slower in some surprising ways than the old. Unfortunately I didn’t keep the tracing data in the old cluster, so it’s really time consuming to figure out what got slower. :-/

For the “test/min” metrics, I think it’s just the ntppool app that’s not wired up correctly to the prometheus/mimir endpoint. I’ll put it on my list to check, thank you!

Did you check that the sluggering_down logging is disabled? :wink: I tend to forget that sometimes.