Thanks everyone. This was pretty puzzling. I don’t think even it’s a new bug. There was a race condition in the code where the sub-system talking to the MQTT clients (the ntppool-agents) for ad-hoc checks could lock-up.
The relevant code hasn’t changed for 3 years, so I’m not entirely sure if something in the new cluster triggered it or if it’s just a co-incidence.