So, over the last two days one of our locations NTP server seems to be having issues keeping a connection with the monitoring server. The server has been up for just shy of a year now, and previously there were no issues, and nothing in the setup has changed.
We have thousands of active connections, and the server (GPS - Strat 1) is keeping good time, but I just can’t seem to keep a constant connection to the monitoring station.
We’re not experiencing any other network anomalies, and traffic is well below our our line’s capacity.
I’ve been chasing my tail on this most of yesterday and today, but keep coming up empty. Just wondering if anyone has any thoughts on avenues to go down that I may have not yet tried.
So it looked like it was finally going to start behaving again on Sunday… but then the bottom dropped out.
I’m still not experiencing any other network issues, and when I point other external servers to this one, I’m not seeing any connection drops or issues.
Really scratching my head over here on where to look next?
Well, that was interesting… I’m getting error “Could not check NTP status” when I try to add the server. However, my two other servers added in just fine.
This has almost got to be an ISP issue (right?). I’ll try my contact there again.
Thanks @ask , I’ll keep trying to add the server to beta, and keep this updated as findings come in.
Just here to say “me too.” My Dallas Linode is doing ok (score above 10 but seeing more drops than usual), but my server connected through Comcast Business is reported as having all sorts of issues.
Given the heterogenous nature of our affected servers and their myriad paths to the monitoring station, I wonder if it’s a problem with the monitoring station or its ISP.
I updated my first post with a graph that looks further into the past, but you might have to click it since it isn’t served over HTTPS. Both servers were reported as having issues shortly after midnight March 1, but the linode is just less affected.
Hi.
I noticed the same thing myself with two servers in Ireland. One is an EC2 droplet, and the the other is a server behind a business fibre connection. On looking at the monthly system metrics, it is clear that something happened on the 1st of March which affected enough IPv4 servers to register on the overall graph.
Time’s telling us it’s still having issues on the new system. If anyone’s on #ntp on freenode there’s been some talk of it there as well.
EDIT: So people don’t have to check both, I think we’ve identified several affected ISPs:
Linode in Dallas
Comcast in Illinois
Comcast in New Mexico
Centurylink in Colorado
Two providers in Ireland
Someone in IRC mentioned that this happens occasionally, and the culprit is a routing/networking issue between the monitoring station’s ISP and parts of the Internet. If it’s not already being done, is there an admin who can work with the monitoring station’s ISP to look into that? Thanks!
EDIT 2018-03-08
My Comcast is doing better over the last 36+ hours, my Linode is still seesawing up and down.
Ok, more info: the problems seem consistently periodic. Things get better around midnight GMT. This seems consistent over the past four days, even on Wednesday which had a smaller peak.
We’ve definitely gotten more emails like this lately, too. Thank you for putting some dates on it and discussing it here. It’s pretty frustrating.
I haven’t had time to dig through the data to try to figure out what the pattern is. Nothing obvious is standing out. We have lost some servers, but it seems to go up and down day by day – http://www.pool.ntp.org/zone
Eventually I’ll get to adding a “automatic traceroute” feature and some code that’ll correlate network paths automatically. Some day, there’s a lot on the todo list.