Monitoring station seems to hate my server all of a sudden

So, over the last two days one of our locations NTP server seems to be having issues keeping a connection with the monitoring server. The server has been up for just shy of a year now, and previously there were no issues, and nothing in the setup has changed.

We have thousands of active connections, and the server (GPS - Strat 1) is keeping good time, but I just can’t seem to keep a constant connection to the monitoring station.

ntp-server-monitoring-graph

http://www.pool.ntp.org/scores/173.161.33.165/log?limit=75

We’re not experiencing any other network anomalies, and traffic is well below our our line’s capacity.

I’ve been chasing my tail on this most of yesterday and today, but keep coming up empty. Just wondering if anyone has any thoughts on avenues to go down that I may have not yet tried.

Thanks in advance.

1 Like

So it looked like it was finally going to start behaving again on Sunday… but then the bottom dropped out.

ntp-server-monitoring-graph-update-01

I’m still not experiencing any other network issues, and when I point other external servers to this one, I’m not seeing any connection drops or issues.

Really scratching my head over here on where to look next?

Hi @tebeve – can you try adding your server to the beta system? There’s a bit more information in the log there (and multiple monitors).

Well, that was interesting… I’m getting error “Could not check NTP status” when I try to add the server. However, my two other servers added in just fine.

This has almost got to be an ISP issue (right?). I’ll try my contact there again.

Thanks @ask , I’ll keep trying to add the server to beta, and keep this updated as findings come in.

One of my servers, tick.mattnordhoff.com, is also having trouble – with IPv4.

It’s with Linode in Texas, so it’s actually slightly closer to LA than my other servers.

http://www.pool.ntp.org/scores/45.79.1.70
http://www.pool.ntp.org/scores/2600:3c00::2:b401
https://web.beta.grundclock.com/scores/45.79.1.70
https://web.beta.grundclock.com/scores/2600:3c00::2:b401

I am in the same situation. A Linode in Texas, the IPv6 is fine, the IPv4 is tanked. I’ll look at the beta system and report back.

Same as @tebeve, “Could not check NTP status” when i try to add it to the beta system.

Just here to say “me too.” My Dallas Linode is doing ok (score above 10 but seeing more drops than usual), but my server connected through Comcast Business is reported as having all sorts of issues.

Given the heterogenous nature of our affected servers and their myriad paths to the monitoring station, I wonder if it’s a problem with the monitoring station or its ISP.

http://www.pool.ntp.org/user/curby

Problems started happening right after midnight on March 1 according to the tracker:

Oh, I guess I forgot to mention that, thanks @curbynet … Comcast Business here as well - West Central IL.

However, my other two servers located in Southeast and Central IA, (both with small ISP, fiber connections to the interwebs) are solid as a rock.

I have a trouble ticket in with Comcast currently, so we’ll see how that goes.

EDIT: Also, you’re top graph is how mine started, and just steadily went downhill.

I updated my first post with a graph that looks further into the past, but you might have to click it since it isn’t served over HTTPS. Both servers were reported as having issues shortly after midnight March 1, but the linode is just less affected.

“1519863696,“2018-03-01 00:21:36”,-5,14,0”

Same here

I’ve added one of my servers to the beta site: https://web.beta.grundclock.com/scores/72.14.181.128

@ask , I finally got the server to actually get picked up by the beta monitors.

https://web.beta.grundclock.com/scores/173.161.33.165

They’re also trending up on the old system as well tho, soooo. I guess time will tell.

fingers crossed

Hi.
I noticed the same thing myself with two servers in Ireland. One is an EC2 droplet, and the the other is a server behind a business fibre connection. On looking at the monthly system metrics, it is clear that something happened on the 1st of March which affected enough IPv4 servers to register on the overall graph.

1 Like

Time’s telling us it’s still having issues on the new system. If anyone’s on #ntp on freenode there’s been some talk of it there as well.

EDIT: So people don’t have to check both, I think we’ve identified several affected ISPs:

Linode in Dallas
Comcast in Illinois
Comcast in New Mexico
Centurylink in Colorado
Two providers in Ireland

Someone in IRC mentioned that this happens occasionally, and the culprit is a routing/networking issue between the monitoring station’s ISP and parts of the Internet. If it’s not already being done, is there an admin who can work with the monitoring station’s ISP to look into that? Thanks!

EDIT 2018-03-08

My Comcast is doing better over the last 36+ hours, my Linode is still seesawing up and down.

http://www.pool.ntp.org/user/curby

Curious that Linode Dallas has multiple IP ranges … i wonder if there’s some pattern to certain subnets that are more drastically affected.

By the way, only three replies per topic? Really?

2 Likes

My Dallas Linode’s score is up to 15.4. Last dropped packet was at 05:20.

http://www.pool.ntp.org/scores/45.79.1.70

Beta is a different story, but it’s trending upward, more or less.

https://web.beta.grundclock.com/scores/45.79.1.70

Ok, more info: the problems seem consistently periodic. Things get better around midnight GMT. This seems consistent over the past four days, even on Wednesday which had a smaller peak.

Maybe just overload…Me too.

Hi everyone,

We’ve definitely gotten more emails like this lately, too. Thank you for putting some dates on it and discussing it here. It’s pretty frustrating.

I haven’t had time to dig through the data to try to figure out what the pattern is. Nothing obvious is standing out. We have lost some servers, but it seems to go up and down day by day – http://www.pool.ntp.org/zone

Remember you can do traceroute’s from the Los Angeles network at https://trace.ntppool.org/traceroute/8.8.8.8

Eventually I’ll get to adding a “automatic traceroute” feature and some code that’ll correlate network paths automatically. Some day, there’s a lot on the todo list.