Monitoring station seems to hate my server all of a sudden


#22

Just an update:

So I had a Comcast Business tech out yesterday… we checked signal to the amp right outside our location (which looked good), and we also replaced everything (line wise) from the ped to the modem, even a new tap.

While I did see some sort of an uptick after we did so for a few hours, last night and today are proving that this was just a coincidence. Still seeing issues with this location. Random ups and downs like others.

http://www.pool.ntp.org/scores/173.161.33.165
https://web.beta.grundclock.com/scores/173.161.33.165

I asked about upstream issues, but the tech was unaware of any either locally or at our backbone connection in Chicago. This doesn’t really surprise me, as the standard answer from Comcast always seems to be - “well, nobody else is complaining”. But I’m sure that not many others are running the services we are running either, soooo, I always take “nobody else” with a grain of salt.

@curbynet , I joined the #ntp channel but haven’t seen anymore chatter about upstream IL CB issues. Have you seen anything else / further developments on that elsewhere? Just curious if anyone with enough “clout”
or an insider had gotten wind of it yet?

Thanks all!


#23

Looking at my graphs and @tebeve 's, it seems like things are improving, albeit slowly. The monitoring station’s view of my Comcast server has been stable since the 8th, but my Linode is still having issues, although the score has been above 0 most of the time in the past few days.

I’m not privy to “behind the scenes” activity either. If anyone knows more, please let us know!


#24

Something is apparently up. Our NTP server time.dynet.no have been in the pool for only about a month, but yesterday its score plummeted to well below zero. I was initially confused since I couldn’t find any obvious reasons for it, but when you look at status.ntppool.org, the probes graph is telling. The big dip on March 12 coincides with our own dip the same day.

Not sure what’s happening, but something’s broken somewhere.


#25

At this point, I don’t know whether this thread should be broken up into separate topics, as there seem to be multiple issues involved. The first was the apparent connectivity issues from the monitoring station(s) to several of our servers.

The second was the graphing issues that started on the 12th. If you look at your graphs or the graphs that other posters linked above, you’ll notice a conspicuous lack of green dots on the graphs during the “dip” times. I mention the 12th because the lack of dots was seen then as well.

Going back to the first problem addressed in this thread, my 173.10.246.233 server saw a score dip for the first time in several days today. It was registered just after the “missing dots” problem was resolved. Perhaps something was resetting/reinitializing itself and the score dips were side effects?

Linking again for reference: http://www.pool.ntp.org/user/curby


#26

Any news on this issue? The monitoring station is a little happier with my systems now, but they still see occasional drops (where there were none before this month). Thanks!


#27

Mine as well, but now it’s not just the one that hiccups, it all of them… at least it’s just small bumps now tho, and not the mountainous plummets it was before.

http://www.pool.ntp.org/user/dd8ogjbcybjuvzbkievwq

The beta monitor however is a completely different story…

https://web.beta.grundclock.com/user/b2vaawpvxc3pmb4a7ak8


#28

I also have issue with the beta monitor:

https://web.beta.grundclock.com/scores/51.174.131.248

No problem with prod:
http://www.pool.ntp.org/scores/51.174.131.248

https://web.beta.grundclock.com/scores/51.174.131.248/log?limit=50&monitor=*


#29

Right, not just one. I too am seeing detected hiccups on both of my servers.

@kennethr I wouldn’t necessarily say “no problem” with the prod monitor. As of this writing, the graph on the second link in your post shows three such hiccups. This is similar to what the monitor has been seeing on my servers.

Anyway, mostly just wanted to “bump” the thread to keep it alive. :crossed_fingers: for a solution.


#30

Maybe a monitoring station should not send any results when it sees that a large part of the pool is suddenly unreachable, meaning it’s a local problem and not a problem with the servers?


#31

It’s not that unusual for a popular data center to have a power outage and cause a nontrivial number of pool servers to genuinely go down simultaneously.

It might be hard to find a balance between “monitoring outage” and “large real outage”.


#32

It would be easier with multiple monitoring stations in different areas, which is a feature of the new (beta) monitoring system, but that system seems even more unpredictable than the live one.


#33

I made the ‘csv log’ link by default add the ‘monitor=*’ parameter, it should indicate more information for requests that didn’t get a response ("-5" score), for example:

https://web.beta.grundclock.com/scores/207.171.7.152/log?limit=200&monitor=*