So I had a Comcast Business tech out yesterday… we checked signal to the amp right outside our location (which looked good), and we also replaced everything (line wise) from the ped to the modem, even a new tap.
While I did see some sort of an uptick after we did so for a few hours, last night and today are proving that this was just a coincidence. Still seeing issues with this location. Random ups and downs like others.
I asked about upstream issues, but the tech was unaware of any either locally or at our backbone connection in Chicago. This doesn’t really surprise me, as the standard answer from Comcast always seems to be - “well, nobody else is complaining”. But I’m sure that not many others are running the services we are running either, soooo, I always take “nobody else” with a grain of salt.
@curbynet , I joined the #ntp channel but haven’t seen anymore chatter about upstream IL CB issues. Have you seen anything else / further developments on that elsewhere? Just curious if anyone with enough “clout”
or an insider had gotten wind of it yet?
Looking at my graphs and @tebeve 's, it seems like things are improving, albeit slowly. The monitoring station’s view of my Comcast server has been stable since the 8th, but my Linode is still having issues, although the score has been above 0 most of the time in the past few days.
I’m not privy to “behind the scenes” activity either. If anyone knows more, please let us know!
Something is apparently up. Our NTP server time.dynet.no have been in the pool for only about a month, but yesterday its score plummeted to well below zero. I was initially confused since I couldn’t find any obvious reasons for it, but when you look at status.ntppool.org, the probes graph is telling. The big dip on March 12 coincides with our own dip the same day.
Not sure what’s happening, but something’s broken somewhere.
At this point, I don’t know whether this thread should be broken up into separate topics, as there seem to be multiple issues involved. The first was the apparent connectivity issues from the monitoring station(s) to several of our servers.
The second was the graphing issues that started on the 12th. If you look at your graphs or the graphs that other posters linked above, you’ll notice a conspicuous lack of green dots on the graphs during the “dip” times. I mention the 12th because the lack of dots was seen then as well.
Going back to the first problem addressed in this thread, my 173.10.246.233 server saw a score dip for the first time in several days today. It was registered just after the “missing dots” problem was resolved. Perhaps something was resetting/reinitializing itself and the score dips were side effects?
Any news on this issue? The monitoring station is a little happier with my systems now, but they still see occasional drops (where there were none before this month). Thanks!
Mine as well, but now it’s not just the one that hiccups, it all of them… at least it’s just small bumps now tho, and not the mountainous plummets it was before.
Right, not just one. I too am seeing detected hiccups on both of my servers.
@kennethr I wouldn’t necessarily say “no problem” with the prod monitor. As of this writing, the graph on the second link in your post shows three such hiccups. This is similar to what the monitor has been seeing on my servers.
Anyway, mostly just wanted to “bump” the thread to keep it alive. for a solution.
Maybe a monitoring station should not send any results when it sees that a large part of the pool is suddenly unreachable, meaning it’s a local problem and not a problem with the servers?
It’s not that unusual for a popular data center to have a power outage and cause a nontrivial number of pool servers to genuinely go down simultaneously.
It might be hard to find a balance between “monitoring outage” and “large real outage”.
It would be easier with multiple monitoring stations in different areas, which is a feature of the new (beta) monitoring system, but that system seems even more unpredictable than the live one.
I made the ‘csv log’ link by default add the ‘monitor=*’ parameter, it should indicate more information for requests that didn’t get a response (“-5” score), for example:
In case somebody still has issue with LA monitoring station (my story) - During last few weeks (end of March 2019) i’ve had 2 issues involving UDP transmission from US to Europe (Poland actually):
the first one involved UDP DNS packet drop in transit from dnsviz(dot)net (DNS-OARC) to NASK (.pl registrar)
In both cases the issue was UDP related and in both cases - from the feedback provided to me - packets were dropping somewhere in/or on the edge of “Hurricane Electric LLC” (AS6939). So if somebody has similar problems (with NTP or generally UDP transit), plz check both ipv4 and ipv6 with “spell” (thanks OVH ! ):
mtr -o “J M X LSR NA B W V” -wbzc 100 TARGET_IP
If the second (ipv6) travels (almost normally) through AS6939 and the first one (ipv4) has “black holes” where AS6939 normally should be, then you probably should contact your AS operator (or ISP) and send him this mtr report. You can also provide backward information from monitoring network at:
dev(dot)ntppool(dot)org(slash)monitoring(slash)network-debugging
Dont bank to much on mtr/icmp Reports. Big Switches and Routers dont prioritise answers to icmp requests, so you may get timeouts on those, but no acutal packet loss. You should check , if the later hops show packetloss, and if they dont, the switch is not actually having issues.
mtr and traceroute can be useful for debugging NTP issues. They have options to send UDP packets instead of ICMP and it’s possible to specify the NTP port. mtr has to be patched to allow source port < 1024.
Two of my servers recently became unusable because something in the network (an anti-DDoS appliance?) seems to be rate limiting NTP packets. The servers are getting NTP requests, but their responses are dropped after few hops. And this happens only on port 123.
I have the same problem with LA monitoring station. Broken score IPv4 80.241.0.72 and no problem with IPv6 2a01:7640::72 on the same server.
Server is Meinberg SyncFire 1100 use GPS+GLN for sync.