Monitoring station seems to hate my server all of a sudden

My Dallas Linode’s score is up to 15.4. Last dropped packet was at 05:20.

http://www.pool.ntp.org/scores/45.79.1.70

Beta is a different story, but it’s trending upward, more or less.

https://web.beta.grundclock.com/scores/45.79.1.70

Ok, more info: the problems seem consistently periodic. Things get better around midnight GMT. This seems consistent over the past four days, even on Wednesday which had a smaller peak.

Maybe just overload…Me too.

Hi everyone,

We’ve definitely gotten more emails like this lately, too. Thank you for putting some dates on it and discussing it here. It’s pretty frustrating.

I haven’t had time to dig through the data to try to figure out what the pattern is. Nothing obvious is standing out. We have lost some servers, but it seems to go up and down day by day – http://www.pool.ntp.org/zone

Remember you can do traceroute’s from the Los Angeles network at https://trace.ntppool.org/traceroute/8.8.8.8

Eventually I’ll get to adding a “automatic traceroute” feature and some code that’ll correlate network paths automatically. Some day, there’s a lot on the todo list.

Just an update:

So I had a Comcast Business tech out yesterday… we checked signal to the amp right outside our location (which looked good), and we also replaced everything (line wise) from the ped to the modem, even a new tap.

While I did see some sort of an uptick after we did so for a few hours, last night and today are proving that this was just a coincidence. Still seeing issues with this location. Random ups and downs like others.

http://www.pool.ntp.org/scores/173.161.33.165
https://web.beta.grundclock.com/scores/173.161.33.165

I asked about upstream issues, but the tech was unaware of any either locally or at our backbone connection in Chicago. This doesn’t really surprise me, as the standard answer from Comcast always seems to be - “well, nobody else is complaining”. But I’m sure that not many others are running the services we are running either, soooo, I always take “nobody else” with a grain of salt.

@curbynet , I joined the #ntp channel but haven’t seen anymore chatter about upstream IL CB issues. Have you seen anything else / further developments on that elsewhere? Just curious if anyone with enough “clout”
or an insider had gotten wind of it yet?

Thanks all!

Looking at my graphs and @tebeve 's, it seems like things are improving, albeit slowly. The monitoring station’s view of my Comcast server has been stable since the 8th, but my Linode is still having issues, although the score has been above 0 most of the time in the past few days.

I’m not privy to “behind the scenes” activity either. If anyone knows more, please let us know!

Something is apparently up. Our NTP server time.dynet.no have been in the pool for only about a month, but yesterday its score plummeted to well below zero. I was initially confused since I couldn’t find any obvious reasons for it, but when you look at status.ntppool.org, the probes graph is telling. The big dip on March 12 coincides with our own dip the same day.

Not sure what’s happening, but something’s broken somewhere.

At this point, I don’t know whether this thread should be broken up into separate topics, as there seem to be multiple issues involved. The first was the apparent connectivity issues from the monitoring station(s) to several of our servers.

The second was the graphing issues that started on the 12th. If you look at your graphs or the graphs that other posters linked above, you’ll notice a conspicuous lack of green dots on the graphs during the “dip” times. I mention the 12th because the lack of dots was seen then as well.

Going back to the first problem addressed in this thread, my 173.10.246.233 server saw a score dip for the first time in several days today. It was registered just after the “missing dots” problem was resolved. Perhaps something was resetting/reinitializing itself and the score dips were side effects?

Linking again for reference: http://www.pool.ntp.org/user/curby

Any news on this issue? The monitoring station is a little happier with my systems now, but they still see occasional drops (where there were none before this month). Thanks!

Mine as well, but now it’s not just the one that hiccups, it all of them… at least it’s just small bumps now tho, and not the mountainous plummets it was before.

http://www.pool.ntp.org/user/dd8ogjbcybjuvzbkievwq

The beta monitor however is a completely different story…

https://web.beta.grundclock.com/user/b2vaawpvxc3pmb4a7ak8

I also have issue with the beta monitor:

https://web.beta.grundclock.com/scores/51.174.131.248

No problem with prod:
http://www.pool.ntp.org/scores/51.174.131.248

https://web.beta.grundclock.com/scores/51.174.131.248/log?limit=50&monitor=*

Right, not just one. I too am seeing detected hiccups on both of my servers.

@kennethr I wouldn’t necessarily say “no problem” with the prod monitor. As of this writing, the graph on the second link in your post shows three such hiccups. This is similar to what the monitor has been seeing on my servers.

Anyway, mostly just wanted to “bump” the thread to keep it alive. :crossed_fingers: for a solution.

1 Like

Maybe a monitoring station should not send any results when it sees that a large part of the pool is suddenly unreachable, meaning it’s a local problem and not a problem with the servers?

It’s not that unusual for a popular data center to have a power outage and cause a nontrivial number of pool servers to genuinely go down simultaneously.

It might be hard to find a balance between “monitoring outage” and “large real outage”.

It would be easier with multiple monitoring stations in different areas, which is a feature of the new (beta) monitoring system, but that system seems even more unpredictable than the live one.

2 Likes

I made the ‘csv log’ link by default add the ‘monitor=*’ parameter, it should indicate more information for requests that didn’t get a response ("-5" score), for example:

https://web.beta.grundclock.com/scores/207.171.7.152/log?limit=200&monitor=*

In case somebody still has issue with LA monitoring station (my story) - During last few weeks (end of March 2019) i’ve had 2 issues involving UDP transmission from US to Europe (Poland actually):

  1. the first one involved UDP DNS packet drop in transit from dnsviz(dot)net (DNS-OARC) to NASK (.pl registrar)

  2. the second one involved UDP NTP packet drop in transit from LA monitoring station to OVH datacenter in Warsaw (also PL) which occurs “from time to time” … and transmission through ipv6 was OK, as can be seen/compared here:
    https://www.ntppool.org/scores/54.38.193.17
    https://www.ntppool.org/scores/2001:41d0:602:1811::
    (it is the same machine)

In both cases the issue was UDP related and in both cases - from the feedback provided to me - packets were dropping somewhere in/or on the edge of “Hurricane Electric LLC” (AS6939). So if somebody has similar problems (with NTP or generally UDP transit), plz check both ipv4 and ipv6 with “spell” (thanks OVH ! :wink:):

mtr -o “J M X LSR NA B W V” -wbzc 100 TARGET_IP

If the second (ipv6) travels (almost normally) through AS6939 and the first one (ipv4) has “black holes” where AS6939 normally should be, then you probably should contact your AS operator (or ISP) and send him this mtr report. You can also provide backward information from monitoring network at:
dev(dot)ntppool(dot)org(slash)monitoring(slash)network-debugging

Dont bank to much on mtr/icmp Reports. Big Switches and Routers dont prioritise answers to icmp requests, so you may get timeouts on those, but no acutal packet loss. You should check , if the later hops show packetloss, and if they dont, the switch is not actually having issues.

For more details, check this page: https://www.linode.com/docs/networking/diagnostics/diagnosing-network-issues-with-mtr/#analyze-mtr-reports

mtr and traceroute can be useful for debugging NTP issues. They have options to send UDP packets instead of ICMP and it’s possible to specify the NTP port. mtr has to be patched to allow source port < 1024.

Two of my servers recently became unusable because something in the network (an anti-DDoS appliance?) seems to be rate limiting NTP packets. The servers are getting NTP requests, but their responses are dropped after few hops. And this happens only on port 123.

I have the same problem with LA monitoring station. Broken score IPv4 80.241.0.72 and no problem with IPv6 2a01:7640::72 on the same server.
Server is Meinberg SyncFire 1100 use GPS+GLN for sync.

https://trace.ntppool.org/ntp/80.241.0.72
{“Time”:“2019-04-08T08:50:00.374958939Z”,“ClockOffset”:11385859,“RTT”:278677201,“Precision”:476,“Stratum”:1,“ReferenceID”:1397247232,“ReferenceTime”:“2019-04-08T08:49:56.826358383Z”,“RootDelay”:0,“RootDispersion”:12191772,“RootDistance”:151530372,“Leap”:0,“MinError”:0,“KissCode”:"",“Poll”:8000000000}

https://trace.ntppool.org/traceroute/80.241.0.72
Traceroute to 80.241.0.72
1 gw-b.develooper.com (207.171.7.3) AS7012 0.332 0.286
2 gi1-9.r01.lax2.phyber.com (207.171.30.13) AS7012 0.804 0.800
3 te0-1-0-7.r04.lax02.as7012.net (207.171.30.61) AS7012 1.096 1.096
4 xe-0-1-0-30.r01.lsanca07.us.bb.gin.ntt.net (198.172.90.73) AS2914 1.192 1.184
5 (129.250.3.235) AS2914 165.265
5 ae-13.r01.lsanca07.us.bb.gin.ntt.net (129.250.2.187) AS2914 177.737
6 (129.250.3.237) AS2914 0.963
6 ae-3.r23.lsanca07.us.bb.gin.ntt.net (129.250.4.106) AS2914 0.993
7 ae-6.r22.asbnva02.us.bb.gin.ntt.net (129.250.3.188) AS2914 70.312 68.848
8 ae-0.r23.asbnva02.us.bb.gin.ntt.net (129.250.3.85) AS2914 68.781 68.793
9 ae-2.r25.amstnl02.nl.bb.gin.ntt.net (129.250.6.163) AS2914 144.494 141.149
10 ae-3.r24.amstnl02.nl.bb.gin.ntt.net (129.250.4.68) AS2914 141.667 150.016
11 ae-1.r01.stocse01.se.bb.gin.ntt.net (129.250.3.69) AS2914 178.006 177.544
12 (83.231.187.186) AS2914 171.683 180.700
13 (83.169.204.153) 208.944
13 (83.169.204.151) 214.169
14 * *
15 * *
16 * *
17 (188.170.164.50) AS31133 243.724 235.604
18 95.59.172.18.static.telecom.kz (95.59.172.18) AS9198 243.483 253.134
19 95.59.172.10.static.telecom.kz (95.59.172.10) AS9198 254.244 253.408
20 (95.59.172.37) AS9198 249.751
20 95.59.172.27.static.telecom.kz (95.59.172.27) AS9198 245.039
21 95.59.172.26.static.telecom.kz (95.59.172.26) AS9198 277.389 272.442
22 95.59.172.11.static.telecom.kz (95.59.172.11) AS9198 267.334 269.837
23 95.59.172.12.static.telecom.kz (95.59.172.12) AS9198 283.323 276.368
24 (92.47.151.240) AS50482 277.445 269.675
25 (88.204.208.2) AS9198 287.504 273.901
26 ntp.nic.kz (80.241.0.72) AS21282 281.406 272.951

MTR report mtr -o “J M X LSR NA B W V” -u -P 123 -wbzc 100 207.171.7.3

  1. AS21282 semey-gw1.nic.kz (80.241.0.70) 0.1 6.8 200. 0.0% 100 100 0.2 3.6 0.2 200.4 23.4
  2. AS21282 80.241.0.254 148. 5.4 164. 0.0% 100 100 0.3 5.2 0.3 165.1 25.1
  3. AS21299 JZK.CRT.tnsplus.kz (80.241.35.173) 0.2 0.9 6.6 0.0% 100 100 1.2 1.6 0.9 9.4 1.3
  4. AS35168 85.29.131.0 0.0 0.2 1.7 0.0% 100 100 21.8 21.8 21.6 23.3 0.2
  5. AS35168 comp131-33.2day.kz (85.29.131.33) 0.1 0.1 1.3 0.0% 100 100 21.8 22.0 21.7 23.2 0.2
    AS21299 comp131-1.2day.kz (85.29.131.1)
  6. AS43727 178.210.33.138 0.2 1.9 34.5 0.0% 100 100 92.1 92.9 91.8 126.7 4.6
  7. AS??? 5.187.73.14 0.6 4.8 60.9 0.0% 100 100 124.1 126.7 123.4 185.3 10.2
    AS??? 5.187.73.36
    AS??? 5.187.73.48
    AS??? 5.187.73.31
    AS??? 5.187.73.24
    AS??? 5.187.73.33
  8. AS174 be5734.nr21.b015761-2.fra06.atlas.cogentco.com (149.14.210.81) 0.1 0.5 1.4 0.0% 100 100 131.2 131.3 130.8 132.5 0.4
  9. AS174 154.25.9.45 0.0 0.6 2.1 0.0% 100 100 124.5 125.0 124.4 126.6 0.6
  10. AS174 be2845.ccr41.fra03.atlas.cogentco.com (154.54.56.189) 0.0 0.5 1.6 0.0% 100 100 128.1 128.6 128.0 129.9 0.4
    AS174 be2846.ccr42.fra03.atlas.cogentco.com (154.54.37.29)
  11. AS174 be2800.ccr42.par01.atlas.cogentco.com (154.54.58.238) 0.3 1.5 4.1 0.0% 100 100 140.7 142.3 140.6 144.8 1.5
    AS174 be2799.ccr41.par01.atlas.cogentco.com (154.54.58.234)
  12. AS174 be3627.ccr41.jfk02.atlas.cogentco.com (66.28.4.197) 0.4 4.6 11.7 0.0% 100 100 213.3 209.0 202.6 214.5 4.7
    AS174 be3628.ccr42.jfk02.atlas.cogentco.com (154.54.27.169)
  13. AS174 be2807.ccr42.dca01.atlas.cogentco.com (154.54.40.110) 0.1 0.9 4.3 0.0% 100 100 216.0 215.7 214.0 218.8 0.8
    AS174 be2806.ccr41.dca01.atlas.cogentco.com (154.54.40.106)
  14. AS174 be2112.ccr41.atl01.atlas.cogentco.com (154.54.7.158) 0.7 2.8 7.4 0.0% 100 100 227.5 224.6 220.4 228.3 2.6
    AS174 be2113.ccr42.atl01.atlas.cogentco.com (154.54.24.222)
  15. AS174 be2690.ccr42.iah01.atlas.cogentco.com (154.54.28.130) 1.2 1.8 5.3 0.0% 100 100 236.4 239.4 236.4 242.2 1.8
    AS174 be2687.ccr41.iah01.atlas.cogentco.com (154.54.28.70)
  16. AS174 be2927.ccr21.elp01.atlas.cogentco.com (154.54.29.222) 1.1 0.8 2.8 0.0% 100 100 261.1 260.7 259.2 262.1 0.8
    AS174 be2928.ccr21.elp01.atlas.cogentco.com (154.54.30.162)
  17. AS174 be2930.ccr32.phx01.atlas.cogentco.com (154.54.42.77) 0.4 1.1 3.5 0.0% 100 100 263.6 263.2 261.2 265.4 1.0
    AS174 be2929.ccr31.phx01.atlas.cogentco.com (154.54.42.65)
  18. AS174 be2932.ccr42.lax01.atlas.cogentco.com (154.54.45.162) 1.1 0.7 2.5 0.0% 100 100 278.0 276.6 275.1 278.0 0.7
    AS174 be2931.ccr41.lax01.atlas.cogentco.com (154.54.44.86)
  19. AS174 be3360.ccr41.lax04.atlas.cogentco.com (154.54.25.150) 7.1 3.7 8.6 0.0% 100 100 277.8 273.4 269.4 279.1 3.2
    AS174 be3271.ccr41.lax04.atlas.cogentco.com (154.54.42.102)
  20. AS174 te0-1-0-0.410.r04.lax02.as7012.net (38.88.197.82) 1.4 0.8 2.5 0.0% 100 100 282.5 282.1 280.9 283.4 0.8
  21. AS7012 te7-4.r02.lax2.phyber.com (207.171.30.62) 1.3 1.6 15.5 0.0% 100 100 275.6 275.9 274.1 290.9 2.7
  22. AS??? ??? 0.0 0.0 0.0 100.0 100 0 0.0 0.0 0.0 0.0 0.0