I have run half off all the NTP servers in Costa Rica for many years now. These servers serve hundreds of devices on the internal ISP network they are part of and thousands more outside. Almost 2 months ago the responses they give don’t get logged or don’t reach the Los Angeles monitoring Station. This only happens using IPv4, on IPv6 it works fine.
Today I decided to take all the servers off the NTP pool since apparently there is no way to fix this. Ask suggested to “see if someone else can monitor the server for a few days”
So here are 2 of them: 200.59.16.50 and 200.59.19.5
So, if anyone can monitor these servers for the next couple of days it would be very much appreciated.
I have some pool servers in Europe and one of them is also monitoring a subset of other pool servers. I have some data for 200.59.16.50. It seems over the last six months the reachability was 97.4%. I think that’s pretty good. It’s actually slightly better than reachability of the IPv6 address of the server in the same period, which was 96.9%.
That server in particular is a Stratum 1 and NTP is his only job. He had a hardware problem beginning February and was taken out for a complete hardware upgrade during March, during that time a Stratum 2 server answered the queries. So yes 97 % is for sure right considering all the hardware issues he suffered. During April he performed just fine but you can see the huge drop in traffic since most of the time is out of the pool.
I did some traffic engineering and now send the reply using a different upstream provider.
The problem seems to be completely gone. But I wonder if it’s just packet loss via that link or something else. I still think that having at least 3 monitoring stations is critical for having a better picture of what is going on. The fact that the AS path was shorter and the routers were sending the replies using the other link is the default and changing inbound routes did not have any effect. So even though the monitoring station receives the replies fine now, others may not, and we won’t notice unless we monitor from a lot of different locations.
root@stratum1:~# traceroute 207.171.3.17
traceroute to 207.171.3.17 (207.171.3.17), 30 hops max, 60 byte packets
1 ether3.mikrotik-bb-zeus.fratec.net (200.59.16.33) 0.189 ms 0.160 ms 0.133 ms
2 sfp-sfpplus1.mikrotik-bb-router-gnd1.fratec.net (200.59.17.43) 0.573 ms 0.563 ms 0.545 ms
[AS52263 Telecable Economico S.A.]
3 rev185.125.nstelecablecr.com (190.113.125.185) 1.201 ms 1.327 ms 1.165 ms
4 172.31.1.9 (172.31.1.9) 1.158 ms 1.139 ms 1.195 ms
5 190.242.134.93 (190.242.134.93) 1.133 ms 1.174 ms 1.097 ms
[AS23520 Columbus Networks USA, Inc.]
6 xe-4-0-0.0-corozal-pamiami.fl.us.nama.pan-corzl-mx01.cwc.com (63.245.106.167) 12.524 ms 12.421 ms 12.415 ms
7 xe-3-0-0.0-boca-raton.fl.us.brx-teracore01.cwc.com (63.245.106.164) 62.754 ms xe-6-1-4.0-boca-raton.fl.us.nmi-teracore02.cwc.com (63.245.107.126) 44.313 ms 44.302 ms
8 xe-0-0-45-1.a00.miamfl02.us.bb.gin.ntt.net (129.250.198.217) 59.425 ms xe-0-0-47-2.a00.miamfl02.us.bb.gin.ntt.net (128.242.180.197) 54.737 ms 54.741 ms
[AS2914 NTT America, Inc.]
9 ae-7.r04.miamfl02.us.bb.gin.ntt.net (129.250.2.202) 131.008 ms 140.788 ms 130.972 ms
10 ae-3.r21.miamfl02.us.bb.gin.ntt.net (129.250.4.250) 56.116 ms 54.307 ms 47.513 ms
11 ae-4.r22.dllstx09.us.bb.gin.ntt.net (129.250.2.219) 93.713 ms 87.852 ms 84.646 ms
12 ae-5.r22.lsanca07.us.bb.gin.ntt.net (129.250.7.69) 131.629 ms 118.773 ms 126.571 ms
13 ae-1.r01.lsanca07.us.bb.gin.ntt.net (129.250.3.123) 144.666 ms 128.926 ms 136.790 ms
14 te7-1.r01.lax2.phyber.com (198.172.90.74) 137.752 ms 128.881 ms 141.051 ms
15 te7-4.r02.lax2.phyber.com (207.171.30.62) 300.943 ms 300.936 ms 300.906 ms
16 ntplax7.ntppool.net (207.171.3.17) 126.467 ms !X 131.171 ms !X 118.436 ms !X
The new and working route is:
root@stratum1:~# traceroute 207.171.3.17
traceroute to 207.171.3.17 (207.171.3.17), 30 hops max, 60 byte packets
1 ether3.mikrotik-bb-zeus.fratec.net (200.59.16.33) 0.158 ms 0.201 ms 0.115 ms
[AS20299 Newcom Limited]
2 186.176.7.73 (186.176.7.73) 0.819 ms 0.816 ms 0.802 ms
3 186.32.0.217 (186.32.0.217) 0.881 ms 0.867 ms 0.875 ms
[AS1299 Telia Company AB]
4 190.106.192.237 (190.106.192.237) 43.285 ms 43.254 ms 43.240 ms
5 mai-b1-link.telia.net (62.115.52.241) 41.705 ms 44.023 ms 44.028 ms
6 mai-b1-link.telia.net (62.115.138.160) 41.310 ms mai-b1-link.telia.net (80.91.250.236) 38.891 ms mai-b1-link.telia.net (80.91.253.220) 38.874 ms
7 ntt-ic-321350-mai-b1.c.telia.net (213.248.81.63) 41.066 ms 42.894 ms 42.899 ms
[AS2914 NTT America, Inc.]
8 ae-3.r21.miamfl02.us.bb.gin.ntt.net (129.250.4.250) 43.410 ms 43.405 ms 40.982 ms
9 ae-4.r22.dllstx09.us.bb.gin.ntt.net (129.250.2.219) 70.660 ms 75.940 ms 75.899 ms
10 ae-5.r22.lsanca07.us.bb.gin.ntt.net (129.250.7.69) 110.576 ms 100.157 ms 102.728 ms
11 ae-1.r01.lsanca07.us.bb.gin.ntt.net (129.250.3.123) 99.446 ms 103.431 ms 100.113 ms
12 te0-0-0-0.r04.lax02.as7012.net (198.172.90.74) 107.573 ms 99.088 ms 99.029 ms
13 te7-4.r02.lax2.phyber.com (207.171.30.62) 99.626 ms 99.427 ms 102.202 ms
14 ntplax7.ntppool.net (207.171.3.17) 100.043 ms !X 98.850 ms !X 100.048 ms !X
So the first two transit providers changed, so yes, possibly AS52263 Telecable Economico S.A. or AS23520 Columbus Networks USA, Inc. are dropping NTP packets.
In my case, HE tunnel problems began at the end of February, and by the middle of March I just killed the BGP session since it was not getting better. For now native IPv6 traffic is going fine to Los Angeles Monitoring Station but IPv4 not so.
Had this issue recently at different times with different servers, which are not part of the pool.
Our monitoring system got no answer to NTP questions, so I started capturing traffic on both ends.
The NTP server sent a valid reply, but this got just lost. Other UDP services on the same machine were working fine.
The packets traversed the networks of TATA Communications (AS6453) and then Cogent (AS174).
We did not manage to find the exact position of the culprit (Traceroutes and MTRs on port 123 showed no loss).
But we eventually changed our BGP advertisements via Cogent to make them less attractive, packets were flowing via Level3/Centurylink and problems were gone.
The weird thing is, that exactly NTP packets are eaten. We stopped NTP, ran the “echo” service on that port and tried to measure packet loss - but there was no loss!
With NTP, the egress packets of the server were eaten somewhere, when routed via TATA/Cogent.
I’m having troubles with cogent too. Have you tried traceroute/mtr with source port 123 and not just the destination port? On two servers in two different countries that recently became unusable, I can see a significant packet loss at the border of cogent network. In the Italian zone it looks like it has caused a ~50% drop in the number of active servers.
That has been tested, when I started the “echo” service on that port to check if I get responses to my packets. All those replies have been sent with source port 123, of course
I made a traceroute on my site to an italian pool server which is reachable to me via Cogent.
The last hop is the NTP server itself and does not reply, because I’m not sending valid NTP packets.
# traceroute -A -f 5 -U -p 123 --sport=123 -q 10 '212.45.144.88' '48'
traceroute to 212.45.144.88 (212.45.144.88), 30 hops max, 48 byte packets
5 te0-7-0-5.rcr21.dus01.atlas.cogentco.com (149.6.139.113) [AS174] 1.170 ms 1.613 ms 1.722 ms 1.372 ms 1.103 ms 1.069 ms 1.137 ms 1.087 ms 1.076 ms 1.099 ms
6 be2115.agr41.fra03.atlas.cogentco.com (154.54.37.37) [AS174] 5.037 ms 4.967 ms 5.104 ms 6.301 ms 4.889 ms 5.406 ms 4.832 ms 4.747 ms 5.062 ms 4.965 ms
7 be3187.ccr42.fra03.atlas.cogentco.com (130.117.1.118) [AS174] 4.849 ms 5.197 ms 4.829 ms 5.310 ms 4.802 ms 5.708 ms 5.164 ms 5.109 ms 5.119 ms 4.864 ms
8 be2960.ccr22.muc03.atlas.cogentco.com (154.54.36.254) [AS174] 10.455 ms 10.509 ms 10.447 ms 10.704 ms 11.138 ms 10.925 ms 11.076 ms 10.714 ms 10.754 ms 11.415 ms
9 be3073.ccr52.zrh02.atlas.cogentco.com (130.117.0.61) [AS174] 16.084 ms 16.360 ms 15.599 ms 15.914 ms 15.690 ms 15.630 ms 16.017 ms 15.809 ms 15.885 ms 16.603 ms
10 be3586.rcr21.mil01.atlas.cogentco.com (154.54.60.114) [AS174] 20.069 ms 19.954 ms 21.156 ms 19.866 ms 20.070 ms 19.945 ms 19.943 ms 21.001 ms 19.968 ms 20.045 ms
11 149.6.153.170 (149.6.153.170) [AS174] 20.858 ms 21.099 ms 20.880 ms 20.699 ms 21.033 ms 20.687 ms 20.798 ms 20.901 ms 20.725 ms 20.840 ms
12 lsr-tis1-te16.metrolink.it (212.45.159.57) [AS8816] 21.133 ms 21.452 ms 21.314 ms 21.334 ms 21.102 ms 21.249 ms 21.359 ms 21.087 ms 21.153 ms 21.225 ms
13 asr-sal1-te24.metrolink.it (212.45.137.189) [AS8816] 21.373 ms 21.586 ms 21.139 ms 21.130 ms 21.143 ms 21.061 ms 21.225 ms 21.564 ms 21.061 ms 21.327 ms
14 gw-mcs.metrolink.it (212.45.139.102) [AS8816] 20.314 ms 20.278 ms 20.301 ms 20.300 ms 20.543 ms 20.307 ms 20.462 ms 20.359 ms 20.332 ms 20.499 ms
15 * * * * * * * * * *
FWIW, in my case the problem is in the direction from the server, so mtr/traceroute needs to be running there. Client requests can reach the server without problem, but the responses are dropped as they enter the cogent network. When I change the port number it doesn’t happen, so it is specific to NTP. In mtr (patched to allow source port < 1024) it looks like this:
Thats interesting…
Could you make this kind of MTR to one of my pool servers (217.144.138.234), I would then make a traceroute back to you from that site.
With a bit of luck, it’s symmetrical routing
I tried few different client machines, but I didn’t see any packet loss to that server. The packets don’t seem to go over cogent network. Also, it seems to have a perfect score, so I’m not sure what issue are you actually debugging here.
I hoped, it would be reached via Cogent - then we might had the possibility to do traces between the two sites to see, where in particular packets start getting dropped.
That’s simply what I wanted to see
But I’m not entirely sure, Cogent is the culprit of all those problems.
I now had two events with around two weeks between where my pool server, located at OVH in france got dropped out of the pool because the monitor couldn’t reach it.
From both sites, OVH and the monitoring station, Cogent was not involved. From both ends, packets traversed Coresite (“any2ix”). And it happened only to IPv4, IPv6 was running fine.
For debugging purposes I built a GRE tunnel to my router on my network in Germany, and started routing some subnets (i.e. 207.171.0.0/19, where I suspected the monitor station to bei in) over that tunnel.
From there it used Cogent to reach the monitor subnet and my scores started raising again constantly.
So, it might be totally unrelated and just errorneously filtered traffic by OVH.
I manage 5 servers (three in Paris and two on dedicated servers OVH) actives since 2009 for the oldest.
One IPV6 only
Two IPV4 only
2 IPV4 + IPV6
Since 3 months, I have the same problems of monitoring from Los Angeles. And, for the two IPV4+IPV6 servers, when the iPV4 server seems down for LA, the IPV6 is OK.
I do some monitoring of these servers without any problem. So I do nothing on the servers but it seems there is pbm with the pool monitoring from LA.
traceroute -A -f 5 -U -p 123 -q 5 '212.45.144.88' '48' -n
traceroute to 212.45.144.88 (212.45.144.88), 30 hops max, 48 byte packets
5 78.254.242.162 [AS12322] 6.505 ms 6.584 ms 6.728 ms 6.883 ms 7.030 ms
6 78.254.248.146 [AS12322] 1.388 ms 1.532 ms 1.484 ms 1.612 ms 1.567 ms
7 194.149.160.105 [AS12322] 1.635 ms 1.586 ms 1.694 ms 1.689 ms 1.552 ms
8 194.149.166.54 [AS12322] 2.159 ms 2.119 ms 1.840 ms 1.904 ms 1.861 ms
9 * * * * *
10 4.69.211.21 [AS3356] 20.438 ms 20.399 ms 20.606 ms 20.622 ms 20.602 ms
11 4.69.211.21 [AS3356] 20.741 ms 197.056 ms 222.826 ms 222.892 ms 223.037 ms
12 213.242.125.46 [AS3356/AS9057] 21.434 ms 21.555 ms 21.508 ms 21.455 ms 21.399 ms
13 212.45.159.57 [AS8816] 21.949 ms 21.912 ms 21.930 ms 21.656 ms 21.689 ms
14 212.45.137.189 [AS8816] 22.076 ms 22.324 ms 22.355 ms 22.189 ms 22.065 ms
15 212.45.139.102 [AS8816] 20.862 ms 20.722 ms 20.560 ms 20.793 ms 20.886 ms
16 * * * * *
traceroute -A -f 5 -U -p 123 -q 5 '212.45.144.88' '48' -n
traceroute to 212.45.144.88 (212.45.144.88), 30 hops max, 48 byte packets
5 80.249.210.226 [AS1200] 24.379 ms 24.158 ms 24.029 ms 23.942 ms 24.098 ms
6 212.45.159.57 [AS8816] 24.307 ms 24.264 ms 24.401 ms 24.077 ms 24.547 ms
7 212.45.137.189 [AS8816] 25.556 ms 25.456 ms 25.367 ms 25.286 ms 25.206 ms
8 212.45.139.102 [AS8816] 23.264 ms 23.597 ms 23.456 ms 23.576 ms 23.496 ms
9 * * * * *
traceroute -A -f 5 -U -p 123 -q 5 '212.45.144.88' '48' -n
traceroute to 212.45.144.88 (212.45.144.88), 30 hops max, 48 byte packets
5 94.23.122.243 [AS16276] 5.904 ms 5.997 ms 178.33.100.161 [AS16276] 5.912 ms 94.23.122.243 [AS16276] 6.064 ms 178.33.100.161 [AS16276] 5.939 ms
6 80.249.210.226 [AS1200] 22.743 ms 22.849 ms 22.992 ms 22.946 ms 23.065 ms
7 212.45.159.57 [AS8816] 23.538 ms 23.520 ms 23.709 ms 23.898 ms 23.482 ms
8 212.45.137.189 [AS8816] 23.216 ms 23.199 ms 23.343 ms 23.331 ms 23.323 ms
9 212.45.139.102 [AS8816] 22.126 ms 22.160 ms 22.284 ms 22.177 ms 22.229 ms
10 * * * * *
11 * * * * *