Problems with the Los Angeles IPv4 monitoring Station

I have run half off all the NTP servers in Costa Rica for many years now. These servers serve hundreds of devices on the internal ISP network they are part of and thousands more outside. Almost 2 months ago the responses they give don’t get logged or don’t reach the Los Angeles monitoring Station. This only happens using IPv4, on IPv6 it works fine.

Today I decided to take all the servers off the NTP pool since apparently there is no way to fix this. Ask suggested to “see if someone else can monitor the server for a few days”

So here are 2 of them: 200.59.16.50 and 200.59.19.5

So, if anyone can monitor these servers for the next couple of days it would be very much appreciated.

I have some pool servers in Europe and one of them is also monitoring a subset of other pool servers. I have some data for 200.59.16.50. It seems over the last six months the reachability was 97.4%. I think that’s pretty good. It’s actually slightly better than reachability of the IPv6 address of the server in the same period, which was 96.9%.


That server in particular is a Stratum 1 and NTP is his only job. He had a hardware problem beginning February and was taken out for a complete hardware upgrade during March, during that time a Stratum 2 server answered the queries. So yes 97 % is for sure right considering all the hardware issues he suffered. During April he performed just fine but you can see the huge drop in traffic since most of the time is out of the pool.

I did some traffic engineering and now send the reply using a different upstream provider.

The problem seems to be completely gone. But I wonder if it’s just packet loss via that link or something else. I still think that having at least 3 monitoring stations is critical for having a better picture of what is going on. The fact that the AS path was shorter and the routers were sending the replies using the other link is the default and changing inbound routes did not have any effect. So even though the monitoring station receives the replies fine now, others may not, and we won’t notice unless we monitor from a lot of different locations.

This is great. Can you tell what the upstream provider change was? I wonder if they are blocking (some?) NTP packets…

The old upstream route towards the monitor went:

root@stratum1:~# traceroute 207.171.3.17 
traceroute to 207.171.3.17 (207.171.3.17), 30 hops max, 60 byte packets
 1  ether3.mikrotik-bb-zeus.fratec.net (200.59.16.33)  0.189 ms  0.160 ms  0.133 ms
 2  sfp-sfpplus1.mikrotik-bb-router-gnd1.fratec.net (200.59.17.43)  0.573 ms  0.563 ms  0.545 ms
 [AS52263 Telecable Economico S.A.]
 3  rev185.125.nstelecablecr.com (190.113.125.185)  1.201 ms  1.327 ms  1.165 ms
 4  172.31.1.9 (172.31.1.9)  1.158 ms  1.139 ms  1.195 ms
 5  190.242.134.93 (190.242.134.93)  1.133 ms  1.174 ms  1.097 ms
 [AS23520 Columbus Networks USA, Inc.]
 6  xe-4-0-0.0-corozal-pamiami.fl.us.nama.pan-corzl-mx01.cwc.com (63.245.106.167)  12.524 ms  12.421 ms  12.415 ms
 7  xe-3-0-0.0-boca-raton.fl.us.brx-teracore01.cwc.com (63.245.106.164)  62.754 ms xe-6-1-4.0-boca-raton.fl.us.nmi-teracore02.cwc.com (63.245.107.126)  44.313 ms  44.302 ms
 8  xe-0-0-45-1.a00.miamfl02.us.bb.gin.ntt.net (129.250.198.217)  59.425 ms xe-0-0-47-2.a00.miamfl02.us.bb.gin.ntt.net (128.242.180.197)  54.737 ms  54.741 ms
 [AS2914 NTT America, Inc.]
 9  ae-7.r04.miamfl02.us.bb.gin.ntt.net (129.250.2.202)  131.008 ms  140.788 ms  130.972 ms
10  ae-3.r21.miamfl02.us.bb.gin.ntt.net (129.250.4.250)  56.116 ms  54.307 ms  47.513 ms
11  ae-4.r22.dllstx09.us.bb.gin.ntt.net (129.250.2.219)  93.713 ms  87.852 ms  84.646 ms
12  ae-5.r22.lsanca07.us.bb.gin.ntt.net (129.250.7.69)  131.629 ms  118.773 ms  126.571 ms
13  ae-1.r01.lsanca07.us.bb.gin.ntt.net (129.250.3.123)  144.666 ms  128.926 ms  136.790 ms
14  te7-1.r01.lax2.phyber.com (198.172.90.74)  137.752 ms  128.881 ms  141.051 ms
15  te7-4.r02.lax2.phyber.com (207.171.30.62)  300.943 ms  300.936 ms  300.906 ms
16  ntplax7.ntppool.net (207.171.3.17)  126.467 ms !X  131.171 ms !X  118.436 ms !X 

The new and working route is:

root@stratum1:~# traceroute 207.171.3.17
traceroute to 207.171.3.17 (207.171.3.17), 30 hops max, 60 byte packets
 1  ether3.mikrotik-bb-zeus.fratec.net (200.59.16.33)  0.158 ms  0.201 ms  0.115 ms
 [AS20299 Newcom Limited]
 2  186.176.7.73 (186.176.7.73)  0.819 ms  0.816 ms  0.802 ms
 3  186.32.0.217 (186.32.0.217)  0.881 ms  0.867 ms  0.875 ms
 [AS1299 Telia Company AB]
 4  190.106.192.237 (190.106.192.237)  43.285 ms  43.254 ms  43.240 ms
 5  mai-b1-link.telia.net (62.115.52.241)  41.705 ms  44.023 ms  44.028 ms
 6  mai-b1-link.telia.net (62.115.138.160)  41.310 ms mai-b1-link.telia.net (80.91.250.236)  38.891 ms mai-b1-link.telia.net (80.91.253.220)  38.874 ms
 7  ntt-ic-321350-mai-b1.c.telia.net (213.248.81.63)  41.066 ms  42.894 ms  42.899 ms
 [AS2914 NTT America, Inc.]
 8  ae-3.r21.miamfl02.us.bb.gin.ntt.net (129.250.4.250)  43.410 ms  43.405 ms  40.982 ms
 9  ae-4.r22.dllstx09.us.bb.gin.ntt.net (129.250.2.219)  70.660 ms  75.940 ms  75.899 ms
10  ae-5.r22.lsanca07.us.bb.gin.ntt.net (129.250.7.69)  110.576 ms  100.157 ms  102.728 ms
11  ae-1.r01.lsanca07.us.bb.gin.ntt.net (129.250.3.123)  99.446 ms  103.431 ms  100.113 ms
12  te0-0-0-0.r04.lax02.as7012.net (198.172.90.74)  107.573 ms  99.088 ms  99.029 ms
13  te7-4.r02.lax2.phyber.com (207.171.30.62)  99.626 ms  99.427 ms  102.202 ms
14  ntplax7.ntppool.net (207.171.3.17)  100.043 ms !X  98.850 ms !X  100.048 ms !X

So the first two transit providers changed, so yes, possibly AS52263 Telecable Economico S.A. or AS23520 Columbus Networks USA, Inc. are dropping NTP packets.

I’ve had problems recently with ipv4 UDP (incl NTP and DNS) transit through AS6939 “Hurricane Electric LLC” - my short story is written here:

When did all of the dropping begin? I was just kinda wondering if it is a result of the Ntp reflection issue that occurred.

In my case, HE tunnel problems began at the end of February, and by the middle of March I just killed the BGP session since it was not getting better. For now native IPv6 traffic is going fine to Los Angeles Monitoring Station but IPv4 not so.

Same server, but IPv4:

Had this issue recently at different times with different servers, which are not part of the pool.
Our monitoring system got no answer to NTP questions, so I started capturing traffic on both ends.
The NTP server sent a valid reply, but this got just lost. Other UDP services on the same machine were working fine.

The packets traversed the networks of TATA Communications (AS6453) and then Cogent (AS174).
We did not manage to find the exact position of the culprit (Traceroutes and MTRs on port 123 showed no loss).
But we eventually changed our BGP advertisements via Cogent to make them less attractive, packets were flowing via Level3/Centurylink and problems were gone.

The weird thing is, that exactly NTP packets are eaten. We stopped NTP, ran the “echo” service on that port and tried to measure packet loss - but there was no loss!
With NTP, the egress packets of the server were eaten somewhere, when routed via TATA/Cogent.

At the moment, I assume a spreading firmware bug, that leads to discardin NTP packets.
Such thing happened several years ago (well, kind of) and maybe it’s coming again:
https://news.ntppool.org/2013/06/ipv6-monitoring-problems-for-german-servers/

I’m having troubles with cogent too. Have you tried traceroute/mtr with source port 123 and not just the destination port? On two servers in two different countries that recently became unusable, I can see a significant packet loss at the border of cogent network. In the Italian zone it looks like it has caused a ~50% drop in the number of active servers.

That has been tested, when I started the “echo” service on that port to check if I get responses to my packets. All those replies have been sent with source port 123, of course :wink:

I made a traceroute on my site to an italian pool server which is reachable to me via Cogent.
The last hop is the NTP server itself and does not reply, because I’m not sending valid NTP packets.

# traceroute -A -f 5 -U -p 123 --sport=123 -q 10 '212.45.144.88' '48'
traceroute to 212.45.144.88 (212.45.144.88), 30 hops max, 48 byte packets
 5  te0-7-0-5.rcr21.dus01.atlas.cogentco.com (149.6.139.113) [AS174]  1.170 ms  1.613 ms  1.722 ms  1.372 ms  1.103 ms  1.069 ms  1.137 ms  1.087 ms  1.076 ms  1.099 ms
 6  be2115.agr41.fra03.atlas.cogentco.com (154.54.37.37) [AS174]  5.037 ms  4.967 ms  5.104 ms  6.301 ms  4.889 ms  5.406 ms  4.832 ms  4.747 ms  5.062 ms  4.965 ms
 7  be3187.ccr42.fra03.atlas.cogentco.com (130.117.1.118) [AS174]  4.849 ms  5.197 ms  4.829 ms  5.310 ms  4.802 ms  5.708 ms  5.164 ms  5.109 ms  5.119 ms  4.864 ms
 8  be2960.ccr22.muc03.atlas.cogentco.com (154.54.36.254) [AS174]  10.455 ms  10.509 ms  10.447 ms  10.704 ms  11.138 ms  10.925 ms  11.076 ms  10.714 ms  10.754 ms  11.415 ms
 9  be3073.ccr52.zrh02.atlas.cogentco.com (130.117.0.61) [AS174]  16.084 ms  16.360 ms  15.599 ms  15.914 ms  15.690 ms  15.630 ms  16.017 ms  15.809 ms  15.885 ms  16.603 ms
10  be3586.rcr21.mil01.atlas.cogentco.com (154.54.60.114) [AS174]  20.069 ms  19.954 ms  21.156 ms  19.866 ms  20.070 ms  19.945 ms  19.943 ms  21.001 ms  19.968 ms  20.045 ms
11  149.6.153.170 (149.6.153.170) [AS174]  20.858 ms  21.099 ms  20.880 ms  20.699 ms  21.033 ms  20.687 ms  20.798 ms  20.901 ms  20.725 ms  20.840 ms
12  lsr-tis1-te16.metrolink.it (212.45.159.57) [AS8816]  21.133 ms  21.452 ms  21.314 ms  21.334 ms  21.102 ms  21.249 ms  21.359 ms  21.087 ms  21.153 ms  21.225 ms
13  asr-sal1-te24.metrolink.it (212.45.137.189) [AS8816]  21.373 ms  21.586 ms  21.139 ms  21.130 ms  21.143 ms  21.061 ms  21.225 ms  21.564 ms  21.061 ms  21.327 ms
14  gw-mcs.metrolink.it (212.45.139.102) [AS8816]  20.314 ms  20.278 ms  20.301 ms  20.300 ms  20.543 ms  20.307 ms  20.462 ms  20.359 ms  20.332 ms  20.499 ms
15  * * * * * * * * * *

FWIW, in my case the problem is in the direction from the server, so mtr/traceroute needs to be running there. Client requests can reach the server without problem, but the responses are dropped as they enter the cogent network. When I change the port number it doesn’t happen, so it is specific to NTP. In mtr (patched to allow source port < 1024) it looks like this:

 2. 80.211.127.6          0.0%   102   17.5  22.1   1.8  90.8  17.7
 3. 62.149.185.100        0.0%   102    1.0   1.3   0.6   6.1   0.9
 4. 149.6.18.57          88.1%   102    1.0   1.3   1.0   3.1   0.6
 5. 154.54.36.225        85.1%   102    5.5   5.6   5.5   5.9   0.1
 6. 154.54.59.2          81.2%   102   10.2  10.4  10.1  11.5   0.3

Thats interesting…
Could you make this kind of MTR to one of my pool servers (217.144.138.234), I would then make a traceroute back to you from that site.
With a bit of luck, it’s symmetrical routing :slight_smile:

You can traceroute from the server (or one really close to it) with https://trace.ntppool.org/traceroute/8.8.8.8

(The debugger there also has a simple NTP client: https://trace.ntppool.org/ntp/17.253.4.125 )

I tried few different client machines, but I didn’t see any packet loss to that server. The packets don’t seem to go over cogent network. Also, it seems to have a perfect score, so I’m not sure what issue are you actually debugging here. :slight_smile:

I hoped, it would be reached via Cogent - then we might had the possibility to do traces between the two sites to see, where in particular packets start getting dropped.
That’s simply what I wanted to see :wink:

But I’m not entirely sure, Cogent is the culprit of all those problems.
I now had two events with around two weeks between where my pool server, located at OVH in france got dropped out of the pool because the monitor couldn’t reach it.
From both sites, OVH and the monitoring station, Cogent was not involved. From both ends, packets traversed Coresite (“any2ix”). And it happened only to IPv4, IPv6 was running fine.

For debugging purposes I built a GRE tunnel to my router on my network in Germany, and started routing some subnets (i.e. 207.171.0.0/19, where I suspected the monitor station to bei in) over that tunnel.
From there it used Cogent to reach the monitor subnet and my scores started raising again constantly.
So, it might be totally unrelated and just errorneously filtered traffic by OVH.

I’ve noticed that my Ipv6 do have much better scores. This is what I see from NJ:

I manage 5 servers (three in Paris and two on dedicated servers OVH) actives since 2009 for the oldest.
One IPV6 only
Two IPV4 only
2 IPV4 + IPV6
Since 3 months, I have the same problems of monitoring from Los Angeles. And, for the two IPV4+IPV6 servers, when the iPV4 server seems down for LA, the IPV6 is OK.
I do some monitoring of these servers without any problem. So I do nothing on the servers but it seems there is pbm with the pool monitoring from LA.

traceroute -A -f 5 -U -p 123  -q 5 '212.45.144.88' '48' -n
traceroute to 212.45.144.88 (212.45.144.88), 30 hops max, 48 byte packets
 5  78.254.242.162 [AS12322]  6.505 ms  6.584 ms  6.728 ms  6.883 ms  7.030 ms
 6  78.254.248.146 [AS12322]  1.388 ms  1.532 ms  1.484 ms  1.612 ms  1.567 ms
 7  194.149.160.105 [AS12322]  1.635 ms  1.586 ms  1.694 ms  1.689 ms  1.552 ms
 8  194.149.166.54 [AS12322]  2.159 ms  2.119 ms  1.840 ms  1.904 ms  1.861 ms
 9  * * * * *
10  4.69.211.21 [AS3356]  20.438 ms  20.399 ms  20.606 ms  20.622 ms  20.602 ms
11  4.69.211.21 [AS3356]  20.741 ms  197.056 ms  222.826 ms  222.892 ms  223.037 ms
12  213.242.125.46 [AS3356/AS9057]  21.434 ms  21.555 ms  21.508 ms  21.455 ms  21.399 ms
13  212.45.159.57 [AS8816]  21.949 ms  21.912 ms  21.930 ms  21.656 ms  21.689 ms
14  212.45.137.189 [AS8816]  22.076 ms  22.324 ms  22.355 ms  22.189 ms  22.065 ms
15  212.45.139.102 [AS8816]  20.862 ms  20.722 ms  20.560 ms  20.793 ms  20.886 ms
16  * * * * *

traceroute -A -f 5 -U -p 123  -q 5 '212.45.144.88' '48' -n
traceroute to 212.45.144.88 (212.45.144.88), 30 hops max, 48 byte packets
 5  80.249.210.226 [AS1200]  24.379 ms  24.158 ms  24.029 ms  23.942 ms  24.098 ms
 6  212.45.159.57 [AS8816]  24.307 ms  24.264 ms  24.401 ms  24.077 ms  24.547 ms
 7  212.45.137.189 [AS8816]  25.556 ms  25.456 ms  25.367 ms  25.286 ms  25.206 ms
 8  212.45.139.102 [AS8816]  23.264 ms  23.597 ms  23.456 ms  23.576 ms  23.496 ms
 9  * * * * *

traceroute -A -f 5 -U -p 123  -q 5 '212.45.144.88' '48' -n
traceroute to 212.45.144.88 (212.45.144.88), 30 hops max, 48 byte packets
 5  94.23.122.243 [AS16276]  5.904 ms  5.997 ms 178.33.100.161 [AS16276]  5.912 ms 94.23.122.243 [AS16276]  6.064 ms 178.33.100.161 [AS16276]  5.939 ms
 6  80.249.210.226 [AS1200]  22.743 ms  22.849 ms  22.992 ms  22.946 ms  23.065 ms
 7  212.45.159.57 [AS8816]  23.538 ms  23.520 ms  23.709 ms  23.898 ms  23.482 ms
 8  212.45.137.189 [AS8816]  23.216 ms  23.199 ms  23.343 ms  23.331 ms  23.323 ms
 9  212.45.139.102 [AS8816]  22.126 ms  22.160 ms  22.284 ms  22.177 ms  22.229 ms
10  * * * * *
11  * * * * *