Problems with the Los Angeles IPv4 monitoring Station

I’ve had problems recently with ipv4 UDP (incl NTP and DNS) transit through AS6939 “Hurricane Electric LLC” - my short story is written here:

When did all of the dropping begin? I was just kinda wondering if it is a result of the Ntp reflection issue that occurred.

In my case, HE tunnel problems began at the end of February, and by the middle of March I just killed the BGP session since it was not getting better. For now native IPv6 traffic is going fine to Los Angeles Monitoring Station but IPv4 not so.

Same server, but IPv4:

Had this issue recently at different times with different servers, which are not part of the pool.
Our monitoring system got no answer to NTP questions, so I started capturing traffic on both ends.
The NTP server sent a valid reply, but this got just lost. Other UDP services on the same machine were working fine.

The packets traversed the networks of TATA Communications (AS6453) and then Cogent (AS174).
We did not manage to find the exact position of the culprit (Traceroutes and MTRs on port 123 showed no loss).
But we eventually changed our BGP advertisements via Cogent to make them less attractive, packets were flowing via Level3/Centurylink and problems were gone.

The weird thing is, that exactly NTP packets are eaten. We stopped NTP, ran the “echo” service on that port and tried to measure packet loss - but there was no loss!
With NTP, the egress packets of the server were eaten somewhere, when routed via TATA/Cogent.

At the moment, I assume a spreading firmware bug, that leads to discardin NTP packets.
Such thing happened several years ago (well, kind of) and maybe it’s coming again:
https://news.ntppool.org/2013/06/ipv6-monitoring-problems-for-german-servers/

I’m having troubles with cogent too. Have you tried traceroute/mtr with source port 123 and not just the destination port? On two servers in two different countries that recently became unusable, I can see a significant packet loss at the border of cogent network. In the Italian zone it looks like it has caused a ~50% drop in the number of active servers.

That has been tested, when I started the “echo” service on that port to check if I get responses to my packets. All those replies have been sent with source port 123, of course :wink:

I made a traceroute on my site to an italian pool server which is reachable to me via Cogent.
The last hop is the NTP server itself and does not reply, because I’m not sending valid NTP packets.

# traceroute -A -f 5 -U -p 123 --sport=123 -q 10 '212.45.144.88' '48'
traceroute to 212.45.144.88 (212.45.144.88), 30 hops max, 48 byte packets
 5  te0-7-0-5.rcr21.dus01.atlas.cogentco.com (149.6.139.113) [AS174]  1.170 ms  1.613 ms  1.722 ms  1.372 ms  1.103 ms  1.069 ms  1.137 ms  1.087 ms  1.076 ms  1.099 ms
 6  be2115.agr41.fra03.atlas.cogentco.com (154.54.37.37) [AS174]  5.037 ms  4.967 ms  5.104 ms  6.301 ms  4.889 ms  5.406 ms  4.832 ms  4.747 ms  5.062 ms  4.965 ms
 7  be3187.ccr42.fra03.atlas.cogentco.com (130.117.1.118) [AS174]  4.849 ms  5.197 ms  4.829 ms  5.310 ms  4.802 ms  5.708 ms  5.164 ms  5.109 ms  5.119 ms  4.864 ms
 8  be2960.ccr22.muc03.atlas.cogentco.com (154.54.36.254) [AS174]  10.455 ms  10.509 ms  10.447 ms  10.704 ms  11.138 ms  10.925 ms  11.076 ms  10.714 ms  10.754 ms  11.415 ms
 9  be3073.ccr52.zrh02.atlas.cogentco.com (130.117.0.61) [AS174]  16.084 ms  16.360 ms  15.599 ms  15.914 ms  15.690 ms  15.630 ms  16.017 ms  15.809 ms  15.885 ms  16.603 ms
10  be3586.rcr21.mil01.atlas.cogentco.com (154.54.60.114) [AS174]  20.069 ms  19.954 ms  21.156 ms  19.866 ms  20.070 ms  19.945 ms  19.943 ms  21.001 ms  19.968 ms  20.045 ms
11  149.6.153.170 (149.6.153.170) [AS174]  20.858 ms  21.099 ms  20.880 ms  20.699 ms  21.033 ms  20.687 ms  20.798 ms  20.901 ms  20.725 ms  20.840 ms
12  lsr-tis1-te16.metrolink.it (212.45.159.57) [AS8816]  21.133 ms  21.452 ms  21.314 ms  21.334 ms  21.102 ms  21.249 ms  21.359 ms  21.087 ms  21.153 ms  21.225 ms
13  asr-sal1-te24.metrolink.it (212.45.137.189) [AS8816]  21.373 ms  21.586 ms  21.139 ms  21.130 ms  21.143 ms  21.061 ms  21.225 ms  21.564 ms  21.061 ms  21.327 ms
14  gw-mcs.metrolink.it (212.45.139.102) [AS8816]  20.314 ms  20.278 ms  20.301 ms  20.300 ms  20.543 ms  20.307 ms  20.462 ms  20.359 ms  20.332 ms  20.499 ms
15  * * * * * * * * * *

FWIW, in my case the problem is in the direction from the server, so mtr/traceroute needs to be running there. Client requests can reach the server without problem, but the responses are dropped as they enter the cogent network. When I change the port number it doesn’t happen, so it is specific to NTP. In mtr (patched to allow source port < 1024) it looks like this:

 2. 80.211.127.6          0.0%   102   17.5  22.1   1.8  90.8  17.7
 3. 62.149.185.100        0.0%   102    1.0   1.3   0.6   6.1   0.9
 4. 149.6.18.57          88.1%   102    1.0   1.3   1.0   3.1   0.6
 5. 154.54.36.225        85.1%   102    5.5   5.6   5.5   5.9   0.1
 6. 154.54.59.2          81.2%   102   10.2  10.4  10.1  11.5   0.3

Thats interesting…
Could you make this kind of MTR to one of my pool servers (217.144.138.234), I would then make a traceroute back to you from that site.
With a bit of luck, it’s symmetrical routing :slight_smile:

You can traceroute from the server (or one really close to it) with https://trace.ntppool.org/traceroute/8.8.8.8

(The debugger there also has a simple NTP client: https://trace.ntppool.org/ntp/17.253.4.125 )

I tried few different client machines, but I didn’t see any packet loss to that server. The packets don’t seem to go over cogent network. Also, it seems to have a perfect score, so I’m not sure what issue are you actually debugging here. :slight_smile:

I hoped, it would be reached via Cogent - then we might had the possibility to do traces between the two sites to see, where in particular packets start getting dropped.
That’s simply what I wanted to see :wink:

But I’m not entirely sure, Cogent is the culprit of all those problems.
I now had two events with around two weeks between where my pool server, located at OVH in france got dropped out of the pool because the monitor couldn’t reach it.
From both sites, OVH and the monitoring station, Cogent was not involved. From both ends, packets traversed Coresite (“any2ix”). And it happened only to IPv4, IPv6 was running fine.

For debugging purposes I built a GRE tunnel to my router on my network in Germany, and started routing some subnets (i.e. 207.171.0.0/19, where I suspected the monitor station to bei in) over that tunnel.
From there it used Cogent to reach the monitor subnet and my scores started raising again constantly.
So, it might be totally unrelated and just errorneously filtered traffic by OVH.

I’ve noticed that my Ipv6 do have much better scores. This is what I see from NJ:

I manage 5 servers (three in Paris and two on dedicated servers OVH) actives since 2009 for the oldest.
One IPV6 only
Two IPV4 only
2 IPV4 + IPV6
Since 3 months, I have the same problems of monitoring from Los Angeles. And, for the two IPV4+IPV6 servers, when the iPV4 server seems down for LA, the IPV6 is OK.
I do some monitoring of these servers without any problem. So I do nothing on the servers but it seems there is pbm with the pool monitoring from LA.

traceroute -A -f 5 -U -p 123  -q 5 '212.45.144.88' '48' -n
traceroute to 212.45.144.88 (212.45.144.88), 30 hops max, 48 byte packets
 5  78.254.242.162 [AS12322]  6.505 ms  6.584 ms  6.728 ms  6.883 ms  7.030 ms
 6  78.254.248.146 [AS12322]  1.388 ms  1.532 ms  1.484 ms  1.612 ms  1.567 ms
 7  194.149.160.105 [AS12322]  1.635 ms  1.586 ms  1.694 ms  1.689 ms  1.552 ms
 8  194.149.166.54 [AS12322]  2.159 ms  2.119 ms  1.840 ms  1.904 ms  1.861 ms
 9  * * * * *
10  4.69.211.21 [AS3356]  20.438 ms  20.399 ms  20.606 ms  20.622 ms  20.602 ms
11  4.69.211.21 [AS3356]  20.741 ms  197.056 ms  222.826 ms  222.892 ms  223.037 ms
12  213.242.125.46 [AS3356/AS9057]  21.434 ms  21.555 ms  21.508 ms  21.455 ms  21.399 ms
13  212.45.159.57 [AS8816]  21.949 ms  21.912 ms  21.930 ms  21.656 ms  21.689 ms
14  212.45.137.189 [AS8816]  22.076 ms  22.324 ms  22.355 ms  22.189 ms  22.065 ms
15  212.45.139.102 [AS8816]  20.862 ms  20.722 ms  20.560 ms  20.793 ms  20.886 ms
16  * * * * *

traceroute -A -f 5 -U -p 123  -q 5 '212.45.144.88' '48' -n
traceroute to 212.45.144.88 (212.45.144.88), 30 hops max, 48 byte packets
 5  80.249.210.226 [AS1200]  24.379 ms  24.158 ms  24.029 ms  23.942 ms  24.098 ms
 6  212.45.159.57 [AS8816]  24.307 ms  24.264 ms  24.401 ms  24.077 ms  24.547 ms
 7  212.45.137.189 [AS8816]  25.556 ms  25.456 ms  25.367 ms  25.286 ms  25.206 ms
 8  212.45.139.102 [AS8816]  23.264 ms  23.597 ms  23.456 ms  23.576 ms  23.496 ms
 9  * * * * *

traceroute -A -f 5 -U -p 123  -q 5 '212.45.144.88' '48' -n
traceroute to 212.45.144.88 (212.45.144.88), 30 hops max, 48 byte packets
 5  94.23.122.243 [AS16276]  5.904 ms  5.997 ms 178.33.100.161 [AS16276]  5.912 ms 94.23.122.243 [AS16276]  6.064 ms 178.33.100.161 [AS16276]  5.939 ms
 6  80.249.210.226 [AS1200]  22.743 ms  22.849 ms  22.992 ms  22.946 ms  23.065 ms
 7  212.45.159.57 [AS8816]  23.538 ms  23.520 ms  23.709 ms  23.898 ms  23.482 ms
 8  212.45.137.189 [AS8816]  23.216 ms  23.199 ms  23.343 ms  23.331 ms  23.323 ms
 9  212.45.139.102 [AS8816]  22.126 ms  22.160 ms  22.284 ms  22.177 ms  22.229 ms
10  * * * * *
11  * * * * *

I have the same issue with both OVH and Scaleway/Online. Worked for years and all the sudden my IPv4 score goes to oblivion while IPv6 remains top notch. I do not have any connections issues beside NTP.

Since yesterday, one of my servers suddenly has a very poor score on it’s IPv4 address. IPv6 is working fine. This server is not located at OVH.
I’m not sure if this can be coincident with the current maintenance works or not, unfortunately I’m unable to do a traceroute from the monitor’s site at the moment.
From my site, I reach 207.171.7.0/24 via Cogent - and I’m quite sure, Cogent was also used a few weeks ago in the other direction.

So, it might be a Cogent related problem. I try to open a ticket there… :-/

JFTR:

~# traceroute -f 3 -4 -I -q 1 -U -p 123 trace.ntppool.org.
traceroute to trace.ntppool.org. (207.171.7.45), 30 hops max, 60 byte packets
 3  te0-7-0-5.rcr21.dus01.atlas.cogentco.com (149.6.139.113)  1.603 ms
 4  be2114.agr21.ams03.atlas.cogentco.com (130.117.48.61)  4.832 ms
 5  be2440.ccr42.ams03.atlas.cogentco.com (130.117.50.5)  82.810 ms
 6  be12488.ccr42.lon13.atlas.cogentco.com (130.117.51.41)  81.432 ms
 7  be2490.ccr42.jfk02.atlas.cogentco.com (154.54.42.85)  83.384 ms
 8  be2359.ccr41.jfk02.atlas.cogentco.com (154.54.43.109)  83.441 ms
 9  be2113.ccr42.atl01.atlas.cogentco.com (154.54.24.222)  100.892 ms
10  be2690.ccr42.iah01.atlas.cogentco.com (154.54.28.130)  115.310 ms
11  be2928.ccr21.elp01.atlas.cogentco.com (154.54.30.162)  128.409 ms
12  be2929.ccr31.phx01.atlas.cogentco.com (154.54.42.65)  139.057 ms
13  be2931.ccr41.lax01.atlas.cogentco.com (154.54.44.86)  149.081 ms
14  be3360.ccr41.lax04.atlas.cogentco.com (154.54.25.150)  150.413 ms
15  *
16  *
17  *
18  *

If you look at your offset graph you will see the downturn happened when the monitoring station moved form Los Angeles to Newark NJ. (mine improved)

This is linked with the maintainance. Also, trace. is not yet migrated, traceroute is not showing the current path. You can try it for the url of this site(community.ntppool.com), since this is at the new location.

Using he-looking glass(lg.he.net) seems fine, but not sure, how representative this is.

A major update on this!

My server still seems unreachable to the monitoring station since the maintenance weekend.
The very same server is reachable on IPv6 without any problems.

With tcpdump I can see (presumably) the monitor is querying my server and replies are sent - but appearently these are getting lost somewhere on the way from my server to the monitor.

These are the packets captured with tcpdump on 2019-05-09 (time is UTC):

22:32:28.678055 IP 139.178.64.42.54313 > 217.144.138.234.123: NTPv4, Client, length 48
22:32:28.678510 IP 217.144.138.234.123 > 139.178.64.42.54313: NTPv4, Server, length 48
22:47:41.217332 IP 139.178.64.42.41577 > 217.144.138.234.123: NTPv4, Client, length 48
22:47:41.217772 IP 217.144.138.234.123 > 139.178.64.42.41577: NTPv4, Server, length 48
23:02:53.329398 IP 139.178.64.42.49108 > 217.144.138.234.123: NTPv4, Client, length 48
23:02:53.329835 IP 217.144.138.234.123 > 139.178.64.42.49108: NTPv4, Server, length 48

The monitor’s CSV protocol declares “i/o timeout” for most of these queries:

ts_epoch,ts,offset,step,score,monitor_id,monitor_name,leap,error
1557442976,"2019-05-09 23:02:56",0,-5,-63.7,6,"Newark, NJ, US",,"i/o timeout"
1557442976,"2019-05-09 23:02:56",0,-5,-63.7,,,,"i/o timeout"
1557442061,"2019-05-09 22:47:41",-0.002267079,1,-61.8,6,"Newark, NJ, US",0,
1557442061,"2019-05-09 22:47:41",-0.002267079,1,-61.8,,,0,
1557441151,"2019-05-09 22:32:31",0,-5,-66.1,6,"Newark, NJ, US",,"i/o timeout"
1557441151,"2019-05-09 22:32:31",0,-5,-66.1,,,,"i/o timeout"

When doing some traceroutes I detected some weirdness:
Packets originate at source port 123/UDP, targeted to a random (on tcpdump seen) upper port on the monitoring station.
It looks like these packets are getting rate-limited, probably by Centurylink (f.k.a. Level3)!

~# traceroute -z 0.6 -w 0.5 -U --sport=123 -p 54313 -q 10 -t 0xb8 -A 139.178.64.42
traceroute to 139.178.64.42 (139.178.64.42), 30 hops max, 60 byte packets
 1  ipv4gate.ntwk-w.301-moved.de (217.144.138.225) [AS15987/AS8820]  0.473 ms  0.382 ms  0.277 ms  0.406 ms  0.322 ms  0.330 ms  0.324 ms  0.327 ms  0.336 ms  0.273 ms
 2  r4-pty.wup.tal.de (81.92.2.89) [AS8820]  0.463 ms  0.244 ms  0.397 ms  0.393 ms  0.321 ms  0.330 ms  1.973 ms  0.309 ms  0.296 ms  0.460 ms
 3  xe-9-1-2.edge4.dus1.level3.net (194.54.94.65) [AS41692]  1.532 ms  1.171 ms  0.867 ms  0.838 ms  0.820 ms  0.893 ms  1.015 ms  0.930 ms  0.896 ms  1.252 ms
 4  * * * * * * * * * *
 5  * nyc2-brdr-02.inet.qwest.net (63.235.42.101) [AS209]  78.080 ms  78.133 ms * *  78.113 ms * * * *
 6  dca-edge-22.inet.qwest.net (67.14.6.142) [AS209]  114.805 ms * *  87.070 ms * * *  87.064 ms  86.962 ms *
 7  * 72.165.161.86 (72.165.161.86) [AS209]  86.660 ms * * * *  86.746 ms *  86.777 ms  86.763 ms
 8  * * * * lag32.fr3.lga.llnw.net (68.142.88.157) [AS22822]  83.146 ms * *  83.050 ms *  83.028 ms
 9  * * * * * * * * * *
10  0.xe-1-0-0.bbr1.ewr1.packet.net (198.16.4.94) [AS5485/AS54825]  83.947 ms * *  89.428 ms *  84.074 ms  83.961 ms *  84.064 ms *
11  * * * * * * * * * *
12  * * * * * * * * * *
13  monewr1.ntppool.net (139.178.64.42) [AS54825]  83.946 ms * *  84.061 ms  84.089 ms *  84.031 ms * * *

When doing the same traceroute, but just changing the source port to a random upper port, the result looks fine:

~# traceroute -z 0.6 -w 0.5 -U --sport=51553 -p 54313 -q 10 -t 0xb8 -A 139.178.64.42
traceroute to 139.178.64.42 (139.178.64.42), 30 hops max, 60 byte packets
 1  ipv4gate.ntwk-w.301-moved.de (217.144.138.225) [AS15987/AS8820]  0.607 ms  0.281 ms  0.294 ms  0.286 ms  0.341 ms  0.305 ms  0.243 ms  0.255 ms  0.289 ms  0.262 ms
 2  r4-pty.wup.tal.de (81.92.2.89) [AS8820]  0.405 ms  2.329 ms  28.117 ms  4.922 ms  0.493 ms  0.314 ms  0.295 ms  0.198 ms  1.087 ms  0.272 ms
 3  xe-9-1-2.edge4.dus1.level3.net (194.54.94.65) [AS41692]  0.850 ms  0.974 ms  0.876 ms  0.877 ms  0.952 ms  0.946 ms  0.859 ms  1.212 ms  0.971 ms  1.540 ms
 4  * * * * * * * * * *
 5  nyc2-brdr-02.inet.qwest.net (63.235.42.101) [AS209]  78.133 ms  78.162 ms  81.005 ms  89.660 ms  77.895 ms  78.105 ms  77.973 ms  78.022 ms  78.158 ms  78.083 ms
 6  dca-edge-22.inet.qwest.net (67.14.6.142) [AS209]  86.963 ms  104.652 ms  86.935 ms  87.029 ms  87.147 ms  86.899 ms  90.714 ms  86.926 ms  87.130 ms  87.038 ms
 7  72.165.161.86 (72.165.161.86) [AS209]  86.726 ms  86.630 ms  86.687 ms  86.644 ms  86.707 ms  86.710 ms  86.751 ms  86.667 ms  86.647 ms  87.720 ms
 8  lag32.fr3.lga.llnw.net (68.142.88.157) [AS22822]  82.979 ms  82.989 ms  83.143 ms  83.084 ms  83.030 ms  82.964 ms  83.022 ms  83.035 ms  83.041 ms  82.956 ms
 9  * * * * * * * * * *
10  0.xe-1-0-0.bbr1.ewr1.packet.net (198.16.4.94) [AS5485/AS54825]  84.579 ms  84.652 ms  83.948 ms  83.885 ms  84.337 ms  92.428 ms  83.922 ms  84.265 ms  84.051 ms  83.884 ms
11  * * * * * * * * * *
12  * * * * * * * * * *
13  monewr1.ntppool.net (139.178.64.42) [AS54825]  84.023 ms  84.070 ms  83.924 ms  83.882 ms  83.872 ms  84.068 ms  84.052 ms  83.909 ms  83.814 ms  83.931 ms

(For being more accurate, I set ToS = 0xb8, since this is the tag my ntpd applys to all outgoing packets) - but to be honest, it didn’t change anything)

Evidently, the culprit is Centurylink! Since the brand “Qwest” seems to be used by Centurylink even nearly 10 years after acquiring them, there is just no other provider between hop 3 and 5, where packetloss starts.

I’ll try to open a ticket there and asking questions about rate-limiting on their network…

Just a short addendum:
This is not just happening to the pool monitor, but whenever traversing Centurylink’s network.
Even the root nameservers are affected by this :wink:

~# traceroute -z 0.6 -w 0.5 -U --sport=123 -p 53 -q 5 -t 0xb8 -A h.root-servers.net.
traceroute to h.root-servers.net. (198.97.190.53), 30 hops max, 60 byte packets
 1  ipv4gate.ntwk-w.301-moved.de (217.144.138.225) [AS15987/AS8820]  0.422 ms  0.323 ms  0.307 ms  0.332 ms  0.274 ms
 2  r4-pty.wup.tal.de (81.92.2.89) [AS8820]  0.356 ms  35.711 ms  0.279 ms  0.206 ms  0.257 ms
 3  xe-9-1-2.edge4.dus1.level3.net (194.54.94.65) [AS41692]  2.085 ms  0.990 ms  0.989 ms  0.835 ms  0.918 ms
 4  * * * * *
 5  lsv2-agw1.inet.qwest.net (63.235.42.101) [AS209]  79.991 ms  79.973 ms *  78.107 ms  78.164 ms
 6  * * * * dcx2-edge-02.inet.qwest.net (67.14.28.138) [AS209]  86.949 ms
 7  * * * * *
 8  np-5-1-1-181-px-p2p.equinix-ord.core.dren.net (143.56.224.121) [AS668]  107.427 ms *  109.000 ms * *
 9  * * * * *
10  * * * 143.56.3.163 (143.56.3.163) [AS668]  129.952 ms *
11  * * * h.root-servers.net (198.97.190.53) [AS1508]  202.645 ms *




~# traceroute -z 0.6 -w 0.5 -U --sport=125 -p 53 -q 5 -t 0xb8 -A h.root-servers.net.
traceroute to h.root-servers.net. (198.97.190.53), 30 hops max, 60 byte packets
 1  ipv4gate.ntwk-w.301-moved.de (217.144.138.225) [AS15987/AS8820]  0.281 ms  0.377 ms  0.393 ms  0.375 ms  0.447 ms
 2  r4-pty.wup.tal.de (81.92.2.89) [AS8820]  0.265 ms  0.499 ms  0.328 ms  0.289 ms  0.528 ms
 3  xe-9-1-2.edge4.dus1.level3.net (194.54.94.65) [AS41692]  2.921 ms  0.900 ms  0.930 ms  0.883 ms  0.968 ms
 4  ae-1-3503.ear2.NewYork6.Level3.net (4.69.214.18) [AS3356]  78.230 ms  78.001 ms * * *
 5  63-235-42-101.dia.static.qwest.net (63.235.42.101) [AS209]  80.175 ms  78.247 ms  78.163 ms  81.879 ms  78.178 ms
 6  dcx2-edge-02.inet.qwest.net (67.14.28.138) [AS209]  86.960 ms  86.966 ms  86.938 ms  88.060 ms  86.934 ms
 7  * * * * *
 8  * * * * *
 9  * * * * *
10  143.56.3.163 (143.56.3.163) [AS668]  130.219 ms  129.877 ms  129.855 ms  129.993 ms  129.947 ms
11  h.root-servers.net (198.97.190.53) [AS1508]  129.386 ms  129.323 ms  129.230 ms  129.310 ms  129.310 ms

And this is no cosmetic issue, it is definetely affecting connectivity:

~# dig -b '217.144.138.XXX#123' @198.97.190.53 . SOA

; <<>> DiG 9.10.3-P4-Ubuntu <<>> -b 217.144.138.XXX#123 @198.97.190.53 . SOA
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached