Problems with the Los Angeles IPv4 monitoring Station

I have the same issue with both OVH and Scaleway/Online. Worked for years and all the sudden my IPv4 score goes to oblivion while IPv6 remains top notch. I do not have any connections issues beside NTP.

Since yesterday, one of my servers suddenly has a very poor score on it’s IPv4 address. IPv6 is working fine. This server is not located at OVH.
I’m not sure if this can be coincident with the current maintenance works or not, unfortunately I’m unable to do a traceroute from the monitor’s site at the moment.
From my site, I reach 207.171.7.0/24 via Cogent - and I’m quite sure, Cogent was also used a few weeks ago in the other direction.

So, it might be a Cogent related problem. I try to open a ticket there… :-/

JFTR:

~# traceroute -f 3 -4 -I -q 1 -U -p 123 trace.ntppool.org.
traceroute to trace.ntppool.org. (207.171.7.45), 30 hops max, 60 byte packets
 3  te0-7-0-5.rcr21.dus01.atlas.cogentco.com (149.6.139.113)  1.603 ms
 4  be2114.agr21.ams03.atlas.cogentco.com (130.117.48.61)  4.832 ms
 5  be2440.ccr42.ams03.atlas.cogentco.com (130.117.50.5)  82.810 ms
 6  be12488.ccr42.lon13.atlas.cogentco.com (130.117.51.41)  81.432 ms
 7  be2490.ccr42.jfk02.atlas.cogentco.com (154.54.42.85)  83.384 ms
 8  be2359.ccr41.jfk02.atlas.cogentco.com (154.54.43.109)  83.441 ms
 9  be2113.ccr42.atl01.atlas.cogentco.com (154.54.24.222)  100.892 ms
10  be2690.ccr42.iah01.atlas.cogentco.com (154.54.28.130)  115.310 ms
11  be2928.ccr21.elp01.atlas.cogentco.com (154.54.30.162)  128.409 ms
12  be2929.ccr31.phx01.atlas.cogentco.com (154.54.42.65)  139.057 ms
13  be2931.ccr41.lax01.atlas.cogentco.com (154.54.44.86)  149.081 ms
14  be3360.ccr41.lax04.atlas.cogentco.com (154.54.25.150)  150.413 ms
15  *
16  *
17  *
18  *

If you look at your offset graph you will see the downturn happened when the monitoring station moved form Los Angeles to Newark NJ. (mine improved)

This is linked with the maintainance. Also, trace. is not yet migrated, traceroute is not showing the current path. You can try it for the url of this site(community.ntppool.com), since this is at the new location.

Using he-looking glass(lg.he.net) seems fine, but not sure, how representative this is.

A major update on this!

My server still seems unreachable to the monitoring station since the maintenance weekend.
The very same server is reachable on IPv6 without any problems.

With tcpdump I can see (presumably) the monitor is querying my server and replies are sent - but appearently these are getting lost somewhere on the way from my server to the monitor.

These are the packets captured with tcpdump on 2019-05-09 (time is UTC):

22:32:28.678055 IP 139.178.64.42.54313 > 217.144.138.234.123: NTPv4, Client, length 48
22:32:28.678510 IP 217.144.138.234.123 > 139.178.64.42.54313: NTPv4, Server, length 48
22:47:41.217332 IP 139.178.64.42.41577 > 217.144.138.234.123: NTPv4, Client, length 48
22:47:41.217772 IP 217.144.138.234.123 > 139.178.64.42.41577: NTPv4, Server, length 48
23:02:53.329398 IP 139.178.64.42.49108 > 217.144.138.234.123: NTPv4, Client, length 48
23:02:53.329835 IP 217.144.138.234.123 > 139.178.64.42.49108: NTPv4, Server, length 48

The monitor’s CSV protocol declares “i/o timeout” for most of these queries:

ts_epoch,ts,offset,step,score,monitor_id,monitor_name,leap,error
1557442976,"2019-05-09 23:02:56",0,-5,-63.7,6,"Newark, NJ, US",,"i/o timeout"
1557442976,"2019-05-09 23:02:56",0,-5,-63.7,,,,"i/o timeout"
1557442061,"2019-05-09 22:47:41",-0.002267079,1,-61.8,6,"Newark, NJ, US",0,
1557442061,"2019-05-09 22:47:41",-0.002267079,1,-61.8,,,0,
1557441151,"2019-05-09 22:32:31",0,-5,-66.1,6,"Newark, NJ, US",,"i/o timeout"
1557441151,"2019-05-09 22:32:31",0,-5,-66.1,,,,"i/o timeout"

When doing some traceroutes I detected some weirdness:
Packets originate at source port 123/UDP, targeted to a random (on tcpdump seen) upper port on the monitoring station.
It looks like these packets are getting rate-limited, probably by Centurylink (f.k.a. Level3)!

~# traceroute -z 0.6 -w 0.5 -U --sport=123 -p 54313 -q 10 -t 0xb8 -A 139.178.64.42
traceroute to 139.178.64.42 (139.178.64.42), 30 hops max, 60 byte packets
 1  ipv4gate.ntwk-w.301-moved.de (217.144.138.225) [AS15987/AS8820]  0.473 ms  0.382 ms  0.277 ms  0.406 ms  0.322 ms  0.330 ms  0.324 ms  0.327 ms  0.336 ms  0.273 ms
 2  r4-pty.wup.tal.de (81.92.2.89) [AS8820]  0.463 ms  0.244 ms  0.397 ms  0.393 ms  0.321 ms  0.330 ms  1.973 ms  0.309 ms  0.296 ms  0.460 ms
 3  xe-9-1-2.edge4.dus1.level3.net (194.54.94.65) [AS41692]  1.532 ms  1.171 ms  0.867 ms  0.838 ms  0.820 ms  0.893 ms  1.015 ms  0.930 ms  0.896 ms  1.252 ms
 4  * * * * * * * * * *
 5  * nyc2-brdr-02.inet.qwest.net (63.235.42.101) [AS209]  78.080 ms  78.133 ms * *  78.113 ms * * * *
 6  dca-edge-22.inet.qwest.net (67.14.6.142) [AS209]  114.805 ms * *  87.070 ms * * *  87.064 ms  86.962 ms *
 7  * 72.165.161.86 (72.165.161.86) [AS209]  86.660 ms * * * *  86.746 ms *  86.777 ms  86.763 ms
 8  * * * * lag32.fr3.lga.llnw.net (68.142.88.157) [AS22822]  83.146 ms * *  83.050 ms *  83.028 ms
 9  * * * * * * * * * *
10  0.xe-1-0-0.bbr1.ewr1.packet.net (198.16.4.94) [AS5485/AS54825]  83.947 ms * *  89.428 ms *  84.074 ms  83.961 ms *  84.064 ms *
11  * * * * * * * * * *
12  * * * * * * * * * *
13  monewr1.ntppool.net (139.178.64.42) [AS54825]  83.946 ms * *  84.061 ms  84.089 ms *  84.031 ms * * *

When doing the same traceroute, but just changing the source port to a random upper port, the result looks fine:

~# traceroute -z 0.6 -w 0.5 -U --sport=51553 -p 54313 -q 10 -t 0xb8 -A 139.178.64.42
traceroute to 139.178.64.42 (139.178.64.42), 30 hops max, 60 byte packets
 1  ipv4gate.ntwk-w.301-moved.de (217.144.138.225) [AS15987/AS8820]  0.607 ms  0.281 ms  0.294 ms  0.286 ms  0.341 ms  0.305 ms  0.243 ms  0.255 ms  0.289 ms  0.262 ms
 2  r4-pty.wup.tal.de (81.92.2.89) [AS8820]  0.405 ms  2.329 ms  28.117 ms  4.922 ms  0.493 ms  0.314 ms  0.295 ms  0.198 ms  1.087 ms  0.272 ms
 3  xe-9-1-2.edge4.dus1.level3.net (194.54.94.65) [AS41692]  0.850 ms  0.974 ms  0.876 ms  0.877 ms  0.952 ms  0.946 ms  0.859 ms  1.212 ms  0.971 ms  1.540 ms
 4  * * * * * * * * * *
 5  nyc2-brdr-02.inet.qwest.net (63.235.42.101) [AS209]  78.133 ms  78.162 ms  81.005 ms  89.660 ms  77.895 ms  78.105 ms  77.973 ms  78.022 ms  78.158 ms  78.083 ms
 6  dca-edge-22.inet.qwest.net (67.14.6.142) [AS209]  86.963 ms  104.652 ms  86.935 ms  87.029 ms  87.147 ms  86.899 ms  90.714 ms  86.926 ms  87.130 ms  87.038 ms
 7  72.165.161.86 (72.165.161.86) [AS209]  86.726 ms  86.630 ms  86.687 ms  86.644 ms  86.707 ms  86.710 ms  86.751 ms  86.667 ms  86.647 ms  87.720 ms
 8  lag32.fr3.lga.llnw.net (68.142.88.157) [AS22822]  82.979 ms  82.989 ms  83.143 ms  83.084 ms  83.030 ms  82.964 ms  83.022 ms  83.035 ms  83.041 ms  82.956 ms
 9  * * * * * * * * * *
10  0.xe-1-0-0.bbr1.ewr1.packet.net (198.16.4.94) [AS5485/AS54825]  84.579 ms  84.652 ms  83.948 ms  83.885 ms  84.337 ms  92.428 ms  83.922 ms  84.265 ms  84.051 ms  83.884 ms
11  * * * * * * * * * *
12  * * * * * * * * * *
13  monewr1.ntppool.net (139.178.64.42) [AS54825]  84.023 ms  84.070 ms  83.924 ms  83.882 ms  83.872 ms  84.068 ms  84.052 ms  83.909 ms  83.814 ms  83.931 ms

(For being more accurate, I set ToS = 0xb8, since this is the tag my ntpd applys to all outgoing packets) - but to be honest, it didn’t change anything)

Evidently, the culprit is Centurylink! Since the brand “Qwest” seems to be used by Centurylink even nearly 10 years after acquiring them, there is just no other provider between hop 3 and 5, where packetloss starts.

I’ll try to open a ticket there and asking questions about rate-limiting on their network…

Just a short addendum:
This is not just happening to the pool monitor, but whenever traversing Centurylink’s network.
Even the root nameservers are affected by this :wink:

~# traceroute -z 0.6 -w 0.5 -U --sport=123 -p 53 -q 5 -t 0xb8 -A h.root-servers.net.
traceroute to h.root-servers.net. (198.97.190.53), 30 hops max, 60 byte packets
 1  ipv4gate.ntwk-w.301-moved.de (217.144.138.225) [AS15987/AS8820]  0.422 ms  0.323 ms  0.307 ms  0.332 ms  0.274 ms
 2  r4-pty.wup.tal.de (81.92.2.89) [AS8820]  0.356 ms  35.711 ms  0.279 ms  0.206 ms  0.257 ms
 3  xe-9-1-2.edge4.dus1.level3.net (194.54.94.65) [AS41692]  2.085 ms  0.990 ms  0.989 ms  0.835 ms  0.918 ms
 4  * * * * *
 5  lsv2-agw1.inet.qwest.net (63.235.42.101) [AS209]  79.991 ms  79.973 ms *  78.107 ms  78.164 ms
 6  * * * * dcx2-edge-02.inet.qwest.net (67.14.28.138) [AS209]  86.949 ms
 7  * * * * *
 8  np-5-1-1-181-px-p2p.equinix-ord.core.dren.net (143.56.224.121) [AS668]  107.427 ms *  109.000 ms * *
 9  * * * * *
10  * * * 143.56.3.163 (143.56.3.163) [AS668]  129.952 ms *
11  * * * h.root-servers.net (198.97.190.53) [AS1508]  202.645 ms *




~# traceroute -z 0.6 -w 0.5 -U --sport=125 -p 53 -q 5 -t 0xb8 -A h.root-servers.net.
traceroute to h.root-servers.net. (198.97.190.53), 30 hops max, 60 byte packets
 1  ipv4gate.ntwk-w.301-moved.de (217.144.138.225) [AS15987/AS8820]  0.281 ms  0.377 ms  0.393 ms  0.375 ms  0.447 ms
 2  r4-pty.wup.tal.de (81.92.2.89) [AS8820]  0.265 ms  0.499 ms  0.328 ms  0.289 ms  0.528 ms
 3  xe-9-1-2.edge4.dus1.level3.net (194.54.94.65) [AS41692]  2.921 ms  0.900 ms  0.930 ms  0.883 ms  0.968 ms
 4  ae-1-3503.ear2.NewYork6.Level3.net (4.69.214.18) [AS3356]  78.230 ms  78.001 ms * * *
 5  63-235-42-101.dia.static.qwest.net (63.235.42.101) [AS209]  80.175 ms  78.247 ms  78.163 ms  81.879 ms  78.178 ms
 6  dcx2-edge-02.inet.qwest.net (67.14.28.138) [AS209]  86.960 ms  86.966 ms  86.938 ms  88.060 ms  86.934 ms
 7  * * * * *
 8  * * * * *
 9  * * * * *
10  143.56.3.163 (143.56.3.163) [AS668]  130.219 ms  129.877 ms  129.855 ms  129.993 ms  129.947 ms
11  h.root-servers.net (198.97.190.53) [AS1508]  129.386 ms  129.323 ms  129.230 ms  129.310 ms  129.310 ms

And this is no cosmetic issue, it is definetely affecting connectivity:

~# dig -b '217.144.138.XXX#123' @198.97.190.53 . SOA

; <<>> DiG 9.10.3-P4-Ubuntu <<>> -b 217.144.138.XXX#123 @198.97.190.53 . SOA
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached