A major update on this!
My server still seems unreachable to the monitoring station since the maintenance weekend.
The very same server is reachable on IPv6 without any problems.
With tcpdump I can see (presumably) the monitor is querying my server and replies are sent - but appearently these are getting lost somewhere on the way from my server to the monitor.
These are the packets captured with tcpdump on 2019-05-09 (time is UTC):
22:32:28.678055 IP 139.178.64.42.54313 > 217.144.138.234.123: NTPv4, Client, length 48
22:32:28.678510 IP 217.144.138.234.123 > 139.178.64.42.54313: NTPv4, Server, length 48
22:47:41.217332 IP 139.178.64.42.41577 > 217.144.138.234.123: NTPv4, Client, length 48
22:47:41.217772 IP 217.144.138.234.123 > 139.178.64.42.41577: NTPv4, Server, length 48
23:02:53.329398 IP 139.178.64.42.49108 > 217.144.138.234.123: NTPv4, Client, length 48
23:02:53.329835 IP 217.144.138.234.123 > 139.178.64.42.49108: NTPv4, Server, length 48
The monitor’s CSV protocol declares “i/o timeout” for most of these queries:
ts_epoch,ts,offset,step,score,monitor_id,monitor_name,leap,error
1557442976,"2019-05-09 23:02:56",0,-5,-63.7,6,"Newark, NJ, US",,"i/o timeout"
1557442976,"2019-05-09 23:02:56",0,-5,-63.7,,,,"i/o timeout"
1557442061,"2019-05-09 22:47:41",-0.002267079,1,-61.8,6,"Newark, NJ, US",0,
1557442061,"2019-05-09 22:47:41",-0.002267079,1,-61.8,,,0,
1557441151,"2019-05-09 22:32:31",0,-5,-66.1,6,"Newark, NJ, US",,"i/o timeout"
1557441151,"2019-05-09 22:32:31",0,-5,-66.1,,,,"i/o timeout"
When doing some traceroutes I detected some weirdness:
Packets originate at source port 123/UDP, targeted to a random (on tcpdump seen) upper port on the monitoring station.
It looks like these packets are getting rate-limited, probably by Centurylink (f.k.a. Level3)!
~# traceroute -z 0.6 -w 0.5 -U --sport=123 -p 54313 -q 10 -t 0xb8 -A 139.178.64.42
traceroute to 139.178.64.42 (139.178.64.42), 30 hops max, 60 byte packets
1 ipv4gate.ntwk-w.301-moved.de (217.144.138.225) [AS15987/AS8820] 0.473 ms 0.382 ms 0.277 ms 0.406 ms 0.322 ms 0.330 ms 0.324 ms 0.327 ms 0.336 ms 0.273 ms
2 r4-pty.wup.tal.de (81.92.2.89) [AS8820] 0.463 ms 0.244 ms 0.397 ms 0.393 ms 0.321 ms 0.330 ms 1.973 ms 0.309 ms 0.296 ms 0.460 ms
3 xe-9-1-2.edge4.dus1.level3.net (194.54.94.65) [AS41692] 1.532 ms 1.171 ms 0.867 ms 0.838 ms 0.820 ms 0.893 ms 1.015 ms 0.930 ms 0.896 ms 1.252 ms
4 * * * * * * * * * *
5 * nyc2-brdr-02.inet.qwest.net (63.235.42.101) [AS209] 78.080 ms 78.133 ms * * 78.113 ms * * * *
6 dca-edge-22.inet.qwest.net (67.14.6.142) [AS209] 114.805 ms * * 87.070 ms * * * 87.064 ms 86.962 ms *
7 * 72.165.161.86 (72.165.161.86) [AS209] 86.660 ms * * * * 86.746 ms * 86.777 ms 86.763 ms
8 * * * * lag32.fr3.lga.llnw.net (68.142.88.157) [AS22822] 83.146 ms * * 83.050 ms * 83.028 ms
9 * * * * * * * * * *
10 0.xe-1-0-0.bbr1.ewr1.packet.net (198.16.4.94) [AS5485/AS54825] 83.947 ms * * 89.428 ms * 84.074 ms 83.961 ms * 84.064 ms *
11 * * * * * * * * * *
12 * * * * * * * * * *
13 monewr1.ntppool.net (139.178.64.42) [AS54825] 83.946 ms * * 84.061 ms 84.089 ms * 84.031 ms * * *
When doing the same traceroute, but just changing the source port to a random upper port, the result looks fine:
~# traceroute -z 0.6 -w 0.5 -U --sport=51553 -p 54313 -q 10 -t 0xb8 -A 139.178.64.42
traceroute to 139.178.64.42 (139.178.64.42), 30 hops max, 60 byte packets
1 ipv4gate.ntwk-w.301-moved.de (217.144.138.225) [AS15987/AS8820] 0.607 ms 0.281 ms 0.294 ms 0.286 ms 0.341 ms 0.305 ms 0.243 ms 0.255 ms 0.289 ms 0.262 ms
2 r4-pty.wup.tal.de (81.92.2.89) [AS8820] 0.405 ms 2.329 ms 28.117 ms 4.922 ms 0.493 ms 0.314 ms 0.295 ms 0.198 ms 1.087 ms 0.272 ms
3 xe-9-1-2.edge4.dus1.level3.net (194.54.94.65) [AS41692] 0.850 ms 0.974 ms 0.876 ms 0.877 ms 0.952 ms 0.946 ms 0.859 ms 1.212 ms 0.971 ms 1.540 ms
4 * * * * * * * * * *
5 nyc2-brdr-02.inet.qwest.net (63.235.42.101) [AS209] 78.133 ms 78.162 ms 81.005 ms 89.660 ms 77.895 ms 78.105 ms 77.973 ms 78.022 ms 78.158 ms 78.083 ms
6 dca-edge-22.inet.qwest.net (67.14.6.142) [AS209] 86.963 ms 104.652 ms 86.935 ms 87.029 ms 87.147 ms 86.899 ms 90.714 ms 86.926 ms 87.130 ms 87.038 ms
7 72.165.161.86 (72.165.161.86) [AS209] 86.726 ms 86.630 ms 86.687 ms 86.644 ms 86.707 ms 86.710 ms 86.751 ms 86.667 ms 86.647 ms 87.720 ms
8 lag32.fr3.lga.llnw.net (68.142.88.157) [AS22822] 82.979 ms 82.989 ms 83.143 ms 83.084 ms 83.030 ms 82.964 ms 83.022 ms 83.035 ms 83.041 ms 82.956 ms
9 * * * * * * * * * *
10 0.xe-1-0-0.bbr1.ewr1.packet.net (198.16.4.94) [AS5485/AS54825] 84.579 ms 84.652 ms 83.948 ms 83.885 ms 84.337 ms 92.428 ms 83.922 ms 84.265 ms 84.051 ms 83.884 ms
11 * * * * * * * * * *
12 * * * * * * * * * *
13 monewr1.ntppool.net (139.178.64.42) [AS54825] 84.023 ms 84.070 ms 83.924 ms 83.882 ms 83.872 ms 84.068 ms 84.052 ms 83.909 ms 83.814 ms 83.931 ms
(For being more accurate, I set ToS = 0xb8, since this is the tag my ntpd applys to all outgoing packets) - but to be honest, it didn’t change anything)
Evidently, the culprit is Centurylink! Since the brand “Qwest” seems to be used by Centurylink even nearly 10 years after acquiring them, there is just no other provider between hop 3 and 5, where packetloss starts.
I’ll try to open a ticket there and asking questions about rate-limiting on their network…