Problems with OVH?

monitoring

#1

Hi,

today one of my servers hosted at OVH in France got kicked out of the pool because of very low score.
But: Only the IPv4 address, via IPv6 the server is happily serving 4k req/s right now, despite the score’s getting worse.
The IPv4 scores are low since about 10 hours and are not getting better. CSV log says, it got I/O timeouts.

I was a bit curious and looked at the graph for pool servers in the FR zone and it seems half of france dropped out of the pool (rightmost edge of the graph):
grafik

This is my IPv4 server:


https://www.ntppool.org/scores/188.165.236.162

That’s the IPv6 address, but the same server:

From a traceroute I can see, the monitoring station should be reaching my server - also my server can reach the network 207.171.7.0/24.

Maybe OVH established some sort of new Anti-DDoS system which is now eating most of the NTP packets (at least from the monitor address) or there is some other issue.

Does anyone else has the same problem?

Greetings Max


#2

Hi Max,
*** exactly *** the same issue for me. My server (188.165.194.26) hosted by OVH has been dropped off the pool this afternoon with a very low score (current score -34) and a monitoring curve which is very similar to yours, with exactly same start time.
No explanation so far. I was also surprised by the graph of the FR server survey.
I reviewed my ntp & firewall configurations and restarted ntp service but the issue still there.
Cheers
Ludo


(added/edited - 11/03-22:26)
From the CSV data, incident starts at 1:58

1541210285,“2018-11-03 01:58:05”,0,-5,13.9,1,“Los Angeles”,“i/o timeout”
1541210285,“2018-11-03 01:58:05”,0,-5,13.9,“i/o timeout”

Might be correlated to an OVH issue on their US backbone (as monitoring station is located in LA): http://travaux.ovh.net/?do=details&id=35051 still in progress :

Details:
Our service provider has an outage on backbone network. They are working to find and fix the issue asap. Affected links have been isolated.


#3

Suspect some network issues (OVH ?) …
Using tcpdump on my server I observed lot of malformed UDP packets (tcpdump -vvv udp and port 123|grep ‘cksum’).


#4

“bad cksum” messages are normal if your NIC is calculating checksums, not the CPU.
Actually, you also must have seen your outgoing packets with that message, too :wink:


#5

Yes you’re absolutely right, disabling NIC offloading fix the ‘problem’ (my bad).
Thanks for pointing this to me :wink:


#6

I watched for requests originating from the monitoring system and captured this:

15:43:55.777470 IP (tos 0x0, ttl 53, id 57643, offset 0, flags [DF], proto UDP (17), length 76)
    207.171.7.201.40057 > 188.165.236.162.123: [udp sum ok] NTPv4, length 48
        Client, Leap indicator: clock unsynchronized (192), Stratum 0 (unspecified), poll 0 (1s), precision 0
        Root Delay: 0.000000, Root dispersion: 0.000000, Reference-ID: (unspec)
          Reference Timestamp:  0.000000000
          Originator Timestamp: 0.000000000
          Receive Timestamp:    0.000000000
          Transmit Timestamp:   3666005283.744762496 (2016/03/03 15:48:03)
            Originator - Receive Timestamp:  0.000000000
            Originator - Transmit Timestamp: 3666005283.744762496 (2016/03/03 15:48:03)
15:43:55.777580 IP (tos 0xb8, ttl 64, id 42245, offset 0, flags [DF], proto UDP (17), length 76)
    188.165.236.162.123 > 207.171.7.201.40057: [bad udp cksum 0x8106 -> 0xda4d!] NTPv4, length 48
        Server, Leap indicator:  (0), Stratum 2 (secondary reference), poll 3 (8s), precision -23
        Root Delay: 0.011688, Root dispersion: 0.019668, Reference-ID: 131.188.3.223
          Reference Timestamp:  3750330480.845544731 (2018/11/04 15:28:00)
          Originator Timestamp: 3666005283.744762496 (2016/03/03 15:48:03)
          Receive Timestamp:    3750331435.777470096 (2018/11/04 15:43:55)
          Transmit Timestamp:   3750331435.777568105 (2018/11/04 15:43:55)
            Originator - Receive Timestamp:  +84326152.032707600
            Originator - Transmit Timestamp: +84326152.032805608

This looks fine, but there’s an oddity:
Traceroute on UDP (Source Port 123, Dst Port 40057, as in the original request):

~# traceroute -U -p 40057 -q 1 -A --sport=123 207.171.7.201
traceroute to 207.171.7.201 (207.171.7.201), 30 hops max, 60 byte packets
 1  vss-3b-6k.fr.eu (188.165.236.252) [AS16276]  1.622 ms
 2  10.95.69.66 (10.95.69.66) [*]  0.225 ms
 3  10.95.66.62 (10.95.66.62) [*]  0.162 ms
 4  10.95.64.2 (10.95.64.2) [*]  1.238 ms
 5  be100-1044.gsw-1-a9.fr.eu (94.23.122.215) [AS16276]  4.151 ms
 6  be100-1345.ash-1-a9.va.us (94.23.122.244) [AS16276]  78.626 ms
 7  be100-1366.lax-la1-bb1-a9.ca.us (178.32.135.157) [AS16276]  135.238 ms
 8  phyber.as7012.any2ix.coresite.com (206.72.210.50) [AS14365/AS4039/AS1784]  134.607 ms
 9  te7-4.r02.lax2.phyber.com (207.171.30.62) [AS7012]  134.153 ms
10  *
11  *
12  *
13  *
14  *

Traceroute on ICMP:

~# traceroute -I -q 1 -A 207.171.7.201
traceroute to 207.171.7.201 (207.171.7.201), 30 hops max, 60 byte packets
 1  vss-3-6k.fr.eu (188.165.236.253) [AS16276]  0.578 ms
 2  10.95.69.0 (10.95.69.0) [*]  0.262 ms
 3  10.95.66.56 (10.95.66.56) [*]  0.268 ms
 4  10.95.64.2 (10.95.64.2) [*]  1.395 ms
 5  be100-1042.ldn-5-a9.uk.eu (213.251.130.103) [AS16276]  4.532 ms
 6  be100-1298.nwk-5-a9.nj.us (192.99.146.133) [AS16276]  71.251 ms
 7  be100-1007.ash-5-a9.va.us (198.27.73.219) [AS16276]  78.826 ms
 8  be100-1367.lax-la1-bb1-a9.ca.us (178.32.135.160) [AS16276]  132.141 ms
 9  phyber.as7012.any2ix.coresite.com (206.72.210.50) [AS14365/AS4039/AS1784]  132.489 ms
10  te7-4.r02.lax2.phyber.com (207.171.30.62) [AS7012]  132.779 ms
11  perl.gi1-9.r01.lax2.phyber.com (207.171.30.14) [AS7012]  134.222 ms
12  b1.develooper.com (207.171.7.201) [AS7012]  132.625 ms

Here you see, that it takes another way to the destination on OVH’s network, but it always reaches the “Phyber” network. But there, two hops are missing on the UDP protocol traceroute.

Could there be a problem on the monitoring network, too?


#7

From monitoring curves, slow recovery looks in progress since 13:30-14:00 (french time).

From OVH task follow-up web page: http://travaux.ovh.net/?do=details&id=35051&PHPSESSID=bb296553e38bbb2bb6a8a0a62013843c

Maybe the issue … Date and time are more less correlated …
Keep fingers crossed.

Ludo


#8

Hi Max,

Yes, I’ve had exactly the same issues for maybe a week: IPv6 all fine, IPv4 (same machine, same ISP/AS, same pieces of cable) all screwed up. https://www.pool.ntp.org/user/AlisonW puts the two side by side. When I see that sort of thing elsewhere I tend to wonder about transatlantic connectivity and/or interception bodies (eg. GCHQ, NSA, etc.) trying to monitor traffic and doing it very badly.

Regards
AlisonW


#9

Thank you for debugging this!

It does seem like there’s been more “is monitoring broken?” questions here and in my email the last 3-6 months, but I’ve never been able to identify any systemic problems.

I’d really like to have an “automatic traceroute on failures” feature in the monitoring system, but … only so much time and I’m just one person. :-/ (In other words, patches are welcome if they come with appropriate time for following up on them).


#10

My ntp server at taiwan, has same problems.

when tracertoute to 207.171.7.201 use ICMP is fine, but use UDP next hope always stop at AS7012.

Traceroute on UDP

~# traceroute -P UDP 207.171.7.201
traceroute to 207.171.7.201 (207.171.7.201), 64 hops max, 40 byte packets
 1  61-216-153-254.HINET-IP.hinet.net (61.216.153.254)  0.942 ms  1.024 ms  1.279 ms
 2  tpe4-3302.hinet.net (168.95.229.94)  1.299 ms  1.289 ms  1.301 ms
 3  tpdt-3022.hinet.net (220.128.4.134)  1.661 ms  1.329 ms  1.671 ms
 4  r4101-s2.tp.hinet.net (220.128.14.97)  1.013 ms
    r4101-s2.tp.hinet.net (220.128.7.117)  1.219 ms
    r4101-s2.tp.hinet.net (220.128.14.97)  1.982 ms
 5  r4001-s2.tp.hinet.net (220.128.12.5)  1.368 ms  0.796 ms  1.033 ms
 6  r11-pa.us.hinet.net (202.39.91.37)  131.065 ms  130.331 ms  130.601 ms
 7  144.223.179.25 (144.223.179.25)  137.236 ms  136.838 ms  137.387 ms
 8  144.232.15.56 (144.232.15.56)  140.364 ms
    144.232.15.60 (144.232.15.60)  142.769 ms  143.778 ms
 9  144.232.9.176 (144.232.9.176)  137.588 ms  142.753 ms  141.073 ms
10  144.232.25.98 (144.232.25.98)  139.089 ms  139.696 ms  138.197 ms
11  ae-1.r02.snjsca04.us.bb.gin.ntt.net (129.250.3.59)  138.318 ms
    ae-1.r01.snjsca04.us.bb.gin.ntt.net (129.250.2.229)  137.937 ms  138.573 ms
12  ae-1.r22.snjsca04.us.bb.gin.ntt.net (129.250.3.26)  131.723 ms
    ae-11.r22.snjsca04.us.bb.gin.ntt.net (129.250.3.120)  139.694 ms  139.235 ms
13  ae-6.r23.lsanca07.us.bb.gin.ntt.net (129.250.4.151)  137.540 ms  135.627 ms  137.813 ms
14  ae-2.r01.lsanca07.us.bb.gin.ntt.net (129.250.4.107)  135.391 ms
    ae-2.r00.lsanca07.us.bb.gin.ntt.net (129.250.3.238)  136.926 ms  135.899 ms
15  ae-1.a02.lsanca07.us.bb.gin.ntt.net (129.250.3.234)  138.991 ms
    ae-0.a02.lsanca07.us.bb.gin.ntt.net (129.250.2.186)  137.942 ms  135.770 ms
16  te0-0-0-0.r04.lax02.as7012.net (198.172.90.74)  136.190 ms  136.806 ms  137.806 ms
17  te7-4.r02.lax2.phyber.com (207.171.30.62)  137.844 ms  136.657 ms  135.886 ms
18  * * *
19  * * *

Traceroute on ICMP

~# traceroute -I -q 1 207.171.7.201
traceroute to 207.171.7.201 (207.171.7.201), 64 hops max, 48 byte packets
 1  61-216-153-254.HINET-IP.hinet.net (61.216.153.254)  0.851 ms
 2  tpe4-3302.hinet.net (168.95.229.94)  1.702 ms
 3  tpdt-3022.hinet.net (220.128.4.134)  1.610 ms
 4  r4101-s2.tp.hinet.net (220.128.7.117)  1.324 ms
 5  r4001-s2.tp.hinet.net (220.128.12.5)  0.927 ms
 6  r11-pa.us.hinet.net (202.39.91.37)  130.908 ms
 7  144.223.166.161 (144.223.166.161)  136.616 ms
 8  144.232.15.56 (144.232.15.56)  143.638 ms
 9  144.232.9.176 (144.232.9.176)  142.703 ms
10  144.232.25.98 (144.232.25.98)  138.487 ms
11  ae-1.r02.snjsca04.us.bb.gin.ntt.net (129.250.3.59)  138.355 ms
12  ae-11.r22.snjsca04.us.bb.gin.ntt.net (129.250.3.120)  138.244 ms
13  ae-6.r23.lsanca07.us.bb.gin.ntt.net (129.250.4.151)  138.119 ms
14  ae-2.r01.lsanca07.us.bb.gin.ntt.net (129.250.4.107)  138.385 ms
15  ae-0.a02.lsanca07.us.bb.gin.ntt.net (129.250.2.186)  141.293 ms
16  te0-0-0-0.r04.lax02.as7012.net (198.172.90.74)  138.948 ms
17  te7-4.r02.lax2.phyber.com (207.171.30.62)  137.751 ms
18  perl.gi1-9.r01.lax2.phyber.com (207.171.30.14)  137.542 ms
19  b1.develooper.com (207.171.7.201)  139.361 ms

#11

From my (limited) experience debugging networks, it looks like there’s a routing problem for the monitoring station.

My own testing corroborates what @randy is seeing, with the extra data point that IPv6 works just fine from my systems regardless of whether it’s UDP or ICMP being used for traceroute.


#12

My ISP has helped solve routing problems. It looks like everything is work right now.