Problems with OVH?

Suspect some network issues (OVH ?) …
Using tcpdump on my server I observed lot of malformed UDP packets (tcpdump -vvv udp and port 123|grep ‘cksum’).

“bad cksum” messages are normal if your NIC is calculating checksums, not the CPU.
Actually, you also must have seen your outgoing packets with that message, too :wink:

Yes you’re absolutely right, disabling NIC offloading fix the ‘problem’ (my bad).
Thanks for pointing this to me :wink:

I watched for requests originating from the monitoring system and captured this:

15:43:55.777470 IP (tos 0x0, ttl 53, id 57643, offset 0, flags [DF], proto UDP (17), length 76)
    207.171.7.201.40057 > 188.165.236.162.123: [udp sum ok] NTPv4, length 48
        Client, Leap indicator: clock unsynchronized (192), Stratum 0 (unspecified), poll 0 (1s), precision 0
        Root Delay: 0.000000, Root dispersion: 0.000000, Reference-ID: (unspec)
          Reference Timestamp:  0.000000000
          Originator Timestamp: 0.000000000
          Receive Timestamp:    0.000000000
          Transmit Timestamp:   3666005283.744762496 (2016/03/03 15:48:03)
            Originator - Receive Timestamp:  0.000000000
            Originator - Transmit Timestamp: 3666005283.744762496 (2016/03/03 15:48:03)
15:43:55.777580 IP (tos 0xb8, ttl 64, id 42245, offset 0, flags [DF], proto UDP (17), length 76)
    188.165.236.162.123 > 207.171.7.201.40057: [bad udp cksum 0x8106 -> 0xda4d!] NTPv4, length 48
        Server, Leap indicator:  (0), Stratum 2 (secondary reference), poll 3 (8s), precision -23
        Root Delay: 0.011688, Root dispersion: 0.019668, Reference-ID: 131.188.3.223
          Reference Timestamp:  3750330480.845544731 (2018/11/04 15:28:00)
          Originator Timestamp: 3666005283.744762496 (2016/03/03 15:48:03)
          Receive Timestamp:    3750331435.777470096 (2018/11/04 15:43:55)
          Transmit Timestamp:   3750331435.777568105 (2018/11/04 15:43:55)
            Originator - Receive Timestamp:  +84326152.032707600
            Originator - Transmit Timestamp: +84326152.032805608

This looks fine, but there’s an oddity:
Traceroute on UDP (Source Port 123, Dst Port 40057, as in the original request):

~# traceroute -U -p 40057 -q 1 -A --sport=123 207.171.7.201
traceroute to 207.171.7.201 (207.171.7.201), 30 hops max, 60 byte packets
 1  vss-3b-6k.fr.eu (188.165.236.252) [AS16276]  1.622 ms
 2  10.95.69.66 (10.95.69.66) [*]  0.225 ms
 3  10.95.66.62 (10.95.66.62) [*]  0.162 ms
 4  10.95.64.2 (10.95.64.2) [*]  1.238 ms
 5  be100-1044.gsw-1-a9.fr.eu (94.23.122.215) [AS16276]  4.151 ms
 6  be100-1345.ash-1-a9.va.us (94.23.122.244) [AS16276]  78.626 ms
 7  be100-1366.lax-la1-bb1-a9.ca.us (178.32.135.157) [AS16276]  135.238 ms
 8  phyber.as7012.any2ix.coresite.com (206.72.210.50) [AS14365/AS4039/AS1784]  134.607 ms
 9  te7-4.r02.lax2.phyber.com (207.171.30.62) [AS7012]  134.153 ms
10  *
11  *
12  *
13  *
14  *

Traceroute on ICMP:

~# traceroute -I -q 1 -A 207.171.7.201
traceroute to 207.171.7.201 (207.171.7.201), 30 hops max, 60 byte packets
 1  vss-3-6k.fr.eu (188.165.236.253) [AS16276]  0.578 ms
 2  10.95.69.0 (10.95.69.0) [*]  0.262 ms
 3  10.95.66.56 (10.95.66.56) [*]  0.268 ms
 4  10.95.64.2 (10.95.64.2) [*]  1.395 ms
 5  be100-1042.ldn-5-a9.uk.eu (213.251.130.103) [AS16276]  4.532 ms
 6  be100-1298.nwk-5-a9.nj.us (192.99.146.133) [AS16276]  71.251 ms
 7  be100-1007.ash-5-a9.va.us (198.27.73.219) [AS16276]  78.826 ms
 8  be100-1367.lax-la1-bb1-a9.ca.us (178.32.135.160) [AS16276]  132.141 ms
 9  phyber.as7012.any2ix.coresite.com (206.72.210.50) [AS14365/AS4039/AS1784]  132.489 ms
10  te7-4.r02.lax2.phyber.com (207.171.30.62) [AS7012]  132.779 ms
11  perl.gi1-9.r01.lax2.phyber.com (207.171.30.14) [AS7012]  134.222 ms
12  b1.develooper.com (207.171.7.201) [AS7012]  132.625 ms

Here you see, that it takes another way to the destination on OVH’s network, but it always reaches the “Phyber” network. But there, two hops are missing on the UDP protocol traceroute.

Could there be a problem on the monitoring network, too?

From monitoring curves, slow recovery looks in progress since 13:30-14:00 (french time).

From OVH task follow-up web page: http://travaux.ovh.net/?do=details&id=35051&PHPSESSID=bb296553e38bbb2bb6a8a0a62013843c

Maybe the issue … Date and time are more less correlated …
Keep fingers crossed.

Ludo

Hi Max,

Yes, I’ve had exactly the same issues for maybe a week: IPv6 all fine, IPv4 (same machine, same ISP/AS, same pieces of cable) all screwed up. https://www.pool.ntp.org/user/AlisonW puts the two side by side. When I see that sort of thing elsewhere I tend to wonder about transatlantic connectivity and/or interception bodies (eg. GCHQ, NSA, etc.) trying to monitor traffic and doing it very badly.

Regards
AlisonW

Thank you for debugging this!

It does seem like there’s been more “is monitoring broken?” questions here and in my email the last 3-6 months, but I’ve never been able to identify any systemic problems.

I’d really like to have an “automatic traceroute on failures” feature in the monitoring system, but … only so much time and I’m just one person. :-/ (In other words, patches are welcome if they come with appropriate time for following up on them).

My ntp server at taiwan, has same problems.

when tracertoute to 207.171.7.201 use ICMP is fine, but use UDP next hope always stop at AS7012.

Traceroute on UDP

~# traceroute -P UDP 207.171.7.201
traceroute to 207.171.7.201 (207.171.7.201), 64 hops max, 40 byte packets
 1  61-216-153-254.HINET-IP.hinet.net (61.216.153.254)  0.942 ms  1.024 ms  1.279 ms
 2  tpe4-3302.hinet.net (168.95.229.94)  1.299 ms  1.289 ms  1.301 ms
 3  tpdt-3022.hinet.net (220.128.4.134)  1.661 ms  1.329 ms  1.671 ms
 4  r4101-s2.tp.hinet.net (220.128.14.97)  1.013 ms
    r4101-s2.tp.hinet.net (220.128.7.117)  1.219 ms
    r4101-s2.tp.hinet.net (220.128.14.97)  1.982 ms
 5  r4001-s2.tp.hinet.net (220.128.12.5)  1.368 ms  0.796 ms  1.033 ms
 6  r11-pa.us.hinet.net (202.39.91.37)  131.065 ms  130.331 ms  130.601 ms
 7  144.223.179.25 (144.223.179.25)  137.236 ms  136.838 ms  137.387 ms
 8  144.232.15.56 (144.232.15.56)  140.364 ms
    144.232.15.60 (144.232.15.60)  142.769 ms  143.778 ms
 9  144.232.9.176 (144.232.9.176)  137.588 ms  142.753 ms  141.073 ms
10  144.232.25.98 (144.232.25.98)  139.089 ms  139.696 ms  138.197 ms
11  ae-1.r02.snjsca04.us.bb.gin.ntt.net (129.250.3.59)  138.318 ms
    ae-1.r01.snjsca04.us.bb.gin.ntt.net (129.250.2.229)  137.937 ms  138.573 ms
12  ae-1.r22.snjsca04.us.bb.gin.ntt.net (129.250.3.26)  131.723 ms
    ae-11.r22.snjsca04.us.bb.gin.ntt.net (129.250.3.120)  139.694 ms  139.235 ms
13  ae-6.r23.lsanca07.us.bb.gin.ntt.net (129.250.4.151)  137.540 ms  135.627 ms  137.813 ms
14  ae-2.r01.lsanca07.us.bb.gin.ntt.net (129.250.4.107)  135.391 ms
    ae-2.r00.lsanca07.us.bb.gin.ntt.net (129.250.3.238)  136.926 ms  135.899 ms
15  ae-1.a02.lsanca07.us.bb.gin.ntt.net (129.250.3.234)  138.991 ms
    ae-0.a02.lsanca07.us.bb.gin.ntt.net (129.250.2.186)  137.942 ms  135.770 ms
16  te0-0-0-0.r04.lax02.as7012.net (198.172.90.74)  136.190 ms  136.806 ms  137.806 ms
17  te7-4.r02.lax2.phyber.com (207.171.30.62)  137.844 ms  136.657 ms  135.886 ms
18  * * *
19  * * *

Traceroute on ICMP

~# traceroute -I -q 1 207.171.7.201
traceroute to 207.171.7.201 (207.171.7.201), 64 hops max, 48 byte packets
 1  61-216-153-254.HINET-IP.hinet.net (61.216.153.254)  0.851 ms
 2  tpe4-3302.hinet.net (168.95.229.94)  1.702 ms
 3  tpdt-3022.hinet.net (220.128.4.134)  1.610 ms
 4  r4101-s2.tp.hinet.net (220.128.7.117)  1.324 ms
 5  r4001-s2.tp.hinet.net (220.128.12.5)  0.927 ms
 6  r11-pa.us.hinet.net (202.39.91.37)  130.908 ms
 7  144.223.166.161 (144.223.166.161)  136.616 ms
 8  144.232.15.56 (144.232.15.56)  143.638 ms
 9  144.232.9.176 (144.232.9.176)  142.703 ms
10  144.232.25.98 (144.232.25.98)  138.487 ms
11  ae-1.r02.snjsca04.us.bb.gin.ntt.net (129.250.3.59)  138.355 ms
12  ae-11.r22.snjsca04.us.bb.gin.ntt.net (129.250.3.120)  138.244 ms
13  ae-6.r23.lsanca07.us.bb.gin.ntt.net (129.250.4.151)  138.119 ms
14  ae-2.r01.lsanca07.us.bb.gin.ntt.net (129.250.4.107)  138.385 ms
15  ae-0.a02.lsanca07.us.bb.gin.ntt.net (129.250.2.186)  141.293 ms
16  te0-0-0-0.r04.lax02.as7012.net (198.172.90.74)  138.948 ms
17  te7-4.r02.lax2.phyber.com (207.171.30.62)  137.751 ms
18  perl.gi1-9.r01.lax2.phyber.com (207.171.30.14)  137.542 ms
19  b1.develooper.com (207.171.7.201)  139.361 ms

From my (limited) experience debugging networks, it looks like there’s a routing problem for the monitoring station.

My own testing corroborates what @randy is seeing, with the extra data point that IPv6 works just fine from my systems regardless of whether it’s UDP or ICMP being used for traceroute.

My ISP has helped solve routing problems. It looks like everything is work right now.

My servers hosted at a small VPS provider have had bad scores for a long time now and they don’t seem to be able to fix it (it doesn’t actually look like a problem with the monitoring station), so I’m looking for a new VPS provider.

Are still people having problems with servers hosted at OVH?

I have tried to analyze scores of pool servers hosted by popular cloud providers and OVH actually seems to work best currently. I identified the servers by the autonomous systems corresponding to their IPv4 address. I’m not sure how accurate or specific that really is. I downloaded their last 200 scores from the pool web and checked if not more than half of the scores is below 10 and then how good is the average score (ok < 15 , good < 20, excellent == 20). I didn’t try to separate them by country zones.

Autonomous system              Servers	Bad 	Ok  	Good	Excellent
========================================================================
Amazon.com, Inc.                 49 	 10%	 18%	 33%	 39%	
Choopa, LLC (Vultr)              28 	 18%	 11%	 50%	 21%	
DigitalOcean, LLC                85 	  7%	  9%	 47%	 36%	
Hetzner Online GmbH             186 	  7%	  1%	 27%	 65%	
Linode, LLC                     103 	  6%	 14%	 26%	 54%	
OVH SAS                         219 	 10%	  0%	 14%	 77%	
Online S.a.s. (Scaleway)         58 	 98%	  2%	  0%	  0%
1 Like

Nice, that’s really fun. We should add the ASN to the monitoring data for this sort of analytics.

3 Likes

There was issues with OVH and Scaleway, seems resolved since this weekend update/upgrade on the monitoring side.It’s quite obvious there was a network issue with the LA station, especially with Scaleway.

Here is the current situation with the new monitoring server.

Autonomous system              Servers	Bad 	Ok  	Good	Excellent
========================================================================
Amazon.com, Inc.                 49 	  6%	  6%	 61%	 27%	
Choopa, LLC                      28 	  7%	  7%	 21%	 64%	
DigitalOcean, LLC                85 	  4%	 24%	 21%	 52%	
Hetzner Online GmbH             186 	  6%	  0%	 15%	 79%	
Linode, LLC                     103 	  6%	  1%	 24%	 69%	
OVH SAS                         219 	  8%	  1%	 19%	 72%	
Online S.a.s.                    58 	  7%	  2%	 22%	 69%
1 Like

Here is an update after a year.

    Autonomous system              Servers	Bad 	Ok  	Good	Excellent
    ========================================================================
    Amazon.com, Inc.                 37 	  5%	  5%	 35%	 54%	
    Choopa, LLC                      37 	  3%	  0%	 54%	 43%	
    DigitalOcean, LLC                69 	  3%	  1%	 51%	 45%	
    Hetzner Online GmbH             218 	  2%	  1%	 41%	 56%	
    Linode, LLC                      96 	  6%	  1%	 53%	 40%	
    OVH SAS                         238 	 12%	  2%	 15%	 71%	
    Online S.a.s.                    56 	  5%	  0%	 86%	  9%	

OVH still seems to have most servers with the perfect score, but it also has most servers with a score below 10.

My problem with OVH is its “anti-DDoS” protection, which is now frequently triggered by the broken Fortigate clients. I’m not sure what exactly it is supposed to protect, when it practically disconnects the server from the Internet. Was anyone able to get the support to disable this “feature”? I asked at least 5 times now, and they say they increased the threshold, but it’s still triggering.

4 Likes

A bit late, but…
I reached out to OVH and asked them to disable this DDoS protection for me (or, at least, to readjust it to high packet rates). Since then, it sometimes sends me e-mails about a detected attack, but it won’t filter my NTP traffic.

1 Like

In my case they are still blocking the NTP traffic. Whenever the “protection” is activated, I see in my graphs a sharp drop to about 1% of the normal request rate.

As the Fortigate situation seems to be only very slowly improving, I’m considering to switch to another VPS provider.

After another year, here are current statistics for the popular cloud providers:

    Autonomous system              Servers	Bad 	Ok  	Good	Excellent
    ========================================================================
    AMAZON                           33 	 18%	  9%	  9%	 64%	
    AS-CHOOPA                        47 	  6%	  0%	  9%	 85%	
    CONTABO                          30 	  3%	  7%	 10%	 80%	
    DIGITALOCEAN                     84 	  7%	  1%	 17%	 75%	
    HETZNER                         234 	  3%	  1%	 19%	 77%	
    LINODE                           91 	  4%	  0%	  0%	 96%	
    NETCUP                           51 	 14%	 18%	 69%	  0%	
    ORACLE                           30 	 43%	  3%	 10%	 43%	
    OVH                             190 	  7%	  8%	 15%	 71%	
    Online SAS                       45 	 13%	  0%	  0%	 87%	
3 Likes

I have NTP servers on Digital Ocean and Microsoft Azure.
Over time the monitoring station has rarely reported packet loss issues on the Azure server.
The server maintained at Digital Ocean has also demonstrated good results over several months, but they still face many network problems with a frequency that I consider elevated to industry availability standards. I’ve also had a problem with my Digital Ocean server in the past due to the very overloaded hosts, causing high CPU Steal rates and consequently loss of NTP requests. I had to provision my VPS several times until I found a minimally decent host to maintain an NTP server at a low cost.
I’ve tried adding a Google Cloud NTP server to the Pool, but the results were disastrous (for some reason the server remained stable until it started to stop working properly and was removed from the Pool, probably some security mechanism in the GCP network).

The Microsoft and Google clouds don’t seem to be popular choices for pool servers. In my data I see only 14 and 3 servers in their ASNs respectively:

    Autonomous system              Servers	Bad 	Ok  	Good	Excellent
    ========================================================================
    GOOGLE                            3 	 33%	  0%	  0%	 67%	
    MICROSOFT                        14 	 64%	  0%	  0%	 36%	

All those “bad” servers have a constant score of -100. I guess that means they were disabled recently, before the pool monitoring schedules them for deletion. Maybe I should count such servers separately.