Time server pool problems since mid February

Hi all,

On or around February the 17th this year, my previously consistent pool score went haywire. It was reported by someone else here in this mailing list post:

https://lists.ntp.org/pipermail/pool/2017-February/008063.html

My stratum 1 time server is a Raspberry Pi 3 with a U-Blox MAX-M8Q hat, serving time with GPSD and Chrony. I have tried NTPsec and classic NTP, as well as different Pi boards and an Odroid-C2 with no change in results. My pool graph is here:

http://www.pool.ntp.org/scores/86.18.156.14

My ISP here in the UK is Virgin Media and my router is a Superhub 3 (a rebadged Arris with Intel’s stinky Puma chipset). In the same mailing list thread as above, someone notes that the problem may be with NTT, Virgin Media’s upstream provider. I’ve tried everything I can think of and can only assume that the problem is out of my hands.

Can anyone offer insight, suggestions, help, or anything to solve a problem that has essentially kept me out of the pool for five months?

Not short of changing ISP.
Is there a realistic alternative for you?
If you are in a cabled area I imagine Openreach also cover it with at least FTTC, which should allow you to use almost any ISP in the country.
But I don’t understand why those offsets result in such a low score - mine have been worse on a few occasions with the score still flat-lining along happily at 20.

It’s not the offset, it’s packet loss.

http://www.pool.ntp.org/scores/86.18.156.14/log?limit=50

It’s definitely packet loss, but outside my control. An ifconfig -a on my time server shows the following:


Some of it, I believe, is being caused by the latency spikes from Intel’s Puma 6 chipset in the Virgin Media Superhub router but the majority is something to do with Virgin Media or its upstream provider. I have enjoyed being in the pool and it irritates me that I can no longer do that.

As another Virgin Media customer that has suffered the same issue since February 17th, I can empathise with your frustration.

As previously stated the issue is with packet loss transiting the gtt.net (GTT Communcations) upstream provider that Virgin Media has used since February 17th, 2017. My traceroute from Dublin (Ireland) looks like this:

There was quite a bit of discussion regarding potential improvements to the monitoring system but the issue remains for customers of Virgin Media for the foreseeable future. :frowning:

Hmm. I’ve just done a traceroute from my time server and, though I have a similarly crappy result, I don’t seem to pass through a gtt.net server along the way:


My delays seem to increase incrementally once they reach the first U.S. cogentco.com server until they arrive at their destination sweating and apologising for being late.

Health Warning: traceroute data doesn’t equal monitoring data…

Having said that, my experience was first with ntt.net (as I recall) which appears to have morphed to gtt.net. I am wondering if you’re on a transit provider on the same infrastructure?

FYI, my traceroutes originally went to Amsterdam (aorta.net, Virgin’s internal backbone), then transited across the Atlantic via ntt/gtt.net to New York City. Now the routing goes directly through a Dublin-Los Angeles link. Needless to say my transit times are actually worse.

For DR purposes I am considering a second WAN provider - and I may move my NTP traffic to the alternate provider. I still see ~1Mbit of NTP requests on Virgin, down from 5-6Mbit pre-monitoring issues.

Hope that helps.

R

Just a reminder that you can traceroute the other way with

curl -s https://trace.ntppool.org/traceroute/8.8.8.8

More at Traceroute and looking glass server

I had this issue and it only got resolved when I changed the upstream route. I have a post here

What you could do is set up a tunnel to some other router so that traffic going to 207.171.3.17 gets re-routed via some other ISP.

If that works then at least you know if that other ISP has no problems sending NTP replies.

As a test I offer my route if you want to try it out (now that it’s working)

@jfrater You say in your other post that “I did some traffic engineering and now send the reply using a different upstream provider”. Is this something a relatively competent person could emulate, or is the fact that I even have to ask the question proof enough that I shouldn’t try?

Well, it’s just that we have multiple upstream fiber connections, so the “engineering” part was setting up a new rule on the routers so that the packets going to the monitor used a specific link. In my case we use the full Internet routing table via BGP and we change any prefix inside the 207.171.0.0/16 net and lower the BGP local preference so the it chooses other routes if available.

But if you have any other Internet provider you can try it out just by explicitly telling the router that you want packets from your server to the monitor via that other provider, even if that other provider is only reachable via a tunnel that uses your current ISP. That’s because packets inside the tunnel should not be affected by any traffic shaping done by the ISP.

It’s similar to getting a VPN to hide your IP or access a corporate network.

hi ncguk,

i was just interested if i can reproduce this packet loss to/from your station. without running a monitoring as available at ntppool.org but simple configuring your IP in my ntpd i can’t see very often ( or actually never ) a “377” for “reach” running ntpq. my server is sitting in vienna / austria. so it must be something in your area as i am in the opposite site of LA.

// hans

Hans, thank you for your testing. My problem (and that of any pool member using Virgin Media as an ISP) seems to lie in the transatlantic hop using cogentco.com outgoing from my server in the UK, and aorta.net on the way back. From this I can see the argument for multiple monitoring stations, but also understand that that would be a complex task.

Uninformed question: is it possible for the timeout of the pool monitoring probe be increased to compensate, or would that compromise the pool in some way?

this question you have to ask ask :slight_smile:

but maybe this packets are not delayed more they are really lost. this could happen due to an overloaded router anywhere.

// hans

As suggested by @ask, I’ve just performed a reverse traceroute. The results aren’t pretty:


I’ve actually performed a few of these tests in the past couple of days and the delays always seem to come as the traffic hits aorta.net. This is obviously out of my hands, so I’m going to give serious consideration to switching ISPs.

To follow up on this - I’ve just received an automated email from Ask telling me my server is going to be removed from the Pool “due to unresolved problems”. I’m disappointed, but I understand why. Perhaps once Virgin Media issues a firmware fix for their Puma 6 infested router, I’ll try again.

Just revisiting this topic some time later - I too have been taken out of the pool due to unresolved problems. Interestingly the new Luxembourg monitoring server isn’t much better for me (on the beta site).

Did you change ISP or are you still with Virgin Media? I would have to go off-fibre if I moved to another provider here, with no assurance that new provider wouldn’t have similar pool monitoring problems to Virgin Media.

I also encountered this starting about 3 months ago. My server is in Colorado. I had a solid 20.0 for probably 12 months, and just one day I started missing 1-3 checkins a day from the pool server, so my score usually dips to 10 before it gets back to 20.0.

I think I am losing packets but it is very sporadic. Mine is not a high traffic server. If I can dig into some of the analysis done above I will report back what I find.

Interesting my two just started having this problem about 1-2 days ago, but ONLY on IPv4. IPv6 seems to have been rock solid.