Monitoring stations timeout to our NTP servers

Thanks @ask, I’m currently working with @stevesommars on it; we’ll keep you posted.

@ask would it be possible to get hostnames or IP addresses for the Amsterdam/L.A. monitoring stations? I’d like to compare routing issues between all 3 monitoring stations. I’m currently suffering from packets being dropped from the NJ monitor in the production pool (I believe this is zayo, but our networks team will be able to dive deeper into this).

Any help would be great as this is negatively affecting our polling and our NTP service is one of very few in IE.

I see you deleted your servers.

UDP’s can be dropped anywhere, and it happens often.
Cisco is telling people that 5% drops on UDP is acceptable.

Yes the score of the pool is based on 1 monitor and 1 monitor alone, and it needs a few misses to kick you out of the pool.

As often said here, there must be more monitors and it must at least average on those, but in fact 1 monitor reporting you as GOOD should be enough.

The beta-system does and my servers are marked by LA as good and dismissed by Newark (as usual).

Yet the system still isn’t fixed. It’s 6 months now…the beta system is 10000000x better then this.

No @Ask, there is no NTP filtering, it’s simply not true. ISP’s use NTP themselves to keep time!

@HEAnet LA is 45.54.12.11, the Amsterdam one is monams1.ntppool.net.

1 Like

If someone says that I’d like to see them configure their computer to drop 5% of UDP traffic and then open a web page that requires 300 DNS queries or watch YouTube over QUIC.

That statement makes several assumptions. Network-wide blocking isn’t common, but rate limiting is something else.

2 Likes

Not sure whats going on

Looking at this section of my report

1583264160,“2020-03-03 19:36:00”,0,-5,1.1,6,“Newark, NJ, US”,“i/o timeout”
1583264160,“2020-03-03 19:36:00”,0,-5,1.1,“i/o timeout”
1583263174,“2020-03-03 19:19:34”,0.000532588,1,6.4,6,“Newark, NJ, US”,0,
1583263174,“2020-03-03 19:19:34”,0.000532588,1,6.4,0,

Can see 4 contacts 2 at 19:19 and 2 at 19:36 (which apparently didnt respond)

I recieved and replied to 3 requests at 19:19 (not 2 not sure whats going on with that?)

19:19:24.843963 IP monewr1.ntppool.net.46571 > raspberrypi.local.ntp: NTPv4, Client, length 48
19:19:24.844086 IP raspberrypi.local.ntp > monewr1.ntppool.net.46571: NTPv4, Server, length 48
19:19:29.842189 IP monewr1.ntppool.net.43195 > raspberrypi.local.ntp: NTPv4, Client, length 48
19:19:29.842356 IP raspberrypi.local.ntp > monewr1.ntppool.net.43195: NTPv4, Server, length 48
19:19:31.922697 IP monewr1.ntppool.net.41698 > raspberrypi.local.ntp: NTPv4, Client, length 48
19:19:31.922779 IP raspberrypi.local.ntp > monewr1.ntppool.net.41698: NTPv4, Server, length 48

Looking at TCP Dump it seems it actually sent 3 requests and i sent 3 responses at 19:35 but they did not make it back to the monitor and why does it only show 2 in the logs on the monitor?

19:35:47.464786 IP monewr1.ntppool.net.48228 > raspberrypi.local.ntp: NTPv4, Client, length 48
19:35:47.464863 IP raspberrypi.local.ntp > monewr1.ntppool.net.48228: NTPv4, Server, length 48
19:35:52.465304 IP monewr1.ntppool.net.50627 > raspberrypi.local.ntp: NTPv4, Client, length 48
19:35:52.465422 IP raspberrypi.local.ntp > monewr1.ntppool.net.50627: NTPv4, Server, length 48
19:35:57.465701 IP monewr1.ntppool.net.40515 > raspberrypi.local.ntp: NTPv4, Client, length 48
19:35:57.465848 IP raspberrypi.local.ntp > monewr1.ntppool.net.40515: NTPv4, Server, length 48

With this running i had a ping/traceroute running ever 5 seconds which reported no changes and seems no dropped packets other than at some hops which may be IMCP?.

On the beta site it seems the most reliable for me is

  1. LA
  2. Amsterdam
  3. Newark

Which is strange of itself for LA and Amsterdam considering i’m in the UK i thought Amsterdam would be the most reliable!
https://web.beta.grundclock.com/scores/81.174.133.68

Traceroute to all 3 in the order of Newark, Amsterdam, LA

Tracing route to monewr1.ntppool.net [139.178.64.42]
over a maximum of 30 hops:

1 <1 ms <1 ms <1 ms router.local [192.168.93.1]
2 10 ms 10 ms 10 ms 195.166.130.248
3 23 ms 11 ms 11 ms 84.93.253.71
4 11 ms 11 ms 10 ms 195.99.125.140
5 10 ms 11 ms 11 ms core2-hu0-1-0-1-1.colindale.ukcore.bt.net [195.99.127.9]
6 11 ms 11 ms 11 ms 109.159.252.134
7 11 ms 10 ms 11 ms 166-49-209-194.eu.bt.net [166.49.209.194]
8 * * * Request timed out.
9 29 ms 12 ms 12 ms ae11.mpr2.lhr2.uk.zip.zayo.com [64.125.30.52]
10 * * * Request timed out.
11 83 ms 77 ms 78 ms ae5.cs3.lga5.us.eth.zayo.com [64.125.29.126]
12 78 ms 77 ms 77 ms ae15.er1.lga5.us.zip.zayo.com [64.125.29.221]
13 80 ms 79 ms 79 ms 64.125.54.26.available.above.net [64.125.54.26]
14 80 ms 80 ms 80 ms 0.et-0-0-1.bsr2.ewr1.packet.net [198.16.6.237]
15 95 ms 101 ms 98 ms 0.ae2.dsr2.ewr1.packet.net [198.16.4.215]
16 95 ms 98 ms 98 ms 147.75.98.107
17 82 ms 82 ms 82 ms monewr1.ntppool.net [139.178.64.42]

Trace complete.

C:\Users\Daniel>tracert monams1.ntppool.net

Tracing route to monams1.ntppool.net [147.75.84.170]
over a maximum of 30 hops:

1 <1 ms <1 ms <1 ms router.local [192.168.93.1]
2 10 ms 10 ms 10 ms 195.166.130.248
3 11 ms 11 ms 11 ms 84.93.253.71
4 11 ms 11 ms 10 ms 195.99.125.140
5 11 ms 11 ms 11 ms peer7-et-4-1-1.telehouse.ukcore.bt.net [194.72.16.134]
6 12 ms 14 ms 11 ms 166-49-128-32.eu.bt.net [166.49.128.32]
7 12 ms 12 ms 11 ms ldn-b1-link.telia.net [213.248.97.48]
8 18 ms 19 ms 18 ms ldn-bb3-link.telia.net [62.115.114.234]
9 18 ms 18 ms 18 ms adm-bb3-link.telia.net [213.155.136.99]
10 18 ms 20 ms 18 ms adm-b1-link.telia.net [62.115.136.195]
11 18 ms 18 ms 28 ms packethost-ic-346116-adm-b4.c.telia.net [62.115.176.233]
12 39 ms 32 ms 118 ms 198.16.6.37
13 17 ms 19 ms 17 ms 147.75.84.170

Trace complete.

C:\Users\Daniel>tracert 45.54.12.11

Tracing route to 11.12.54.45.ptr.anycast.net [45.54.12.11]
over a maximum of 30 hops:

1 <1 ms <1 ms <1 ms router.local [192.168.93.1]
2 10 ms 22 ms 10 ms 195.166.130.248
3 10 ms 10 ms 11 ms 84.93.253.67
4 11 ms 11 ms 10 ms ^C
C:\Users\Daniel>tracert 45.54.12.11

Tracing route to 11.12.54.45.ptr.anycast.net [45.54.12.11]
over a maximum of 30 hops:

1 <1 ms <1 ms <1 ms router.local [192.168.93.1]
2 10 ms 10 ms 10 ms 195.166.130.248
3 15 ms 11 ms 11 ms 84.93.253.67
4 11 ms 11 ms 29 ms 195.99.125.136
5 39 ms 18 ms 11 ms peer7-et-4-1-4.telehouse.ukcore.bt.net [194.72.16.140]
6 25 ms 11 ms 11 ms 166-49-128-32.eu.bt.net [166.49.128.32]
7 11 ms 29 ms 12 ms hu0-6-0-4.ccr22.lon01.atlas.cogentco.com [130.117.14.65]
8 39 ms 21 ms 12 ms be2870.ccr41.lon13.atlas.cogentco.com [154.54.58.173]
9 102 ms 135 ms 90 ms be12497.ccr41.par01.atlas.cogentco.com [154.54.56.130]
10 112 ms 115 ms 100 ms be3627.ccr41.jfk02.atlas.cogentco.com [66.28.4.197]
11 99 ms 98 ms 98 ms be2806.ccr41.dca01.atlas.cogentco.com [154.54.40.106]
12 109 ms 109 ms 109 ms be2112.ccr41.atl01.atlas.cogentco.com [154.54.7.158]
13 119 ms 117 ms 118 ms be2687.ccr41.iah01.atlas.cogentco.com [154.54.28.70]
14 172 ms 132 ms 181 ms be2927.ccr21.elp01.atlas.cogentco.com [154.54.29.222]
15 146 ms 151 ms 153 ms be2930.ccr32.phx01.atlas.cogentco.com [154.54.42.77]
16 180 ms 152 ms 152 ms be2932.ccr42.lax01.atlas.cogentco.com [154.54.45.162]
17 159 ms 158 ms 159 ms be2199.rcr21.b020604-0.lax01.atlas.cogentco.com [154.54.2.174]
18 152 ms 154 ms 152 ms 38.140.153.10
19 * * * Request timed out.
20 152 ms 152 ms 152 ms 192.73.255.218
21 152 ms 151 ms 168 ms 11.12.54.45.ptr.anycast.net [45.54.12.11]

Trace complete.

1 Like

Some (many?) ISPs/IXPs are filtering based UDP port 123 based on size, rate and possibly other characteristics. By default traceroute does not use UDP port 123, so it often fails to detect filtering. On Linux I typically do:
traceroute -n -U -p 123 IP_address

Either the NTP request or NTP response may be lost.
A port 123 traceroute running from the NTP client may detect NTP request (mode 3) loss. A port 123 traceroute running from the NTP server may detect NTP response (mode 4) loss.

Most of the NTP packet losses (timeouts) that I’ve worked on recently have been in the NTP server -> Newark direction.

Its nice to detect which ISP/IXP drops the NTP packets, but that hasn’t helped so far. I can’t get any response from them.

5 Likes

ISP’s do not filter on UDP-NTP packets.
Dan is having the same problems as a lot of us.

Please stop this nonsense that ISP’s drop packets on purpose, they do not.

ISP’s do use NTP themselves, would be stupid to filter it.

While that may be true the internet is never constant, it is possible that some network deployed some form of DDoS protections (intentionally rate limiting UDP) or is under an active DDoS attack https://www.digitalattackmap.com/ https://horizon.netscout.com/?filters=trigger.triggerName.UDP It is also possible that networks operators make a mistake from time to time or just that the buffer of some random equipment is full and had to drop a packet. routing issues… network upgrades…

1 Like

The problem is this system is badly managed, the monitor is dropping everybody while being the problem itself.
The beta-system has proven this over and over again.
Management of this pool does nothing.

They always tell you: It’s you OR your provider.

ntppool.org is the problem and after 6 months it’s still not fixed.

The question is: How much time does @Ask need to fix it?
Can he fix it?

The beta-system is better, but it never seems to replace the “normal”-system.

Ps. my servers are deleted for a second time, I’m fed-up with this crap. They can be used at my own pool: ntp.heppen.be also round-robin and spot on time.

If he was collective CTO of all of the backbone ISPs that rate limit NTP traffic, I imagine he would have adjusted their policies by now.

2 Likes

None of my servers have issues… Saying “everybody” is a very broad and inaccurate statement…

5 Likes

Hi,

I can see your hosting that on ZEN, I’m with Plusnet but have the issue alot worse than you (mine never gets into the pool).

If you compare to my traceroutes Monitoring stations timeout to our NTP servers which do we seem to have in common?

If it weren’t for LAX, these would be sensible figures: https://web.beta.grundclock.com/scores/37.49.54.12
“Newark, NJ, US (19.8)Los Angeles, CA (9.2)Amsterdam (17.9)”

I hope this observation helps. I’m getting timeouts from the Newark monitoring station (for the public pool) shortly after setting my Net Speed to 250 Mbps, but no problems when it’s set at 50 Mbps. And, the problem seems to resolve when I go back to 50 Mbps. Also, the Los Angeles and Amsterdam monitors (on beta) give me perfect scores although the Newark monitor (on beta) still drops my score (while included in the public pool at the higher speed). My IPv4-only server is in both pools at the same time with a constant 10 Mbps in the beta pool, so the only thing that is changing is the speed setting I’m offering to the public pool.

I don’t think my Debian system is the bottleneck. Load average is always below 1. The ntpd ramps up to about 6% CPU at the higher speed until the Newark monitor drops my score below 10. The process is at about 1.5% at the slower speed. This server is connected directly to the ISP (through their modem/router) using 1 Gbps ethernet to a 250 Mbps fiber connection. My server might run a lot of intermittent jobs, but ntpd is the biggest CPU user that I see during these tests. Most importantly, two of the three monitoring stations say I’m good at the higher speed.

I wonder if someone is rate limiting port 123 traffic on the path between my server and the Newark monitor? It doubt it’s my machine or my ISP, since the other two monitors are fine. Other postings have suggested filing a ticket with one’s ISP, but I don’t want to upset the table for something that is only a volunteer activity – I’ll save my chips for something more important to me. At this time, my plan is to find the highest speed that gives me a perfect score, and then I’ll live with that level of contribution to the public pool.

I was going to include a traceroute, but the system said it contained too many links for a first time user.
https://www.ntppool.org/scores/216.177.181.129
https://web.beta.grundclock.com/scores/216.177.181.129

Speed setting dependent monitoring failure is most likely local problem to your site. That may include any resource (network bandwidth, CPU, memory), but most likely it is related some software resource limitation. It is typically related to connection tracking, which should be scaled up, or even better, switched off for the NTP traffic in the firewalls (external devices, and host firewall).

@DrRossJohnson Please, check whether your case is something like the case described in https://security.stackexchange.com/questions/43205/nf-conntrack-table-full-dropping-packet

I would recommend also increasing the net.netfilter.nf_conntrack_max parameter, if you have not done so already.

There is also some old, but helpful information here: https://www.pc-freak.net/blog/resolving-nf_conntrack-table-full-dropping-packet-flood-message-in-dmesg-linux-kernel-log/

My first thought too was connection tracking when I read that he started having issues at higher bandwidth settings…

If your ISP router is doing NAT, there probably isn’t any adjustment you can make to it. Though disabling conntrack for in/out 123 on your debian box is a good start to ensure that its limit isn’t the issue.

Hi,

the monitoring issues are back.
My server has a rating between 0 and 10 since one week, while there was no outage or problem.
https://www.ntppool.org/scores/37.120.164.45/log?limit=200&monitor=*

Some weeks before (since the latest monitoring update) it was mostly fine.

Regards

Time passes by and nothing changes.

1 Like