List of Monitoring IPs?

gunter · December 30, 2024, 9:36am

I would like to do some mtr/tracerouting as I am having hard time to maintain a stable score: pool.ntp.org: Statistics for 94.136.191.126

Server health is fine but i need to debug connectivity with provider.

avij · December 30, 2024, 10:51am

Looks like the majority of the monitor nodes have problems querying your server. With a score graph like that, it should be fairly simple to reproduce the problem yourself by setting up some monitoring on your own from some other server, with a script like this one.

If you need a second (or third, or fourth…) opinion, here are some stats:
Singapore, Netherlands, Australia, Finland
You can do traceroutes to the respective server names.

As usual, before bothering your ISP too much, make sure your own server is fine. For example, see if there are odd messages in dmesg about conntrack tables filling up.

gunter · December 30, 2024, 11:41am

Thanks so much @avij - looks like massive packet losses. Will investigate.

dmesg is clear - I had already disabled conntrack and even switched over to ntpd-rs for multi-core support. But the issue with low score was even happening during low traffic.

avij · December 30, 2024, 12:46pm

Apologies, the script I pointed to in my previous message was actually for ping while the linked graphs show NTP packet loss.

If you want to reproduce the NTP packet loss problem, here’s the script I use for that:

#!/bin/sh
addr=$1
while true
do
	pktloss=0
	for i in {1..4}
	do
		if ! ntpdate -q -p1 $addr 2>&1 | grep -q "+/-"
		then
			pktloss=$[ $pktloss + 250 ]
		fi
		sleep 2
	done
	echo $pktloss
	sleep 288
done

This script works as-is on Rocky Linux 9. YMMV. On older versions of ntpdate I looked for the string “adjust time server” in the output. It may be beneficial to add an invocation of date to the scripts to record a timestamp.

One good way to troubleshoot this would be to set up a temporary server (the cheapest possible) at the same datacenter. If you then run the aforementioned monitoring scripts on the temporary server and measure packet loss between your temporary server and your NTP server, you will be able to narrow the search to either “outside the provider’s datacenter” or “inside the provider’s datacenter”. Note that you will likely need to run the scripts for a longish time to spot the problems, like 24 hours. There are periods in your server score graphs when there is no packet loss, likely correlating with traffic amounts.

NTPman · December 30, 2024, 1:21pm

You may want to try this one for monitoring too:

This shows visually the reachability/loss bursts.

avij · December 30, 2024, 1:55pm

$ ntpdate -q -d -p1 94.136.191.126
30 Dec 15:53:29 ntpdate[17869]: ntpdate 4.2.6p5@1.2349-o Tue Jun 23 15:38:19 UTC 2020 (1)
Looking for host 94.136.191.126 and service ntp
host found : in.ntp.nu
transmit(94.136.191.126)
receive(94.136.191.126)
94.136.191.126: Server dropped: Server has gone too long without sync
server 94.136.191.126, port 123
stratum 2, precision -18, leap 00, trust 000
refid [94.136.191.126], delay 0.31161, dispersion 0.00000
transmitted 1, in filter 1
reference time:    00000000.00000000  Mon, Jan  1 1900  1:39:49.000
originate timestamp: eb1d245b.2a01197c  Mon, Dec 30 2024 15:53:31.164
transmit timestamp:  eb1d245a.f0ca0481  Mon, Dec 30 2024 15:53:30.940
filter delay:  0.31161  0.00000  0.00000  0.00000
         0.00000  0.00000  0.00000  0.00000
filter offset: 0.076463 0.000000 0.000000 0.000000
         0.000000 0.000000 0.000000 0.000000
delay 0.31161, dispersion 0.00000
offset 0.076463

I’m concerned about that “Server has gone too long without sync” part.

Edit: It seems to be a known issue in ntpd-rs. On the other hand, newer versions of ntpdate do not show this error. On the third hand, I don’t see a mention in the RFCs that the reftime field would be optional, so I’d still classify this as a bug. Oh well. Let’s not get ourselves stuck in this.

Sebhoster · December 30, 2024, 2:56pm

reference time:    00000000.00000000  Mon, Jan  1 1900  1:39:49.000

This line indicates that the service was restarted less than two hours before this test was taken, and has not synchronized to an upstream source. 1st of January 1900 is the ntp timestamp equivalent of zero.

However, the error messages in the pool monitoring are all network related. They also look a lot like the cycle of a server with too high of a load - overall score creeps upwards, slightly eclipses 10 points → server gets included in the pool, a bit later network timeout errors appear, score drops back below 10, errors stop, the cycle repeats.

@gunter :

What is the netspeed of your server? Have you tried reducing it?
Is this IPv6 address the same server? If yes make sure to also reduce that netspeed (or set it to 0 for troubleshooting)
What is the upstream time source of the server? Is it synchronized?
If you say the issue is also happening during low traffic, how much traffic are we talking about?

gunter · December 30, 2024, 3:47pm

@avij I do not see those issues using ntpdig from another server, but also right now with packet-loss and imo ntpd-rs is more fragile towards loss in updates than ntpsec so I might switch back

@Sebhoster - I have seen these score creeps to be more of a red-herring. Even if disabled score will go up & down. It happening more often once added to the pool might indicate congested network or host-machine - but underlaying issue remains: network is unstable

What is the netspeed of your server? Have you tried reducing it?

Was at 1G, reduced now to 100M for both IP4 and IP6 (yes both are the same machine)

What is the upstream time source of the server? Is it synchronized?

Multiple Stratum 1 - it is in-sync based on local “ntp-ctl status” and ntpdig on external servers. But also “generally” I am sure once packet-loss kicks in it is unable to sync.

If you say the issue is also happening during low traffic, how much traffic are we talking about?

Low traffic to me is anything below 10mbit

CPU has a lot of steal, but given there is still room I can not imagine it to be the issue:

avij · December 30, 2024, 3:57pm

As an additional data point, I started pinging the server too. Although these results are still preliminary, it looks like NTP packets get dropped more often than ICMP echo requests (pings).

Ping stats: Singapore, Netherlands, Australia, Finland.

Edit: There’s also a summary stats page.

You said you had disabled connection tracking, but please verify with “conntrack -L” or “grep port=123 /proc/net/nf_conntrack” (or whatever works for you), and/or “iptables -L -t raw” which should probably show something like:

Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
CT         udp  --  anywhere             anywhere             udp dpt:ntp CT notrack

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
CT         udp  --  anywhere             anywhere             udp spt:ntp CT notrack

gunter · December 30, 2024, 4:17pm

$ sudo iptables -L -t raw
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
CT         udp  --  anywhere             anywhere             udp spt:ntp CT notrack
CT         udp  --  anywhere             anywhere             udp dpt:ntp CT notrack

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
CT         udp  --  anywhere             anywhere             udp spt:ntp CT notrack
CT         udp  --  anywhere             anywhere             udp dpt:ntp CT notrack

There was much more throughput possible a couple of days before:

it looks like NTP packets get dropped more often than ICMP echo requests

This could also indicate that they are dropping packets before it reaches the VM. I had this with another provider, where NTP traffic was triggering their anti-DDoS firewall.

At least on VM side no dropped packets, nothing being throttled from ntpd-rs.

As much as I want to believe this is an issue with ntpd-rs/server its is highly likely bad network quality - its Contabo

I will switch back to ntpd though and test out with GitHub - mlichvar/rsntp: High-performance NTP server written in Rust just to rule out some stuff…

avij · December 30, 2024, 4:17pm

I’m not sure if the situation has changed in the meantime, but I don’t think ntpd-rs has support for multiple cores.

gunter · December 30, 2024, 4:24pm

it has. it runs a process per cpu-thread if running in server-mode. Aside rsntp its the only ntp server that does.

Sebhoster · December 30, 2024, 4:33pm

In my opinion bandwidth is a not a good indicator since ntp packets are quite small. 50 kpps is a lot and you will definitely have to make sure that connection tracking is disabled all the way. Even then, a small server is likely overwhelmed with that load.

Have you contacted Contabo about the issue?

gunter · December 30, 2024, 4:41pm

Seeing some other benchmarks and given the server specs, my conclusion would be that 50 kpps is not a lot.

That would indicate premium hw and network, which surely I do not have right now

At least one of the data points is GitHub - mlichvar/rsntp: High-performance NTP server written in Rust where a 4-core from 2011 was able to max out 1Gbit NIC.

So yah definitely want to see the limits of the server, but right now - given the data points - I am already failing due to network quality rather than any other hardware limits.

Have you contacted Contabo about the issue?

Yes, they are looking at it. Thanks again @avij for providing your monitoring.

If we can get this stable in India, that might help that region - seems like demand is definitely there for a NTP server.

davehart · January 1, 2025, 4:35am

@gunter both IPv6 and IPv4 monitoring for in.ntp.nu shows a distinct improvement starting about a day before the new year. Do you know what changed?

gunter · January 1, 2025, 12:25pm

Yes, for now I limited netspeed to 100 for IP4 and 250 for IP6. Seems like the server can only handle peaks of 10mbit/~15k pps, anything above that traffic gets dropped (before reaching the server). As soon as I went over 15mbit, score went down again, so this is reproducible at least:

I can only guess its due to rate-limiting/anti-ddos or the host machine itself having limits.

Unfortunately this is not sustainable as I know it could handle much more. I do understand Contabo not having a high interest to improve this behavior.

gunter · January 2, 2025, 3:15pm

It was confirmed by Contabo that there is indeed issues with one of the upstream providers and they are working on it.

So there is nothing that could have been improved with routing or bandwidth. I will observe this over the next couple of weeks and see if I can increase the throughput, then indeed it would be an interesting PoP for me.

Nevertheless, I still think it would be great if we could have some sort of looking glass of monitoring servers, as it would make debugging a lot easier.

Thanks all for your help.

davehart · January 2, 2025, 5:56pm

There’s nothing about your experience that was specific to the perspective of the actual pool monitors. Any NTP “looking glass” would do as well.

RIPE offers this functionality to their members. I am in North America and am quite jealous of my European peers for the measurement tools available to them. IMO it makes ARIN look like a hopeless mess by contrast. If those acronyms puzzle you see Regional Internet registry - Wikipedia.

gunter · January 2, 2025, 6:14pm

Sadly I am not a RIPE member.

Topic		Replies	Views
Monitoring station routing problems Server operators	10	1007	July 13, 2019
"i/o timeout" from different monitoring stations Server operators monitoring	50	4017	January 27, 2020
Server score keeps dropping Server operators monitoring	14	2108	April 19, 2019
Sawtooth graph - every time score is over 10 the next check times out Server operators	18	929	October 23, 2020
NTP server blacklisted? Server operators	10	1224	June 27, 2022

List of Monitoring IPs?

Related topics