I would like to do some mtr/tracerouting as I am having hard time to maintain a stable score: pool.ntp.org: Statistics for 94.136.191.126
Server health is fine but i need to debug connectivity with provider.
I would like to do some mtr/tracerouting as I am having hard time to maintain a stable score: pool.ntp.org: Statistics for 94.136.191.126
Server health is fine but i need to debug connectivity with provider.
Looks like the majority of the monitor nodes have problems querying your server. With a score graph like that, it should be fairly simple to reproduce the problem yourself by setting up some monitoring on your own from some other server, with a script like this one.
If you need a second (or third, or fourth…) opinion, here are some stats:
Singapore, Netherlands, Australia, Finland
You can do traceroutes to the respective server names.
As usual, before bothering your ISP too much, make sure your own server is fine. For example, see if there are odd messages in dmesg about conntrack tables filling up.
Thanks so much @avij - looks like massive packet losses. Will investigate.
dmesg is clear - I had already disabled conntrack and even switched over to ntpd-rs for multi-core support. But the issue with low score was even happening during low traffic.
Apologies, the script I pointed to in my previous message was actually for ping while the linked graphs show NTP packet loss.
If you want to reproduce the NTP packet loss problem, here’s the script I use for that:
#!/bin/sh
addr=$1
while true
do
pktloss=0
for i in {1..4}
do
if ! ntpdate -q -p1 $addr 2>&1 | grep -q "+/-"
then
pktloss=$[ $pktloss + 250 ]
fi
sleep 2
done
echo $pktloss
sleep 288
done
This script works as-is on Rocky Linux 9. YMMV. On older versions of ntpdate I looked for the string “adjust time server” in the output. It may be beneficial to add an invocation of date to the scripts to record a timestamp.
One good way to troubleshoot this would be to set up a temporary server (the cheapest possible) at the same datacenter. If you then run the aforementioned monitoring scripts on the temporary server and measure packet loss between your temporary server and your NTP server, you will be able to narrow the search to either “outside the provider’s datacenter” or “inside the provider’s datacenter”. Note that you will likely need to run the scripts for a longish time to spot the problems, like 24 hours. There are periods in your server score graphs when there is no packet loss, likely correlating with traffic amounts.
You may want to try this one for monitoring too:
This shows visually the reachability/loss bursts.
$ ntpdate -q -d -p1 94.136.191.126
30 Dec 15:53:29 ntpdate[17869]: ntpdate 4.2.6p5@1.2349-o Tue Jun 23 15:38:19 UTC 2020 (1)
Looking for host 94.136.191.126 and service ntp
host found : in.ntp.nu
transmit(94.136.191.126)
receive(94.136.191.126)
94.136.191.126: Server dropped: Server has gone too long without sync
server 94.136.191.126, port 123
stratum 2, precision -18, leap 00, trust 000
refid [94.136.191.126], delay 0.31161, dispersion 0.00000
transmitted 1, in filter 1
reference time: 00000000.00000000 Mon, Jan 1 1900 1:39:49.000
originate timestamp: eb1d245b.2a01197c Mon, Dec 30 2024 15:53:31.164
transmit timestamp: eb1d245a.f0ca0481 Mon, Dec 30 2024 15:53:30.940
filter delay: 0.31161 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000
filter offset: 0.076463 0.000000 0.000000 0.000000
0.000000 0.000000 0.000000 0.000000
delay 0.31161, dispersion 0.00000
offset 0.076463
I’m concerned about that “Server has gone too long without sync” part.
Edit: It seems to be a known issue in ntpd-rs. On the other hand, newer versions of ntpdate do not show this error. On the third hand, I don’t see a mention in the RFCs that the reftime field would be optional, so I’d still classify this as a bug. Oh well. Let’s not get ourselves stuck in this.
reference time: 00000000.00000000 Mon, Jan 1 1900 1:39:49.000
This line indicates that the service was restarted less than two hours before this test was taken, and has not synchronized to an upstream source. 1st of January 1900 is the ntp timestamp equivalent of zero.
However, the error messages in the pool monitoring are all network related. They also look a lot like the cycle of a server with too high of a load - overall score creeps upwards, slightly eclipses 10 points → server gets included in the pool, a bit later network timeout errors appear, score drops back below 10, errors stop, the cycle repeats.
@gunter :
@avij I do not see those issues using ntpdig from another server, but also right now with packet-loss and imo ntpd-rs is more fragile towards loss in updates than ntpsec so I might switch back
@Sebhoster - I have seen these score creeps to be more of a red-herring. Even if disabled score will go up & down. It happening more often once added to the pool might indicate congested network or host-machine - but underlaying issue remains: network is unstable
What is the netspeed of your server? Have you tried reducing it?
Was at 1G, reduced now to 100M for both IP4 and IP6 (yes both are the same machine)
What is the upstream time source of the server? Is it synchronized?
Multiple Stratum 1 - it is in-sync based on local “ntp-ctl status” and ntpdig on external servers. But also “generally” I am sure once packet-loss kicks in it is unable to sync.
If you say the issue is also happening during low traffic, how much traffic are we talking about?
Low traffic to me is anything below 10mbit
CPU has a lot of steal, but given there is still room I can not imagine it to be the issue:
As an additional data point, I started pinging the server too. Although these results are still preliminary, it looks like NTP packets get dropped more often than ICMP echo requests (pings).
Ping stats: Singapore, Netherlands, Australia, Finland.
Edit: There’s also a summary stats page.
You said you had disabled connection tracking, but please verify with “conntrack -L” or “grep port=123 /proc/net/nf_conntrack” (or whatever works for you), and/or “iptables -L -t raw” which should probably show something like:
Chain PREROUTING (policy ACCEPT)
target prot opt source destination
CT udp -- anywhere anywhere udp dpt:ntp CT notrack
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
CT udp -- anywhere anywhere udp spt:ntp CT notrack
$ sudo iptables -L -t raw
Chain PREROUTING (policy ACCEPT)
target prot opt source destination
CT udp -- anywhere anywhere udp spt:ntp CT notrack
CT udp -- anywhere anywhere udp dpt:ntp CT notrack
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
CT udp -- anywhere anywhere udp spt:ntp CT notrack
CT udp -- anywhere anywhere udp dpt:ntp CT notrack
There was much more throughput possible a couple of days before:
it looks like NTP packets get dropped more often than ICMP echo requests
This could also indicate that they are dropping packets before it reaches the VM. I had this with another provider, where NTP traffic was triggering their anti-DDoS firewall.
At least on VM side no dropped packets, nothing being throttled from ntpd-rs.
As much as I want to believe this is an issue with ntpd-rs/server its is highly likely bad network quality - its Contabo
I will switch back to ntpd though and test out with GitHub - mlichvar/rsntp: High-performance NTP server written in Rust just to rule out some stuff…
I’m not sure if the situation has changed in the meantime, but I don’t think ntpd-rs has support for multiple cores.
it has. it runs a process per cpu-thread if running in server-mode. Aside rsntp its the only ntp server that does.
In my opinion bandwidth is a not a good indicator since ntp packets are quite small. 50 kpps is a lot and you will definitely have to make sure that connection tracking is disabled all the way. Even then, a small server is likely overwhelmed with that load.
Have you contacted Contabo about the issue?
Seeing some other benchmarks and given the server specs, my conclusion would be that 50 kpps is not a lot.
That would indicate premium hw and network, which surely I do not have right now
At least one of the data points is GitHub - mlichvar/rsntp: High-performance NTP server written in Rust where a 4-core from 2011 was able to max out 1Gbit NIC.
So yah definitely want to see the limits of the server, but right now - given the data points - I am already failing due to network quality rather than any other hardware limits.
Have you contacted Contabo about the issue?
Yes, they are looking at it. Thanks again @avij for providing your monitoring.
If we can get this stable in India, that might help that region - seems like demand is definitely there for a NTP server.
@gunter both IPv6 and IPv4 monitoring for in.ntp.nu shows a distinct improvement starting about a day before the new year. Do you know what changed?
Yes, for now I limited netspeed to 100 for IP4 and 250 for IP6. Seems like the server can only handle peaks of 10mbit/~15k pps, anything above that traffic gets dropped (before reaching the server). As soon as I went over 15mbit, score went down again, so this is reproducible at least:
I can only guess its due to rate-limiting/anti-ddos or the host machine itself having limits.
Unfortunately this is not sustainable as I know it could handle much more. I do understand Contabo not having a high interest to improve this behavior.
It was confirmed by Contabo that there is indeed issues with one of the upstream providers and they are working on it.
So there is nothing that could have been improved with routing or bandwidth. I will observe this over the next couple of weeks and see if I can increase the throughput, then indeed it would be an interesting PoP for me.
Nevertheless, I still think it would be great if we could have some sort of looking glass of monitoring servers, as it would make debugging a lot easier.
Thanks all for your help.
There’s nothing about your experience that was specific to the perspective of the actual pool monitors. Any NTP “looking glass” would do as well.
RIPE offers this functionality to their members. I am in North America and am quite jealous of my European peers for the measurement tools available to them. IMO it makes ARIN look like a hopeless mess by contrast. If those acronyms puzzle you see Regional Internet registry - Wikipedia.
Sadly I am not a RIPE member.