I used to be a 19.9 and then February 2nd it’s dropped down a lot. I don’t know what this pink dot is. Can somebody help me figure out what’s wrong?
Welcome!
Can you share the hyperlink to your server monitor page?
Here you go and thanks for responding
By the way I’ve tested outside Cloud Server that synced to my server and it’s only like 26 milliseconds off in chrony its based in Chicago and I’m in Texas it’s actually synced to my personal ntp
The pink dots go with the left axis, i.e., the respective monitors occasionally see a high offset somewhere between -300ms and -400ms. Could be that there is something else setting the clock with that offset, but seeing the unusually high RTT values (>>200ms) for pretty much all monitors, even for servers that are relatively “close” to Chicago, I’d say something fishy going on with the server’s uplink or transmit path. I.e., something often causing high and asymmetric delay for individual monitor samples.
See, e.g., the pattern for the currently highest-scoring active monitor:
1775418305,2026-04-05 19:45:05,-0.000351315,1,12.663276672,129,usmci1-3strqkc,40.839,,
1775417969,2026-04-05 19:39:29,-0.354342039,-0.086821429,12.277133942,129,usmci1-3strqkc,750.576,,
The first sample looks “normal” from offset and RTT point of view. The second has -354ms offset, and 750ms RTT. The curious thing that might give some clue to the origin is that the exceptionally high offsets are in a rather narrow range (-300ms through -400ms, mostly around -360ms it seems), and there is nothing between that range, and more “normal” offsets. I.e., there is no relevant number of samples in the range -50ms through -300ms, with a noticeable number between 0ms and -50 ms, and a large number around 0ms. I.e., if it were some traffic blocking the line, I would not think the pattern would show these distinct ranges, but be more evenly distributed over the entire range between the “normal” and the maximum of the negative offset.
I have data from a couple of non-Pool monitors. Problems:
About 25% of the requests receive no responses. This alone is enough to cause a low score
The RTT is anomalously high, see the bivalued RTTs below
A similar plot taken from another monitor located in suburban Chicago has a peak at 35 msec (good) and at 780 msec (bad).
The extra delay is in the NTP response direction.
Can you tell us about the server and its IP connection?
I am not sure if it is 100% fixed yet but….
Root cause found — my UDM Pro (UniFi Dream Machine Pro) has a custom iptables chain UBIOS_WAN_LAN_USER that handles WAN→LAN traffic. My NTP server is on VLAN40 (192.168.40.x), which is routed via a static route through a downstream switch — not directly attached to the UDM Pro. Because of this, UniFi never auto-generated an ACCEPT rule for that subnet, so all pool monitor probes were silently DROPped at the firewall the entire time.
Added iptables -I UBIOS_WAN_LAN_USER 1 -s 0.0.0.0/0 -d 192.168.40.0/24 -j ACCEPT to my boot script so it persists across reboots and reprovisioning. Score immediately started recovering after the rule was inserted. No more timeouts — chronyc clients now shows ~30+ pool monitors hitting the server successfully, and the score has climbed back above 10.
The dense pink dots were all from the 5-day pre-fix window when every probe was being dropped.
Regarding the server and connection:
The server runs Debian Linux with chrony, stratum 2, syncing from a local GPS-disciplined stratum 1 (CenterClick NTP270) on my LAN. The WAN connection is Vexus Fiber (T-Mobile XGS-PON) at 2Gbps symmetric, with NTP UDP 123 port-forwarded from WAN through the UDM Pro to the server on an internal 10Gbps LAN. The server tracks within ~1–2 microseconds of the GPS reference. The small offset penalties visible in the score log are from network path asymmetry between my ISP and distant monitors — not from inaccuracy on the server side.
chronyc tracking
Reference ID : C0A82814 (192.168.40.20)
Stratum : 2
System time : 0.000001391 seconds slow of NTP time
Last offset : -0.000000203 seconds
RMS offset : 0.000005151 seconds
Frequency : 1.890 ppm fast
Skew : 0.011 ppm
Root delay : 0.000365802 seconds
Root dispersion : 0.001018542 seconds
Leap status : Normal
chronyc sources
MS Name/IP address Stratum Poll Reach LastRx Last sample
^* 192.168.40.20 1 6 377 18 +142us[ +142us] +/- 1222us
Syncing from a local GPS stratum 1 (CenterClick NTP270) — offset within ~1–2 microseconds, RMS under 6 microseconds.
We should move the discussion back to the main thread. Others may be interested and could contribute too.
ok we can move it here
ran captures on both sides of the router — UDM Pro forwarding is under 1ms, and the server responds to everything instantly. couldn’t catch a 750ms event though, they seem too intermittent. could it be upstream from my router, maybe something with my ISP’s UDP handling?
here some screen shots from when i first got the pool running at 20 and then it drop to 5 and stayed low.
We’ve exchanged a couple of DMs. I see both loss and delay from my NTP monitors. The delay is in the NTP response direction. A tcpdump at the NTP server may show which direction is dropping packets.
:This is the delay seen from my monitoring client located in Ohio.
Not sure about your case, but I did experience similar issue which initial score was good then started dropping as the load got higher. The reason behind (for my case) was that my OPNsense box (despite loaded with i7 12 gen CPU, 8GB RAM & Intel NICs) has high latency under high load (>20K requests per second).
I believe the issue was on the OPNsense OS, therefore I switched back to Mikrotik RouterOS (again) and the score has been very stable even under very high load (>60K requests per second) since then.
![]()
I believe if the score starts to drop when load increases, it’s probably caused by your server or your router.
In which direction (request or response) was the extra delay?
Hi buddy, if you are asking about my case, then the delay was on the response direction.
The server has not had a score of 10 or above for a while now, so there should not be significant load increases. Rather, load should be decreasing.
Load-related issues also typically have a very characteristic sawtooth pattern (depending on the nature of the system part that is being overloaded, e.g., stateful flow tracking table being full thus dropping new flows). Scores slowly ramping up until they cross the 10-point threshold, then suddenly dropping way below 10 again due to sudden excessive packet drops when the load suddenly jumps beyond sustainable levels.
If it were packet processing delay, I’d also expect the delay values seen to be distributed more continuously from “normal” levels to extreme levels. In this case here, there is a very characteristic gap between “normal” delay values (typically <100ms), and the anomalous delay (>600ms), with pretty much no samples with values in between. So something is really delaying those packets by some almost constant amount, vs., e.g., a buffer (e.g., network buffer or CPU scheduler queue) slowly filling as packets are added to the buffer at a higher rate than they are being removed.
My guess would be either hardware issue, or some routing/forwarding issue.
E.g., maybe some upstream is indeed taking issue with NTP packets, and diverting a portion of them through some remote scrubbing center (though a delay >500ms sounds rather high even for such a case, unless the rerouting happens via Asia or Oceania). I have a case where an ISP is blocking outgoing NTP packets as they consider them a reflection attack. Another ISP could be trying to “clean” the traffic in a scrubbing center instead of plain dropping.
Or some strange route flapping, e.g., I have two nodes in Australia with the network path going via the USA. In my case, it’s static like that, and bi-directional. But in other cases could be some flapping, i.e., the normal path being short, but anomalously alternating with a much longer path for just one direction.
Agreeing with you on the analysis. ISP could play a factor and OP could contact them to see if they are doing something strange. This usually shouldn’t happen on business broadband connections or on VPS, but we are not sure what kind of internet connection OP is using.
Earlier today, I set up some RIPE Atlas measurements towards the server. It seems as if ICMP packets (ping) are not affected by the issue. I haven’t done a thorough analysis, but a cursory look at the data suggests that the maximum RTT measured for ICMP was about 64ms, while the NTP measurement reflects the previous findings, i.e., the “normal” RTT is in the range of about 39ms to about 67ms, and then the anomalous RTT values ranging from about 560ms to about 1s (a singular exception with 310ms).
So it seems the behavior is specific to the traffic type. The server, or some firewall in front of it, seems to be dropping any other incoming UDP-based traffic, so it’s difficult to say whether this is specific to NTP (somewhat more likely I think), or UDP-based traffic in general.
- Raw results data for the NTP measurement (note that the Atlas results have the sign on the offset value inverted, as found by @marco.davids)
- Raw results data for the ICMP (ping) measurement
If one drops the “results” part in the above URLs, one gets to a higher-level UI that also has some analysis tools.
Update: Running a TCP-based traceroute (to a port where there is some sort of response) and looking at the RTT data, that seems to be affected by the same issue as NTP, see this graphical representation. At the same time, the ping measurement still doesn’t show anything out of the ordinary. Strange…
Update 2: At least one of the RIPE Atlas probes involved in the measurements used for the troubleshooting described above has been flagged as originating a “DDoS” attack. Since it’s been running fine for a long time, I guess this flagging is due to the aforementioned measurements I set up earlier today. I hope I can still prevent its suspension and salvage it for its roles as NTP server and Pool monitor.
Anyway, that leads me to my personal preliminary conclusion that some stupid, overzealous, misguided “security” system somewhere is quite likely causing the strange issues that are being seen.
Update 3: Finally had the chance to look deeper into the DDoS flagging, and found that it was not based on some third-party report, but it seems my own hoster detected and flagged this
. I still hope that they accept my explanation and refrain from terminating the machine.
Anyway, need to withdraw my earlier preliminary personal conclusion. While it is still possible that some security function might be causing this, there’s now less evidence pointing in that direction from my point of view, thus the puzzle remains.
Did you catch a “750ms” event by now? According to the data shared by @stevesommars above and especially below, those events seem to occur quite regularly, and to last for some 10 minutes or so, so hopefully, you’d be able to catch one eventually.
I’ve plotted recent losses and RTT between my client (Ohio) and the NTP Server.
Each NTP request appears in the top graph as either 0 (response received) or 1 (no response received). The bottom graph show the RTT in milliseconds when responses were received.
Note the correlation between the two graphs.
The edge router may be the culprit.
Sorry for going quiet — I hit the new user message limit (17/day) and had to continue the investigation with Steve over email. Here’s the full resolution:
Recap of what the thread found:
The thread correctly identified three problems: ~25% packet loss to pool monitors, bimodal RTT distribution with a cluster around 750-800ms, and timing anomalies. MagicNTP’s RIPE Atlas measurements showed ICMP was unaffected while NTP responses had much higher RTT, and Steve’s post 19 correctly pointed to the edge router as the culprit based on correlated graphs.
What we did over email:
Steve asked me to add his NTP server as a noselect source on smolnut so he could correlate timing from his side, and confirmed from a pcap I shared that smolnut itself was responding to every request with zero loss — the problem was happening after packets left the server.
He provided a custom mtr script probing ports 122, 123, and 124 to test for port-specific filtering. I ran it from smolnut targeting his Ohio monitor — all three ports showed ~20-27% loss uniformly across every hop, ruling out ISP port filtering.
Steve then asked for a WAN-side capture. I took a 30-minute tcpdump on LongBean’s WAN interface and sent it to him. Steve analyzed the timestamps and showed the ~714ms bimodal delay was present in the T(WAN)−T3 difference — meaning the delay was being introduced between smolnut and LongBean, inside my own network. I had initially assumed LongBean was clean but Steve’s pcap plot proved otherwise.
Root cause:
The culprit was my core switch (UniFi Switch Pro XG-8 PoE, MIPS Linux 4.4.153) doing software L3 routing for VLAN 40 on its single MIPS CPU core. NAPI interrupt coalescing on the NIC was causing outbound UDP packets to sit in the receive queue for up to ~714ms — exactly the bimodal pattern visible in the pcap. smolnut itself was innocent (T3−T2 < 333µs on every packet). The Vexus ISP ticket was a red herring.
Fix: moved VLAN 40 routing from the switch back to the UDM Pro via the UniFi API. The switch is now pure L2.
Result: pool score recovered from ~5.9 to 17.4 with all probes returning step=1.
Thanks to Steve and MagicNTP for the analysis — the T(WAN)−T3 plot and the RIPE Atlas data were what pointed us in the right direction.






