Score/network woes

pheleven · August 21, 2018, 5:32pm

One of my problem servers is on the beta test monitor in Zurich and tests flawless from there across half the globe.

The issue is either at Phyber, Carksys, NTT, or Level3.

nightman · August 21, 2018, 5:33pm

Didn’t know about the beta. Thanks!

VladimirKhramov · August 23, 2018, 11:34pm

0000000012

Load on the server is low. At the monitor logs I have many “i/o timeout” errors

ask · August 23, 2018, 11:35pm

I would like to add a second production monitor (and either monitor from both or let server operators choose?) – however, I don’t have a reliable, lightly loaded server with a high quality time source nearby in Europe or Asia that’d be appropriate for this. :-/

debbiep · August 24, 2018, 1:37am

I’m getting similar behaviour with the LA monitor here with a Raspberry Pi Stratum 1 (GPS with PPS) server at my day job in Melbourne, Victoria, Australia (ntp.polyfoam.com.au). I changed ISPs (from TPG to Telstra) a week or so ago and haven’t had a good score from the LA monitor since. The beta Zurich monitor is giving me a score of 19 or higher.

Our new ISP connection is 100 Mbit symmetrical, and the Pi is under very low load. A tcpdump on the Pi shows the NTP request arriving and my Pi responding to it. Tracerouting the LA monitor turns to stars here:

12  te0-1-0-0.410.r04.lax02.as7012.net (38.88.197.82)  160.970 ms  160.997 ms  161.077 ms
13  te7-4.r02.lax2.phyber.com (207.171.30.62)  160.870 ms  161.154 ms  161.010 ms
14  * * *

Those times (160 ms or so) are typical for crossing the Pacific.

VladimirKhramov · August 24, 2018, 5:53am

Statistics for 2 days at https://web.beta.grundclock.com. Zurich (3 samples) score 20, Los Angeles, CA score 3.1

robertcope · August 25, 2018, 8:08pm

Just a thought, but if we assume that most of these scores are a result of bad network or whatever and not actually bad servers, maybe lowering the penalty from 5 to 2 or 3 would make sense? The idea being to let the good servers recover and not get pulled from the pool as often as they are right now.

studentmain · August 27, 2018, 1:51pm

Hi, I just did some test in China, result and code at here. Server is in south-west China, no stratum 0 time reference, 500Mbps upload/download bandwidth. Because of that, I just try to monitor usability not accuracy. You can use excel or what you like to analysis data.

Now I’m sure that monitor cause cn zone haven’t enough server. I’m trying to get a more stable server for monitor station now, but I’m not sure how stable it will be.

I also have some idea about monitoring system, time of monitoring station maybe not a problem, when we get more data, we can use some statistic technology to find server with bad time. Then we can set a distribution monitoring system.

nightman · August 30, 2018, 3:49pm

It looks like my score has stabilized significantly. No timeout errors in about 36 hours, score of 20 for almost 24 hours now. Was there a change made with the Los Angeles monitoring server?

NoahMcNallie · August 31, 2018, 2:04am

My problem seems to be fixed and was related to the clock source on the gentoo VPS reverting back to the default refined-jiffies instead of TSC. This was due to a security miss-label imposed by SELinux installation. This was causing the offset to be too large, resulting in the incapability to keep time accurately which I think was causing the score to flux. Today I crawled up from -90 to -5 and am expecting tick.NJ.logiplex.net to be back in use tomorrow. It will be interesting to find out if I receive any more flux.

robertcope · August 31, 2018, 12:08pm

There may have been an improvement; the score for my server in Dallas has consistently risen during the last 12 hours. But my Singapore servers are still showing terrible results. I’ve spent time instrumenting them all and am fairly sure it is not an issue on my side at this point.

( ntp2 and ntp4 are in Singapore, ntp3 is in Dallas )

avij · August 31, 2018, 7:50pm

Well, I’m not sure your .sg servers are working properly, at least that ntp4 server over IPv4. Here are some test results from another .sg host:

ntp2 (ok):

$ ntpdate -qu ntp2.bytestacker.com
server 2400:8901::f03c:91ff:fe01:da06, stratum 2, offset 0.086176, delay 0.21443
server 139.162.22.237, stratum 2, offset -0.007517, delay 0.02676
31 Aug 19:43:54 ntpdate[27841]: adjust time server 139.162.22.237 offset -0.007517 sec

ntp4:

$ ntpdate -qu ntp4.bytestacker.com
server 2001:19f0:4400:61e1:5400:1ff:fea4:8d90, stratum 2, offset 0.085316, delay 0.20190
server 149.28.156.244, stratum 0, offset 0.000000, delay 0.00000
31 Aug 19:44:13 ntpdate[27842]: adjust time server 2001:19f0:4400:61e1:5400:1ff:fea4:8d90 offset 0.085316 sec

$ ntpdate -qu ntp4.bytestacker.com
server 2001:19f0:4400:61e1:5400:1ff:fea4:8d90, stratum 2, offset 0.084876, delay 0.20076
server 149.28.156.244, stratum 0, offset 0.000000, delay 0.00000
31 Aug 19:44:44 ntpdate[27867]: adjust time server 2001:19f0:4400:61e1:5400:1ff:fea4:8d90 offset 0.084876 sec

$ ntpdate -qu ntp4.bytestacker.com
server 2001:19f0:4400:61e1:5400:1ff:fea4:8d90, stratum 2, offset 0.082369, delay 0.19537
server 149.28.156.244, stratum 0, offset 0.000000, delay 0.00000
31 Aug 19:45:14 ntpdate[27898]: adjust time server 2001:19f0:4400:61e1:5400:1ff:fea4:8d90 offset 0.082369 sec

$ ntpdate -qu ntp4.bytestacker.com
server 2001:19f0:4400:61e1:5400:1ff:fea4:8d90, stratum 2, offset 0.090107, delay 0.21078
server 149.28.156.244, stratum 0, offset 0.000000, delay 0.00000
31 Aug 19:45:30 ntpdate[28007]: adjust time server 2001:19f0:4400:61e1:5400:1ff:fea4:8d90 offset 0.090107 sec

$ ntpdate -qu ntp4.bytestacker.com
server 2001:19f0:4400:61e1:5400:1ff:fea4:8d90, stratum 2, offset 0.087315, delay 0.20488
server 149.28.156.244, stratum 0, offset 0.000000, delay 0.00000
31 Aug 19:46:00 ntpdate[28008]: adjust time server 2001:19f0:4400:61e1:5400:1ff:fea4:8d90 offset 0.087315 sec

$ ntpdate -qu 149.28.156.244
server 149.28.156.244, stratum 0, offset 0.000000, delay 0.00000
31 Aug 19:48:34 ntpdate[28016]: no server suitable for synchronization found

robertcope · August 31, 2018, 8:20pm

That’s interesting, and thanks! I’m currently working on how to monitor them from .sg. My current monitoring setup has a main node in Dallas with agents on each machine. From localhost, everything is great and consistent. From the main node in Dallas, the .sg machines both look pretty awful. I guess I need to build a proxy node to monitor from somewhere in .sg, preferably at a third provider, maybe AWS.

(1 is good, 0 is bad; ntp3 is in Dalla as well)

robertcope · September 1, 2018, 3:31am

I spent some time setting up a proxy in Singapore this evening. I used an EC2 instance as it is neutral to the two vendors I’m using for ntp2 and ntp4. The proxy is running running two tests, a simple NTP query that is built into Zabbix and an external script that runs ntpdate. Ironically, in my testing, ntp4 is stable and ntp2 is not:

From localhost, of course, all the servers continue to be stable.

avij · September 1, 2018, 5:22am

Well, yes, monitoring from localhost is not particularly useful. You will get more truthful results if you monitor your NTP servers from a different AS.

Here are some additional statistics as measured from .fi:
(edit: see the next messages for better links)

These show the packet loss, so 0 is good. From here, ntp2 shows some packet loss, while ntp3 and ntp4 are OK.

robertcope · September 1, 2018, 12:23pm

Monitoring from localhost is useful; it establishes that the system is working. In fact, when I started looking into why my servers were not scoring well, localhost would fail. I believe some kernel tuning has fixed that, though I’m still working to prove that, too.

I agree that testing from different networks is good, which is why I moved to a second and now third provider to run servers and tests from.

Thanks for providing more data points!

robertcope · September 1, 2018, 1:16pm

Also, can I ask what you’re using to measure packet loss? I’d love to set up similar monitors.

avij · September 1, 2018, 2:06pm

Sure, monitoring localhost may reveal some problems, but monitoring from elsewhere is much more effective.

I’m using MRTG with some custom scripts. Here’s the one that actually generates the data:

#!/bin/sh

addr=$1

while true
do
	pktloss=0
	for i in {1..4}
	do
		if ! ntpdate -q -p1 $addr | grep -q "adjust time server"
		then
			pktloss=$[ $pktloss + 250 ]
		fi
		sleep 2
	done
	echo $pktloss
	sleep 288
done

Looking at the script I remembered that I should not have used the hostname as the target, because your hosts are dual-stacked. I’ve now set up separate monitoring for each host’s IPv4 and IPv6 addresses and stopped the previous monitoring:

ntp2:
http://orava.miuku.net/stats/ntppacketloss-139.162.22.237.html
http://orava.miuku.net/stats/ntppacketloss-2400_8901__f03c_91ff_fe01_da06.html
ntp3:
http://orava.miuku.net/stats/ntppacketloss-149.28.248.90.html
http://orava.miuku.net/stats/ntppacketloss-2001_19f0_6401_922_5400_1ff_fea4_635b.html
ntp4:
http://orava.miuku.net/stats/ntppacketloss-149.28.156.244.html
http://orava.miuku.net/stats/ntppacketloss-2001_19f0_4400_61e1_5400_1ff_fea4_8d90.html

Those look rather empty at the moment, but they’ll show more data in the next few hours.

robertcope · September 3, 2018, 2:18pm

I decided to throw another node up in Singapore at a third provider. This one has been perfect since it was fired up:

I also added monitoring of the ssh service to see what TCP vs UDP looked like. As you would expect, ssh has been perfect among all the hosts.

NoahMcNallie · September 4, 2018, 3:57am

Things are apparently evening out the last 24 hours. Holding firm.

Topic		Replies	Views
Our servers suffer massive down-score - and I don't have the slightest clue why Server operators	9	1924	May 16, 2018
Time server pool problems since mid February Server operators monitoring	18	4445	January 29, 2018
Monitoring and packet loss in transit Server operators	5	832	January 7, 2020
So we DO have a problem with monitoring, right? Server operators	12	1167	May 22, 2019
Possible monitoring system problem Server operators	2	564	July 13, 2023

Score/network woes

Related topics