Additional monitoring servers (help wanted)

I really hope this gets done. All my servers are dropping out as soon as they score up. There is nothing wrong with them that they can’t handle though

Noah

All my servers are dropping out as soon as they score up. There is nothing wrong with them that they can’t handle though.

If they are only dropping out when they get added to the rotation (score >= 10), then the issue is unlikely to be monitoring, and there is probably actually something wrong on your end. If the problem was a monitoring issue, it should also still be dropping when the score is below 10.

If the issue is not with the server, then I’d recommend you take a look at the network - in particular, that connection tracking (and by implication NAT) is disabled for NTP packets, in both directions. The server is not the only place where NTP traffic can cause load problems. Depending on your network topology, this may need to be configured on more than one router.

Something else to consider - when initially setting up an NTP server on my current ISP, it tripped their DDOS-mitigation rules, and much of the traffic was dropped before it even reached me. This was resolved by simply calling them and having my server’s IP excluded from that monitoring. If you are certain that you have ruled out both your server and your network as potential causes, then talking to your upstream(s) may be a good idea.

1 Like

It is. Quoting from my server’s monitor log:

1553776099,"2019-03-28 12:28:19",0.035440914,1,-1.9,1,"Los Angeles",0,
1553776099,"2019-03-28 12:28:19",0.035440914,1,-1.9,,,0,
1553775143,"2019-03-28 12:12:23",0,-5,-3.1,1,"Los Angeles",,"i/o timeout"
1553775143,"2019-03-28 12:12:23",0,-5,-3.1,,,,"i/o timeout"
1553774179,"2019-03-28 11:56:19",0.033143745,1,2,1,"Los Angeles",0,
1553774179,"2019-03-28 11:56:19",0.033143745,1,2,,,0,
1553773207,"2019-03-28 11:40:07",0.036694936,1,1.1,1,"Los Angeles",0,
1553773207,"2019-03-28 11:40:07",0.036694936,1,1.1,,,0,
1553772238,"2019-03-28 11:23:58",0.037754256,1,0.1,1,"Los Angeles",0,
1553772238,"2019-03-28 11:23:58",0.037754256,1,0.1,,,0,
1553771333,"2019-03-28 11:08:53",0.032810487,1,-1,1,"Los Angeles",0,
1553771333,"2019-03-28 11:08:53",0.032810487,1,-1,,,0,
1553770236,"2019-03-28 10:50:36",0,-5,-2.1,1,"Los Angeles",,"i/o timeout"
1553770236,"2019-03-28 10:50:36",0,-5,-2.1,,,,"i/o timeout"
1553769058,"2019-03-28 10:30:58",0.036298195,1,3.1,1,"Los Angeles",0,
1553769058,"2019-03-28 10:30:58",0.036298195,1,3.1,,,0,
1553767887,"2019-03-28 10:11:27",0,-5,2.2,1,"Los Angeles",,"i/o timeout"
1553767887,"2019-03-28 10:11:27",0,-5,2.2,,,,"i/o timeout"
1553766671,"2019-03-28 09:51:11",0.026999843,1,7.6,1,"Los Angeles",0,
1553766671,"2019-03-28 09:51:11",0.026999843,1,7.6,,,0,
1553765564,"2019-03-28 09:32:44",0.034973041,1,6.9,1,"Los Angeles",0,
1553765564,"2019-03-28 09:32:44",0.034973041,1,6.9,,,0,
1553764393,"2019-03-28 09:13:13",0.033346673,1,6.2,1,"Los Angeles",0,
1553764393,"2019-03-28 09:13:13",0.033346673,1,6.2,,,0,
1553763277,"2019-03-28 08:54:37",0,-5,5.5,1,"Los Angeles",,"i/o timeout"
1553763277,"2019-03-28 08:54:37",0,-5,5.5,,,,"i/o timeout"
1553762148,"2019-03-28 08:35:48",0.032394437,1,11.1,1,"Los Angeles",0,
1553762148,"2019-03-28 08:35:48",0.032394437,1,11.1,,,0,
1553760980,"2019-03-28 08:16:20",0.028899206,1,10.6,1,"Los Angeles",0,
1553760980,"2019-03-28 08:16:20",0.028899206,1,10.6,,,0,
1553759770,"2019-03-28 07:56:10",0.034931327,1,10.1,1,"Los Angeles",0,

Not everybody has the same issue. NoahMcNallie indicated that their server was only having issues once it was added to the rotation (i.e. score >= 10), which indicates that monitoring isn’t the issue in that case.

In your case, the issue appears to be different, as the monitor is still complaining about it regardless of whether or not the server is part of the pool rotation. This may or may not be a symptom of problems with the monitoring; all it tells us is that the monitor is frequently attempting to probe your server and receiving no reply. There are many possible reasons for that to occur. I note that you have been active in the China thread - be aware that there are issues with monitoring servers behind the Great Firewall, so if your server is inside China that may well be what is happening here. If your server is (or was) in the China zone, regardless of where it is geographically located, the load is also a bit of a special case.

Nope, I am in Taiwan, not behind the GFW, but also faced some network problem in the America continent, as previously stated. I don’t think a normal America to Asia packet will transverse America continent for 2 times… LAX to ASH to NYC to LAX before leaving US soil.

There are multiple things going on I think. Both the VPS one in China and one in Canada are getting quite a few thousand requests a second for ntp even while the China VPS has been set to 384KiBit and both are falling out. They are serving a purpose though and it should clear up if think. My American VPS ipv4 score is very low too and looks to have fallen out recently. This one usually stays above. Also the ovh VPS in Canada has been under ddos since I purchased it, on and off. Pretty sure they sold it to me like that :expressionless:

Noah

Not to keep this thread off topic any more than it should be… but montreal.ca.logiplex.net made it into the 11+ score (11.4 right now) in CN at 100MiBit for the first time that I have noticed. I think this might be a sign of good things to come. These VPS are configured the same as my other RHEL VPS servers (well, fedora and two centos) which is basicly `yum update’, iptables and ip6tables rules allowing port 123, a few sysctl parameters, and base ntpd with modified stratum 1 servers that are local to them. There really is not anything more than this done to them. I don’t use them for anything else and the other one was doing fine before, without being modified. That one I am kind of confused about but I guess montreal is a mixture of the ddos it has been experiencing and the monitor station both. I’m not confirming without a doubt that there is nothing … possibly … wrong with them. But, I don’t think so. They query fine from around the world. If anything there would be some sort of rate limiting happening in china and/or at the ISP.

Noah

One last thing is that I do have a Sun T2000 which is one of the last machies Sun made before selling to Oracle. It is 64 Gigs of 16 channel DDR2 and an eight core UltraSPARC T1 and an unused 15K RPM SAS drive at 72GiByte. When I can get it up and running would be under discrepancy and it would not be anywhere that would have DDoS protection. I do soon plan on moving somewhere that I could probably have it running on something like a 10/100 D3 line.

Noah

I have the same thing from Sweden, not getting over 0 anymore…
My own external monitoring shows about 0.03% failure, from LA about 30% failure…

1554224910,“2019-04-02 17:08:30”,-0.002919288,1,-25.5,1,“Los Angeles”,0,
1554224910,“2019-04-02 17:08:30”,-0.002919288,1,-25.5,0,
1554223802,“2019-04-02 16:50:02”,0.001310674,1,-27.9,1,“Los Angeles”,0,
1554223802,“2019-04-02 16:50:02”,0.001310674,1,-27.9,0,
1554222744,“2019-04-02 16:32:24”,-0.005134751,1,-30.4,1,“Los Angeles”,0,
1554222744,“2019-04-02 16:32:24”,-0.005134751,1,-30.4,0,
1554221679,“2019-04-02 16:14:39”,0,-5,-33.1,1,“Los Angeles”,“i/o timeout”
1554221679,“2019-04-02 16:14:39”,0,-5,-33.1,“i/o timeout”
1554220590,“2019-04-02 15:56:30”,0,-5,-29.6,1,“Los Angeles”,“i/o timeout”
1554220590,“2019-04-02 15:56:30”,0,-5,-29.6,“i/o timeout”
1554219516,“2019-04-02 15:38:36”,0,-5,-25.9,1,“Los Angeles”,“i/o timeout”
1554219516,“2019-04-02 15:38:36”,0,-5,-25.9,“i/o timeout”
1554218350,“2019-04-02 15:19:10”,0,-5,-21.9,1,“Los Angeles”,“i/o timeout”
1554218350,“2019-04-02 15:19:10”,0,-5,-21.9,“i/o timeout”
1554217239,“2019-04-02 15:00:39”,-0.002086741,1,-17.8,1,“Los Angeles”,0,
1554217239,“2019-04-02 15:00:39”,-0.002086741,1,-17.8,0,
1554216104,“2019-04-02 14:41:44”,0,-5,-19.8,1,“Los Angeles”,“i/o timeout”
1554216104,“2019-04-02 14:41:44”,0,-5,-19.8,“i/o timeout”
1554214869,“2019-04-02 14:21:09”,-0.003405503,1,-15.6,1,“Los Angeles”,0,
1554214869,“2019-04-02 14:21:09”,-0.003405503,1,-15.6,0,
1554213772,“2019-04-02 14:02:52”,0,-5,-17.5,1,“Los Angeles”,“i/o timeout”
1554213772,“2019-04-02 14:02:52”,0,-5,-17.5,“i/o timeout”

I just tested logiplex.net on kvm with a tsc clocksource, and it differs 0.001148 seconds from nist. How accurate are you looking for, Ask?

I could pay for ddos protection but they only offer 10Gig last I knew. The VPS is throttled to 2.5 for scheduling purposes. It is Gentoo with a gentoo 5.0.7 kernel. It is paid three years. Full cgroups and selinux.

Besides just allowing ntp traffic in iptables, you also need to be sure to disable connection tracking.

Willing to host a site here is as well.

@ask : how can i participate to the monitoring servers ?
Chris

I’d be happy to run some monitors, if still needed. I have physical servers in Montreal (OVH BHS) and Nuremberg (Hetzner).

Hi Ask,
I think I have found the root of the IPv4 problems…namely loadbalancers.
Have a look here:

As the current Newark is passing 2 routers, if you mtr udp packages for some time you see a second router pop-up.

The problem is that UDP that is trying to pass the second router is being dropped.

In short what happens is this, if your NTP-watchdog is sending a package from e.g. 10.0.0.10 and it passes router1 at 10.0.0.1, when it comes to my server the request is answered.
However on the same IP it could try to pass router2 at 10.0.0.2 however this router isn’t expecting an UDP package from me as it never knew your monitor has send one.
So it doesn’t know the destination and drops the UDP package.
This only happens to IPv4 when IP-binding/loadbalancing (multiple network-connections) is done.

IPv6 doesn’t have this problem as all servers have their own unique address as such there is no mistake about the destination.

I’m not quite sure if this is your problem, but it looks very possible because IPv6 users have no problems but IPv4 users do.

Many docker and other VM users complain about UDP-drops at such hosters, it’s not uncommon.

I do know Cinfu.com has very cheap VPS’ses and they do not have this problem since they updated their server-VM-software.
I added 2 off those in my NTP-pool and they tick perfectly without problems.
You could ask them to run a monitor for you.

Or ask packet to route your IP just via 1 router without IP-binding or loadbalancing, it should be fine then if I’m correct.

Bas.

@Ask, I have contacted Cinfu and they are willing to supply you with a location for your purpose.
You have more details via email.

1 Like

I have a Stratum 1 source and a server that keeps good time. Based in Australia.