Our servers suffer massive down-score - and I don't have the slightest clue why


#1

Hi

we’ve been running a pair of NTP servers for the pool for several years now. One of them high load (2 Gb setting, ~6 Mbps average) the second lightly loaded (and basically supposed to be switched in should #1 fail)

These boxes have been up 24/7 for several years now, in a fully redundant data center with pretty good connectivity, and with standard operational monitoring.

So imagine my surprise when Ask’s robot emailed me this morning telling me that Server #1 has been removed from pool because of low score. First I assumed the box had crashed - but it hadn’t. It was up (> 1000 days uptime right now), ntpd was up and serving. Synchronization was ok too - it synchronized to a DCF77 Stratum1 box ~200 km north, and had a GPS based Stratum1 ~200km southeast as candidate. Both those Stratum1s are reliable Meinberg boxes.

And it was claimed to have negative -13.3 score. Checking on the other, it also was scored pretty badly, plus 11.something (and has since fallen to 4.8)

Being completely out of ideas I added another external stratum1 source and restarted ntpd on both boxes. And while box 1 is now very slowly creeping back top the 0 line, box 2 has since fallen way below the acceptability threshhold…

Another thing I see is that they are monitored from the US West Coast. Both of these boxes are located in Central Europe (Frankfurt, Germany) - could it be that we are seeing here US west coast connectivity problems, not those of my boxes?

See for yourself:
http://www.pool.ntp.org/scores/195.50.171.101
http://www.pool.ntp.org/scores/195.50.171.102

Any idea? What can I do?


#2

Seems like temporary network issue. Right now your NTP server is available worldwide - https://atlas.ripe.net/measurements/12444930/#!probes


#3

Perhaps it’s related to this earlier issue?


#4

Very much so, yes. My server* is connected by IPv4 and (native) IPv6 so, obviously, holds exactly the same time yet the graph is regularly different.

  • UK, dedicated machine.

#5

It started again. Box fell down to 4.6 this afternoon and is now slowly climbing back (now at 8.7)
Are the Californian monitoring station affected by weekend traffic overload?

What is their IP?
What is their connectivity? Carrier? AS? Anything?

And why is there no monitoring from Europe?


#6

I have 3 servers on 2 different ISP’s here in Denmark and they are all at 20 and has been so for days, so it is not a general problem.


#7

I am sorry guys, but this issue is still going on and I’m not closer to any solution.

Would somebody please answer my question? What are the IPs of the monitoring systems ? What is their connectivity? Carrier?
I need this to have our peering people look into that.


#8

Hello Hedberg,

as your “answer” seems to have effectively smothered any further discussion here, I have to state that this is of course a non-sequitur. The internet doesn’t work that way. Just because one place/ISP in Denmark has good connectivity to the US west coast monitoring stations has barely any implication on the connectivity of others there - or in neighboring countries.

And I sill need the IP addresses of the monitoring stations.


#9

You can find the address of the LA monitoring station in this thread:
https://community.ntppool.org/t/problems-with-the-los-angeles-ipv4-monitoring-station

Other stations are running in the new beta pool:
https://web.beta.grundclock.com


#10

Also, network debugging information here: https://dev.ntppool.org/monitoring/network-debugging/

And yes, the beta site has more monitors and more details in the monitoring logs.