Score/network woes

monitoring

#1

So my servers, like apparently many others, are suffering from bouncing scores quite regularly for the last several weeks. My pair of pool servers are hosted on pretty decent circuits (multiple, BGP) and I’ve never seen problems like this that persist more than tiny blips. Also of note, the servers are both in Northern California. I also ran an MTR for the last 24 hours and have 0% loss to ntplax7 and 1ms standard deviation, yet my scores are currently ~5.

I’ve put one of them into the new monitoring and it’s seeing similar issues (and scores) but only from the Los Angeles monitoring station.

So, Ask, have you explored the possibility that Phyber is having issues (or a direct upstream of them)? That seems most likely to me at this point.


#2

For reference:
https://web.beta.grundclock.com/scores/198.169.208.142/log?limit=200&monitor=*


#3

Yeah, there’ve been some other similar threads, and my servers are affected as well. I wonder what constitutes a “failure” or red dot on the graphs … does that imply a complete failure to connect, some out-of-spec data point (e.g. network latency, offset, etc.), a combination of factors, or something else?


#4

As far as I know a red dot (big) can mean some network error (unable to retrieve the time from the ntp server in question). It’s UDP after all.

A red dot (small/big) can also mean wrong time (offset) by some threshold (I think it’s around 100ms) from the monitoring server’s point of view.

There is also an orange dot (small) if the time is not quite right but not enough to mark it red.


Below the graph is a link “What do the graphs mean?”

The Score graph
A couple of times an hour the pool system checks the time from your server and compares it to the local time. Points are deducted if the server can’t be reached or if the time offset is more than 100ms (as measured from the monitoring systems). More points are deducted the bigger the offset is.

The graph is only meant as a tool to visualize trends. For more exact details of what the monitoring system found you can click on the CSV link.

The Offset graph
The monitoring system works roughly like an SNTP (RFC 2030) client, so it is more susceptible by random network latencies between the server and the monitoring system than a regular ntpd server would be.

The monitoring system can be inaccurate as much as 10ms or more.


#5

Bringing this back up to the top because it never resolved. It’s been “good enough” for the last month (IE scores bounce around but not out of the pool), but the last 2 days my servers are completely removed from the pool with horrible scores.

The symptoms are exactly the same. The lines are fine, no packet loss, minimal jitter, looks like only burst samples failing, and only happening from the LA monitoring station (Phyber) - the beta test site is seeing 100% success, half way around the world.

My servers are a few hundred miles from Phyber and traverse a pretty simple path to get there (Level3 -> NTT -> Phyber) all within California. The problem appears limited to NTP and/or Phyber.

The alerting is getting pretty annoying, both mine and the pool’s. There’s a hardware failure, a configuration error, or someone is rate limiting NTP close to or at Phyber.


#6

Well, my server score was a stable 20 for months on end since I joined the pool many years ago and at the begining it was just on a low speed connection. About 3 years ago the score became erratic, so I tried different server machines, different operating systems, different routers and even a direct connection with no router or firewall, all gave just the same behavior, so I supposed it must be either a network issue or scoring system effect, neither of which I can do anything about.
Since the new monitoring started recently, there are even more i/of timeouts and the score remains below zero most of the time now.
I already was removed automatically by the beta pool system and I am thinking I might as well remove myself from the main pool too.
My current server score can be seen at www.ntppool.org/s/134


#7

Sorry to hear about that. My server has been at a steady 20 score for years, usually swings between +1 / -1 ms only. I posted in another thread suggesting contacting RIPE Atlas about using their (thousands of globally located) probes for a more distributed monitoring for NTP servers, don’t know what ever came of that.


#9

I’m having the same issue. I see in the CSV logs that it’s some kind of network timeout. I am confident that it is not my issue. My issues started happening around the same time they migrated the infrastructure to a new setup per https://ntppool.statuspage.io/


#10

This issue definitely persists. Scores reach into the teens and plumit sometimes below negative. More so on IPv4 than IPv6. Scores are all over. Its been this way for weeks, although I had thought that it had been better.


#11

image

Mine started dipping a few days ago, I haven’t changed any config and I’m getting no alerts from my monitoring systems. Load on the server is not high.


#12

I can add that mine is on IPv4 as I don’t have IPv6.
Also, as can be seen from the link to my scores in my earlier post, all the drops are due to “i/o timeout” and nothing to do with server performance.
Does anyone know what consitutes a timeout for the monitoring system?


#13

Having the same issue as everyone else in this thread. Been running a server for many months with a score of 20, only having very minor (1-2 points) dips on occasion. For the past 1-2 months my score has been all over the place, frequently dipping below the threshold to actually be used in the pool. Looking at the monitor logs I have many “i/o timeout” errors. Hopefully someone with some insight into the monitoring system can chime in regarding what is going on.


#14

It’s been hardcoded as 3 seconds.


#15

Getting worse…

image


#16

I’ve been getting the same timeout errors from the monitoring stations as others have. I started looking more deeply at my own servers, and what I found is that I get timeouts making requests pretty regularly. I have not had much success at troubleshooting the issue, though.

Here is an strace from a simple example, ntpstat timing out. This is a relatively unloaded server, all it’s running it ntpd and sshd. It’s a fresh CentOS 7 build.

This a server hosted at Linode, so a KVM server.


execve("/bin/ntpstat", [“ntpstat”], [/* 18 vars */]) = 0
brk(NULL) = 0x5622839fc000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f18ef28a000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=17881, …}) = 0
mmap(NULL, 17881, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f18ef285000
close(3) = 0
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, “\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P%\2\0\0\0\0\0”…, 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=2173512, …}) = 0
mmap(NULL, 3981792, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f18eec9d000
mprotect(0x7f18eee60000, 2093056, PROT_NONE) = 0
mmap(0x7f18ef05f000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1c2000) = 0x7f18ef05f000
mmap(0x7f18ef065000, 16864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f18ef065000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f18ef284000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f18ef282000
arch_prctl(ARCH_SET_FS, 0x7f18ef282740) = 0
mprotect(0x7f18ef05f000, 16384, PROT_READ) = 0
stat("/etc/sysconfig/64bit_strstr_via_64bit_strstr_sse2_unaligned", 0x7ffec0c0a330) = -1 ENOENT (No such file or directory)
mprotect(0x562281af7000, 4096, PROT_READ) = 0
mprotect(0x7f18ef28b000, 4096, PROT_READ) = 0
munmap(0x7f18ef285000, 17881) = 0
socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 3
connect(3, {sa_family=AF_INET, sin_port=htons(123), sin_addr=inet_addr(“127.0.0.1”)}, 16) = 0
sendto(3, “\26\2\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0”…, 576, 0, NULL, 0) = 576
select(4, [3], NULL, NULL, {1, 0}) = 0 (Timeout)
write(2, “timeout\n”, 8) = 8
exit_group(2) = ?
+++ exited with 2 +++


#17

The overall level of servers being “healthy” seems pretty constant, so either whatever is happening isn’t enough to drop the score that low or it’s just the regular “internet noise” moving through the networks … (I wish I had more time to look into it, but it’s a little down the list of priorities as long as most servers are mostly “healthy enough”).

The network configuration in Los Angeles did change a few weeks ago, so it’s possible that the local firewall is causing the timeouts. :expressionless:


#18

One thing I’ve noticed is that my IPv6 scores are fairly consistent while my IPv4 scores are not. They’re the same machine (two machines, four addresses) so that’s kind of weird. Whether that is an issue on my end or not, I have no idea.


#19

After trying to chase down my server issues and not making much progress, I decided the issue is likely on my provider’s side and not my servers. I built a couple of new servers at a new provider that mimic the ones I already had.

So far, testing is showing that they are much more reliable. I have an EC2 instance that is polling all the servers with ntpdate and recording the results. ntp1 and ntp2 are the old servers (at Linode), ntp3 and ntp4 are the new servers (at Vultr):

[ec2-user@ip-172-30-0-163 ~] for H in 1 2 3 4; do echo "-- ntp{H}"; cat ntp${H}.bytestacker.com.log |grep strat|tail -10; done
– ntp1
server 69.164.201.245, stratum 2, offset 0.060460, delay 0.18123
server 69.164.201.245, stratum 2, offset 0.003616, delay 0.08075
server 69.164.201.245, stratum 0, offset 0.000000, delay 0.00000
server 69.164.201.245, stratum 0, offset 0.000000, delay 0.00000
server 69.164.201.245, stratum 0, offset 0.000000, delay 0.00000
server 69.164.201.245, stratum 0, offset 0.000000, delay 0.00000
server 69.164.201.245, stratum 2, offset -0.011584, delay 0.08273
server 69.164.201.245, stratum 0, offset 0.000000, delay 0.00000
server 69.164.201.245, stratum 2, offset -0.000726, delay 0.05666
server 69.164.201.245, stratum 2, offset 0.001277, delay 0.06134
– ntp2
server 172.104.187.12, stratum 0, offset 0.000000, delay 0.00000
server 172.104.187.12, stratum 0, offset 0.000000, delay 0.00000
server 172.104.187.12, stratum 2, offset -0.000296, delay 0.27011
server 172.104.187.12, stratum 2, offset -0.003444, delay 0.26508
server 172.104.187.12, stratum 0, offset 0.000000, delay 0.00000
server 172.104.187.12, stratum 2, offset 0.001656, delay 0.26199
server 172.104.187.12, stratum 0, offset 0.000000, delay 0.00000
server 172.104.187.12, stratum 2, offset -0.001426, delay 0.25827
server 172.104.187.12, stratum 0, offset 0.000000, delay 0.00000
server 172.104.187.12, stratum 0, offset 0.000000, delay 0.00000
– ntp3
server 149.28.248.90, stratum 2, offset -0.001238, delay 0.05667
server 149.28.248.90, stratum 2, offset 0.000518, delay 0.05803
server 149.28.248.90, stratum 2, offset -0.001013, delay 0.05566
server 149.28.248.90, stratum 2, offset 0.000547, delay 0.05823
server 149.28.248.90, stratum 2, offset 0.000283, delay 0.05843
server 149.28.248.90, stratum 2, offset -0.001173, delay 0.05539
server 149.28.248.90, stratum 2, offset -0.001030, delay 0.05548
server 149.28.248.90, stratum 2, offset 0.000340, delay 0.05827
server 149.28.248.90, stratum 2, offset 0.000332, delay 0.05817
server 149.28.248.90, stratum 2, offset -0.001008, delay 0.05547
– ntp4
server 149.28.156.244, stratum 2, offset -0.000084, delay 0.25479
server 149.28.156.244, stratum 2, offset -0.001035, delay 0.25525
server 149.28.156.244, stratum 2, offset -0.001262, delay 0.25494
server 149.28.156.244, stratum 2, offset -0.005068, delay 0.26225
server 149.28.156.244, stratum 2, offset -0.001136, delay 0.25523
server 149.28.156.244, stratum 2, offset -0.000106, delay 0.25693
server 149.28.156.244, stratum 2, offset -0.002571, delay 0.25841
server 149.28.156.244, stratum 2, offset -0.001345, delay 0.25487
server 149.28.156.244, stratum 2, offset -0.001028, delay 0.25371
server 149.28.156.244, stratum 2, offset -0.011647, delay 0.27618

The pool monitoring server is showing similar positive results. There was one timeout this morning to ntp4, which is in Shanghai… we’ll see if that trend continues, I guess.

ntp4 hasn’t taken any production traffic yet. We’ll see what happens when it does.


#20

Any way you could manually set the monitoring station to one other than Los Angeles for a day or two? This would help narrow down the source of the issue. My server has been getting about 10 timeout errors per day on average over the past couple of weeks. If it were to be monitored from another location for a day or two and the errors went away (or significantly reduced), we would at least know that the issue exists somewhere in the path between the monitor and the server.


#21

The beta service monitors from Los Angeles and Zurich.

https://web.beta.grundclock.com/
https://manage-beta.grundclock.com/manage

You have to sign up and add your servers separately though.