Score/network woes

monitoring

#1

So my servers, like apparently many others, are suffering from bouncing scores quite regularly for the last several weeks. My pair of pool servers are hosted on pretty decent circuits (multiple, BGP) and I’ve never seen problems like this that persist more than tiny blips. Also of note, the servers are both in Northern California. I also ran an MTR for the last 24 hours and have 0% loss to ntplax7 and 1ms standard deviation, yet my scores are currently ~5.

I’ve put one of them into the new monitoring and it’s seeing similar issues (and scores) but only from the Los Angeles monitoring station.

So, Ask, have you explored the possibility that Phyber is having issues (or a direct upstream of them)? That seems most likely to me at this point.


#2

For reference:
https://web.beta.grundclock.com/scores/198.169.208.142/log?limit=200&monitor=*


#3

Yeah, there’ve been some other similar threads, and my servers are affected as well. I wonder what constitutes a “failure” or red dot on the graphs … does that imply a complete failure to connect, some out-of-spec data point (e.g. network latency, offset, etc.), a combination of factors, or something else?


#4

As far as I know a red dot (big) can mean some network error (unable to retrieve the time from the ntp server in question). It’s UDP after all.

A red dot (small/big) can also mean wrong time (offset) by some threshold (I think it’s around 100ms) from the monitoring server’s point of view.

There is also an orange dot (small) if the time is not quite right but not enough to mark it red.


Below the graph is a link “What do the graphs mean?”

The Score graph
A couple of times an hour the pool system checks the time from your server and compares it to the local time. Points are deducted if the server can’t be reached or if the time offset is more than 100ms (as measured from the monitoring systems). More points are deducted the bigger the offset is.

The graph is only meant as a tool to visualize trends. For more exact details of what the monitoring system found you can click on the CSV link.

The Offset graph
The monitoring system works roughly like an SNTP (RFC 2030) client, so it is more susceptible by random network latencies between the server and the monitoring system than a regular ntpd server would be.

The monitoring system can be inaccurate as much as 10ms or more.


#5

Bringing this back up to the top because it never resolved. It’s been “good enough” for the last month (IE scores bounce around but not out of the pool), but the last 2 days my servers are completely removed from the pool with horrible scores.

The symptoms are exactly the same. The lines are fine, no packet loss, minimal jitter, looks like only burst samples failing, and only happening from the LA monitoring station (Phyber) - the beta test site is seeing 100% success, half way around the world.

My servers are a few hundred miles from Phyber and traverse a pretty simple path to get there (Level3 -> NTT -> Phyber) all within California. The problem appears limited to NTP and/or Phyber.

The alerting is getting pretty annoying, both mine and the pool’s. There’s a hardware failure, a configuration error, or someone is rate limiting NTP close to or at Phyber.


#6

Well, my server score was a stable 20 for months on end since I joined the pool many years ago and at the begining it was just on a low speed connection. About 3 years ago the score became erratic, so I tried different server machines, different operating systems, different routers and even a direct connection with no router or firewall, all gave just the same behavior, so I supposed it must be either a network issue or scoring system effect, neither of which I can do anything about.
Since the new monitoring started recently, there are even more i/of timeouts and the score remains below zero most of the time now.
I already was removed automatically by the beta pool system and I am thinking I might as well remove myself from the main pool too.
My current server score can be seen at www.ntppool.org/s/134


#7

Sorry to hear about that. My server has been at a steady 20 score for years, usually swings between +1 / -1 ms only. I posted in another thread suggesting contacting RIPE Atlas about using their (thousands of globally located) probes for a more distributed monitoring for NTP servers, don’t know what ever came of that.


#9

I’m having the same issue. I see in the CSV logs that it’s some kind of network timeout. I am confident that it is not my issue. My issues started happening around the same time they migrated the infrastructure to a new setup per https://ntppool.statuspage.io/


#10

This issue definitely persists. Scores reach into the teens and plumit sometimes below negative. More so on IPv4 than IPv6. Scores are all over. Its been this way for weeks, although I had thought that it had been better.


#11

image

Mine started dipping a few days ago, I haven’t changed any config and I’m getting no alerts from my monitoring systems. Load on the server is not high.


#12

I can add that mine is on IPv4 as I don’t have IPv6.
Also, as can be seen from the link to my scores in my earlier post, all the drops are due to “i/o timeout” and nothing to do with server performance.
Does anyone know what consitutes a timeout for the monitoring system?


#13

Having the same issue as everyone else in this thread. Been running a server for many months with a score of 20, only having very minor (1-2 points) dips on occasion. For the past 1-2 months my score has been all over the place, frequently dipping below the threshold to actually be used in the pool. Looking at the monitor logs I have many “i/o timeout” errors. Hopefully someone with some insight into the monitoring system can chime in regarding what is going on.


#14

It’s been hardcoded as 3 seconds.


#15

Getting worse…

image


#16

I’ve been getting the same timeout errors from the monitoring stations as others have. I started looking more deeply at my own servers, and what I found is that I get timeouts making requests pretty regularly. I have not had much success at troubleshooting the issue, though.

Here is an strace from a simple example, ntpstat timing out. This is a relatively unloaded server, all it’s running it ntpd and sshd. It’s a fresh CentOS 7 build.

This a server hosted at Linode, so a KVM server.


execve("/bin/ntpstat", [“ntpstat”], [/* 18 vars */]) = 0
brk(NULL) = 0x5622839fc000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f18ef28a000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=17881, …}) = 0
mmap(NULL, 17881, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f18ef285000
close(3) = 0
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, “\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P%\2\0\0\0\0\0”…, 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=2173512, …}) = 0
mmap(NULL, 3981792, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f18eec9d000
mprotect(0x7f18eee60000, 2093056, PROT_NONE) = 0
mmap(0x7f18ef05f000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1c2000) = 0x7f18ef05f000
mmap(0x7f18ef065000, 16864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f18ef065000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f18ef284000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f18ef282000
arch_prctl(ARCH_SET_FS, 0x7f18ef282740) = 0
mprotect(0x7f18ef05f000, 16384, PROT_READ) = 0
stat("/etc/sysconfig/64bit_strstr_via_64bit_strstr_sse2_unaligned", 0x7ffec0c0a330) = -1 ENOENT (No such file or directory)
mprotect(0x562281af7000, 4096, PROT_READ) = 0
mprotect(0x7f18ef28b000, 4096, PROT_READ) = 0
munmap(0x7f18ef285000, 17881) = 0
socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 3
connect(3, {sa_family=AF_INET, sin_port=htons(123), sin_addr=inet_addr(“127.0.0.1”)}, 16) = 0
sendto(3, “\26\2\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0”…, 576, 0, NULL, 0) = 576
select(4, [3], NULL, NULL, {1, 0}) = 0 (Timeout)
write(2, “timeout\n”, 8) = 8
exit_group(2) = ?
+++ exited with 2 +++


#17

The overall level of servers being “healthy” seems pretty constant, so either whatever is happening isn’t enough to drop the score that low or it’s just the regular “internet noise” moving through the networks … (I wish I had more time to look into it, but it’s a little down the list of priorities as long as most servers are mostly “healthy enough”).

The network configuration in Los Angeles did change a few weeks ago, so it’s possible that the local firewall is causing the timeouts. :expressionless: