At this moment, I have three systems in the pool. One system with IPv4 and IPv6 and two systems with IPv6 only. All systems are connected to the same router.
The scores for the IPv6 servers are good: the overall score is a solid 20 and also nearly all individual scores are 20. Maybe once or twice a day a single lost packet, mostly for monitors far away.
The IPv4 score is, let’s say, more dynamic. There seems to be quite some packet loss. Most of the time the overall score is still at 20, but sometimes it drops to lower levels. And sometimes (shortly) even below 10.
And the strangest thing is: the monitors that are closest (both geographically and in terms of network latency), have the worst score (4.2 and 5.9 at the time of writing). Most other servers are at 20, but some distant monitors are just below 20.
I don’t experience any packet loss for other protocols. I do see a spike in NTP traffic around the top of the hour, but in the CVS-log I see that the time outs are scattered around the clock. I don’t see a correlation with NTP traffic or other traffic through my router.
I get approximately 2k NTP requests per second (in total). I tried to lower the bandwith setting to reduce the NTP requests, but this doesn’t seem to help with the scores of the nearby monitors.
Should I consider this as normal and just ignore it?
Hi, what is your current bandwidth setting and how much did you decrease it?
Also, after lowering the setting it takes time for the traffic flow to subside.
At this moment, the IPv4 server is at 2 Gbps and the IPv6 servers are at 3 Gbps.
I tried as low as 100 Mbps. But even after several days, there is still some packet loss.
I’ll try again and report back.
Acknowledging that you did not yet indicate which server(s) we are talking about (i.e., likely you didn’t for a reason), in case that wasn’t on purpose, it would help if you could indicate which they are.
My guess is on your router.
For IPv4 I guess you are using port forwarding/DNAT to an private IP on your network?
Then most likely the NAT table in your router overflows because your router tries to keep the connection state in memory. For NTP over UDP you can try to disable stateful forwarding because it is not needed.
IPv6 does not have this problem because your machines have their own public IP and you most likely have setup just some firewall rules which don’t need to keep state and don’t tax the CPU and memory of the router much.
This was also my first guess (and I did hit the limits at first), but after increasing the maximum number of states and decreasing the time that a state is kept in memory I can see that the number of states is far below the limit.
At this moment (with the lowered bandwidth setting) there are around 12k states, while the limit is at 500k. With the previous higher bandwidth setting, the number of states was around 50k IIRC. Still quite low compared to the maximum.
And yes, the IPv4 server is on a private IPv4 address on my network. Because of NAT I cannot disable the creation of states (state is needed for the translation of packets, at least on OpenBSD).
But even in case I am overlooking something at the router (which is very possible), why does the packet loss occur mainly at the two nlams monitors closest to me?
Seeing that a few monitors have problems reaching your server and the other monitors give you a perfect 20 score, I find it unlikely that the problem would be in your server or your router. As mlichvar wrote above, it’s probably some sort of a routing/peering issue somewhere along the path. I would not be too concerned about this, as long as the overall score stays generally above 10.
The two monitors in the Netherlands have consecutive IP addresses, which might explain why both give bad scores for you, vs. a hypothetical, more diverse deployment within the Netherlands.
After about half a day on the lower bandwidth, I see that the monitors’ packet loss shows no improvement, while I do get significantly fewer NTP requests.
So I also think it has to do with routing/peering between the NL monitors and my ISP.
I will leave it as it is for now and will turn up the bandwidth again soon.