I’m running 3 NTP servers, two of which were added two weeks ago in the India zone. The third one was added two days ago. Here are a few things that I noticed:
The two servers (let’s say A and B) I added earlier show a sawtooth pattern in the monitoring graph.
Server A, and C are on Oracle Cloud. I’m running my projects on them, and on the side, I’m running these NTP servers. The projects consume very little resources so resource crunch is not an issue.
Server B is a Raspberry Pi 4B in my home. It has a static IPv6 and local monitoring shows 100% of the requests were responded to, however, it also shows a sawtooth pattern in the monitoring.
I’m not able to understand the cause of this. I switched from NTPd to Chrony but still there is no difference.
Is this an issue from the monitoring end that zones in India/Asia are not being monitored correctly, because two same servers, in the same location, with the same configuration, but different zones are showing different monitoring results?
Or are the servers actually getting unstable in the India zone due to any reason whatsoever?
Also, two different servers in two different locations, and different networks, but in the same zone show the same sawtooth graph.
Any help or right direction for debugging is appreciated
I asked the team to move the zone of server C from above to India, Asia. Remember, server C was working flawlessly, perfect 20 scores, without any dip whatsoever.
As soon as the zone was changed to India, I saw a slight dip within a few hours, followed by the result shown below.
I got an email saying that this server is removed from the pool, which makes sense.
I hope this new finding will help in narrowing down the problem.
this maybe caused by unstable connection from oracle india to ntp pool san jose(i.e you know that some routes is more congested while some are not), so even you have stable connection towards all other destinations, if your server connection with ntp pool san jose is unstable, the score will definitely be sawtooth pattern
p.s the monitoring timeout is only 500ms rather than usual 3000ms or 5000ms, so when you are testing network stability by sending ping packets from your ntp server to San Jose Monitoring station, remember to set the timeout to 500ms
this is my new ntp server and have been run for more than 1 day and score only increases, so it shows that the monitoring station and indian zone have no problem
I have 3 servers running at Linode in Mumbai, which are also prone to some flakeyness. This affects both IPv4 and IPv6 - i.e. pool.ntp.org: Statistics for 2400:8904::f03c:92ff:fe80:f6ac
Also routing from there to my home internet connection (Deutsche Telekom, AS3320) is about 5% lossy.
If you need traceroutes or other tests, I can happily help out with that.
@fxxkputin I was running another NTP server from my home, which too had the same problem. Not only Oracle, but also ISP Jio and Airtel have the same problems. Btw, can you share the address of the San Jose monitoring station? I’ll share the ping results from all the systems I ever ran NTP server.
Thanks @lordgurke, but I think I’ll wait for a few months until my current VPSs subscription expires and we move to a dedicated host. During that time, I’ll also investigate the network issues I may have missed earlier.
Okay, so for the people on Oracle cloud who are having this issue, this is caused due to the conntrack table on the VNIC filling up and dropping the packets, you can identify this by looking at the metrics here: Compute > Instances > Instance Details > Attached VNICs > VNIC Details
To overcome this you have to enable stateless rules in your security list, I essentially have this bypassed in my case as I use a firewall on the host itself.
Here’s what my setup looks like:
Ingress Rules
This took longer than I’d like to admit to figure out but this solved my issue entirely.
Also don’t forget to increase the conntrack table size on the host itself, I’ve seen mine go as high as 1,000,000 entries on my busiest server.