Servers in India zone are unstable

Hi guys

I’m running 3 NTP servers, two of which were added two weeks ago in the India zone. The third one was added two days ago. Here are a few things that I noticed:

  1. The two servers (let’s say A and B) I added earlier show a sawtooth pattern in the monitoring graph.
  2. The third server (C), despite being in India (but autoconfigured to US zone) is very stable and its score is only increasing.

Server A, and C are on Oracle Cloud. I’m running my projects on them, and on the side, I’m running these NTP servers. The projects consume very little resources so resource crunch is not an issue.

Server B is a Raspberry Pi 4B in my home. It has a static IPv6 and local monitoring shows 100% of the requests were responded to, however, it also shows a sawtooth pattern in the monitoring.

I’m not able to understand the cause of this. I switched from NTPd to Chrony but still there is no difference.

Is this an issue from the monitoring end that zones in India/Asia are not being monitored correctly, because two same servers, in the same location, with the same configuration, but different zones are showing different monitoring results?
Or are the servers actually getting unstable in the India zone due to any reason whatsoever?

Also, two different servers in two different locations, and different networks, but in the same zone show the same sawtooth graph.

Any help or right direction for debugging is appreciated :slight_smile:

It happens to many IPv4 servers.

@Ask is working on a new monitor-system to minimise this from happening.

Nothing you can do or change to make it better.

It’s just UPD-timeouts that happen underway and as the routing of IPv6 is faster is happens far less.

Therefor a new system is in the making to have more then 1 monitor and a better score-handling system.

Bas.

I’m working with Rishabh to collect traceroute, etc. This situation differs from the many others we’ve seen, as it affects IPv6.

1 Like

Hi guys

I have more new findings.

I asked the team to move the zone of server C from above to India, Asia. Remember, server C was working flawlessly, perfect 20 scores, without any dip whatsoever.

As soon as the zone was changed to India, I saw a slight dip within a few hours, followed by the result shown below.
image

I got an email saying that this server is removed from the pool, which makes sense.

I hope this new finding will help in narrowing down the problem.

this maybe caused by unstable connection from oracle india to ntp pool san jose(i.e you know that some routes is more congested while some are not), so even you have stable connection towards all other destinations, if your server connection with ntp pool san jose is unstable, the score will definitely be sawtooth pattern

p.s the monitoring timeout is only 500ms rather than usual 3000ms or 5000ms, so when you are testing network stability by sending ping packets from your ntp server to San Jose Monitoring station, remember to set the timeout to 500ms


this is my new ntp server and have been run for more than 1 day and score only increases, so it shows that the monitoring station and indian zone have no problem

I have 3 servers running at Linode in Mumbai, which are also prone to some flakeyness. This affects both IPv4 and IPv6 - i.e. pool.ntp.org: Statistics for 2400:8904::f03c:92ff:fe80:f6ac
Also routing from there to my home internet connection (Deutsche Telekom, AS3320) is about 5% lossy.
If you need traceroutes or other tests, I can happily help out with that.

@fxxkputin I was running another NTP server from my home, which too had the same problem. Not only Oracle, but also ISP Jio and Airtel have the same problems. Btw, can you share the address of the San Jose monitoring station? I’ll share the ping results from all the systems I ever ran NTP server.

Thanks @lordgurke, but I think I’ll wait for a few months until my current VPSs subscription expires and we move to a dedicated host. During that time, I’ll also investigate the network issues I may have missed earlier.

Okay, so for the people on Oracle cloud who are having this issue, this is caused due to the conntrack table on the VNIC filling up and dropping the packets, you can identify this by looking at the metrics here: Compute > Instances > Instance Details > Attached VNICs > VNIC Details

To overcome this you have to enable stateless rules in your security list, I essentially have this bypassed in my case as I use a firewall on the host itself.
Here’s what my setup looks like:
Ingress Rules

Egress rules must match when stateless firewall is used:

This took longer than I’d like to admit to figure out but this solved my issue entirely.
Also don’t forget to increase the conntrack table size on the host itself, I’ve seen mine go as high as 1,000,000 entries on my busiest server.

1 Like