Adding New Monitor

Hello,

We currently run a number of dual-stack NTP servers.
However, we regularly see issues in the monitoring of 2 of these.
As most of the monitoring servers are in the US/EU, how can we sponsor a monitor server in the country closer to the servers.
There are 33 active servers in the country zone (17 being active on IPv6), so this should reduce the latency (from ~180ms to EU) and improve monitoring for the country zone

We are happy to sponsor a monitor server in the country zone as this would be highly beneficial to all.
Please let me know how this can be achieved.

Thank you,

Hello ncryptd, and welcome to the forum!

When you say sponsor, would you be willing to host a monitor yourself or do you want to compensate someone for running one?

Hello ncryptd,

Welcome! Which country are you thinking of setting up a monitoring server?

Yeah, we are missing the Country/Zone in here :stuck_out_tongue:

Hi All,

Thank you for the welcome!

This will be for South Africa. The ZA zone.
I’m happy to host a server in our network.

I’m not sure if we will add more operators to the monitoring beta, as the group (on the forum) is closed.

If this remains, you are welcome to create a VM in this case and I will set it up with my account + the associated packages.

@ask any comments?
@ncryptd nothing personal but you are fresh on this board, can you share maybe your account you are using on the pool?

Here it is

pool.ntp.org: ntp@ncryptd's pool servers // 1q66sz9

Long time ntp hoster / lurker… first time poster…

1 Like

Thanks, i see.
If you want a quick fix you can DM me, otherwise we need to wait for ask.

Yes, it would be a good idea to have a monitor in Africa.

However, before blaming monitors it’s always a good idea to make sure your own setup is fine. For example, let’s have a look at pool.ntp.org: Statistics for 102.222.156.150 and its CSV log. Parsing that log gives me the following results:

$ grep -v -e recentmedian -e "i/o timeout" -e ts_epoch -e INIT -e "connection refused" scores.txt | cut -d, -f3 | cut -c1-5 | sort | uniq -c
      6 -0.02
      5 -0.03
      6 -0.04
     29 -0.05
     51 -0.06
    821 -0.07
   3224 -0.08
    122 -0.09
     69 -0.10
      1 -0.12
      3 -0.13

In other words, the majority of the monitor nodes seem to think this server is 70-80 ms off. Do note that long RTT in itself is not a problem, it only becomes a problem if the network path is highly asymmetrical. Off the top of my head, I think the monitors consider a server that is < 75 ms off to be an acceptable server. If the offset is more than 75 ms, the score will be reduced. I think this threshold is fairly generous.

As for why most monitors think the offset is 80ish ms, maybe the network path from this server to most of the world is indeed that asymmetrical. Another possibility is that maybe this server’s clock is indeed around 80 ms off. One plausible scenario might be that this server in .za is forced to sync its clock to the “head office” time server in .nl, and the network path between those points is asymmetrical. It would be a good idea to check the time sources in use, and use a handful of local (<20 ms away) time sources instead of time sources on another continent.

2 Likes

Addendum: Based on some traceroutes, I think the asymmetrical routing is the reason for the troubles. I’m sorry to say this, but your network might not be the best home for a monitoring node. While it would certainly work fine for monitoring your own pool NTP servers and probably for a large amount of African pool NTP servers as well (depending on how interconnected African countries are), I find it likely that it would show the same 80 ms offset for non-African pool NTP servers. You can try this yourself by running “ntpdate -qu ntp.example.org” or “chronyd -Q -t 3 ‘server ntp.example.org maxsamples 1’” from a host in your South Africa datacenter, replacing ntp.example.org with some other pool IP address like ntp.miuku.net (one of mine).

On the other hand, if there was a monitoring server somewhere else in South Africa at some other ISP, it would likely work fine for monitoring all South African pool NTP servers and it would probably work reasonably well for monitoring all pool NTP servers worldwide.

1 Like

Your stratum 2 server typically uses 129.134.27.123 (time5.facebook.com) as a synchronization source. The facebook NTP servers utilize anycast, I don’t know if that is a factor. The root delay and root dispersion are quite high

102.222.156.150

Try synchronizing to a stratum 1 with a lower RTT to your NTP server.

Thanks @stevesommars.
Both servers have been updated to a stratum 1 with lower RTT.

It appears to have reduced the offset.
Will continue to monitor.

This helped, but there are still issues:

Second graph shows that the apparent offset (as seem from Germany, changed by about 80 msec (probably good). The third graph shows, however, that the root dispersion is erratic. The bottom graph shows that the stratum is usually 2 (as expected), but sometimes it is 0 or 3. When the stratum is 0 NTP will not synchronize clients.

When the stratum is 0 the reference ID is INIT. When the stratum is 2 the reference ID bounces between 5 external stratum 1 NTP servers

I don’t know what the cause is.

1 Like

I take back my previous statement about your network not being suitable for a monitor. With the source server changes the offset seems to be fine.

As for the INIT problem mentioned above, I’m seeing it as well:

$ ntpdate -qu 102.222.156.150
ntpdig: 102.222.156.150: Response dropped: stratum 0, probable KOD packet
ntpdig: no eligible servers

This was from my Finnish NTP server (95.217.203.53). I don’t think there should have been a reason to deny this request, because I hadn’t queried your server at all in the last 12 hours. Many of the previous queries worked fine, though.

1 Like

Could this be e.g. chronyd switching between sources? I.e., temporarily not having a majority, e.g., due to packet drops (as still seen with some servers from the monitors, but the other way round should be affected similarly)?

I faintly remember having seen that myself before, as well as mentioned/hinted at in context of some other issues discussed in this forum. And the servers “bouncing” between upstream servers seems to hint at the source selection not being very stable.

If it is indeed chronyd running on those servers, the logs should give a hint as to what is going on (I don’t recall off the top of my head how verbose ntpd is about the topic of source selection changes).

Looks like the INIT kiss code is not only occuring when the server is switching from one source to another, but sometimes even while sticking to the same source (or it’s switching at even higher rate than the artificially high poll rate used for troubleshooting):

18:11:36.186495 IP (tos 0x0, ttl 53, id 1263, offset 0, flags [DF], proto UDP (17), length 76)
    102.222.156.150.123 > 192.168.178.173.33418: NTPv4, Server, length 48
        Leap indicator:  (0), Stratum 2 (secondary reference), poll 3 (8s), precision -24
        Root Delay: 0.001831, Root dispersion: 0.441558, Reference-ID: 0xc415bb02
          Reference Timestamp:  3922272562.527557385 (2024-04-16T16:09:22Z)
          Originator Timestamp: 3922272696.000077563 (2024-04-16T16:11:36Z)
          Receive Timestamp:    3922272696.095104358 (2024-04-16T16:11:36Z)
          Transmit Timestamp:   3922272696.095131051 (2024-04-16T16:11:36Z)
            Originator - Receive Timestamp:  +0.095026794
            Originator - Transmit Timestamp: +0.095053488
18:11:44.183670 IP (tos 0x0, ttl 53, id 2673, offset 0, flags [DF], proto UDP (17), length 76)
    102.222.156.150.123 > 192.168.178.173.33418: NTPv4, Server, length 48
        Leap indicator: clock unsynchronized (192), Stratum 0 (unspecified), poll 3 (8s), precision -24
        Root Delay: 0.000000, Root dispersion: 0.000030, Reference-ID: INIT
          Reference Timestamp:  0.000000000
          Originator Timestamp: 3922272704.000087325 (2024-04-16T16:11:44Z)
          Receive Timestamp:    3922272704.092050840 (2024-04-16T16:11:44Z)
          Transmit Timestamp:   3922272704.092084201 (2024-04-16T16:11:44Z)
            Originator - Receive Timestamp:  +0.091963514
            Originator - Transmit Timestamp: +0.091996875
18:11:52.184618 IP (tos 0x0, ttl 53, id 2734, offset 0, flags [DF], proto UDP (17), length 76)
    102.222.156.150.123 > 192.168.178.173.33418: NTPv4, Server, length 48
        Leap indicator:  (0), Stratum 2 (secondary reference), poll 3 (8s), precision -24
        Root Delay: 0.001861, Root dispersion: 0.938903, Reference-ID: 0xc415bb02
          Reference Timestamp:  3922272708.702318002 (2024-04-16T16:11:48Z)
          Originator Timestamp: 3922272712.000113408 (2024-04-16T16:11:52Z)
          Receive Timestamp:    3922272712.093305824 (2024-04-16T16:11:52Z)
          Transmit Timestamp:   3922272712.093336686 (2024-04-16T16:11:52Z)
            Originator - Receive Timestamp:  +0.093192415
            Originator - Transmit Timestamp: +0.093223277

The server is switching source every few minutes, and sending an INIT kiss code at about the same rate, resulting in a 4 point score drop on a receiving monitor. Packet loss on the other hand doesn’t seem more of an issue than in other cases.

Why does the server keep switching sources at this frequency?

1 Like

It’s weird. Looking at my logs, the server seems to be switching between four close stratum 1 servers. The root delay is smaller than few milliseconds.

It seems to be running ntpd. The large jumps in root dispersion are consistent with the ntpd clock filter using 16 seconds for missing samples.

1 Like

In that case the operator might choose to make it easier for us to help debug the issue by allowing ntpq queries, at least temporarily. This is not necessarily insecure, depending on your definition. It would allow folks to potentially DoS the system as mode 6 queries are not rate-limited, and it would expose details like OS, peer IP addresses, their sys.peer IPv4 addresses via refid.

To open ntpd up to remote ntpq queries, ensure the restrict default line does not have noquery and you’ll probably want to add nomrulist.

1 Like

Local and global ping-based outside-in view of connectivity doesn’t show any significant general packet loss.

Inside-out view, especially NTP-based, might obviously be different, though. I.e., if missing responses from the sources are an issue, NTP-specific packet filtering/rate limiting towards/from/by the sources are candidates to take a closer look at.