Erratic scoring by monitoring server (query backlog?)

Ive been troubleshooting some connection problems with my servers and noticed that changes to netspeed take a long time to be processed (sometimes multiple hours). Also, scoring of some monitorings seems to be way off (e.g. monitor deksf3-z4ytm9, which shows a score of 20 for server 2a02:a46d:7ac:1:213:95ff:fe1d:f6b1 still 24h after server lost connection).
Server 2a02:a46d:7ac:1:213:95ff:fe1d:f6b1 now shows a score of >15, but no pool traffic appears being routed to it.

Is this the same problem we have seen early Sept?
This makes troubleshouting much more difficult.

If i can remember correctly the issue was caused by a backlog in the scoring algorithm and database files growing bigger than anticipated.

Anybody else experiencing this behavior or it my account thatā€™s being ā€˜punishedā€™? :wink:
@ask Could you please check?

What do you mean by ā€œprocessedā€?

deksf3-z4yatm9 in particular seems broken. It does not report any new measurements and is stuck on the last value it reported. All others seem to work as expected.

The scoring also seems to be consistent with the reported values from the monitors. So the scorer is not the problem here.

Is that still the case? The client distribution statistics on the score page indicate that your server is returned in over 1% of NL pool dns requests, if they donā€™t reach your server that would be a lot of unhappy clientsā€¦

Traffic and scoring of my ipv6 server did not show any anomalies. But I discovered that deksf3-z4yatm9 does not even appear in the monitor list for my server.

Looks like monitoring servers have caught up with the changes now.

Today the problem was very apparent again: Due to connection issues my NTP servers lost connection. Correctly, the monitoring servers downscored my servers within a few minutes (scores down to -15). However, the speed to my servers was not reduced at the same time. Around 250 requests/sec keep coming in even now (more than 45 mins after servers lost connection and while score is still around -15). How can this be? What is causing the delay? Shouldnā€™t traffic be adjusted right at the same time as the score drops below 10?
@ask Can you please take a look?

Even now (>3 hours after the server dropped out of the pool) im still getting pool traffic routed my way! More than 130 req/s!
This used to be fine, but since a few weeks it is not.

After 3 hours elapsed does your server still receive high traffic? I do not see your server being advertised by the geoDNS servers. I am checking this with the following command:

while :; do for p in a b c d e f g h i ;do dig AAAA 2.nl.pool.ntp.org @$p.ntpns.org. | grep -i 2a02:a46d:7ac:1:;done; sleep 1; done

No output.
It is likely leftover traffic. How many clients are generating the traffic?

1 Like

Dont know how to tell how many clients are generating the traffic.
Iā€™m using an appliance (Meinberg M300).
I can see >8000 uniqe addresses in the last 80 secs or so

Ask has altered the TTL of the DNS from 140s to 90s.

My servers start to act strange too for the monitors.

Itā€™s my believe his DNS-servers canā€™t handle the extra load or other DNS servers donā€™t accept the TTL.

Something went wrong in his 140 ā†’ 90 sec TTL.

In my opnion @ask should set 300s TTL, not below.

This is kind of normal behaviour. Many clients resolve the IP address for their ntp service once at startup and then just query that IP directly.

Dropping out of the pool (wether voluntarily or through the monitoring) just means you donā€™t get any new clients, but old ones might query you for a long time.

From pool.ntp.org: Join the NTP Pool! :

Finally, I must emphasize that joining the pool is a long term commitment. We are happy to take you out of the pool again if your circumstances change, but because of how the ntp clients operate it will take weeks, months or even YEARS before the traffic completely goes away.

4 Likes

@Sebhoster @ask thanks Seb. I understand and accept that some residual traffic is to be expected.
But iā€™m talking about something else: after i change the netspeed from 500 Mbit ā†’ 3Gbit the pool takes many hours to adjust the speed. It literally takes 4-5 hours until the adjusted speed is reflected in the number of requests to my server.
That is not the advertised 10-30 mins on the server management page!
(he other way round this is valid aswell)

FWIW: As an experiment, I doubled the speed on a US, IPv4 server (below 1 Gbps) and traffic changed within 5-15 minutes. Although it didnā€™t double.

A lot to keep up with in this thread!

According to the monitoring there werenā€™t any hiccups on any of the DNS servers with updating the DNS configuration; or with the configuration generation in the last day.

Usually I can query the raw logs for about 24 hours and I could have checked when the DNS servers stopped giving your IP as an answer, but this weekend part of that system was temporarily disabled.

The long term tracking only does daily counts currently. I added a new table thatā€™ll track the DNS answer counts with 5 minute granularity for about 6 weeks.

If you can reproduce the behavior, let me know and I can lookup in the logs what happened.

Generally speaking SNTP clients will quickly go away, but NTP clients (ntpd, chronyd, etc) wonā€™t go away for sometimes a long time (recent or future versions will improve on this!)

I havenā€™t seen any indicators from monitoring that the DNS servers arenā€™t generally handling the queries. Itā€™s possible that some of them become overloaded briefly for a few seconds in the peak moments. (Iā€™ll follow-up in that thread).

Thatā€™s what Iā€™d expect based on how the system works (and what Iā€™ve seen when I have experimented, too).

Is there any ratelimiting in place? After 20 packets of monitoring with 8 seconds of spacing in between them, all other packets are unanswered (output of the command
ntpmon 2a02:a46d:7ac:1:ee46:70ff:fe00:9292
from https://github.com/bruncsak/ntpmon):

12-18 08:04 19.43 @@@@@@@@@@@@@@@@@@@@............................................................................................
12-18 08:17 19.46 ................................................................................................................
12-18 08:33 17.85 ................................................................................................................
12-18 08:49 18.06 ................................................................................................................

@NTPman Sorry mate. I had ratelimite on to control some non-behaving clients. Turned it off for now.
@ask Thanks for looking into it. To test iā€™ve changed netspeed of 2a02:a46d:7ac:1:213:95ff:fe1d:f6b1 from 500 MBit to 2 GBit at 16:10 hrs UTC.
at 16:00 hrs number of requests was ~100/sec (combined for two servers with netspeed at 500 MBit).

Thanks @Kets_One ā€“ here are the data from the logs on how many times (in each 5 minute period) that IP was returned by the authoritative DNS servers.

dt q
2023-12-18 15:20:00 830
2023-12-18 15:25:00 825
2023-12-18 15:30:00 816
2023-12-18 15:35:00 787
2023-12-18 15:40:00 810
2023-12-18 15:45:00 840
2023-12-18 15:50:00 885
2023-12-18 15:55:00 827
2023-12-18 16:00:00 880
2023-12-18 16:05:00 799
2023-12-18 16:10:00 926
2023-12-18 16:15:00 2698
2023-12-18 16:20:00 3194
2023-12-18 16:25:00 3148
2023-12-18 16:30:00 3280
2023-12-18 16:35:00 3054
2023-12-18 16:40:00 3043
2023-12-18 16:45:00 3055
2023-12-18 16:50:00 3025
2023-12-18 16:55:00 3142
2023-12-18 17:00:00 3258
2023-12-18 17:05:00 3073
SELECT
    dt,
    sum(queries) AS q
FROM geodns3.by_server_ip_5m
WHERE (ServerIP = '2a02:a46d:7ac:1:213:95ff:fe1d:f6b1') AND (dt > '2023-12-18 15:15:00') AND (dt < '2023-12-18 17:10:00')
GROUP BY dt
ORDER BY dt ASC
FORMAT Markdown

@ask Many thanks for checking this. Looks like traffic volume increases after <10 min . How do i need to interprete your data? Prior to 16:00 hrs my IP was being returned ~800 times/5 min period. This translates to ~11/sec, while i am receiving ~100 req/sec at my server?

Based on my experience (and of others) the Fritzbox router does not seem to be suited well for high UDP traffic. Experienced performance can be erratic, leading to erratic scoring from monitoring servers (and possibly dropped packages)?

Iā€™m currently investigating replacing the Fritzbox router with something more robust, e.g. a Juniper SRX320.
Iā€™m curious if anyone has experience with this machine or (if not) has some good suggestions on which router is well-suited for TCP, but also UDP traffic.
Also, please take into account that iā€™m just a computer enthousiast and not a networking-expert.

For the client, the first step is to resolve ā€œpool.ntp.orgā€ to an IP address. Thats the metric Ask posted - how many times your IP address was returned as answer to the question ā€œwhere is pool.ntp.org?ā€.
The second step for the client is to send a NTP packet to the IP address(es) they got in step one. Thats the request your server receives.

Since DNS requests are heavily cached along the way, many clients never get to ask the pool DNS servers but just get a cached answer from their DNS resolver. That way many clients get your IP without the DNS pool servers noticing.
And finally, clients can send multiple packets to the same IP address once they have one that works. In fact, many common NTP clients just resolve the IP address once at startup and just keep using it.

Thatā€™s also why no one knows exactly how many NTP clients and how much traffic the pool as a whole actually serves.

2 Likes

@Sebhoster Thanks! Clear.

Just to be clear: That is behavior that the user can select by using the server directive. It picks exactly one IP address from potentially multiple being returned by DNS, e.g., when accessing the pool system, and then sticks to it even if the server becomes unavailable *.

Another option nowadays supported by common NTP clients is the pool directive. Detailed client behavior depends on the specific implementation, but generally, the client initially interacts with all IP addresses that a DNS query returns. Those servers are monitored for some time, and after a while, the list is pruned, keeping only the best performing servers up to a (typically configurable) maximum number of servers. In the same manner, after the initial pruning has been completed, the servers keep being evaluated. And if one is becoming unavailable or shows ā€œbadā€ performance, re-resolution of the name is done, and unavailable/bad servers are replaced by fresh ones.

Thereā€™s been various discussions to update the ā€œUse the poolā€ pagesā€™ example configuration to use the pool directive rather than the currently mentioned server directive. Arguably, the former would be preferrable, but the change is still pending.

* Thatā€™s at least the behavoir with ntpd classic and older chrony versions. Newer chrony versions seem to re-resolve the server name if the current server becomes unavailable or exhibits otherwise bad performance, and then replace the single server IP address used by a fresh one. Dave Hart is considering similar functionality for ntpd classic. I am not sure how NTPSec behaves in this respect.

2 Likes

So here we are.
From the log:

1703142275,2023-12-21 07:04:35,,-5,8.3,24,recentmedian,,i/o timeout
1703142135,2023-12-21 07:02:15,,-5,14,24,recentmedian,,i/o timeout

The geoDNS server supposed to stop advertising at 07:04:35 (UTC). when the score dropped from 14 to 8.3.

Thu 21 Dec 2023 08:00:53 AM CET
2.nl.pool.ntp.org.	130	IN	AAAA	2a02:a46d:7ac:1:ee46:70ff:fe00:9292
Thu 21 Dec 2023 08:01:56 AM CET
2.nl.pool.ntp.org.	130	IN	AAAA	2a02:a46d:7ac:1:213:95ff:fe1d:f6b1
Thu 21 Dec 2023 08:02:58 AM CET
2.nl.pool.ntp.org.	130	IN	AAAA	2a02:a46d:7ac:1:213:95ff:fe1d:f6b1
Thu 21 Dec 2023 08:04:01 AM CET
2.nl.pool.ntp.org.	130	IN	AAAA	2a02:a46d:7ac:1:213:95ff:fe1d:f6b1
2.nl.pool.ntp.org.	130	IN	AAAA	2a02:a46d:7ac:1:ee46:70ff:fe00:9292
Thu 21 Dec 2023 08:05:03 AM CET
2.nl.pool.ntp.org.	130	IN	AAAA	2a02:a46d:7ac:1:ee46:70ff:fe00:9292
Thu 21 Dec 2023 08:06:06 AM CET
Thu 21 Dec 2023 08:07:08 AM CET
Thu 21 Dec 2023 08:08:11 AM CET
Thu 21 Dec 2023 08:09:14 AM CET
Thu 21 Dec 2023 08:10:17 AM CET
Thu 21 Dec 2023 08:11:19 AM CET
Thu 21 Dec 2023 08:12:22 AM CET
Thu 21 Dec 2023 08:13:25 AM CET
Thu 21 Dec 2023 08:14:28 AM CET
Thu 21 Dec 2023 08:15:31 AM CET
Thu 21 Dec 2023 08:16:33 AM CET

The measurement shows that the geoDNS server stopped advertising the server between 07:05:03 and 07:06:06 UTC. This is below one minute delay after the score was dropping under 10. All seems to be good in the scoring and geoDNS.

4 Likes