Modem/Routers slowdown on heavy traffic

Hi all,

I have talked about this many times, and I have solved the heavy NAT problem.
But there is another problem, far more serious for me. The number of NAT-session is solved because other people jumped in and helped Belgium in traffic offloading.

However, I ran into a new underlying problem, again slowing down the modem into problems.
Believe it or not, it’s the DNS server in the modem that runs into troubles, so bad, I get resolving timeouts and it even fails to connect to remote resolvers like 1.1.1.1 etc.

My solution is to install 2 own DNS servers with DNSmasq, however, I’m probably not the only one that runs into troubles.

It turns out that Chrony tries to resolve everything and causes a huge load on the the DNS-servers and modems can’t handle the number of requests.
My Draytek sees the DNS port 53 being attacked by DDOS because of the number of requests.

So my question is, how do I stop Chrony from resolving clients? As I think it’s resolving every request incoming or outgoing.

I have read the documentation and can’t find a setting for chronyd to disable resolving for clients.
I do not want a resolve-stop for pool/servers as they look nice in the stats, and they are not resolved that often.

Is it possible that Chronyd causing this and it only shows now because my zone is offloaded a lot.

I can restore the DNS problem in the routers by rebooting it, but that’s not a solution.
Does anybody know how to stop Chronyd in resolving? As I want to know if this is the other cause of my resolving problems.

I tested resolving various ways and it seems modem/routers can’t handle it.

Anybody a solution? Thanks.

Chrony can respond to NTP queries without doing DNS lookups for the clients’ IP addresses, and does so by default. If you run something like “chronyc clients -p 100000” (ie. without the -n option) periodically that will look up the names of the clients. Do you maybe have some sort of stats scripts that run “chronyc clients” commands periodically?

Are the requests forward DNS queries (ie. look up A/AAAA records for time.apple.com) or reverse DNS queries (ie. look up PTR record for 253.84.253.17.in-addr.arpa.)?

I’ll note that if you’re running ntppool-agent, it will use a fair amount of DNS queries for looking up the IP addresses of the various reference NTP servers (although not at a scale that would really be a problem). A caching nameserver helps reducing the outgoing DNS queries in this case.

That is not what I mean. I suspect Chronyd (not Chronyc) of resolving any request.

It talks about resolving in the docs for server/pool uplink.
So I suspect it resolves all clients too.

However, you could be right it’s the pool-monitor, I have that running too.

But I did disable my Chrony for NTP-serving for testing, set to monitor-only, and problems went away.

So, my concern is that Chronyd does resolving for every request. I could be wrong, but I would like a config setting to turn it off, either for all or best only clients.

Modems/routers can not handle it…and I do know removing pool-related access does stop them from being overloaded.

BTW, I did had problems in the past too, before we could run a monitor, so I do not think the monitor is to blame. All in all, it just checks 5000 servers, all the same on a daily basis, I’m sure that could/would be cached by a modem. Or should be.

This I can’t say, didn’t check. I noticed my own pc to make DNS request time-outs.
So I tried to resolve this matter by using DNSmasq (that I had running on another machine), and that kept resolving for a while until DrayTek decided it was an UDP DDOS.

Too much 53 requests,

chronyd (the background service) only performs forward DNS lookups (Name → IP) to find your time servers when it starts and to stop the daemon from using DNS entirely, you must use raw IP addresses in your /etc/chrony.conf file instead of hostnames (e.g., server 1.2.3.4 instead of server pool.ntp.org).

It does not perform any other reverse DNS lookups during normal operation; that only happens when you run the chronyc tool to view stats.

Have you monitored your DNSMasq to see if your chronyd does too much requests?

It looks to be the pool-monitor…

it shows me this at every start:

jan 07 19:27:10 server dnsmasq[1349826]: cached hkgalb.develooper.com is 160.0.97.145
jan 07 19:27:10 server dnsmasq[1349826]: query[AAAA] api-buzz.mon.ntppool.dev from 127.0.0.1
jan 07 19:27:10 server dnsmasq[1349826]: cached api-buzz.mon.ntppool.dev is <CNAME>
jan 07 19:27:10 server dnsmasq[1349826]: cached hkgalb.develooper.com is 2620:171:5a:8010::1
jan 07 19:27:12 server dnsmasq[1349826]: query[AAAA] api-buzz.mon.ntppool.dev from 127.0.0.1
jan 07 19:27:12 server dnsmasq[1349826]: cached api-buzz.mon.ntppool.dev is <CNAME>
jan 07 19:27:12 server dnsmasq[1349826]: cached hkgalb.develooper.com is 2620:171:5a:8010::1
jan 07 19:27:12 server dnsmasq[1349826]: query[A] api-buzz.mon.ntppool.dev from 127.0.0.1
jan 07 19:27:12 server dnsmasq[1349826]: cached api-buzz.mon.ntppool.dev is <CNAME>
jan 07 19:27:12 server dnsmasq[1349826]: cached hkgalb.develooper.com is 160.0.97.145

1 Like

Did notice this:

bas@workstation:~$ nslookup hkgalb.develooper.com
Server:		127.0.0.53
Address:	127.0.0.53#53

Non-authoritative answer:
Name:	hkgalb.develooper.com
Address: 160.0.97.145
Name:	hkgalb.develooper.com
Address: 2620:171:5a:8010::1

bas@workstation:~$ ping hkgalb.develooper.com
PING hkgalb.develooper.com (160.0.97.145) 56(84) bytes of data.
^C
--- hkgalb.develooper.com ping statistics ---
15 packets transmitted, 0 received, 100% packet loss, time 14371ms

It responds to nothing, why it is there all the time?

The above corresponds to the following ntppool-agent log entries:

msg=“failed to upload metrics: Post "https://api-buzz.mon.ntppool.dev/v1/metrics\”: reader collect and export timeout"
msg=“traces export: Post "https://api-buzz.mon.ntppool.dev/v1/traces\”: processor export timeout"
msg=“processor export timeout: retry-able request failure: Post "https://api-buzz.mon.ntppool.dev/v1/logs\”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

The server not responding to ping does not really matter, but what matters is that the above server does not respond to connections to its https (TCP 443) port. This service being unavailable does not seem to cause actual problems for the pool, though.

Well I’m removing my monitors and see what it brings.

Keeps me off the street :wink:

You’ve demonstrated it doesn’t reply to ping (AKA ICMP echo request). I tried traceroute -4 and -6 from Linux and at least for my source IP, it also doesn’t respond with ICMP port unreachable to UDP traffic.
However, that doesn’t mean it responds to nothing. If you capture all the traffic involving the underlying IP addresses, I think you’ll see API request and response traffic.

The rest of your message is obviously correct, but the monitor software does use DNS for looking up IP addresses of the various “sanity check” NTP servers, ie. those that are used to determine if the monitor’s time is good enough. Maybe you don’t see those requests in your logs if you have a caching nameserver, but you’ll notice them if you restart your nameserver daemon (clearing the cache).

I wish I had an answer.

I changed routers, changed ISP, changed loads of stuff, but my network keeps giving problems when NTP-pool is running.

Last time I stopped everything for monitor and clients and the network is back to normal.

Something is happening, but I can get my finger on it. Weird is, the fraffic load itself isn’t high.
It’s the number of sessions. As my DNS started acting weird, pages even refuse to load, I looked into that.
So far the DNSmasq servers hold up, but pages load slowly.

I do remember from the early pool-monitor that it used ntp or chrony as measuring tool, maybe it’s still doing that but the ‘-n’ parameter isn’t given.
I have looked in the sources but can’t find it.

My Datacenter-servers don’t do NAT, they have no problems.

If indeed for some reason your local DNS tries to resolve client’s IPs it might help to either set the TTL to a low value (or zero) to free up DNS cache space or alternatively set the TTL to a high value (e.g. 86400 seconds, the standard TTL that Unbound uses). Setting the TTL to a high value might prevent frequent lookups by DNS server.

For the (six) pool servers i run I have not seen DNS lookups like you describe.
Total number of NTP requests for all pool servers combined is around 1000/sec.

Well I have dnsmasq logging and restarted Chronyd to see if it’s using DNS a lot.
It doesn’t, so I’m pretty sure this issue is not caused by Chrony.

I will check the logs again in a few hours.
It it’s still normal.

Then remove my server temporary from being used in the pool and install a monitor again.

It has to be coming from something.

I do notice this often:

jan 08 16:55:18 server ntppool-agent[3508]: level=WARN msg=“local-check failure” env=prod ip_version=v4 server=ntp.stupi.se ip=192.36.143.234 err=“network: i/o timeout” trace_id=7084499213c29ad40e3c8aecaf72b8fd span_id=4668c44c6f2df33e

Then when I check it by hand:

bas@workstation:~$ ntpdate ntp.stupi.se
2026-01-08 16:55:04.490975 (+0100) +0.000634 +/- 0.020136 ntp.stupi.se 192.36.143.150 s1 no-leap
CLOCK: adj_systime: Operation not permitted
bas@workstation:~$ ntpdate 192.36.143.234
ntpdig: no eligible servers

Maybe a good idea to check servers a few times a week if the DNS and IP still match.
Will save a lot of traffic, as I see this a lot.

Also, as you see, the monitor does seem to resolve the IP.

As for Stupi, that hostname resolves to multiple IP addresses:

$ dig +short +noshort ntp.stupi.se
ntp.stupi.se.		86400	IN	A	192.36.143.234
ntp.stupi.se.		86400	IN	A	192.36.143.150
ntp.stupi.se.		86400	IN	A	192.36.143.151
ntp.stupi.se.		86400	IN	A	192.36.143.153

Of those, looks like the .234 one does not indeed respond to queries, and neither does its IPv6 counterpart 2001:440:1880:1000::20. I’ve sent the Stupi folks a note so they could address this, one way or another.

Ok, but why is the monitor showing both? Is it resolving? Or not?

Why is the hostname given to the monitor? Should it even been given?

Does it only use the first pool-ip of ntp.stupi.se ?

As I understand it, the monitor is only given the hostnames (not IP addresses) of the sanity check NTP servers and it is the responsibility of the monitor server to resolve the DNS names to IP addresses whenever needed. One of the reasons for that approach is that the list of sanity check NTP servers seems to include hostnames that resolve to different IP addresses depending on who’s asking. For example one of those server names is time.apple.com, which resolves to IP addresses in Europe when queried from Europe, and to IP addresses in Asia when queried from Asia.

Apparently the monitor software uses only the first IP address that is returned. I’d think that is usually sufficient and a reasonable assumption in this use case.

Is it? As I see the next one all the time too.

jan 08 18:23:59 server ntppool-agent[3508]: level=WARN msg="local-check failure" env=prod ip_version=v4 server=ntp.nict.jp ip=133.243.238.244 err="offset too large: 13.160382ms"

It’s there all the time. I understand that correct-local-time must be checked.
But these keep popping up to no use, as they ‘error’ one way or the other.

I would suspect the monitor-server to remove those, and not sending them any longer.

I don’t expect this to be my problem, but it looks strange it’s sending dns-names and not IP’s, when it’s said the pool works with IP’s only.

I would expect the monitor to return the error and if e.g. 10 monitors report this the server in question is removed from local-check.
Or at least send an email to the owner letting them know it’s in error.