The DNS “time to live” has been 140 seconds most recently. I changed it to 90 seconds about 8 hours ago. I’m curious if those of you who graph NTP queries on your servers could see a difference in the traffic patterns.
On the DNS servers this increased the queries per second from about 95k to 115k during “normal” seconds and the peaks went from about 490k to 680k queries per second.
Ask it will take some time before you notice.
However, because many people get DNS changes via DNS-caching, the load on your servers won’t increase that much.
A lot of servers ignorte TTL anyway, but many client programs do not.
But I do not believe 140 → 90 seconds will do much except increase your load.
I would have expected it to be more then 300 seconds, beware, a DNS-cache will only hold the request that long after the initial request.
So that server will change/reuest-again after x-seconds.
I fail to see where 90 seconds will help over 140 seconds.
I would suggest you set it to 300 seconds, that means that every 5 minutes clients get new addresses, so it rotates at worst 12 times an hour on the DNS they use.
As most clients poll only once a day or worse once every restart of the OS, they won’t poll that much.
This is the majority of clients and they rotate plenty.
The problem is the bad behaving clients that poll fast and all at the same time, where a local NTP-caching server should have been installed and used.
Please consider setting it 300 seconds if the pattern doesn’t better at 90 seconds then we have ‘tested’ all settings to see what loads do.
Traffic looks to me like I would have expected it to:
Average traffic remains the same
Maximum traffic remains the same
Traffic spikes last a smaller amount of time but happen more frequently
My assumption is that most clients and DNS servers out there work as intended. And I think there are some “firehose” DNS resolvers that serve a lot of clients. Everytime the TTL expires on one of these, the firehose points somewhere else. And wherever they point, those ntp servers get a traffic spike. Reducing the TTL fits to shorter but more frequent spikes.
Sorry, but from your post, I once again only infer that your servers are supposedly impacted. That is far from “we all”. Obviously until others start reporting issues in significant numbers that they relate to the change in DNS TTL value.
So that does not like a drastic increase in score drops. Or any increase at all. For reference, the change was somewhere in the middle of the graph around 2023-12-15 19:00 (± an hour if i got the timezones wrong)
edit: new picture without monitor 24 because that one just repeats errors from the others
I have changed the TTL to 130 seconds since making it lower didn’t make any meaningful improvements on the traffic patterns.
@Sebhoster Thanks for the analytics! I’m happy others are looking at this, too. Usually it’s just me and @stevesommars.
@Bas The DNS system and the monitoring system are quite decoupled. The DNS servers won’t get updated health information if the monitoring system (and scoring) isn’t working right, but there aren’t any reasonable ways that the DNS system can negatively impact the monitoring or scoring system.
Processing the data from both does share resources in the “central system”, but all the components have some safeguards around resource consumption to make it unlikely (Linux cgroups, for example, via Kubernetes resources).
I spot checked the DNS servers and all of them generally seemed to handle the increased queries fine. The CPU monitoring is only a few times a minute, so it might have missed brief periods of slow response times from some servers, but most of them have plenty capacity that the regular 5-6x increase in load at peak periods should be handled fine.
The DNS server log processing did run out of space for a bit (I currently keep ~24 hours / ~12 billion queries for diagnostics and debugging), but all that affected was the “what percentage of queries does my server get?” metric on the servers page and some of my dashboards.
I would be interested in seeing an experiment to the other direction, ie. increasing the TTL. Case in point, I got a fairly big traffic spike on one of my servers yesterday and the ISP apparently thought it was some DDoS attack and blocked inbound NTP to this server at their firewall. I’m working with them to get the restriction removed.
My theory is that a significantly longer TTL would, over time, smooth the traffic spikes. I’m talking about 30 minutes or so. The obvious drawback is that if a server goes down, the clients would not be directed to other NTP servers as quickly as before. Maybe try with 15 minutes (900 seconds) and see what happens?
I have a bad idea: What if the TTL depended on the scores of the IPs in the response? If every server’s score is 20, use a TTL of 15 minutes or whatever. If any of them are lower, use a low TTL. Under the theory that a server with a perfect score is unlikely to (or can’t?) be kicked out quickly, but servers with imperfect scores may have problems.