I have changed the TTL to 130 seconds since making it lower didn’t make any meaningful improvements on the traffic patterns.
@Sebhoster Thanks for the analytics! I’m happy others are looking at this, too. Usually it’s just me and @stevesommars.
@Bas The DNS system and the monitoring system are quite decoupled. The DNS servers won’t get updated health information if the monitoring system (and scoring) isn’t working right, but there aren’t any reasonable ways that the DNS system can negatively impact the monitoring or scoring system.
Processing the data from both does share resources in the “central system”, but all the components have some safeguards around resource consumption to make it unlikely (Linux cgroups, for example, via Kubernetes resources).
I spot checked the DNS servers and all of them generally seemed to handle the increased queries fine. The CPU monitoring is only a few times a minute, so it might have missed brief periods of slow response times from some servers, but most of them have plenty capacity that the regular 5-6x increase in load at peak periods should be handled fine.
The DNS server log processing did run out of space for a bit (I currently keep ~24 hours / ~12 billion queries for diagnostics and debugging), but all that affected was the “what percentage of queries does my server get?” metric on the servers page and some of my dashboards.