I’ve seen some people here having graphs to show their server traffic and resource usage, presumably Grafana.
I’m curious, what are you using, how have you set it up, and what exactly are you monitoring? Where is the data collected (Prometheus, InfluxDB, Graphite?), and by what method? Basically, what’s your setup?
Looking into it, I’m a little intimidated by the number of parts involved in such a setup. Might be a fun project though.
In my bit of the University of Cambridge, we generally run CollectD on each host and they all send metrics to a central Graphite server. Then we have a Grafana system that can display data from that Graphite server. I take advantage of this for my NTP server metrics.
On the CollectD side, there are four plugins that I use that are relevant for NTP:
The “iptables” plugin is combined with IPTables rules that specifically match NTP traffic so I can separate that from everything else the servers do. The temperature sensors are useful for telling when changes in clock frequency are caused by problems with the cooling system.
I then have a single Grafana dashboard that puts all this information together for the NTP servers under my control.
I use Zabbix to monitor standard CPU, memory, disk, and network traffic like any other host. For specific NTP metrics I started with this Zabbix template. That is only chrony client stuff so I added server-type things.
For the NTP Pool Project itself it’s a lot of Grafana stuff plus custom software for data from the DNS servers and monitors.
Prometheus (with various exporters) for scraping / collecting data, mostly over HTTP though the monitors are “scraped” over MQTT. The DNS servers are scraped with mTLS (each DNS server gets a private certificate for this).
Longer term metrics storage in Mimir (from Grafana).
Application logs go via promtail into Loki (also Grafana).
System logs from the DNS servers are sent via vector (with mTLS) into the main cluster; also going into Loki.
DNS server (query) logs are serialized in Avro files and then sent to the cluster with a custom program, then to Kafka and from there into ClickHouse (and summarized; only about a day of queries are kept).
Alerts are made with a combination of grafana and prometheus + alertmanager (and a few Loki rules)
Tracing (OpenTelemetry) is collected via various otel-collector instances and sent to Tempo (another Grafana product).
I haven’t figured out to count queries from the (single) NTP server I run though, which I guess is what the thread is about…
I use Prometheus and Grafana. Data is collected with node exporter for general metrics like CPU load, chrony exporter (with the unfinished serverstats PR) for chrony metrics and a small python script that processes the chrony client statistics and also exposes them for Prometheus.
The chrony portion of my dashboard: The yellow lines indicate dropped packets and clients that had at least one of their requests dropped
I use nothing, but my servers have limited DNS-range because they are set to 512K.
However, in case of trouble I use nethogs and check chronyc clients for abusive polling.
If there is one, I will block the IP via the firewall, so far just a few, not worth counting
Late to the party, but I use Grafana, InfluxDB, telegraf, and my own script NTPmon. This handles chrony and ntpd (both traditional and NTPsec versions) with the same monitoring infrastructure.
One of my pool hosts runs ntpd-rs, which provides its own telemetry endpoint, so that is scraped periodically by telegraf and graphed using a different Grafana dashboard.
One of my recent blog posts links to a public Grafana dashboard where you can see the metrics I collect, and another explains why I chose InfluxDB & telegraf for NTP monitoring.
I don’t monitor the timekeeping performance of my various servers beyond manually using the tools shipped alongside the daemon. I.e., ntpq for my ntpd classic and NTPsec instances, and chronyc for chronyd. And the Pool’s monitoring system of course.
I have basic traffic monitoring in place for my current main server in the pool, using darkstat and RIPE Atlas.
I have two darkstat instances, one for IPv4, one for IPv6. IPv6 is somewhat boring, just 616 kbit/s or so with a 3 Gbit setting. IPv4 is more interesting, with something in the tool or the system being too slow to handle the pcap-based packet capture so that the max bitrate shown is typically clipped at around 8 Mbit/s. The counters for “captured” and “dropped” packets are from pcap, which has them as u_int, so they wrap around rather often.
This is the overall interface traffic, so not only NTP, but other stuff as well. But since this server is primarily intended for NTP service, any non-NTP traffic (ICMP pings as discussed in a recent separate thread, HTTP(S) and other probing/scanning, me viewing darkstat, …) is basically negligible in comparison to the 2 Mbit/s and upwards NTP traffic.
My hypothesis is that the rather large variations throughout the day are due to some other servers in the same zone phasing in and out of the pool rather often due to high load (there are two monitors in the same zone, and local traffic seems to have a typical RTT of 1 or 2 ms, so the typical packet loss seen in other cases is rather unlikely to cause servers to drop here). (Very short bursts could also be the occasional software or GeoDB update download.)
With a 6 Mbit setting for IPv4, I have seen the occasional peak of up to 18 Mbit/s (which is roughly where the cloud provider’s DDoS detection typically kicks in), but otherwise, throughput typically varies between about 2 Mbit/s and 12 Mbit/s.
As an IPv6 evangelist, I feel an urgent need to say that it is not only due to the low number of IPv6 clients (in fact, there are quite a few nowadays), but also because @ask simply refuses to add more AAAA records besides the one at 2.pool.ntp.org that we have had since IPv6 Launch Day 2012 or so.
Though I have the impression that for some reason, despite the huge number of clients, uptake of IPv6 is a bit lower in large parts of Asia (where this particular server is located) than, e.g., in Europe (with a few exceptions such as India and North Korea), according to this source.
Actually, taking a second look, Europe overall isn’t that great, either, in some parts…
If the traffic ratio between IPv4 and IPv6 that I see on my NTP servers would match these stats from APNIC, I would be a very happy man. But this can at best only happen if all [0123].pool.ntp.org FQDNs receive an AAAA-record.
But I’ll stop now - because this thread was not about IPv6.
Applied filter: 'udp and port 123'
IPv4 bytes: 410,072,543 (94.385 %), IPv6 bytes: 29,811,210 (5.615 %).
Starttime: Mon Jun 24 11:57:34 2024
Endtime: Mon Jun 24 12:05:02 2024
Total number of packets: 4,827,024.
IPv4: 4,556,013 packets, IPv6: 271,011 packets.
The uptake in IPv6 could be higher in my opinion, most of the IPv6 requests are from the known DSlight / IPv6 only internet providers depite what the map says…
F.w.i.w.: I’m creating my own monitoring dashboard for gpsd + ntpsec.
and a demo at http://gateway.vanheusden.com:5000/
It’s a fairly new project (2 days when I write this) so there may be bugs and a lot to wish for Also at the time of writing, the gps is located behind a coated window (recently moved) so reception is spotty (for the demo).