What's your monitoring setup?

I’ve seen some people here having graphs to show their server traffic and resource usage, presumably Grafana.
I’m curious, what are you using, how have you set it up, and what exactly are you monitoring? Where is the data collected (Prometheus, InfluxDB, Graphite?), and by what method? Basically, what’s your setup?

Looking into it, I’m a little intimidated by the number of parts involved in such a setup. Might be a fun project though.

2 Likes

I am using Telegraf on my time servers to send both NT and system metrics into Influxdb. Then I am using Grafana for dashboards.

2 Likes

In my bit of the University of Cambridge, we generally run CollectD on each host and they all send metrics to a central Graphite server. Then we have a Grafana system that can display data from that Graphite server. I take advantage of this for my NTP server metrics.

On the CollectD side, there are four plugins that I use that are relevant for NTP:

  • chrony, for timekeeping data from chrony
  • iptables, for traffic levels
  • processes, for chrony’s CPU usage
  • ipmi, for hardware temperature sensors

The “iptables” plugin is combined with IPTables rules that specifically match NTP traffic so I can separate that from everything else the servers do. The temperature sensors are useful for telling when changes in clock frequency are caused by problems with the cooling system.

I then have a single Grafana dashboard that puts all this information together for the NTP servers under my control.

2 Likes

Im using netdata and zabbix for monitoring stuff.

1 Like

I use Zabbix to monitor standard CPU, memory, disk, and network traffic like any other host. For specific NTP metrics I started with this Zabbix template. That is only chrony client stuff so I added server-type things.

  • NTP packets received: system.run[sudo chronyc serverstats | grep "NTP packets received" | awk '{print $5}']
  • Client count: system.run[sudo chronyc -c -n clients | wc -l]

A few other ones following that pattern. It makes fine graphs.

That is with my speed set to 25mbps, BTW.

2 Likes

Using munin with the chrony_drift and chrony_status plugins from here:

The resulting graphs :chart_with_upwards_trend: can be seen here:

https://beta.kenyonralph.com/munin/time-day.html

2 Likes

For the NTP Pool Project itself it’s a lot of Grafana stuff plus custom software for data from the DNS servers and monitors.

  • Prometheus (with various exporters) for scraping / collecting data, mostly over HTTP though the monitors are “scraped” over MQTT. The DNS servers are scraped with mTLS (each DNS server gets a private certificate for this).
  • Longer term metrics storage in Mimir (from Grafana).
  • Application logs go via promtail into Loki (also Grafana).
  • System logs from the DNS servers are sent via vector (with mTLS) into the main cluster; also going into Loki.
  • DNS server (query) logs are serialized in Avro files and then sent to the cluster with a custom program, then to Kafka and from there into ClickHouse (and summarized; only about a day of queries are kept).
  • Alerts are made with a combination of grafana and prometheus + alertmanager (and a few Loki rules)
  • Tracing (OpenTelemetry) is collected via various otel-collector instances and sent to Tempo (another Grafana product).

I haven’t figured out to count queries from the (single) NTP server I run though, which I guess is what the thread is about… :smiley:

4 Likes

I use Prometheus and Grafana. Data is collected with node exporter for general metrics like CPU load, chrony exporter (with the unfinished serverstats PR) for chrony metrics and a small python script that processes the chrony client statistics and also exposes them for Prometheus.

The chrony portion of my dashboard: The yellow lines indicate dropped packets and clients that had at least one of their requests dropped

1 Like

I think it’s both relevant and interesting either way, so thanks for sharing ^^

2 Likes

I use nothing, but my servers have limited DNS-range because they are set to 512K.
However, in case of trouble I use nethogs and check chronyc clients for abusive polling.
If there is one, I will block the IP via the firewall, so far just a few, not worth counting :slight_smile:

I use MRTG:

2 Likes

Late to the party, but I use Grafana, InfluxDB, telegraf, and my own script NTPmon. This handles chrony and ntpd (both traditional and NTPsec versions) with the same monitoring infrastructure.

One of my pool hosts runs ntpd-rs, which provides its own telemetry endpoint, so that is scraped periodically by telegraf and graphed using a different Grafana dashboard.

One of my recent blog posts links to a public Grafana dashboard where you can see the metrics I collect, and another explains why I chose InfluxDB & telegraf for NTP monitoring.

I use shell + rrdtool for embedded device LuckFox pico with 128Mb storage
image

I don’t monitor the timekeeping performance of my various servers beyond manually using the tools shipped alongside the daemon. I.e., ntpq for my ntpd classic and NTPsec instances, and chronyc for chronyd. And the Pool’s monitoring system of course.

I have basic traffic monitoring in place for my current main server in the pool, using darkstat and RIPE Atlas.

I have two darkstat instances, one for IPv4, one for IPv6. IPv6 is somewhat boring, just 616 kbit/s or so with a 3 Gbit setting. IPv4 is more interesting, with something in the tool or the system being too slow to handle the pcap-based packet capture so that the max bitrate shown is typically clipped at around 8 Mbit/s. The counters for “captured” and “dropped” packets are from pcap, which has them as u_int, so they wrap around rather often.

This is the overall interface traffic, so not only NTP, but other stuff as well. But since this server is primarily intended for NTP service, any non-NTP traffic (ICMP pings as discussed in a recent separate thread, HTTP(S) and other probing/scanning, me viewing darkstat, …) is basically negligible in comparison to the 2 Mbit/s and upwards NTP traffic.

My hypothesis is that the rather large variations throughout the day are due to some other servers in the same zone phasing in and out of the pool rather often due to high load (there are two monitors in the same zone, and local traffic seems to have a typical RTT of 1 or 2 ms, so the typical packet loss seen in other cases is rather unlikely to cause servers to drop here). (Very short bursts could also be the occasional software or GeoDB update download.)

With a 6 Mbit setting for IPv4, I have seen the occasional peak of up to 18 Mbit/s (which is roughly where the cloud provider’s DDoS detection typically kicks in), but otherwise, throughput typically varies between about 2 Mbit/s and 12 Mbit/s.

1 Like

As an IPv6 evangelist, I feel an urgent need to say that it is not only due to the low number of IPv6 clients (in fact, there are quite a few nowadays), but also because @ask simply refuses to add more AAAA records besides the one at 2.pool.ntp.org that we have had since IPv6 Launch Day 2012 or so. :neutral_face:

4 Likes

Though I have the impression that for some reason, despite the huge number of clients, uptake of IPv6 is a bit lower in large parts of Asia (where this particular server is located) than, e.g., in Europe (with a few exceptions such as India and North Korea), according to this source.

Actually, taking a second look, Europe overall isn’t that great, either, in some parts…

If the traffic ratio between IPv4 and IPv6 that I see on my NTP servers would match these stats from APNIC, I would be a very happy man. But this can at best only happen if all [0123].pool.ntp.org FQDNs receive an AAAA-record.

But I’ll stop now - because this thread was not about IPv6.

4 Likes

From my vantage point in Germany, no…

Applied filter: 'udp and port 123'
IPv4 bytes: 410,072,543 (94.385 %), IPv6 bytes: 29,811,210 (5.615 %).
Starttime: Mon Jun 24 11:57:34 2024
  Endtime: Mon Jun 24 12:05:02 2024
Total number of packets: 4,827,024.
IPv4: 4,556,013 packets, IPv6: 271,011 packets.

The uptake in IPv6 could be higher in my opinion, most of the IPv6 requests are from the known DSlight / IPv6 only internet providers :frowning: depite what the map says…

F.w.i.w.: I’m creating my own monitoring dashboard for gpsd + ntpsec.

and a demo at http://gateway.vanheusden.com:5000/
It’s a fairly new project (2 days when I write this) so there may be bugs and a lot to wish for :slight_smile: Also at the time of writing, the gps is located behind a coated window (recently moved) so reception is spotty (for the demo).

4 Likes