What's your monitoring setup?

I’ve seen some people here having graphs to show their server traffic and resource usage, presumably Grafana.
I’m curious, what are you using, how have you set it up, and what exactly are you monitoring? Where is the data collected (Prometheus, InfluxDB, Graphite?), and by what method? Basically, what’s your setup?

Looking into it, I’m a little intimidated by the number of parts involved in such a setup. Might be a fun project though.

2 Likes

I am using Telegraf on my time servers to send both NT and system metrics into Influxdb. Then I am using Grafana for dashboards.

2 Likes

In my bit of the University of Cambridge, we generally run CollectD on each host and they all send metrics to a central Graphite server. Then we have a Grafana system that can display data from that Graphite server. I take advantage of this for my NTP server metrics.

On the CollectD side, there are four plugins that I use that are relevant for NTP:

  • chrony, for timekeeping data from chrony
  • iptables, for traffic levels
  • processes, for chrony’s CPU usage
  • ipmi, for hardware temperature sensors

The “iptables” plugin is combined with IPTables rules that specifically match NTP traffic so I can separate that from everything else the servers do. The temperature sensors are useful for telling when changes in clock frequency are caused by problems with the cooling system.

I then have a single Grafana dashboard that puts all this information together for the NTP servers under my control.

2 Likes

Im using netdata and zabbix for monitoring stuff.

1 Like

I use Zabbix to monitor standard CPU, memory, disk, and network traffic like any other host. For specific NTP metrics I started with this Zabbix template. That is only chrony client stuff so I added server-type things.

  • NTP packets received: system.run[sudo chronyc serverstats | grep "NTP packets received" | awk '{print $5}']
  • Client count: system.run[sudo chronyc -c -n clients | wc -l]

A few other ones following that pattern. It makes fine graphs.

That is with my speed set to 25mbps, BTW.

2 Likes

Using munin with the chrony_drift and chrony_status plugins from here:

The resulting graphs :chart_with_upwards_trend: can be seen here:

https://beta.kenyonralph.com/munin/time-day.html

2 Likes

For the NTP Pool Project itself it’s a lot of Grafana stuff plus custom software for data from the DNS servers and monitors.

  • Prometheus (with various exporters) for scraping / collecting data, mostly over HTTP though the monitors are “scraped” over MQTT. The DNS servers are scraped with mTLS (each DNS server gets a private certificate for this).
  • Longer term metrics storage in Mimir (from Grafana).
  • Application logs go via promtail into Loki (also Grafana).
  • System logs from the DNS servers are sent via vector (with mTLS) into the main cluster; also going into Loki.
  • DNS server (query) logs are serialized in Avro files and then sent to the cluster with a custom program, then to Kafka and from there into ClickHouse (and summarized; only about a day of queries are kept).
  • Alerts are made with a combination of grafana and prometheus + alertmanager (and a few Loki rules)
  • Tracing (OpenTelemetry) is collected via various otel-collector instances and sent to Tempo (another Grafana product).

I haven’t figured out to count queries from the (single) NTP server I run though, which I guess is what the thread is about… :smiley:

3 Likes

I use Prometheus and Grafana. Data is collected with node exporter for general metrics like CPU load, chrony exporter (with the unfinished serverstats PR) for chrony metrics and a small python script that processes the chrony client statistics and also exposes them for Prometheus.

The chrony portion of my dashboard: The yellow lines indicate dropped packets and clients that had at least one of their requests dropped

1 Like

I think it’s both relevant and interesting either way, so thanks for sharing ^^

2 Likes

I use nothing, but my servers have limited DNS-range because they are set to 512K.
However, in case of trouble I use nethogs and check chronyc clients for abusive polling.
If there is one, I will block the IP via the firewall, so far just a few, not worth counting :slight_smile:

I use MRTG:

2 Likes

Late to the party, but I use Grafana, InfluxDB, telegraf, and my own script NTPmon. This handles chrony and ntpd (both traditional and NTPsec versions) with the same monitoring infrastructure.

One of my pool hosts runs ntpd-rs, which provides its own telemetry endpoint, so that is scraped periodically by telegraf and graphed using a different Grafana dashboard.

One of my recent blog posts links to a public Grafana dashboard where you can see the metrics I collect, and another explains why I chose InfluxDB & telegraf for NTP monitoring.