Include monitor status (active vs. testing) in log entries

@ask, would it be possible to include the “status” of a monitor (active vs. testing) at the time of generating a log entry in the log entry itself?

I am currently trying to understand why one of my server’s score is so volatile. I guess there are some issues at the connectivity level, but it might help to know which monitors are active over time for a particular server to try to detect some pattern.

Also, not as critical for debugging, including the server stratum in a monitor’s log entries would be of interest, as well as displaying it on the server detail page above the graph. Right now, the stratum is shown on the “Manage Servers” main page, at least for most servers, but not on the detail page for a specific server, where it would be interesting as well.

Thanks for your kind consideration!

@PoolMUC Originally I choose not to empathize the stratum in the system because I didn’t want people to think of that as a metric of how good their server was (and motivate people to fuss with it), considering that in the protocol it primarily (?) exists for loop detection – as far I understand anyway.

The column stored in the scoring log are a bit unpleasant to change across all the components and archives, so I want to be deliberate about it. (As a note for myself, this was the last time I added something: Add RTT data to archives when available · ntppool/archiver@95f213c · GitHub – the v2 monitor didn’t exist at the time).

The testing/active status does make sense to store though. Are there any other metrics or attributes that’d be interesting to store? Adding the stratum will probably compress very well in ClickHouse, so that seems okay to add as well.

@stevesommars or @mlichvar, I think one of you suggested other metrics or values from the NTP response that’d be worth cataloging once upon a time?

1 Like

@PoolMUC in terms of connectivity issues I think the best thing I could add would be automatic traceroutes. I started on it a while ago[1], but got stuck on overthinking how to store the results so it’d be possible to do queries on the clickhouse (for example) data to find common problems / connection paths. In an imaginary world where all the code existed the monitors would traceroute at some appropriate interval when problems are encountered and pool server operators would have a little daemon they can optionally install that’d to traceroutes in the opposite direction when certain conditions were met.

[1] I think more than 10 years ago, oops! I can’t find the original place this code was, but the https://trace.ntppool.org/traceroute/8.8.8.8 site was originally to test the code I wrote for collecting the traceroute data. :slight_smile:

It is a bit offtopic here, but the theory is that you never store byproduct of computation in an append only log, only information what is generated at the edge of your computation realm, only data the you do not have control over. You should always be able to regenerate the required byproduct from the original data.

So just make the algorithm good enough to be able to regenerate the testing/active monitor status on a stable manner from the previous log entries.

Root Dispersion is interesting. It is the calculated total error budget of the server’s time from the ultimate reference clock source(s). That is, the server believes correct UTC is within +/- rootdisp seconds. Root Delay might also be worth logging, it’s the cumulative round-trip time to the reference clock. Both values are 32-bit fixed point signed, with the most-significant 16 bits being seconds and the remainder being fractions of a second (so 0xffff would be 63335/65536 s).

A wrinkle is those numbers accumulate up the chain of system peers (o or * in the ntpq peers billboard), but each ntpd actually calculates its time from multiple survivor sources, not its system peer alone.

2 Likes

All of the NTP fields are of interest, the base Mode 3/4 fields require only 48 bytes. I’d store all responses + the client Tx and Rx times (T1 and T4)

Leap indicator is important. I don’t know if the monitor code detects stray LI=1 responses. I see examples every few months.

Without knowing the cause, I wouldn’t trust pool servers with stratum >4

Precision is important. There are some NTP servers that use the precision field as a measure of uncertainty. Prime examples are the LeoNTP appliances.

A reference ID set to FREE or LOCL/LCL or 127.0.0.1 might indicate that the server is running from an undisciplined clock.

I second Dave Hart’s comment about root delay, root dispersion.

Bad reference timestamps (timestamp too old, or timestamp > Tx timestamp) can cause clients to reject an NTP response.

Occasionally responses will have transmit timestamps < receive timestamps (causality violation). One example is the stale T2 bug that was fixed long ago in ntpd

2 Likes