Holes in the CSV log

I had some hickups in the upstream ntp synchronization on one of my servers today. While I investigated this, I stumbled about some strange holes in the csv log of the ntppool score page - some lines in the log are just empty.

You can see it under: https://www.ntppool.org/scores/82.165.57.232/log?limit=2000&monitor=* (increase the limit until you reach the times) today (2023-11-22) starting at 02:43:46, 07:34:43 and 14:03:00 - which is exactly when the hickups mentioned above occured.

Some context: I recently moved the server to Chrony and at the three points in time it logged an error “Can’t synchronise: no majority”. My question for this thread is what happened to the pool logs during that and if the logs disappearing is a chrony issue or a pool monitoring issue?

It is still happening… Screenshot to show you what I mean:

from https://www.ntppool.org/scores/82.165.57.232/log

Interestingly, the lines that are missing still seem to be there in the monitoring score server since they still appear when filtering for individual monitors:

grafik

I had a “Can’t synchronise: no majority” event at 05:09:22 so I’m pretty sure it has something to do with that. Still, whatever my ntp server is doing should have no effects on the ability of the pool to produce logs.

edit: I just found out that you can set the log output format to json, and in the json the missing entries are present:

    {
      "ts": 1700801965,
      "monitor_id": "44",
      "step": 1,
      "offset": 0.00194526,
      "score": 18.9897543204644
    },
    {
      "score": 14.0402666044412,
      "ts": 1700802870,
      "offset": -0.106221033,
      "monitor_id": "44",
      "step": -4
    },
    {
      "score": 14.3382532742191,
      "ts": 1700803851,
      "step": 1,
      "monitor_id": "44",
      "offset": 0.002773747
    },

Just had this happening to me as well, albeit at much smaller scale:

CSV:

1701118337,“2023-11-27 20:52:17”,1,13.6952502613083,24,recentmedian,

1701118335,“2023-11-27 20:52:15”,1,15.4493213746965,24,recentmedian,

1701118271,“2023-11-27 20:51:11”,1,15.4493213746965,24,recentmedian,

JSON:

  {
  	"monitor_id": "24",
  	"step": 1,
  	"ts": 1701118271,
  	"offset": null,
  	"score": 15.4493213746965
  },
  {
  	"monitor_id": "51",
  	"ts": 1701118335,
  	"step": -4,
  	"score": 14.0019005096129,
  	"offset": -1765036.21504558
  },
  {
  	"step": 1,
  	"ts": 1701118335,
  	"offset": null,
  	"score": 15.4493213746965,
  	"monitor_id": "24"
  },
  {
  	"monitor_id": "42",
  	"ts": 1701118337,
  	"step": -4,
  	"offset": -1765036.23765379,
  	"score": 12.3764371800933
  },
  {
  	"score": 13.6952502613083,
  	"offset": null,
  	"ts": 1701118337,
  	"step": 1,
  	"monitor_id": "24"
  },

chronyd messages at that instance (after unusual spontaneous device restart):

Tue Nov 7 10:35:01 2023 daemon.warn chronyd[669]: System clock wrong by 1765036.218890 seconds, adjustment started
Mon Nov 27 20:52:17 2023 daemon.warn chronyd[669]: System clock was stepped by 1765036.218890 seconds

Thus it indeed seems something in chronyd’s response in such circumstances is causing those holes in the CSV log.

Unfortunately, it seems the CSV and JSON logs do not merely represent different formats of the same data, but the JSON log seems to have a subset of the information of the CSV log only.

E.g., in this case, the value of the leap indicator and any kiss code from the server would have been interesting, but are missing from the JSON log. And with the corresponding entries in the CSV log missing entirely being the issue to be investigated, the CSV log obviously isn’t of much help in this case, either…

Not relevant here, or at all, but curiously realized that the sorting of entries is different in the two log variants. I.e., most recent log entry first in the CSV log, but most recent log entry last in the JSON log.

1 Like

Apologies I didn’t get to this sooner where it might have been more relevant.

The JSON API was really just meant to power the graphs, hence it shows a bit different data (and the CSV logs more as “logs”, I suppose).

You can get your server ID next to the CSV link on the scores page and then query the full data set in Google BigQuery:

SELECT
  EXTRACT(year from ts),
  EXTRACT(month from ts),
  AVG(score),
  AVG(offset),
  count(*)
from ntppool.ntppool.log_scores
where server_id = 57259
  group by 1,2
  order by 1,2;

To include data from before 2020, use

from `ntppool.ntppool.log_scores*`

(I can’t figure out how to link to a working example; I think opening the “console” from that link and then clicking “create sql query” and using the example above might work).

One thing I wanted to point out is that the log has both the monitors and the “scorer” (id 24). The scorer doesn’t add data for every monitoring data point; it skips adding another entry if it’s been less than 15 minutes AND the score is substantially the same.

Thanks! Thats exactly the info I needed.

I put in this query:

SELECT
  *
from ntppool.ntppool.log_scores
where server_id = 57259
AND error is not null
AND error != "i/o timeout"

And now I know: The error message for the missing lines is
bad stratum 0 (referenceID: 0x0, \u0000\u0000\u0000\u0000)
Which seems to fit since chronys source code looks to me like it will answer with stratum 0 when it can’t synchronize to an upstream time source.

That still leaves the question why the NTP pool has trouble showing this error message in the normal log. Probably a bug somewhere along the log content parsing? I took a look at the code but I could not immediately figure out where it breaks.

The NTPv4 RFC indicates that a stratum value of zero in a packet on the wire actually/“customarily” maps to value 16 in an implementation’s internal state variables, which in turn indicates a non-synchronized state.

So in that sense, the error message “bad stratum” seems not fully right, in the sense that while it does not represent an actual stratum value, it is a valid value as per figure 11 of the RFC, representing a special condition (albeit that being the stratum value being “unspecified or invalid”).

Though that figure arguably might be considered a bit fuzzy, as it seems to show packet stratum values and what they mean. And as such, it is giving another definition of the value zero, when according to the text following the figure, the value 16 “customarily” shouldn’t appear on the wire, and value zero should appear instead. I.e., as per description after the figure, value zero should be interpreted as “unsynchronized” (among various other places giving a similar indication, e.g., the leap indicator field).

I hope those responses also have leap=0x3 (0b11), which is how ntpd indicates it is unsynchronized.

Yes they do. (This is the first time I’ve seen anything in a leap field!)

Hm, curious! I wonder if the Perl code generating the CSV couldn’t handled the \u0000 bytes.

It was also surprisingly slow (mostly because some of the abstractions I’d used for managing the data), so I switched last weekend to a Go based program for the CSV and JSON APIs. My main motivation was a slow project to get this data out of ClickHouse instead of the MySQL database (and eventually not have the scores need to go through the MySQL database…), but it being faster and maybe having fixed this bug is just a bonus. :slight_smile:

If you see unexpected data from the website or the HTTP APIs, sending me the Traceparent header from the response is helpful for diagnostics (within a week or so of doing the request while the telemetry and log data is kept). This is new this year; for the main website the traceparent header is new since the last month sometime.

1 Like