Monitoring seems broken

The actual score does not fit to the results of the active monitoring servers

The server reachability is very good. However, the clock of the server is loosing the synchronization regularly. This is reflected with slight penalty only in the overall score.
There seems to be no problem with the monitoring.

The monitoring scores are all 19+ while the overall score is 17.6.

As far as I know l, the overall score should be calculated as median of the active monitors which would be 19.5

1 Like

The brief hole in the graphs that started at 03:15 today is probably not a good sign…

Yes, it should be. The CSV log linked to at the bottom of the page tells me what’s wrong. Normally every other line in that log is a “recentmedian” entry. In the default 200 CSV entries I see none of those.

I had to go back almost 15 hours to find the time of the breakage:

1698909520,"2023-11-02 07:18:40",0.00355772,1,18.3866995980403,32,fihel1-z4ytm9,0,
1698909388,"2023-11-02 07:16:28",0.002301398,1,18.4881010425801,35,sgsin3-1a6a7hp,0,
1698909230,"2023-11-02 07:13:50",0.006134328,1,18.4137887694582,19,usewr1-1a6a7hp,0,
1698900169,"2023-11-02 04:42:49",,1,17.5839130427706,24,recentmedian,,
1698900169,"2023-11-02 04:42:49",-0.010692684,1,17.9205346138958,44,nlams2-19sfa9p,0,
1698900158,"2023-11-02 04:42:38",,1,17.5839130427706,24,recentmedian,,
1698900158,"2023-11-02 04:42:38",-0.012126799,1,17.4806799201151,34,denue1-z4ytm9,0,
1698900091,"2023-11-02 04:41:31",,1,17.5839130427706,24,recentmedian,,
1698900091,"2023-11-02 04:41:31",-0.012241447,1,17.5839130427706,54,defra3-jsdnqw,0,

Ugh, indeed – I’m not sure why monitoring didn’t send alerts (yellow is the production system).

Ok, monitoring fixed (at least for this scenario) and I restarted the scorer with debugging enabled and it’s happily working away. Thanks for the note(s), everyone!

There was an unrelated problem that I fixed yesterday that made some of the processes hang for a little bit. It didn’t give any indications that it wasn’t working (or kube would have restarted it) and I didn’t look closely enough obviously. (That particular process usually have logging essentially disabled because it’s super noisy otherwise; I’ll adjust it to some middle ground).

@Sebhoster – that was a cool looking graph by the way :slight_smile:

update: eh, no – it’s still broken. I will load up the code in a little bit and figure it out.

update 2: fixed! It will take a little bit to catch up.

2 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.