I would treat incorrect sample as a time-out.
Additional request:
If there are multiple samples filtering out the bad ones in a measurement batch, please stop this since we have now multiple monitoring locations. One sample is definetly giving better picture what is going on.
Out of curiosity, I did a bit of digging. Doesn’t change the fact that the monitor as such should catch this condition, but the bug seems actually imported from an external library used for the NTP interactions. The library code that is responsible for decoding the response uses the data length returned by a UDP packet read to shorten the receive buffer (initially large enough to hold an NTP packet many times over, 8192 octets) to the actual length of the data received. Which I guess would be zero in this case. But that’s it, nothing else done with the length, especially no plausibility check (that I can see).
I am not sufficiently familiar with Go and am still going to read up on it a bit, but given the outcome observed, it seems Go is happily “interpreting”/faking data that isn’t there. I.e., the buffer is interpreted as a struct, which in turn is used to access the various fields of the NTP packet without issue (probably using some default values for the data that isn’t actually there).
I understand Go is memory safe in the sense that it will not actually try to access memory locations that aren’t defined. But I would have expected at least some kind of feedback that one tries to access data that isn’t actually there, rather than apparently working with some “default” values for such non-existing data.
Interesting.
Anyway, the upside is that the solution seems simple. I.e., just check that the amount of data returned from the UDP layer can actually hold at least a minimal NTP packet before proceeding to interpret that data as an NTP packet. The downside is that the issue is in an external library, so not sure how fast we’ll get a fix. Unless there is something that can be done on the NTP pool monitoring code side.
But not sure how urgent it is to get a fix, maybe the misbehaving server can be (temporarily) dropped from the pool by other means. Or whoever operates the server realizes what is going on and in some way addresses the issue (since it is known when the issue started, that might help reconstruct what triggered the misbehavior, thus helping to address the issue).
EDIT: I just realize: It seems it’s been a while since then, but Ask has previously contributed to the upstream Go SNTP client library. That should hopefully make it easier to get a fix .
My read is a bit different. I’m similarly incompetent with Go, but I think the error isn’t in the ntp library/module but in the monitor code. I think the library is returning io.EOF back to the monitor code and it is nonetheless looking at the response packet, which was zero initialized at allocation and then untouched by the attempted read. It’s just an educated guess, though.
Ah, right, missed binary.Read()
doing an implicit length check, thanks!
Makes much more sense that way…
So much the better when the fix doesn’t depend on an external project.
Yeah, I think the monitor when handling errors expects a network error and ends up treating other errors as … not an error. Oops!
I disabled the server mentioned in the system and will get the software fix tested this weekend.
This should be fixed now; thanks everyone for helping finding and diagnosing it.
I added the erroneous server to the beta system and earlier today upgraded a few of the monitors there to a version with a fix.
Later I added a change to the monitoring API to make it recognize the data from not-yet-upgraded monitoring clients and score the server accordingly.
The production system had a few monitors upgraded earlier today too, and recently got the API upgrade as well.
In the ~100k tests done since the API got changed 30 minutes ago no other server than the one reported in this thread had this behavior.
Thanks for fixing this so quickly, @ask. Everything looks good now from my perspective.