Yes, it means there is not a leap second at the end of the current day. (Seeing as today is 2 January, this is expected.)
That flag is separate from the leapfile stuff, though; NTP can learn leap information from its upstream servers, and that will affect that flag. So, even if you were checking ntpq -c rv on the day of a leap second, you wouldn’t know if it said leap_add_sec because of a leap file or because it had learned it from other servers.
If you’ve replaced the file more than a day ago and it hasn’t logged anything, i don’t think i have an answer for you. Sorry. Besides updating to current software, or asking someone else. Or straceing ntpd’s file I/O.
Hm, those should only be easy to spot if they directly synchronize to smearing S1 servers.
If a machine synchronizes to an upstream server which in turn synchronizes to a smearing server then this can be detected so easily. If the machine then gets smeared time from upstream, and has a leap second file configured in addition then the machine is probably 1 s off after the leap second / smearing, and I’d expect that the machine’s time is stepped at some point in time thereafter.
Thanks for this analysis, Ask, it’s really interesting.
Is there any way to estimate how many clients had bad time because of pool servers not doing leap second correctly? My naive guess is it’s about 1% of NTP Pool clients, or approximately 100,000 machines.
My math could easily be wrong though… Here’s how I got there. Clients frequently have 4 random pool servers configured (Linux defaults). So if they have 2 or more bad servers the consensus algorithm won’t help and they may have bad time. 442 servers had bad time, roughly 10% of the pool. So there’s a 10% * 10% = 1% chance of a client having at least 2 bad pool servers and therefore bad time. There’s 5–15M clients using the NTP pool, so 1% of 10M is 100,000. Like I said, could easily be wrong, but that’s what I wrote on my napkin here.
That estimate seems a bit too low to me. On my server in a small country zone I see that it had more than 4M clients (unique IP addresses) in last 24 hours. I’m not sure how much of that is from Snapchat, but globally I’d expect the number of clients to be much higher.
I suspect the true number of clients is in the hundreds of millions or more, most of them SNTP clients. They will likely all easily get messed up by a bad time stamp. Most of the clients are not very thoughtfullly made, is my impression.
At least tens of millions of Linux installations (NTP clients) seems reasonable, too. For them your percentage estimate seems reasonable.
However I suspect the pain was unevenly distributed. I haven’t looked but likely the “got leap wrong” servers weren’t evenly distributed through the zones/regions.
Thanks for the feedback on my estimate. I forgot about the SNTP clients, they will certainly behave differently. But maybe 1% of ntpd installations is reasonable. Seems like a big enough number to contemplate preventing this problem next time in roughly two years.
I know it’s difficult to estimate the number of NTP Pool clients. I took the 5-15M number from the Wikipedia page which cites an NTP Pool page for it. But those numbers are from 2011!
If anyone else is interested, perhaps we could discuss the question of how to measure the client population in a new forum post. I’m out of date on techniques for this but would be glad to contribute some grunt work with data processing.
For the estimate to be realistic, you also need to take into account 2nd or even 3rd generation clients.
It’s not uncommon for networks to use a gateway machine running ntpd to then provide time to all the other machines on the network (usually using NAT, so we can’t even really do more than guess how many of them there may be.
My current network is tiny, with only about a dozen machines (plus a few devices using sntp), and in the days before I had a GPS PPS source, I still only used the gateway to get time from the pool - everything else got time from the gateway.
Before retiring, I ran far bigger networks, and used a similar arrangement, although with a variable number of external gateways, depending on the need for redundancy. In those cases, there were 100s or 1000s of machines behind each gateway getting time from the pool - which I suggest is not unusual.