Leap second 2017 status

#12

For NTP clients not handing out these servers in the last day before the leap is likely too late. Many ntpd’s will have been running for weeks or months already.

SNTP clients don’t care about the leap second, but it’d potentially protect them from talking to a confused server just after the leap second (until the monitoring system has kicked those out anyway).

Now, I’m not sure what the overlap between “didn’t announce the leap second” and “were off by a second after the event” were (it’s in the data though if anyone is up for trying to figure that out). I’m sure it’s less than 100%.

The downside to implementing something like that is that the pool system now has to get all this correct, beyond just having the right time. It wouldn’t be a lot of code to implement, but it doesn’t “feel good”. It’s something that’s hard to test and has to work – or the whole system would go kaboom[1].

Since it won’t help NTP clients (vs SNTP) it doesn’t seem worth the complexity.

This leap second I was around, had time and good internet connection, but generally I optimize for “will work without me sitting paying attention”. The NTP Pool doesn’t exactly have a NOC or 24/7 staff. Or staff. :slight_smile:

I think there are other things that are more likely to be beneficial, in particular things that are actionable weeks or months in advance. For example we could track the refid/upstream of each server, learn which ones mess up the leap second and warn operators to not use those as upstream servers. Maybe.

[1] (Which I suppose is the story of leap seconds).

0 Likes

#13

Hrm, you’re right.

Agreed. :frowning:

It would be pretty much set-it-and-forget-it. Update leap-seconds.list occasionally, by hand or by cron. Then cross your fingers and hope it doesn’t go disastrously wrong. :smile:

Mm. That sounds very complex for something with limited accuracy, though.

Edit: And there’s no way to predict operators who use problematic upstreams but configure leap-seconds.list by hand. Or only use 49% problematic upstreams, so they’d still be alright.

0 Likes

#14

@ask

On the other side of the leap second, what to do about servers that are now 1 second off? Under the current scheme, it can take them about 100 minutes [edit: or about 110] to leave the pool. For example:

1483233900,"2017-01-01 01:25:00",0.987508058547974,-2,8.9,0
1483232417,"2017-01-01 01:00:17",0.991002321243286,-2,11.4,0
1483231040,"2017-01-01 00:37:20",0.99502968788147,-2,14.1,0
1483229629,"2017-01-01 00:13:49",0.999479174613953,-2,17,0
1483228172,"2016-12-31 23:49:32",-0.00108015537261963,1,20,0 

From Monitor.pm:

           if ($offset_abs > 3 or $status->{stratum} >= 8) {
               $step = -4;
               # [...]
           }
           elsif ($offset_abs > 0.75) {
               $step = -2;
           }

What if it was changed to take 5-10 points if they’re > 0.9 seconds off? Or 0.9 - 1.1?

That would be a simple change, and beneficial to SNTP clients from around 00:30-01:30 after a leap second.

More complicated rules might be better, but that should be good enough. (Only punish ~1.0 second servers on a leap second day? Don’t be so harsh if they already have a low score, so they don’t get to -100 in a few hours?)

Edit: Or have the monitoring system run a special check a few seconds after a leap second, instead of waiting up to ~15 minutes for the normal scheduled check.

Edit (06:46): I unwittingly edited out three of the more important lines of code.

0 Likes

#15

Having just recently read the NTP Best Current Practices document, I have plans to pull in the leap-seconds.list using cron. Does ntpd have to be restarted afterwards or does it occasionally take a peek over there to see if there is anything new?

0 Likes

#16

How can you see this?

0 Likes

#17

Modern versions (4.2.8; maybe 4.2.6) check the file automatically. (Once a day, i think; maybe once an hour.)

if 4.2.6 doesn’t do it automatically, you can still use ntpq (maybe ntpdc) to reload it, if you have the control key stuff set up.

0 Likes

#18

Okay, looks like I’m using 4.2.6 here (CentOS 7). Is there a way to know whether or not it’s looking? Is it in the logs or is there a command I can run to see?

0 Likes

#19

It’s been years since i ran 4.2.6 with a leap second file, if i ever did…

4.2.8 includes leap file information in ntpq -c rv or, more specifically, ntpq -c "rv 0 leapsec expire". Check if 4.2.6p5 does the same?

I think it also logs a message when the file has changed, but i’m not certain.

0 Likes

#20

Hmmm, I see “leap_none”, not sure if that means it’s not expecting a leap second or… ?

[root@server ~]# ntpq -c rv
associd=0 status=0615 leap_none, sync_ntp, 1 event, clock_sync,
version="ntpd 4.2.6p5@1.2349-o Mon Nov 14 18:25:09 UTC 2016 (1)",
processor="x86_64", system="Linux/4.8.6-x86_64-linode78", leap=00,
stratum=2, precision=-24, rootdelay=1.484, rootdisp=63.680,
refid=200.98.196.212,
reftime=dc145294.b2a902b9  Mon, Jan  2 2017  4:16:20.697,
clock=dc145a48.cf2a1ab9  Mon, Jan  2 2017  4:49:12.809, peer=23189,
tc=10, mintc=3, offset=-3.633, frequency=-7.465, sys_jitter=20.045,
clk_jitter=11.233, clk_wander=1.035
[root@server ~]# ntpq -c "rv 0 leapsec expire"
No system variables returned
0 Likes

#21

Yes, it means there is not a leap second at the end of the current day. (Seeing as today is 2 January, this is expected.)

That flag is separate from the leapfile stuff, though; NTP can learn leap information from its upstream servers, and that will affect that flag. So, even if you were checking ntpq -c rv on the day of a leap second, you wouldn’t know if it said leap_add_sec because of a leap file or because it had learned it from other servers.

0 Likes

#22

Okay, well, thanks for the information. Perhaps next time there is a leap second I can do some testing. Guess I’ll be digging more into what ntpq can tell me.

0 Likes

#23

Mm.

If you’ve replaced the file more than a day ago and it hasn’t logged anything, i don’t think i have an answer for you. Sorry. :confounded: Besides updating to current software, or asking someone else. :disappointed_relieved: Or straceing ntpd’s file I/O.

0 Likes

#24

Ugh, it’s wayyyy too late for that. :sleeping:

0 Likes

#25

The system records the leap flag in the “monitor log” if it’s set. (And then promptly doesn’t do anything else with it; yet anyway).

0 Likes

#26

Hm, those should only be easy to spot if they directly synchronize to smearing S1 servers.

If a machine synchronizes to an upstream server which in turn synchronizes to a smearing server then this can be detected so easily. If the machine then gets smeared time from upstream, and has a leap second file configured in addition then the machine is probably 1 s off after the leap second / smearing, and I’d expect that the machine’s time is stepped at some point in time thereafter.

1 Like

#27

Thanks for this analysis, Ask, it’s really interesting.

Is there any way to estimate how many clients had bad time because of pool servers not doing leap second correctly? My naive guess is it’s about 1% of NTP Pool clients, or approximately 100,000 machines.

My math could easily be wrong though… Here’s how I got there. Clients frequently have 4 random pool servers configured (Linux defaults). So if they have 2 or more bad servers the consensus algorithm won’t help and they may have bad time. 442 servers had bad time, roughly 10% of the pool. So there’s a 10% * 10% = 1% chance of a client having at least 2 bad pool servers and therefore bad time. There’s 5–15M clients using the NTP pool, so 1% of 10M is 100,000. Like I said, could easily be wrong, but that’s what I wrote on my napkin here.

1 Like

#28

That estimate seems a bit too low to me. On my server in a small country zone I see that it had more than 4M clients (unique IP addresses) in last 24 hours. I’m not sure how much of that is from Snapchat, but globally I’d expect the number of clients to be much higher.

1 Like

#29

I suspect the true number of clients is in the hundreds of millions or more, most of them SNTP clients. They will likely all easily get messed up by a bad time stamp. Most of the clients are not very thoughtfullly made, is my impression.

At least tens of millions of Linux installations (NTP clients) seems reasonable, too. For them your percentage estimate seems reasonable.

However I suspect the pain was unevenly distributed. I haven’t looked but likely the “got leap wrong” servers weren’t evenly distributed through the zones/regions.

1 Like

#30

Thanks for the feedback on my estimate. I forgot about the SNTP clients, they will certainly behave differently. But maybe 1% of ntpd installations is reasonable. Seems like a big enough number to contemplate preventing this problem next time in roughly two years.

I know it’s difficult to estimate the number of NTP Pool clients. I took the 5-15M number from the Wikipedia page which cites an NTP Pool page for it. But those numbers are from 2011!
If anyone else is interested, perhaps we could discuss the question of how to measure the client population in a new forum post. I’m out of date on techniques for this but would be glad to contribute some grunt work with data processing.

http://www.pool.ntp.org/en/vendors.html#pool-capacity

2 Likes

#31

For the estimate to be realistic, you also need to take into account 2nd or even 3rd generation clients.
It’s not uncommon for networks to use a gateway machine running ntpd to then provide time to all the other machines on the network (usually using NAT, so we can’t even really do more than guess how many of them there may be.
My current network is tiny, with only about a dozen machines (plus a few devices using sntp), and in the days before I had a GPS PPS source, I still only used the gateway to get time from the pool - everything else got time from the gateway.
Before retiring, I ran far bigger networks, and used a similar arrangement, although with a variable number of external gateways, depending on the need for redundancy. In those cases, there were 100s or 1000s of machines behind each gateway getting time from the pool - which I suggest is not unusual.

1 Like