Leap second 2017 status

Thanks for sharing the stats.

Those 4 must be quite special flowers :slight_smile:

Last update on this, at least for today:

It dropped off pretty fast:

A little later (01:00-02:00) it was a little over 220 servers off by a second (plus/minus 50ms).
02-03: 164 servers.
03-04: 112.
04-05: 87
05-06 (now): 66

There are a few dozen more that are off by a significant amount but less than a second. At a glance they look like slowly recovering.

All are excluded from the DNS records, but old clients might be using them of course.

I didn’t look super carefully, but I also saw a few servers that were correct at midnight UTC, but were off by a second for a brief period hours later.

If there’s any conclusion then I think it is that handling leap seconds is still pretty much broken. Obviously not all operators of servers in the NTP Pool take the same care or interest or have the same knowledge of all this, but my expectation is that the overall NTP installations aren’t doing better than this.

It’d be interesting to break this down by server software and versions, but we don’t have that information in the NTP Pool system.

1 Like

Last night around 9pm in Taiwan (12/31 1pm UTC) I checked my server’s backup time sources, and found that Korea and Philippines government-running standard time servers have not published leap second warning, while standard time servers from Taiwan/Hong Kong/Japan/Indonesia published warnings. I did not check further (during the leap second occurrence) and no screenshots were available, though. After all, their time is correct by now. :imp:

1 Like

Could this be due to the servers syncing to an upstream server that is implementing leap smearing?

See: http://www.n5tnl.com/time/leap_2016/index.html

1 Like

I was monitoring a large part of the pool servers during the leap and I was curious to see how many servers that didn’t leap correctly share the same source. I tried to prepare a graph of synchronization from reference IDs from packets collected immediately after the leap.

https://mlichvar.fedorapeople.org/leap2016/badleap_refids.png

There are some clusters of servers that share the same source, but it seems the problem is spread quite uniformly. I suspect the main problem is configuration using a small number of sources, which can’t outvote the bad servers or reference clocks. About a quarter of the servers seems to be running openntpd, which doesn’t support leap seconds yet.

The largest cluster is synchronised to refid 7F7F01* (aka local driver). The recommendations for pool servers say the local driver shouldn’t be used, so maybe it would be good to send an email to the owners of the servers that did?

2 Likes

Why should we tolerate this defective-by-design software live inside the pool???

Alica, I didn’t realize openntpd just ignores the leap second. Those probably aren’t appropriate for the pool then, at least not in regions of the world with plenty servers.

That being said, it’s not super clear what to do to improve the situation. Exclude servers if they fail a leap second? It’s somewhat random depending on which upstream servers they have at the moment and lots can change until the next leap second (might be in 6 months, might be in 3 years).

No, those were pretty easy to spot. There were 9 in the system (that got kicked out after a few hours).

What if the monitoring code was aware of leap seconds? Something like:

  • If there’s a leap second today && the server is not announcing it:
    • $step -= 0.2;

If a server is consistently unaware of the leap second, and my math is right, that should dock about 20 points over the course of the day, disabling it around 12:00, while it can continue to contribute for the rest of the year.

For NTP clients not handing out these servers in the last day before the leap is likely too late. Many ntpd’s will have been running for weeks or months already.

SNTP clients don’t care about the leap second, but it’d potentially protect them from talking to a confused server just after the leap second (until the monitoring system has kicked those out anyway).

Now, I’m not sure what the overlap between “didn’t announce the leap second” and “were off by a second after the event” were (it’s in the data though if anyone is up for trying to figure that out). I’m sure it’s less than 100%.

The downside to implementing something like that is that the pool system now has to get all this correct, beyond just having the right time. It wouldn’t be a lot of code to implement, but it doesn’t “feel good”. It’s something that’s hard to test and has to work – or the whole system would go kaboom[1].

Since it won’t help NTP clients (vs SNTP) it doesn’t seem worth the complexity.

This leap second I was around, had time and good internet connection, but generally I optimize for “will work without me sitting paying attention”. The NTP Pool doesn’t exactly have a NOC or 24/7 staff. Or staff. :slight_smile:

I think there are other things that are more likely to be beneficial, in particular things that are actionable weeks or months in advance. For example we could track the refid/upstream of each server, learn which ones mess up the leap second and warn operators to not use those as upstream servers. Maybe.

[1] (Which I suppose is the story of leap seconds).

Hrm, you’re right.

Agreed. :frowning:

It would be pretty much set-it-and-forget-it. Update leap-seconds.list occasionally, by hand or by cron. Then cross your fingers and hope it doesn’t go disastrously wrong. :smile:

Mm. That sounds very complex for something with limited accuracy, though.

Edit: And there’s no way to predict operators who use problematic upstreams but configure leap-seconds.list by hand. Or only use 49% problematic upstreams, so they’d still be alright.

@ask

On the other side of the leap second, what to do about servers that are now 1 second off? Under the current scheme, it can take them about 100 minutes [edit: or about 110] to leave the pool. For example:

1483233900,"2017-01-01 01:25:00",0.987508058547974,-2,8.9,0
1483232417,"2017-01-01 01:00:17",0.991002321243286,-2,11.4,0
1483231040,"2017-01-01 00:37:20",0.99502968788147,-2,14.1,0
1483229629,"2017-01-01 00:13:49",0.999479174613953,-2,17,0
1483228172,"2016-12-31 23:49:32",-0.00108015537261963,1,20,0 

From Monitor.pm:

           if ($offset_abs > 3 or $status->{stratum} >= 8) {
               $step = -4;
               # [...]
           }
           elsif ($offset_abs > 0.75) {
               $step = -2;
           }

What if it was changed to take 5-10 points if they’re > 0.9 seconds off? Or 0.9 - 1.1?

That would be a simple change, and beneficial to SNTP clients from around 00:30-01:30 after a leap second.

More complicated rules might be better, but that should be good enough. (Only punish ~1.0 second servers on a leap second day? Don’t be so harsh if they already have a low score, so they don’t get to -100 in a few hours?)

Edit: Or have the monitoring system run a special check a few seconds after a leap second, instead of waiting up to ~15 minutes for the normal scheduled check.

Edit (06:46): I unwittingly edited out three of the more important lines of code.

Having just recently read the NTP Best Current Practices document, I have plans to pull in the leap-seconds.list using cron. Does ntpd have to be restarted afterwards or does it occasionally take a peek over there to see if there is anything new?

How can you see this?

Modern versions (4.2.8; maybe 4.2.6) check the file automatically. (Once a day, i think; maybe once an hour.)

if 4.2.6 doesn’t do it automatically, you can still use ntpq (maybe ntpdc) to reload it, if you have the control key stuff set up.

Okay, looks like I’m using 4.2.6 here (CentOS 7). Is there a way to know whether or not it’s looking? Is it in the logs or is there a command I can run to see?

It’s been years since i ran 4.2.6 with a leap second file, if i ever did…

4.2.8 includes leap file information in ntpq -c rv or, more specifically, ntpq -c "rv 0 leapsec expire". Check if 4.2.6p5 does the same?

I think it also logs a message when the file has changed, but i’m not certain.

Hmmm, I see “leap_none”, not sure if that means it’s not expecting a leap second or… ?

[root@server ~]# ntpq -c rv
associd=0 status=0615 leap_none, sync_ntp, 1 event, clock_sync,
version="ntpd 4.2.6p5@1.2349-o Mon Nov 14 18:25:09 UTC 2016 (1)",
processor="x86_64", system="Linux/4.8.6-x86_64-linode78", leap=00,
stratum=2, precision=-24, rootdelay=1.484, rootdisp=63.680,
refid=200.98.196.212,
reftime=dc145294.b2a902b9  Mon, Jan  2 2017  4:16:20.697,
clock=dc145a48.cf2a1ab9  Mon, Jan  2 2017  4:49:12.809, peer=23189,
tc=10, mintc=3, offset=-3.633, frequency=-7.465, sys_jitter=20.045,
clk_jitter=11.233, clk_wander=1.035
[root@server ~]# ntpq -c "rv 0 leapsec expire"
No system variables returned

Yes, it means there is not a leap second at the end of the current day. (Seeing as today is 2 January, this is expected.)

That flag is separate from the leapfile stuff, though; NTP can learn leap information from its upstream servers, and that will affect that flag. So, even if you were checking ntpq -c rv on the day of a leap second, you wouldn’t know if it said leap_add_sec because of a leap file or because it had learned it from other servers.

Okay, well, thanks for the information. Perhaps next time there is a leap second I can do some testing. Guess I’ll be digging more into what ntpq can tell me.