What happened here? Bad Stratum 7 13 and 15

Hello,

First post here but been lurking a while. Hope you can help.

Saw weird stuff today as below:
These logs from beta. But same seen on production monitors.

1583683084,"2020-03-08 15:58:04",-0.000316214,1,-21.5,23,Amsterdam,0,
1583683084,"2020-03-08 15:58:04",-0.000316214,1,-59.7,,,0,
1583682723,"2020-03-08 15:52:03",0.005422785,-5,-26.3,22,"Los Angeles, CA",,"bad stratum 0 (referenceID: 0xa077d8c5, �w��)"
1583682723,"2020-03-08 15:52:03",0.005422785,-5,-63.9,,,,"bad stratum 0 (referenceID: 0xa077d8c5, �w��)"
1583682476,"2020-03-08 15:47:56",0.001815355,-5,-29.3,20,"Newark, NJ, US",,"bad stratum 0 (referenceID: 0xa077d8c5, �w��)"
1583682476,"2020-03-08 15:47:56",0.001815355,-5,-62,,,,"bad stratum 0 (referenceID: 0xa077d8c5, �w��)"
1583682191,"2020-03-08 15:43:11",0.002304153,-5,-23.7,23,Amsterdam,,"bad stratum 0 (referenceID: 0xa077d8c5, �w��)"
1583682191,"2020-03-08 15:43:11",0.002304153,-5,-60,,,,"bad stratum 0 (referenceID: 0xa077d8c5, �w��)"
1583681943,"2020-03-08 15:39:03",0.005632071,-5,-22.4,22,"Los Angeles, CA",,"bad stratum 0 (referenceID: 0xa077d8c5, �w��)"
1583681943,"2020-03-08 15:39:03",0.005632071,-5,-57.9,,,,"bad stratum 0 (referenceID: 0xa077d8c5, �w��)"
1583681639,"2020-03-08 15:33:59",0.001871698,-5,-25.6,20,"Newark, NJ, US",,"bad stratum 0 (referenceID: 0xa077d8c5, �w��)"
1583681639,"2020-03-08 15:33:59",0.001871698,-5,-55.6,,,,"bad stratum 0 (referenceID: 0xa077d8c5, �w��)"
1583681317,"2020-03-08 15:28:37",0.001873217,-5,-19.7,23,Amsterdam,,"bad stratum 0 (referenceID: 0xa077d8c5, �w��)"
1583681317,"2020-03-08 15:28:37",0.001873217,-5,-53.3,,,,"bad stratum 0 (referenceID: 0xa077d8c5, �w��)"
1583681054,"2020-03-08 15:24:14",0.006278902,-5,-18.3,22,"Los Angeles, CA",,"bad stratum 0 (referenceID: 0xa077d8c5, �w��)"
1583681054,"2020-03-08 15:24:14",0.006278902,-5,-50.8,,,,"bad stratum 0 (referenceID: 0xa077d8c5, �w��)"
1583680809,"2020-03-08 15:20:09",0.002927684,-4,-21.7,20,"Newark, NJ, US",,"bad stratum 15"
1583680809,"2020-03-08 15:20:09",0.002927684,-4,-48.3,,,,"bad stratum 15"
1583680513,"2020-03-08 15:15:13",0.00423773,-4,-15.5,23,Amsterdam,,"bad stratum 15"
1583680513,"2020-03-08 15:15:13",0.00423773,-4,-46.6,,,,"bad stratum 15"
1583680217,"2020-03-08 15:10:17",0.007069186,-4,-14,22,"Los Angeles, CA",,"bad stratum 13"
1583680217,"2020-03-08 15:10:17",0.007069186,-4,-44.8,,,,"bad stratum 13"
1583679969,"2020-03-08 15:06:09",0.003438599,1,-18.6,20,"Newark, NJ, US",,"bad stratum 7"
1583679969,"2020-03-08 15:06:09",0.003438599,1,-43,,,,"bad stratum 7"
1583679627,"2020-03-08 15:00:27",-0.011037552,1,-12.1,23,Amsterdam,0,
1583679627,"2020-03-08 15:00:27",-0.011037552,1,-46.3,,,0,
1583679354,"2020-03-08 14:55:54",0.000636382,-5,-10.5,22,"Los Angeles, CA",,",?Lb"
1583679354,"2020-03-08 14:55:54",0.000636382,-5,-49.8,,,,",?Lb"
1583679106,"2020-03-08 14:51:46",-0.003117401,-5,-20.6,20,"Newark, NJ, US",,",?Lb"
1583679106,"2020-03-08 14:51:46",-0.003117401,-5,-47.1,,,,",?Lb"
1583678827,"2020-03-08 14:47:07",0,-5,-13.8,23,Amsterdam,,"i/o timeout"
1583678827,"2020-03-08 14:47:07",0,-5,-44.4,,,,"i/o timeout"
1583678467,"2020-03-08 14:41:07",0.001463921,-5,-5.8,22,"Los Angeles, CA",,",?Lb"
1583678467,"2020-03-08 14:41:07",0.001463921,-5,-41.4,,,,",?Lb"
1583678134,"2020-03-08 14:35:34",-0.002037602,-5,-16.5,20,"Newark, NJ, US",,",?Lb"
1583678134,"2020-03-08 14:35:34",-0.002037602,-5,-38.4,,,,",?Lb"
1583677890,"2020-03-08 14:31:30",-0.001901663,-5,-9.2,23,Amsterdam,,",?Lb"
1583677890,"2020-03-08 14:31:30",-0.001901663,-5,-35.1,,,,",?Lb"
1583677643,"2020-03-08 14:27:23",0.001858051,-5,-0.9,22,"Los Angeles, CA",,",?Lb"
1583677643,"2020-03-08 14:27:23",0.001858051,-5,-31.7,,,,",?Lb"
1583677380,"2020-03-08 14:23:00",-0.001472616,-5,-12.1,20,"Newark, NJ, US",,",?Lb"
1583677380,"2020-03-08 14:23:00",-0.001472616,-5,-28.1,,,,",?Lb"
1583677115,"2020-03-08 14:18:35",-0.00104997,-5,-4.4,23,Amsterdam,,",?Lb"
1583677115,"2020-03-08 14:18:35",-0.00104997,-5,-24.3,,,,",?Lb"
1583676869,"2020-03-08 14:14:29",0.002604457,-5,4.4,22,"Los Angeles, CA",,",?Lb"
1583676869,"2020-03-08 14:14:29",0.002604457,-5,-20.3,,,,",?Lb"
1583676578,"2020-03-08 14:09:38",-0.014846272,-5,-7.4,20,"Newark, NJ, US",,",?Lb"

My Setup:
Stratum 1
3 x Stratum 1 servers (hidden masters) running on raspberry pi 4 with adafruit ultimate gps hat and chronodot rtc, running ntpd, each stratum1 has redundant power supplies with ups. GPS signal is good with ten sats seen and a 3d fix. I do NOT use gpsd, I use driver 127.127.20.0, combining both pps and gps.

ntpq -pn -c ass normally shows very low offsets as below example:

user@host: ntpq -pn -c ass
     remote           refid      st t when poll reach   delay   offset  jitter

o127.127.20.0    .GPS0.           0 l    5    8  377    0.000    0.000   0.002

Stratum 2
3 x Stratum 2 servers (published in pool) running chrony.

Concept is that the 3xstratum2 are connected to the three stratum 1 servers and then peered (with keys) to each other. Stratum 1 are not peered with each other.

I am in Kenya, and attempting to improve ntp pool coverage in my neighbourhood. The Stratum 2 are connected via IP Transit to the submarine cable and the internet exchange points here. I use my own Public AS number with fixed public IPv6 and IPv4 addresses.

Problem
Today all three of my stratum 2 servers disappeared off the NTP Pool with the above logs.

Two Stratum 1 servers were fine. One of them started to throw large offsets of like 1 second. Dont know why. Sat fix was fine throughout. Restarting ntpd on that server restored it, though its taking time to settle as usual.

So, why did the Stratum2 servers not carryon and trust the remaining 2 stratum 1 servers?

What do the errors in the log mean, specifically the bad stratum ones.

Thanks for any insights you all can offer.

Salaams,

Alex

Additional info:
Stratum1 ntp.conf

driftfile /var/lib/ntp/ntp.drift
leapfile /usr/share/zoneinfo/leap-seconds.list
statistics loopstats peerstats clockstats
filegen loopstats file loopstats type day enable
filegen peerstats file peerstats type day enable
filegen clockstats file clockstats type day enable
server 127.127.20.0 mode 88 minpoll 3 iburst
fudge 127.127.20.0 stratum 0 flag1 1 flag2 0 flag3 1 flag4 0 time1 0.100 time2 0.0 refid GPS0
restrict -4 default kod notrap nomodify nopeer noquery limited
restrict -6 default kod notrap nomodify nopeer noquery limited
restrict 127.0.0.1
restrict ::1
restrict source notrap nomodify noquery

Stratum2 Chrony.conf

server ntp-s1-0.icolo.io iburst
server ntp-s1-1.icolo.io iburst
server ntp-s1-2.icolo.io iburst
peer ntp1.icolo.io key 1
peer ntp2.icolo.io key 2
keyfile /etc/chrony/chrony.keys
driftfile /var/lib/chrony/chrony.drift
logdir /var/log/chrony
maxupdateskew 100.0
rtcsync
makestep 1 3
allow
bindcmdaddress 127.0.0.1
bindcmdaddress ::1

Topology


Stratum 1 build

The peering of the three stratum 2 servers may create havoc. If they decide to synchronize against each other, the stratum 1 time is effectively abandoned as a time source and the three stratum 2 servers will run away with ever-increasing stratum values.

The safe way to do this add the noselect parameter to the stratum 2 servers

server ntp1.icolo.io noselect key 1
server ntp2.icolo.io noselect key 2

I had once such an issue with my stratum 2 servers which were peering with each other and didn’t clear the leap second flag because the algorithm decided they were the majority.

Hi Lammert,

Thank you for the feedback. According to the manuals:

" noselect

Marks the server as unused, except for display purposes. The server is discarded by the selection algorithm. This option is valid only with the server and peer commands."

What would the value be of peering if we put noselect then? I might as well just remove the peers.

I was using this concept: NTP Advanced Configuration as my template.

As I understand it, we WANT to use our peers in case of the upstream servers having issues.

I read that we need to have minimum 4 sources per ntp server, whcih in my case is 3 Stratum 1s and two peers.

I realise that the peers to create a loop, but chrony and ntpd are supposed to know how to deal with that.

Your continued thoughts appreciated.

Alex

You should use peer command on your stratum 2 servers. Please refer to ntpd and chrony documentation for more details.

Hi Alicia,

As shown in my configs above, I AM using peer. Lammert however suggested to use noselect with the peer entries.

Do you agree to use noselect with them?

Thanks for the feedback.

Alex

This is what I see:

未命名

No peer command inside.

The idea behind multiple Stratum 1 servers and peering is to have redundancy in the setup. Although your system has redundancy on the level of hardware, it has no redundancy when it comes to time because basically all time is derived from one GPS receiver. Instead of connecting the Stratum 2 with each other, I would advice to look for some good Stratum 1 or Stratum 2 servers on the internet to use as secondary source.

In that case, if the GPS antenna fails (lightning strike for example) your Stratum 2 servers will continue to provide synchronized time.

I personally use just one Stratum 1 server to feed nine Stratum 2 servers, but obviously the power grid in my region (Western Europe) is more reliable than in Kenya and there are more secondary third party Stratum 1 servers in the neighborhood.

Hi Alicia,

Apologies, you have highlighted typo in the pasted config for stratum 2. I have fixed it. Peer command is now shown.

Thanks,

Alex

Hi Lammert,

The system is installed in a hi quality data centre, and we have endeavoured to solve power, cooling, lightnig protection etc… (at least to 99.999% uptime). ;), remaining work is a second peering session with a second submarine fibre.

Yes, I take your point about single source of Time, and in fact I have a fourth hardware S1 ntp server coming on shortly that is forced to Gallileo GPS network, to balance this. This device is also in another physical location.

However referring to another server elsewhere in the net is what I am trying to solve, as RTT is really high to these sites. The best one I have tried is in Amsterdam, but its ~200ms away. Connectivity in to Africa is still not great…

Best,

Alex

Hi Lammert,

I am not yet convinced about using noselect on the peers.

alicia also seems to think that peer is good.

Alex

Perhaps I should set prefer on the servers and not on the peers, so that peers are only selected when servers are not avialbable?

E.G

server ntp-s1-0.icolo.io iburst prefer
server ntp-s1-1.icolo.io iburst prefer
server ntp-s1-2.icolo.io iburst prefer
peer ntp1.icolo.io key 1
peer ntp2.icolo.io key 2

Thoughts?

Alex

Hi, wondering if there was anything useful in the chrony logs on the stratum 2 devices that shows why they chose each other over the two remaining stratum 1 devices. Or similar on the stratum 1 device that had the large offset. The “its taking time to settle as usual” statement in your first post sounds a bit odd.

Root dispersion on all servers is quite, over 100msec. Does the GPS use PPS, or just NMEA?

Looking at all three servers during this incident I see stratums of 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15. This matches Lammert’s suggestion.

1 Like

GPS is using PPS and NMEA without GPSD.

With regard to Lammerts suggestion, could you please explain it to me? If I put noselect on the peers, then is there any value to them at all? SHouldnt I just remove them?

Alicia earlier on was keen that peers should be there.

Many thanks for the insights.

Alex

I have once been burnt by a runaway group of Stratum 2 servers. noselect will this prevent for sure. Maybe just setting the Stratum 1 servers as prefer-ed or using peer instead of server is enough but I didn’t test it.

Also, my issues at that time were on ntpd, while I am now running on chronyd. Both daemons may have a different implementation of how the prefered upstream time providers are selected.

[offtopic]
Nice setup you have by the way. Must be one of the most professional time sources on the African Continent :+1:
[/offtopic]

Hi Lammert,

Thanks for the feedback.

Forgive me for digging in on the noselect. As I understand it, noselect will mean that the peer is never used for anything. Noselect is basically for monitoring purposes.

Seeing as I am not interested in monitoring a peer, then I might as well remove it right?

Trouble is, there is multiple advice out there that I SHOULD peer, see Alicia comment above.

Alex

I did some digging in the manuals.

https://chrony.tuxfamily.org/doc/3.4/chrony.conf.html

The peer keyword is only effective if at least one of the partners in the peering is running ntpd. The needed logic is not implemented in chronyd according to documentation in the link I mentioned above.

This is mainly useful when the NTP implementation of the peer (e.g. ntpd ) supports ephemeral symmetric associations and does not need to be configured with an address of this host. chronyd does not support ephemeral associations.

And at the end of that documentation block

If two hosts should be able to synchronise to each other in both directions, it is recommended to use two separate client/server associations (specified by the server directive on both hosts) instead.

So, although peer is a recognized keyword in chronyd, its functionality is not fully implemented. And when you run bi-directional client/server associations, you either need enough independent time sources, or use noselect, to prevent the block of peered Stratum 2 servers to become the majority in the election process.

Hi @alex,

You have to answer the following question to yourself:
What if all of your stratum one servers fail?

  1. Better not to provide time at all, the clients should look elsewhere for time.
  2. Better to keep providing time to your clients, even if the time quality is not so good.

Depending on the answer, I think different design fits better. In the first case do not use peer between the stratum two servers at all. In the second case add the peer statements and external time sources different for each server as well and do not worry about higher delay since this situation is really the last resort.

2 Likes

Some various random thoughts about everything…

There is a script somewhere with NTPD (forgot the name) where you can test if indeed you are getting a PPS pulse and proper timestamps.

As has been mentioned, you have multiple single points of failure in your setup:

  • One GPS antenna feeding 3 devices
  • Likewise it appears one power supply feeding the same 3 devices
  • No secondary time source(s) (for S1 or S2)

I think @lammert found one issue, in that Chrony doesn’t implement ‘peer’ fully, so it’s not benefiting you any by having that for your S2 servers.

When running NTPD and using peer, you need to set orphan mode if you want to still run without any upstream sources. i.e. if you want you servers to drop to stratum-5:

tos orphan 5

More Info: https://www.eecis.udel.edu/~mills/ntp/html/orphan.html

Alternatively you can use the ‘local’ clock driver as a last resort time source if you don’t want to use orphan mode.

server 127.127.1.0     # local clock
fudge  127.127.1.0 stratum 10

In my past experience, running peer (with NTPD) never really ended up working all that great. While the concept seems great as a fallback if you lose your primary upstream time source, unless your network must remain isolated, you are better off adding additional remote time sources. Remember, NTP was developed back in the days of analog, dial up, flaky connections… If you are worried about too much drift or network issues, lower the maxpoll to 9 for remote time sources (= 512 seconds = ~8.5 min intervals).

You can set each of your S1 servers to also have a ‘server’ line of the other two S1’s. It can serve as a little sanity check, and the S0 source should automatically always end up being the primary selection anyhow.

3 time sources is okay, but when you lose one then there is no way for NTP/Chrony to determine which of the remaining 2 is ‘more correct’…

If you want to maintain valid time even when you have a GPS outage, you should look into integrating a GPSDO (either commercial product or home-brew). (GPSDO = GPS Disciplined Oscillator)

If your S1 & S2 servers are located in the same facility, you might as well set the server lines to:
minpoll 4 maxpoll 4 on the S2’s for each of the S1’s. If / when you add remote servers, you will need to manually set the minpoll/maxpoll for them (to higher numbers obviously), otherwise they will get automatically get clamped to lower settings.

2 Likes

I have seen this issue several times. It can happen when there are more than two peers polling each other (it doesn’t matter if they are specified with the server or peer directive), they are very close in the network and the upstream sources are less stable than the peers from their point of view.

NTP cannot detect loops between three or more sources. My recommendation is to make a line instead of full mesh, i.e. configure two peers with only one peer and one peer with two peers (A <=> B <=> C).

If you don’t want to do that, you should configure the chronyd servers to prefer lower strata more strongly by setting stratumweight to 0.1 for instance. A loop may still form, e.g. when the stratum 1 servers become unreachable.