Oddities with rsntp on CentOS 7

#1

This is somewhere in between a blog post and a post asking questions. :innocent: I had a hard time finding concrete documentation on setting up rsntp in front of regular ntpd, so I’m trying to share my setup—but it also doesn’t work flawlessly, and perhaps someone here has the clues.

Background

I have a server at OVH (in their Canada data center), which sits in the pool dual-stack. (See its IPv4 and IPv6 pool listings if you’d like.) The IPv4 entry is also in the Brazil pool, which is pretty underserved, and thus the server sees a lot of traffic.

It’s running CentOS 7, is a(n older) quad-core Xeon, and has a 100Mbps unmetered network drop.

For reasons I don’t entirely understand, when I bump the IPv4 listing from the 25 Mbps pool setting to 50 Mbps, its score in the monitoring system starts to drop, and it periodically falls out of the pool. Server metrics look fine: even at peaks, ntpd doesn’t use more than about 3% CPU.

dropped packets

One thing I did notice is that running ntpdc -c iostats shows thousands of dropped packets, though they’re a relatively small percentage. However, ntpdc -c sysstat shows a signficantly higher percentage for “rate exceeded” (I do have some basic limiting configured.) So it seems like packets are getting dropped by ntpd for some other reason.

I could/should go to the source and try to figure out exactly where that can happen, but I took it as an excuse to play with rsntp. (Also: I recently rebooted the box to apply some kernel updates, and then went right into rsntp setup, so there are currently no dropped packets showing.)

rsntp

I discovered it on this forum, but for those who haven’t seen it, rsntp is a multi-threaded NTP server written in Rust, which I’d almost liken to a cache. It sits in front of a “real” NTP server and bases most of its data on that, but does use real timestamps so its responses are good. The readme for the project shows it saturating a 1Gb network port with four cores.

Installation (CentOS 7)

With EPEL already set up, yum install rust cargo gave me all I needed for a Rust runtime and compiler.

I then cloned the git repo for rsntp, and ran make release in the checkout. That gave me an rsntp binary in target/release. For now I’m just running it in screen while I tinker, which is obviously not a good long-term solution

ntpd configuration

By default, ntpd binds to *:123, which (1) is not where rsntp expects it to be, and (2) prevents rsntp from binding to port 123.

Frustratingly, ntpd doesn’t seem to present any means to change the port it listens on. In another thread on here, there was a recommendation to make ntpd listen on localhost only. For some reason, that configuration on my setup was overly restrictive, and NTP was unable to query any timeservers. :grimacing: (It’s as if it was locked down to only being able to use the loopback device, rather than just listening there.)

Since the overwhelming majority of my traffic is IPv4 and it’s a dual-stack box, I ended up compromising and going for this:

interface ignore wildcard
interface listen ipv6

The first line keeps it from binding to everything, and the latter makes it bind to IPv6 addresses only. (For reasons I don’t fully understand, it still listens on 127.0.0.1 as well, but that’s perfect for my needs.)

The problem with this is that now I can only use IPv6 servers. There are some good nearby IPv6 servers so it’s not a big deal, but I feel like the configuration isn’t quite what I expected.

pointing rsntp at ntpd

My initial plan, with ntpd listening on IPv6, was to point rsntp at the IPv6 address. Unfortunately, for some reason, Rust doesn’t seem to like this:

[root@ns507230 release]# ./rsntp -4 3 -6 0 -a 192.99.2.8:123 -s [2607:5300:60:3308::1]:123
Server thread #1 started
Server thread #2 started
Server thread #3 started
thread 'main' panicked at 'Client failed to send packet: Address family not supported by protocol (os error 97)', src/main.rs:369:27
note: Run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

I suspect this is some oddity with the Rust runtime or something. (FWIW, I have 1.33.0.2.el7.) Using [::1] as an IP was not any better.

So I ended up just pointing it at localhost, with ./rsntp -4 3 -6 0 -a 192.99.2.8:123 -s 127.0.0.1:123. And now it works.

By the way, the only reason for using 3 threads is that my fingers didn’t seem to want to type -4 4 and I didn’t catch it until it was already running.

Open questions

(Some of these are things I’m just interested in researching when I have time; not necessarily questions I expect the group to answer.)

  • Under what circumstances does ntpd drop packets? Is a non-zero value indicative of a problem, or could it just be junk queries or the like?
  • Does the Rust package I have just not support IPv6?!
  • Is there a way to have ntpd bind to localhost-only without sacrificing the ability to connect to non-localhost timeservers? (Similarly, can I make it bind to IPv6 addresses only, but still query IPv4 servers?)
  • How do people monitor rsntp? I’d love to get some metrics but it doesn’t seem to expose any. (Maybe I need to roll up my sleeves and learn some Rust!)
  • (To be seen) Were my problems getting past 25 Mbps (pool setting) something that will be solved by rsntp, or was it something like a network issue?
0 Likes

#2

I’m having the same problem in that datacenter on ovh but elsewhere also. My New Jersey server at vultr is the only one at the moment that is even usually in the pool. My two Hong Kong servers are doing very poorly also. This is beginning to be a widely known issue, and there seems to be an effort at putting another monitor station up, but since this would require some pretty large changes and resources, it has not happened yet. Things are functional, and nobody wants to take a large leap that might not provide changes anyone wants at least not immediately. And since the GPS rollover is so close, I imagine there is some more hesitation even. This is my guess anyways.

0 Likes

#3

No, that was just a bug in rsntp that it expected an IPv4 address in the -s option. It should be fixed now.

Not that I know of.

I have some experimental code that can print request rate on standard output, and I could clean it up and push it to the repo if you would like, but personally I prefer iptables for monitoring (e.g. the iptables plugin in collectd).

I don’t think rsntp will help. 25 Mbps of NTP traffic on real x86 hardware is not a problem for ntpd. (Edit: I see now that was just the pool setting, not the actual request rate.)

0 Likes

#4

Just curious, but have you added NOTRACK to your firewall configuration?

i.e.

/usr/sbin/iptables -t raw -A PREROUTING -p udp --dport 123 -j CT --notrack /usr/sbin/iptables -t raw -A OUTPUT -p udp --sport 123 -j CT --notrack

Likewise I’m sure you would need to do the same for ip6tables if you’re serving ipv6.

NTPD will track & drop packets if you have the limited value set in the restrict line. Packets are dropped if they are: invalid, exceeding the configured rate, or the buffer is full (only noticeable under really high QPS).

1 Like

#5

Thanks for all the great replies! I’ll follow up shortly. But in the meantime, I’ve now had rsntp die a few times with this error:

thread 'main' panicked at 'Client failed to send packet: Operation not permitted (os error 1)', src/main.rs:369:27 stack backtrace: 0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace at src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:39 1: std::sys_common::backtrace::print at src/libstd/sys_common/backtrace.rs:70 at src/libstd/sys_common/backtrace.rs:58 2: std::panicking::default_hook::{{closure}} at src/libstd/panicking.rs:200 3: std::panicking::default_hook at src/libstd/panicking.rs:215 4: std::panicking::rust_panic_with_hook at src/libstd/panicking.rs:478 5: std::panicking::continue_panic_fmt at src/libstd/panicking.rs:385 6: std::panicking::begin_panic_fmt at src/libstd/panicking.rs:340 7: rsntp::main 8: std::rt::lang_start::{{closure}} 9: std::panicking::try::do_call at src/libstd/rt.rs:49 at src/libstd/panicking.rs:297 10: __rust_maybe_catch_panic at src/libpanic_unwind/lib.rs:92 11: std::rt::lang_start_internal at src/libstd/panicking.rs:276 at src/libstd/panic.rs:388 at src/libstd/rt.rs:48 12: main 13: __libc_start_main 14:

This is slightly puzzling, because it looks like this is when it’s querying ntpd on localhost. I guess I don’t know the kernel networking code well enough to reason about why such an operation might fail. Has anyone else seen this?

0 Likes

#6

This does sound like a lot of work, and it’s not something I have the resources to help with right now, so I don’t want to complain too loudly about it. But it is sort of a frustrating experience to have a server that seems to be working reliably, but that the monitoring server dislikes for unclear reasons.

As far as OVH, I did have trouble in the past with their DDoS mitigation kicking in, though I was able to request that they dial it back. It seemed like it was seeing normal bursts of activity as an attack. I never did figure out how much of my traffic it was actually dropping, but it was certainly annoying. Thankfully it hasn’t kicked in in some time now.

0 Likes

#7

Awesome, thanks!

The other thing I’ve considered here is running it in a container, but for the moment it’s too much bother.

I think iptables monitoring is fine. I actually just set that up last night. (I’d like to get collectd and some graphs going next.) It occurs to me that it’s really just a straight count with rsntp, since it doesn’t do the rate-limiting and all that ntpd does. So I don’t think I’ll need anything more than iptables.

(I also added a rule to drop requests with a source port of 0; for some reason I was getting a fair number of those.)

The 25 Mbps pool setting for me is still only 1-2 Mbps for me, so it’s extra-surprising that I’d have trouble.

It’s not a very scientific study since I never figured out what was causing the score to drop at > 25 Mbps. It could have been some level of coincidence.

The only rules I have are a few ACCEPT and REJECTs. Is conntrack enabled by default these days? Might be worth me specifying manually in any case, I suppose.

This is the part that confused me. The “rate exceeded” count was significantly higher than the number of dropped packets, so I wasn’t entirely sure how it was counting.

Though with rsntp running, I see it complaining with some regularity that it’s receiving packets that are invalid, so this may explain the drops. I had initially assumed it was just a buffer issue, which I think is an unfounded conclusion on my part.

Thanks again, all, for the great feedback!

0 Likes

#8

Was ntpd stopped/restarted when this happened? Or maybe there was a firewall rule that rejected the request send by rsntp? I guess the code should be changed to not abort on this error and just print a message.

1 Like

#9

Yes, you should always have an inbound & outbound rule to drop any packets to/from port 0.

I’ve found that more often than not conntrack does get loaded by default. It’s not hard to check and see and if you don’t need it then you can unload it (just be sure to have some code somewhere to make sure it does the same after a reboot).

NTPD has some fuzzy bit of code so it doesn’t drop 100% of the packets that exceed the configured rate limit. It will still reply to a certain percentage of them, just to give a little leeway and not let the client think the server has gone silent.

1 Like

#10

ntpd was running (and I was asleep). Looking at what might cause sendto to fail, more than a couple Google results have suggested that conntrack was getting in the way, though. Sure enough:

[root@ns507230 ~]# cat /proc/net/nf_conntrack | wc -l
6127

D’oh! This is with the server temporarily out of the pool and “only” doing about 300qps. Now I understand why it’s been failing with higher load!

0 Likes

#11

I had the same problem and my solution was to use two public IPv4 addresses.
One for ntpd and one for rsntp.

0 Likes