I’m not able to find official document which suggests to not implement ntp server using virtual machine; I’d like to know reasons but specially what RFC shows this recommendation.
@RobertGamma , welcome to the community!
I guess there is no such RFC, using hardware for time service is not a hard requirement. It is just best practice not to implement time service on virtual machines. Scheduling of the guest for execution by the host may have some uncertainty. An example, when my VMs go through vmotion, they typically lose 0.2 seconds.
Hello @RobertGamma. Or is it Renzo Marengo? Or perhaps buckroger2011@gmail.com?
It might give your questions some more credibility if you didn’t post exactly the same question in a different place under a different pseudonym 2 weeks after posting it on the IETF mailing list.
There’s no official document which suggests not to implement NTP servers using virtual machines, because nowadays VMs are perfectly capable of serving time of the quality which most organisations require. Anyone who says anything different is selling time appliances.
Don’t just take my word for it, though, have a read of this detailed explanation of the Linux kernel mechanisms which enable accurate timekeeping in VMs.
Or take a look at one of the places where I’ve provided actual data showing that time sync in VMs works fine:
- The School for Sysadmins Who Can’t Timesync Good and Wanna Learn To Do Other Stuff Good Too, part 5 - myths, misconceptions, and best practices
- virtualization - What are the limits of running NTP servers in virtual machines? (2010) - Server Fault
- What’s the time, Mister Cloud? An introduction to and experimental comparison of time synchronisation in AWS and Azure, part 3
- AWS microsecond accurate time: a first look
Or possibly they’re running VMware. Here’s what happened to one of my VMs when it got live-migrated at about 09:31:50 this morning:
(a graph showing a NTP client with an offset that spikes to 178 ms slow, then 66 ms fast, before levelling out after about 8 minutes)
This is, in my experience, perfectly normal for vMotion, and is why my NTP servers are still on dedicated hardware.
(Ubuntu 20.04 LTS, Linux 5.4, chrony 3.5, VMware ESXi 7.0.3)
Nice! I’m not surprised live migration has that sort of effect. Presumably the new host had different frequency error characteristics which meant that it took some time to sort out what adjustments to make.
I’m not trying to be too greedy for data, but do you happen to have a graph of a longer period (say 6-12 hours) where the client was static on one VM host? And one of the frequency error for a stable period and a vMotion?
This is, in my experience, perfectly normal for vMotion, and is why my NTP servers are still on dedicated hardware.
Assuming that VMware is pretty stable when it’s not moving VMs around, one could surely achieve the same result by simply having the NTP servers pinned to their hosts and just let them go down and let the others in the pool take the load if hypervisor maintenance is required.
I occasionally see messages of the form: kernel: [1981116.560313] watchdog: BUG: soft lockup - CPU#0 stuck for 47s! When the CPU became unstuck time was off by 47 seconds. [This was a Digital Ocean single core VM.] The problem was not reproducible on demand & Digital Ocean support was little help. Had this problem occurred on a non-VM I would have expected a reboot.
Consider this an VM anecdote. I have a couple of other Digital Ocean VMs in the NTP Pool that don’t show similar bugs.
Nice! I’m not surprised live migration has that sort of effect. Presumably the new host had different frequency error characteristics which meant that it took some time to sort out what adjustments to make.
I think the big problem is actually that the VM is stunned for a few hundred milliseconds during the migration, so its clock is suddenly behind reality.
I’m not trying to be too greedy for data, but do you happen to have a graph of a longer period (say 6-12 hours) where the client was static on one VM host? And one of the frequency error for a stable period and a vMotion?
Here’s the same VM over the last 48 hours (including two vMotions). Note this is a very different Y-axis.
The amount of variance I see there is slightly more than for systems on bare metal using the same NTP servers, but not by a lot.
And here’s the frequency variation over the same period:
As you might guess, the second migration was back to the same host the VM started on. Those wobbles in the bottom-left have amplitude about 1ppm, which is quite large compared with my bare-metal NTP servers.
Assuming that VMware is pretty stable when it’s not moving VMs around, one could surely achieve the same result by simply having the NTP servers pinned to their hosts and just let them go down and let the others in the pool take the load if hypervisor maintenance is required.
Yes, I was probably being too flippant. It’s easier for me to run a good NTP service on bare metal, but I could do something adequate on VMware with a bit of administrative complication.
Some more actual data on VM timekeeping, using both standalone KVM hosts and AWS: