Monitors aren't bad, just report strange....so people get the wrong idea...my opinion

Hi all,

Like many of you I had bad scores on my fastest servers but not on my VPS-servers.
Like many of you, I was plagued with Time-outs, orange and red dots.
Like many of you, I got CSV values that made no sense and/or timeouts.

First of all, the monitor is wrong…BUT only in reporting the actual problem.
When you see Time-Out, this is not because your server is offline, it’s because your server is false-ticking. Don’t get upset just yet!
The monitor should check e.g. with a ping if it’s running or not, then it should report false-ticks and not Time-out. Well actually the Time-out is correct but it puts people off-track in finding the problem.

It’s not UPD-drops underway on purpose.

My suggestion should be, if the monitor is giving a time-out it should ping the ip so it can report the server there but NPT not responding due to bad time.

Anyway, the problem is bad timekeeping on your part (mine as well) the reason is not your fault.

Yes it is. I replaced my I5-4440 for a RPi2 and expected it to go even worse…well no…it went up to near 20 and stay there. WHAT?? How can this be?

The problem is the TSC, or in long words the Time-Stamp-Counter. This is an ongoing counter that never stops and should be stable as NTP and Chrony base their time-correctness on this counter.
But here is comes…no joke…modern CPU’s do speedstep. This is a problem to start with as NTP doesn’t consider the CPU-clock to change speed, but with changing speeds the TSC goes faster and slower. Chrony does consider it but fails also.

Well disable speedstep…yep that is a solution. But the most modern CPU’s do thermal-clock-changes to stop overheating of parts of the CPU and you can not control this.
As such the TSC wobbles and NTP and Chrony have no clue what to do with it.

The RPi2 doesn’t have this and it’s ticker keeps going forward at a steady pace.

Last year I changed an old Celeron 1.1GHz speedstep disabled to an I5-4440 (set max perf) and I never got the 20 mark again. What ever I did, it would not stay solid 20, dropped to minus numbers a lot of times.

The other day I was monitoring for hours and I noticed jumps in seconds, sometimes 1 to 2 seconds or more. when that happens NTP stops responding to protect the time-network.
You can not stop NTP from doing this even with PPS as it’s the TSC that messes everything up.
The problem is that everything is time-stamped from this ticker, if it stamps wrong it means everything is off-time and NTP/Chrony get confused and simply stop sending time.

In short they do not know what time the correct time is or what source to trust!

Fix the CPU speed to rocksolid and it should stay stable, in this case a slower CPU is better then the fastest on the planet.
In my case I’m going to kick the i5-4440 out and install the Intel J1900 again with disabled-speedstep.
Also saves me a great deal on electricity, about 30W/h.

Meinberg has a nice document explaining this all, but sadly I can’t upload it in this post.

It is my opinion after long seeking and searching that the CPU and it’s TSC is the problem together with changing CPU-speeds.

Hopefully it helps you too in solving the time-out and bad-scores.
See for yourself, I changed to the RPi yesterday and ever since not a single tick is missing or off!

Schermafdruk op 2020-02-25 16-03-26

Bas.

Update: I (think) have tackled the problem, well looks like it for now. Just testing to be sure.
For the moment my Blue-line is rock-solid and I haven’t got a single timeout, not from Newark and not on the Beta-sever.

I will start a new topic with my config and changes, then we can compare and hope it solves it for everybody.
Also, my Intel i5 is a very heavily used system, timekeeping is just a side-task.

The monitoring problem situation you described can be the reason for certain configuration, like yours. Your graph is full with red dots, there is even 1.4 second offset measurement. For others, the reason is real packet loss; there is no big offset sample:

The biggest sample difference is 8 millisecond on my server. Even possible to differentiate two distinct group of samples, probably alternate packet routes with different packet delays.

I beg too differ, the red-spots you see on my graph that are not on the blue line are in fact wrong-ticks but within the limits of the monitor/NTP.
I thought I was spot on time too, the PPS values shown this all the time and so does the monitor.
But they go haywire when the TSC starts stamping wrong, the hard part is, this doesn’t have to last long, it solves itself as soon as the TSC is ticking normal again, until the next off-event is happening again.
As soon as this happens the samples can be off but not enough so they look like big offsets, but if they are big enough it will cause not-sending the UDP packet as NTP’s protective code stops it from sending to keep NTP-time-healty.

Trust me, it’s the Time-stamper that goes wrong. You are not losing packets, they are simply not send.
As all other monitors in the beta-system check on different times, the condition is simply corrected when they come to monitor. Ergo they show nothing is wrong.

They only way to see this happening is monitor your server with ntpq -p and see the variations in offsets, if you are lucky you spot the one where it happens.
The cause is CPU-clock variations, that causes all the trouble.

Sure you can miss an UDP packet sometimes, but not all the time and every day as the beta-system proves.
Have a look:

Amsterdam only missed once, that’s all that happened.
Also this server is running GPS without PPS to make sure the time isn’t corrected too fast.

Trust me, I was wrong at the assumption the UDP packets are lost, well not quite as they are simply not send because of problematic TSC that confuses NTP what the right time is.
NTP and Chrony do not send UDP if they are not sure what time it is, that is causing the time-outs.
The reason for that is the TSC not counting in a steady path.
To make things worse, in some CPU’s there can be multiple TSC’s not being synchronised, in such a case it’s probably a good idea to run NTP on an isolated CPU to keep it from Core-hopping and catch a different running TSC.

You could try the last core for that, see more info here: LinuxCNC Documentation Wiki: The Isolcpus Boot Parameter And GRUB2
Then run NTP on that core with the taskset command, see what happens.

It’s the simplest way and within a day you see the result at the blue-line.

Isolating a core on my I5-4440 is the next step I’m going to take before rebuilding my system with a different motherboard :slight_smile:

You are right that it is a valid problem, one always has to verify his own server, does it lose its synchronization time to time? When this situation happens, the NTP server may not answer queries.

That is not my case. When my server receives NTP query packets originating from the monitoring system, it always answers them. Still, a lot of, regularly occurring time-outs are reported by the monitoring system. The only plausible explanation is packet loss (or problem with the monitoring station itself, but I doubt that).

Well I can only say this, I’m running isolated core now and it has not lost a single packet.
Not 1.
However it was heavy snowing last night so I have a red dot, but not a timeout, you can see the snow-cloud passing on the time-values, PPS is very good at this :slight_smile:

How come you are so sure it sends UPD’s to the monitor?
As NTP does send monitor-messages for NTPQ and you see timings, but that is monitoring, the package with the actual-time is not send and that causes the timeout.

Install irqbalance and taskset
Just look into your server and pick the core that is least used.

Then change /etc/init.d/ntp to this:

(
flock -w 180 9
/usr/bin/taskset -c 3 start-stop-daemon --start --quiet --oknodo --pidfile $PIDFILE --startas $DAEMON – -p $PIDFILE $NTPD_OPTS
) 9>$LOCKFILE

Where -c 3 is the 4th core in my case as it’s counting cores from 0, this way you make sure NTP stays on that core.
That makes NTP running islolated, however it may not be enough and the core itself needs to be isolated.
I had an isolated core anyway and put it on there.
Irqbalance makes sure that every core is handling their own IRQ’s and not passing via Core0 as Linux does by default.

Other then the snow passing by and give a few higher values, I did not receive 1 time-out (yet).

I am doing packet capture. All packets are in pair. One from the monitoring station, and a reply packet immediately following, sent back to the monitoring station. No exception.

Just now, there was packet loss from my server towards the monitoring station.

https://www.ntppool.org/scores/156.106.214.52/log?limit=200&monitor=*

ts_epoch,ts,offset,step,score,monitor_id,monitor_name,leap,error
1582712977,"2020-02-26 10:29:37",0,-5,10.3,6,"Newark, NJ, US",,"i/o timeout"

I have seen three pairs of packets, 5 seconds spacing. From this, I can even deduct that the packet loss was not toward my server, but from my server toward the monitoring station.

(By the way, by reading the source code of the monitoring, it supposed to be 4 seconds spacing only, and not 5. Isn’t it @ask? Or am I wrong?)

I had the exact same timeouts and they happened randomly.
These are caused by your own NTP that stops sending UDP because it thinks it has bad time.

Do yourself a favour and put NTP on the last core of your CPU.
Also check if ondemand governor is running, if so, set it to powersave or performance.

See what happens, if the time-outs are gone then you have the answer.

NTP will not send time if it’s not sure it’s the correct time and such conditions are created if the TSC is not counting at a steady pace.
Core-hopping creates such as well as CPU-speed changes.

My money is mostly on core-hopping :slight_smile:

Try it and you know it.

That only means that it waits max 4 seconds for the UDP to arrive.
But it’s never send, that is the problem.
Ask can make it 100 seconds, it won’t matter.

How can I make this clear, if NTP is unsure about the time being correct, it will not sent UDP-Time.
Instead it waits until the time is correct and only then it answers to questions again.

It waits only two seconds the answer packet to arrive. Than it just sleeps additional two seconds to prevent getting KoD reply for the next query. I cannot account for the fifth second delay I have seen. EDIT: I checked the code again. The timeout is set to 3 seconds, not two, so it ends up to five correctly.

That is exactly the point, the answer is never send.
You can wait all you like.

If you are off too much for whatever reason it will present you a time-out.
That is not a packet-loss, it was never send in the first place by your server at all.

NTPD and Chrony do not send UPD-time if they are unsure about the time.
They ONLY send correct time within certain limits (that I do not know), else they do not respond at all.
No error package or wrong time message…it simply sends nothing at all.

In short, if your time is wrong and the monitor comes, the NTPD doesn’t respond AT ALL.

Maybe something has changed since 2014, but the Meinberg document states this.

Since then I started to find a way to get rid of the wrong ticks and it worked, I haven’t gotten a single Time-out (not from any monitor), some off-ticks, yes it happens, but not a single Time-Out.

The tcpdump shows, that the reply packet is sent. Do you think the the tcpdump program shows a packet which isn’t there in the reality?

1 Like

You have a time-out every few hours.
Did you try any of the suggestions I made?

Or do you like arguing all day about it.

I do not care about tcpdump, sorry past that station.

My server gave the same dumps and same time-outs.
Then I replaced it with an RPi2 and all time-outs where gone.

If the internet is the problem, the RPi2 should have the same problems.
It did not. It was a perfect 20 all day long.

UDP does not get dropped that much. It doesn’t.

Update, it seems Intel CPU’s have a TSC bug causing this:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1744988

Weird is, I have an unaffected kernel and unlisted CPU but the behaviour is the same.

I changed my clock to HPET instead of TSC as some reported that the HPET makes it stable.

https://access.redhat.com/solutions/18627 (works also for Debian and variants)

Assuming you were running linux, which timesource were you using? i.e.: jiffies, acpi_pm, hpet, pit, tsc, etc… and did you try changing it to see if that affected performance? Likewise occasionally the bios will have some settings related to hpet that you need to make sure are enabled.

Yes I run Linux, is there any other? :slight_smile:

I was using TSC but changed to HPET, but it doesn’t seem improve.
Going to try the acpi_pm to see if it’s any better.

Update:
I’m running 2 systems side by side, both with GPS, but it’s heavy snowing here at the moment.
It seems that acpi_pm does the job, the wobble is within 6ms and no strange jitter changes upto 600ms when the TSC went nuts.
The RPi and i5-4440 seem to run stable offsets.
To bad we can’t put an extra monitor on e.g. port 124 to compare such, tried but the pool won’t accept it.

Actually in terms of timekeeping, the FreeBSD kernel is far superior to the Linux kernel…

1 Like

I know, I have NetBSD on a Cobalt Raq2, but I hate the system as it’s impractical en badly sorted to install packages.
I prefer the APT system.
For me BSD is a mess to work with, sorry.

But it’s not the Linux kernel that goes wrong, it’s the Intel CPU that is a mess.
They try to fix it via the Kernel but that doesn’t work.

And the last year, around the time Ask moved the pool to Packet they decided to use the change the clock-source.
NTP has become a victim of bad tickers because the kernel time-source changes ran into the Intel bug.

At the same time I changed my trusty old Celeron 847 for the J1900 and later the I5, both containing the bug.
From that moment I have been running circles trying to find the cause.
Who would expect that a time-stamper goes lunatic?

Also the monitor responses made things worse by declaring systems offline, but Ask is not to blame, he didn’t know.

I could have known, as my son-in-law complained his i5-4440 was slow and my AMD FX6350 was faster.
But he’s a musician and got into trouble with real-time filters etc, no wonder if the ticker can’t tick right. :slight_smile:

Anyway, the acpi_pm seems to work as should…fingers crossed.

Well I change back to Intel CPU, and it’s showing in 3 monitor queries:

ts_epoch,ts,offset,step,score,monitor_id,monitor_name,leap,error
1582912792,“2020-02-28 17:59:52”,0,-5,13.9,6,“Newark, NJ, US”,“i/o timeout”
1582912792,“2020-02-28 17:59:52”,0,-5,13.9,“i/o timeout”
1582911666,“2020-02-28 17:41:06”,0.071213794,1,19.9,6,“Newark, NJ, US”,0,
1582911666,“2020-02-28 17:41:06”,0.071213794,1,19.9,0,
1582910559,“2020-02-28 17:22:39”,0.066881733,1,19.9,6,“Newark, NJ, US”,0,
1582910559,“2020-02-28 17:22:39”,0.066881733,1,19.9,0,

A simple RaspberryPi will hum-along all day long.

I tried everything including a new kernel that was supposed to fix it.

Sorry to say, this Intel i5-4440 goes into the trash and I get a modern AMD.
Intels TSC is bad and not synchronised. Tried HPET and ACPI_PM, same rubbish.

That is the reason people complain about the monitors.
Try yourself, switch to a RaspberryPi and see what happens.
You do not even need a GPS or PPS, it will go nuts in minutes!

The problem is Intel CPU’s, if you have this problem and the CPU is Intel, well install an Raspberry of 50 euro/dollar and see what happens.

You do realize that HPET, PIT, RTC, and most other timers are on the MOTHERBOARD and not the CPU? It’s probably a firmware (BIOS) issue more than likely for that specific board, simply overlooked because nanosecond timekeeping is usually not a top priority for desktop boards…

But do what works for you…

2 Likes

I do not care where they are located.
The TSC is IN the CPU and should keep counting at a steady pace.
It does not.

I changed to other time-sources but the cores are not in synch.

NTP and Chrony get confused and then the trouble starts.

The monitor(s) are not bad, just a few too little.
But when it goes bad, all monitors show the same.

1582913378,“2020-02-28 18:09:38”,0,-5,13.8,23,Amsterdam,“i/o timeout”
1582913378,“2020-02-28 18:09:38”,0,-5,8.8,“i/o timeout”

When I run the RaspberryPi and there is a Time-out the timeout is not confirmed by ALL monitor, just happens at 1 monitor.

Trust me…it is the CPU. Also Intel did confirm the ticker is bad and never fixed it.

ALL systems show the same blue-jojo…I want to bet you that they are all Intel CPU’s.