Hi all,
Like many of you I had bad scores on my fastest servers but not on my VPS-servers.
Like many of you, I was plagued with Time-outs, orange and red dots.
Like many of you, I got CSV values that made no sense and/or timeouts.
First of all, the monitor is wrong…BUT only in reporting the actual problem.
When you see Time-Out, this is not because your server is offline, it’s because your server is false-ticking. Don’t get upset just yet!
The monitor should check e.g. with a ping if it’s running or not, then it should report false-ticks and not Time-out. Well actually the Time-out is correct but it puts people off-track in finding the problem.
It’s not UPD-drops underway on purpose.
My suggestion should be, if the monitor is giving a time-out it should ping the ip so it can report the server there but NPT not responding due to bad time.
Anyway, the problem is bad timekeeping on your part (mine as well) the reason is not your fault.
Yes it is. I replaced my I5-4440 for a RPi2 and expected it to go even worse…well no…it went up to near 20 and stay there. WHAT?? How can this be?
The problem is the TSC, or in long words the Time-Stamp-Counter. This is an ongoing counter that never stops and should be stable as NTP and Chrony base their time-correctness on this counter.
But here is comes…no joke…modern CPU’s do speedstep. This is a problem to start with as NTP doesn’t consider the CPU-clock to change speed, but with changing speeds the TSC goes faster and slower. Chrony does consider it but fails also.
Well disable speedstep…yep that is a solution. But the most modern CPU’s do thermal-clock-changes to stop overheating of parts of the CPU and you can not control this.
As such the TSC wobbles and NTP and Chrony have no clue what to do with it.
The RPi2 doesn’t have this and it’s ticker keeps going forward at a steady pace.
Last year I changed an old Celeron 1.1GHz speedstep disabled to an I5-4440 (set max perf) and I never got the 20 mark again. What ever I did, it would not stay solid 20, dropped to minus numbers a lot of times.
The other day I was monitoring for hours and I noticed jumps in seconds, sometimes 1 to 2 seconds or more. when that happens NTP stops responding to protect the time-network.
You can not stop NTP from doing this even with PPS as it’s the TSC that messes everything up.
The problem is that everything is time-stamped from this ticker, if it stamps wrong it means everything is off-time and NTP/Chrony get confused and simply stop sending time.
In short they do not know what time the correct time is or what source to trust!
Fix the CPU speed to rocksolid and it should stay stable, in this case a slower CPU is better then the fastest on the planet.
In my case I’m going to kick the i5-4440 out and install the Intel J1900 again with disabled-speedstep.
Also saves me a great deal on electricity, about 30W/h.
Meinberg has a nice document explaining this all, but sadly I can’t upload it in this post.
It is my opinion after long seeking and searching that the CPU and it’s TSC is the problem together with changing CPU-speeds.
Hopefully it helps you too in solving the time-out and bad-scores.
See for yourself, I changed to the RPi yesterday and ever since not a single tick is missing or off!
Bas.
Update: I (think) have tackled the problem, well looks like it for now. Just testing to be sure.
For the moment my Blue-line is rock-solid and I haven’t got a single timeout, not from Newark and not on the Beta-sever.
I will start a new topic with my config and changes, then we can compare and hope it solves it for everybody.
Also, my Intel i5 is a very heavily used system, timekeeping is just a side-task.