Bug 11027 - random forward time jumps
Summary: random forward time jumps
Status: RESOLVED OBSOLETE
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: PPC-32 (show other bugs)
Hardware: All Linux
: P1 high
Assignee: platform_ppc-32
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-07-02 10:49 UTC by Mario Frasca
Modified: 2012-10-30 14:54 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.32
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments

Description Mario Frasca 2008-07-02 10:49:58 UTC
Latest working kernel version: 2.6.8

Earliest failing kernel version: 2.6.14

Distribution: Debian (sid)

Hardware Environment: 
motherboard	: PowerMac4,1 MacRISC2 MacRISC Power Macintosh
detected as	: 256 (iMac "Flower Power")
cpu		: 750CXe
MemTotal:       578160 kB


Software Environment:
Debian sid, being "safe-upgraded" each day automatically.

Problem Description:
I was using nptd, but it kept stopping, saying that the last correction was too large to be accepted as "normal", so I wrote a small bash line performing each minute a ntpdate clock correction and logging the correction to a logfile.  

while true; do /usr/sbin/ntpdate -b nl.pool.ntp.org 2>&1 | /bin/grep ntpdate >> /root/ntpdate.log 2>&1; sleep 60; done

the 'sleep 60' is also influenced by the time jumps ("time flies").

I'm not rotating this logfile because I want to keep the whole history of corrections.  I have reported this bug at Debian (Bug#426414) and at the moment I have collected over one year of corrections...  most of the values are below 0.01s, these are also the values I have on kernel 2.6.8, but if I start a later kernel and use grep to extract from the log anything above 1 second in the last 1440 minutes...  wait, I'll do it now...

mario@croese:~$ tail -n 1440 /root/ntpdate.log | egrep "offset.-?[1-9]" | wc
     53     583    4504

here are the last 10, just to give an idea...
mario@croese:~$ tail -n 1440 /root/ntpdate.log | egrep "offset.-?[1-9]" | tail
 2 Jul 16:06:06 ntpdate[13245]: step time server 83.98.155.30 offset -48.425601 sec
 2 Jul 16:22:14 ntpdate[13355]: step time server 85.17.224.88 offset -68.058509 sec
 2 Jul 16:39:58 ntpdate[13501]: step time server 194.171.167.130 offset -142.177247 sec
 2 Jul 16:47:41 ntpdate[13545]: step time server 131.211.182.93 offset -46.371272 sec
 2 Jul 17:10:11 ntpdate[13687]: step time server 131.211.84.189 offset -29.214178 sec
 2 Jul 17:42:22 ntpdate[13838]: step time server 85.17.224.88 offset -138.328958 sec
 2 Jul 18:08:16 ntpdate[15233]: step time server 145.24.129.5 offset -159.231000 sec
 2 Jul 18:42:44 ntpdate[18475]: step time server 192.87.106.2 offset -28.497221 sec
 2 Jul 19:03:21 ntpdate[20403]: step time server 204.9.53.11 offset -107.613813 sec
 2 Jul 19:13:50 ntpdate[21383]: step time server 213.206.85.19 offset -59.320549 sec

not to count the ones below one second but above the normal drift...

Steps to reproduce: install Debian sid on such an iMac, use it normally, just wait a few minutes (seldom longer than 30).

{{
please help!

the problem with this is when trying to record my favourite programs from the radio: it's extremely difficult to estimate how many seconds there are between 0:00 and 0:10 at night (sometimes just 600, but in some cases 800 or even more...).  for longer broadcasts I can't reliably do it unattended.
}}
Comment 1 Andrew Morton 2008-07-02 11:02:50 UTC
I marked this as a regression.
Comment 2 john stultz 2008-07-29 19:21:32 UTC
I believe this is due to the timebase frequency being incorrect.

I see similar behavior on my mac mini. My understanding, last I looked into it, is we're pulling the frequency from the firmware tables, but it is somehow incorrect. Trying to use the calibrated method also failed to improve the freq.

For instance: on my current 2.6.24-19-powerpc kernel, running on a 1.4ghz mac mini, I get an uncorrectable ~2300ppm drift.
Comment 3 Mario Frasca 2008-10-23 00:11:06 UTC
replying to John: to me it does not seem the same as a drift...  what happens on your system if you disable ntp and run the same bash one-liner given here in the report?  do you also see a "normal" behaviour and on top of it some random jumps?
Comment 4 john stultz 2008-10-29 08:05:36 UTC
Hmmm. Good point. So I see lots of jumps, but of a similar magnitude, since the drift is constant.

Yours is much more erratic. 

Can you attach a dmesg log?  I'm baffled as to what changes between 2.6.8 and 2.6.14 might have caused this. Is there any chance you can bisect the versions down a bit further?
Comment 5 Mario Frasca 2008-10-29 23:40:22 UTC
I'm running Debian on a very slow system...  recompiling a kernel costs quite some time and recompiling 7 will cost 7 times that...  I can't remember from where I had retrieved those precompiled kernels and I haven't got them any more (needed some space).  can you point me at some info about recompiling (in particular configuring) the kernel so that I get the same that is distributed by Debian?
Comment 6 john stultz 2010-01-05 01:44:09 UTC
Mario: Just wanted to check in and see if this issue still existed with more recent kernel versions?
Comment 7 Mario Frasca 2010-01-05 13:47:12 UTC
yes, it's still there, in kernel 2.6.30.

2.6.30-1-powerpc 2010-01-01 06:00:01 up 15 days, 9:07, 1 user, load average: 1.45, 0.67, 0.32
2.6.30-1-powerpc 2010-01-02 06:00:01 up 20 days, 5:19, 0 users, load average: 0.40, 0.11, 0.03
2.6.30-1-powerpc 2010-01-03 06:00:02 up 21 days, 7:14, 0 users, load average: 0.06, 0.07, 0.02
2.6.30-1-powerpc 2010-01-04 06:00:02 up 22 days, 8:42, 0 users, load average: 0.04, 0.01, 0.00
2.6.30-1-powerpc 2010-01-05 06:00:01 up 23 days, 13:50, 0 users, load average: 0.07, 0.02, 0.00

the HUGE amount of time jumps on the 1st of January were possibly linked to me doing a safe-upgrade over a ssh link (most jumps occur while load is high).
Comment 8 Mario Frasca 2010-03-22 11:45:38 UTC
update:

an update in January made the situation temporarily very close to acceptable.

the update was not to the kernel version, I would not know what was the part modified, maybe you can help me look into the logs, but anyway, on the 7th of January I rebooted the system into the same kernel as the last logs I gave here above and these are the first few lines:

2.6.30-1-powerpc 2010-01-07 08:23:22 REBOOT
2.6.30-1-powerpc 2010-01-08 06:00:01 up 21:54, 7 users, load average: 0.04, 0.29, 0.22
.
.
.
2.6.30-1-powerpc 2010-01-12 06:00:01 up 4 days, 21:54, 7 users, load average: 0.00, 0.05, 0.12
.
.
.
2.6.30-1-powerpc 2010-01-18 06:00:02 up 10 days, 21:58, 10 users, load average: 0.06, 0.33, 0.24
2.6.30-1-powerpc 2010-01-19 06:00:01 up 11 days, 22:03, 8 users, load average: 0.02, 0.24, 0.23


this would be good enough for me and did not change when I booted into kernel 
2.6.32-trunk-powerpc the 20th of February.

I updated and rebooted the system the 3rd of March.  the behaviour did not seem to change significantly, that is, the few values I collected remain within the limits of one or less minutes per day.  rebooted  and the 7th of March and things got worse, with daily cumulative jumps up to 20 minutes.  after an other update and reboot, I'm bact to this:

2.6.32-trunk-powerpc 2010-03-14 10:02:39 REBOOT
2.6.32-trunk-powerpc 2010-03-15 06:00:01 up 20:00, 5 users, load average: 0.04, 0.37, 0.45
2.6.32-trunk-powerpc 2010-03-16 06:00:01 up 1 day, 20:09, 3 users, load average: 0.71, 0.74, 0.53
2.6.32-trunk-powerpc 2010-03-17 06:00:01 up 2 days, 20:24, 3 users, load average: 2.30, 0.96, 0.56
2.6.32-trunk-powerpc 2010-03-18 06:00:01 up 3 days, 20:35, 3 users, load average: 0.00, 0.07, 0.26
2.6.32-trunk-powerpc 2010-03-19 06:00:01 up 4 days, 21:35, 5 users, load average: 0.00, 1.09, 1.66
2.6.32-trunk-powerpc 2010-03-20 06:00:01 up 5 days, 22:53, 3 users, load average: 2.00, 2.29, 2.66
2.6.32-trunk-powerpc 2010-03-21 06:00:33 up 7 days, 28 min, 3 users, load average: 18.64, 10.34, 5.88
Comment 9 Mario Frasca 2010-03-22 12:19:04 UTC
a bit more statistical info...

in the period 7th January - 20th February I had at least 21 random forward time jumps of entity 1s or above (I don't keep track of individual jumps, I correct them once each minute and so two jumps of 4 seconds in the same minute are to me indistinguishable from one of 8 seconds).  largest jump was 166s.

in the period 20th February to now I observed 500 more jumps, longest 856s.

there seem to be some preferred values, like around 170s...  whatever the cause might be.
Comment 10 john stultz 2012-05-22 16:37:46 UTC
Mario: My apologies for the embarrassingly slow response here. I was just reminded of this issue by Alan's kernel version update.

If you still are hitting this problem, could you please attach dmesg output?
Comment 11 Mario Frasca 2012-05-23 11:03:08 UTC
at the moment my server is hard-disk-dead, so I'm not experiencing the problem any more :(

Note You need to log in before you can comment on or make changes to this bug.