Latest working kernel version:2.6.27 Earliest failing kernel version:2.6.28-rc1 Distribution: RHEL 4.92 Hardware Environment: Intel Bensley/Clovertown Software Environment: RHEL 4.92 Problem Description: gettimeofday return a old time in mmbench Steps to reproduce: 1, set "mem=1G" in kernel cmdline 2, compile mmbench as gcc -o mmbench -lm -g mmbench.c 3, run cmd #mmbench for a while. or run mmbench in GDB directly. sometime, mmbench get a segmentation fault. and GDB report : (gdb) r Starting program: /bm/tmp/benchmarks/run_bmmmbench/mmbench/mmbench Program received signal SIGSEGV, Segmentation fault. 0x0000000000401925 in main (argc=1, argv=0x7fff0b3d1cf8) at mmbench.c:410 410 accs_num[diff_gran]++; // access time distribution (gdb) p diff_gran $2 = -42890 according to mmbench.c, the diff_gran is equal to (gettimeofday(current time) - gettimeofday(start time)) So, the negative value of diff_gran means gettimeofday get a old time. I also write a gettimeofday checking program gtod.c . but this simple program can not reproduce this issue.
I'll switch this to x86_64.
Created attachment 18723 [details] mmbench.c file it is mmbench.c file
== #include <sys/time.h> #include <stdio.h> long v[100000000]; main (){ struct timeval tv; int i,j ; for (i=0; i<100000000; i++){ gettimeofday(&tv, 0); v[i] = tv.tv_sec * 1000000 + tv.tv_usec; //printf("sec is %ld usec is %ld\n", tv.tv_sec,tv.tv_usec); } for (i=0;i<9999999;i++) if(v[i]>v[i+1]) printf("gettimeofday error %8d %8d\n", v[i], v[i+1]); } == it is my gettimeofday checking program. compile as gcc -o g gtod.c I have run it as $for ((i=0;i<100;i++)) do ./g & ./g & ./g; done it will run a long time. but not error print out.
Created attachment 18724 [details] dmesg log
Created attachment 18725 [details] kernel config file
just a quick question, does my time-warp test: http://people.redhat.com/mingo/time-warp-test/time-warp-test.c detect a GTOD time warp too? Ingo
Handled-By : Ingo Molnar <mingo@elte.hu>
Ingo: seems your test case detect some errors, your case output as below: 8 CPUs, running 8 parallel test-tasks. checking for time-warps via: - read time stamp counter (RDTSC) instruction (cycle resolution) - gettimeofday (TOD) syscall (usec resolution) - clock_gettime(CLOCK_MONOTONIC) syscall (nsec resolution) new TSC-warp maximum: -4294958692 cycles, 00000000fffffe22 -> 0000000000001fbe new TSC-warp maximum: -4294959196 cycles, 00000000ffffeab5 -> 0000000000000a59 new TSC-warp maximum: -4294963588 cycles, 00000000fffff839 -> 00000000000006b5 new TOD-warp maximum: -4294952 usecs, 00045b4b09861f8b -> 00045b4b09449663 new TOD-warp maximum: -4294965 usecs, 00045b4b09956228 -> 00045b4b0953d8f3 new TSC-warp maximum: -4294964029 cycles, 00000000fffffb6b -> 000000000000082e new TSC-warp maximum: -4294964047 cycles, 00000000fffffc72 -> 0000000000000923 new TSC-warp maximum: -4294965766 cycles, 00000000fffffea2 -> 000000000000049c new TOD-warp maximum: -4294966 usecs, 00045b4b2a116d9c -> 00045b4b29cfe466 new TSC-warp maximum: -4294966099 cycles, 00000000ffffff46 -> 00000000000003f3 new TSC-warp maximum: -4294966513 cycles, 00000000ffffff43 -> 0000000000000252 Alex
still exists on rc4
TSC warp exists on 2.6.27~2.6.28-rc4. but for TOD warp it just exist on 2.6.28-rcx serial kernel. Alex
Can you please provide the output of /sys/devices/system/clocksource/clocksource0/current_clocksource on 2.6.27 and 2.6.28-rc4 ?
all of kernel use tsc as clock source.
Ok, can reproduce now. Debugging
Can you please check /var/log/messages for NTP entries ?
we did not use NTP on this machine.
Ingo: we have checked the code time-warp-test.c: ... #define rdtscll(val) \ do { \ __asm__ __volatile__("cpuid; rdtsc" : "=A" (val):: "ebx","ecx","edx","memory" ); \ } while (0) ... In the intel archatecture. the rdtsc return its 64bits value in eax and edx, but your code just get the eax value. So we get lots of TSC warp warning: new TSC-warp maximum: -4294958692 cycles, 00000000fffffe22 -> 0000000000001fbe new TSC-warp maximum: -4294959196 cycles, 00000000ffffeab5 -> 0000000000000a59 new TSC-warp maximum: -4294963588 cycles, 00000000fffff839 -> 00000000000006b5 It mabye incorrect. So we modify this code as below: #define rdtscll(val) \ do { \ unsigned low, high; \ __asm__ __volatile__("cpuid; rdtsc" : "=a" (low), "=d" (high)); \ val = low | (unsigned long long)(high) << 32; \ } while (0)
Hmm. I was able to reproduce and debugged out that NTP was the reason for the wreckage. For some yet unknown reason NTP was not able to use the pre 2.6.28 /var/lib/ntp/drift value. I stopped NTP and set this to 0.0 manually, restarted NTP and it did not happen again in 9 hrs. According to your time-warp-test output your time is off by > 4 seconds. Can you verify that nothing calls settimeofday() ?
(In reply to comment #16) changing the delta calculation in test_TSC() - delta = t1-t0; + delta = (t1-t0) & 0xffffffff; works as well :)
OK, I will open oprofile to check the settimeofday, if the issue reproduced. I try to reproduce the TOD warp issue on my machine for 2.6.28-rc4. somtime it is not easy to reproduce. it seems the gettimeofday has a nature issue in theory before copy_to_user. Another gettimeofday system call can happen between do_gettimeofday and copy_to_user. asmlinkage long sys_gettimeofday(struct timeval __user *tv, struct timezone __user *tz) { if (likely(tv != NULL)) { struct timeval ktv; do_gettimeofday(&ktv); if (copy_to_user(tv, &ktv, sizeof(ktv))) return -EFAULT;
well, that does not matter with Ingos time-warp-test lock(&SHARED(LOCK)); rdtod(T1); T0 = SHARED_LL(TOD); SHARED_LL(TOD) = T1; SHARED_LL(NR_TOD_LOOPS)++; unlock(&SHARED(LOCK)); We have the lock around the rdtod and the update of the shared data structure. With this test we should never ever be able to see time going backwards with gettimeofday(). In an unlocked scenario there is no guarantee aside of one: A single task can never observe time going backwards even if it migrates from one CPU to another.
Thomas: you are reasonable... But I can not reproduce it from yesterday on my 3 kind of machines whenever with mmbench or time-warp-test. but however, I can sure there is no one call settimeofday etc systemcall to reset time during testing. I am going to retest more in this night to see whether it appears.
I have done some bisect for this issue, and found it was introduced by the commit: 99ebcf8285df28f32fd2d1c19a7166e70f00309c and sys_gettimeofday was changed from this commit. the last kerenl also has this issue.
Handled-By : Thomas Gleixner <tglx@linutronix.de> First-Bad-Commit : 99ebcf8285df28f32fd2d1c19a7166e70f00309c
> ------- Comment #22 from alex.shi@intel.com 2008-11-14 00:22 ------- > I have done some bisect for this issue, and found it was introduced by the > commit: > 99ebcf8285df28f32fd2d1c19a7166e70f00309c > and sys_gettimeofday was changed from this commit. the last kerenl also has > this issue. Hmm, this is not very helpful, as this is a merge commit. The code in kernel/time/timekeeping.c was changed by more than one commit. Thanks, tglx
I did some modifcation of time-warp-test and find the TOD get a old time while error happen. @@ -240,7 +241,7 @@ static inline void test_TOD(unsigned long *shared) { #if TEST_TOD - usecs_t T0, T1; + static usecs_t T0,T00, T1 ; long long delta; lock(&SHARED(LOCK)); @@ -256,10 +257,12 @@ SHARED(NR_TOD_WARPS)++; if (delta < SHARED_LL(WORST_TOD)) { SHARED_LL(WORST_TOD) = delta; - fprintf(stderr, "\rnew TOD-warp maximum: %9Ld usecs, %016Lx -> %016Lx\n", - delta, T0, T1); + fprintf(stderr, "\rnew TOD-warp maximum: %9Ld usecs, %016Lx -> %016Lx, T00 is %016Lx\n", + delta, T0, T1, T00); } unlock(&SHARED(LOCK)); + }else{ + T00=T0; } it is output: new TOD-warp maximum: -4294940 usecs, 00045bec66e17038 -> 00045bec669fe71c, T00 is 00045bec66e17030 another message,
A nother message is when I close the print_status in time-warp-test, I tried several times, do not see the warp happen.
I have added a static variable: oldns to check the xtime variable in timekeep.c: static s64 oldns; do { seq = read_seqbegin(&xtime_lock); *ts = xtime; newns = timespec_to_ns(ts); if((oldns - newns) > 0 ){ printk(KERN_ERR "Alex \n **** \n find error in %s oldns is %016llx, newns is %016llx \n **** \n", \ __func__, oldns, newns); dump_stack(); } /* read clocksource: */ cycle_now = clocksource_read(clock); /* calculate the delta since the last update_wall_time: */ cycle_delta = (cycle_now - clock->cycle_last) & clock->mask; if (cycle_now < clock->cycle_last){ printk(KERN_ERR "Alex \n **** \n find error in %s cycle_now %016llx, clock->cycle_last %016llx \n **** \n", \ __func__, cycle_now, clock->cycle_last); dump_stack(); } /* convert to nanoseconds: */ nsecs = cyc2ns(clock, cycle_delta); } while (read_seqretry(&xtime_lock, seq)); oldns = newns; and find some output. not clear it is useful: **** find error in getnstimeofday oldns is 11077b6db72423f1, newns is 11077b6db72423f0 **** Pid: 0, comm: swapper Not tainted 2.6.28-rc5 #2 Call Trace: <IRQ> [<ffffffff80211588>] read_tsc+0x9/0x1c [<ffffffff8024a938>] getnstimeofday+0x73/0x129 [<ffffffff80247cb5>] ktime_get_ts+0x22/0x4b [<ffffffff80247e23>] ktime_get+0xc/0x41 [<ffffffff80248faf>] sched_clock_tick+0x54/0x8a [<ffffffff8023c2e4>] run_timer_softirq+0x21/0x193 [<ffffffff80231d72>] scheduler_tick+0x3b/0xf3 [<ffffffff8023c8f3>] update_process_times+0x44/0x50 [<ffffffff8024cd50>] tick_periodic+0x68/0x73 [<ffffffff8024cd73>] tick_handle_periodic+0x18/0x51 [<ffffffff8021d18b>] smp_apic_timer_interrupt+0x8e/0xac [<ffffffff8020be1b>] apic_timer_interrupt+0x6b/0x70 <EOI> [<ffffffff80212242>] mwait_idle+0x41/0x44 [<ffffffff8020a08a>] cpu_idle+0x40/0x5e find error in getnstimeofday oldns is 11077b7793fe7106, newns is 11077b7793c16804 **** Pid: 3528, comm: time-warp-test Not tainted 2.6.28-rc5 #2 Call Trace: [<ffffffff8024a938>] getnstimeofday+0x73/0x129 [<ffffffff80247cb5>] ktime_get_ts+0x22/0x4b [<ffffffff8024432c>] posix_ktime_get_ts+0xc/0x10 [<ffffffff8020b117>] system_call_after_swapgs+0x17/0x65 [<ffffffff80244acf>] sys_clock_gettime+0x3c/0x8d [<ffffffff8020b17b>] system_call_fastpath+0x16/0x1b
oh, I guess the comments #27 explaination is: when one getnstimeofday running, another getnstimeoday use a newer xtime, so the oldns in second getnstimeofday was updated to a newer than the first one. Does it mean gettimeofday return a old data? seems not. oh, I should know clear about the read lock for xtime_lock first
> > oldns = newns; > > > and find some output. not clear it is useful: That's not useful at all as nothing protects oldns neither from cross cpu access nor from preemption. Thanks, tglx
Tglx: > [<ffffffff8024a938>] getnstimeofday+0x73/0x129 > [<ffffffff80247cb5>] ktime_get_ts+0x22/0x4b > [<ffffffff80247e23>] ktime_get+0xc/0x41 > [<ffffffff80248faf>] sched_clock_tick+0x54/0x8a > [<ffffffff8023c2e4>] run_timer_softirq+0x21/0x193 > [<ffffffff80231d72>] scheduler_tick+0x3b/0xf3 From above stack, getnstimeofday seems can be called in a irq process. so we need use read_seqbegin_irqsave to instead of read_seqbegin in getnstimeofday? I have tried it seems the error frequency is lower than before. I will tested again on 2.6.27 because 2.6.27 also does not use irqsave mode seqlock. BRGS Alex
(In reply to comment #30) > Tglx: > > > [<ffffffff8024a938>] getnstimeofday+0x73/0x129 > > [<ffffffff80247cb5>] ktime_get_ts+0x22/0x4b > > [<ffffffff80247e23>] ktime_get+0xc/0x41 > > [<ffffffff80248faf>] sched_clock_tick+0x54/0x8a > > [<ffffffff8023c2e4>] run_timer_softirq+0x21/0x193 > > [<ffffffff80231d72>] scheduler_tick+0x3b/0xf3 > > From above stack, getnstimeofday seems can be called in a irq process. so we > need use read_seqbegin_irqsave to instead of read_seqbegin in getnstimeofday? It doesn't matter as getnstimeofday just reads data instead of changing data I > have tried it seems the error frequency is lower than before. > > I will tested again on 2.6.27 because 2.6.27 also does not use irqsave mode > seqlock. > > > BRGS > Alex >
> Tglx: > > > [<ffffffff8024a938>] getnstimeofday+0x73/0x129 > > [<ffffffff80247cb5>] ktime_get_ts+0x22/0x4b > > [<ffffffff80247e23>] ktime_get+0xc/0x41 > > [<ffffffff80248faf>] sched_clock_tick+0x54/0x8a > > [<ffffffff8023c2e4>] run_timer_softirq+0x21/0x193 > > [<ffffffff80231d72>] scheduler_tick+0x3b/0xf3 > > >From above stack, getnstimeofday seems can be called in a irq process. so we > need use read_seqbegin_irqsave to instead of read_seqbegin in getnstimeofday? > I > have tried it seems the error frequency is lower than before. > > I will tested again on 2.6.27 because 2.6.27 also does not use irqsave mode > seqlock. No. That's wrong. There is no reason to use irqsave for that. The only place where we need irqs disabled is when we write lock xtime lock.
still exists on rc6 on my 2 machines. Below is my reproduce scripts which named "repd.sh" and locate at /home/alexs/. if you want to use it, pls modify its absolute location in script. hope it is helpful. ==== #!/bin/bash # #usage: /home/alexs/repd.sh 8 # timesf=/tmp/times #check reboot times in testing sed -i "/repd.sh/d" /etc/rc.local if [ ! -z "$1" ]; then echo $1 >$timesf times=$1 else times=`cat $timesf` fi #do testing untill timeout. $to is timeout seconds. to=300 if [ $times -gt 0 ];then ((times--)) echo $times > $timesf sync tmpf=/tmp/`date +%s` /home/alexs/time-warp-test 2> $tmpf & for((i=0; i<$to; i++)) do sleep 1; if grep TOD-warp $tmpf -q ;then echo "find TOD warp in this kernel `uname -r`" >> $tmpf sudo killall time-warp-test break; fi done if [ $i -ge $to ];then #write script into rc.local. needs a absolute script location. sudo sh -c "echo /home/alexs/repd.sh >> /etc/rc.local " sudo reboot exit 0 fi fi
I reproduced it on x86-64 machine. Steps: 1) Start Ingo's time-warp-test; 2) Start my own time test program. It's simple and just calls gettimeofday and compare the next value to make sure no warp. --- output after a couple of miniutes from my program: [ymzhang@linux-b03 bug]$ ./time_goback 1000000000000000 gettimeofday goes back at loop:[564350392]. diff_usec:[-4294967] stime.tv_sec=[1227252201] stime.tv_usec=[302334] gettimeofday goes back at loop:[602412232]. diff_usec:[-4294967] stime.tv_sec=[1227252207] stime.tv_usec=[302667] gettimeofday goes back at loop:[621441193]. diff_usec:[-4294967] stime.tv_sec=[1227252210] stime.tv_usec=[302833] It's interesting that the diff_usec is always -4294967. If comparing with the output posted by Alex, time-warps also says 4294967 or 4294966. I will instrument kernel to see if someone calls settimeofday.
Created attachment 18960 [details] My simple time check program Start time-warp-test, then start my time_goback. time_goback prints of many errors while time-warp-test doesn't. But if just starting time_goback, it seems no error outputed.
Handled-By : Yanmin Zhang <yanmin_zhang@linux.intel.com>
Created attachment 18996 [details] New time_goback The old one has a bug which causes coredump when there is no parameter.
In the update_wall_time function, I found if the 538 line was commented in the following code. the problem disappears in my testing with 8 times reboot, each of reboot running 300 seconds time-warp-test. seems the 531 line: clocksource_adjust(offset) do a incorrect xtime_nsec modifiction. that line was added in the first-bad commits. 502 clock->xtime_nsec = (s64)xtime.tv_nsec << clock->shift; 503 504 /* normally this loop will run just once, however in the 505 * case of lost or late ticks, it will accumulate correctly. 506 */ 507 while (offset >= clock->cycle_interval) { 508 /* accumulate one interval */ 509 offset -= clock->cycle_interval; 510 clock->cycle_last += clock->cycle_interval; 511 512 clock->xtime_nsec += clock->xtime_interval; 513 if (clock->xtime_nsec >= (u64)NSEC_PER_SEC << clock->shift) { 514 clock->xtime_nsec -= (u64)NSEC_PER_SEC << clock->shift; 515 xtime.tv_sec++; 516 second_overflow(); 517 } 518 519 clock->raw_time.tv_nsec += clock->raw_interval; 520 if (clock->raw_time.tv_nsec >= NSEC_PER_SEC) { 521 clock->raw_time.tv_nsec -= NSEC_PER_SEC; 522 clock->raw_time.tv_sec++; 523 } 524 525 /* accumulate error between NTP and clock interval */ 526 clock->error += tick_length; 527 clock->error -= clock->xtime_interval << (NTP_SCALE_SHIFT - clock->shift); 528 } 529 530 /* correct the clock when NTP error is too big */ 531 clocksource_adjust(offset); 532 533 /* store full nanoseconds into xtime after rounding it up and 534 * add the remainder to the error difference. 535 */ 536 xtime.tv_nsec = ((s64)clock->xtime_nsec >> clock->shift) + 1; 537 clock->xtime_nsec -= (s64)xtime.tv_nsec << clock->shift; 538 clock->error += clock->xtime_nsec << (NTP_SCALE_SHIFT - clock->shift); 539
Created attachment 19006 [details] Committed fix for timekeeping rounding issues Does reverting this patch resolve the issue?
Created attachment 19008 [details] Another patch to try If the reverting the previous patch didn't work, could you try applying this one?
I had tried 3 scenarios: 1, revert the patch 2, adding your patch 3, both of 1,2 and found in all of situations. the warp still exists.
Well, I think it's almost root casued. I instrumented kernel. 1) The bug is triggered when clock->mult is changed. Because it depends on cycle_interval and xtime_interval which are calculated when booting, so sometimes it can't be triggered, so we need reboot kernel often to trigger different initial mult and cycle_interval; 2) Sometimes, xtime.tv_nsec becomes -1 in function update_wall_time after statement xtime.tv_nsec = ((s64)clock->xtime_nsec >> clock->shift) + 1; As xtime.tc_nsec is a kind of long, but in do_vgettimeofday, nsec is unsigned long, so the time jumps forward, later gettimeofday returns a correct time, but the test program reports a go_back issue; 3) Why does xtime.tv_nsec become -1? clocksource_adjust does it at clock->xtime_nsec -= offset; My capture shows offset=9596319 while the original clock->xtime_nsec=3293928. So then clock->xtime_nsec is almost equal to -1UL and xtime.tv_nsec also becomes -1.
Perhaps just delete clock->xtime_nsec -= offset; from function clocksource_adjust? I will check it tomorrow.
(In reply to comment #43) > Perhaps just delete > clock->xtime_nsec -= offset; > from function clocksource_adjust? I'd not advise this, as it will just cause more issues. Especially with users of NTP. If this didn't happen in 2.6.27, there's only been 3 changes to the timekeeping.c file since then, so this shouldn't be hard to catch. We tried reverting one with the rounding issues patch, and that really seemed like the most likely to affect this, so I'm a bit confused. Never the less, I'll generate patches for the other ones and we can try to revert those as well.
(In reply to comment #44) > (In reply to comment #43) > > Perhaps just delete > > clock->xtime_nsec -= offset; > > from function clocksource_adjust? > > I'd not advise this, as it will just cause more issues. Especially with users > of NTP. > > If this didn't happen in 2.6.27, there's only been 3 changes to the > timekeeping.c file since then, so this shouldn't be hard to catch. We tried > reverting one with the rounding issues patch, and that really seemed like the > most likely to affect this, so I'm a bit confused. I confirm Alex that just reverting the one of rounding issue patch doesn't fix it. > > Never the less, I'll generate patches for the other ones and we can try to > revert those as well. >
Created attachment 19019 [details] clock-monotonic patch Here's another of the 2.6.28 changes to timekeeping.c Please revert attachment 19006 [details] (rounding fix), then this attachment.
Created attachment 19020 [details] clocksource forward now patch The last 2.6.28 change to timekeeping.c Please revert attachment 19006 [details], then 19019, and finally this patch.
(In reply to comment #47) > Created an attachment (id=19020) [details] > clocksource forward now patch > > The last 2.6.28 change to timekeeping.c > > Please revert attachment 19006 [details], then 19019, and finally this patch. > I reverted the 3 patches and didn't trigger the issue after running the testing for 1 hour. My debug statement didn't run into xtime.tv_nsec=-1.
I added a printk in do_settimeofday and didn't find the output after the testing is started. I checked patch mono_raw and clocksource_forward (19019&19020) manually. It seems these 2 patches don't change the logic of xtime changing. Perhaps this issue exists __theoretically__, but latest kernel triggers it. My debugging in update_wall_time shows: xtime.tv_nsec:[-1] is negative. xtime_nsec01[3234962] error01[6517645312] tick_length01[17179869184000000] tick_length[17179869184000000] old offset[18978759] offset[9378909] old mult[1747653] shift[22] cycle_interval[9599850]xtime_interval[16777206652050] xtime_nsec[18446744073708592274]error[-3054655488] Current mult[1747654] shift[22] cycle_interval[9599850]xtime_interval[16777216251900] xtime_nsec[18446744073703407669]error[6291401728] old xtime tv_sec[1227679454] tv_nsec[996000003] Current xtime tv_sec[1227679455] tv_nsec[-1] old offset is the offset initial value in update_wall_time, and offset is the value after the while loop. xtime_nsec01 is clock->xtime_nsec before calling clocksource_adjust. Others are just old xtime&clock when entering update_wall_time, or new xtime&clock after new xtime.tv_nsec becomes -1. So when the interrupt is sent after more than 1 tick, but less than 2 ticks, e.g., 1.999 ticks, clock->xtime_nsec might become less than offset.
(In reply to comment #48) > (In reply to comment #47) > > Created an attachment (id=19020) [details] [details] > > clocksource forward now patch > > > > The last 2.6.28 change to timekeeping.c > > > > Please revert attachment 19006 [details] [details], then 19019, and finally > this patch. > > > > I reverted the 3 patches and didn't trigger the issue after running the > testing > for 1 hour. My debug statement didn't run into xtime.tv_nsec=-1. Just to be sure, can you re-add attachment #19020 [details] and see if the issue reappears?
(In reply to comment #49) > Perhaps this issue exists __theoretically__, but latest kernel triggers it. > > My debugging in update_wall_time shows: Could you attach the patch that generated this output? thanks -john
Created attachment 19024 [details] A debugging patch to instrument kernel Apply it to 2.6.28-rc6. But We just trigger the issue on 2 machines. On other machines, we couldn't trigger it.
(In reply to comment #49) > My debugging in update_wall_time shows: > xtime.tv_nsec:[-1] is negative. xtime_nsec01[3234962] error01[6517645312] Hrm.. So xtime_nsec01 is xtime_nsec before clocksource_adjust(). But xtime_nsec01 >> 22 is 0, so that's less then a nanosecond (but not zero). Since we set it to be clock->xtime_nsec = (s64)xtime.tv_nsec << clock->shift, that means we must have hit the second overflow case. So I'm wondering if we're hitting the same rounding issue we had earlier at the nanosecond boundary (addressed by Roman's timekeeping rounding patch), but just on the second boundary. Its a little late, so I'm not confident there's no side effects, but I've got a patch you can try, basically deferring the second overflow till after we adjust the clocksource. Will attach shortly.
Created attachment 19028 [details] patch to do the second overflow accumulation after adjusting the clocksource. Here's a patch that moves the second overflow code to after the adjust_clocksource() call. This should avoid xtime_nsec from underflowing and rolling negative. I'm hoping this resolves the issue here. However, I'm not 100% confident this patch is right. I'm a little worried we might actually overflow xtime_nsec if the update_wall_time is deferred for too long. The proper fix may be to detect the underflow and propagate it back up to the xtime.tv_sec field. But I'm a bit tired for that right now.
Created attachment 19030 [details] fix negative underflow Ok, I think I have a better way to solve this (if I'm actually chasing the right issue). One thing I was having a hard time understanding is how we could be underflowing and going negative if we always round xtime.tv_nsec up by one. So instead of accumulating the second overflows later, we can instead just add 1 nanosecond to the xtime_nsec value before we do the shift (instead of after) and that should avoid the issue. This also avoids any problem with overflowing xtime_nsec. Fair warning: Its a little late here, so this may be wrong. I've not had the chance to test this myself. Hopefully its not totally off base. :)
tried just revert clock_forward_now patch on rc6. it also has this issue.
(In reply to comment #53) > (In reply to comment #49) > > My debugging in update_wall_time shows: > > xtime.tv_nsec:[-1] is negative. xtime_nsec01[3234962] error01[6517645312] > > Hrm.. So xtime_nsec01 is xtime_nsec before clocksource_adjust(). > > But xtime_nsec01 >> 22 is 0, so that's less then a nanosecond (but not zero). > Since we set it to be clock->xtime_nsec = (s64)xtime.tv_nsec << clock->shift, > that means we must have hit the second overflow case. > > So I'm wondering if we're hitting the same rounding issue we had earlier at > the > nanosecond boundary (addressed by Roman's timekeeping rounding patch), but > just > on the second boundary. > > Its a little late, so I'm not confident there's no side effects, but I've got > a > patch you can try, basically deferring the second overflow till after we > adjust > the clocksource. I tested it. The patch does fix it, from mathematics point of view. tick_length might be changed by second_overflow. > > Will attach shortly. >
(In reply to comment #56) > tried just revert clock_forward_now patch on rc6. it also has this issue. > It's a potential issue and triggered by the latest kernel. Let's resolve it with a new patch.
(In reply to comment #57) > (In reply to comment #53) > > (In reply to comment #49) > > > My debugging in update_wall_time shows: > > > xtime.tv_nsec:[-1] is negative. xtime_nsec01[3234962] error01[6517645312] > > > > Hrm.. So xtime_nsec01 is xtime_nsec before clocksource_adjust(). > > > > But xtime_nsec01 >> 22 is 0, so that's less then a nanosecond (but not > zero). > > Since we set it to be clock->xtime_nsec = (s64)xtime.tv_nsec << > clock->shift, > > that means we must have hit the second overflow case. > > > > So I'm wondering if we're hitting the same rounding issue we had earlier at > the > > nanosecond boundary (addressed by Roman's timekeeping rounding patch), but > just > > on the second boundary. > > > > Its a little late, so I'm not confident there's no side effects, but I've > got a > > patch you can try, basically deferring the second overflow till after we > adjust > > the clocksource. > I tested it. The patch does fix it, from mathematics point of view. I am really sorry. I found I replied too early. Now, my instrumented kernel printed out: xtime go back and diff_nsec[-2] xtime_nsec01[3338665954639872] old offset[9599399] offset[9599399] old mult[1747592] shift[22] cycle_interval[9600196]xtime_interval[16777225728032] xtime_nsec[18446744073706729512]error[10403569664] Current mult[1747593] shift[22] cycle_interval[9600196]xtime_interval[16777235328228] xtime_nsec[18446744073708340825]error[9162903552] old xtime tv_sec[1227688952] tv_nsec[795999993] Current xtime tv_sec[1227688952] tv_nsec[795999991] So the second check is triggered that xtime->tv_nsec gos back with 2 nsec. I know that user application never finds 2 nsec backward. But ... > > tick_length might be changed by second_overflow. > > > > > Will attach shortly. > > >
(In reply to comment #59) > > I tested it. The patch does fix it, from mathematics point of view. > I am really sorry. I found I replied too early. Now, my instrumented kernel > printed out: > > xtime go back and diff_nsec[-2] xtime_nsec01[3338665954639872] > old offset[9599399] offset[9599399] > old mult[1747592] shift[22] > cycle_interval[9600196]xtime_interval[16777225728032] > xtime_nsec[18446744073706729512]error[10403569664] > Current mult[1747593] shift[22] > cycle_interval[9600196]xtime_interval[16777235328228] > xtime_nsec[18446744073708340825]error[9162903552] > old xtime tv_sec[1227688952] tv_nsec[795999993] > Current xtime tv_sec[1227688952] tv_nsec[795999991] So what set of patches is this with? (Sorry, there have been quite a few). > So the second check is triggered that xtime->tv_nsec gos back with 2 nsec. I > know that user application never finds 2 nsec backward. But ... This could be ok. This part of the timekeeping code is pretty complicated. Let me try to explain. Timekeeping is calcualted as the following: gtod = xtime + cyc2ns(now - cycle_last) Where cyc2ns(x) = x*mult >> shift So think of time as two parts: 1) accumulated, and 2) not accumulated. { accumulated }{ not accumulated } |----------xtime------------|=== cyc2ns(now-cycle_last) ===| When slowing or speeding up the clock (for ntp or just fine error correction), we do so by changing the mult value. However, we want the speeding up or slowing down to only be in affect *after* we change it. In other words, before we change the mult value, and after we've changed mult value we still need to be at the same point in time. |----------xtime------------|=== cyc2ns(now-cycle_last) ===| So in the case above, the mult value is increased by one, which will cause cyc2ns() to return a larger value for the same now value. |----------xtime------------|=== cyc2ns(now-cycle_last) ========| So to keep time constant we reduce xtime by change in offset. |----------xtime-------|=== cyc2ns(now-cycle_last) ========| Does that make sense?
So after sleeping on the patches I sent out last night, while the first probably works, it likely causes side effects as Yanmin pointed out. The second I'm suspecting won't solve the issue, since we can reduce xtime_nsec by more then 1 nanosecond, but it will likely make it harder to trigger.
Comment on attachment 19030 [details] fix negative underflow Marking this one obsolete since I now don't think it will work.
Created attachment 19034 [details] catch nsec underflows and fix them using the error value Ok, here's a new shot at getting this right. Since we can't undo the second overflow or move it later in the function, we can just push time forward enough to avoid the underflow. To keep things sane, we store the amount we changed xtime_nsec into the error value, which will result in corrective action the next time around. Let me know if this works for you.
(In reply to comment #60) > (In reply to comment #59) > > > I tested it. The patch does fix it, from mathematics point of view. > > I am really sorry. I found I replied too early. Now, my instrumented kernel > > printed out: > > > > xtime go back and diff_nsec[-2] xtime_nsec01[3338665954639872] > > old offset[9599399] offset[9599399] > > old mult[1747592] shift[22] > > cycle_interval[9600196]xtime_interval[16777225728032] > > xtime_nsec[18446744073706729512]error[10403569664] > > Current mult[1747593] shift[22] > > cycle_interval[9600196]xtime_interval[16777235328228] > > xtime_nsec[18446744073708340825]error[9162903552] > > old xtime tv_sec[1227688952] tv_nsec[795999993] > > Current xtime tv_sec[1227688952] tv_nsec[795999991] > > So what set of patches is this with? (Sorry, there have been quite a few). > > > So the second check is triggered that xtime->tv_nsec gos back with 2 nsec. > I > > know that user application never finds 2 nsec backward. But ... > > This could be ok. > > This part of the timekeeping code is pretty complicated. Let me try to > explain. > > Timekeeping is calcualted as the following: > gtod = xtime + cyc2ns(now - cycle_last) > > Where cyc2ns(x) = x*mult >> shift > > > So think of time as two parts: 1) accumulated, and 2) not accumulated. > > { accumulated }{ not accumulated } > |----------xtime------------|=== cyc2ns(now-cycle_last) ===| > > > When slowing or speeding up the clock (for ntp or just fine error > correction), > we do so by changing the mult value. However, we want the speeding up or > slowing down to only be in affect *after* we change it. In other words, > before > we change the mult value, and after we've changed mult value we still need to > be at the same point in time. > > > |----------xtime------------|=== cyc2ns(now-cycle_last) ===| > > So in the case above, the mult value is increased by one, which will cause > cyc2ns() to return a larger value for the same now value. > > |----------xtime------------|=== cyc2ns(now-cycle_last) ========| > > So to keep time constant we reduce xtime by change in offset. > > |----------xtime-------|=== cyc2ns(now-cycle_last) ========| > > > Does that make sense? > Definitely. Thanks for your detailed explanation. yanmin
(In reply to comment #63) > Created an attachment (id=19034) [details] > catch nsec underflows and fix them using the error value > > Ok, here's a new shot at getting this right. Since we can't undo the second > overflow or move it later in the function, we can just push time forward > enough > to avoid the underflow. To keep things sane, we store the amount we changed > xtime_nsec into the error value, which will result in corrective action the > next time around. > > Let me know if this works for you. > I am testing it on 2 machines and didn't trigger the issue in 1 hour. In addition, I just saw once that: xtime go back and diff_nsec[-2] in update_wall_times. So the patch is better. Another new idea from my coworker: Can we add a checking in clocksource_adjust to abort the adjust when clock->xtime_nsec < offset? I will test the new idea.
Created attachment 19046 [details] Patch to fix it with a new idea John, Here is the patch with the new idea. It just aborts the adjust when (s64)clock->xtime_nsec < offset. I tested it for 1 hour on 2 machines and didn't trigger the issue. Yanmin
John, As the overflow happens rarely, it might be better to add a unlikely in your patch. Or add a likely in my patch. I ran the testing on 2 machines for more than 20 hours and didn't trigger the issue. yanmin
(In reply to comment #66) > Created an attachment (id=19046) [details] > Patch to fix it with a new idea > > Here is the patch with the new idea. It just aborts the adjust when > (s64)clock->xtime_nsec < offset. > > I tested it for 1 hour on 2 machines and didn't trigger the issue. Yep, that was another method I was considering, however your patch has one issue with it: offset may be made negative at that point. It would probably be better to just check and return at the very top of clocksource_adjust(). Either method should work. Although I think re-adding it to the error is more consistent to how we deal with other rounding issues. So I'll submit that to lkml today and should we get negative feedback, we'll propose a corrected variant of your patch.
I've sent my patch to lkml.
Thanks John. Your email to LKML is very clear with detailed explanation. Hope 2.6.28-rc7 merges the patch.
Fixed by: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6c9bacb41c10ba84ff68f238e234d96f35fb64f7