Bug 11970 - gettimeofday return a old time in mmbench
Summary: gettimeofday return a old time in mmbench
Status: CLOSED CODE_FIX
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: All Linux
: P1 high
Assignee: platform_x86_64@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks: 11808
  Show dependency tree
 
Reported: 2008-11-06 23:57 UTC by Alex Shi
Modified: 2008-12-07 12:25 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.28-rc3
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
mmbench.c file (17.22 KB, text/plain)
2008-11-07 00:51 UTC, Alex Shi
Details
dmesg log (30.76 KB, text/plain)
2008-11-07 01:05 UTC, Alex Shi
Details
kernel config file (33.65 KB, text/plain)
2008-11-07 01:06 UTC, Alex Shi
Details
My simple time check program (1.04 KB, patch)
2008-11-20 23:54 UTC, Yanmin Zhang
Details | Diff
New time_goback (1.04 KB, patch)
2008-11-23 18:01 UTC, Yanmin Zhang
Details | Diff
Committed fix for timekeeping rounding issues (2.12 KB, patch)
2008-11-24 11:56 UTC, john stultz
Details | Diff
Another patch to try (1.10 KB, patch)
2008-11-24 14:52 UTC, john stultz
Details | Diff
clock-monotonic patch (6.44 KB, patch)
2008-11-25 19:11 UTC, john stultz
Details | Diff
clocksource forward now patch (6.10 KB, patch)
2008-11-25 19:12 UTC, john stultz
Details | Diff
A debugging patch to instrument kernel (4.39 KB, patch)
2008-11-25 23:20 UTC, Yanmin Zhang
Details | Diff
patch to do the second overflow accumulation after adjusting the clocksource. (1.36 KB, patch)
2008-11-26 00:18 UTC, john stultz
Details | Diff
fix negative underflow (1.45 KB, patch)
2008-11-26 00:39 UTC, john stultz
Details | Diff
catch nsec underflows and fix them using the error value (638 bytes, patch)
2008-11-26 12:34 UTC, john stultz
Details | Diff
Patch to fix it with a new idea (657 bytes, patch)
2008-11-26 23:35 UTC, Yanmin Zhang
Details | Diff

Description Alex Shi 2008-11-06 23:57:16 UTC
Latest working kernel version:2.6.27 
Earliest failing kernel version:2.6.28-rc1
Distribution: RHEL 4.92 
Hardware Environment: Intel Bensley/Clovertown
Software Environment: RHEL 4.92
Problem Description: 
gettimeofday return a old time in mmbench

Steps to reproduce:
1, set "mem=1G" in kernel cmdline 
2, compile mmbench as gcc -o mmbench -lm -g mmbench.c 
3, run cmd #mmbench for a while. or run mmbench in GDB directly. 
sometime, mmbench get a segmentation fault. and GDB report :
(gdb) r
Starting program: /bm/tmp/benchmarks/run_bmmmbench/mmbench/mmbench 
Program received signal SIGSEGV, Segmentation fault.
0x0000000000401925 in main (argc=1, argv=0x7fff0b3d1cf8) at mmbench.c:410
410                     accs_num[diff_gran]++;          // access time distribution
(gdb) p diff_gran
$2 = -42890
according to mmbench.c, the diff_gran is equal to (gettimeofday(current time) - gettimeofday(start time))

So, the negative value of diff_gran means gettimeofday get a old time. 

I also write a gettimeofday checking program gtod.c . but this simple program can not reproduce this issue.
Comment 1 Andrew Morton 2008-11-07 00:07:19 UTC
I'll switch this to x86_64.
Comment 2 Alex Shi 2008-11-07 00:51:43 UTC
Created attachment 18723 [details]
mmbench.c file 

it is mmbench.c file
Comment 3 Alex Shi 2008-11-07 00:59:49 UTC
==
#include <sys/time.h>
#include <stdio.h>
long v[100000000];
main (){
	struct timeval tv;
	int i,j ;
	for (i=0; i<100000000; i++){
		gettimeofday(&tv, 0);
		v[i] = tv.tv_sec * 1000000 + tv.tv_usec;
		//printf("sec is %ld usec is %ld\n", tv.tv_sec,tv.tv_usec);	
	}
	for (i=0;i<9999999;i++)
		if(v[i]>v[i+1])
			printf("gettimeofday error %8d %8d\n", v[i], v[i+1]);
			
}
==
it is my gettimeofday checking program. compile as 
gcc -o g gtod.c 
I have run it as 
$for ((i=0;i<100;i++)) do ./g & ./g & ./g; done 
it will run a long time. but not error print out. 
Comment 4 Alex Shi 2008-11-07 01:05:41 UTC
Created attachment 18724 [details]
dmesg log
Comment 5 Alex Shi 2008-11-07 01:06:54 UTC
Created attachment 18725 [details]
kernel config file
Comment 6 Ingo Molnar 2008-11-07 01:22:53 UTC
just a quick question, does my time-warp test:

    http://people.redhat.com/mingo/time-warp-test/time-warp-test.c

detect a GTOD time warp too?

	Ingo
Comment 7 Rafael J. Wysocki 2008-11-09 10:56:23 UTC
Handled-By : Ingo Molnar <mingo@elte.hu>
Comment 8 Alex Shi 2008-11-09 19:12:48 UTC
Ingo:
seems your test case detect some errors, your case output as below:

8 CPUs, running 8 parallel test-tasks.
checking for time-warps via:
- read time stamp counter (RDTSC) instruction (cycle resolution)
- gettimeofday (TOD) syscall (usec resolution) 
- clock_gettime(CLOCK_MONOTONIC) syscall (nsec resolution)

new TSC-warp maximum: -4294958692 cycles, 00000000fffffe22 -> 0000000000001fbe
new TSC-warp maximum: -4294959196 cycles, 00000000ffffeab5 -> 0000000000000a59
new TSC-warp maximum: -4294963588 cycles, 00000000fffff839 -> 00000000000006b5
new TOD-warp maximum:  -4294952 usecs,  00045b4b09861f8b -> 00045b4b09449663
new TOD-warp maximum:  -4294965 usecs,  00045b4b09956228 -> 00045b4b0953d8f3
new TSC-warp maximum: -4294964029 cycles, 00000000fffffb6b -> 000000000000082e
new TSC-warp maximum: -4294964047 cycles, 00000000fffffc72 -> 0000000000000923
new TSC-warp maximum: -4294965766 cycles, 00000000fffffea2 -> 000000000000049c
new TOD-warp maximum:  -4294966 usecs,  00045b4b2a116d9c -> 00045b4b29cfe466
new TSC-warp maximum: -4294966099 cycles, 00000000ffffff46 -> 00000000000003f3
new TSC-warp maximum: -4294966513 cycles, 00000000ffffff43 -> 0000000000000252

Alex 
Comment 9 Alex Shi 2008-11-09 19:54:53 UTC
still exists on rc4 
Comment 10 Alex Shi 2008-11-10 00:59:50 UTC
TSC warp exists on 2.6.27~2.6.28-rc4. but for TOD warp it just exist on 2.6.28-rcx serial kernel. 

Alex 
Comment 11 Thomas Gleixner 2008-11-10 01:07:30 UTC
Can you please provide the output of /sys/devices/system/clocksource/clocksource0/current_clocksource on 2.6.27 and 2.6.28-rc4 ?
Comment 12 Alex Shi 2008-11-10 01:43:15 UTC
all of kernel use tsc as clock source.  
Comment 13 Thomas Gleixner 2008-11-10 02:20:15 UTC
Ok, can reproduce now. Debugging
Comment 14 Thomas Gleixner 2008-11-10 15:38:10 UTC
Can you please check /var/log/messages for NTP entries ?
Comment 15 Alex Shi 2008-11-10 18:18:34 UTC
we did not use NTP on this machine. 
Comment 16 Alex Shi 2008-11-11 00:47:41 UTC
Ingo:
we have checked the code time-warp-test.c: 
...
#define rdtscll(val)					\
do {							\
	__asm__ __volatile__("cpuid; rdtsc" : "=A" (val):: "ebx","ecx","edx","memory" );	\
} while (0)
...
In the intel archatecture.  the rdtsc return its 64bits value in eax and edx, but your code just get the eax value. So we get lots of TSC warp warning: 
new TSC-warp maximum: -4294958692 cycles, 00000000fffffe22 -> 0000000000001fbe
new TSC-warp maximum: -4294959196 cycles, 00000000ffffeab5 -> 0000000000000a59
new TSC-warp maximum: -4294963588 cycles, 00000000fffff839 -> 00000000000006b5
It mabye incorrect. So we modify this code as below: 

#define rdtscll(val)                                    \
do {                                                    \
        unsigned low, high;                             \
        __asm__ __volatile__("cpuid; rdtsc" : "=a" (low), "=d" (high)); \
        val = low | (unsigned long long)(high) << 32;                   \
} while (0)
Comment 17 Thomas Gleixner 2008-11-11 00:59:36 UTC
Hmm. I was able to reproduce and debugged out that NTP was the reason for the
wreckage. For some yet unknown reason NTP was not able to use the pre 2.6.28
/var/lib/ntp/drift value. I stopped NTP and set this to 0.0 manually, restarted
NTP and it did not happen again in 9 hrs.

According to your time-warp-test output your time is off by > 4 seconds. 

Can you verify that nothing calls settimeofday() ?
Comment 18 Thomas Gleixner 2008-11-11 01:02:40 UTC
(In reply to comment #16)

changing the delta calculation in test_TSC()

-       delta = t1-t0;
+       delta = (t1-t0) & 0xffffffff;

works as well :)
Comment 19 Alex Shi 2008-11-11 01:30:37 UTC
OK, I will open oprofile to check the settimeofday, if the issue reproduced. 
I try to reproduce the TOD warp issue on my machine for 2.6.28-rc4. somtime it is not easy to reproduce. 
it seems the gettimeofday has a nature issue in theory before copy_to_user. Another gettimeofday system call can happen between do_gettimeofday and copy_to_user. 

asmlinkage long sys_gettimeofday(struct timeval __user *tv,
                                 struct timezone __user *tz)
{
        if (likely(tv != NULL)) {
                struct timeval ktv;
                do_gettimeofday(&ktv);
                if (copy_to_user(tv, &ktv, sizeof(ktv)))
                        return -EFAULT;
     
Comment 20 Thomas Gleixner 2008-11-11 01:55:46 UTC
well, that does not matter with Ingos time-warp-test
        lock(&SHARED(LOCK));                                                                                                                                             
rdtod(T1);                                                                                                                                                       
T0 = SHARED_LL(TOD);
SHARED_LL(TOD) = T1;
SHARED_LL(NR_TOD_LOOPS)++;                                                                                                                                       
unlock(&SHARED(LOCK));

We have the lock around the rdtod and the update of the shared data structure.
With this test we should never ever be able to see time going backwards with gettimeofday().

In an unlocked scenario there is no guarantee aside of one:
A single task can never observe time going backwards even if it migrates from one CPU to another.
Comment 21 Alex Shi 2008-11-11 22:34:22 UTC
Thomas:
you are reasonable...
But I can not reproduce it from yesterday on my 3 kind of machines whenever with mmbench or time-warp-test. 
but however, I can sure there is no one call settimeofday etc systemcall to reset time during testing.  

I am going to retest more in this night to see whether it appears. 
Comment 22 Alex Shi 2008-11-14 00:22:55 UTC
I have done some bisect for this issue, and found it was introduced by the commit: 
99ebcf8285df28f32fd2d1c19a7166e70f00309c 
and sys_gettimeofday was changed from this commit. the last kerenl also has this issue. 
Comment 23 Rafael J. Wysocki 2008-11-16 09:42:50 UTC
Handled-By : Thomas Gleixner <tglx@linutronix.de>

First-Bad-Commit : 99ebcf8285df28f32fd2d1c19a7166e70f00309c
Comment 24 Thomas Gleixner 2008-11-17 14:17:07 UTC
> ------- Comment #22 from alex.shi@intel.com  2008-11-14 00:22 -------
> I have done some bisect for this issue, and found it was introduced by the
> commit: 
> 99ebcf8285df28f32fd2d1c19a7166e70f00309c 
> and sys_gettimeofday was changed from this commit. the last kerenl also has
> this issue. 

Hmm, this is not very helpful, as this is a merge commit. The code in
kernel/time/timekeeping.c was changed by more than one commit.

Thanks,
	tglx
Comment 25 Alex Shi 2008-11-17 17:51:52 UTC
I did some modifcation of time-warp-test and find the TOD get a old time while error happen. 

@@ -240,7 +241,7 @@
 static inline void test_TOD(unsigned long *shared)
 {
 #if TEST_TOD
-       usecs_t T0, T1;
+       static usecs_t T0,T00, T1 ;
        long long delta;

        lock(&SHARED(LOCK));
@@ -256,10 +257,12 @@
                SHARED(NR_TOD_WARPS)++;
                if (delta < SHARED_LL(WORST_TOD)) {
                        SHARED_LL(WORST_TOD) = delta;
-                       fprintf(stderr, "\rnew TOD-warp maximum: %9Ld usecs,  %016Lx -> %016Lx\n",
-                               delta, T0, T1);
+                       fprintf(stderr, "\rnew TOD-warp maximum: %9Ld usecs,  %016Lx -> %016Lx, T00 is %016Lx\n",
+                               delta, T0, T1, T00);
                }
                unlock(&SHARED(LOCK));
+       }else{
+       T00=T0;
        }
it is output:
new TOD-warp maximum:  -4294940 usecs,  00045bec66e17038 -> 00045bec669fe71c, T00 is 00045bec66e17030

another message, 
Comment 26 Alex Shi 2008-11-17 17:58:43 UTC
A nother message is when I close the print_status in time-warp-test, I tried several times, do not see the warp happen. 
Comment 27 Alex Shi 2008-11-19 08:46:13 UTC
I have added a static variable: oldns to check the xtime variable in timekeep.c: 
        static s64 oldns;

        do {
                seq = read_seqbegin(&xtime_lock);

                *ts = xtime;
                newns =  timespec_to_ns(ts);

                if((oldns - newns) > 0 ){
                        printk(KERN_ERR "Alex \n **** \n find error in %s oldns is %016llx, newns is %016llx \n **** \n", \
                                __func__, oldns, newns);
                        dump_stack();
                }

                /* read clocksource: */
                cycle_now = clocksource_read(clock);

                /* calculate the delta since the last update_wall_time: */
                cycle_delta = (cycle_now - clock->cycle_last) & clock->mask;
                if (cycle_now < clock->cycle_last){
                        printk(KERN_ERR "Alex \n **** \n find error in %s cycle_now %016llx, clock->cycle_last %016llx \n **** \n", \
                                __func__, cycle_now, clock->cycle_last);
                        dump_stack();
                }
                /* convert to nanoseconds: */
                nsecs = cyc2ns(clock, cycle_delta);

        } while (read_seqretry(&xtime_lock, seq));

        oldns = newns;


and find some output. not clear it is useful:

 ****
 find error in getnstimeofday oldns is 11077b6db72423f1, newns is 11077b6db72423f0
 ****
Pid: 0, comm: swapper Not tainted 2.6.28-rc5 #2
Call Trace:
 <IRQ>  [<ffffffff80211588>] read_tsc+0x9/0x1c
 [<ffffffff8024a938>] getnstimeofday+0x73/0x129
 [<ffffffff80247cb5>] ktime_get_ts+0x22/0x4b
 [<ffffffff80247e23>] ktime_get+0xc/0x41
 [<ffffffff80248faf>] sched_clock_tick+0x54/0x8a
 [<ffffffff8023c2e4>] run_timer_softirq+0x21/0x193
 [<ffffffff80231d72>] scheduler_tick+0x3b/0xf3
 [<ffffffff8023c8f3>] update_process_times+0x44/0x50
 [<ffffffff8024cd50>] tick_periodic+0x68/0x73
 [<ffffffff8024cd73>] tick_handle_periodic+0x18/0x51
 [<ffffffff8021d18b>] smp_apic_timer_interrupt+0x8e/0xac
 [<ffffffff8020be1b>] apic_timer_interrupt+0x6b/0x70
 <EOI>  [<ffffffff80212242>] mwait_idle+0x41/0x44
 [<ffffffff8020a08a>] cpu_idle+0x40/0x5e
 find error in getnstimeofday oldns is 11077b7793fe7106, newns is 11077b7793c16804
 ****
Pid: 3528, comm: time-warp-test Not tainted 2.6.28-rc5 #2
Call Trace:
 [<ffffffff8024a938>] getnstimeofday+0x73/0x129
 [<ffffffff80247cb5>] ktime_get_ts+0x22/0x4b
 [<ffffffff8024432c>] posix_ktime_get_ts+0xc/0x10
 [<ffffffff8020b117>] system_call_after_swapgs+0x17/0x65
 [<ffffffff80244acf>] sys_clock_gettime+0x3c/0x8d
 [<ffffffff8020b17b>] system_call_fastpath+0x16/0x1b
Comment 28 Alex Shi 2008-11-19 09:11:39 UTC
oh, I guess the comments #27 explaination is:
when one getnstimeofday running, another getnstimeoday use a newer xtime, so the oldns in second getnstimeofday was updated to a newer than the first one. 
Does it mean gettimeofday return a old data? seems not. 

oh, I should know clear about the read lock for xtime_lock first 
Comment 29 Thomas Gleixner 2008-11-19 14:04:57 UTC
> 
>         oldns = newns;
> 
> 
> and find some output. not clear it is useful:

That's not useful at all as nothing protects oldns neither from cross
cpu access nor from preemption.

Thanks,

	tglx
Comment 30 Alex Shi 2008-11-19 21:21:11 UTC
Tglx: 

> [<ffffffff8024a938>] getnstimeofday+0x73/0x129
> [<ffffffff80247cb5>] ktime_get_ts+0x22/0x4b
> [<ffffffff80247e23>] ktime_get+0xc/0x41
> [<ffffffff80248faf>] sched_clock_tick+0x54/0x8a
> [<ffffffff8023c2e4>] run_timer_softirq+0x21/0x193
> [<ffffffff80231d72>] scheduler_tick+0x3b/0xf3

From above stack, getnstimeofday seems can be called in a irq process. so we need use read_seqbegin_irqsave to instead of read_seqbegin in getnstimeofday? I have tried it seems the error frequency is lower than before. 

I will tested again on 2.6.27 because 2.6.27 also does not use irqsave mode seqlock. 


BRGS
Alex 
Comment 31 Yanmin Zhang 2008-11-19 21:56:36 UTC
(In reply to comment #30)
> Tglx: 
> 
> > [<ffffffff8024a938>] getnstimeofday+0x73/0x129
> > [<ffffffff80247cb5>] ktime_get_ts+0x22/0x4b
> > [<ffffffff80247e23>] ktime_get+0xc/0x41
> > [<ffffffff80248faf>] sched_clock_tick+0x54/0x8a
> > [<ffffffff8023c2e4>] run_timer_softirq+0x21/0x193
> > [<ffffffff80231d72>] scheduler_tick+0x3b/0xf3
> 
> From above stack, getnstimeofday seems can be called in a irq process. so we
> need use read_seqbegin_irqsave to instead of read_seqbegin in getnstimeofday?
It doesn't matter as getnstimeofday just reads data instead of changing data


 I
> have tried it seems the error frequency is lower than before. 
> 
> I will tested again on 2.6.27 because 2.6.27 also does not use irqsave mode
> seqlock. 
> 
> 
> BRGS
> Alex 
> 
Comment 32 Thomas Gleixner 2008-11-20 08:46:06 UTC
> Tglx: 
> 
> > [<ffffffff8024a938>] getnstimeofday+0x73/0x129
> > [<ffffffff80247cb5>] ktime_get_ts+0x22/0x4b
> > [<ffffffff80247e23>] ktime_get+0xc/0x41
> > [<ffffffff80248faf>] sched_clock_tick+0x54/0x8a
> > [<ffffffff8023c2e4>] run_timer_softirq+0x21/0x193
> > [<ffffffff80231d72>] scheduler_tick+0x3b/0xf3
> 
> >From above stack, getnstimeofday seems can be called in a irq process. so we
> need use read_seqbegin_irqsave to instead of read_seqbegin in getnstimeofday?
> I
> have tried it seems the error frequency is lower than before. 
> 
> I will tested again on 2.6.27 because 2.6.27 also does not use irqsave mode
> seqlock. 

No. That's wrong. There is no reason to use irqsave for that. The only
place where we need irqs disabled is when we write lock xtime lock.
Comment 33 Alex Shi 2008-11-20 23:16:28 UTC
still exists on rc6 on my 2 machines. 

Below is my reproduce scripts which named "repd.sh" and locate at /home/alexs/. if you want to use it, pls modify its absolute location in script. hope it is helpful. 

====
#!/bin/bash
#
#usage: /home/alexs/repd.sh 8 
# 
timesf=/tmp/times

#check reboot times in testing
sed -i "/repd.sh/d" /etc/rc.local
if [ ! -z "$1" ]; then 
        echo $1 >$timesf
        times=$1
else 
        times=`cat $timesf`
fi 

#do testing untill timeout. $to is timeout seconds. 
to=300
if [ $times -gt 0 ];then 
        ((times--))
        echo $times > $timesf
        sync 

        tmpf=/tmp/`date +%s`
        /home/alexs/time-warp-test  2> $tmpf & 

        for((i=0; i<$to; i++))
        do
                sleep 1;
                if grep TOD-warp $tmpf -q ;then 
                        echo "find TOD warp in this kernel `uname -r`" >> $tmpf
                        sudo killall time-warp-test
                        break;
                fi
        done

        if [ $i -ge $to ];then 
                #write script into rc.local. needs a absolute script location. 
                sudo sh -c "echo /home/alexs/repd.sh >> /etc/rc.local "
                sudo reboot 
                exit 0 
        fi
fi
Comment 34 Yanmin Zhang 2008-11-20 23:50:59 UTC
I reproduced it on x86-64 machine.
Steps:
1) Start Ingo's time-warp-test;
2) Start my own time test program. It's simple and just calls gettimeofday and compare the next value to make sure no warp.

---
output after a couple of miniutes from my program:
[ymzhang@linux-b03 bug]$ ./time_goback 1000000000000000
gettimeofday goes back at loop:[564350392]. diff_usec:[-4294967] stime.tv_sec=[1227252201] stime.tv_usec=[302334]
gettimeofday goes back at loop:[602412232]. diff_usec:[-4294967] stime.tv_sec=[1227252207] stime.tv_usec=[302667]
gettimeofday goes back at loop:[621441193]. diff_usec:[-4294967] stime.tv_sec=[1227252210] stime.tv_usec=[302833]



It's interesting that the diff_usec is always -4294967. If comparing with the output posted by Alex, time-warps also says 4294967 or 4294966.

I will instrument kernel to see if someone calls settimeofday.
Comment 35 Yanmin Zhang 2008-11-20 23:54:14 UTC
Created attachment 18960 [details]
My simple time check program

Start time-warp-test, then start my time_goback. time_goback prints of many errors while time-warp-test doesn't. But if just starting time_goback, it seems no error outputed.
Comment 36 Rafael J. Wysocki 2008-11-22 13:34:18 UTC
Handled-By : Yanmin Zhang <yanmin_zhang@linux.intel.com>
Comment 37 Yanmin Zhang 2008-11-23 18:01:50 UTC
Created attachment 18996 [details]
New time_goback

The old one has a bug which causes coredump when there is no parameter.
Comment 38 Alex Shi 2008-11-23 18:43:02 UTC
In the update_wall_time function, I found if the 538 line was commented in the following code. the problem disappears in my testing with 8 times reboot, each of reboot running 300 seconds time-warp-test.  seems the 531 line: clocksource_adjust(offset) do a incorrect xtime_nsec modifiction. 
that line was added in the first-bad commits. 


502         clock->xtime_nsec = (s64)xtime.tv_nsec << clock->shift;
503 
504         /* normally this loop will run just once, however in the
505          * case of lost or late ticks, it will accumulate correctly.
506          */
507         while (offset >= clock->cycle_interval) {
508                 /* accumulate one interval */
509                 offset -= clock->cycle_interval;
510                 clock->cycle_last += clock->cycle_interval;
511 
512                 clock->xtime_nsec += clock->xtime_interval;
513                 if (clock->xtime_nsec >= (u64)NSEC_PER_SEC << clock->shift) {
514                         clock->xtime_nsec -= (u64)NSEC_PER_SEC << clock->shift;
515                         xtime.tv_sec++;
516                         second_overflow();
517                 }
518 
519                 clock->raw_time.tv_nsec += clock->raw_interval;
520                 if (clock->raw_time.tv_nsec >= NSEC_PER_SEC) {
521                         clock->raw_time.tv_nsec -= NSEC_PER_SEC;
522                         clock->raw_time.tv_sec++;
523                 }
524 
525                 /* accumulate error between NTP and clock interval */
526                 clock->error += tick_length;
527                 clock->error -= clock->xtime_interval << (NTP_SCALE_SHIFT - clock->shift);
528         }
529 
530         /* correct the clock when NTP error is too big */
531         clocksource_adjust(offset);
532 
533         /* store full nanoseconds into xtime after rounding it up and
534          * add the remainder to the error difference.
535          */
536         xtime.tv_nsec = ((s64)clock->xtime_nsec >> clock->shift) + 1;
537         clock->xtime_nsec -= (s64)xtime.tv_nsec << clock->shift;
538         clock->error += clock->xtime_nsec << (NTP_SCALE_SHIFT - clock->shift);
539 
Comment 39 john stultz 2008-11-24 11:56:35 UTC
Created attachment 19006 [details]
Committed fix for timekeeping rounding issues

Does reverting this patch resolve the issue?
Comment 40 john stultz 2008-11-24 14:52:51 UTC
Created attachment 19008 [details]
Another patch to try

If the reverting the previous patch didn't work, could you try applying this one?
Comment 41 Alex Shi 2008-11-24 18:08:27 UTC
I had tried 3 scenarios:
1, revert the patch 
2, adding your patch
3, both of 1,2 
and found in all of situations. the warp still exists. 
Comment 42 Yanmin Zhang 2008-11-25 01:01:46 UTC
Well, I think it's almost root casued.
I instrumented kernel.
1) The bug is triggered when clock->mult is changed. Because it depends on cycle_interval and xtime_interval which are calculated when booting, so sometimes it can't be triggered, so we need reboot kernel often to trigger different initial
mult and cycle_interval;
2) Sometimes, xtime.tv_nsec becomes -1 in function update_wall_time after statement
xtime.tv_nsec = ((s64)clock->xtime_nsec >> clock->shift) + 1;
As xtime.tc_nsec is a kind of long, but in do_vgettimeofday, nsec is unsigned long, so the time jumps forward, later gettimeofday returns a correct time, but the test program reports a go_back issue;
3) Why does xtime.tv_nsec become -1? clocksource_adjust does it at
clock->xtime_nsec -= offset;
My capture shows offset=9596319 while the original clock->xtime_nsec=3293928. So then clock->xtime_nsec is almost equal to -1UL and xtime.tv_nsec also becomes -1.
Comment 43 Yanmin Zhang 2008-11-25 01:20:43 UTC
Perhaps just delete
clock->xtime_nsec -= offset;
from function clocksource_adjust?

I will check it tomorrow.
Comment 44 john stultz 2008-11-25 09:41:41 UTC
(In reply to comment #43)
> Perhaps just delete
> clock->xtime_nsec -= offset;
> from function clocksource_adjust?

I'd not advise this, as it will just cause more issues. Especially with users of NTP.

If this didn't happen in 2.6.27, there's only been 3 changes to the timekeeping.c file since then, so this shouldn't be hard to catch. We tried reverting one with the rounding issues patch, and that really seemed like the most likely to affect this, so I'm a bit confused.

Never the less, I'll generate patches for the other ones and we can try to revert those as well.
Comment 45 Yanmin Zhang 2008-11-25 19:02:17 UTC
(In reply to comment #44)
> (In reply to comment #43)
> > Perhaps just delete
> > clock->xtime_nsec -= offset;
> > from function clocksource_adjust?
> 
> I'd not advise this, as it will just cause more issues. Especially with users
> of NTP.
> 
> If this didn't happen in 2.6.27, there's only been 3 changes to the
> timekeeping.c file since then, so this shouldn't be hard to catch. We tried
> reverting one with the rounding issues patch, and that really seemed like the
> most likely to affect this, so I'm a bit confused.
I confirm Alex that just reverting the one of rounding issue patch doesn't fix it.

> 
> Never the less, I'll generate patches for the other ones and we can try to
> revert those as well.
> 
Comment 46 john stultz 2008-11-25 19:11:23 UTC
Created attachment 19019 [details]
clock-monotonic patch

Here's another of the 2.6.28 changes to timekeeping.c

Please revert attachment 19006 [details] (rounding fix), then this attachment.
Comment 47 john stultz 2008-11-25 19:12:45 UTC
Created attachment 19020 [details]
clocksource forward now patch

The last 2.6.28 change to timekeeping.c

Please revert attachment 19006 [details], then 19019, and finally this patch.
Comment 48 Yanmin Zhang 2008-11-25 21:08:01 UTC
(In reply to comment #47)
> Created an attachment (id=19020) [details]
> clocksource forward now patch
> 
> The last 2.6.28 change to timekeeping.c
> 
> Please revert attachment 19006 [details], then 19019, and finally this patch.
> 

I reverted the 3 patches and didn't trigger the issue after running the testing for 1 hour. My debug statement didn't run into xtime.tv_nsec=-1.
Comment 49 Yanmin Zhang 2008-11-25 22:36:41 UTC
I added a printk in do_settimeofday and didn't find the output after the testing is started. 

I checked patch mono_raw and clocksource_forward (19019&19020) manually. It seems these 2 patches don't change the logic of xtime changing.

Perhaps this issue exists __theoretically__, but latest kernel triggers it.

My debugging in update_wall_time shows:
xtime.tv_nsec:[-1] is negative. xtime_nsec01[3234962] error01[6517645312] tick_length01[17179869184000000] tick_length[17179869184000000]
old offset[18978759] offset[9378909]
old mult[1747653] shift[22] cycle_interval[9599850]xtime_interval[16777206652050] xtime_nsec[18446744073708592274]error[-3054655488]
Current mult[1747654] shift[22] cycle_interval[9599850]xtime_interval[16777216251900] xtime_nsec[18446744073703407669]error[6291401728]
old xtime tv_sec[1227679454] tv_nsec[996000003]
Current xtime tv_sec[1227679455] tv_nsec[-1]


old offset is the offset initial value in update_wall_time, and offset is the value after the while loop. xtime_nsec01 is clock->xtime_nsec before calling clocksource_adjust. Others are just old xtime&clock when entering update_wall_time, or new xtime&clock after new xtime.tv_nsec becomes -1.

So when the interrupt is sent after more than 1 tick, but less than 2 ticks, e.g., 1.999 ticks, clock->xtime_nsec might become less than offset.
Comment 50 john stultz 2008-11-25 22:50:24 UTC
(In reply to comment #48)
> (In reply to comment #47)
> > Created an attachment (id=19020) [details] [details]
> > clocksource forward now patch
> > 
> > The last 2.6.28 change to timekeeping.c
> > 
> > Please revert attachment 19006 [details] [details], then 19019, and finally
> this patch.
> > 
> 
> I reverted the 3 patches and didn't trigger the issue after running the
> testing
> for 1 hour. My debug statement didn't run into xtime.tv_nsec=-1.

Just to be sure, can you re-add attachment #19020 [details] and see if the issue reappears?
Comment 51 john stultz 2008-11-25 23:11:56 UTC
(In reply to comment #49)
> Perhaps this issue exists __theoretically__, but latest kernel triggers it.
> 
> My debugging in update_wall_time shows:

Could you attach the patch that generated this output?

thanks
-john
Comment 52 Yanmin Zhang 2008-11-25 23:20:10 UTC
Created attachment 19024 [details]
A debugging patch to instrument kernel

Apply it to 2.6.28-rc6. But We just trigger the issue on 2 machines. On other machines, we couldn't trigger it.
Comment 53 john stultz 2008-11-25 23:34:53 UTC
(In reply to comment #49)
> My debugging in update_wall_time shows:
> xtime.tv_nsec:[-1] is negative. xtime_nsec01[3234962] error01[6517645312]

Hrm.. So xtime_nsec01 is xtime_nsec before clocksource_adjust().

But xtime_nsec01 >> 22 is 0, so that's less then a nanosecond (but not zero). Since we set it to be clock->xtime_nsec = (s64)xtime.tv_nsec << clock->shift, that means we must have hit the second overflow case.

So I'm wondering if we're hitting the same rounding issue we had earlier at the nanosecond boundary (addressed by Roman's timekeeping rounding patch), but just  on the second boundary.

Its a little late, so I'm not confident there's no side effects, but I've got a patch you can try, basically deferring the second overflow till after we adjust the clocksource.

Will attach shortly.
Comment 54 john stultz 2008-11-26 00:18:55 UTC
Created attachment 19028 [details]
patch to do the second overflow accumulation after adjusting the clocksource.

Here's a patch that moves the second overflow code to after the adjust_clocksource() call. This should avoid xtime_nsec from underflowing and rolling negative. I'm hoping this resolves the issue here.

However, I'm not 100% confident this patch is right. I'm a little worried we might actually overflow xtime_nsec if the update_wall_time is deferred for too long. The proper fix may be to detect the underflow and propagate it back up to the xtime.tv_sec field. But I'm a bit tired for that right now.
Comment 55 john stultz 2008-11-26 00:39:43 UTC
Created attachment 19030 [details]
fix negative underflow

Ok, I think I have a better way to solve this (if I'm actually chasing the right issue).

One thing I was having a hard time understanding is how we could be underflowing and going negative if we always round xtime.tv_nsec up by one.

So instead of accumulating the second overflows later, we can instead just add 1 nanosecond to the xtime_nsec value before we do the shift (instead of after) and that should avoid the issue. This also avoids any problem with overflowing xtime_nsec.

Fair warning: Its a little late here, so this may be wrong. I've not had the chance to test this myself. Hopefully its not totally off base. :)
Comment 56 Alex Shi 2008-11-26 00:45:11 UTC
tried just revert clock_forward_now patch on rc6. it also has this issue. 
Comment 57 Yanmin Zhang 2008-11-26 01:02:26 UTC
(In reply to comment #53)
> (In reply to comment #49)
> > My debugging in update_wall_time shows:
> > xtime.tv_nsec:[-1] is negative. xtime_nsec01[3234962] error01[6517645312]
> 
> Hrm.. So xtime_nsec01 is xtime_nsec before clocksource_adjust().
> 
> But xtime_nsec01 >> 22 is 0, so that's less then a nanosecond (but not zero).
> Since we set it to be clock->xtime_nsec = (s64)xtime.tv_nsec << clock->shift,
> that means we must have hit the second overflow case.
> 
> So I'm wondering if we're hitting the same rounding issue we had earlier at
> the
> nanosecond boundary (addressed by Roman's timekeeping rounding patch), but
> just
>  on the second boundary.
> 
> Its a little late, so I'm not confident there's no side effects, but I've got
> a
> patch you can try, basically deferring the second overflow till after we
> adjust
> the clocksource.
I tested it. The patch does fix it, from mathematics point of view.

tick_length might be changed by second_overflow.

> 
> Will attach shortly.
> 
Comment 58 Yanmin Zhang 2008-11-26 01:04:23 UTC
(In reply to comment #56)
> tried just revert clock_forward_now patch on rc6. it also has this issue. 
> 

It's a potential issue and triggered by the latest kernel. Let's resolve it with a new patch.
Comment 59 Yanmin Zhang 2008-11-26 01:28:39 UTC
(In reply to comment #57)
> (In reply to comment #53)
> > (In reply to comment #49)
> > > My debugging in update_wall_time shows:
> > > xtime.tv_nsec:[-1] is negative. xtime_nsec01[3234962] error01[6517645312]
> > 
> > Hrm.. So xtime_nsec01 is xtime_nsec before clocksource_adjust().
> > 
> > But xtime_nsec01 >> 22 is 0, so that's less then a nanosecond (but not
> zero).
> > Since we set it to be clock->xtime_nsec = (s64)xtime.tv_nsec <<
> clock->shift,
> > that means we must have hit the second overflow case.
> > 
> > So I'm wondering if we're hitting the same rounding issue we had earlier at
> the
> > nanosecond boundary (addressed by Roman's timekeeping rounding patch), but
> just
> >  on the second boundary.
> > 
> > Its a little late, so I'm not confident there's no side effects, but I've
> got a
> > patch you can try, basically deferring the second overflow till after we
> adjust
> > the clocksource.
> I tested it. The patch does fix it, from mathematics point of view.
I am really sorry. I found I replied too early. Now, my instrumented kernel printed out:

xtime go back and diff_nsec[-2] xtime_nsec01[3338665954639872]
old offset[9599399] offset[9599399]
old mult[1747592] shift[22] cycle_interval[9600196]xtime_interval[16777225728032] xtime_nsec[18446744073706729512]error[10403569664]
Current mult[1747593] shift[22] cycle_interval[9600196]xtime_interval[16777235328228] xtime_nsec[18446744073708340825]error[9162903552]
old xtime tv_sec[1227688952] tv_nsec[795999993]
Current xtime tv_sec[1227688952] tv_nsec[795999991]


So the second check is triggered that xtime->tv_nsec gos back with 2 nsec. I know that user application never finds 2 nsec backward. But ...

> 
> tick_length might be changed by second_overflow.
> 
> > 
> > Will attach shortly.
> > 
> 
Comment 60 john stultz 2008-11-26 11:20:42 UTC
(In reply to comment #59)
> > I tested it. The patch does fix it, from mathematics point of view.
> I am really sorry. I found I replied too early. Now, my instrumented kernel
> printed out:
> 
> xtime go back and diff_nsec[-2] xtime_nsec01[3338665954639872]
> old offset[9599399] offset[9599399]
> old mult[1747592] shift[22]
> cycle_interval[9600196]xtime_interval[16777225728032]
> xtime_nsec[18446744073706729512]error[10403569664]
> Current mult[1747593] shift[22]
> cycle_interval[9600196]xtime_interval[16777235328228]
> xtime_nsec[18446744073708340825]error[9162903552]
> old xtime tv_sec[1227688952] tv_nsec[795999993]
> Current xtime tv_sec[1227688952] tv_nsec[795999991]

So what set of patches is this with? (Sorry, there have been quite a few).
 
> So the second check is triggered that xtime->tv_nsec gos back with 2 nsec. I
> know that user application never finds 2 nsec backward. But ...

This could be ok.

This part of the timekeeping code is pretty complicated. Let me try to explain.

Timekeeping is calcualted as the following:
  gtod = xtime + cyc2ns(now - cycle_last)

Where cyc2ns(x) = x*mult >> shift


So think of time as two parts: 1) accumulated, and 2) not accumulated.

{     accumulated          }{      not accumulated         }
|----------xtime------------|=== cyc2ns(now-cycle_last) ===|


When slowing or speeding up the clock (for ntp or just fine error correction), we do so by changing the mult value.  However, we want the speeding up or slowing down to only be in affect *after* we change it. In other words, before we change the mult value, and after we've changed mult value we still need to be at the same point in time.


|----------xtime------------|=== cyc2ns(now-cycle_last) ===|

So in the case above, the mult value is increased by one, which will cause cyc2ns() to return a larger value for the same now value.

|----------xtime------------|=== cyc2ns(now-cycle_last) ========|

So to keep time constant we reduce xtime by change in offset.

|----------xtime-------|=== cyc2ns(now-cycle_last) ========|


Does that make sense?
Comment 61 john stultz 2008-11-26 11:46:07 UTC
So after sleeping on the patches I sent out last night, while the first probably works, it likely causes side effects as Yanmin pointed out.

The second I'm suspecting won't solve the issue, since we can reduce xtime_nsec by more then 1 nanosecond, but it will likely make it harder to trigger.
Comment 62 john stultz 2008-11-26 11:53:53 UTC
Comment on attachment 19030 [details]
fix negative underflow

Marking this one obsolete since I now don't think it will work.
Comment 63 john stultz 2008-11-26 12:34:17 UTC
Created attachment 19034 [details]
catch nsec underflows and fix them using the error value

Ok, here's a new shot at getting this right. Since we can't undo the second overflow or move it later in the function, we can just push time forward enough to avoid the underflow. To keep things sane, we store the amount we changed xtime_nsec into the error value, which will result in corrective action the next time around.

Let me know if this works for you.
Comment 64 Yanmin Zhang 2008-11-26 16:55:09 UTC
(In reply to comment #60)
> (In reply to comment #59)
> > > I tested it. The patch does fix it, from mathematics point of view.
> > I am really sorry. I found I replied too early. Now, my instrumented kernel
> > printed out:
> > 
> > xtime go back and diff_nsec[-2] xtime_nsec01[3338665954639872]
> > old offset[9599399] offset[9599399]
> > old mult[1747592] shift[22]
> > cycle_interval[9600196]xtime_interval[16777225728032]
> > xtime_nsec[18446744073706729512]error[10403569664]
> > Current mult[1747593] shift[22]
> > cycle_interval[9600196]xtime_interval[16777235328228]
> > xtime_nsec[18446744073708340825]error[9162903552]
> > old xtime tv_sec[1227688952] tv_nsec[795999993]
> > Current xtime tv_sec[1227688952] tv_nsec[795999991]
> 
> So what set of patches is this with? (Sorry, there have been quite a few).
> 
> > So the second check is triggered that xtime->tv_nsec gos back with 2 nsec.
> I
> > know that user application never finds 2 nsec backward. But ...
> 
> This could be ok.
> 
> This part of the timekeeping code is pretty complicated. Let me try to
> explain.
> 
> Timekeeping is calcualted as the following:
>   gtod = xtime + cyc2ns(now - cycle_last)
> 
> Where cyc2ns(x) = x*mult >> shift
> 
> 
> So think of time as two parts: 1) accumulated, and 2) not accumulated.
> 
> {     accumulated          }{      not accumulated         }
> |----------xtime------------|=== cyc2ns(now-cycle_last) ===|
> 
> 
> When slowing or speeding up the clock (for ntp or just fine error
> correction),
> we do so by changing the mult value.  However, we want the speeding up or
> slowing down to only be in affect *after* we change it. In other words,
> before
> we change the mult value, and after we've changed mult value we still need to
> be at the same point in time.
> 
> 
> |----------xtime------------|=== cyc2ns(now-cycle_last) ===|
> 
> So in the case above, the mult value is increased by one, which will cause
> cyc2ns() to return a larger value for the same now value.
> 
> |----------xtime------------|=== cyc2ns(now-cycle_last) ========|
> 
> So to keep time constant we reduce xtime by change in offset.
> 
> |----------xtime-------|=== cyc2ns(now-cycle_last) ========|
> 
> 
> Does that make sense?
> 
Definitely. Thanks for your detailed explanation.

yanmin
Comment 65 Yanmin Zhang 2008-11-26 19:30:32 UTC
(In reply to comment #63)
> Created an attachment (id=19034) [details]
> catch nsec underflows and fix them using the error value
> 
> Ok, here's a new shot at getting this right. Since we can't undo the second
> overflow or move it later in the function, we can just push time forward
> enough
> to avoid the underflow. To keep things sane, we store the amount we changed
> xtime_nsec into the error value, which will result in corrective action the
> next time around.
> 
> Let me know if this works for you.
> 
I am testing it on 2 machines and didn't trigger the issue in 1 hour. In addition, I just saw once that:
xtime go back and diff_nsec[-2]
in update_wall_times.

So the patch is better.

Another new idea from my coworker:
Can we add a checking in clocksource_adjust to abort the adjust when clock->xtime_nsec < offset?

I will test the new idea.
Comment 66 Yanmin Zhang 2008-11-26 23:35:18 UTC
Created attachment 19046 [details]
Patch to fix it with a new idea

John,

Here is the patch with the new idea. It just aborts the adjust when (s64)clock->xtime_nsec < offset.

I tested it for 1 hour on 2 machines and didn't trigger the issue.

Yanmin
Comment 67 Yanmin Zhang 2008-11-27 21:44:17 UTC
John,

As the overflow happens rarely, it might be better to add a unlikely in your patch. Or add a likely in my patch.

I ran the testing on 2 machines for more than 20 hours and didn't trigger the issue.

yanmin
Comment 68 john stultz 2008-12-01 11:50:31 UTC
(In reply to comment #66)
> Created an attachment (id=19046) [details]
> Patch to fix it with a new idea
> 
> Here is the patch with the new idea. It just aborts the adjust when
> (s64)clock->xtime_nsec < offset.
> 
> I tested it for 1 hour on 2 machines and didn't trigger the issue.

Yep, that was another method I was considering, however your patch has one issue with it: offset may be made negative at that point. It would probably be better to just check and return at the very top of clocksource_adjust().

Either method should work. Although I think re-adding it to the error is more consistent to how we deal with other rounding issues. So I'll submit that to lkml today and should we get negative feedback, we'll propose a corrected variant of your patch.
Comment 69 john stultz 2008-12-01 18:35:27 UTC
I've sent my patch to lkml.
Comment 70 Yanmin Zhang 2008-12-01 18:43:03 UTC
Thanks John. Your email to LKML is very clear with detailed explanation. Hope 2.6.28-rc7 merges the patch.

Note You need to log in before you can comment on or make changes to this bug.