This loop in kernel/sched.c:touch_cache(): for (i = 0; i < size / 6; i += 8) { switch (i % 6) { case 0: cache[i]++; /* 1st third, forward */ case 1: cache[size-1-i]++; /* 3rd third, backward */ case 2: cache[chunk1-i]++; /* 1st third, backward */ case 3: cache[chunk1+i]++; /* 2nd third, forward */ case 4: cache[chunk2-i]++; /* 2nd third, backward */ case 5: cache[chunk2+i]++; /* 3rd third, forward */ } } can only touch the first two thirds of the buffer because `i % 8' will always be even. In order to visit the odd cases of the switch, the for loop's stride should be relatively prime to the modulus, e.g. 7. This at least partially explains <http://lkml.org/lkml/2005/6/21/473>.
* Andrew Morton <akpm@osdl.org> wrote: > This loop in kernel/sched.c:touch_cache(): > > for (i = 0; i < size / 6; i += 8) { > switch (i % 6) { > case 0: cache[i]++; /* 1st third, forward */ > case 1: cache[size-1-i]++; /* 3rd third, backward */ > case 2: cache[chunk1-i]++; /* 1st third, backward */ > case 3: cache[chunk1+i]++; /* 2nd third, forward */ > case 4: cache[chunk2-i]++; /* 2nd third, backward */ > case 5: cache[chunk2+i]++; /* 3rd third, forward */ > } > } > > can only touch the first two thirds of the buffer because `i % 8' will > always be even. In order to visit the odd cases of the switch, the > for loop's stride should be relatively prime to the modulus, e.g. 7. > > This at least partially explains <http://lkml.org/lkml/2005/6/21/473>. indeed - good catch. The patch below should fix this. Boot-tested. Also added some comments. Ingo ----------------------> Subject: [patch] sched: fix migration cost estimator From: Ingo Molnar <mingo@elte.hu> blubkin@vmware.com found the following bug: touch_cache() only touched 2 thirds of the cache buffer (hence wasting memory and possibly making calibration less accurate). Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched.c | 24 +++++++++++++++++------- 1 file changed, 17 insertions(+), 7 deletions(-) Index: linux-rt-new.q/kernel/sched.c =================================================================== --- linux-rt-new.q.orig/kernel/sched.c +++ linux-rt-new.q/kernel/sched.c @@ -5721,14 +5721,24 @@ static void touch_cache(void *__cache, u unsigned long *cache = __cache; int i; + /* + * We access the buffer via 6 independent 'streams' of + * read/write access which are interleaved: + * + * [---> <---|---> <---|---> <---] + * + * We touch every cacheline in the buffer. (iteration + * step is 32 bytes on 32-bit systems, 64 bytes on + * 64-bit systems) + */ for (i = 0; i < size/6; i += 8) { - switch (i % 6) { - case 0: cache[i]++; - case 1: cache[size-1-i]++; - case 2: cache[chunk1-i]++; - case 3: cache[chunk1+i]++; - case 4: cache[chunk2-i]++; - case 5: cache[chunk2+i]++; + switch (i % 6*8) { + case 0*8: cache[i]++; + case 1*8: cache[size-1-i]++; + case 2*8: cache[chunk1-i]++; + case 3*8: cache[chunk1+i]++; + case 4*8: cache[chunk2-i]++; + case 5*8: cache[chunk2+i]++; } } }
Created attachment 9440 [details] Test program to demonstrate touch_cache() algorithm flaws
The proposed patch (multiplying everything by 8) doesn't change it at all. Now instead of comparing even numbers to 0,1,2,3,4,5, we compare multiples of 16 to multiples [0..5] of 8. Half still can never be matched. But it turns out that's irrelevant because I also didn't notice the lack of `break;' statements in the original code. The original _does_ touch all the cache, by mistake. I tacked on an attachment, a trivial C program that shows how the original and proposed patch are conceptually flawed (when implemented with `break;'). Meanwhile, the original loop goes from 0..size/6 by steps of 8, so it generates 1/48 of the cache addresses. `cache' is declared `unsigned long *cache' in the function, so on 32-bit this is doing an average of one 32-bit write per 192 bytes of test cache -- one in 6 cache lines, if the comment about cache line size is correct. Conclusion: needs a little work to ensure this is actually doing what the comments think it's doing...
Ok, now I see. I think this code must have been right at one point, then the `switch' got wrapped around it and made it wrong. This loop actually touches all the cache lines (on 32-bit): for (i = 0; i < size/6; i += 8) { cache[i]++; cache[size-1-i]++; cache[chunk1-i]++; cache[chunk1+i]++; cache[chunk2-i]++; cache[chunk2+i]++; } That is, it writes to at least one longword within each 8 longwords. Not all at the same offsets, since `chunk1' and `chunk2' can be a bit off and since `cache[size - 1]' is at the end of a cache line while `cache[0]' is at the beginning. Kenneth W Chen's observation that: > I'm consistently getting a smaller than expected cache migration cost > as measured by Ingo's scheduler-cache-hot-autodetect.patch currently > in -mm tree. makes even more sense now. This was pulling at most 1/6 of the test memory into cache on 32-bit, and either 1/6 or 1/12 on 64-bit. Although my initial analysis was a bunch of bollocks, I was right when I thought I smelled something...
Queued a fix from Ingo
Created attachment 9453 [details] Test program with new test_cache() routine This is a more robust touch_cache() routine, embedded in a test program. Next attachment is a small shell script that runs touch_cache() through its paces.
Created attachment 9454 [details] Test script for updated touch_cache()
The patch in comment #1 doesn't do anything to this bug. The 2nd attachment ("Test program with new test_cache() routine") contains a touch_cache() that is robust over various test buffer sizes and with various cache line lengths.
Created attachment 9790 [details] Test program with new test_cache() routine (updated) This has been tested more, and made adequately compatible with the kernel environment.
Created attachment 9791 [details] Test script for updated touch_cache()
Created attachment 9792 [details] 2.6.19-git18 patch 2.6.19-git18 patch. NOT TESTED IN KERNEL ENVIRONMENT, only in my userland test harness.
This activity reflects email conversation w/Ingo Molnar & Andrew Morton, who I will also now Cc on the bug. This is supposed to be in lieu of the previous "sched-fix-migration-cost-estimator.patch", which had no actual effect on the bug.
This is still broken in newest 2.6.22 snapshots.
This bug should now be moot because we removed the estimator from 2.6.23. Might still be worth applying to 2.6.22 - albeit it could change the calibration results and trigger change in behavior. if we introduce something like this in the future then we'll pick up your fix.