Bug 15618
Summary: | 2.6.18->2.6.32->2.6.33 huge regression in performance | ||
---|---|---|---|
Product: | Process Management | Reporter: | Anton Starikov (ant.starikov) |
Component: | Other | Assignee: | process_other |
Status: | RESOLVED CODE_FIX | ||
Severity: | high | CC: | rjw |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 2.6.32 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: | testcase |
Description
Anton Starikov
2010-03-23 16:13:16 UTC
Created attachment 25659 [details]
testcase
I attach here the testcase.
Unpack and cd regression-testcase
Then run as ./RUNME NTHREADS
Test isn't long, for 2 threads it takes about 30 seconds on 2.4 GHz Opteron.
(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Tue, 23 Mar 2010 16:13:25 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=15618 > > Summary: 2.6.18->2.6.32->2.6.33 huge regression in performance > Product: Process Management > Version: 2.5 > Kernel Version: 2.6.32 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: high > Priority: P1 > Component: Other > AssignedTo: process_other@kernel-bugs.osdl.org > ReportedBy: ant.starikov@gmail.com > Regression: No > > > We have benchmarked some multithreaded code here on 16-core/4-way opteron > 8356 > host on number of kernels (see below) and found strange results. > Up to 8 threads we didn't see any noticeable differences in performance, but > starting from 9 threads performance diverges substantially. I provide here > results for 14 threads lolz. Catastrophic meltdown. Thanks for doing all that work - at a guess I'd say it's mmap_sem. Perhaps with some assist from the CPU scheduler. If you change the config to set CONFIG_RWSEM_GENERIC_SPINLOCK=n, CONFIG_RWSEM_XCHGADD_ALGORITHM=y does it help? Anyway, there's a testcase in bugzilla and it looks like we got us some work to do. > 2.6.18-164.11.1.el5 (centos) > > user time: ~60 sec > sys time: ~12 sec > > 2.6.32.9-70.fc12.x86_64 (fedora-12) > > user time: ~60 sec > sys time: ~75 sec > > 2.6.33-0.46.rc8.git1.fc13.x86_64 (fedora-12 + rawhide kernel) > > user time: ~60 sec > sys time: ~300 sec > > In all three cases real time regress corresponding to giving numbers. > > Binary used for all three cases is exactly the same (compiled on centos). > Setups for all three cases so identical as possible (last two - the same > fedora-12 setup booted with different kernels). > > What can be reason of this regress in performance? Is it possible to tune > something to recover performance on 2.6.18 kernel? > > I perf'ed on 2.6.32.9-70.fc12.x86_64 kernel > > report (top part only): > > 43.64% dve22lts-mc [kernel] [k] _spin_lock_irqsave > 32.93% dve22lts-mc ./dve22lts-mc [.] DBSLLlookup_ret > 5.37% dve22lts-mc ./dve22lts-mc [.] SuperFastHash > 3.76% dve22lts-mc /lib64/libc-2.11.1.so [.] __GI_memcpy > 2.60% dve22lts-mc [kernel] [k] clear_page_c > 1.60% dve22lts-mc ./dve22lts-mc [.] index_next_dfs > > stat: > 129875.554435 task-clock-msecs # 10.210 CPUs > 1883 context-switches # 0.000 M/sec > 17 CPU-migrations # 0.000 M/sec > 2695310 page-faults # 0.021 M/sec > 298370338040 cycles # 2297.356 M/sec > 130581778178 instructions # 0.438 IPC > 42517143751 cache-references # 327.368 M/sec > 101906904 cache-misses # 0.785 M/sec > > callgraph(top part only): > > 53.09% dve22lts-mc [kernel] [k] > _spin_lock_irqsave > | > |--49.90%-- __down_read_trylock > | down_read_trylock > | do_page_fault > | page_fault > | | > | |--99.99%-- __GI_memcpy > | | | > | | |--84.28%-- (nil) > | | | > | | |--9.78%-- 0x100000000 > | | | > | | --5.94%-- 0x1 > | --0.01%-- > [...] > > | > |--49.39%-- __up_read > | up_read > | | > | |--100.00%-- do_page_fault > | | page_fault > | | | > | | |--99.99%-- __GI_memcpy > | | | | > | | | |--84.18%-- (nil) > | | | | > | | | |--10.13%-- 0x100000000 > | | | | > | | | --5.69%-- 0x1 > | | --0.01%-- > [...] > > | --0.00%-- > [...] > > --0.72%-- > [...] > > > > On 2.6.33 I see similar picture with spin-lock plus addition of a lot of time > spent in cgroup related kernel calls. > > If it is necessary, I can attach binary for tests. > > -- > Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You are on the CC list for the bug. On Tue, 23 Mar 2010, Ingo Molnar wrote:
>
> It shows a very brutal amount of page fault invoked mmap_sem spinning
> overhead.
Isn't this already fixed? It's the same old "x86-64 rwsemaphores are using
the shit-for-brains generic version" thing, and it's fixed by
1838ef1 x86-64, rwsem: 64-bit xadd rwsem implementation
5d0b723 x86: clean up rwsem type system
59c33fa x86-32: clean up rwsem inline asm statements
NOTE! None of those are in 2.6.33 - they were merged afterwards. But they
are in 2.6.34-rc1 (and obviously current -git). So Anton would have to
compile his own kernel to test his load.
We could mark them as stable material if the load in question is a real
load rather than just a test-case. On one of the random page-fault
benchmarks the rwsem fix was something like a 400% performance
improvement, and it was apparently visible in real life on some crazy SGI
"initialize huge heap concurrently on lots of threads" load.
Side note: the reason the spinlock sucks is because of the fair ticket
locks, it really does all the wrong things for the rwsem code. That's why
old kernels don't show it - the old unfair locks didn't show the same kind
of behavior.
Linus
On Mar 23, 2010, at 6:45 PM, Linus Torvalds wrote: > > > On Tue, 23 Mar 2010, Ingo Molnar wrote: >> >> It shows a very brutal amount of page fault invoked mmap_sem spinning >> overhead. > > Isn't this already fixed? It's the same old "x86-64 rwsemaphores are using > the shit-for-brains generic version" thing, and it's fixed by > > 1838ef1 x86-64, rwsem: 64-bit xadd rwsem implementation > 5d0b723 x86: clean up rwsem type system > 59c33fa x86-32: clean up rwsem inline asm statements > > NOTE! None of those are in 2.6.33 - they were merged afterwards. But they > are in 2.6.34-rc1 (and obviously current -git). So Anton would have to > compile his own kernel to test his load. Thanks for info, I will try it now. > We could mark them as stable material if the load in question is a real > load rather than just a test-case. On one of the random page-fault > benchmarks the rwsem fix was something like a 400% performance > improvement, and it was apparently visible in real life on some crazy SGI > "initialize huge heap concurrently on lots of threads" load. It is not just a test-case, it is real-life code. With real-life problems on 2.6.32 and later :) Anton. On Mar 23, 2010, at 7:00 PM, Ingo Molnar wrote:
>> NOTE! None of those are in 2.6.33 - they were merged afterwards. But they
>> are in 2.6.34-rc1 (and obviously current -git). So Anton would have to
>> compile his own kernel to test his load.
>
> another option is to run the rawhide kernel via something like:
>
> yum update --enablerepo=development kernel
>
> this will give kernel-2.6.34-0.13.rc1.git1.fc14.x86_64, which has those
> changes included.
I will apply this commits to 2.6.32, I afraid current OFED (which I need also) will not work on 2.6.33+.
Anton.
On Tue, 23 Mar 2010 18:34:09 +0100 Ingo Molnar <mingo@elte.hu> wrote: > > It shows a very brutal amount of page fault invoked mmap_sem spinning > overhead. > Yes. Note that we fall off a cliff at nine threads on a 16-way. As soon as a core gets two threads scheduled onto it? Probably triggered by an MM change, possibly triggered by a sched change which tickled a preexisting MM shortcoming. Who knows. Anton, we have an executable binary in the bugzilla report but it would be nice to also have at least a description of what that code is actually doing. A quick strace shows quite a lot of mprotect activity. A pseudo-code walkthrough, perhaps? Thanks. On Mar 23, 2010, at 7:13 PM, Andrew Morton wrote: > Anton, we have an executable binary in the bugzilla report but it would > be nice to also have at least a description of what that code is > actually doing. A quick strace shows quite a lot of mprotect activity. > A pseudo-code walkthrough, perhaps? Right now can't say too much about the code (we just gave a chance to neighbor group to run their code on our cluster, so I'm totally unfriendly with this code). I will forward your question to them. But probably right now you can get more information (including sources) here http://fmt.cs.utwente.nl/tools/ltsmin/ Anton On Tue, 23 Mar 2010 19:03:36 +0100 Anton Starikov <ant.starikov@gmail.com> wrote: > > On Mar 23, 2010, at 7:00 PM, Ingo Molnar wrote: > >> NOTE! None of those are in 2.6.33 - they were merged afterwards. But they > >> are in 2.6.34-rc1 (and obviously current -git). So Anton would have to > >> compile his own kernel to test his load. > > > > another option is to run the rawhide kernel via something like: > > > > yum update --enablerepo=development kernel > > > > this will give kernel-2.6.34-0.13.rc1.git1.fc14.x86_64, which has those > > changes included. > > I will apply this commits to 2.6.32, I afraid current OFED (which I need > also) will not work on 2.6.33+. > You should be able to simply set CONFIG_RWSEM_GENERIC_SPINLOCK=n, CONFIG_RWSEM_XCHGADD_ALGORITHM=y by hand, as I mentioned earlier? On Mar 23, 2010, at 7:21 PM, Andrew Morton wrote:
>> I will apply this commits to 2.6.32, I afraid current OFED (which I need
>> also) will not work on 2.6.33+.
>>
>
> You should be able to simply set CONFIG_RWSEM_GENERIC_SPINLOCK=n,
> CONFIG_RWSEM_XCHGADD_ALGORITHM=y by hand, as I mentioned earlier?
Hm. I tried, but when I do "make oldconfig", then it gets rewritten, so I assume that it conflicts with some other setting from default fedora kernel config. trying to figure out which one exactly.
Anton.
* Andrew Morton <akpm@linux-foundation.org> wrote: > On Tue, 23 Mar 2010 18:34:09 +0100 > Ingo Molnar <mingo@elte.hu> wrote: > > > > > It shows a very brutal amount of page fault invoked mmap_sem spinning > > overhead. > > > > Yes. Note that we fall off a cliff at nine threads on a 16-way. As soon as > a core gets two threads scheduled onto it? it's AMD Opterons so no SMT. My (wild) guess would be that 8 cpus can still do cacheline ping-pong reasonably efficiently, but it starts breaking down very seriously with 9 or more cores bouncing the same single cache-line. Breakdowns in scalability are usually very non-linear, for hardware and software reasons. '8 threads' sounds like a hw limit to me. From the scheduler POV there's no big difference between 8 or 9 CPUs used [this is non-HT] - with 8 or 7 cores still idle. Ingo * Andrew Morton <akpm@linux-foundation.org> wrote: > lolz. Catastrophic meltdown. Thanks for doing all that work - at a guess > I'd say it's mmap_sem. [...] Looks like we dont need to guess, just look at the call graph profile (a'ka the smoking gun): > > I perf'ed on 2.6.32.9-70.fc12.x86_64 kernel > > > > [...] > > > > callgraph(top part only): > > > > 53.09% dve22lts-mc [kernel] > [k] > > _spin_lock_irqsave > > | > > |--49.90%-- __down_read_trylock > > | down_read_trylock > > | do_page_fault > > | page_fault > > | | > > | |--99.99%-- __GI_memcpy > > | | | > > | | |--84.28%-- (nil) > > | | | > > | | |--9.78%-- 0x100000000 > > | | | > > | | --5.94%-- 0x1 > > | --0.01%-- > > [...] > > > > | > > |--49.39%-- __up_read > > | up_read > > | | > > | |--100.00%-- do_page_fault > > | | page_fault > > | | | > > | | |--99.99%-- __GI_memcpy > > | | | | > > | | | |--84.18%-- (nil) > > | | | | > > | | | |--10.13%-- 0x100000000 > > | | | | > > | | | --5.69%-- 0x1 > > | | --0.01%-- > > [...] It shows a very brutal amount of page fault invoked mmap_sem spinning overhead. > Perhaps with some assist from the CPU scheduler. Doesnt look like it, the perf stat numbers show that the scheduler is only very lightly involved: > > 129875.554435 task-clock-msecs # 10.210 CPUs > > 1883 context-switches # 0.000 M/sec a context switch only every ~68 milliseconds. Ingo Ingo * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, 23 Mar 2010, Ingo Molnar wrote: > > > > It shows a very brutal amount of page fault invoked mmap_sem spinning > > overhead. > > Isn't this already fixed? It's the same old "x86-64 rwsemaphores are using > the shit-for-brains generic version" thing, and it's fixed by > > 1838ef1 x86-64, rwsem: 64-bit xadd rwsem implementation > 5d0b723 x86: clean up rwsem type system > 59c33fa x86-32: clean up rwsem inline asm statements Ah, indeed! > NOTE! None of those are in 2.6.33 - they were merged afterwards. But they > are in 2.6.34-rc1 (and obviously current -git). So Anton would have to > compile his own kernel to test his load. another option is to run the rawhide kernel via something like: yum update --enablerepo=development kernel this will give kernel-2.6.34-0.13.rc1.git1.fc14.x86_64, which has those changes included. OTOH that kernel has debugging [lockdep] enabled so it might not be comparable. > We could mark them as stable material if the load in question is a real load > rather than just a test-case. On one of the random page-fault benchmarks the > rwsem fix was something like a 400% performance improvement, and it was > apparently visible in real life on some crazy SGI "initialize huge heap > concurrently on lots of threads" load. > > Side note: the reason the spinlock sucks is because of the fair ticket > locks, it really does all the wrong things for the rwsem code. That's why > old kernels don't show it - the old unfair locks didn't show the same kind > of behavior. Yeah. Ingo On Mar 23, 2010, at 6:45 PM, Linus Torvalds wrote:
>
>
> On Tue, 23 Mar 2010, Ingo Molnar wrote:
>>
>> It shows a very brutal amount of page fault invoked mmap_sem spinning
>> overhead.
>
> Isn't this already fixed? It's the same old "x86-64 rwsemaphores are using
> the shit-for-brains generic version" thing, and it's fixed by
>
> 1838ef1 x86-64, rwsem: 64-bit xadd rwsem implementation
> 5d0b723 x86: clean up rwsem type system
> 59c33fa x86-32: clean up rwsem inline asm statements
>
> NOTE! None of those are in 2.6.33 - they were merged afterwards. But they
> are in 2.6.34-rc1 (and obviously current -git). So Anton would have to
> compile his own kernel to test his load.
Applied mentioned patches. Things didn't improve too much.
before:
prog: Total exploration time 9.880 real 60.620 user 76.970 sys
after:
prog: Total exploration time 9.020 real 59.430 user 66.190 sys
perf report:
38.58% prog [kernel] [k] _spin_lock_irqsave
37.42% prog ./prog [.] DBSLLlookup_ret
6.22% prog ./prog [.] SuperFastHash
3.65% prog /lib64/libc-2.11.1.so [.] __GI_memcpy
2.09% prog ./anderson.6.dve2C [.] get_successors
1.75% prog [kernel] [k] clear_page_c
1.73% prog ./prog [.] index_next_dfs
0.71% prog [kernel] [k] handle_mm_fault
0.38% prog ./prog [.] cb_hook
0.33% prog ./prog [.] get_local
0.32% prog [kernel] [k] page_fault
Anton.
Reply-To: peterz@infradead.org On Tue, 2010-03-23 at 20:14 +0100, Anton Starikov wrote: > On Mar 23, 2010, at 6:45 PM, Linus Torvalds wrote: > > > > > > > On Tue, 23 Mar 2010, Ingo Molnar wrote: > >> > >> It shows a very brutal amount of page fault invoked mmap_sem spinning > >> overhead. > > > > Isn't this already fixed? It's the same old "x86-64 rwsemaphores are using > > the shit-for-brains generic version" thing, and it's fixed by > > > > 1838ef1 x86-64, rwsem: 64-bit xadd rwsem implementation > > 5d0b723 x86: clean up rwsem type system > > 59c33fa x86-32: clean up rwsem inline asm statements > > > > NOTE! None of those are in 2.6.33 - they were merged afterwards. But they > > are in 2.6.34-rc1 (and obviously current -git). So Anton would have to > > compile his own kernel to test his load. > > > Applied mentioned patches. Things didn't improve too much. > > before: > prog: Total exploration time 9.880 real 60.620 user 76.970 sys > > after: > prog: Total exploration time 9.020 real 59.430 user 66.190 sys > > perf report: > > 38.58% prog [kernel] > [k] _spin_lock_irqsave > 37.42% prog ./prog > [.] DBSLLlookup_ret > 6.22% prog ./prog > [.] SuperFastHash > 3.65% prog /lib64/libc-2.11.1.so > [.] __GI_memcpy > 2.09% prog ./anderson.6.dve2C > [.] get_successors > 1.75% prog [kernel] > [k] clear_page_c > 1.73% prog ./prog > [.] index_next_dfs > 0.71% prog [kernel] > [k] handle_mm_fault > 0.38% prog ./prog > [.] cb_hook > 0.33% prog ./prog > [.] get_local > 0.32% prog [kernel] > [k] page_fault Could you verify with a callgraph profile what that spin_lock_irqsave() is? If those rwsem patches were successfull mmap_sem should no longer have a spinlock to content on, in which case it might be another lock. If not, something went wrong with backporting those patches. On Mar 23, 2010, at 8:22 PM, Robin Holt wrote:
> On Tue, Mar 23, 2010 at 07:25:43PM +0100, Anton Starikov wrote:
>> On Mar 23, 2010, at 7:21 PM, Andrew Morton wrote:
>>>> I will apply this commits to 2.6.32, I afraid current OFED (which I need
>>>> also) will not work on 2.6.33+.
>>>>
>>>
>>> You should be able to simply set CONFIG_RWSEM_GENERIC_SPINLOCK=n,
>>> CONFIG_RWSEM_XCHGADD_ALGORITHM=y by hand, as I mentioned earlier?
>>
>> Hm. I tried, but when I do "make oldconfig", then it gets rewritten, so I
>> assume that it conflicts with some other setting from default fedora kernel
>> config. trying to figure out which one exactly.
>
> Have you tracked this down yet? I just got the patches applied against
> an older kernel and am running into the same issue.
I decided to not track down this issue and just applied patches. I understood that with this patches there is no need to change this config options. Am I wrong?
Anton
I attach here callgraph. Also I checked kernel source, actual code which was compiled is exactly what should be after patches. Do I miss something? On Mar 23, 2010, at 8:22 PM, Robin Holt wrote:
> On Tue, Mar 23, 2010 at 07:25:43PM +0100, Anton Starikov wrote:
>> On Mar 23, 2010, at 7:21 PM, Andrew Morton wrote:
>>>> I will apply this commits to 2.6.32, I afraid current OFED (which I need
>>>> also) will not work on 2.6.33+.
>>>>
>>>
>>> You should be able to simply set CONFIG_RWSEM_GENERIC_SPINLOCK=n,
>>> CONFIG_RWSEM_XCHGADD_ALGORITHM=y by hand, as I mentioned earlier?
>>
>> Hm. I tried, but when I do "make oldconfig", then it gets rewritten, so I
>> assume that it conflicts with some other setting from default fedora kernel
>> config. trying to figure out which one exactly.
>
> Have you tracked this down yet? I just got the patches applied against
> an older kernel and am running into the same issue.
I think you can prevent overwriting this options if you set them in arch/x86/configs/x86_64_defconfig
Anton
On Tue, 23 Mar 2010, Andrew Morton wrote:
>
> You should be able to simply set CONFIG_RWSEM_GENERIC_SPINLOCK=n,
> CONFIG_RWSEM_XCHGADD_ALGORITHM=y by hand, as I mentioned earlier?
No. Doesn't work. The XADD code simply never worked on x86-64, which is
why those three commits I pointed at are required.
Oh, and you need one more commit (at least) in addition to the three I
already mentioned - the one that actually adds the x86-64 wrappers and
Kconfig option:
bafaecd x86-64: support native xadd rwsem implementation
so the minimal list of commits (on top of 2.6.33) is at least
59c33fa x86-32: clean up rwsem inline asm statements
5d0b723 x86: clean up rwsem type system
bafaecd x86-64: support native xadd rwsem implementation
1838ef1 x86-64, rwsem: 64-bit xadd rwsem implementation
and I just verified that they at least cherry-pick cleanly (in that
order). I _think_ it would be good to also do
0d1622d x86-64, rwsem: Avoid store forwarding hazard in __downgrade_write
but that one is a small detail, not anything fundamentally important.
Linus
On Tue, Mar 23, 2010 at 02:49:59PM -0500, Robin Holt wrote:
> On Tue, Mar 23, 2010 at 08:30:19PM +0100, Anton Starikov wrote:
> >
> > On Mar 23, 2010, at 8:22 PM, Robin Holt wrote:
> >
> > > On Tue, Mar 23, 2010 at 07:25:43PM +0100, Anton Starikov wrote:
> > >> On Mar 23, 2010, at 7:21 PM, Andrew Morton wrote:
> > >>>> I will apply this commits to 2.6.32, I afraid current OFED (which I
> need also) will not work on 2.6.33+.
> > >>>>
> > >>>
> > >>> You should be able to simply set CONFIG_RWSEM_GENERIC_SPINLOCK=n,
> > >>> CONFIG_RWSEM_XCHGADD_ALGORITHM=y by hand, as I mentioned earlier?
> > >>
> > >> Hm. I tried, but when I do "make oldconfig", then it gets rewritten, so
> I assume that it conflicts with some other setting from default fedora kernel
> config. trying to figure out which one exactly.
> > >
> > > Have you tracked this down yet? I just got the patches applied against
> > > an older kernel and am running into the same issue.
> >
> > I decided to not track down this issue and just applied patches. I
> understood that with this patches there is no need to change this config
> options. Am I wrong?
>
> We might need to also apply:
> bafaecd11df15ad5b1e598adc7736afcd38ee13d
For the record, these are the patches I have applied to a 2.6.32 kernel from a vendor:
59c33fa7791e9948ba467c2b83e307a0d087ab49
5d0b7235d83eefdafda300656e97d368afcafc9a
1838ef1d782f7527e6defe87e180598622d2d071
0d1622d7f526311d87d7da2ee7dd14b73e45d3fc
bafaecd11df15ad5b1e598adc7736afcd38ee13d
A quick look at the disassembly makes it look like we are using the
rwsem_64, et al.
Robin
On Tue, 23 Mar 2010, Anton Starikov wrote:
>
> On Mar 23, 2010, at 6:45 PM, Linus Torvalds wrote:
>
> >
> >
> > On Tue, 23 Mar 2010, Ingo Molnar wrote:
> >>
> >> It shows a very brutal amount of page fault invoked mmap_sem spinning
> >> overhead.
> >
> > Isn't this already fixed? It's the same old "x86-64 rwsemaphores are using
> > the shit-for-brains generic version" thing, and it's fixed by
> >
> > 1838ef1 x86-64, rwsem: 64-bit xadd rwsem implementation
> > 5d0b723 x86: clean up rwsem type system
> > 59c33fa x86-32: clean up rwsem inline asm statements
> >
> > NOTE! None of those are in 2.6.33 - they were merged afterwards. But they
> > are in 2.6.34-rc1 (and obviously current -git). So Anton would have to
> > compile his own kernel to test his load.
>
>
> Applied mentioned patches. Things didn't improve too much.
Yeah, I missed at least one commit, namely
bafaecd x86-64: support native xadd rwsem implementation
which is the one that actually makes x86-64 able to use the xadd version.
Linus
On Tue, Mar 23, 2010 at 07:25:43PM +0100, Anton Starikov wrote:
> On Mar 23, 2010, at 7:21 PM, Andrew Morton wrote:
> >> I will apply this commits to 2.6.32, I afraid current OFED (which I need
> also) will not work on 2.6.33+.
> >>
> >
> > You should be able to simply set CONFIG_RWSEM_GENERIC_SPINLOCK=n,
> > CONFIG_RWSEM_XCHGADD_ALGORITHM=y by hand, as I mentioned earlier?
>
> Hm. I tried, but when I do "make oldconfig", then it gets rewritten, so I
> assume that it conflicts with some other setting from default fedora kernel
> config. trying to figure out which one exactly.
Have you tracked this down yet? I just got the patches applied against
an older kernel and am running into the same issue.
Thanks,
Robin
Reply-To: peterz@infradead.org On Tue, 2010-03-23 at 20:14 +0100, Anton Starikov wrote: > On Mar 23, 2010, at 6:45 PM, Linus Torvalds wrote: > > > > > > > On Tue, 23 Mar 2010, Ingo Molnar wrote: > >> > >> It shows a very brutal amount of page fault invoked mmap_sem spinning > >> overhead. > > > > Isn't this already fixed? It's the same old "x86-64 rwsemaphores are using > > the shit-for-brains generic version" thing, and it's fixed by > > > > 1838ef1 x86-64, rwsem: 64-bit xadd rwsem implementation > > 5d0b723 x86: clean up rwsem type system > > 59c33fa x86-32: clean up rwsem inline asm statements > > > > NOTE! None of those are in 2.6.33 - they were merged afterwards. But they > > are in 2.6.34-rc1 (and obviously current -git). So Anton would have to > > compile his own kernel to test his load. > > > Applied mentioned patches. Things didn't improve too much. > > before: > prog: Total exploration time 9.880 real 60.620 user 76.970 sys > > after: > prog: Total exploration time 9.020 real 59.430 user 66.190 sys > > perf report: > > 38.58% prog [kernel] > [k] _spin_lock_irqsave > 37.42% prog ./prog > [.] DBSLLlookup_ret > 6.22% prog ./prog > [.] SuperFastHash > 3.65% prog /lib64/libc-2.11.1.so > [.] __GI_memcpy > 2.09% prog ./anderson.6.dve2C > [.] get_successors > 1.75% prog [kernel] > [k] clear_page_c > 1.73% prog ./prog > [.] index_next_dfs > 0.71% prog [kernel] > [k] handle_mm_fault > 0.38% prog ./prog > [.] cb_hook > 0.33% prog ./prog > [.] get_local > 0.32% prog [kernel] > [k] page_fault Could you verify with a callgraph profile what that spin_lock_irqsave() is? If those rwsem patches were successfull mmap_sem should no longer have a spinlock to content on, in which case it might be another lock. If not, something went wrong with backporting those patches. On Tue, Mar 23, 2010 at 08:30:19PM +0100, Anton Starikov wrote:
>
> On Mar 23, 2010, at 8:22 PM, Robin Holt wrote:
>
> > On Tue, Mar 23, 2010 at 07:25:43PM +0100, Anton Starikov wrote:
> >> On Mar 23, 2010, at 7:21 PM, Andrew Morton wrote:
> >>>> I will apply this commits to 2.6.32, I afraid current OFED (which I need
> also) will not work on 2.6.33+.
> >>>>
> >>>
> >>> You should be able to simply set CONFIG_RWSEM_GENERIC_SPINLOCK=n,
> >>> CONFIG_RWSEM_XCHGADD_ALGORITHM=y by hand, as I mentioned earlier?
> >>
> >> Hm. I tried, but when I do "make oldconfig", then it gets rewritten, so I
> assume that it conflicts with some other setting from default fedora kernel
> config. trying to figure out which one exactly.
> >
> > Have you tracked this down yet? I just got the patches applied against
> > an older kernel and am running into the same issue.
>
> I decided to not track down this issue and just applied patches. I understood
> that with this patches there is no need to change this config options. Am I
> wrong?
We might need to also apply:
bafaecd11df15ad5b1e598adc7736afcd38ee13d
Robin
I think we got a winner!
Problem seems to be fixed.
Just for record, I used next patches:
59c33fa7791e9948ba467c2b83e307a0d087ab49
5d0b7235d83eefdafda300656e97d368afcafc9a
1838ef1d782f7527e6defe87e180598622d2d071
4126faf0ab7417fbc6eb99fb0fd407e01e9e9dfe
bafaecd11df15ad5b1e598adc7736afcd38ee13d
0d1622d7f526311d87d7da2ee7dd14b73e45d3fc
Thanks,
Anton.
On Mar 23, 2010, at 8:54 PM, Linus Torvalds wrote:
>
>
> On Tue, 23 Mar 2010, Anton Starikov wrote:
>
>>
>> On Mar 23, 2010, at 6:45 PM, Linus Torvalds wrote:
>>
>>>
>>>
>>> On Tue, 23 Mar 2010, Ingo Molnar wrote:
>>>>
>>>> It shows a very brutal amount of page fault invoked mmap_sem spinning
>>>> overhead.
>>>
>>> Isn't this already fixed? It's the same old "x86-64 rwsemaphores are using
>>> the shit-for-brains generic version" thing, and it's fixed by
>>>
>>> 1838ef1 x86-64, rwsem: 64-bit xadd rwsem implementation
>>> 5d0b723 x86: clean up rwsem type system
>>> 59c33fa x86-32: clean up rwsem inline asm statements
>>>
>>> NOTE! None of those are in 2.6.33 - they were merged afterwards. But they
>>> are in 2.6.34-rc1 (and obviously current -git). So Anton would have to
>>> compile his own kernel to test his load.
>>
>>
>> Applied mentioned patches. Things didn't improve too much.
>
> Yeah, I missed at least one commit, namely
>
> bafaecd x86-64: support native xadd rwsem implementation
>
> which is the one that actually makes x86-64 able to use the xadd version.
>
> Linus
Although case is solved, I will post description for testcase program.
Just in case someone wonder or would like to keep it for some later tests.
------------------------------------------------------------------------
It is a parallel model checker. The command line you used does reachability
on the state space of mode anderson.6, meaning that it searches through all
possible states (int vectors). Each thread gets a vector from the queue,
calculates its successor states and puts them in a lock-less static hash
table (pseudo BFS exploration because the threads each have there own
queue).
How did ingo run the binary? Because the static table size should be chosen
to fit into memory. "-s 27" allocates 2^27 * (|vector| + 1 ) * sizeof(int)
bytes. |vector| is equal to 19 for anderson.6, ergo the table size is 10GB.
This could explain the huge number of page faults ingo gets.
But anyway, you can imagine that the code is quiet jumpy and has a big
memory footprint, so the page faults may also be normal.
------------------------------------------------------------------------
On Mar 23, 2010, at 7:13 PM, Andrew Morton wrote:
> Anton, we have an executable binary in the bugzilla report but it would
> be nice to also have at least a description of what that code is
> actually doing. A quick strace shows quite a lot of mprotect activity.
> A pseudo-code walkthrough, perhaps?
>
> Thanks.
Closing, since the problem has been solved in the current Linus' tree. On Tue, 23 Mar 2010, Anton Starikov wrote: > > I think we got a winner! > > Problem seems to be fixed. > > Just for record, I used next patches: > > 59c33fa7791e9948ba467c2b83e307a0d087ab49 > 5d0b7235d83eefdafda300656e97d368afcafc9a > 1838ef1d782f7527e6defe87e180598622d2d071 > 4126faf0ab7417fbc6eb99fb0fd407e01e9e9dfe > bafaecd11df15ad5b1e598adc7736afcd38ee13d > 0d1622d7f526311d87d7da2ee7dd14b73e45d3fc Ok. If you have performance numbers for before/after these patches for your actual workload, I'd suggest posting them to stable@kernel.org, and maybe those rwsem fixes will get back-ported. The patches are pretty small, and should be fairly safe. So they are certainly stable material. Linus Tomorrow I will try to patch and check 2.6.33 and see are this patches enough to restore performance or not, because on 2.6.33 kernel performance issue also used to involve somehow crgoup business (and performance was terrible even comparing to broken 2.6.32). If it will not fix 2.6.33, then I will ask to reopen the bug, otherwise I will post to stable@.
Thanks again for help,
Anton.
On Mar 24, 2010, at 12:04 AM, Linus Torvalds wrote:
>
>
> On Tue, 23 Mar 2010, Anton Starikov wrote:
>>
>> I think we got a winner!
>>
>> Problem seems to be fixed.
>>
>> Just for record, I used next patches:
>>
>> 59c33fa7791e9948ba467c2b83e307a0d087ab49
>> 5d0b7235d83eefdafda300656e97d368afcafc9a
>> 1838ef1d782f7527e6defe87e180598622d2d071
>> 4126faf0ab7417fbc6eb99fb0fd407e01e9e9dfe
>> bafaecd11df15ad5b1e598adc7736afcd38ee13d
>> 0d1622d7f526311d87d7da2ee7dd14b73e45d3fc
>
> Ok. If you have performance numbers for before/after these patches for
> your actual workload, I'd suggest posting them to stable@kernel.org, and
> maybe those rwsem fixes will get back-ported.
>
> The patches are pretty small, and should be fairly safe. So they are
> certainly stable material.
>
> Linus
* Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Tue, 23 Mar 2010, Anton Starikov wrote: > > > > I think we got a winner! > > > > Problem seems to be fixed. > > > > Just for record, I used next patches: > > > > 59c33fa7791e9948ba467c2b83e307a0d087ab49 > > 5d0b7235d83eefdafda300656e97d368afcafc9a > > 1838ef1d782f7527e6defe87e180598622d2d071 > > 4126faf0ab7417fbc6eb99fb0fd407e01e9e9dfe > > bafaecd11df15ad5b1e598adc7736afcd38ee13d > > 0d1622d7f526311d87d7da2ee7dd14b73e45d3fc > > Ok. If you have performance numbers for before/after these patches for > your actual workload, I'd suggest posting them to stable@kernel.org, and > maybe those rwsem fixes will get back-ported. > > The patches are pretty small, and should be fairly safe. So they are > certainly stable material. We havent had any stability problems with them, except one trivial build bug, so -stable would be nice. Ingo On Wed, 24 Mar 2010, Ingo Molnar wrote:
>
> We havent had any stability problems with them, except one trivial build bug,
> so -stable would be nice.
Oh, you're right. There was that UML build bug. But I think that was
included in the list of commits Anton had - commit 4126faf0ab ("x86: Fix
breakage of UML from the changes in the rwsem system").
Linus
Yes, it is included into my list.
When I will submit it into stable, I will include it also.
Anton
On Mar 24, 2010, at 12:55 AM, Linus Torvalds wrote:
>
>
> On Wed, 24 Mar 2010, Ingo Molnar wrote:
>>
>> We havent had any stability problems with them, except one trivial build
>> bug,
>> so -stable would be nice.
>
> Oh, you're right. There was that UML build bug. But I think that was
> included in the list of commits Anton had - commit 4126faf0ab ("x86: Fix
> breakage of UML from the changes in the rwsem system").
>
> Linus
On Wed, 24 Mar 2010, Andi Kleen wrote:
>
> It would be also nice to get that change into 2.6.32 stable. That is
> widely used on larger systems.
Looking at the changes to the files in question, it looks like it should
all apply cleanly to 2.6.32, so I don't see any reason not to backport
further back.
Somebody should double-check, though.
Linus
Reply-To: andi@firstfloor.org Linus Torvalds <torvalds@linux-foundation.org> writes: > On Wed, 24 Mar 2010, Ingo Molnar wrote: >> >> We havent had any stability problems with them, except one trivial build >> bug, >> so -stable would be nice. > > Oh, you're right. There was that UML build bug. But I think that was > included in the list of commits Anton had - commit 4126faf0ab ("x86: Fix > breakage of UML from the changes in the rwsem system"). It would be also nice to get that change into 2.6.32 stable. That is widely used on larger systems. -Andi Reply-To: rdreier@cisco.com > I will apply this commits to 2.6.32, I afraid current OFED (which I > need also) will not work on 2.6.33+. What do you need from OFED that is not in 2.6.34-rc1? On Mar 24, 2010, at 5:40 PM, Roland Dreier wrote:
>> I will apply this commits to 2.6.32, I afraid current OFED (which I
>> need also) will not work on 2.6.33+.
>
> What do you need from OFED that is not in 2.6.34-rc1?
I didn't go too 2.6.34-rc1.
I tried 2.6.33, mlx4 driver which comes with kernel produces panic on my hardwire. And OFED-1.5 doesn't support this kernel (probably it still can be compiled, didn't check).
Anton.
On Tue, 2010-03-23 at 10:22 -0400, Andrew Morton wrote: > (switched to email. Please respond via emailed reply-to-all, not via the > bugzilla web interface). > > On Tue, 23 Mar 2010 16:13:25 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > > > https://bugzilla.kernel.org/show_bug.cgi?id=15618 > > > > Summary: 2.6.18->2.6.32->2.6.33 huge regression in performance > > Product: Process Management > > Version: 2.5 > > Kernel Version: 2.6.32 > > Platform: All > > OS/Version: Linux > > Tree: Mainline > > Status: NEW > > Severity: high > > Priority: P1 > > Component: Other > > AssignedTo: process_other@kernel-bugs.osdl.org > > ReportedBy: ant.starikov@gmail.com > > Regression: No > > > > > > We have benchmarked some multithreaded code here on 16-core/4-way opteron > 8356 > > host on number of kernels (see below) and found strange results. > > Up to 8 threads we didn't see any noticeable differences in performance, > but > > starting from 9 threads performance diverges substantially. I provide here > > results for 14 threads > > lolz. Catastrophic meltdown. Thanks for doing all that work - at a > guess I'd say it's mmap_sem. Perhaps with some assist from the CPU > scheduler. > > If you change the config to set CONFIG_RWSEM_GENERIC_SPINLOCK=n, > CONFIG_RWSEM_XCHGADD_ALGORITHM=y does it help? > > Anyway, there's a testcase in bugzilla and it looks like we got us some > work to do. > <snip> I had an "opportunity" to investigate page fault behavior on 2.6.18+ [RHEL5.4] on an 8-socket Istanbul system earlier this year. When I saw this mail, I collected up the data I had from that adventure and ran additional tests on 2.6.33 and 2.6.34-rc1. I have attached plots for what "per node" and "system wide" page fault scalability. The per node plot [#1] shows the page fault rate of 1 to 6 [nr_cores_per_socket] tasks [processes] and threads faulting in a fixed GB/task at the same time on a single socket. The system wide plot [#3] show 1 to 48 [nr_sockets * nr_cores_per_socket] tasks and threads again faulting in a fixed GB/task... For the latter test, I load one core per socket at at time, then add the 2nd core per socket, ... In all cases, the individual tasks/threads are fork()ed/pthread_create()d by a parent bound to the cpu where they'll run to obtain node-local kernel data structures. The tests run with SCHED_FIFO. I plot both "faults per wall clock second"--the aggregate rate--and "faults per cpu second" or normalized rate. The per node scalability doesn't look all that different across the 3 releases, especially the faults per cpu seconds curves. However, in the system wide multi-threaded tests, 2.6.33 is an anomaly compared to both 2.6.18+ and 2.6.34-rc1. The 2.6.18+ and 2.6.34.rc1 multi-threaded tests show a lot of noise and, of course, a lot lower fault rate relative the the multi-task tests. I aborted the 2.6.33 system wide multi-threaded test at 32 threads because it was just taking too long. Unfortunately, with this many curves, the legends obscure much of the plot. So, rather than bloat this message any more, I've packaged up the raw data along with plots with and without legends and placed the tarball here: http://free.linux.hp.com/~lts/Pft/ That directory also contains the source for the version of the pft test used, along with the scripts used to run the tests and plot the results. Note that some manual editing of the "plot annotations" in the raw data was required to generate several different plots from the same data. The pft test is a highly, uh, "evolved" version of pft.c that Christoph Lameter pointed me at a few years ago. This version requires a patched libnuma with the v2 api. The required patch to the numactl-2.0.3 package is included in the test tarball. [I've contacted Cliff about getting the patch into 2.0.4.] Lee On Tue, Mar 23, 2010 at 08:00:54PM -0700, Linus Torvalds wrote:
>
>
> On Wed, 24 Mar 2010, Andi Kleen wrote:
> >
> > It would be also nice to get that change into 2.6.32 stable. That is
> > widely used on larger systems.
>
> Looking at the changes to the files in question, it looks like it should
> all apply cleanly to 2.6.32, so I don't see any reason not to backport
> further back.
>
> Somebody should double-check, though.
I have queued them all up for .33 and .32-stable kernel releases now.
thanks,
greg k-h
On Tue, Mar 23, 2010 at 08:00:54PM -0700, Linus Torvalds wrote:
>
>
> On Wed, 24 Mar 2010, Andi Kleen wrote:
> >
> > It would be also nice to get that change into 2.6.32 stable. That is
> > widely used on larger systems.
>
> Looking at the changes to the files in question, it looks like it should
> all apply cleanly to 2.6.32, so I don't see any reason not to backport
> further back.
>
> Somebody should double-check, though.
I have queued them all up for .33 and .32-stable kernel releases now.
thanks,
greg k-h
|