Distribution: any Hardware Environment: especially Intel P4 Software Environment: Problem Description: Since the SYSENTER/vsyscall support went in the 2.5 __switch_to/load_esp0 function does two WRMSRs to rewrite MSR_IA32_SYSENTER_CS and MSR_IA32_SYSENTER_ESP. This is hidden in processor.h:load_esp0. WRMSR is very slow (60+ cycles) especially on a Pentium 4 and slows down the context switch considerably. This is a trade off between faster system calls using SYSENTER and slower context switches, but the context switches got unduly hit here. The reason it rewrites SYSENTER_CS is non obviously vm86 which doesn't guarantee the MSR stays constant (sigh). I think this would be better handled by having a global flag or process flag when any process uses vm86 and not do it when this flag is not set (as in 99% of all normal use cases) It rewrites SYSENTER_ESP to the stack page of the current process. Previous implements used an trampoline for that. The reason it was moved to the context was that an NMI could see the trampoline stack for one instruction and when it calls current (very unlikely) and references the stack pointer it doesn't get a valid task_struct. The obvious solution would be to somehow check this case (e.g. by looking at esp) in the NMI slow path. Steps to reproduce: Benchmark __switch_to or context switch. Note lmbench is not reliable (numbers vary wildy); microbenchmarks of WRMSR show the problem clearly.
Update: some patches and patch proposals for that exists now. First one of from me to fix the wrmsr for vm86 mode (attached) And some proposals from Linus/Jamie Lokier how to eliminate the wrmsr for the sysenter stack (no patch yet, just proposals) See surrounding thread also, including replies from Linus approving the idea: http://marc.theaimsgroup.com/?l=linux-kernel&m=104502360530576&w=2 (Jamie also has some ideas how to speed up ctx switch fast path more) Just someone needs to implement it....
Created attachment 140 [details] eliminate one wrmsr in i386 context switch
Linus fixed it in 2.5.65.