Bug 15385
Summary: | CPU loses support for mwait instruction following suspend/resume | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | Alan Stern (stern) |
Component: | i386 | Assignee: | drivers_video-dri-intel (drivers_video-dri-intel) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | hpa, mingo, rjw, shaohua.li, suresh.b.siddha, tglx, venki |
Priority: | P1 | ||
Hardware: | IA-32 | ||
OS: | Linux | ||
Kernel Version: | 2.6.33-rc8 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 7216 | ||
Attachments: | Dmesg for boot followed by suspend/resume |
Description
Alan Stern
2010-02-24 16:57:04 UTC
I think this is an i915 issue, so reassigning. Ouch, 865... it sounds like it may have panic'd at resume? Does it suspend/resume correctly if you booth with i915.modeset=0 (i.e. use the old suspend/resume code)? Thanks Jesse, that suggestion was a big help! The system still panics, but with i915.modeset=0 the screen comes back so I can see what's going on. It turns out the panic is caused by an invalid opcode exception in mwait_idle(). The bad instruction is the assembler __monitor() call; the offending IP points directly to the 0x0f,0x01,0xc8 bytes in the instruction stream. For the record, I have set CONFIG_M686=y and: $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Celeron(R) CPU 2.53GHz stepping : 9 cpu MHz : 2533.270 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe constant_tsc up pebs bts pni dtes64 monitor ds_cpl cid xtpr lahf_lm bogomips : 5053.30 clflush size : 64 power management: From dmesg: CPU: Trace cache: 12K uops, L1 D cache: 16K CPU: L2 cache: 256K CPU: Hyper-Threading is disabled mce: CPU supports 4 MCE banks CPU0: Thermal monitoring enabled (TM1) using mwait in idle threads. So it looks like the idle-routine selection logic is messed up. Accordingly, I am reclassifying this bug and adding hpa to the CC list in the hope that he can figure out what's going wrong. Confirmed. With the following patch applied, the resume worked correctly: Index: 2.6.33-rc8/arch/x86/kernel/process.c =================================================================== --- 2.6.33-rc8.orig/arch/x86/kernel/process.c +++ 2.6.33-rc8/arch/x86/kernel/process.c @@ -507,6 +507,9 @@ static int __cpuinit mwait_usable(const return 0; cpuid(MWAIT_INFO, &eax, &ebx, &ecx, &edx); + printk(KERN_INFO "cpuid: ecx %x edx %x\n", ecx, edx); + return 0; + /* Check, whether EDX has extended info about MWAIT */ if (!(ecx & MWAIT_ECX_EXTENDED_INFO)) return 1; The output from the printk was: [ 0.032024] cpuid: ecx 0 edx 0 This suggests the "return 1" in the last line above really should be "return 0". But I don't know enough about these processors to tell if that's the right solution. No, this is clearly wrong. If ECX = EDX = 0, it simply means there are no extensions to MONITOR/MWAIT. The reason your patch "works" is because you're disabling MWAIT. You're saying you have the "offending IP", but you're not giving any register values. In particular, if ECX != 0 at the point we're invoking MONITOR, that is buggy for this CPU, and would trigger a #GP(0). The panic message lists the register contents as follows: EAX = 0xC1319008, EBX = 0xC13B0780, ECX = 0, EDX = 0 Also, the PID is 0. By the way, do you have any idea why this exception should trigger during system resume and not during normal operation? return 1; in comment #4 is correct. The check is for mwait extended info and basic mwait is supposed to work without that as well. And as you said things are working OK on boot up. Something seems to be broken on resume. May be something related to microcode? Can you check CPUID.1.ECX.bit3 is set on resume? That bit says whether monitor/mwait is supported or not. Adding Suresh. Good guess! Before the suspend, cpuid(1) gives ECX = 0x441d, but afterward it gives ECX = 0x4415. What could cause this sort of thing? Usually failure to load microcode on the way out of suspend, OR control registers being changed. In this case, your CPU had monitor/mwait coming in, and not on the way out! Not sure whether the microcode loaded by the BIOS or the one loaded by the OS is at fault here. Do you see any microcode related in dmesg after boot? If yes, do you see similar message after resume with the workaround in comment #4. Created attachment 25327 [details]
Dmesg for boot followed by suspend/resume
I'm not sure what to look for, but attached is the complete dmesg log, starting from bootup and showing a successful suspend/resume transition. The only change to the vanilla 2.6.33-rc8 kernel was "return 0;" added near the start of mwait_usable().
I was expecting to see "updated to revision" kind of message from arch/x86/kernel/microcode_intel.c. I don't see such a message in dmesg. So, I think this is the BIOS that is loading a microcode update for this CPU at the boot time and not loading it during resume. There are no BIOS updates available from HP more recent than the one I have now. So it will be necessary to work around this problem somehow. Should I simply boot with "idle=poll" always? I wonder if you can register the appropriate microcode with the microcode driver, even if it is the one currently loaded into the CPU; my understanding is that the microcode driver will check and re-load the microcode on resume. My previous testing was all done without CONFIG_MICROCODE enabled, so anything that was loaded had to have been by the BIOS. I just tried again with CONFIG_MICROCODE and CONFIG_MICROCODE_INTEL set (and with "idle=poll"). I also added "#define DEBUG" at the start of microcode_intel.c. It worked, but there's no indication that any microcode was actually loaded. The only new lines in the dmesg log are: [ 0.607123] microcode: CPU0 sig=0xf49, pf=0x4, revision=0x3 [ 0.608874] microcode: Microcode Update Driver: v2.00 <tigran@aivazian.fsnet.co.uk>, Peter Oruba Well, you could check what CPUID does after suspend/resume. You should be able to *not* use idle=poll with the microcode driver. Even with the microcode stuff enabled, the ecx value from cpuid(1) still changes from 0x441d to 0x4415. Not surprising, since the microcode driver never loads anything into the CPU. I'm not at all familiar with the microcode driver. Are you saying that after it loads its data into the CPU, it stores a copy of all the resident microcode? And then it reloads that copy back into the CPU during a system resume? What functions should I look at to find where this is supposed to happen and fix it? It turns out that microcode is not the answer. I finally got the microcode driver to do something. The code in the file supplied by Intel was not getting loaded because the revision level in the file was the same as the CPU's current revision level. I changed the driver to force it to load the microcode anyway: [ 47.985453] microcode: CPU0 updated to revision 0x3, date = 2005-04-21 It didn't help. The mwait-support flag (bit 0x8 of ecx following cpuid(1)) was completely unaffected: Loading the microcode before doing a suspend left the flag turned on. Then after a suspend the flag was off, even though the microcode driver did reload the data into the CPU during early resume. After rebooting, not loading any microcode, and doing a suspend, loading the microcode by hand left the flag turned off. I'm at a loss for ideas as to the cause. Does anybody at Intel have a suggestion? Booting with "idle=halt" seems like a reasonable workaround, but it would be nice to actually fix the problem. Fixed by commit 85a0e7539781dad4bfcffd98e72fa9f130f4e40d (PM / x86: Save/restore MISC_ENABLE register). The problem was caused by the fact that the MISC_ENABLE register was not getting saved and restored across the suspend by either the system or the BIOS. |