Latest failing kernel version: 2.6.26rc6 Earliest failing kernel version: 2.6.24 Last working kernel: ? Distribution: Ubuntu 8.04, x64 Hardware Environment: Acer Extensa 5220 Intel Celeron M 530, 2 gb ram, Software Environment: Single User Mode Problem Description: At some point within usually a minute or two of loading the processor.ko module, the cpu starts waking up around 45 000 to 50 000 times per second. # acpi=noirq # disable_8254_timer does not help changes nothing # nohz=off reduces wakes to a constant ~1000 probably doesn't use C-states # acpi=off or # rm /lib/modules/$VER/kernel/drivers/acpi/processor.ko fixes problem wakes stay down, powertop doesn't show C-states # rmmod -wf processor, rmmod still shows me: same as acpi=off, but lsmod shows "processor 26864 1" # blacklist yenta_socket, pcmcia, rsrc does not help tested in 2.6.24, 2.6.25.6 and 2.6.26-rc6. The interesting thing for my sytem is that the problem *stops* after suspending and resuming from s3. The problem reappears on the next reboot. Just let me know if I should post some more information or try something. I will attach some logs now.
Created attachment 16478 [details] 2.6.26rc6 .config
Created attachment 16479 [details] dmesg of normal boot the hpet=off is ignored, would have to be hpet=disable to work.
Created attachment 16480 [details] cpuinfo
Created attachment 16481 [details] interrupts (10 seconds interval)
Created attachment 16482 [details] powertop -d before the problem appears
Created attachment 16483 [details] powertop -d after the problem appears
Created attachment 16484 [details] timer_list (normal boot)
Created attachment 16485 [details] lspci
Created attachment 16486 [details] lspci -vv
Created attachment 16487 [details] dmesg of acpi=off boot
It should also be noted that when my system starts it does not use the C2-state at all. This changes at the point the 40000 wps appear. After suspending and resuming (S3) the problem disappears altogether: The wps are low and it's over 90% in C2.
It seems like the bug is triggered a lot less with hpet=disable. It only appeared shortly with 23.000 once after boot.
If I first suspend and resume in 2.6.25 and then kexec into 2.6.26rc6, the bug does not show. I will post logs of this scenario after this comment.
Created attachment 16500 [details] 2.6.26rc6 hpet=disable dmesg
Created attachment 16501 [details] kexec 2.6.25.6 into 2.6.26 dmesg I restarted into 2.6.26 with kexec after suspending and resuming in 2.6.25.6. The bug does not show then.
Okay. The bug appears with hpet=disable also, but differently. It skips in and out a lot. Usually it remains at 40000+ wps. now it skips back and forth between ~200-500 and 20000, 40000. The crazy thing is still, that - as under 2.6.25 - the more wps, the more time is shown as spent in C2.
Created attachment 16502 [details] powertop in 2.6.26 hpet=disable powertop running for a longer time. it's captured with powertop >> powertop.log
> ACPI: CPU0 (power states: C1[C1] C2[C2]) this is likely a case of a C-state (C2) exported by the platform that returns immediately rather than actually entering idle. you should be able to make the symptom go away with processor.max_cstate=1 Please verify that you're running the latest BIOS for the system. Sometimes vendors get this wrong on release and then correct it... Then please check the BIOS SETUP for any options related to processor power management. However, what we really want to know is why C2 isn't working... please attach the output from acpidump so we can see exactly what C-states the system claims to support.
Processor max_cstate=1 fixes the problem. No more 40 000 or more. The BIOS is the newest available version. It has no corresponding option, hardly any options at all.
Created attachment 16515 [details] acpidump output
Created attachment 16516 [details] DSDT.dsl created above from acpidump output for convenience
Created attachment 16517 [details] acpidump.out I'm very sorry. Turns out there was just a new BIOS released. It does not fix this problem, though.
Created attachment 16518 [details] 2.6.26rc6 dmesg with new BIOS
Created attachment 16520 [details] try the debug patch Will you please try the debug patch and see whether the problems still exists? After the patch is applied, please add the boot option of "idle=nomwait". Please also attach the output of dmidecode. Thanks.
Created attachment 16521 [details] dmidecode (new BIOS)
for idle=nowait dmesg tells me [ 0.000000] Malformed early option 'idle' for insmod processor.ko idle=nowait it tells me [ 261.132372] processor: `nomwait' invalid for parameter `idle' As modinfo processor.ko says parm: idle:Disable the mwait for CPU idle (int) I tried: insmod processor.ko idle=1 This seems to fix the problem for me! Great!!! Thanks! The CPU now spends most time in the C3 state (which was not there before). I will test this for a while now.
Created attachment 16523 [details] dmesg with idlepatch This is the dmesg with the above patch applied and processor module loaded manually with idle=1
Thanks for the test. From the test it seems that the power-top can work well after disableing mwait for CPU C-states on your laptop. I am sorry that I give the incorrect patch. Now in the updated processor_idle.patch the boot parameter of "idle=nomwait" is used instead of processor module parameter . (Of course the old process_idle.patch is also ok. But the module parameter of "processor.idle=1" is used.). The updated patch can be found in : http://bugzilla.kernel.org/show_bug.cgi?id=10807#c23 And I will add your laptop into DMI check table. Thanks.
No problem. When do you think this will go into mainline? Should I close the bug then? I've not seen the problem anymore after using your patch.
Hi, Dionisus Now the patch set is already sent to acpi mail list. And it will take some time to merge them into upstream kernel. Thanks.
Hi, Dionisus Will you please attach the output of "lspci -vxxx" before suspend and after resume? Thanks.
Created attachment 16533 [details] lspci -vxxxx before suspend bios v. 1.34 I wish I could give you the lspci after suspend, but at the moment my computer doesn't wake anymore at all. No more resume under 2.6.25.6, 2.6.26rc6, Windows XPSP3. I tried flashing the old bios again, but it did not improve much (XP resumes with a Blue Screen of Death, KERNEL_DATA_INPAGE_ERROR). If you have any idea what I can do let me know. Thanks.
Created attachment 16534 [details] lspci -vxxxx before suspend bios v. 1.31
The nomwait patch also works for me.
I was wondering...: Couldn't this problem be fixed with a generic patch of some kind? At least there should never be 20.000 + wakes per second, right? Or is this needed so the bugs causing the high wakes can be found? Maybe a prinkt would be better suited?
Hi, Dionisus The workaround patch can work for you. The root cause is not gotten. In the problem description the interesting thing for your sytem is that the problem *stops* after suspending and resuming from s3. The problem reappears on the next reboot. Maybe the problem is related with the BIOS configuration. But in the description of comment #32 your laptop can't be resumed from S3. We can't compare the difference before suspend and after resume. It will be great if you can provide the output of lspci -vxxx before suspend and after resume. Maybe this will be helpful to find the root casue. Thanks.
Hi ykzhao, I fixed my problem with suspend in 2.6.25. I will now post the lspcis.
Created attachment 16544 [details] lspci -vxxxx before suspend
Created attachment 16545 [details] lspci -vxxxx after suspend
By the way: idle=halt does not do the same for me as processor.idle=1. The former does not activate the C3 state and the CPU stays for shorter amounts in it.
Hi, Dionisus What you said is very right. The boot option of idle=halt is totally different with the module parameter of processor.idle=1 in the comment #24. When this boot parameter is used, halt is used for CPU idle and there is no CPU C-states. Maybe it is not very reasonable. Of course I will rework on this issue and try to refresh the patch. Thanks.
The patch works very well for me now. Did the new lspci help you? Please let me know if I can do something else to help.
Btw. the patch also works well in 2.6.25.7 for me. (applied for the _32 and _64 version of process.c)
I mistakenly changed the status myself. Accordings to docs this should be done by QA if appropriate.
Hi, Dionisus Thanks for the info of lspci -vxxx before and after suspend. But now I can't find the root cause that there are about 40000+ wakeup per seconds. Anyway please assign it to me and I will try to track this issue. Thanks.
Hi ykzhao, I'm sorry but I don't know how to assign it to you. Actually the patch helps much better than going into standby and resuming. If I can help in any way please let me know. Thanks.
Please let me know if I can do anything else to help find the root cause. Thank you.
Does anyone have any idea how the cause of this problem could be found?
If I understand it correctly patches that just went into Linus' tree should fix it. Can you confirm that 2.6.26-git5 (when it will be available) fixes it?
The patch is already in linux-git tree and should fix the problem. Now the remaing problem is that the problem disappears after suspending and resuming from s3. I am analyzing why there are more than 40000+ wakeup per second when the system enters the C-states. Unfortunately I haven't gotten the root cause.
Thanks. I will need some time to test. I will probably be unreachable for around two weeks and unable to test the new kernel until after that(no bandwidth for downloading the update). If you need me to test the kernel before, please give me patches to apply (to 2.6.26).
Same problem with Acer Extensa 4220. # rmmod -f processor # insmod processor.ko max_cstate=1 reduces WPS from 50000+ to only 50. I'm using 2.6.26.
Update: In 2.6.27-rc2 it basically still happens. But now after start powertop shows only C1 and C3 and uses only C1 at first. A bit later it starts using C3 about 80% (C2 still nowhere to be found) and the wakes go up to ~50.000. And I read there might be a similar problem on the Dell Vostro 1510 as well.
Created attachment 17118 [details] dmesg of 2.6.27-rc2
Hi, Dionisus I add your laptop into DMI check table in the following commit, in which Mwait will be disabled for CPU C-states. >commit 2a2a64714d9c40f7705c4de1e79a5b855c7211a9 > Author: Zhao Yakui <yakui.zhao@intel.com> > Date: Tue Jun 24 18:02:57 2008 +0800 > ACPI: Disable MWAIT via DMI on broken Compal board From the dmesg in comment #53 I can't find that mwait is disabled for CPU C-states on your laptop. I am sorry for my fault. The SYS_VENDOR in bios should be "Acer", but the SYS_VENDOR in patch is "ACER". As the cas is sensitive in the function of dmi_check_system, the mwait isn't disabled for CPU C-states. Maybe you can continue to use the boot option of "idle=nomwait". Will you please try the boot option of "idle=nomwait" on the lastest kernel(2.6.27-rc2) and see whether the system still exists? Thanks.
Hi, Georgij Will you please try the boot option of "idle=nowmait" on the 2.6.27-rc2 kernel and see whether the problem still exists? If the problem disappears, please attach the output of dmidecode. Thanks.
Hi ykzhao, I checked at 30a2f3c60a84092c8084dfe788b710f8d0768cd4 (-rc3) and if I boot with idle=nomwait, the cpu is in C3 most of the time. So the problem is fixed if I pass the parameter. No problem, just a typo. Please let me know when I can test again. I've got Linus' current git tree available. Then I will post a new dmesg.
Created attachment 17253 [details] Patch to fix typo for current git HEAD. I fixed the typo and tested it. It works. This is a patch for the current git head. I hope it's okay.
Hi, Dionisus Thanks for your work. After the patch in comment #58 is applied on the 2.6.27-rc2 kernel, the system can work well and the CPU will be in C3 state most of the time. Of course there exists another interesing issue from the problem description and I can't get the root cause unfortunately. Before suspend there will be more than 40000+ wakes per second in power top. The problem *stops* after suspending and resuming from s3. Anyway the system can work well after mwait is disabled for CPU C-state. IMO this bug can be marked as resolved. Thanks for the test and work again.
Hi ykzhao, yes, after the patch in comment #58 the system can work well and the CPU is in C3 state most of the time. If you have any idea what other tests I could make to find the root cause that the problem goes away after suspend and resume let me know.
And you're very welcome of course, thank you for your help, your work and the patch!
Hi ykzhau, could you send the patch to Linus? Thanks
Created attachment 17273 [details] acer 4220 dmidecode #56 > Hi, Georgij > Will you please try the boot option of "idle=nowmait" on the 2.6.27-rc2 >kernel and see whether the problem still exists? The problem don't appear with "idle=nomwait" on the 2.6.27-rc3. It still appears without that parameter. Here is my dmidecode.
I can confirm that after suspend to ram and resume the problem disappeares too. S3 seems to be broken on the 2.6.26, but it works on the 2.6.27-rc3 (if you can call it "works" then after resume backlight on all VTs is off, but hopefully it is on then on Xorg; using xf86-video-intel with Xorg and vesa with VTs becase someone didn't add intel-agp as a dependency for intelfb and I don't know how to do this the way it works). I'm willing to help with investigating why resuming after s2ram solves the problem, but maybe you shouldn't add "idle=nomwait" a default quirk for my laptop until this investigation is done if I can't use something like "idle=wait" to override that.
Hi, Georgij Thanks for the test and help. Do you mean that there exists the same problem with the 5520 laptop? After the suspend/resume the problem disappears? Right? If so, your laptop won't be added into the dmi check table before the root cause is gotten. Of course we will appreciate your help to investigate this issue. Thanks.
Hi, Dionisus Thanks for the reminder. As the patch in comment #58 is not very emergent, can we defer it for some time? After we get the root cause that s2ram also solves the issue, the Acer 4220 laptop will also be added into the DMI check table. Of course the typo will be fixed. Of course maybe the boot option of "idle=nomwait" will be added on your laptop before that. Is that OK? Thanks.
Hi Ykzhao, I think maybe we should do both. I think we should better fix the problem now for most users. Then once we found the root cause, we can write a better fix. But we should change the patch so debugging is also still possible without recompiling: We could add a parameter "forcemwait", like Georgij suggested in #64. If you like I can try to make such a patch. What do you think? Thanks.
I think we should override the dmi nomwait if idle=mwait is given as a kernel parameter. roughly like if (idle_mwait) idle_nomwait=0; But I couldn't find the place where idle=mwait is processed.
Hi, Dionisus Thanks for caring this issue. In fact when no option is added, OS will try to use mwait for CPU C-states. Only when the laptop falls into the dmi check table or mwait is disabled in BIOS, mwait will be disabled for CPU C-states. IMO it is inappropriate to add the boot option of "idle=mwait". After this boot option is added, maybe we will think that mwait is always used for CPU C-states. In fact on some systems the mwait is disabled in BIOS(BIOS returns the C-state that doesn't use mwait). It sounds not reasonable. Will you please send the patch in comment #63 to linux-acpi mail list if you don't want to add the option of "idle=nomwait"? ( I ACK it). Thanks.
Hi ykhhao, thanks for all your help. I agree. idle=mwait would not be a good name. What do you think about idle=forcemwait? I have submitted the patch.
The patch made it into the current linux head and it works fine now. So with 2.6.27 the problem is (hot)fixed for the Acer 5220.
Okay. Either the patch has not really made it yet or it was reversed. The patch is not in the current head.
Hi, Dionisus Thanks for caring this issue. Now the patch is already ACPI test git tree. But we have to wait for some time before it hits the upstream kernel. Thanks.
Okay. This time it's really in the git tree. That means it's hotfixed in 2.6.27: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=3df8a905ed09341041a3d1c6309fdb18cc809297 Dear Ykzhao, are you still researching on the root cause of the bug or should we close this bug now?
Hi, Dionisus Sorry that I don't spend time on identifying the root cause again. After checking the datasheet of ICH8 and CPU, I still can't get the root cause. So I give it up. Very sorry.
Hi Yhzhau, no Problem. I have only one idea left: Could you write a patch that prints a warning when the kernel runs at over 10.000 hz? This way more people could find out that they have the problem and with their help it would maybe be easier to find the cause(s)? At least they would know that their battery life would be affected by a bug. e.g. "CPUX: Unusually high amount of processor wake-ups per second: XX.XXX. If you're not currently using high resolution timer software (e.g. multimedia players), this may well be a bug in a part of the kernel." or something like: "CPU0 wakes over 10.000 times per second. This might be a bug, see: http://www.explanation.com. Of course I'm not sure how many times are normal and when they're usually high, except when I use mplayer. But I think I've never got over 1.000 without the bug, even when using e.g. mplayer. What do you think?
> CPU: Intel(R) Celeron(R) CPU 530 @ 1.73GHz stepping 01 please paste the output from $ cat /proc/cpuinfo
Created attachment 18349 [details] cat /proc/cpuinfo
thanks for the cpuinfo Looks like you've got one of these two: http://processorfinder.intel.com/details.aspx?sSpec=SLA2G http://processorfinder.intel.com/details.aspx?sSpec=SL9VA which are both: http://support.intel.com/support/processors/mobile/celeron/sb/CS-023760.htm And the data sheet does say that this processor does actually have MWAIT for C2 and C3. I don't see any errata related to MWAIT waking up prematurely, but this issue could be something related to our MONITOR address triggering when we don't expect it to...
You're very welcome, if there's *anything* else I can do to help, please let me know. (In case someone really wants to hack this: These notebooks are quite cheap(but good), sold for 330-400€, e.g. on Amazon.)
Created attachment 18357 [details] test patch vs linux-2.6.27 please apply this patch to linux-2.6.27 and boot with "idle=clflush" dmesg should show an additional line: "Enabling CLFLUSH on MWAIT" Please report if this has any effect on the large number of wakeups/sec when you're using MWAIT in C2 or C3.
oh, to perform the test in commetn #81 using 2.6.27, you'll need to disable the DMI workaround. you can do that by commenting out this call in acpi_processor_init(): dmi_check_system(processor_idle_dmi_table); or otherwise disabling set_no_mwait() from running on your system.
The patch in #81 with workaround 1 from #82 also works. The C3 state is available and used, just like with 2.6.27-vanilla as far as I can see. I will keep using it for now to check for possible side-effects, right?
I didn't really deactivate the workaround, I'll have to test it again...
It looks like the bug is not reproducible in 2.6.27. No more 40.000 wake/s even if I disable the workaround. It just stays in C1 while idle: Cn Avg residency P-states (frequencies) C0 (cpu running) (10.7%) <- used to be here 99% on idle C0 0.0ms ( 0.0%) C1 8.1ms (89.0%) <- used to be here a bit when *busy* C3 0.0ms ( 0.3%) <- the same, no C2, C3 hardly used. I got 12.000 wakes once, but it was caused by "ethstatus". So I can't really test if clflush helps the problem. If I enable clflush I can no longer see what C-states are used, so there's nothing I could test. But hey, that it's fixed is a good thing, right? Does that help? Is there anything else I can test?
hmmm, so the workaround didn't work because the problem had already vanished in linux-2.6.26. Yes, that is both good news and bad. Good for 2.6.27 users. Bad for 2.6.26 users... Any chance you can git-bisect to see what fixed this in .27 so we can perhaps back-port that for the benefit of .26 customers?
(In reply to comment #86) > hmmm, so the workaround didn't work because the problem > had already vanished in linux-2.6.26. Yes. The one problem (high wakes) had disappeared, while the other (CPU does not use C2-C3) did not. I thought you might also be trying to find a fix for the second problem. > Any chance you can git-bisect to see what fixed this in .27 > so we can perhaps back-port that for the benefit of .26 customers? It will probably take a while until I've got time to fix the issue. But I'm also curious.
Hi, Dionisus How about the result of git-bisect? From the dsecription of comment #85 it seems that C1 is entered instead of C3 at most time when cpu is idle although the wake/s is less. Will you please use git-bisect to identify which commit brings this issue? (Of course the workaround about the dmi check should be removed in your test.) thanks.
git bisect complains because the good version is newer than the bad version. I will try exchanging good and bad: git bisect start 2.6.27-rc5 2.6.27-rc1 does not work git bisect start 2.6.27-rc1 2.6.27-rc5 does work It turns out the one (high wakes) bug was fixed somewhere between 2.6.27-rc1 and 2.6.27-rc5. I will let you know when I find out more.
Okay. The bad guy is commit 320eee... http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=320eee776357db52d6fcfb11cff985b1976a4595 please add me as bisected-by/tested-by Dennis Jansen <Dennis.Jansen@web.de>
ps. The C2 mode still doesnt work of course.
pps. ykzhaos patch fixed both problems for me: the high wakes were gone and the cpu could use all C-modes and did use the right one. commit 320eee "just" fixed (only) the high wakes problem.
cc venki
Patch for the other problem available too, now.
venki, should we send 320eee776357db52d6fcfb11cff985b1976a4595 to 2.6.26.stable?
Yes. 320eee OK to go to stable.
Created attachment 20321 [details] 2.6.27's 320eee refreshed to apply to 2.6.26.stable It is sort of late for 2.6.26.y now, but I'll e-mail this patch to stable@kernel.org just to close the books on this one.