Bug 10914 (40000wps)

Summary: 40000 wake/s unless idle=nomwait - Acer Extensa 5220, Celeron 520
Product: ACPI Reporter: Dionisus Torimens (djtm)
Component: Power-ProcessorAssignee: ykzhao (yakui.zhao)
Status: CLOSED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: acpi-bugzilla, bunk, smpuj, venki
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.26-rc6 Subsystem:
Regression: --- Bisected commit-id:
Attachments: 2.6.26rc6 .config
dmesg of normal boot
cpuinfo
interrupts (10 seconds interval)
powertop -d before the problem appears
powertop -d after the problem appears
timer_list (normal boot)
lspci
lspci -vv
dmesg of acpi=off boot
2.6.26rc6 hpet=disable dmesg
kexec 2.6.25.6 into 2.6.26 dmesg
powertop in 2.6.26 hpet=disable
acpidump output
DSDT.dsl
acpidump.out
2.6.26rc6 dmesg with new BIOS
try the debug patch
dmidecode (new BIOS)
dmesg with idlepatch
lspci -vxxxx before suspend bios v. 1.34
lspci -vxxxx before suspend bios v. 1.31
lspci -vxxxx before suspend
lspci -vxxxx after suspend
dmesg of 2.6.27-rc2
Patch to fix typo for current git HEAD.
acer 4220 dmidecode
cat /proc/cpuinfo
test patch vs linux-2.6.27
2.6.27's 320eee refreshed to apply to 2.6.26.stable

Description Dionisus Torimens 2008-06-15 01:31:38 UTC
Latest failing kernel version: 2.6.26rc6
Earliest failing kernel version: 2.6.24
Last working kernel: ?
Distribution: Ubuntu 8.04, x64
Hardware Environment: Acer Extensa 5220
Intel Celeron M 530, 2 gb ram,
Software Environment: Single User Mode
Problem Description: At some point within usually a minute or two of loading the processor.ko module, the cpu starts waking up around 45 000 to 50 000 times per second.

# acpi=noirq
# disable_8254_timer
does not help
changes nothing

# nohz=off
reduces wakes to a constant ~1000
probably doesn't use C-states

# acpi=off
or # rm /lib/modules/$VER/kernel/drivers/acpi/processor.ko
fixes problem
wakes stay down, powertop doesn't show C-states

# rmmod -wf processor, rmmod still shows me:
same as acpi=off, but lsmod shows
"processor              26864  1"

# blacklist yenta_socket, pcmcia, rsrc
does not help

tested in 2.6.24, 2.6.25.6 and 2.6.26-rc6.

The interesting thing for my sytem is that the problem *stops* after suspending and resuming from s3. The problem reappears on the next reboot.

Just let me know if I should post some more information or try something.

I will attach some logs now.
Comment 1 Dionisus Torimens 2008-06-15 01:35:49 UTC
Created attachment 16478 [details]
2.6.26rc6 .config
Comment 2 Dionisus Torimens 2008-06-15 01:37:13 UTC
Created attachment 16479 [details]
dmesg of normal boot

the hpet=off is ignored, would have to be hpet=disable to work.
Comment 3 Dionisus Torimens 2008-06-15 01:37:31 UTC
Created attachment 16480 [details]
cpuinfo
Comment 4 Dionisus Torimens 2008-06-15 01:37:56 UTC
Created attachment 16481 [details]
interrupts (10 seconds interval)
Comment 5 Dionisus Torimens 2008-06-15 01:38:22 UTC
Created attachment 16482 [details]
powertop -d before the problem appears
Comment 6 Dionisus Torimens 2008-06-15 01:38:40 UTC
Created attachment 16483 [details]
powertop -d after the problem appears
Comment 7 Dionisus Torimens 2008-06-15 01:39:22 UTC
Created attachment 16484 [details]
timer_list (normal boot)
Comment 8 Dionisus Torimens 2008-06-15 01:42:07 UTC
Created attachment 16485 [details]
lspci
Comment 9 Dionisus Torimens 2008-06-15 01:42:22 UTC
Created attachment 16486 [details]
lspci -vv
Comment 10 Dionisus Torimens 2008-06-15 01:42:53 UTC
Created attachment 16487 [details]
dmesg of acpi=off boot
Comment 11 Dionisus Torimens 2008-06-15 01:48:33 UTC
It should also be noted that when my system starts it does not use the C2-state at all. This changes at the point the 40000 wps appear.

After suspending and resuming (S3) the problem disappears altogether: The wps are low and it's over 90% in C2.
Comment 12 Dionisus Torimens 2008-06-16 01:09:01 UTC
It seems like the bug is triggered a lot less with hpet=disable. It only appeared shortly with 23.000 once after boot.
Comment 13 Dionisus Torimens 2008-06-16 01:14:01 UTC
If I first suspend and resume in 2.6.25 and then kexec into 2.6.26rc6, the bug does not show. I will post logs of this scenario after this comment.
Comment 14 Dionisus Torimens 2008-06-16 01:14:44 UTC
Created attachment 16500 [details]
2.6.26rc6 hpet=disable dmesg
Comment 15 Dionisus Torimens 2008-06-16 01:16:34 UTC
Created attachment 16501 [details]
kexec 2.6.25.6 into 2.6.26 dmesg

I restarted into 2.6.26 with kexec after suspending and resuming in 2.6.25.6. The bug does not show then.
Comment 16 Dionisus Torimens 2008-06-16 01:24:32 UTC
Okay. The bug appears with hpet=disable also, but differently. It skips in and out a lot. Usually it remains at 40000+ wps. now it skips back and forth between ~200-500 and 20000, 40000.

The crazy thing is still, that - as under 2.6.25 - the more wps, the more time is shown as spent in C2.
Comment 17 Dionisus Torimens 2008-06-16 01:27:42 UTC
Created attachment 16502 [details]
powertop in 2.6.26 hpet=disable

powertop running for a longer time. it's captured with 
powertop >> powertop.log
Comment 18 Len Brown 2008-06-16 19:24:24 UTC
> ACPI: CPU0 (power states: C1[C1] C2[C2])

this is likely a case of a C-state (C2) exported by the platform
that returns immediately rather than actually entering idle.
you should be able to make the symptom go away with
processor.max_cstate=1

Please verify that you're running the latest BIOS for the system.
Sometimes vendors get this wrong on release and then correct it...
Then please check the BIOS SETUP for any options related to
processor power management.

However, what we really want to know is why C2 isn't working...
please attach the output from acpidump so we can see
exactly what C-states the system claims to support.
Comment 19 Dionisus Torimens 2008-06-16 21:40:09 UTC
Processor max_cstate=1 fixes the problem. No more 40 000 or more.

The BIOS is the newest available version. It has no corresponding option, hardly any options at all.
Comment 20 Dionisus Torimens 2008-06-16 21:41:02 UTC
Created attachment 16515 [details]
acpidump output
Comment 21 Dionisus Torimens 2008-06-16 21:52:58 UTC
Created attachment 16516 [details]
DSDT.dsl

created above from acpidump output for convenience
Comment 22 Dionisus Torimens 2008-06-16 22:49:47 UTC
Created attachment 16517 [details]
acpidump.out

I'm very sorry. Turns out there was just a new BIOS released. It does not fix this problem, though.
Comment 23 Dionisus Torimens 2008-06-16 23:00:00 UTC
Created attachment 16518 [details]
2.6.26rc6 dmesg with new BIOS
Comment 24 ykzhao 2008-06-16 23:57:42 UTC
Created attachment 16520 [details]
try the debug patch

Will you please try the debug patch and see whether the problems still exists?
After the patch is applied, please add the boot option of "idle=nomwait".

Please also attach the output of dmidecode.

Thanks.
Comment 25 Dionisus Torimens 2008-06-17 00:28:43 UTC
Created attachment 16521 [details]
dmidecode (new BIOS)
Comment 26 Dionisus Torimens 2008-06-17 01:05:05 UTC
for idle=nowait dmesg tells me
[    0.000000] Malformed early option 'idle'

for insmod processor.ko idle=nowait it tells me
[  261.132372] processor: `nomwait' invalid for parameter `idle'

As modinfo processor.ko says
parm:           idle:Disable the mwait for CPU idle (int)
I tried:
insmod processor.ko idle=1

This seems to fix the problem for me! Great!!! Thanks!
The CPU now spends most time in the C3 state (which was not there before).
I will test this for a while now.
Comment 27 Dionisus Torimens 2008-06-17 01:42:33 UTC
Created attachment 16523 [details]
dmesg with idlepatch

This is the dmesg with the above patch applied and processor module loaded manually with idle=1
Comment 28 ykzhao 2008-06-17 02:28:26 UTC
Thanks for the test.
From the test it seems that the power-top can work well after disableing mwait for CPU C-states on your laptop. 

I am sorry that I give the incorrect patch. Now in the updated processor_idle.patch  the boot parameter of "idle=nomwait" is used instead of processor module parameter . (Of course the old process_idle.patch is also ok. But the module parameter of "processor.idle=1" is used.).

The updated patch can be found in :
   http://bugzilla.kernel.org/show_bug.cgi?id=10807#c23
And I will add your laptop into DMI check table.

Thanks.
Comment 29 Dionisus Torimens 2008-06-17 07:39:06 UTC
No problem. When do you think this will go into mainline?

Should I close the bug then? I've not seen the problem anymore after using your patch.
Comment 30 ykzhao 2008-06-17 17:59:17 UTC
Hi, Dionisus
    Now the patch set is already sent to acpi mail list. And it will take some time to merge them into upstream kernel.
    Thanks.
    
Comment 31 ykzhao 2008-06-17 19:34:54 UTC
Hi, Dionisus
    Will you please attach the output of "lspci -vxxx" before suspend and after resume? 
    Thanks.
Comment 32 Dionisus Torimens 2008-06-17 21:45:41 UTC
Created attachment 16533 [details]
lspci -vxxxx before suspend bios v. 1.34

I wish I could give you the lspci after suspend, but at the moment my computer  doesn't wake anymore at all. No more resume under 2.6.25.6, 2.6.26rc6, Windows XPSP3.  I tried flashing the old bios again, but it did not improve much (XP resumes with a Blue Screen of Death, KERNEL_DATA_INPAGE_ERROR).

If you have any idea what I can do let me know. Thanks.
Comment 33 Dionisus Torimens 2008-06-17 21:46:04 UTC
Created attachment 16534 [details]
lspci -vxxxx before suspend bios v. 1.31
Comment 34 Dionisus Torimens 2008-06-18 06:49:55 UTC
The nomwait patch also works for me.
Comment 35 Dionisus Torimens 2008-06-18 18:58:09 UTC
I was wondering...:

Couldn't this problem be fixed with a generic patch of some kind? At least there should never be 20.000 + wakes per second, right?

Or is this needed so the bugs causing the high wakes can be found? Maybe a prinkt would be better suited?
Comment 36 ykzhao 2008-06-18 19:15:45 UTC
Hi, Dionisus
    The workaround patch can work for you. The root cause is not gotten. 
    In the problem description the interesting thing for your sytem is that the problem *stops* after suspending and resuming from s3. The problem reappears on the next reboot. Maybe the problem is related with the BIOS configuration.
    But in the description of comment #32 your laptop can't be resumed from S3. We can't compare the difference before suspend and after resume.
    It will be great if you can provide the output of lspci -vxxx before suspend and after resume. Maybe this will be helpful to find the root casue.
    Thanks.
Comment 37 Dionisus Torimens 2008-06-19 03:53:17 UTC
Hi ykzhao,

I fixed my problem with suspend in 2.6.25. I will now post the lspcis.
Comment 38 Dionisus Torimens 2008-06-19 03:56:24 UTC
Created attachment 16544 [details]
lspci -vxxxx before suspend
Comment 39 Dionisus Torimens 2008-06-19 04:01:12 UTC
Created attachment 16545 [details]
lspci -vxxxx after suspend
Comment 40 Dionisus Torimens 2008-06-19 07:08:59 UTC
By the way: idle=halt does not do the same for me as processor.idle=1. The former does not activate the C3 state and the CPU stays for shorter amounts in it.
Comment 41 ykzhao 2008-06-20 07:14:25 UTC
Hi, Dionisus
    What you said is very right. The boot option of idle=halt is totally different with the module parameter of processor.idle=1 in the comment #24.
When this boot parameter is used, halt is used for CPU idle and there is no CPU C-states.
    Maybe it is not very reasonable. Of course I will rework on this issue and try to refresh the patch.
    Thanks.
Comment 42 Dionisus Torimens 2008-06-20 19:49:29 UTC
The patch works very well for me now. Did the new lspci help you? 

Please let me know if I can do something else to help.
Comment 43 Dionisus Torimens 2008-06-20 20:03:34 UTC
Btw. the patch also works well in 2.6.25.7 for me. (applied for the _32 and _64 version of process.c)
Comment 44 Dionisus Torimens 2008-06-20 22:04:20 UTC
I mistakenly changed the status myself. Accordings to docs this should be done by QA if appropriate.
Comment 45 ykzhao 2008-06-22 19:59:27 UTC
Hi, Dionisus
    Thanks for the info of lspci -vxxx before and after suspend. But now I can't find the root cause that there are about 40000+ wakeup per seconds. Anyway please assign it to me and I will try to track this issue.
   Thanks.
    
Comment 46 Dionisus Torimens 2008-06-23 19:45:38 UTC
Hi ykzhao,

I'm sorry but I don't know how to assign it to you.
Actually the patch helps much better than going into standby and resuming.
If I can help in any way please let me know.

Thanks.
Comment 47 Dionisus Torimens 2008-07-09 00:47:07 UTC
Please let me know if I can do anything else to help find the root cause.

Thank you.
Comment 48 Dionisus Torimens 2008-07-11 22:46:16 UTC
Does anyone have any idea how the cause of this problem could be found?
Comment 49 Adrian Bunk 2008-07-16 15:39:18 UTC
If I understand it correctly patches that just went into Linus' tree should fix it.

Can you confirm that 2.6.26-git5 (when it will be available) fixes it?
Comment 50 ykzhao 2008-07-16 23:29:18 UTC
The patch is already in linux-git tree and should fix the problem.
Now the remaing problem is that the problem disappears after suspending
and resuming from s3. I am analyzing why there are more than 40000+ wakeup per second when the system enters the C-states. Unfortunately I haven't gotten the root cause.
Comment 51 Dionisus Torimens 2008-07-17 02:45:45 UTC
Thanks. I will need some time to test. I will probably be unreachable for around two weeks and unable to test the new kernel until after that(no bandwidth for downloading the update).

If you need me to test the kernel before, please give me patches to apply (to 2.6.26).
Comment 52 Orivej Desh 2008-07-24 13:16:30 UTC
Same problem with Acer Extensa 4220.
# rmmod -f processor
# insmod processor.ko max_cstate=1
reduces WPS from 50000+ to only 50.
I'm using 2.6.26.
Comment 53 Dionisus Torimens 2008-08-07 03:11:27 UTC
Update:
In 2.6.27-rc2 it basically still happens. But now after start powertop shows only C1 and C3 and uses only C1 at first. A bit later it starts using C3 about 80% (C2 still nowhere to be found) and the wakes go up to ~50.000.

And I read there might be a similar problem on the Dell Vostro 1510 as well.
Comment 54 Dionisus Torimens 2008-08-07 03:13:45 UTC
Created attachment 17118 [details]
dmesg of 2.6.27-rc2
Comment 55 ykzhao 2008-08-14 05:12:54 UTC
Hi, Dionisus
    I add your laptop into DMI check table in the following commit, in which Mwait will be disabled for CPU C-states. 
   >commit 2a2a64714d9c40f7705c4de1e79a5b855c7211a9
   > Author: Zhao Yakui <yakui.zhao@intel.com>
   > Date:   Tue Jun 24 18:02:57 2008 +0800
 >    ACPI: Disable MWAIT via DMI on broken Compal board
   From the dmesg in comment #53 I can't find that mwait is disabled for CPU C-states on your laptop. 
   I am sorry for my fault. The SYS_VENDOR in bios should be "Acer", but the SYS_VENDOR in patch is "ACER". As the cas is sensitive in the function of dmi_check_system, the mwait isn't disabled for CPU C-states. Maybe you can continue to use the boot option of "idle=nomwait". 
    
   Will you please try the boot option of "idle=nomwait" on the lastest kernel(2.6.27-rc2) and see whether the system still exists?
   Thanks.
Comment 56 ykzhao 2008-08-14 05:26:11 UTC
Hi, Georgij
    Will you please try the boot option of "idle=nowmait" on the 2.6.27-rc2 kernel and see whether the problem still exists?
    If the problem disappears, please attach the output of dmidecode.
    Thanks.
Comment 57 Dionisus Torimens 2008-08-14 15:52:16 UTC
Hi ykzhao,

I checked at 30a2f3c60a84092c8084dfe788b710f8d0768cd4 (-rc3) and if I boot with idle=nomwait, the cpu is in C3 most of the time. So the problem is fixed if I pass the parameter.
No problem, just a typo. Please let me know when I can test again. I've got Linus' current git tree available. Then I will post a new dmesg.
Comment 58 Dionisus Torimens 2008-08-14 16:34:36 UTC
Created attachment 17253 [details]
Patch to fix typo for current git HEAD.

I fixed the typo and tested it. It works. This is a patch for the current git head. I hope it's okay.
Comment 59 ykzhao 2008-08-14 19:30:24 UTC
Hi, Dionisus
   Thanks for your work. After the patch in comment #58 is applied on the 2.6.27-rc2 kernel, the system can work well and the CPU will be in C3 state most of the time. 
   
   Of course there exists another interesing issue from the problem description and I can't get the root cause unfortunately. Before suspend there will be more than 40000+ wakes per second in power top. The problem *stops* after suspending and resuming from s3.
   
   Anyway the system can work well after mwait is disabled for CPU C-state. IMO this bug can be marked as resolved.
   
   Thanks for the test and work again.
Comment 60 Dionisus Torimens 2008-08-15 01:41:57 UTC
Hi ykzhao,

yes, after the patch in comment #58 the system can work well and the CPU is in C3 state most of the time.

If you have any idea what other tests I could make to find the root cause that the problem goes away after suspend and resume let me know.
Comment 61 Dionisus Torimens 2008-08-15 01:42:43 UTC
And you're very welcome of course, thank you for your help, your work and the patch!
Comment 62 Dionisus Torimens 2008-08-15 07:02:06 UTC
Hi ykzhau,

could you send the patch to Linus?

Thanks
Comment 63 Orivej Desh 2008-08-16 02:02:51 UTC
Created attachment 17273 [details]
acer 4220 dmidecode

#56
> Hi, Georgij
>    Will you please try the boot option of "idle=nowmait" on the 2.6.27-rc2
>kernel and see whether the problem still exists?
The problem don't appear with "idle=nomwait" on the 2.6.27-rc3. It still appears without that parameter.
Here is my dmidecode.
Comment 64 Orivej Desh 2008-08-16 13:32:05 UTC
I can confirm that after suspend to ram and resume the problem disappeares too. S3 seems to be broken on the 2.6.26, but it works on the 2.6.27-rc3 (if you can call it "works" then after resume backlight on all VTs is off, but hopefully it is on then on Xorg; using xf86-video-intel with Xorg and vesa with VTs becase someone didn't add intel-agp as a dependency for intelfb and I don't know how to do this the way it works).
I'm willing to help with investigating why resuming after s2ram solves the problem, but maybe you shouldn't add "idle=nomwait" a default quirk for my laptop until this investigation is done if I can't use something like "idle=wait" to override that.
Comment 65 ykzhao 2008-08-17 19:14:59 UTC
Hi, Georgij
    Thanks for the test and help. 
    Do you mean that there exists the same problem with the 5520 laptop? After the  suspend/resume the problem disappears? Right?
    If so, your laptop won't be added into the dmi check table before the root cause is gotten. Of course we will appreciate your help to investigate this issue.
    Thanks.
   
Comment 66 ykzhao 2008-08-17 19:30:26 UTC
Hi, Dionisus
    Thanks for the reminder. 
    As the patch in comment #58 is not very emergent, can we defer it for some time? After we get the root cause that s2ram also solves the issue, the Acer 4220 laptop will also be added into the DMI check table. Of course the typo will be fixed. Of course maybe the boot option of "idle=nomwait" will be added on your laptop before that. Is that OK?
    Thanks.
Comment 67 Dionisus Torimens 2008-08-18 07:28:22 UTC
Hi Ykzhao,

I think maybe we should do both. I think we should better fix the problem now for most users. Then once we found the root cause, we can write a better fix.

But we should change the patch so debugging is also still possible without recompiling: We could add a parameter "forcemwait", like Georgij suggested in #64. If you like I can try to make such a patch. What do you think?

Thanks.
Comment 68 Dionisus Torimens 2008-08-19 06:43:18 UTC
I think we should override the dmi nomwait if idle=mwait is given as a kernel parameter. roughly like 
if (idle_mwait) idle_nomwait=0;

But I couldn't find the place where idle=mwait is processed.
Comment 69 ykzhao 2008-08-19 18:10:42 UTC
Hi, Dionisus
   Thanks for caring this issue. In fact when no option is added, OS will try to use mwait for CPU C-states. Only when the laptop falls into the dmi check table or mwait is disabled in BIOS, mwait will be disabled for CPU C-states. 
   IMO it is inappropriate to add the boot option of "idle=mwait".  After this boot option is added, maybe we will think that mwait is always used for CPU C-states. In fact on some systems the mwait is disabled in BIOS(BIOS returns the C-state that doesn't use mwait). It sounds not reasonable.
   
   Will you please send the patch in comment #63 to linux-acpi mail list if you don't want to add the option of "idle=nomwait"? ( I ACK it).

   Thanks.
   
   
Comment 70 Dionisus Torimens 2008-08-20 08:36:24 UTC
Hi ykhhao,

thanks for all your help. I agree. idle=mwait would not be a good name.
What do you think about idle=forcemwait?

I have submitted the patch.
Comment 71 Dionisus Torimens 2008-08-24 12:24:58 UTC
The patch made it into the current linux head and it works fine now. So with 2.6.27 the problem is (hot)fixed for the Acer 5220.
Comment 72 Dionisus Torimens 2008-08-29 23:41:04 UTC
Okay. Either the patch has not really made it yet or it was reversed. The patch is not in the current head.
Comment 73 ykzhao 2008-08-31 19:37:38 UTC
Hi, Dionisus
    Thanks for caring this issue. Now the patch is already ACPI test git tree. But we have to wait for some time before it hits the upstream kernel.
    Thanks.
Comment 74 Dionisus Torimens 2008-09-03 01:37:01 UTC
Okay. This time it's really in the git tree. That means it's hotfixed in 2.6.27:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=3df8a905ed09341041a3d1c6309fdb18cc809297

Dear Ykzhao, are you still researching on the root cause of the bug or should we close this bug now?
Comment 75 ykzhao 2008-09-09 18:19:02 UTC
Hi, Dionisus
    Sorry that I don't spend time on identifying the root cause again. After checking the datasheet of ICH8 and CPU, I still can't get the root cause. So I give it up.
    Very sorry. 
Comment 76 Dionisus Torimens 2008-09-16 06:02:39 UTC
Hi Yhzhau,

no Problem. I have only one idea left:

Could you write a patch that prints a warning when the kernel runs at over 10.000 hz? This way more people could find out that they have the problem and with their help it would maybe be easier to find the cause(s)?
At least they would know that their battery life would be affected by a bug.

e.g. "CPUX: Unusually high amount of processor wake-ups per second: XX.XXX. If you're not currently using high resolution timer software (e.g. multimedia players), this may well be a bug in a part of the kernel."

or something like: "CPU0 wakes over 10.000 times per second. This might be a bug, see: http://www.explanation.com.

Of course I'm not sure how many times are normal and when they're usually high, except when I use mplayer. But I think I've never got over 1.000 without the bug, even when using e.g. mplayer.

What do you think?
Comment 77 Len Brown 2008-10-16 22:06:39 UTC
> CPU: Intel(R) Celeron(R) CPU          530  @ 1.73GHz stepping 01

please paste the output from
$ cat /proc/cpuinfo
Comment 78 Dionisus Torimens 2008-10-17 07:13:05 UTC
Created attachment 18349 [details]
cat /proc/cpuinfo
Comment 79 Len Brown 2008-10-17 10:32:18 UTC
thanks for the cpuinfo

Looks like you've got one of these two:
http://processorfinder.intel.com/details.aspx?sSpec=SLA2G
http://processorfinder.intel.com/details.aspx?sSpec=SL9VA
which are both:
http://support.intel.com/support/processors/mobile/celeron/sb/CS-023760.htm

And the data sheet does say that this processor
does actually have MWAIT for C2 and C3.

I don't see any errata related to MWAIT waking up prematurely,
but this issue could be something related to our MONITOR address
triggering when we don't expect it to...
Comment 80 Dionisus Torimens 2008-10-17 11:05:44 UTC
You're very welcome, if there's *anything* else I can do to help, please let me know. 

(In case someone really wants to hack this: These notebooks are quite cheap(but good), sold for 330-400€, e.g. on Amazon.)
Comment 81 Len Brown 2008-10-17 14:19:33 UTC
Created attachment 18357 [details]
test patch vs linux-2.6.27

please apply this patch to linux-2.6.27
and boot with "idle=clflush"
dmesg should show an additional line:
"Enabling CLFLUSH on MWAIT"

Please report if this has any effect on the large number
of wakeups/sec when you're using MWAIT in C2 or C3.
Comment 82 Len Brown 2008-10-17 14:26:33 UTC
oh, to perform the test in commetn #81 using 2.6.27,
you'll need to disable the DMI workaround.

you can do that by commenting out this call in
acpi_processor_init():

        dmi_check_system(processor_idle_dmi_table);

or otherwise disabling set_no_mwait() from running on your system.
Comment 83 Dionisus Torimens 2008-10-18 07:09:21 UTC
The patch in #81 with workaround 1 from #82 also works. The C3 state is available and used, just like with 2.6.27-vanilla as far as I can see.

I will keep using it for now to check for possible side-effects, right?
Comment 84 Dionisus Torimens 2008-10-18 08:02:19 UTC
I didn't really deactivate the workaround, I'll have to test it again...
Comment 85 Dionisus Torimens 2008-10-19 01:45:31 UTC
It looks like the bug is not reproducible in 2.6.27. No more 40.000 wake/s even if I disable the workaround. It just stays in C1 while idle:
Cn                Avg residency       P-states (frequencies)
C0 (cpu running)        (10.7%) <- used to be here 99% on idle
C0                0.0ms ( 0.0%)
C1                8.1ms (89.0%) <- used to be here a bit when *busy*
C3                0.0ms ( 0.3%) <- the same, no C2, C3 hardly used.
I got 12.000 wakes once, but it was caused by "ethstatus". So I can't really test if clflush helps the problem. If I enable clflush I can no longer see what C-states are used, so there's nothing I could test.

But hey, that it's fixed is a good thing, right?

Does that help? Is there anything else I can test?
Comment 86 Len Brown 2008-10-27 19:28:39 UTC
hmmm, so the workaround didn't work because the problem
had already vanished in linux-2.6.26.
Yes, that is both good news and bad.
Good for 2.6.27 users.
Bad for 2.6.26 users...

Any chance you can git-bisect to see what fixed this in .27
so we can perhaps back-port that for the benefit of .26 customers?
Comment 87 Dionisus Torimens 2008-10-28 02:09:38 UTC
(In reply to comment #86)
> hmmm, so the workaround didn't work because the problem
> had already vanished in linux-2.6.26.

Yes. The one problem (high wakes) had disappeared, while the other (CPU does not use C2-C3) did not. I thought you might also be trying to find a fix for the second problem.

> Any chance you can git-bisect to see what fixed this in .27
> so we can perhaps back-port that for the benefit of .26 customers?

It will probably take a while until I've got time to fix the issue. But I'm also curious.
Comment 88 ykzhao 2008-12-23 21:19:10 UTC
Hi, Dionisus
    How about the result of git-bisect?
    From the dsecription of comment #85 it seems that C1 is entered instead of C3 at most time when cpu is idle although the wake/s is less.
    Will you please use git-bisect to identify which commit brings this issue? (Of course the workaround about the dmi check should be removed in your test.)
    thanks.

    
    
Comment 89 Dionisus Torimens 2008-12-26 10:48:41 UTC
git bisect complains because the good version is newer than the bad version. I will try exchanging good and bad:
git bisect start 2.6.27-rc5 2.6.27-rc1 does not work
git bisect start 2.6.27-rc1 2.6.27-rc5 does work

It turns out the one (high wakes) bug was fixed somewhere between 2.6.27-rc1 and 2.6.27-rc5. I will let you know when I find out more.
Comment 90 Dennis Jansen 2008-12-26 13:02:40 UTC
Okay. The bad guy is 
commit 320eee...
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=320eee776357db52d6fcfb11cff985b1976a4595

please add me as bisected-by/tested-by Dennis Jansen <Dennis.Jansen@web.de>
Comment 91 Dennis Jansen 2008-12-26 13:03:49 UTC
ps. The C2 mode still doesnt work of course.
Comment 92 Dennis Jansen 2008-12-26 13:11:21 UTC
pps. ykzhaos patch fixed both problems for me: the high wakes were gone and the cpu could use all C-modes and did use the right one. 

commit 320eee "just" fixed (only) the high wakes problem.
Comment 93 Zhang Rui 2008-12-28 17:15:09 UTC
cc venki
Comment 94 Dionisus Torimens 2008-12-29 14:38:03 UTC
Patch for the other problem available too, now.
Comment 95 Len Brown 2009-02-02 20:04:08 UTC
venki,
should we send 320eee776357db52d6fcfb11cff985b1976a4595
to 2.6.26.stable?
Comment 96 Venkatesh Pallipadi 2009-02-04 14:31:06 UTC
Yes. 320eee OK to go to stable.
Comment 97 Len Brown 2009-02-21 09:03:28 UTC
Created attachment 20321 [details]
2.6.27's 320eee refreshed to apply to 2.6.26.stable

It is sort of late for 2.6.26.y now,
but I'll e-mail this patch to stable@kernel.org
just to close the books on this one.