Bug 11896

Summary: [2.6.28-rc2] EeePC ACPI errors & exceptions
Product: ACPI Reporter: Rafael J. Wysocki (rjw)
Component: ECAssignee: Alexey Starikovskiy (astarikovskiy)
Status: CLOSED CODE_FIX    
Severity: normal CC: acpi-bugzilla, astarikovskiy, james, maciej.rutecki, makalsky, yakui.zhao
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.28-rc2 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 11808    
Attachments: [Patch 1/4]: EC transaction is explictly divided into several phases
[Patch 2/4]: Add the 2us delay before starting check EC status while in polling mode
[Patch 3/4]: Add some delay on some broken BIOS to workaround EC interrupt storm
[patch 4/4]: EC cleanup:Delete the unused function and data declaration
use spinlock for enable/disable GPE
acpidump from hp/compaq nx6310
dmesg during boot (kernel 2.6.28-rc3 + patch from comment #16)
dmesg after boot (kernel 2.6.28-rc3 + patch from comment #16)
Decrease after-command timeout
Disable mode degradation
Dmesg from 2.6.26-rc3 +#11917 +#16 +#31
dmesg from 2.6.26-rc3 +#11917 +#16
Fragments of syslog
Try to restart failed read
kernel.log from latest git +#11917 (comment #3) +#11896 (comment #16) +#11896 (comment #43)
wait for last gpe
restart failed transaction
latest git +#11917 (comment #3) +#11896 (comment #16) +#11896 (comment #51)
latest git +#11917 (comment #3) +#11896 (comment #16) +#11896 (comment #52)
dmesg from 2.6.28-rc5 (EeePC 901)

Description Rafael J. Wysocki 2008-10-29 16:34:57 UTC
Subject    : [2.6.28-rc2] EeePC ACPI errors & exceptions
Submitter  : Darren Salt <linux@youmustbejoking.demon.co.uk>
Date       : 2008-10-27 22:52
References : http://marc.info/?l=linux-kernel&m=122514911328761&w=4
Handled-By : Alexey Starikovskiy <aystarik@gmail.com>
Handled-By : Zhao Yakui <yakui.zhao@intel.com>

This entry is being used for tracking a regression from 2.6.27.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 ykzhao 2008-10-30 01:56:42 UTC
Created attachment 18511 [details]
[Patch 1/4]: EC transaction is explictly divided into several phases
Comment 2 ykzhao 2008-10-30 01:57:56 UTC
Created attachment 18512 [details]
[Patch 2/4]: Add the 2us delay before starting check EC status while in polling mode
Comment 3 ykzhao 2008-10-30 01:58:42 UTC
Created attachment 18513 [details]
[Patch 3/4]: Add some delay on some broken BIOS to workaround EC interrupt storm
Comment 4 ykzhao 2008-10-30 02:02:24 UTC
Created attachment 18514 [details]
[patch 4/4]:  EC cleanup:Delete the unused function and data declaration
Comment 5 ykzhao 2008-10-30 02:04:12 UTC
Hi, 
   Can someone try the patch set on the latest kernel(2.6.28-rc2) and see whether the isse still exists?
   thanks.
Comment 6 Alexey Starikovskiy 2008-10-30 15:29:27 UTC
Created attachment 18530 [details]
use spinlock for enable/disable GPE

Darren,
Could you please check if this patch changes anything?
Comment 7 Alexey Starikovskiy 2008-10-31 08:56:49 UTC

*** This bug has been marked as a duplicate of bug 11917 ***
Comment 8 ykzhao 2008-11-02 18:07:36 UTC
Why is the bug marked as the duplicate of bug 11917?
   IMO this is a different bug with bug 11917.
   In the bug 11917 the problem is that Asus Eee PC hotkeys stop working after prolonged usage.
   But in this bug the problem is that EeePC ACPI errors & exceptions.
   
   So I will reopen this bug.
   thanks.
Comment 9 ykzhao 2008-11-02 18:53:18 UTC
From the http://marc.info/?l=linux-kernel&m=122519667129134&w=2 we can know that the warning message is gone and ACPI function also works well.

At the same time there exists the another similar warning message on the box owned by Macijie. 
   >http://marc.info/?l=linux-acpi&m=122513901711396&w=2

   From the http://marc.info/?l=linux-acpi&m=122522314522351&w=2 we can know that the problem is resolved by the patch.
   Thanks.
Comment 10 ykzhao 2008-11-03 00:53:08 UTC
After the patch set is applied, the asus EEEPC 901 and HP box can work well.
So the bug will be marked as resovled.
Comment 11 ykzhao 2008-11-03 06:36:07 UTC
As Alexey is the maintainer of EC driver, it will be better after someone can try the patch in comment #6 and see whether the problem still exists.
So the bug is reopened.
Thanks.
Comment 12 Maciej Rutecki 2008-11-03 11:01:36 UTC
I test patches from comment #1-4 and #6 and work fine on HP/Compaq nx6310.

Details: http://marc.info/?l=linux-acpi&m=122513901711396&w=2

Thanks for help
Comment 13 Alexey Starikovskiy 2008-11-03 11:11:27 UTC
Maciej,
Is it "and" or "or"?
Comment 14 Maciej Rutecki 2008-11-03 11:41:32 UTC
I was wrong: patch from comment #6 doesn't work:

CE: hpet increasing min_delta_ns to 15000 nsec
ACPI: EC: missing confirmations, switch off interrupt mode.
ACPI Exception (evregion-0419): AE_TIME, Returned by Handler for [EmbeddedControl] [20080926]
ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.C31E] (Node f7019858), AE_TIME
ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.TZ3_._TMP] (Node f7019ed0), AE_TIME

I'm going to test four patches from #1-4 again to be sure. 

Alexey: "or" :)
Comment 15 Alexey Starikovskiy 2008-11-04 00:33:28 UTC
This is bug is reported to be fixed by msleep revert patch.
Comment 16 Alexey Starikovskiy 2008-11-04 00:38:55 UTC
Reference: http://marc.info/?l=linux-acpi&m=122524653025029&w=2
Comment 17 ykzhao 2008-11-04 01:16:18 UTC
The issue can be avoided by using udelay. 
    But after the patch is applied, the performance will be affected. While doing the EC transaction, the CPU will do nothing but loop. IMO this is not very reasonable.
   
    In fact about the problem of AE_TIME I raise the above issue after you sent your patch.
    
    At the same time I raise another issue about this patch in the mail list. If preempt schedule happens in udelay, maybe the same issue will happen.
    Why left so many unstable factor?
    Thanks.
    
Comment 18 Alexey Starikovskiy 2008-11-04 01:28:46 UTC
Yakui,
I may start considering your arguments if you come with real data. Until now all of your projections were wrong, so you don't have credit any longer.
I also will not consider reverting "fast transaction", until there is a proof that it breaks something, so please don't attach your 4 patch series to every bug
in ACPI/EC.
Once again, if you don't have knowledge to do kernel programming, learn first and don't waste other people time.
Comment 19 ykzhao 2008-11-04 01:48:11 UTC
    I am not prepared to revert any patch from you.  What I cared is to make EC work more stably. In fact I point out what will be affected by your patch. 
    IMO when you disagree with other patch, please give some detailed reason or say which box will be broken by the patch.
     
    It is difficult to reproduce the issue about AE_TIME after the patch mentioned in comment #16 is applied. But we can't say that it is impossible to happen. 
    In fact I raise the issue of AE_TIME about your patch "fast transaction". But you ignore it and told me to put the real data. Now the problem appears.
    
    IMO the better patch is that we should get the root cause and the possible potential fault should be considered. It is not better that something is reverted when a problem happens. And it is reverted again when another problem happens. ( As I know, msleep is reverted at least twice in EC driver).
    Thanks.
    
    
Comment 20 Alexey Starikovskiy 2008-11-04 02:14:29 UTC
(In reply to comment #19)
>     I am not prepared to revert any patch from you.
Then why your patch series, which reverts it appears in every bug report and in linux-acpi mail list?
>  What I cared is to make EC work more stably.
You don't have a clue how to do it apparently.
> In fact I point out what will be affected by your patch. 
You don't.

> IMO when you disagree with other patch, please give some detailed reason or
> say which box will be broken by the patch.
With the "fast transaction" patch ASUS eeePC is reported to not trigger interrupt storm detection and does only 2-3 faux interrupts per transaction, compared to 20-30 without the fast transaction. Thus, by reverting fast transaction, you force EC driver to poll mode and degraded performance. Plus, to "limit" the interrupt count, you instead of disabling interrupt source try to
make interrupt servicing longer, thus disabling other interrupts and prohibiting scheduling to other more important tasks.
And please don't say that you read this for the first time -- all these messages are on the mailing list, you just don't read them. 
> 
>     It is difficult to reproduce the issue about AE_TIME after the patch
> mentioned in comment #16 is applied. But we can't say that it is impossible
> to
> happen. 
>     In fact I raise the issue of AE_TIME about your patch "fast transaction".
> But you ignore it and told me to put the real data. Now the problem appears.
msleep revert patch is on my queue since then (and I sent it to you more than a month ago), now then we have reason to apply it, thus it is sent to Len.
>     IMO the better patch is that we should get the root cause and the
>     possible
> potential fault should be considered.
Why did you not come up with root cause yet? I asked you to bring one and you replied with "fast transaction" revert series.

> It is not better that something is reverted when a problem happens. And it is
> > reverted again when another problem happens.
You are amazing... How the hell your 4-patch series did appear? 
> ( As I know, msleep is reverted at least twice in EC driver).
Use of msleep looks to be more promising power-vise until proved to either hurt performance or break things.
Now we know that it breaks things, so it will stay out of EC driver.
>     Thanks.
No, thank you for wasting my time and keeping me busy.
Comment 21 ykzhao 2008-11-04 08:15:12 UTC
a. AE_TIME issue that happens on Asus EEEPC & HP box
   I raise the AE_TIME issue about your patch. In fact I repeated my viewpoint several times about this issue.Why AE_TIME happens on Asus EEEPC & HP box is consistent with my analysis.
   The discussion can be found in :
   >http://marc.info/?l=linux-acpi&m=122103582406176&w=2

b. msleep is replaced by udelay(in comment #14)
   When there is no EC interrupt confirm in EC transaction or EC GPE storm is detected, it will be switched to polling mode.In such case the EC transaction will be done in the function of ec_poll. In the following EC transactions the CPU will do nothing but loop while waiting for the EC status. In such case the performance will be affected. At the same time the CPU can't enter idle state, which causes that more power is used.
    In fact the udelay is replaced by the msleep in the following commit:
     >commit 1b7fc5aae8867046f8d3d45808309d5b7f2e036a
     >Author: Alexey Starikovskiy <astarikovskiy@suse.de>
     >Date:   Fri Jun 6 11:49:33 2008 -0400
       >ACPI: EC: Use msleep instead of udelay while waiting for event.
       >http://bugzilla.kernel.org/show_bug.cgi?id=10724
    
    In the above commit there is no explanation why udelay is replaced msleep.
    Now when the problem appears, the commit is reverted again before getting the root cause. 
    To let more people be involved in the bug fix about EC, IMO it is very important to add the detailed explanation. Only when there is no other way to solve the problem, it is reasonable to consider reverting the commit. 

c. Why say that I waste your time?
   I know that you can write the patch very quickly. But sometimes there are too many errors after you sent your patch to ACPI mailing list. 
    For example: 
      There exists the fatal sync issue in the patch you pasted in http://bugzilla.kernel.org/show_bug.cgi?id=11549#C1. 
      >http://marc.info/?l=linux-acpi&m=122164185131664&w=2
    If no one raises any issue about your patch, more laptops will be broken by your patch.
     
   Even although someone raise some issues about your patch, there still exists another issue. For example: 
     There exist a bunch of the following warning message in bug 11930:
     > ACPI: EC: non-query interrupt received, switching to interrupt mode
     > __ratelimit: 40 callbacks suppressed
   It is harmless but it is confusing and annoying.  

d. About the patch set from me
   1)The EC transaction is done in process context. Maybe it is slow but it will be more stable. At the same time compared with the 2.6.26 kernel it is not changed too lot.  
   2)Add 2us delay before checking EC status while in polling mode.
   3)Add the delay to work around the EC GPE storm. This only affects some broken laptops. It will also be OK to disable EC GPE while doing EC transaction after EC GPE storm is detected. But it is possible to lose the EC notification event while doing EC transaction. So to avoid losing the EC notification event the udelay is added to workaround the EC GPE storm. 
Comment 22 Alexey Starikovskiy 2008-11-04 08:38:10 UTC
(In reply to comment #21)
> d. About the patch set from me
If you really look for "root cause" you don't sweep 500+ lines of code and say -- problem was there.
Comment 23 Maciej Rutecki 2008-11-04 09:33:26 UTC
Patches from comment #1-4 seem work on my laptop (tested 14 hours)
Comment 24 Alexey Starikovskiy 2008-11-04 09:40:54 UTC
Maciej,
Please test patch referenced in #16. Patches from #1-4 are not going to be used, so sorry for waisting your time.
Comment 25 Maciej Rutecki 2008-11-04 11:34:39 UTC
Patch from #16 (http://marc.info/?l=linux-acpi&m=122514341319748&w=4):

ACPI: EC: missing confirmations, switch off interrupt mode.
ACPI Exception (evregion-0419): AE_TIME, Returned by Handler for [EmbeddedControl] [20080926]
ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.TZ4_._TMP] (Node f7019f18), AE_TIME
ACPI Exception (evregion-0419): AE_TIME, Returned by Handler for [EmbeddedControl] [20080926]
ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.C31E] (Node f7019858), AE_TIME
ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.TZ3_._TMP] (Node f7019ed0), AE_TIME
ACPI Exception (evregion-0419): AE_TIME, Returned by Handler for [EmbeddedControl] [20080926]
ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.C31E] (Node f7019858), AE_TIME
ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.TZ3_._TMP] (Node f7019ed0), AE_TIME
ACPI Exception (evregion-0419): AE_TIME, Returned by Handler for [EmbeddedControl] [20080926]
ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.C31E] (Node f7019858), AE_TIME
ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.TZ3_._TMP] (Node f7019ed0), AE_TIME
ACPI Exception (evregion-0419): AE_TIME, Returned by Handler for [EmbeddedControl] [20080926]
ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.C31E] (Node f7019858), AE_TIME
ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.TZ3_._TMP] (Node f7019ed0), AE_TIME
ACPI Exception (evregion-0419): AE_TIME, Returned by Handler for [EmbeddedControl] [20080926]
ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.C31E] (Node f7019858), AE_TIME
ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.TZ3_._TMP] (Node f7019ed0), AE_TIME
Comment 26 Alexey Starikovskiy 2008-11-04 12:13:40 UTC
Maciej,
Could you please make attach full dmesg with time in printk? Could you also enable DEBUG in drivers/acpi/ec.c?
Comment 27 Alexey Starikovskiy 2008-11-04 12:14:50 UTC
please also run an acpidump and attach it here...
Comment 28 Maciej Rutecki 2008-11-04 13:52:57 UTC
Created attachment 18653 [details]
acpidump from hp/compaq nx6310
Comment 29 Maciej Rutecki 2008-11-04 13:54:47 UTC
Created attachment 18654 [details]
dmesg during boot (kernel 2.6.28-rc3 + patch from comment #16)
Comment 30 Maciej Rutecki 2008-11-04 13:55:37 UTC
Created attachment 18655 [details]
dmesg after boot (kernel 2.6.28-rc3 + patch from comment #16)
Comment 31 Alexey Starikovskiy 2008-11-04 14:18:00 UTC
Created attachment 18656 [details]
Decrease after-command timeout

please apply on top
Comment 32 Alexey Starikovskiy 2008-11-04 14:19:42 UTC
Created attachment 18657 [details]
Disable mode degradation

Please attach dmesg with and without this patch on top.
Thanks!
Comment 33 Maciej Rutecki 2008-11-04 14:29:26 UTC
(In reply to comment #32)
> Created an attachment (id=18657) [details]
> Disable mode degradation
> 
> Please attach dmesg with and without this patch on top.
> Thanks!
> 

On clean 2.6.28-rc3 or -rc3 + patches from #16 and #31?

Patch from #16 + #31 I will test tommorow, it's getting late (in Poland) :-)

Regards
Comment 34 Alexey Starikovskiy 2008-11-04 14:36:27 UTC
on top of #16.
it is _very_ late in Moscow too :)
Comment 35 Alexey Starikovskiy 2008-11-04 14:39:27 UTC
oh, and patch from #11917 is still required too.
Comment 36 Maciej Rutecki 2008-11-04 22:29:40 UTC
Created attachment 18684 [details]
Dmesg from 2.6.26-rc3 +#11917 +#16  +#31

Problem still exists
Comment 37 Maciej Rutecki 2008-11-04 23:07:56 UTC
Created attachment 18685 [details]
dmesg from 2.6.26-rc3 +#11917 +#16

maciek@gumis:~$ dmesg| grep Error
[  740.413357] ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.C31E] (Node f7019858), AE_TIME
[  740.413422] ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.TZ3_._TMP] (Node f7019ed0), AE_TIME
Comment 38 Alexey Starikovskiy 2008-11-04 23:26:40 UTC
Is it always C31E to fail?
Comment 39 Maciej Rutecki 2008-11-04 23:48:41 UTC
I'm not 100% sure, but I think: yes.

Always see this:
\_TZ_.C31E
\_TZ_.TZ3_._TMP

less often also:
\_TZ_.TZ4_._TMP

Already I test 2.6.28-rc3 +#22917 +#1 +#32 and seems be OK, but I want wait few hours to be sure.
Comment 40 Alexey Starikovskiy 2008-11-05 00:35:18 UTC
Maciej,
Could you please check if any of /var/log/kernel.log or syslog have the point where interrupt mode is switched off and beginning of failing read?
Comment 41 Maciej Rutecki 2008-11-05 08:00:12 UTC
Created attachment 18691 [details]
Fragments of syslog

I attach fragments of syslog. Full syslog has 75MB.


2.6.28-rc3 +#22917 +#1 +#32 seems be ok
Comment 42 Alexey Starikovskiy 2008-11-05 09:03:20 UTC
Thanks,
I hope you meant 2.6.28-rc3 +#11917 +#16 +#32 ?
From the fragments it looks like read command was never completed, while interrupt came and status was cleared instead of being set to OBF = 1. 
Comment 43 Alexey Starikovskiy 2008-11-05 09:26:05 UTC
Created attachment 18693 [details]
Try to restart failed read

Maciej,
Please check if this patch is able to restart failed read command...
Comment 44 Maciej Rutecki 2008-11-05 09:37:24 UTC
(In reply to comment #42)
> Thanks,
> I hope you meant 2.6.28-rc3 +#11917 +#16 +#32 ?

Yes #11917, sorry for mistake.
Comment 45 Maciej Rutecki 2008-11-05 09:39:07 UTC
(In reply to comment #43)
> Created an attachment (id=18693) [details]
> Try to restart failed read
> 
> Maciej,
> Please check if this patch is able to restart failed read command...
> 

To be sure: which kernel and patches also use?
Comment 46 Alexey Starikovskiy 2008-11-05 09:44:33 UTC
latest kernel git + patches from #11917 and c16
Comment 47 Maciej Rutecki 2008-11-05 12:10:01 UTC
Created attachment 18694 [details]
kernel.log from latest git +#11917 (comment #3) +#11896 (comment #16) +#11896 (comment #43)
Comment 48 Alexey Starikovskiy 2008-11-05 12:40:17 UTC
Looks like restart works... 
Why the temperature update happens so often? Do you poll it in tight loop?
Comment 49 Maciej Rutecki 2008-11-05 12:51:18 UTC
(In reply to comment #48)
> Looks like restart works... 
> Why the temperature update happens so often? Do you poll it in tight loop?
> 

It maybe can caused by software from userland (Sorry if I said something stupid, I'm not Linux programmer :))? I use kima [1] to read temperature form coretemp and thermal zone.  

[1] http://kde-apps.org/content/show.php/Kima+-+kicker+monitoring+applet?content=33257 - see comments: "Sorry, I should have mentioned that. I am using kima 0.7.4. Nevertheless, the number of additional interrupts is about 80-100/sec."
Comment 50 ykzhao 2008-11-06 01:13:59 UTC
Hi, Alexey 
    Can you explain why the EC restart mechanism is introduced?
    From the patch it seems that the status (0x0) is also used to restart the EC transaction.
    But according to the Spec the status(0x0) is also reasonable. It indicates that the input buffer is empty.
    At the same time it seems so confusing that no one can understand what you want to do. 
    Thanks.
Comment 51 Alexey Starikovskiy 2008-11-09 12:39:03 UTC
Created attachment 18755 [details]
wait for last gpe

Maciej,
Could you please check if this patch helps? (instead of #11896(comment #43))
Comment 52 Alexey Starikovskiy 2008-11-09 12:41:21 UTC
Created attachment 18756 [details]
restart failed transaction

If patch before does not help, please check if this patch (same restart) on top of it still helps.
Comment 53 Maciej Rutecki 2008-11-11 09:33:10 UTC
Created attachment 18803 [details]
latest git +#11917 (comment #3) +#11896 (comment #16) +#11896 (comment #51)

Still the same:

Nov 11 18:15:56 gumis kernel: [  321.460892] ACPI Exception (evregion-0419): AE_TIME, Returned by Handler for [EmbeddedControl] [20080926]
Nov 11 18:15:56 gumis kernel: [  321.460916] ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.C31E] (Node f7019858), AE_TIME
Nov 11 18:15:56 gumis kernel: [  321.461041] ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.TZ3_._TMP] (Node f7019ed0), AE_TIME

I will test next patch
Comment 54 Maciej Rutecki 2008-11-11 12:17:30 UTC
Created attachment 18809 [details]
latest git +#11917 (comment #3) +#11896 (comment #16) +#11896 (comment #52)

Works OK. Thanks for help.

PS. Sorry for delay, I have short vacation :)
Comment 55 Len Brown 2008-11-11 21:59:35 UTC
patch in comment #52 is in acpi tree:

commit a2f93aeadf97e870ff385030633a73e21146815d
    ACPI: EC: restart failed command

patch referenced in comment #16 is also in acpi tree:

commit 1cfe62c8010ac56e1bd3827e30386a87cc2f3594
Author: Alexey Starikovskiy <astarikovskiy@suse.de>
    ACPI: EC: revert msleep patch

patch in http://bugzilla.kernel.org/show_bug.cgi?id=11917#c3
is also in the acpi tree:
commit d21cf3c16b1191f3154a51e0b20c82bf851cc553
Author: Alexey Starikovskiy <astarikovskiy@suse.de>
    ACPI EC: Fix regression due to use of uninitialized variable
Comment 56 Len Brown 2008-11-12 21:30:24 UTC
above shipped in 2.6.28-rc4-git3

closed.
Comment 57 Darren Salt 2008-11-16 12:34:42 UTC
Created attachment 18881 [details]
dmesg from 2.6.28-rc5 (EeePC 901)

Seems fine with 2.6.28-rc5 on my EeePC 901, apart from some GPE storm stuff
Comment 58 Maciej Rutecki 2008-11-16 13:47:58 UTC
Also confirm, that -rc5 solves problem.

Thanks