Bug 11896
Description
Rafael J. Wysocki
2008-10-29 16:34:57 UTC
Created attachment 18511 [details]
[Patch 1/4]: EC transaction is explictly divided into several phases
Created attachment 18512 [details]
[Patch 2/4]: Add the 2us delay before starting check EC status while in polling mode
Created attachment 18513 [details]
[Patch 3/4]: Add some delay on some broken BIOS to workaround EC interrupt storm
Created attachment 18514 [details]
[patch 4/4]: EC cleanup:Delete the unused function and data declaration
Hi, Can someone try the patch set on the latest kernel(2.6.28-rc2) and see whether the isse still exists? thanks. Created attachment 18530 [details]
use spinlock for enable/disable GPE
Darren,
Could you please check if this patch changes anything?
*** This bug has been marked as a duplicate of bug 11917 *** Why is the bug marked as the duplicate of bug 11917? IMO this is a different bug with bug 11917. In the bug 11917 the problem is that Asus Eee PC hotkeys stop working after prolonged usage. But in this bug the problem is that EeePC ACPI errors & exceptions. So I will reopen this bug. thanks. From the http://marc.info/?l=linux-kernel&m=122519667129134&w=2 we can know that the warning message is gone and ACPI function also works well. At the same time there exists the another similar warning message on the box owned by Macijie. >http://marc.info/?l=linux-acpi&m=122513901711396&w=2 From the http://marc.info/?l=linux-acpi&m=122522314522351&w=2 we can know that the problem is resolved by the patch. Thanks. After the patch set is applied, the asus EEEPC 901 and HP box can work well. So the bug will be marked as resovled. As Alexey is the maintainer of EC driver, it will be better after someone can try the patch in comment #6 and see whether the problem still exists. So the bug is reopened. Thanks. I test patches from comment #1-4 and #6 and work fine on HP/Compaq nx6310. Details: http://marc.info/?l=linux-acpi&m=122513901711396&w=2 Thanks for help Maciej, Is it "and" or "or"? I was wrong: patch from comment #6 doesn't work: CE: hpet increasing min_delta_ns to 15000 nsec ACPI: EC: missing confirmations, switch off interrupt mode. ACPI Exception (evregion-0419): AE_TIME, Returned by Handler for [EmbeddedControl] [20080926] ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.C31E] (Node f7019858), AE_TIME ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.TZ3_._TMP] (Node f7019ed0), AE_TIME I'm going to test four patches from #1-4 again to be sure. Alexey: "or" :) This is bug is reported to be fixed by msleep revert patch. The issue can be avoided by using udelay. But after the patch is applied, the performance will be affected. While doing the EC transaction, the CPU will do nothing but loop. IMO this is not very reasonable. In fact about the problem of AE_TIME I raise the above issue after you sent your patch. At the same time I raise another issue about this patch in the mail list. If preempt schedule happens in udelay, maybe the same issue will happen. Why left so many unstable factor? Thanks. Yakui, I may start considering your arguments if you come with real data. Until now all of your projections were wrong, so you don't have credit any longer. I also will not consider reverting "fast transaction", until there is a proof that it breaks something, so please don't attach your 4 patch series to every bug in ACPI/EC. Once again, if you don't have knowledge to do kernel programming, learn first and don't waste other people time. I am not prepared to revert any patch from you. What I cared is to make EC work more stably. In fact I point out what will be affected by your patch. IMO when you disagree with other patch, please give some detailed reason or say which box will be broken by the patch. It is difficult to reproduce the issue about AE_TIME after the patch mentioned in comment #16 is applied. But we can't say that it is impossible to happen. In fact I raise the issue of AE_TIME about your patch "fast transaction". But you ignore it and told me to put the real data. Now the problem appears. IMO the better patch is that we should get the root cause and the possible potential fault should be considered. It is not better that something is reverted when a problem happens. And it is reverted again when another problem happens. ( As I know, msleep is reverted at least twice in EC driver). Thanks. (In reply to comment #19) > I am not prepared to revert any patch from you. Then why your patch series, which reverts it appears in every bug report and in linux-acpi mail list? > What I cared is to make EC work more stably. You don't have a clue how to do it apparently. > In fact I point out what will be affected by your patch. You don't. > IMO when you disagree with other patch, please give some detailed reason or > say which box will be broken by the patch. With the "fast transaction" patch ASUS eeePC is reported to not trigger interrupt storm detection and does only 2-3 faux interrupts per transaction, compared to 20-30 without the fast transaction. Thus, by reverting fast transaction, you force EC driver to poll mode and degraded performance. Plus, to "limit" the interrupt count, you instead of disabling interrupt source try to make interrupt servicing longer, thus disabling other interrupts and prohibiting scheduling to other more important tasks. And please don't say that you read this for the first time -- all these messages are on the mailing list, you just don't read them. > > It is difficult to reproduce the issue about AE_TIME after the patch > mentioned in comment #16 is applied. But we can't say that it is impossible > to > happen. > In fact I raise the issue of AE_TIME about your patch "fast transaction". > But you ignore it and told me to put the real data. Now the problem appears. msleep revert patch is on my queue since then (and I sent it to you more than a month ago), now then we have reason to apply it, thus it is sent to Len. > IMO the better patch is that we should get the root cause and the > possible > potential fault should be considered. Why did you not come up with root cause yet? I asked you to bring one and you replied with "fast transaction" revert series. > It is not better that something is reverted when a problem happens. And it is > > reverted again when another problem happens. You are amazing... How the hell your 4-patch series did appear? > ( As I know, msleep is reverted at least twice in EC driver). Use of msleep looks to be more promising power-vise until proved to either hurt performance or break things. Now we know that it breaks things, so it will stay out of EC driver. > Thanks. No, thank you for wasting my time and keeping me busy. a. AE_TIME issue that happens on Asus EEEPC & HP box I raise the AE_TIME issue about your patch. In fact I repeated my viewpoint several times about this issue.Why AE_TIME happens on Asus EEEPC & HP box is consistent with my analysis. The discussion can be found in : >http://marc.info/?l=linux-acpi&m=122103582406176&w=2 b. msleep is replaced by udelay(in comment #14) When there is no EC interrupt confirm in EC transaction or EC GPE storm is detected, it will be switched to polling mode.In such case the EC transaction will be done in the function of ec_poll. In the following EC transactions the CPU will do nothing but loop while waiting for the EC status. In such case the performance will be affected. At the same time the CPU can't enter idle state, which causes that more power is used. In fact the udelay is replaced by the msleep in the following commit: >commit 1b7fc5aae8867046f8d3d45808309d5b7f2e036a >Author: Alexey Starikovskiy <astarikovskiy@suse.de> >Date: Fri Jun 6 11:49:33 2008 -0400 >ACPI: EC: Use msleep instead of udelay while waiting for event. >http://bugzilla.kernel.org/show_bug.cgi?id=10724 In the above commit there is no explanation why udelay is replaced msleep. Now when the problem appears, the commit is reverted again before getting the root cause. To let more people be involved in the bug fix about EC, IMO it is very important to add the detailed explanation. Only when there is no other way to solve the problem, it is reasonable to consider reverting the commit. c. Why say that I waste your time? I know that you can write the patch very quickly. But sometimes there are too many errors after you sent your patch to ACPI mailing list. For example: There exists the fatal sync issue in the patch you pasted in http://bugzilla.kernel.org/show_bug.cgi?id=11549#C1. >http://marc.info/?l=linux-acpi&m=122164185131664&w=2 If no one raises any issue about your patch, more laptops will be broken by your patch. Even although someone raise some issues about your patch, there still exists another issue. For example: There exist a bunch of the following warning message in bug 11930: > ACPI: EC: non-query interrupt received, switching to interrupt mode > __ratelimit: 40 callbacks suppressed It is harmless but it is confusing and annoying. d. About the patch set from me 1)The EC transaction is done in process context. Maybe it is slow but it will be more stable. At the same time compared with the 2.6.26 kernel it is not changed too lot. 2)Add 2us delay before checking EC status while in polling mode. 3)Add the delay to work around the EC GPE storm. This only affects some broken laptops. It will also be OK to disable EC GPE while doing EC transaction after EC GPE storm is detected. But it is possible to lose the EC notification event while doing EC transaction. So to avoid losing the EC notification event the udelay is added to workaround the EC GPE storm. (In reply to comment #21) > d. About the patch set from me If you really look for "root cause" you don't sweep 500+ lines of code and say -- problem was there. Patches from comment #1-4 seem work on my laptop (tested 14 hours) Maciej, Please test patch referenced in #16. Patches from #1-4 are not going to be used, so sorry for waisting your time. Patch from #16 (http://marc.info/?l=linux-acpi&m=122514341319748&w=4): ACPI: EC: missing confirmations, switch off interrupt mode. ACPI Exception (evregion-0419): AE_TIME, Returned by Handler for [EmbeddedControl] [20080926] ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.TZ4_._TMP] (Node f7019f18), AE_TIME ACPI Exception (evregion-0419): AE_TIME, Returned by Handler for [EmbeddedControl] [20080926] ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.C31E] (Node f7019858), AE_TIME ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.TZ3_._TMP] (Node f7019ed0), AE_TIME ACPI Exception (evregion-0419): AE_TIME, Returned by Handler for [EmbeddedControl] [20080926] ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.C31E] (Node f7019858), AE_TIME ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.TZ3_._TMP] (Node f7019ed0), AE_TIME ACPI Exception (evregion-0419): AE_TIME, Returned by Handler for [EmbeddedControl] [20080926] ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.C31E] (Node f7019858), AE_TIME ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.TZ3_._TMP] (Node f7019ed0), AE_TIME ACPI Exception (evregion-0419): AE_TIME, Returned by Handler for [EmbeddedControl] [20080926] ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.C31E] (Node f7019858), AE_TIME ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.TZ3_._TMP] (Node f7019ed0), AE_TIME ACPI Exception (evregion-0419): AE_TIME, Returned by Handler for [EmbeddedControl] [20080926] ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.C31E] (Node f7019858), AE_TIME ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.TZ3_._TMP] (Node f7019ed0), AE_TIME Maciej, Could you please make attach full dmesg with time in printk? Could you also enable DEBUG in drivers/acpi/ec.c? please also run an acpidump and attach it here... Created attachment 18653 [details]
acpidump from hp/compaq nx6310
Created attachment 18654 [details] dmesg during boot (kernel 2.6.28-rc3 + patch from comment #16) Created attachment 18655 [details] dmesg after boot (kernel 2.6.28-rc3 + patch from comment #16) Created attachment 18656 [details]
Decrease after-command timeout
please apply on top
Created attachment 18657 [details]
Disable mode degradation
Please attach dmesg with and without this patch on top.
Thanks!
(In reply to comment #32) > Created an attachment (id=18657) [details] > Disable mode degradation > > Please attach dmesg with and without this patch on top. > Thanks! > On clean 2.6.28-rc3 or -rc3 + patches from #16 and #31? Patch from #16 + #31 I will test tommorow, it's getting late (in Poland) :-) Regards on top of #16. it is _very_ late in Moscow too :) oh, and patch from #11917 is still required too. Created attachment 18684 [details]
Dmesg from 2.6.26-rc3 +#11917 +#16 +#31
Problem still exists
Created attachment 18685 [details]
dmesg from 2.6.26-rc3 +#11917 +#16
maciek@gumis:~$ dmesg| grep Error
[ 740.413357] ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.C31E] (Node f7019858), AE_TIME
[ 740.413422] ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.TZ3_._TMP] (Node f7019ed0), AE_TIME
Is it always C31E to fail? I'm not 100% sure, but I think: yes. Always see this: \_TZ_.C31E \_TZ_.TZ3_._TMP less often also: \_TZ_.TZ4_._TMP Already I test 2.6.28-rc3 +#22917 +#1 +#32 and seems be OK, but I want wait few hours to be sure. Maciej, Could you please check if any of /var/log/kernel.log or syslog have the point where interrupt mode is switched off and beginning of failing read? Created attachment 18691 [details]
Fragments of syslog
I attach fragments of syslog. Full syslog has 75MB.
2.6.28-rc3 +#22917 +#1 +#32 seems be ok
Thanks, I hope you meant 2.6.28-rc3 +#11917 +#16 +#32 ? From the fragments it looks like read command was never completed, while interrupt came and status was cleared instead of being set to OBF = 1. Created attachment 18693 [details]
Try to restart failed read
Maciej,
Please check if this patch is able to restart failed read command...
(In reply to comment #42) > Thanks, > I hope you meant 2.6.28-rc3 +#11917 +#16 +#32 ? Yes #11917, sorry for mistake. (In reply to comment #43) > Created an attachment (id=18693) [details] > Try to restart failed read > > Maciej, > Please check if this patch is able to restart failed read command... > To be sure: which kernel and patches also use? latest kernel git + patches from #11917 and c16 Created attachment 18694 [details] kernel.log from latest git +#11917 (comment #3) +#11896 (comment #16) +#11896 (comment #43) Looks like restart works... Why the temperature update happens so often? Do you poll it in tight loop? (In reply to comment #48) > Looks like restart works... > Why the temperature update happens so often? Do you poll it in tight loop? > It maybe can caused by software from userland (Sorry if I said something stupid, I'm not Linux programmer :))? I use kima [1] to read temperature form coretemp and thermal zone. [1] http://kde-apps.org/content/show.php/Kima+-+kicker+monitoring+applet?content=33257 - see comments: "Sorry, I should have mentioned that. I am using kima 0.7.4. Nevertheless, the number of additional interrupts is about 80-100/sec." Hi, Alexey Can you explain why the EC restart mechanism is introduced? From the patch it seems that the status (0x0) is also used to restart the EC transaction. But according to the Spec the status(0x0) is also reasonable. It indicates that the input buffer is empty. At the same time it seems so confusing that no one can understand what you want to do. Thanks. Created attachment 18755 [details] wait for last gpe Maciej, Could you please check if this patch helps? (instead of #11896(comment #43)) Created attachment 18756 [details]
restart failed transaction
If patch before does not help, please check if this patch (same restart) on top of it still helps.
Created attachment 18803 [details] latest git +#11917 (comment #3) +#11896 (comment #16) +#11896 (comment #51) Still the same: Nov 11 18:15:56 gumis kernel: [ 321.460892] ACPI Exception (evregion-0419): AE_TIME, Returned by Handler for [EmbeddedControl] [20080926] Nov 11 18:15:56 gumis kernel: [ 321.460916] ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.C31E] (Node f7019858), AE_TIME Nov 11 18:15:56 gumis kernel: [ 321.461041] ACPI Error (psparse-0524): Method parse/execution failed [\_TZ_.TZ3_._TMP] (Node f7019ed0), AE_TIME I will test next patch Created attachment 18809 [details] latest git +#11917 (comment #3) +#11896 (comment #16) +#11896 (comment #52) Works OK. Thanks for help. PS. Sorry for delay, I have short vacation :) patch in comment #52 is in acpi tree: commit a2f93aeadf97e870ff385030633a73e21146815d ACPI: EC: restart failed command patch referenced in comment #16 is also in acpi tree: commit 1cfe62c8010ac56e1bd3827e30386a87cc2f3594 Author: Alexey Starikovskiy <astarikovskiy@suse.de> ACPI: EC: revert msleep patch patch in http://bugzilla.kernel.org/show_bug.cgi?id=11917#c3 is also in the acpi tree: commit d21cf3c16b1191f3154a51e0b20c82bf851cc553 Author: Alexey Starikovskiy <astarikovskiy@suse.de> ACPI EC: Fix regression due to use of uninitialized variable above shipped in 2.6.28-rc4-git3 closed. Created attachment 18881 [details]
dmesg from 2.6.28-rc5 (EeePC 901)
Seems fine with 2.6.28-rc5 on my EeePC 901, apart from some GPE storm stuff
Also confirm, that -rc5 solves problem. Thanks |