Bug 11892
Description
Mark
2008-10-29 15:33:46 UTC
please provide dmesg and acpidump outputs. Created attachment 18503 [details]
dmesg output
Created attachment 18504 [details]
acpidump output
Created attachment 18507 [details]
acpidump binary format
Hi, Mark Will you please try the patch set in http://bugzilla.kernel.org/show_bug.cgi?id=11896#C1,2,3,4 on the latest kernel(2.6.28-rc2) and see whether the problem still exists? After the test, please attach the output of dmesg, /proc/acpi/thermal_zone/*/* , /proc/acpi/battery/*/state thanks. Created attachment 18529 [details]
remove msleep from poll mode
Mark, could you please check if this patch changes something?
please enable DEBUG at the beginning of drivers/acpi/ec.c.
Created attachment 18532 [details] the dmesg with patch set from 11896 the 4 patches from the http://bugzilla.kernel.org/show_bug.cgi?id=11896#C1,2,3,4 bug report changed nothing in this case. attached are the dmesg. Created attachment 18533 [details]
/proc/acpi/battery/*/state output
this is the output of /proc/acpi/battery/*/state immediately after the boot, before the battery disappeared.
Created attachment 18534 [details]
ouput of /proc/acpi/battery/*/state half a hour later
in this output the battery is no longer there.
Created attachment 18535 [details]
therm output after the boot.
immediately after the boot, the thermal state report correct values.
Created attachment 18536 [details]
therm output half an hour later.
half an hour after the boot, the thermal is stuck on unrealistic 146C.
Created attachment 18537 [details] lshal -m log this is a log of lshal -m Here is the last two minutes, before the battery disappears. The values reported here at the end, voltage=10000 and reporting.current=-1 are always the same, after about 15 tries. Next I will try the patch from comment #6. will report as soon as I can. Created attachment 18538 [details]
dmesg with debug output and patch 18529
This patch didn't help either, the battery state is present: no
and the thermal temperature is 146C.
This dmesg is with debug enabled.
Seems like my EC just doesn't want to work in interrupt mode, only in polling.
Hi, Mark Will you please double check whether the box can work well on 2.6.26 kernel? Will you please not load the acpi_video driver on 2.6.27 kernel and see whether the problem still exists?( Please don't use the acpi_video driver.) thanks. Hi, Ykzhao The 2.6.26 is working as it should. Also as I said in the comment #1, if I revert this patch for EC driver ( http://lkml.org/lkml/2008/10/11/43 ), on either the 2.6.27 or 2.6.28, the kernels works fine, falling back to polling mode. With this patch however the acpi ec is always in interrupt mode, which for some reason stops working properly after some time after the boot. As for the second part, the acpi_driver, its not a module, how should I prevent this driver from loading? Sorry for this lame question, but never done this before. Thanks, I meant acpi_video not acpi_driver in previous comment, my bet. Please deselect the "CONFIG_ACPI_VIDEO" in kernel configuration . Thanks. Created attachment 18672 [details]
dmesg with acpi_video disabled
No joy here.
With acpi_video disabled I've the same results, so this changes nothing.
Attached is the dmesg, with debug still enabled in ec.c
Is there anything else I can do to help resolve this? Mark, please uncomment DEBUG in drivers/acpi/ec.c, enable time stamps in printk and attach here dmesg before and after EC stopped working. Mark, Actually, there is a number of pending patches, you could see the list in bug #11896. Please check if the last set works for you... (In reply to comment #21) > Mark, > Actually, there is a number of pending patches, you could see the list in bug > #11896. Please check if the last set works for you... > just to be sure, the debug is already uncomented in my ec.c file. And the patches you want me to try are: git +#11917 (comment #3) +#11896 (comment #16) +#11896 (comment #43) Is this correct? correct. (In reply to comment #21) > Mark, > Actually, there is a number of pending patches, you could see the list in bug > #11896. Please check if the last set works for you... > Sorry to report, but those 3 patches didn't help. Attached is the full dmesg. This time after 5 reboots the behaviour is the same, during the boot the ACPI detects battery, adapter and thermal correctly, as shown in dmesg, but by the time the boot finishes it stops working. Created attachment 18696 [details]
dmesg with 3 suggested patches.
Created attachment 18700 [details]
Wait for EC to clean up
Please add this patch on top.
Hi, Mark Will you please attach the dmesg on 2.6.26 stable kernel? From the log in comment #18 it seems that EC is crashed after the reporting the following message. After the burst enable command is issued, the 0x09 is returned instead of 0x90 acknowledge. >[ 19.921656] Adding 1951888k swap on /dev/sda3. Priority:-1 extents:1 across:1951888k >[ 19.936409] ACPI: EC: transaction start >[ 19.936410] ACPI: EC: <--- command = 0x82 >[ 19.937553] ACPI: EC: ~~~> interrupt >[ 19.937557] ACPI: EC: ---> status = 0x19 >[ 19.937561] ACPI: EC: ---> data = 0x09 >[ 19.937575] ACPI: EC: transaction end Why is the incorrect acknowledge returned by EC? It is very interesting. Will you please try the following four patches and see whether the problem stille exists? Thanks. Thanks. Created attachment 18701 [details]
[Patch 1/4]: the EC transaction is explicitly divided into several phases
yakui, in "no acpi_video" dmesg it have died 2 seconds earlier, at "[ 17.069461] ACPI: EC: transaction start". Notice that EC returns same 0x90 value as from BURST_ENABLE command. Hope it will help you with your investigation Created attachment 18702 [details]
[patch 2/4] add the 2us delay before checking EC status while in polling mode
Yakui, I told you several times, your 4-patches 500+ lines "try-it-please" series does not help. Created attachment 18703 [details]
[patch 3/4]: Add some delay in ec gpe handler to workaround the GPE storm on some broken bios
Created attachment 18704 [details]
[patch 4/4]: Disable EC burst mode when returning incorrect Acknowledge byte
Hi, Mark Will you please try the attched patch set(#C28,30, 32, 33) on the latest kernel(2.6.28-rc3) and see whether the problem still exists? (Of course the patch in http://bugzilla.kernel.org/show_bug.cgi?id=11917#C3 is also needed). It will be great if you can add the debug function in EC.c and attach the output of dmesg. At the same time the acpi_video driver is still disabled. After the test, please attach the output of dmesg. Thanks. Hi, Alexey Why do you think that my patch set is useless? Please note that it is a regression. In the 2.6.26 kernel the system can work well. And in the 2.6.26 kernel the EC transaction is done explicitly in several phases. Right? So IMO it is appropriate to use the previous working mode on the latest kernel when trying to get the root cause. It seems that several EC bugs are related with your patch(do EC transaction in interrupt context). Thanks. Alexey, The forth patch from comment #26 did the following: out of 6 reboots, on 2 reboots, the EC stopped working somewhere during the boot process. on 1 reboot, the EC stopped working about half an hour after reboot. on 3 reboots, the EC continued to work after 1 hour. Also the last patch introduced very strange behaviour, during the boot time some of the boot scripts will hang for 5-10 seconds, or until I press any key on the keyboard. And also the "time cat /proc/acpi/battery/BAT1/state" took consistently 1.2 seconds to complete, against normal 0.4 seconds. Created attachment 18705 [details]
the dmesg of 4 patches proposed by Alexey.
I'm going to try the Yakui's patches now. I can see that you guys disagree alot :)
but no harm done in trying, isn't it?
Created attachment 18706 [details]
The correct dmesg for 4 patches from Alexey
(In reply to comment #34) Well, so far the 4 patches Yakui provided here plus the patch from bug #11917 commwnt 3 are holding up, the EC works and all seems fine, full dmesg is provided. I will reboot and run the laptop for a few hours again, just to be sure. Do you think that I should try with acpi_video enabled again? Thanks. P.S. If you have any other stuff I should try, I will be glad to do it. Created attachment 18707 [details]
the dmesg of 4 patches from Yakui and #11917C3
Hi, Mark Thanks for the test. It is very lucky that my patch set can make your box work well. Please don't enable the acpi_video driver. After the acpi_video driver is enabled, the backlight interface will be registered. In such case the _BCM object will be evaluated to change the brightness. Unfortunately in the _BCM object the EC I/O port will be directly accessed. As there is no synchronization between the AML code and OS, maybe the system can't work well. Thanks again. Yakui, thanks. However without acpi_video the screen won't turn off completely, only go blank, the backlight is still on, which is not so good thing for a laptop :). I will try with acpi_video, even if the system will become unstable, this is my testing laptop, so I don't mind. :) I will report, if there will be anything to report. Mark, Let me explain situation from my POV. With this bug we have regression, as you say from 2.6.26. So, the ultimate solution will be to revert all the changes to EC since 2.6.26, and by Linus rules what could be done -- no matter how many machines are fixed by the patch, single regression weighs more. What I am trying to do know is to find small patch which will resolve your regression, while keeping every other bugs still fixed, and there are quite a lot of them. I say small because it is not possible to predict how many machines will be broken by non-trivial patch. In many cases even trivial patch will break something. Thus I say that 500+ lines patch series from Yakui is hopeless. There is no way to find which part of this series removed (or masked) your regression, thus there is no point in evaluating it. Thanks for your understanding, hope you will not mind to test some more patches. Alexey, Being a programmer myself (applications not system and unfortunately not for Linux yet) I do understand what a few lines of code can do, especially for such critical parts as ACPI. So I don't mind at all to test as many patches as needed. Thanks. Let's drop last patch for now and check if poll mode still works for you. Please set ACPI_EC_STORM_THRESHOLD to something low (0-5?). This should force gpe to be disabled during transaction, essentially poll mode. Alexey, without the last patch, using only the previous 3 patches, and with ACPI_EC_STORM_THRESHOLD set to 1, forcing the poll mode, all works ok. Created attachment 18712 [details]
dmesg of forced poll mode.
(In reply to comment #45) > Thanks. > Let's drop last patch for now and check if poll mode still works for you. > Please set ACPI_EC_STORM_THRESHOLD to something low (0-5?). This should force > gpe to be disabled during transaction, essentially poll mode. > Well I've also tested (I haven't anything better to do :) ) the 2.6.28-rc3 and 2.6.27 kernels without the 3 patches in a forced poll mode and they all work fine, so as long as it is poll mode it works fine, and its the interrupt mode that I'm having problems with. Well, this is why we are trying to detect interrupt storm... Could you find the max value at which detection reliably happens ? Well at 1 its happening of course, what you would suggest the median value should be? I think I will start with ACPI_EC_STORM_THRESHOLD = 10, it is a half way from original 20. I will post it here. P.S. Can I ask how does it help you? I mean, do you want to find some value that will force a hardware like mine to work in poll mode, while all the others still working in interrupt mode? Or is there something else? I just wanted to know, to understand it better, thanks. Exactly. ASUS eeePC have the storm issue as well, but it looks less severe, or may be CPU there is not fast enough to service so many interrupts per transaction. On their hardware current "fast transaction" is able to operate in interrupt mode regardless of the storm (2-3 false interrupts per transaction), thus if in your case we could trigger storm detection at any higher value, we could set it as a threshold and make everyone happy. (In reply to comment #49) > Well, this is why we are trying to detect interrupt storm... Could you find > the > max value at which detection reliably happens ? > So 8 is the magic number. At ACPI_EC_STORM_THRESHOLD = 8 the GPE storm is detected and the EC is forced into poll mode. At 9 the interrupt mode is chosen. I will redo the test over several reboots, just to be sure. Definitely 8, after several complete shutdowns. Handled-By : Alexey Starikovskiy <astarikovskiy@suse.de> Is there anything else I can do for now? Could you please check if patch from bug #11896 (comment #51) works/breaks your system? Should I test this patch alone, or with some other patches? And should I revert the ACPI_EC_STORM_THRESHOLD to the default 20 or leave it at 8 that is currently working for me? Please try both 8 and 20 values -- Yakui thinks this patch may solve all our problems, so may be 8 or whole storm detection is not needed any longer... (In reply to comment #56) > Could you please check if patch from bug #11896 (comment #51) works/breaks > your > system? > Sorry, this patch changed nothing. When ACPI_EC_STORM_THRESHOLD was 20, and EC working in interrupt mode, it showed the same symptoms as before. When ACPI_EC_STORM_THRESHOLD was 8, and EC working in poll mode, all worked fine. So this patch nor fixed, nor broke my system. Thanks for checking that. I already made a proper patch for reducing threshold value to 8, and it should be picked up soon... patch is in acpi tree: commit 06cf7d3c7af902939cd1754abcafb2464060cba8 Author: Alexey Starikovskiy <astarikovskiy@suse.de> Date: Sun Nov 9 19:01:06 2008 +0300 ACPI: EC: lower interrupt storm treshold shipped in 2.6.28-rc4-git3 closed. Hello, I know that the bug is signed as closed but I have the same effect as described by Mark with all recent 2.6.28 kernels with ubuntu 9.04 distribution (tested on generic, server, 32bit, 64bit). I downloaded the kernel sources (form ubuntu repository) and played with ACPI_EC_STORM_THRESHOLD and set it to 1 and 0 (it has 8 value initially). My battery "disappears" after a few minutes with original kernels and after up 1h with ACPI_EC_STORM_THRESHOLD=0/1. However, everything works perfectly with the kernel prepared by Leann Ogasawara here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/286169 (post #6). Any help more than appreciated. Regards, Marcin Created attachment 22514 [details]
output from acpidump
Created attachment 22515 [details]
output from dmesg
|