Bug 11892

Summary: Battery information and status disappearing and wrong thermal status - MSI ex600 laptop
Product: ACPI Reporter: Mark (makalsky)
Component: ECAssignee: Alexey Starikovskiy (astarikovskiy)
Status: CLOSED CODE_FIX    
Severity: normal CC: acpi-bugzilla, florian, marcin, ole, rjw, yakui.zhao
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.28-rc2 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 11167    
Attachments: dmesg output
acpidump output
acpidump binary format
remove msleep from poll mode
the dmesg with patch set from 11896
/proc/acpi/battery/*/state output
ouput of /proc/acpi/battery/*/state half a hour later
therm output after the boot.
therm output half an hour later.
lshal -m log
dmesg with debug output and patch 18529
dmesg with acpi_video disabled
dmesg with 3 suggested patches.
Wait for EC to clean up
[Patch 1/4]: the EC transaction is explicitly divided into several phases
[patch 2/4] add the 2us delay before checking EC status while in polling mode
[patch 3/4]: Add some delay in ec gpe handler to workaround the GPE storm on some broken bios
[patch 4/4]: Disable EC burst mode when returning incorrect Acknowledge byte
the dmesg of 4 patches proposed by Alexey.
The correct dmesg for 4 patches from Alexey
the dmesg of 4 patches from Yakui and #11917C3
dmesg of forced poll mode.
output from acpidump
output from dmesg

Description Mark 2008-10-29 15:33:46 UTC
Latest working kernel version: 2.6.26
Earliest failing kernel version:2.6.27
Distribution: Ubuntu
Hardware Environment: Intel Core 2 Duo cpu, MSI ex600 laptop.
Software Environment: x86_64
Problem Description: The latest patch for the EC driver from ver. 2.0 to 2.1 is causing the battery on my laptop to disappear within a few minutes after the boot.

cat /proc/acpi/battery/BAT1/state shows
present: no
In addition the cat /proc/acpi/thermal_zone/THRM/temperature command after booting shows the correct temperature of around 56C, but then again after a few minutes of work it always shows 146C.

I don't know the exact patch number, however it is this one: 
http://lkml.org/lkml/2008/10/11/43

I have initially reported this bug on Ubuntu, you can see it here: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/286169

But since then I've tried the mainline 2.6.28-rc2 kernel, with and without this patch, and the bug is there.

Without this patch the dmesg shows this:
[    0.587824] PCI: MCFG configuration 0: base e0000000 segment 0 buses 0 - 255
[    0.588162] ACPI Error (evregion-0315): No handler for Region [EC__] (ffff88007fb305e8) [EmbeddedControl] [20080609]
[    0.588168] ACPI Error (exfldio-0291): Region EmbeddedControl(3) has no handler [20080609]
[    0.588173] ACPI Error (psparse-0530): Method parse/execution failed [\_SB_.PCI0.SBRG.EC__.BAT1._STA] (Node ffff88007fb33040), AE_NOT_EXIST
[    0.588212] ACPI Error (uteval-0232): Method execution failed [\_SB_.PCI0.SBRG.EC__.BAT1._STA] (Node ffff88007fb33040), AE_NOT_EXIST
[    0.589684] ACPI Error (evregion-0315): No handler for Region [EC__] (ffff88007fb305e8) [EmbeddedControl] [20080609]
[    0.589689] ACPI Error (exfldio-0291): Region EmbeddedControl(3) has no handler [20080609]
[    0.589694] ACPI Error (psparse-0530): Method parse/execution failed [\_SB_.PCI0.SBRG.EC__.BAT1._STA] (Node ffff88007fb33040), AE_NOT_EXIST
[    0.589732] ACPI Error (uteval-0232): Method execution failed [\_SB_.PCI0.SBRG.EC__.BAT1._STA] (Node ffff88007fb33040), AE_NOT_EXIST
[    0.589824] PCI: MCFG area at e0000000 reserved in ACPI motherboard resources
[    0.598949] PCI: Using MMCONFIG at e0000000 - efffffff
[    0.600599] ACPI: EC: non-query interrupt received, switching to interrupt mode
[    0.600599] ACPI: EC: GPE storm detected, disabling EC GPE
[    0.669385] ACPI: EC: GPE = 0x17, I/O: command/status = 0x66, data = 0x62
[    0.669385] ACPI: EC: driver started in poll mode

and the ACPI is working fine in poll mode. However with this patch the dmesg shows this:
[    0.159226] PCI: MCFG configuration 0: base e0000000 segment 0 buses 0 - 255
[    0.156016] ACPI: EC: non-query interrupt received, switching to interrupt mode
[    0.172212] PCI: MCFG area at e0000000 reserved in ACPI motherboard resources
[    0.185424] PCI: Using MMCONFIG at e0000000 - efffffff
[    0.193727] ACPI: EC: GPE = 0x17, I/O: command/status = 0x66, data = 0x62
[    0.193727] ACPI: EC: driver started in interrupt mode

The ACPI in this case on my hardware stops working properly in interrupt mode after a few minutes and never returns to normal.

I would gladly help in any way I can and provide additional information if needed.
Comment 1 Alexey Starikovskiy 2008-10-29 15:40:35 UTC
please provide dmesg and acpidump outputs.
Comment 2 Mark 2008-10-29 16:10:52 UTC
Created attachment 18503 [details]
dmesg output
Comment 3 Mark 2008-10-29 16:11:47 UTC
Created attachment 18504 [details]
acpidump output
Comment 4 Mark 2008-10-29 17:49:27 UTC
Created attachment 18507 [details]
acpidump binary format
Comment 5 ykzhao 2008-10-30 02:09:21 UTC
Hi, Mark 
    Will you please try the patch set in http://bugzilla.kernel.org/show_bug.cgi?id=11896#C1,2,3,4 on the latest kernel(2.6.28-rc2) and see whether the problem still exists?
    After the test, please attach the output of dmesg, /proc/acpi/thermal_zone/*/* , /proc/acpi/battery/*/state
    thanks.
Comment 6 Alexey Starikovskiy 2008-10-30 15:26:53 UTC
Created attachment 18529 [details]
remove msleep from poll mode

Mark, could you please check if this patch changes something?
please enable DEBUG at the beginning of drivers/acpi/ec.c.
Comment 7 Mark 2008-10-30 16:22:38 UTC
Created attachment 18532 [details]
the dmesg with patch set from 11896

the 4 patches from the http://bugzilla.kernel.org/show_bug.cgi?id=11896#C1,2,3,4 bug report changed nothing in this case. attached are the dmesg.
Comment 8 Mark 2008-10-30 16:26:22 UTC
Created attachment 18533 [details]
/proc/acpi/battery/*/state output

this is the output of /proc/acpi/battery/*/state immediately after the boot, before the battery disappeared.
Comment 9 Mark 2008-10-30 16:27:32 UTC
Created attachment 18534 [details]
ouput of /proc/acpi/battery/*/state half a hour later

in this output the battery is no longer there.
Comment 10 Mark 2008-10-30 16:28:40 UTC
Created attachment 18535 [details]
therm output after the boot.

immediately after the boot, the thermal state report correct values.
Comment 11 Mark 2008-10-30 16:29:48 UTC
Created attachment 18536 [details]
therm output half an hour later.

half an hour after the boot, the thermal is stuck on unrealistic 146C.
Comment 12 Mark 2008-10-30 16:33:43 UTC
Created attachment 18537 [details]
lshal -m log

this is a log of lshal -m 
Here is the last two minutes, before the battery disappears.
The values reported here at the end, voltage=10000 and reporting.current=-1 are always the same, after about 15 tries.

Next I will try the patch from comment #6. will report as soon as I can.
Comment 13 Mark 2008-10-30 18:07:54 UTC
Created attachment 18538 [details]
dmesg with debug output and patch 18529

This patch didn't help either, the battery state is present: no
and the thermal temperature is 146C.
This dmesg is with debug enabled.
Seems like my EC just doesn't want to work in interrupt mode, only in polling.
Comment 14 ykzhao 2008-10-30 19:34:41 UTC
Hi, Mark
    Will you please double check whether the box can work well on 2.6.26 kernel?
    Will you please not load the acpi_video driver on 2.6.27 kernel and see whether the problem still exists?( Please don't use the acpi_video driver.)

    thanks.
    
Comment 15 Mark 2008-10-31 01:51:32 UTC
Hi, Ykzhao
The 2.6.26 is working as it should.
Also as I said in the comment #1, if I revert this patch for EC driver ( http://lkml.org/lkml/2008/10/11/43 ), on either the 2.6.27 or 2.6.28, the kernels works fine, falling back to polling mode. With this patch however the acpi ec is always in interrupt mode, which for some reason stops working properly after some time after the boot.


As for the second part, the acpi_driver, its not a module, how should I prevent this driver from loading? Sorry for this lame question, but never done this before.
Thanks,
Comment 16 Mark 2008-10-31 04:08:58 UTC
I meant acpi_video not acpi_driver in previous comment, my bet.
Comment 17 ykzhao 2008-11-02 17:12:42 UTC
Please deselect the "CONFIG_ACPI_VIDEO" in kernel configuration .
Thanks.
Comment 18 Mark 2008-11-04 15:52:48 UTC
Created attachment 18672 [details]
dmesg with acpi_video disabled

No joy here.
With acpi_video disabled I've the same results, so this changes nothing.
Attached is the dmesg, with debug still enabled in ec.c
Comment 19 Mark 2008-11-05 13:09:07 UTC
Is there anything else I can do to help resolve this?
Comment 20 Alexey Starikovskiy 2008-11-05 13:31:26 UTC
Mark,
please uncomment DEBUG in drivers/acpi/ec.c, enable time stamps in printk and attach here dmesg before and after EC stopped working.
Comment 21 Alexey Starikovskiy 2008-11-05 13:50:25 UTC
Mark,
Actually, there is a number of pending patches, you could see the list in bug #11896. Please check if the last set works for you...
Comment 22 Mark 2008-11-05 14:03:38 UTC
(In reply to comment #21)
> Mark,
> Actually, there is a number of pending patches, you could see the list in bug
> #11896. Please check if the last set works for you...
> 

just to be sure, the debug is already uncomented in my ec.c file.
And the patches you want me to try are:
git +#11917 (comment #3) +#11896 (comment #16) +#11896 (comment #43)

Is this correct?
Comment 23 Alexey Starikovskiy 2008-11-05 14:23:06 UTC
correct.
Comment 24 Mark 2008-11-05 16:17:36 UTC
(In reply to comment #21)
> Mark,
> Actually, there is a number of pending patches, you could see the list in bug
> #11896. Please check if the last set works for you...
> 
Sorry to report, but those 3 patches didn't help.
Attached is the full dmesg.
This time after 5 reboots the behaviour is the same, during the boot the ACPI detects battery, adapter and thermal correctly, as shown in dmesg, but by the time the boot finishes it stops working. 
Comment 25 Mark 2008-11-05 16:18:24 UTC
Created attachment 18696 [details]
dmesg with 3 suggested patches.
Comment 26 Alexey Starikovskiy 2008-11-05 23:14:22 UTC
Created attachment 18700 [details]
Wait for EC to clean up

Please add this patch on top.
Comment 27 ykzhao 2008-11-06 00:15:30 UTC
Hi, Mark
   Will you please attach the dmesg on 2.6.26 stable kernel?
   From the log in comment #18 it seems that EC is crashed after the reporting the following message. After the burst enable command is issued, the 0x09 is returned instead of 0x90 acknowledge.
    >[   19.921656] Adding 1951888k swap on /dev/sda3.  Priority:-1 extents:1 across:1951888k
    >[   19.936409] ACPI: EC: transaction start
    >[   19.936410] ACPI: EC: <--- command = 0x82
    >[   19.937553] ACPI: EC: ~~~> interrupt
    >[   19.937557] ACPI: EC: ---> status = 0x19
    >[   19.937561] ACPI: EC: ---> data = 0x09
    >[   19.937575] ACPI: EC: transaction end
   
    Why is the incorrect acknowledge returned by EC? It is very interesting.
    Will you please try the following four patches and see whether the problem stille exists?
    Thanks.
    Thanks.
Comment 28 ykzhao 2008-11-06 00:30:01 UTC
Created attachment 18701 [details]
[Patch 1/4]: the EC transaction is explicitly divided into several phases
Comment 29 Alexey Starikovskiy 2008-11-06 00:33:29 UTC
yakui,
in "no acpi_video" dmesg it have died 2 seconds earlier, at "[   17.069461] ACPI: EC: transaction start". Notice that EC returns same 0x90 value as from BURST_ENABLE command. Hope it will help you with your investigation
Comment 30 ykzhao 2008-11-06 00:35:03 UTC
Created attachment 18702 [details]
[patch 2/4] add the 2us delay before checking EC status while in polling mode
Comment 31 Alexey Starikovskiy 2008-11-06 00:35:51 UTC
Yakui,
I told you several times, your 4-patches 500+ lines "try-it-please" series does not help.
Comment 32 ykzhao 2008-11-06 00:39:38 UTC
Created attachment 18703 [details]
[patch 3/4]: Add some delay in ec gpe handler to workaround the GPE storm on some broken bios
Comment 33 ykzhao 2008-11-06 00:41:27 UTC
Created attachment 18704 [details]
[patch 4/4]: Disable EC burst mode when returning incorrect Acknowledge byte
Comment 34 ykzhao 2008-11-06 00:46:08 UTC
Hi, Mark
    Will you please try the attched patch set(#C28,30, 32, 33) on the latest kernel(2.6.28-rc3) and see whether the problem still exists? (Of course the patch in http://bugzilla.kernel.org/show_bug.cgi?id=11917#C3 is also needed).
    It will be great if you can add the debug function in EC.c and attach the output of dmesg.
    At the same time the acpi_video driver is still disabled.
    After the test, please attach the output of dmesg.
    Thanks.
Comment 35 ykzhao 2008-11-06 00:54:04 UTC
Hi, Alexey
    Why do you think that my patch set is useless? 
    Please note that it is a regression. In the 2.6.26 kernel the system can work well. And in the 2.6.26 kernel the EC transaction is done explicitly in several phases. Right?
    So IMO it is appropriate to use the previous working mode on the latest kernel when trying to get the root cause.
    It seems that several EC bugs are related with your patch(do EC transaction in interrupt context). 
    Thanks.
    
    
Comment 36 Mark 2008-11-06 03:41:17 UTC
Alexey,
The forth patch from comment #26 did the following:
out of 6 reboots,
on 2 reboots, the EC stopped working somewhere during the boot process.
on 1 reboot, the EC stopped working about half an hour after reboot.
on 3 reboots, the EC continued to work after 1 hour.
Also the last patch introduced very strange behaviour, during the boot time some of the boot scripts will hang for 5-10 seconds, or until I press any key on the keyboard.
And also the "time cat /proc/acpi/battery/BAT1/state" took consistently 1.2 seconds to complete, against normal 0.4 seconds.
Comment 37 Mark 2008-11-06 03:44:34 UTC
Created attachment 18705 [details]
the dmesg of 4 patches proposed by Alexey.

I'm going to try the Yakui's patches now. I can see that you guys disagree alot :)
but no harm done in trying, isn't it?
Comment 38 Mark 2008-11-06 05:18:28 UTC
Created attachment 18706 [details]
The correct dmesg for 4 patches from Alexey
Comment 39 Mark 2008-11-06 06:04:38 UTC
(In reply to comment #34)
Well, so far the 4 patches Yakui provided here plus the patch from bug #11917 commwnt 3 are holding up, the EC works and all seems fine, full dmesg is provided.
I will reboot and run the laptop for a few hours again, just to be sure.
Do you think that I should try with acpi_video enabled again?
Thanks.

P.S. If you have any other stuff I should try, I will be glad to do it.
Comment 40 Mark 2008-11-06 06:06:03 UTC
Created attachment 18707 [details]
the dmesg of 4 patches from Yakui and #11917C3
Comment 41 ykzhao 2008-11-06 07:16:07 UTC
Hi, Mark
    Thanks for the test. It is very lucky that my patch set can make your box work well.
    Please don't enable the acpi_video driver. After the acpi_video driver is enabled, the backlight interface will be registered. In such case the _BCM object will be evaluated to change the brightness. Unfortunately in the _BCM object the EC I/O port will be directly accessed. As there is no synchronization between the AML code and OS, maybe the system can't work well.
    Thanks again.
    
Comment 42 Mark 2008-11-06 07:58:45 UTC
Yakui, thanks.
However without acpi_video the screen won't turn off completely, only go blank, the backlight is still on, which is not so good thing for a laptop :).
I will try with acpi_video, even if the system will become unstable, this is my testing laptop, so I don't mind. :)
I will report, if there will be anything to report.
Comment 43 Alexey Starikovskiy 2008-11-06 13:21:08 UTC
Mark,
Let me explain situation from my POV. With this bug we have regression, as you say from 2.6.26. So, the ultimate solution will be to revert all the changes to EC since 2.6.26, and by Linus rules what could be done -- no matter how many machines are fixed by the patch, single regression weighs more. What I am trying to do know is to find small patch which will resolve your regression, while keeping every other bugs still fixed, and there are quite a lot of them. I say small because it is not possible to predict how many machines will be broken by non-trivial patch. In many cases even trivial patch will break something. Thus I say that 500+ lines patch series from Yakui is hopeless. There is no way to find which part of this series removed (or masked) your regression, thus there is no point in evaluating it.
Thanks for your understanding, hope you will not mind to test some more patches. 
Comment 44 Mark 2008-11-06 13:43:04 UTC
Alexey,
Being a programmer myself 
(applications not system and unfortunately not for Linux yet) 
I do understand what a few lines of code can do, especially for such critical parts as ACPI.
So I don't mind at all to test as many patches as needed.
Comment 45 Alexey Starikovskiy 2008-11-06 15:02:24 UTC
Thanks.
Let's drop last patch for now and check if poll mode still works for you.
Please set ACPI_EC_STORM_THRESHOLD to something low (0-5?). This should force gpe to be disabled during transaction, essentially poll mode.
Comment 46 Mark 2008-11-06 17:05:32 UTC
Alexey, 
without the last patch, using only the previous 3 patches, and with ACPI_EC_STORM_THRESHOLD set to 1, forcing the poll mode, all works ok.
Comment 47 Mark 2008-11-06 17:06:14 UTC
Created attachment 18712 [details]
dmesg of forced poll mode.
Comment 48 Mark 2008-11-08 12:00:08 UTC
(In reply to comment #45)
> Thanks.
> Let's drop last patch for now and check if poll mode still works for you.
> Please set ACPI_EC_STORM_THRESHOLD to something low (0-5?). This should force
> gpe to be disabled during transaction, essentially poll mode.
> 
Well I've also tested (I haven't anything better to do :) ) the 2.6.28-rc3 and 2.6.27 kernels without the 3 patches in a forced poll mode and they all work fine, so as long as it is poll mode it works fine, and its the interrupt mode that I'm having problems with.
Comment 49 Alexey Starikovskiy 2008-11-08 12:17:23 UTC
Well, this is why we are trying to detect interrupt storm... Could you find the max value at which detection reliably happens ?
Comment 50 Mark 2008-11-08 12:34:59 UTC
Well at 1 its happening of course, what you would suggest the median value should be? 
I think I will start with ACPI_EC_STORM_THRESHOLD = 10, it is a half way from original 20.
I will post it here.

P.S. Can I ask how does it help you?
I mean, do you want to find some value that will force a hardware like mine to work in poll mode, while all the others still working in interrupt mode?
Or is there something else?
I just wanted to know, to understand it better, thanks.
Comment 51 Alexey Starikovskiy 2008-11-08 13:01:18 UTC
Exactly. ASUS eeePC have the storm issue as well, but it looks less severe, or may be CPU there is not fast enough to service so many interrupts per transaction.
On their hardware current "fast transaction" is able to operate in interrupt mode regardless of the storm (2-3 false interrupts per transaction), thus if in your case we could trigger storm detection at any higher value, we could set it as a threshold and make everyone happy. 
Comment 52 Mark 2008-11-08 16:43:51 UTC
(In reply to comment #49)
> Well, this is why we are trying to detect interrupt storm... Could you find
> the
> max value at which detection reliably happens ?
> 

So 8 is the magic number.
At ACPI_EC_STORM_THRESHOLD = 8 the GPE storm is detected and the EC is forced into poll mode. At 9 the interrupt mode is chosen.
I will redo the test over several reboots, just to be sure.
Comment 53 Mark 2008-11-08 17:18:28 UTC
Definitely 8, after several complete shutdowns.
Comment 54 Rafael J. Wysocki 2008-11-09 09:21:53 UTC
Handled-By : Alexey Starikovskiy <astarikovskiy@suse.de>
Comment 55 Mark 2008-11-10 05:04:54 UTC
Is there anything else I can do for now?
Comment 56 Alexey Starikovskiy 2008-11-10 08:25:04 UTC
Could you please check if patch from bug #11896 (comment #51) works/breaks your system?
Comment 57 Mark 2008-11-10 08:35:07 UTC
Should I test this patch alone, or with some other patches?
And should I revert the ACPI_EC_STORM_THRESHOLD to the default 20 or leave it at  8 that is currently working for me?
Comment 58 Alexey Starikovskiy 2008-11-10 08:38:29 UTC
Please try both 8 and 20 values -- Yakui thinks this patch may solve all our problems, so may be 8 or whole storm detection is not needed any longer...
Comment 59 Mark 2008-11-10 12:02:47 UTC
(In reply to comment #56)
> Could you please check if patch from bug #11896 (comment #51) works/breaks
> your
> system?
> 

Sorry, this patch changed nothing.
When ACPI_EC_STORM_THRESHOLD was 20, and EC working in interrupt mode, it showed the same symptoms as before.
When ACPI_EC_STORM_THRESHOLD was 8, and EC working in poll mode, all worked fine.
So this patch nor fixed, nor broke my system.
Comment 60 Alexey Starikovskiy 2008-11-10 12:07:11 UTC
Thanks for checking that. I already made a proper patch for reducing threshold value to 8, and it should be picked up soon... 
Comment 61 Len Brown 2008-11-11 22:01:46 UTC
patch is in acpi tree:

commit 06cf7d3c7af902939cd1754abcafb2464060cba8
Author: Alexey Starikovskiy <astarikovskiy@suse.de>
Date:   Sun Nov 9 19:01:06 2008 +0300

    ACPI: EC: lower interrupt storm treshold
Comment 62 Len Brown 2008-11-12 21:29:04 UTC
shipped in 2.6.28-rc4-git3

closed.
Comment 63 Marcin Inkielman 2009-07-27 18:15:00 UTC
Hello,

I know that the bug is signed as closed but I have the same effect as described by Mark with all recent 2.6.28 kernels with ubuntu 9.04 distribution (tested on generic, server, 32bit, 64bit). I downloaded the kernel sources (form ubuntu repository) and played with ACPI_EC_STORM_THRESHOLD and set it to 1 and 0 (it has 8 value initially). My battery "disappears" after a few minutes with original kernels and after up 1h with ACPI_EC_STORM_THRESHOLD=0/1.

However, everything works perfectly with the kernel prepared by Leann Ogasawara  here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/286169 (post #6).

Any help more than appreciated.

Regards,
Marcin
Comment 64 Marcin Inkielman 2009-07-27 18:17:47 UTC
Created attachment 22514 [details]
output from acpidump
Comment 65 Marcin Inkielman 2009-07-27 18:18:42 UTC
Created attachment 22515 [details]
output from dmesg