Bug 45151

Summary: [REGRESSION] [BISECTED] ACPI event stop to be detected
Product: ACPI Reporter: Francesco (trentini)
Component: ACPICA-CoreAssignee: acpi_acpica-core (acpi_acpica-core)
Status: CLOSED CODE_FIX    
Severity: normal CC: chris, feng.tang, florian, lenb, lv.zheng, Robert.Moore
Priority: P1    
Hardware: All   
OS: Linux   
URL: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/908825
Kernel Version: 2.6.39 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Acpidump of my Notebook
gpe_debug.patch
dmesg output when running kernel with "gpe_debug.patch"
new gpe patch
dmesg output of test proposed in comment #7
dmesg output of test proposed in comment #9
dmesg outputs of comment #13 (good?)
dmesg outputs of comment #13 (bad?)
dmesg outputs of comment #14 (good?)
dmesg outputs of comment #14 (bad?)
test of comment #21
adding ec_irq_force param
make_threshold_parameter.patch
dmidecode of santech t45
add quirk for Clevo M720

Description Francesco 2012-07-26 14:20:48 UTC
This report follow ubuntu bug from here:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/908825

This is the resume of the launchpad report and the suspected problem:
I've a Clevo M720 notebook rebranded as Santech T45. All kernel from  2.6.38 and before support my laptop hotkey without extra set up. From 2.6.39 and following acpi support is broken. For example, with newer kernel, there are not recognized event like:
- the ac_adapter plug and unplug
- changing battery charging/discharging status event
- suspend button*
- backlight hotkey event*
- low power mode button*
- lid open/close

The bug is reproducible booting the laptop with a kernel from 2.6.38 and older and affect all most diffused distribution (debian, fedora, ubuntu, etc.) when using newer kernel. Using acpi_event command I've saved the output for low power mode button, backlight up and down and suspend button:
#: acpi_listen
processor CPU0 00000080 00000003
processor CPU0 00000081 00000000
processor CPU1 00000080 00000003
processor CPU1 00000081 00000000
processor CPU0 00000080 00000000
processor CPU0 00000081 00000000
processor CPU1 00000080 00000000
processor CPU1 00000081 00000000
video LCD 00000087 00000000
video LCD 00000086 00000000
button/sleep SLPB 00000080 00000002

After kernel bisection, I've found the first commit that have caused the regression:
6dfad339645247c9deb553e4d68e21211ddc61bd is the first bad commit
commit 6dfad339645247c9deb553e4d68e21211ddc61bd
Author: Lin Ming <email address hidden>
Date: Mon Feb 14 15:29:34 2011 +0800

    ACPICA: GPE detect optimization - ignore unused GPE registers

    This optimization will simply ignore GPE registers that contain
    no enabled GPEs - there is no need to read the register.
    ACPICA bugzilla 884.

    http://www.acpica.org/bugzilla/show_bug.cgi?id=884

    Signed-off-by: Lin Ming <email address hidden>
    Signed-off-by: Bob Moore <email address hidden>
    Signed-off-by: Len Brown <email address hidden>

:040000 040000 8c25abddf22c9740402624bd582f3c87bfe9fc4f e40ed2fa28b82990cc8fb147f61841fd6400e711 M drivers
Comment 1 Feng Tang 2012-07-30 09:28:48 UTC
could you post your acpidump binary?

The commit's logic looks ok, if you revert this patch like this, will there be some difference? thanks

diff --git a/drivers/acpi/acpica/evgpe.c b/drivers/acpi/acpica/evgpe.c
index afbd5cb..36da618 100644
--- a/drivers/acpi/acpica/evgpe.c
+++ b/drivers/acpi/acpica/evgpe.c
@@ -377,10 +377,12 @@ u32 acpi_ev_gpe_detect(struct acpi_gpe_xrupt_info * gpe_xrupt_list)
                         * Optimization: If there are no GPEs enabled within this
                         * register, we can safely ignore the entire register.
                         */
+                       /*
                        if (!(gpe_register_info->enable_for_run |
                              gpe_register_info->enable_for_wake)) {
                                continue;
                        }
+                       */

                        /* Read the Status Register */
Comment 2 Francesco 2012-07-31 20:48:35 UTC
Created attachment 76621 [details]
Acpidump of my Notebook
Comment 3 Francesco 2012-07-31 20:49:37 UTC
I've compiled the 3.2.21 kernel (the ubuntu precise kernel source) with and without the proposed patch. The results is that applying the  patch, thus comment out the if statement, solve the issue and the acpi restart working correctly on my notebook.

I don't know why, but those three lines cause the stop of acpi event on my notebook.

I've also attached the acpidump output of my notebook.

Thanks
Comment 4 Feng Tang 2012-08-02 09:05:16 UTC
Created attachment 76651 [details]
gpe_debug.patch

Hi,

Could you try to build a kernel with this patch, and enable CONFIG_ACPI_DEBUG in kernel config, then do:

1. bootup the kernel, switch to root user
2. run
     #echo 0x4 > /sys/module/acpi/parameters/debug_layer
     #echo 0x0800000c > /sys/module/acpi/parameters/debug_level
3. startup the "acpi_listen" and press _once the hotkey which doesn't work
4. send out the "dmesg"

Thanks,
Comment 5 Francesco 2012-08-02 21:09:07 UTC
Created attachment 76711 [details]
dmesg output when running kernel with "gpe_debug.patch"
Comment 6 Francesco 2012-08-02 21:11:48 UTC
I did as you written. I press only the "backlight down" hotkey that use cause an acpi event. The acpi listen output is:
video LCD 00000087 00000000

Thanks for your time
Comment 7 Robert Moore 2012-08-02 21:16:01 UTC
Very good, the output is useful.

Now, please comment out those three lines of code (as in comment #1), and get the same debug output so we can compare the two debug sequences.

Thanks.
bob
Comment 8 Francesco 2012-08-03 08:09:13 UTC
What does you mean exactly?

Option A: from a virgen kernel source I apply _only the patch in comment #1 _and enable CONFIG_ACPI_DEBUG in kernel config and do the same step of comment #4

Option B: apply _both patch, comment #1 and "gpe_debug.patch", and run the same step as comment #4 with CONFIG_ACPI_DEBUG enabled.

Which of two option do you mean?

I prefer to be precise because I'm very far to be an expert in kernel hacking, so I want to be sure that I'm doing the right thing.

Thanks
Comment 9 Feng Tang 2012-08-03 08:36:39 UTC
Created attachment 76731 [details]
new gpe patch

Hi  Francesco,

You already did a good job in testing kernel and providing the useful info.

I guess Bob's intention is to compare both the working and non-working case.
So could you try the new attached patch on a clean kernel without any of my previous patch, do the same test and provide the dmesg info?

Also you can add "log_buf_len=4M acpi.debug_layer=0x1f acpi.debug_level=0x8000000c" to your kernel cmdline by change grub.conf/grub.cfg, so that you don't need to do the 
     #echo 0x4 > /sys/module/acpi/parameters/debug_layer
     #echo 0x0800000c > /sys/module/acpi/parameters/debug_level.

thanks,
Comment 10 Francesco 2012-08-12 11:13:29 UTC
Ok, I've done both test. I've done the test as described in comment #4 and with the patch requested in comment #7 and comment #9. The results are:


Comment #7 test: in this case I've comment out the three lines of the if statement. The acpi subsystem works and acpi_listen gives the normal output as: "video LCD 00000087 00000000". The dmesg output are attached in "dmesg_comm7.txt" file.



Comment #9 test: using "new gpe patch" proposed in this comment doesn't resume the acpi subsystem and acpi_listen doesn't put any output. In any case, the relative dmesg is attached in "dmesg_comm9.txt" file.

I hope that it is the thinks what you want.

If more test is needed, there is no problem.

Thanks,
Comment 11 Francesco 2012-08-12 11:14:11 UTC
Created attachment 77431 [details]
dmesg output of test proposed in comment #7
Comment 12 Francesco 2012-08-12 11:14:29 UTC
Created attachment 77441 [details]
dmesg output of test proposed in comment #9
Comment 13 Feng Tang 2012-08-13 06:48:55 UTC
From the 2 dmesgs for bad/good situations, I can't find obvious clue. Could you add these parameters to your kernel command line (by editing your /boot/grub/grub.cfg)
 "debug log_buf_len=8M acpi.debug_layer=0x1009001F acpi.debug_level=0x082002FF", and try the same test process for both the good/bad kernels? thanks
Comment 14 Feng Tang 2012-08-14 02:23:51 UTC
Hi Francesco,

Since this bug is heavily bound with EC driver, especially when there is a GPE storm warning in your dmesg. please apply this patch to your EC driver for all your good/bad test kernels:


diff --git a/drivers/acpi/ec.c b/drivers/acpi/ec.c
index 7edaccc..1b70769 100644
--- a/drivers/acpi/ec.c
+++ b/drivers/acpi/ec.c
@@ -27,7 +27,7 @@
  */

 /* Uncomment next line to get verbose printout */
-/* #define DEBUG */
+#define DEBUG

 #include <linux/kernel.h>
 #include <linux/module.h>


thanks
Comment 15 Feng Tang 2012-08-20 02:57:22 UTC
Francesco,

Could you provide the debug info with the new debug method? I mean make the changes suggested in Comment 13/14 and do the test and post the dmesg for both the good and bad kernel case.
Comment 16 Francesco 2012-08-20 15:09:30 UTC
Here it is the results but something strange happens. I've check three times but I think I'm doing something wrong.

The strange behavior is that adding the string of comment #13 ("debug log_buf_len=8M  cpi.debug_layer=0x1009001F acpi.debug_level=0x082002FF") to kernel boot parameter,
the acpi subsystem work vice versa as booting without the string. This mean that the acpi works with the if statement whereas stop working when is comment out (thus applying the patch of comment #1). On the other hand, when all the kernels are booted without the previous string the acpi subsystem work as described in the bug. Therefore, the string seems to invert the behavior of the acpi subsystem.

Hence, I've reset all the kernel source folder and restart to compile all the kernel variant before doing the test. All test are done _WITH the boot string of comment #13 and acpi debug compiling option activated. The dmesg files included are:

dmesg_c1.txt: in this case _only patch of comment #1 is applied. In that case acpi_listen doesn't output anything, the acpi subsystem doesn't work.

dmesg_c9-new-gpe.txt: in this case _only patch of comment #9 is applied (called "new gpe patch"). In that case acpi work correctly.

dmesg_c14-c1.txt: in this case _both patch of comment #1 and comment #14 are applied. In that case acpi_listen doesn't output anything, the acpi subsystem doesn't work.

dmesg_c14-c9.txt: in this case _both patch of comment #9 and comment #14 are applied. In that case acpi work correctly.


I hope that this help.

Thanks,
Francesco
Comment 17 Francesco 2012-08-20 15:10:17 UTC
Created attachment 77981 [details]
dmesg outputs of comment #13 (good?)
Comment 18 Francesco 2012-08-20 15:10:40 UTC
Created attachment 77991 [details]
dmesg outputs of comment #13 (bad?)
Comment 19 Francesco 2012-08-20 15:10:59 UTC
Created attachment 78001 [details]
dmesg outputs of comment #14 (good?)
Comment 20 Francesco 2012-08-20 15:11:18 UTC
Created attachment 78011 [details]
dmesg outputs of comment #14 (bad?)
Comment 21 Feng Tang 2012-08-21 06:24:41 UTC
(In reply to comment #16)
> Here it is the results but something strange happens. I've check three times
> but I think I'm doing something wrong.
> 
> The strange behavior is that adding the string of comment #13 ("debug
> log_buf_len=8M  cpi.debug_layer=0x1009001F acpi.debug_level=0x082002FF") to
> kernel boot parameter,

You can enter /sys/module/acpi/parameters, check debug_level and debug_layer to
see if they are really what we wanted to set.

> the acpi subsystem work vice versa as booting without the string. This mean
> that the acpi works with the if statement whereas stop working when is
> comment
> out (thus applying the patch of comment #1). On the other hand, when all the
> kernels are booted without the previous string the acpi subsystem work as
> described in the bug. Therefore, the string seems to invert the behavior of
> the
> acpi subsystem.

Yes, that's weird, Comment #1 should work while Comment #9 should not which simply add some debug info. So the too many dmesg output must affect the time sequence of EC(Embedded Controller), the EC on your platform is fragile which force the driver to use polling mode for normal transaction, while normal EC will use driver's interrupt driven mode.

Since the dmesg is too many and its ring buffer has been overwritten (the same cmdline set doesn't have so many msg on my platform I can't find any useful info from the dmesg. So need your help to do more tests as follows:

1. add "log_buf_len=8M acpi.debug_layer=0x0000001F acpi.debug_level=0x080002FF" to cmdline
2.a  test comment #1 + comment #14
2.b  test comment #9 + comment #14
2.c  test comment #1
2.d  test comment #9

hope this time, the dmesg won't be wrapped.
Comment 22 Francesco 2012-08-21 13:00:36 UTC
Created attachment 78071 [details]
test of comment #21
Comment 23 Francesco 2012-08-21 13:01:15 UTC
(In reply to comment #21)
> You can enter /sys/module/acpi/parameters, check debug_level and debug_layer
> to
> see if they are really what we wanted to set.
I always check this files and all option are set up correctly.
 
> Yes, that's weird, Comment #1 should work while Comment #9 should not which
> simply add some debug info.
There is something that is worse, now in all four test the acpi work correctly! I do not understand how the "debug string" can change this behavior.


> 1. add "log_buf_len=8M acpi.debug_layer=0x0000001F
> acpi.debug_level=0x080002FF"
> to cmdline
> 2.a  test comment #1 + comment #14
> 2.b  test comment #9 + comment #14
> 2.c  test comment #1
> 2.d  test comment #9

The results are put together in the "test_c21.tar.bz2" file in comment #22
Comment 24 Feng Tang 2012-08-21 15:44:57 UTC
> 
> > Yes, that's weird, Comment #1 should work while Comment #9 should not which
> > simply add some debug info.
> There is something that is worse, now in all four test the acpi work
> correctly!
> I do not understand how the "debug string" can change this behavior.

I don't think these are worse :) As I said before, this issue is highly bound with EC driver, with these lots of ACPI debug info, the timing is changed, which makes your EC driver work "normally" in interrupt mode, not polling mode (you can't find "GPE storm" info any more in these dmesgs).

Could you try the following patch _only_ without any previou patches or those "debug string" in cmdline?

diff --git a/drivers/acpi/ec.c b/drivers/acpi/ec.c
index 7edaccc..9196ab6 100644
--- a/drivers/acpi/ec.c
+++ b/drivers/acpi/ec.c
@@ -71,7 +71,7 @@ enum ec_command {
 #define ACPI_EC_UDELAY_GLK     1000    /* Wait 1ms max. to get global lock */
 #define ACPI_EC_MSI_UDELAY     550     /* Wait 550us for MSI EC */

-#define ACPI_EC_STORM_THRESHOLD 8      /* number of false interrupts
+#define ACPI_EC_STORM_THRESHOLD 20     /* number of false interrupts
                                           per one transaction */

 enum {
Comment 25 Francesco 2012-09-05 11:36:00 UTC
With the proposed patch in comment #24 the acpi event are come back to dead. I've compiled twice kernel to be sure. On dmesg, there is no output.
Now I'm using the kernel with the patch as default on my notebook.
Thanks a lot for the help!
Francesco
Comment 26 Francesco 2012-09-05 14:02:25 UTC
Sorry for my confused English...
In the previous comment, I would mean that the patch fixes the bug.
Francesco
Comment 27 Feng Tang 2012-09-06 09:27:47 UTC
Created attachment 79351 [details]
adding ec_irq_force param

Glag to hear it fix the problem. But changing the threshold may not be accepted as some other platform with broken EC HW need it to be 8.

So could you try the attached patch cleanly, and add a "acpi.ec_irq_force=1" in the kernel cmdline? thanks,

If it works fine, hopefully we can push it to mainline kernel.

- Feng
Comment 28 Francesco 2012-09-11 19:30:37 UTC
No, I've check twice but the patch in comment #27 doesn't solve the issue, the bug is still present.

Thanks,
Comment 29 Feng Tang 2012-09-12 02:47:54 UTC
(In reply to comment #28)
> No, I've check twice but the patch in comment #27 doesn't solve the issue,
> the
> bug is still present.
> 

Thanks for the test, when you tried the patch, did you also add "acpi.ec_irq_force=1" in the kernel cmdline as in comment 27.


btw, we have another patch "make_threshold_module_param.patch", could you try it without any other patch, and add "acpi.ec_storm_threshold=20" in cmdline at the same time?
Comment 30 Feng Tang 2012-09-12 02:48:56 UTC
Created attachment 79881 [details]
make_threshold_parameter.patch
Comment 31 Francesco 2012-09-17 20:21:56 UTC
(In reply to comment #29)
> Thanks for the test, when you tried the patch, did you also add
> "acpi.ec_irq_force=1" in the kernel cmdline as in comment 27.
No, I in test of my comment #28 to add the string.

> btw, we have another patch "make_threshold_module_param.patch", could you try
> it without any other patch, and add "acpi.ec_storm_threshold=20" in cmdline
> at
> the same time?


Yes, I've tried patch of comment #27 and comment #30 WITH the respectively strings and both cases resumes the acpi, both resolve the bug.

Thanks a lot for all your work!
Comment 32 Feng Tang 2012-09-25 03:22:16 UTC
Francesco,

Could you run "sudo dmidecode" on your machine, and post the output here. We'd like to make a patch to change the threshold for your laptop based on dmidecode info. With that patch checked into future and stable kernel, your machine can just work without any action.
Comment 33 Francesco 2012-09-25 19:13:08 UTC
Created attachment 81041 [details]
dmidecode of santech t45

here it is, the dmidecode of my laptop.
Comment 34 Feng Tang 2012-09-26 03:04:48 UTC
Created attachment 81081 [details]
add quirk for Clevo M720

Francesco,

Pls test this patch plug the patch in comment 30, and make sure no "ec_storm_threshold" in the kernel cmdline. thanks,
Comment 35 Francesco 2012-09-28 07:17:39 UTC
ok, I've tested the new kernel with patch 30 and 34, without any extra kernel command line at boot up.
The acpi seems working correctly, all acpi event are detected by acpi_listen.

Thanks!
Francesco
Comment 36 Feng Tang 2012-09-28 07:20:24 UTC
Many thanks for test. Francesco.

I'll submit the 2 patches soon.
Comment 37 Len Brown 2012-10-06 18:52:27 UTC
patches staged for 3.7
Comment 38 Florian Mickler 2012-10-15 21:20:25 UTC
A patch referencing this bug report has been merged in Linux v3.7-rc1:

commit a520d52e99b14ba7db135e916348f12f2a6e09be
Author: Feng Tang <feng.tang@intel.com>
Date:   Fri Sep 28 15:22:00 2012 +0800

    ACPI: EC: Make the GPE storm threshold a module parameter
Comment 39 Florian Mickler 2012-10-15 21:28:41 UTC
A patch referencing this bug report has been merged in Linux v3.7-rc1:

commit 67bfa9b60bd689601554526d144b21d529f78a09
Author: Feng Tang <feng.tang@intel.com>
Date:   Fri Sep 28 15:22:01 2012 +0800

    ACPI: EC: Add a quirk for CLEVO M720T/M730T laptop
Comment 40 Lv Zheng 2015-05-13 08:23:55 UTC
Hi,

Can anyone here try the upstream kernel without this quirk.
We are about to remove the quirk as it is covered by the following commit:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ca37bfdf

Thanks in advance.

Best regards
-Lv