Bug 198169

Summary: Excessive GPE07 interrupts after plugging in power on MacBook Pro
Product: ACPI Reporter: Ronald (ronald)
Component: ECAssignee: Zhang Rui (rui.zhang)
Status: RESOLVED CODE_FIX    
Severity: normal CC: andrey+kernel, danielroschka+kernel, jboyens, lukas, peteryuchuang, rui.zhang, scott
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 4.14 (and earlier, at least back to 4.9) Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg output; enabled dynamic debug for ec.c after boot but before plugging in power; truncated after 1 second worth of debug output
dmidecode output
Decompiled DSDT
powertop output, power plugged in, gpe not masked
powertop output, running on battery, gpe not masked
powertop output, power plugged in, gpe masked
lsusb output, power plugged in, gpe not masked
lsusb output, running on battery, gpe not masked
lsusb output, power plugged in, gpe masked
powertop output, power plugged in, gpe masked (with ec/acpi processes)
dmesg output with DEBUG enabled in drivers/acpi/ec.c
Output of 'acpidump'
patch to make sbshc driver handle acpi notification
Patch to fix kernel oops when removing sbs or sbshc module
Patch to fix the GPE storm
Patch to fix kernel oops when removing sbs or sbshc module

Description Ronald 2017-12-15 07:13:56 UTC
Created attachment 261185 [details]
dmesg output; enabled dynamic debug for ec.c after boot but before plugging in power; truncated after 1 second worth of debug output

System is a MacBookPro13,3, though I believe the issue occurs on all other MacBookPro13,* and MacBookPro14,* . If I boot without the power supply plugged in, then running 'watch cat /sys/firmware/acpi/interrupts/gpe07' shows no activity; but as soon as I plug in the power the interrupt count starts to increase by about 5000/second. If I boot with the power supply plugged in, then I see this large number of interrupts as soon as I'm able to log in.

When this happens, the CPU starts spending about 20% in interrupts, and it leads to other issues such as unloading modules taking very long (1 minute) (the longer this has been going on, the longer the module unload takes).
Comment 1 Ronald 2017-12-15 07:14:45 UTC
Created attachment 261187 [details]
dmidecode output
Comment 2 Ronald 2017-12-15 07:15:31 UTC
Created attachment 261189 [details]
Decompiled DSDT
Comment 3 Zhang Rui 2017-12-18 08:51:42 UTC
please boot with boot option acpi_mask_gpe=0x07 and see if the problem is gone or not.
Comment 4 Ronald 2017-12-19 08:37:46 UTC
Well, it partially, but not really, solves the issue. Yes, the gpe is then masked and the counter in /sys/firmware/acpi/interrupts/gpe07 doesn't increase anymore; furthermore the CPU time spent in irq goes down to about 3%. However, that's still 3% as opposed to 1% when I boot with power unplugged, so something's still busy. Also, running powertop I still see the following (after booting with power plugged in and gpe07 masked):

  Power est.    Usage       Events/s    Category       Description
    6.26 W      0.0 pkts/s                Device         nic:virbr0
    3.29 W    100.0%                      Device         USB device: iBridge (Apple Inc.)
    1.49 W    623.9 ms/s      10.0        kWork          acpi_ec_event_processor
    916 mW      5.1 ms/s     260.8        Timer          tick_sched_timer
    809 mW    332.6 ms/s      10.0        kWork          acpi_os_execute_deferred
    488 mW      0.8 pkts/s                Device         Network interface: wlp0s20f0u7 (rtl8812au)
    122 mW     37.2 ms/s      10.1        kWork          acpi_ec_event_handler

(the power numbers are largely bogus; but Usage is interesting, and just the fact that 3 acpi_* methods show up near the top - these don't show up when booting with power unplugged). Also, the long time to remove some modules is still there too (and not present if booted with power unplugged).
Comment 5 Zhang Rui 2018-04-03 06:06:04 UTC
about " 3.29 W    100.0%                      Device         USB device: iBridge (Apple Inc.)", can you see this line with power unplugged?

I think you can attach "lsusb -v" and "powertop -r foo" output for
1. boot without power plugged
2. boot with power plugged
3. boot with power plugged and GPE masked
Comment 6 Zhang Rui 2018-05-07 06:20:56 UTC
ping...
Comment 7 Ronald 2018-05-08 04:44:54 UTC
Created attachment 275811 [details]
powertop output, power plugged in, gpe not masked
Comment 8 Ronald 2018-05-08 04:45:43 UTC
Created attachment 275813 [details]
powertop output, running on battery, gpe not masked
Comment 9 Ronald 2018-05-08 04:46:15 UTC
Created attachment 275815 [details]
powertop output, power plugged in, gpe masked
Comment 10 Ronald 2018-05-08 04:46:41 UTC
Created attachment 275817 [details]
lsusb output, power plugged in, gpe not masked
Comment 11 Ronald 2018-05-08 04:47:26 UTC
Created attachment 275819 [details]
lsusb output, running on battery, gpe not masked
Comment 12 Ronald 2018-05-08 04:47:45 UTC
Created attachment 275821 [details]
lsusb output, power plugged in, gpe masked
Comment 13 Ronald 2018-05-08 05:19:23 UTC
Apologies for the delay, and thanks for the ping - this slipped below my radar. I've attached the desired info now.

First of all, please ignore the iBridge device - that's just the "touchbar" on some of the current MBP's. As mentioned previously, the power value associated with it is quite bogus.

Second, looking at things again, it appears I can't reproduce my last reported case: with the gpe masked I now see no sign of any ec/acpi related tasks in powertop, and the irq usage is down to 1%, just like when booting with the power unplugged. I went back to kernel 4.13 to see if this was a kernel version issue, but same behaviour there. So it looks like masking the gpe does indeed fully resolve resource usage.

Question: what is the impact of masking this gpe?

Lastly, I mentioned the long module unload times (for at least one module): I traced that down to the module calling 'acpi_remove_gpe_handler()' which ends up calling 'acpi_os_wait_events_complete()'; in the latter, the 'flush_workqueue(kacpi_notify_wq);' is the one that is taking seconds to minutes to complete (depending on how long the system has been up - the longer the uptime, the greater the number of items in the workqueue). Again, this (increasing) delay here is seen even with the gpe masked, but not when booting with the power unplugged. I suppose this could be related to some other gpe event, but the only other one that shows any significant count is gpe17 which is used by the keyboard/touchpad, and from what I can tell is not affected by whether power is plugged in or not at boot.
Comment 14 Peter Y. Chuang 2018-05-08 07:58:16 UTC
I have a MacBookPro14,1, and I have been experimenting with acpi_mask_gpe=0x07 for quite a while now. From my observation, I think those acpi_* entries appear on powertop after changing power source. For instance, after masking gpe, there aren't any acpi_* entries on powertop when I boot the system on battery, but once I plug the power in while the laptop is still running, those acpi_* entries appear shortly after. If I then unplug the laptop, those acpi_* entries will stay there, and the power consumption will be slightly higher than before the plugging and unplugging.
Comment 15 Ronald 2018-05-09 04:02:53 UTC
Created attachment 275871 [details]
powertop output, power plugged in, gpe masked (with ec/acpi processes)
Comment 16 Ronald 2018-05-09 04:14:49 UTC
Peter: ah, good idea. I tried to reproduce this today, but what do you know: now every time I boot (with power plugged in and gpe masked) I have a bunch of ec/acpi processes! So, added another powertop output showing today's situation. Notice in particular the much higher number of wakeups.

When I next run into the case where it boots without them, I verify the unplugging/plugging triggers them again, though I have little doubt that it will.
Comment 17 Ronald 2018-05-21 00:02:16 UTC
I noticed recently that with the GPE masked, half the time my system boots with no battery present (/sys/class/power_supply/BAT0 is missing, and GNOME doesn't show any battery indicator either); this coincides with there being no acpi_ec_* entries in powertop (and visa versa: those entries are there when the battery is present). So this masking is not a good workaround.
Comment 18 Zhang Rui 2018-05-21 02:34:45 UTC
(In reply to Ronald from comment #17)
> I noticed recently that with the GPE masked, half the time my system boots
> with no battery present (/sys/class/power_supply/BAT0 is missing, and GNOME
> doesn't show any battery indicator either);

hmmm, can you please confirm the battery are 100% present if you boot without the GPE masked?


> this coincides with there being
> no acpi_ec_* entries in powertop (and visa versa: those entries are there
> when the battery is present). So this masking is not a good workaround.

please uncomment
/* #define DEBUG */
in drivers/acpi/ec.c and rebuild your kernel, boot with kernel parameter "log_buf_len=4M", and attach the dmesg output when you see the acpi_ec_* enteries in powertop.
I'd like to check which Query method keeps on executing.
Comment 19 Ronald 2018-05-21 04:21:55 UTC
(In reply to Zhang Rui from comment #18)
> (In reply to Ronald from comment #17)
> > I noticed recently that with the GPE masked, half the time my system boots
> > with no battery present (/sys/class/power_supply/BAT0 is missing, and GNOME
> > doesn't show any battery indicator either);
> 
> hmmm, can you please confirm the battery are 100% present if you boot
> without the GPE masked?

Yes, I can confirm that: I rebooted over 20 times without the GPE masked, and saw the battery every single time; then, just to be sure, I rebooted a dozen more times with the GPE masked and saw no battery in 5 of the cases.

> please uncomment
> /* #define DEBUG */
> in drivers/acpi/ec.c and rebuild your kernel, boot with kernel parameter
> "log_buf_len=4M", and attach the dmesg output when you see the acpi_ec_*
> enteries in powertop.
> I'd like to check which Query method keeps on executing.

I presume you wanted this with the GPE masked - attaching dmesg the output.
Comment 20 Ronald 2018-05-21 04:22:52 UTC
Created attachment 276077 [details]
dmesg output with DEBUG enabled in drivers/acpi/ec.c
Comment 21 Zhang Rui 2018-05-21 05:08:22 UTC
it is _Q10 that keeps on firing.

                    Method (_Q10, 0, NotSerialized)  // _Qxx: EC Query
                    {
                        If (OSDW ())
                        {
                            Notify (SMB0, 0x80) // Status Change
                        }
                        ElseIf (SSTS & 0x40)
                        {
                            If (!Acquire (^SMB0.SMTX, 0xFFFF))
                            {
                                Local0 = (SAAD >> One)
                                If (Local0 == 0x0A)
                                {
                                    ^^^^BAT0.BNOT (SADW)
                                }

                                SSTS = Zero
                                Release (^SMB0.SMTX)
                            }
                        }
                    }

can you please check if the problem is gone if you boot with acpi_osi="!Darwin"
Comment 22 Ronald 2018-05-23 01:35:59 UTC
Yes, booting with !Darwin (and no gpe mask) the excessive interrupts go away, as well as the acpi_ec_* calls in powertop. Of course it breaks other things...

Is this a case of some missing acknowledgement, or why would _Q10 keep firing like this?
Comment 23 Zhang Rui 2018-05-23 01:57:03 UTC
if !Darwin fixes the GPE 0x07 storm at the same time, it is probably because, in MacOS, there is some handler for GPE 0x07 that can clear that GPE, which is missing in Linux.

and as we can see from the _Q10 control method
if MacOS
   do something for SMB0
else
   do other tricks
In Linux, OSDW returns true, but apparently we didn't handle SMB0 notification 0x80 properly, and that's probably why _Q10 keeps firing.

This seems more like a gap rather than a bug. Please attach the full acpidump.
Comment 24 Ronald 2018-05-25 06:20:45 UTC
Created attachment 276177 [details]
Output of 'acpidump'
Comment 25 Ronald 2018-05-27 03:20:26 UTC
Btw., sorry for the bother, but how did you determine from the trace that is was _Q10 firing? Just trying to understand how things are working here.
Comment 26 Zhang Rui 2018-05-30 04:19:30 UTC
[  237.328346] ACPI: EC: ##### Query(0x10) started #####

Query(0x10) means _Q10 method is invoked.
Comment 27 Ronald 2018-06-04 03:47:56 UTC
Oh, duh! Thanks.

So I played around with the 'if (OSDW())' sections, and narrowed things down to two places:
----------------------------------------------------
--- dsdt.dsl.orig	2017-01-21 17:30:21.723222953 -0800
+++ dsdt.dsl	2018-06-03 20:24:29.418588639 -0700
@@ -5507,7 +5507,7 @@
                         Mutex (SMTX, 0x00)
                         Method (_STA, 0, NotSerialized)  // _STA: Status
                         {
-                            If (OSDW ())
+                            If (!OSDW ())
                             {
                                 Return (0x0F)
                             }
@@ -5741,7 +5741,7 @@
                         })
                         Method (_STA, 0, NotSerialized)  // _STA: Status
                         {
-                            If (OSDW ())
+                            If (!OSDW ())
                             {
                                 Return (Zero)
                             }
-------------------------------------------------------
With these the GPE07 interrupt is now well behaved, and the battery status changes immediately upon unplug/plug of the power-supply (this was another related issue, where it often took a minute or more for the battery status to change after unplugging/plugging in).

In particular the first change (in SMB0._STA()) actually avoids the excessive interrupts - with this I see just around 70 interrupts on an unplug and around 200 on a plug-in, but that's all. The second change (in BAT0._STA()) is required so the battery is visible (otherwise the first change causes the battery to disappear, similar to was reported above when masking the GPE07), and as mentioned also has the effect of making the status changes visible immediately.

Not sure now how to
Comment 28 Ronald 2018-06-04 04:00:53 UTC
(oops, hit the save button a bit early)

Not sure how to handle this in the kernel so we don't need to modify the DSDT. The SMB0.STA() change confuses me: I don't understand why returning a status of 0 (meaning not present, not enabled, etc) "fixes" the interrupt issue.
Comment 29 Zhang Rui 2018-06-25 06:58:29 UTC
that's because we have in kernel driver to handle the SMB0 device, which invokes _Q10 method.

                    Method (_Q10, 0, NotSerialized)  // _Qxx: EC Query
                    {
                        If (OSDW ())
                        {
                            Notify (SMB0, 0x80) // Status Change
                        }
                        ElseIf (SSTS & 0x40)
                        {
                            If (!Acquire (^SMB0.SMTX, 0xFFFF))
                            {
                                Local0 = (SAAD >> One)
                                If (Local0 == 0x0A)
                                {
                                    ^^^^BAT0.BNOT (SADW)
                                }

                                SSTS = Zero
                                Release (^SMB0.SMTX)
                            }
                        }
                    }

are the interrupts still there if you change this OSDW() to !OSDW()?
Comment 30 Ronald 2018-07-02 09:27:51 UTC
(In reply to Zhang Rui from comment #29)
> that's because we have in kernel driver to handle the SMB0 device, which
> invokes _Q10 method.

Well, yes, but it turns out it's not called. With the help of your hints so far I dug into the code for ec.c, sbshc.c, and sbs.c (the relevant drivers here AFAICT), and noticed the following.

First of all, the relevant DSDT snippets are:

                Device (EC)
                {
                    ...
                    Name (_GPE, 0x07)  // _GPE: General Purpose Events

                    Device (SMB0)
                    {
                        Name (_HID, "ACPI0001" /* SMBus 1.0 Host Controller */)  // _HID: Hardware ID
                        Name (_EC, 0x2010)  // _EC_: Embedded Controller
                        ...

                        Device (SBS0)
                        {
                            Name (_HID, "ACPI0002" /* Smart Battery Subsystem */)  // _HID: Hardware ID
                            Name (_SBS, One)  // _SBS: Smart Battery Subsystem
                        }
                        ...
                    }

                    Method (_Q10, 0, NotSerialized)  // _Qxx: EC Query
                    {
                        ...
                    }
                }

Now, in ec.c we have the following that gets run when the EC ACPI device is registered (just the relevant code snippets):

        ec_install_handlers():
            acpi_walk_namespace(ACPI_TYPE_METHOD, ec->handle, 1,
                                acpi_ec_register_query_methods,
                                NULL, ec, NULL);

        acpi_ec_register_query_methods():
            if (ACPI_SUCCESS(status) && sscanf(node_name, "_Q%x", &value) == 1)
                acpi_ec_add_query_handler(ec, value, handle, NULL, NULL);

So this registers handlers for all _Qxx methods defined for EC device.

However, we also have this in sbshc.c:

        acpi_smbus_hc_add()
            status = acpi_evaluate_integer(device->handle, "_EC", NULL, &val);
            hc->query_bit = val & 0xff;  <== 0x10
            acpi_ec_add_query_handler(hc->ec, hc->query_bit, NULL, smbus_alarm, ch)

This is triggered when the SMB0 device is registered; and thanks to 'Name (_EC, 0x2010)' this also registers a handler for query_bit 10; but since this happens after the registration of the EC device, this new handler replaces the earlier registered one that would've called _Q10. I.e. _Q10 is never called (or maybe only a few times in a short interval between the registration of the EC device and the SMB0 device).

> are the interrupts still there if you change this OSDW() to !OSDW()?

Yes. This is actually the first thing I tried. But due to the above any changes here make no difference.

Given the above knowledge, I tried a couple things in the sbshc.c driver: 1) not registering the sbshc handler so that the EC registered handler is used which calls _Q10, and 2) changing the sbshc handler to both do it's original work (which is calling acpi_sbs_callback()) and invoke the _Q10 method like the EC registered handler would do. But with 1) I got no battery device after boot, and with 2) nothing seemed to change, whether using the original DSDT or changing the OSDW check in _Q10.

Looks like I need to dig further.
Comment 31 Zhang Rui 2018-08-29 02:56:02 UTC
It has been some time after I read this thread last time, so let me refresh my memory.

1. GPE 0x07 interrupt storm by default
2. GPE 0x07 interrupt storm is gone, together with the battery device if acpi_osi="!Darwin"
3. Hacking the DSDT to disable SMB0, but enable BAT0 makes everything work.


To me, with default kernel, where OSDW() returns true, BAT0 is actually disabled, but we can still see the battery devices, and I guess it is the smart battery device.

So, IMO, the right fix should be make sbshc driver to handle charge change event correct, and use it for battery management, rather than disabling the sbshc device.
Comment 32 Zhang Rui 2018-08-29 02:57:36 UTC
Created attachment 278191 [details]
patch to make sbshc driver handle acpi notification

please first confirm the problem still exists in the latest upstream kernel, say, 4.19-rc1, and see please check if this patch helps or not.
Comment 33 Ronald 2018-09-04 07:44:16 UTC
Thanks taking another look at this. Sorry for the delay, but I've spent some time trying to understand things and debug the issue, with some success.

First of all, yes, the problem still exists (4.19-rc2), and in fact is even worse on recent kernels (4.17 through 4.19) where the cpu spends about 58% in irq.

Second, thanks for the patch, but it doesn't help any for the simple reason that smbus_alarm() is already being called due to it being registered as a query-handler for _Q10 - this line in sbshc.c:

        acpi_ec_add_query_handler(hc->ec, hc->query_bit, NULL, smbus_alarm, hc);

> 1. GPE 0x07 interrupt storm by default
> 2. GPE 0x07 interrupt storm is gone, together with the battery device if
> acpi_osi="!Darwin"
> 3. Hacking the DSDT to disable SMB0, but enable BAT0 makes everything work.
>
> To me, with default kernel, where OSDW() returns true, BAT0 is actually
> disabled, but we can still see the battery devices, and I guess it is the
> smart battery device.

Correct.

> So, IMO, the right fix should be make sbshc driver to handle charge change
> event correct, and use it for battery management, rather than disabling the
> sbshc device.

It already does this - see my note about smbus_alarm() above. However, due to things "working" with SBS0 disabled, there must be something happening (or not happening) in the sbs/sbshc drivers.

So I spent some time playing around with things here, until I finally discovered the apparent culprit: calling acpi_ac_get_present() from acpi_sbs_callback() (which in turn is queued for execution from smbus_alarm()). Commenting this out causes the GPE storm to go away (I see the counter for the GPE increase by almost 200 each time the power is plugged/unplugged, and the smbus_alarm() callback is called half a dozen times in quick succession, but then things go quiet again). I think somehow making that smbus-read operation triggers the SBS to fire the GPE again, causing an endless loop.

I printed out the status returned by that smbus-read and it's always 0xFFFF, no matter whether the power is plugged in or not. Looking at the spec, that looks bogus to me (besides the fact that bit 15 is always set, i.e. it's obviously not reflecting the presence of the charger). And besides, there's an ADP1 device present too. Therefore it looks to me like the fix is probably to just not create an sbs-charger device on these systems (MBP13,* and MBP14,* - don't know about others).

Here is a quick sample patch - does this look reasonable to you?
--------------------------------------------------------
diff --git a/drivers/acpi/sbs.c b/drivers/acpi/sbs.c
index 045369c431de..dc2017991e91 100644
--- a/drivers/acpi/sbs.c
+++ b/drivers/acpi/sbs.c
@@ -441,9 +441,13 @@ static int acpi_ac_get_present(struct acpi_sbs *sbs)
 
 	/*
 	 * The spec requires that bit 4 always be 1. If it's not set, assume
-	 * that the implementation doesn't support an SBS charger
+	 * that the implementation doesn't support an SBS charger.
+	 *
+	 * And on some MacBooks a status of 0xffff is always returned, no
+	 * matter whether the charger is plugged in or not, which is also
+	 * wrong, so ignore the SBS charger for those too.
 	 */
-	if (!((status >> 4) & 0x1))
+	if (!((status >> 4) & 0x1) || status == 0xffff)
 		return -ENODEV;
 
 	sbs->charger_present = (status >> 15) & 0x1;
--------------------------------------------------------

With this the only issue remaining is that in Gnome the battery indicator now shows the opposite state, i.e. shows on battery when plugged in and on power when unplugged - still need to chase that one down (though /sys/class/power_supply/ADP1/online shows the correct value).
Comment 34 Ronald 2018-09-04 08:45:17 UTC
Created attachment 278285 [details]
Patch to fix kernel oops when removing sbs or sbshc module

Discovered this while removing these modules during the GPE storm.
Comment 35 Ronald 2018-09-04 08:46:42 UTC
Created attachment 278287 [details]
Patch to fix the GPE storm

This is the same as the proposed patch, but in proper patch format.
Comment 36 Lukas Wunner 2018-09-04 09:24:45 UTC
FWIW, Matthew Garrett encountered SBS issues years ago:
https://mjg59.dreamwidth.org/29954.html

This lead to:
https://git.kernel.org/linus/9faf6136ff46 ("ACPI / SBS: Disable smart battery manager on Apple")
https://git.kernel.org/linus/3031cddea633 ("ACPI / SBS: Don't assume the existence of an SBS charger")
Comment 37 Ronald 2018-09-05 04:02:33 UTC
Created attachment 278303 [details]
Patch to fix kernel oops when removing sbs or sbshc module

Slight rework of the patch: moved the flushing of the queue into acpi_smbus_unregister_callback() as that keeps knowledge of how deferred execution is done all in one place (sbshc).
Comment 38 Ronald 2018-09-05 04:04:59 UTC
The issue with the Gnome battery indicator is a bug in upower (https://gitlab.freedesktop.org/upower/upower/issues/56) and fixed in version 0.99.8 thereof. I.e. not a kernel issue.
Comment 39 Ronald 2018-09-05 04:11:17 UTC
Thanks Lukas. Saw the commits (referenced 30341cdde in the commit message), but missed the blog entry - too bad,that might have helped debug this.
Comment 40 Zhang Rui 2018-09-29 14:08:04 UTC
@Ronald, the code change in comment #33 seems reasonable, can you please send it to linux-acpi mailing list?
Comment 41 Ronald 2018-10-01 02:57:43 UTC
(In reply to Zhang Rui from comment #40)
> @Ronald, the code change in comment #33 seems reasonable, can you please
> send it to linux-acpi mailing list?

Done. Thanks for reviewing and all your help here.