Bug 198169
Created attachment 261187 [details]
dmidecode output
Created attachment 261189 [details]
Decompiled DSDT
please boot with boot option acpi_mask_gpe=0x07 and see if the problem is gone or not. Well, it partially, but not really, solves the issue. Yes, the gpe is then masked and the counter in /sys/firmware/acpi/interrupts/gpe07 doesn't increase anymore; furthermore the CPU time spent in irq goes down to about 3%. However, that's still 3% as opposed to 1% when I boot with power unplugged, so something's still busy. Also, running powertop I still see the following (after booting with power plugged in and gpe07 masked): Power est. Usage Events/s Category Description 6.26 W 0.0 pkts/s Device nic:virbr0 3.29 W 100.0% Device USB device: iBridge (Apple Inc.) 1.49 W 623.9 ms/s 10.0 kWork acpi_ec_event_processor 916 mW 5.1 ms/s 260.8 Timer tick_sched_timer 809 mW 332.6 ms/s 10.0 kWork acpi_os_execute_deferred 488 mW 0.8 pkts/s Device Network interface: wlp0s20f0u7 (rtl8812au) 122 mW 37.2 ms/s 10.1 kWork acpi_ec_event_handler (the power numbers are largely bogus; but Usage is interesting, and just the fact that 3 acpi_* methods show up near the top - these don't show up when booting with power unplugged). Also, the long time to remove some modules is still there too (and not present if booted with power unplugged). about " 3.29 W 100.0% Device USB device: iBridge (Apple Inc.)", can you see this line with power unplugged? I think you can attach "lsusb -v" and "powertop -r foo" output for 1. boot without power plugged 2. boot with power plugged 3. boot with power plugged and GPE masked ping... Created attachment 275811 [details]
powertop output, power plugged in, gpe not masked
Created attachment 275813 [details]
powertop output, running on battery, gpe not masked
Created attachment 275815 [details]
powertop output, power plugged in, gpe masked
Created attachment 275817 [details]
lsusb output, power plugged in, gpe not masked
Created attachment 275819 [details]
lsusb output, running on battery, gpe not masked
Created attachment 275821 [details]
lsusb output, power plugged in, gpe masked
Apologies for the delay, and thanks for the ping - this slipped below my radar. I've attached the desired info now. First of all, please ignore the iBridge device - that's just the "touchbar" on some of the current MBP's. As mentioned previously, the power value associated with it is quite bogus. Second, looking at things again, it appears I can't reproduce my last reported case: with the gpe masked I now see no sign of any ec/acpi related tasks in powertop, and the irq usage is down to 1%, just like when booting with the power unplugged. I went back to kernel 4.13 to see if this was a kernel version issue, but same behaviour there. So it looks like masking the gpe does indeed fully resolve resource usage. Question: what is the impact of masking this gpe? Lastly, I mentioned the long module unload times (for at least one module): I traced that down to the module calling 'acpi_remove_gpe_handler()' which ends up calling 'acpi_os_wait_events_complete()'; in the latter, the 'flush_workqueue(kacpi_notify_wq);' is the one that is taking seconds to minutes to complete (depending on how long the system has been up - the longer the uptime, the greater the number of items in the workqueue). Again, this (increasing) delay here is seen even with the gpe masked, but not when booting with the power unplugged. I suppose this could be related to some other gpe event, but the only other one that shows any significant count is gpe17 which is used by the keyboard/touchpad, and from what I can tell is not affected by whether power is plugged in or not at boot. I have a MacBookPro14,1, and I have been experimenting with acpi_mask_gpe=0x07 for quite a while now. From my observation, I think those acpi_* entries appear on powertop after changing power source. For instance, after masking gpe, there aren't any acpi_* entries on powertop when I boot the system on battery, but once I plug the power in while the laptop is still running, those acpi_* entries appear shortly after. If I then unplug the laptop, those acpi_* entries will stay there, and the power consumption will be slightly higher than before the plugging and unplugging. Created attachment 275871 [details]
powertop output, power plugged in, gpe masked (with ec/acpi processes)
Peter: ah, good idea. I tried to reproduce this today, but what do you know: now every time I boot (with power plugged in and gpe masked) I have a bunch of ec/acpi processes! So, added another powertop output showing today's situation. Notice in particular the much higher number of wakeups. When I next run into the case where it boots without them, I verify the unplugging/plugging triggers them again, though I have little doubt that it will. I noticed recently that with the GPE masked, half the time my system boots with no battery present (/sys/class/power_supply/BAT0 is missing, and GNOME doesn't show any battery indicator either); this coincides with there being no acpi_ec_* entries in powertop (and visa versa: those entries are there when the battery is present). So this masking is not a good workaround. (In reply to Ronald from comment #17) > I noticed recently that with the GPE masked, half the time my system boots > with no battery present (/sys/class/power_supply/BAT0 is missing, and GNOME > doesn't show any battery indicator either); hmmm, can you please confirm the battery are 100% present if you boot without the GPE masked? > this coincides with there being > no acpi_ec_* entries in powertop (and visa versa: those entries are there > when the battery is present). So this masking is not a good workaround. please uncomment /* #define DEBUG */ in drivers/acpi/ec.c and rebuild your kernel, boot with kernel parameter "log_buf_len=4M", and attach the dmesg output when you see the acpi_ec_* enteries in powertop. I'd like to check which Query method keeps on executing. (In reply to Zhang Rui from comment #18) > (In reply to Ronald from comment #17) > > I noticed recently that with the GPE masked, half the time my system boots > > with no battery present (/sys/class/power_supply/BAT0 is missing, and GNOME > > doesn't show any battery indicator either); > > hmmm, can you please confirm the battery are 100% present if you boot > without the GPE masked? Yes, I can confirm that: I rebooted over 20 times without the GPE masked, and saw the battery every single time; then, just to be sure, I rebooted a dozen more times with the GPE masked and saw no battery in 5 of the cases. > please uncomment > /* #define DEBUG */ > in drivers/acpi/ec.c and rebuild your kernel, boot with kernel parameter > "log_buf_len=4M", and attach the dmesg output when you see the acpi_ec_* > enteries in powertop. > I'd like to check which Query method keeps on executing. I presume you wanted this with the GPE masked - attaching dmesg the output. Created attachment 276077 [details]
dmesg output with DEBUG enabled in drivers/acpi/ec.c
it is _Q10 that keeps on firing. Method (_Q10, 0, NotSerialized) // _Qxx: EC Query { If (OSDW ()) { Notify (SMB0, 0x80) // Status Change } ElseIf (SSTS & 0x40) { If (!Acquire (^SMB0.SMTX, 0xFFFF)) { Local0 = (SAAD >> One) If (Local0 == 0x0A) { ^^^^BAT0.BNOT (SADW) } SSTS = Zero Release (^SMB0.SMTX) } } } can you please check if the problem is gone if you boot with acpi_osi="!Darwin" Yes, booting with !Darwin (and no gpe mask) the excessive interrupts go away, as well as the acpi_ec_* calls in powertop. Of course it breaks other things... Is this a case of some missing acknowledgement, or why would _Q10 keep firing like this? if !Darwin fixes the GPE 0x07 storm at the same time, it is probably because, in MacOS, there is some handler for GPE 0x07 that can clear that GPE, which is missing in Linux. and as we can see from the _Q10 control method if MacOS do something for SMB0 else do other tricks In Linux, OSDW returns true, but apparently we didn't handle SMB0 notification 0x80 properly, and that's probably why _Q10 keeps firing. This seems more like a gap rather than a bug. Please attach the full acpidump. Created attachment 276177 [details]
Output of 'acpidump'
Btw., sorry for the bother, but how did you determine from the trace that is was _Q10 firing? Just trying to understand how things are working here. [ 237.328346] ACPI: EC: ##### Query(0x10) started ##### Query(0x10) means _Q10 method is invoked. Oh, duh! Thanks. So I played around with the 'if (OSDW())' sections, and narrowed things down to two places: ---------------------------------------------------- --- dsdt.dsl.orig 2017-01-21 17:30:21.723222953 -0800 +++ dsdt.dsl 2018-06-03 20:24:29.418588639 -0700 @@ -5507,7 +5507,7 @@ Mutex (SMTX, 0x00) Method (_STA, 0, NotSerialized) // _STA: Status { - If (OSDW ()) + If (!OSDW ()) { Return (0x0F) } @@ -5741,7 +5741,7 @@ }) Method (_STA, 0, NotSerialized) // _STA: Status { - If (OSDW ()) + If (!OSDW ()) { Return (Zero) } ------------------------------------------------------- With these the GPE07 interrupt is now well behaved, and the battery status changes immediately upon unplug/plug of the power-supply (this was another related issue, where it often took a minute or more for the battery status to change after unplugging/plugging in). In particular the first change (in SMB0._STA()) actually avoids the excessive interrupts - with this I see just around 70 interrupts on an unplug and around 200 on a plug-in, but that's all. The second change (in BAT0._STA()) is required so the battery is visible (otherwise the first change causes the battery to disappear, similar to was reported above when masking the GPE07), and as mentioned also has the effect of making the status changes visible immediately. Not sure now how to (oops, hit the save button a bit early) Not sure how to handle this in the kernel so we don't need to modify the DSDT. The SMB0.STA() change confuses me: I don't understand why returning a status of 0 (meaning not present, not enabled, etc) "fixes" the interrupt issue. that's because we have in kernel driver to handle the SMB0 device, which invokes _Q10 method. Method (_Q10, 0, NotSerialized) // _Qxx: EC Query { If (OSDW ()) { Notify (SMB0, 0x80) // Status Change } ElseIf (SSTS & 0x40) { If (!Acquire (^SMB0.SMTX, 0xFFFF)) { Local0 = (SAAD >> One) If (Local0 == 0x0A) { ^^^^BAT0.BNOT (SADW) } SSTS = Zero Release (^SMB0.SMTX) } } } are the interrupts still there if you change this OSDW() to !OSDW()? (In reply to Zhang Rui from comment #29) > that's because we have in kernel driver to handle the SMB0 device, which > invokes _Q10 method. Well, yes, but it turns out it's not called. With the help of your hints so far I dug into the code for ec.c, sbshc.c, and sbs.c (the relevant drivers here AFAICT), and noticed the following. First of all, the relevant DSDT snippets are: Device (EC) { ... Name (_GPE, 0x07) // _GPE: General Purpose Events Device (SMB0) { Name (_HID, "ACPI0001" /* SMBus 1.0 Host Controller */) // _HID: Hardware ID Name (_EC, 0x2010) // _EC_: Embedded Controller ... Device (SBS0) { Name (_HID, "ACPI0002" /* Smart Battery Subsystem */) // _HID: Hardware ID Name (_SBS, One) // _SBS: Smart Battery Subsystem } ... } Method (_Q10, 0, NotSerialized) // _Qxx: EC Query { ... } } Now, in ec.c we have the following that gets run when the EC ACPI device is registered (just the relevant code snippets): ec_install_handlers(): acpi_walk_namespace(ACPI_TYPE_METHOD, ec->handle, 1, acpi_ec_register_query_methods, NULL, ec, NULL); acpi_ec_register_query_methods(): if (ACPI_SUCCESS(status) && sscanf(node_name, "_Q%x", &value) == 1) acpi_ec_add_query_handler(ec, value, handle, NULL, NULL); So this registers handlers for all _Qxx methods defined for EC device. However, we also have this in sbshc.c: acpi_smbus_hc_add() status = acpi_evaluate_integer(device->handle, "_EC", NULL, &val); hc->query_bit = val & 0xff; <== 0x10 acpi_ec_add_query_handler(hc->ec, hc->query_bit, NULL, smbus_alarm, ch) This is triggered when the SMB0 device is registered; and thanks to 'Name (_EC, 0x2010)' this also registers a handler for query_bit 10; but since this happens after the registration of the EC device, this new handler replaces the earlier registered one that would've called _Q10. I.e. _Q10 is never called (or maybe only a few times in a short interval between the registration of the EC device and the SMB0 device). > are the interrupts still there if you change this OSDW() to !OSDW()? Yes. This is actually the first thing I tried. But due to the above any changes here make no difference. Given the above knowledge, I tried a couple things in the sbshc.c driver: 1) not registering the sbshc handler so that the EC registered handler is used which calls _Q10, and 2) changing the sbshc handler to both do it's original work (which is calling acpi_sbs_callback()) and invoke the _Q10 method like the EC registered handler would do. But with 1) I got no battery device after boot, and with 2) nothing seemed to change, whether using the original DSDT or changing the OSDW check in _Q10. Looks like I need to dig further. It has been some time after I read this thread last time, so let me refresh my memory. 1. GPE 0x07 interrupt storm by default 2. GPE 0x07 interrupt storm is gone, together with the battery device if acpi_osi="!Darwin" 3. Hacking the DSDT to disable SMB0, but enable BAT0 makes everything work. To me, with default kernel, where OSDW() returns true, BAT0 is actually disabled, but we can still see the battery devices, and I guess it is the smart battery device. So, IMO, the right fix should be make sbshc driver to handle charge change event correct, and use it for battery management, rather than disabling the sbshc device. Created attachment 278191 [details]
patch to make sbshc driver handle acpi notification
please first confirm the problem still exists in the latest upstream kernel, say, 4.19-rc1, and see please check if this patch helps or not.
Thanks taking another look at this. Sorry for the delay, but I've spent some time trying to understand things and debug the issue, with some success. First of all, yes, the problem still exists (4.19-rc2), and in fact is even worse on recent kernels (4.17 through 4.19) where the cpu spends about 58% in irq. Second, thanks for the patch, but it doesn't help any for the simple reason that smbus_alarm() is already being called due to it being registered as a query-handler for _Q10 - this line in sbshc.c: acpi_ec_add_query_handler(hc->ec, hc->query_bit, NULL, smbus_alarm, hc); > 1. GPE 0x07 interrupt storm by default > 2. GPE 0x07 interrupt storm is gone, together with the battery device if > acpi_osi="!Darwin" > 3. Hacking the DSDT to disable SMB0, but enable BAT0 makes everything work. > > To me, with default kernel, where OSDW() returns true, BAT0 is actually > disabled, but we can still see the battery devices, and I guess it is the > smart battery device. Correct. > So, IMO, the right fix should be make sbshc driver to handle charge change > event correct, and use it for battery management, rather than disabling the > sbshc device. It already does this - see my note about smbus_alarm() above. However, due to things "working" with SBS0 disabled, there must be something happening (or not happening) in the sbs/sbshc drivers. So I spent some time playing around with things here, until I finally discovered the apparent culprit: calling acpi_ac_get_present() from acpi_sbs_callback() (which in turn is queued for execution from smbus_alarm()). Commenting this out causes the GPE storm to go away (I see the counter for the GPE increase by almost 200 each time the power is plugged/unplugged, and the smbus_alarm() callback is called half a dozen times in quick succession, but then things go quiet again). I think somehow making that smbus-read operation triggers the SBS to fire the GPE again, causing an endless loop. I printed out the status returned by that smbus-read and it's always 0xFFFF, no matter whether the power is plugged in or not. Looking at the spec, that looks bogus to me (besides the fact that bit 15 is always set, i.e. it's obviously not reflecting the presence of the charger). And besides, there's an ADP1 device present too. Therefore it looks to me like the fix is probably to just not create an sbs-charger device on these systems (MBP13,* and MBP14,* - don't know about others). Here is a quick sample patch - does this look reasonable to you? -------------------------------------------------------- diff --git a/drivers/acpi/sbs.c b/drivers/acpi/sbs.c index 045369c431de..dc2017991e91 100644 --- a/drivers/acpi/sbs.c +++ b/drivers/acpi/sbs.c @@ -441,9 +441,13 @@ static int acpi_ac_get_present(struct acpi_sbs *sbs) /* * The spec requires that bit 4 always be 1. If it's not set, assume - * that the implementation doesn't support an SBS charger + * that the implementation doesn't support an SBS charger. + * + * And on some MacBooks a status of 0xffff is always returned, no + * matter whether the charger is plugged in or not, which is also + * wrong, so ignore the SBS charger for those too. */ - if (!((status >> 4) & 0x1)) + if (!((status >> 4) & 0x1) || status == 0xffff) return -ENODEV; sbs->charger_present = (status >> 15) & 0x1; -------------------------------------------------------- With this the only issue remaining is that in Gnome the battery indicator now shows the opposite state, i.e. shows on battery when plugged in and on power when unplugged - still need to chase that one down (though /sys/class/power_supply/ADP1/online shows the correct value). Created attachment 278285 [details]
Patch to fix kernel oops when removing sbs or sbshc module
Discovered this while removing these modules during the GPE storm.
Created attachment 278287 [details]
Patch to fix the GPE storm
This is the same as the proposed patch, but in proper patch format.
FWIW, Matthew Garrett encountered SBS issues years ago: https://mjg59.dreamwidth.org/29954.html This lead to: https://git.kernel.org/linus/9faf6136ff46 ("ACPI / SBS: Disable smart battery manager on Apple") https://git.kernel.org/linus/3031cddea633 ("ACPI / SBS: Don't assume the existence of an SBS charger") Created attachment 278303 [details]
Patch to fix kernel oops when removing sbs or sbshc module
Slight rework of the patch: moved the flushing of the queue into acpi_smbus_unregister_callback() as that keeps knowledge of how deferred execution is done all in one place (sbshc).
The issue with the Gnome battery indicator is a bug in upower (https://gitlab.freedesktop.org/upower/upower/issues/56) and fixed in version 0.99.8 thereof. I.e. not a kernel issue. Thanks Lukas. Saw the commits (referenced 30341cdde in the commit message), but missed the blog entry - too bad,that might have helped debug this. @Ronald, the code change in comment #33 seems reasonable, can you please send it to linux-acpi mailing list? (In reply to Zhang Rui from comment #40) > @Ronald, the code change in comment #33 seems reasonable, can you please > send it to linux-acpi mailing list? Done. Thanks for reviewing and all your help here. |
Created attachment 261185 [details] dmesg output; enabled dynamic debug for ec.c after boot but before plugging in power; truncated after 1 second worth of debug output System is a MacBookPro13,3, though I believe the issue occurs on all other MacBookPro13,* and MacBookPro14,* . If I boot without the power supply plugged in, then running 'watch cat /sys/firmware/acpi/interrupts/gpe07' shows no activity; but as soon as I plug in the power the interrupt count starts to increase by about 5000/second. If I boot with the power supply plugged in, then I see this large number of interrupts as soon as I'm able to log in. When this happens, the CPU starts spending about 20% in interrupts, and it leads to other issues such as unloading modules taking very long (1 minute) (the longer this has been going on, the longer the module unload takes).