Bug 205861 - Semi-regularly, 1 CPU core is entirely hogged by kworker/0:1+kacpid
Summary: Semi-regularly, 1 CPU core is entirely hogged by kworker/0:1+kacpid
Status: RESOLVED UNREPRODUCIBLE
Alias: None
Product: Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-12-15 11:20 UTC by Ralf
Modified: 2020-09-18 16:08 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.3.9 (Debian)
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
"acpidump" output (576.04 KB, text/plain)
2020-01-02 11:59 UTC, Ralf
Details
diff -u config-5.4.19-good config-5.4.19-bad (56.84 KB, patch)
2020-04-14 21:49 UTC, Ralf
Details | Diff

Description Ralf 2019-12-15 11:20:20 UTC
Since the kernel update from 4.19.37-6 to 5.2.17-1 (Debian kernel version), I regularly have 100% load on one CPU core. "htop" shows a red bar for that load, which means it is "kernel" load. According to "top", the load is caused by "kworker/0:1+kacpid". "perf top" in one case showed some function names involving acpi, but when this just happened again, "perf top" showed no noticeable load at all.

The issue persists with the kernel I have currently installed:
$ uname -a
Linux r-thinktop 5.3.0-2-amd64 #1 SMP Debian 5.3.9-3 (2019-11-19) x86_64 GNU/Linux

This happens roughly once or twice a day.
It starts randomly during operation, at least I have not been able to find a pattern yet.
To get rid of this, I have to put the machine to suspend and resume it again.

I am using Debian testing on a ThinkPad P50.
Comment 1 Zhang Rui 2019-12-29 04:35:46 UTC
is there any ACPI interrupt storm when this happens?
you can confirm this by checking /sys/firmware/acpi/interrupts/* for a certain period.
Comment 2 Ralf 2020-01-01 14:30:19 UTC
The number in the files "sci", "gpe6F" and "gpe_all" keeps growing when this happens. So I guess that means yes, it is an interrupt storm?

$ cat /sys/firmware/acpi/interrupts/gpe6F
 2578152     STS enabled      unmasked
Comment 3 Zhang Rui 2020-01-02 03:09:29 UTC
yes, this is an interrupt storm issue.
please attach the acpidump output.
Comment 4 Ralf 2020-01-02 11:59:14 UTC
Created attachment 286577 [details]
"acpidump" output

Done. I recorded this when the system was "good" (no interrupt storm); should I try again to get the output during an interrupt storm?
Comment 5 Zhang Rui 2020-01-02 12:41:09 UTC
        Method (_L6F, 0, NotSerialized)  // _Lxx: Level-Triggered GPE, xx=0x00-0xFF
        {
            \_SB.UGPS ()
            If ((TBTS == 0x01))
            {
                If (\_SB.ISME (CPGN))
                {
                    \_SB.THDR ()
                }
            }

            \_SB.CGLS ()
        }

I dig into the ASL code a bit but didn't figure out what this GPE is for.

(In reply to Ralf from comment #0)
> Since the kernel update from 4.19.37-6 to 5.2.17-1 (Debian kernel version),
> I regularly have 100% load on one CPU core.

so this look like a kernel regression? it would be nice to confirm this.

> It starts randomly during operation, at least I have not been able to find a
> pattern yet.
> To get rid of this, I have to put the machine to suspend and resume it again.

this is a useful information, this is probably related with device runtime PM.
To confirm this, we can disable the runtime PM for all the devices, by "echo on > power/control" in the device node in sysfs, and see if the problem still exist.
If no, we can enable runtime PM for some of the devices to narrow down the problem to the specific device.
Comment 6 Zhang Rui 2020-01-02 12:41:41 UTC
BTW, please attach the dmesg output when the problem happens and we can see if there is anything bad already happens.
Comment 7 Zhang Rui 2020-01-02 13:05:45 UTC
(In reply to Ralf from comment #4)
> Created attachment 286577 [details]
> "acpidump" output
> 
> Done. I recorded this when the system was "good" (no interrupt storm);
> should I try again to get the output during an interrupt storm?

no, you don't
Comment 8 Ralf 2020-01-02 15:19:14 UTC
> so this look like a kernel regression? it would be nice to confirm this.

Looks like it to me, yes.
If I find the time, I'll build an uptream 4.19 kernel and boot that, to see if the issues remain.

Also potentially interesting, I was on a congress over the week-end where I used my laptop with hardly any peripherals (only the occasional USB mouse), and the issue didn't occur.

> To confirm this, we can disable the runtime PM for all the devices, by "echo
> on > power/control" in the device node in sysfs, and see if the problem still
> exist.

So what exactly do you want me to do? Which device would that be for? Also given that days can pass without the issue occurring, it's hard to test if doing anything fixes it...

As for dmesg, the issue just started again and for the last hour there's not much in the logs:

Jan 02 16:08:11 r-thinktop kernel: pci_bus 0000:02: Allocating resources
Jan 02 16:08:11 r-thinktop kernel: pci_bus 0000:05: Allocating resources
Jan 02 16:08:51 r-thinktop kernel: pci_bus 0000:02: Allocating resources
Jan 02 16:08:51 r-thinktop kernel: pci_bus 0000:05: Allocating resources
Jan 02 16:16:59 r-thinktop kernel: input: E8:07:BF:72:CC:89 as /devices/virtual/input/input42
Jan 02 16:17:09 r-thinktop kernel: usb 1-14: USB disconnect, device number 5
Comment 9 Zhang Rui 2020-01-03 03:31:29 UTC
(In reply to Ralf from comment #8)
> > so this look like a kernel regression? it would be nice to confirm this.
> 
> Looks like it to me, yes.
> If I find the time, I'll build an uptream 4.19 kernel and boot that, to see
> if the issues remain.
> 
> Also potentially interesting, I was on a congress over the week-end where I
> used my laptop with hardly any peripherals (only the occasional USB mouse),
> and the issue didn't occur.

this is possible.
Now the problem is that some GPE is triggered and it is not handled properly, and this leads to the interrupt storm.
But what triggers the GPE is still unknown, it is probably caused by some peripherals.

> 
> > To confirm this, we can disable the runtime PM for all the devices, by
> "echo
> > on > power/control" in the device node in sysfs, and see if the problem
> still
> > exist.
> 
> So what exactly do you want me to do? Which device would that be for?

I don't know, in theory, we should disable runtime PM for all the devices and see if it helps or not, and then we can narrow down the problems.

But maybe you can try the USB type-C and Thunderbolt device as the first step because the ASL shows that GPE 0x6F may be related with this device.

> Also
> given that days can pass without the issue occurring, it's hard to test if
> doing anything fixes it...

Well, agree. That's why these intermittent issues are hard to reproduce & fix.

> 
> As for dmesg, the issue just started again and for the last hour there's not
> much in the logs:
> 
> Jan 02 16:08:11 r-thinktop kernel: pci_bus 0000:02: Allocating resources
> Jan 02 16:08:11 r-thinktop kernel: pci_bus 0000:05: Allocating resources
> Jan 02 16:08:51 r-thinktop kernel: pci_bus 0000:02: Allocating resources
> Jan 02 16:08:51 r-thinktop kernel: pci_bus 0000:05: Allocating resources
> Jan 02 16:16:59 r-thinktop kernel: input: E8:07:BF:72:CC:89 as
> /devices/virtual/input/input42
> Jan 02 16:17:09 r-thinktop kernel: usb 1-14: USB disconnect, device number 5

So do you know this usb issue is related?
Maybe we can have a script to
1. clear the previous dmesg
2. get the current GPE 0x6F number
3. sleep for 60 seconds, or even shorter
4. check the current GPE 0x6F value, so that we can know if the interrupt storms happens or not.
5. save the new content in dmesg.
6. goto step1 and repeat the test until the interrupt storm happens.

In this way, maybe we can find out what triggers the interrupt storm.
Comment 10 Ralf 2020-01-31 22:32:33 UTC
> so this look like a kernel regression? it would be nice to confirm this.

I am now on vanilla 4.19.27 for a few weeks without this issue occurring. So yes, looks like a regression. I'll slowly move to newer kernel versions.

> But maybe you can try the USB type-C and Thunderbolt device as the first step
> because the ASL shows that GPE 0x6F may be related with this device.

I don't usually have any USB-3 or Thunderbird device connected, but I presume the controller itself could be related? I am not sure how I can figure out which device that would be in sysfs, though.

> In this way, maybe we can find out what triggers the interrupt storm.

For the log I posted, none of the events in the dmesg coincided with the begin of the increased CPU load, in terms of timing.

I *thought* I had an idea for reproducing this, because it seemed like whenever I used my bluetooth headsets in the evening and suspended the laptop for the night, the issue would come up the next morning not long after resume... but just briefly connecting and discconnecting bluetooth is not sufficient, so I am not sure if there is really a connection here. It is the case though that for the 4-5 days where the issue didn't come up, I also wasn't using bluetooth.
Comment 11 Ralf 2020-04-02 07:07:34 UTC
After a lot of testing, I was finally able to reproduce this with a vanilla kernel v5.4.19.  Interestingly, even though Debian kernel v5.3.9 was affected, I did not see any issue with vanilla kernel v5.3.18.

However, in upgrading from v5.3.18 to v5.4.19 I also changed the kernel config (.config file) from my own, stripped-down version (hoping to reduce build times) to something as close as I could make it to the one Debian uses. I am now building  v5.4.19 again with my stripped-down config to determine if the issue only arises in some configurations, or if maybe the Debian 5.3 kernel backported the problematic patch and vanilla it only landed in 5.4.
Comment 12 Ralf 2020-04-14 21:49:46 UTC
Created attachment 288457 [details]
diff -u config-5.4.19-good config-5.4.19-bad

Indeed it looks like 5.4.19 with one config has the issue and with another config does not. I attached the output of "diff -u config-5.4.19-good config-5.4.19-bad"; unfortunately I d not have a good idea how to go about narrowing down the option that could be causing the problem.
Comment 13 Ralf 2020-05-17 20:57:49 UTC
The issue persists with (Debian's) kernel 5.6.
Comment 14 Zhang Rui 2020-06-29 11:16:27 UTC
well, there are too many differences.
something that worth trying
Clear
CONFIG_HAMRADIO
CONFIG_CAN
CONFIG_NFC
disable wimax
CONFIG_ISDN
CONFIG_STAGING
CONFIG_IIO

Please try to figure out which Kconfig introduces this issue and then raise the problem to the driver owner.
Anyway, this is more like a kconfig/driver problem, move this report out of ACPI category.
Comment 15 Ralf 2020-07-11 08:20:19 UTC
> well, there are too many differences.

Yeah, I know. But I had no idea where to even start trying, and I figured you'd have some advise for which options are most likely to be relevant -- that seemed better than me blindly trying 1000s of configurations.

> move this report out of ACPI category.

To where, and how? I am not familiar with bugzilla.
Comment 16 Ralf 2020-08-09 13:49:11 UTC
I have done a bunch more testing, and I reduced the problem to the following config diff:

--- config-5.6.13-good.2        2020-08-09 15:47:13.676528957 +0200
+++ config-5.6.13-bad   2020-07-27 12:25:34.139836285 +0200
@@ -62,6 +62,7 @@
 CONFIG_IRQ_DOMAIN_HIERARCHY=y
 CONFIG_GENERIC_MSI_IRQ=y
 CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
+CONFIG_IRQ_MSI_IOMMU=y
 CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y
 CONFIG_GENERIC_IRQ_RESERVATION_MODE=y
 CONFIG_IRQ_FORCED_THREADING=y
@@ -5741,9 +5742,47 @@
 CONFIG_SND_SOC_INTEL_KBL_DA7219_MAX98357A_MACH=m
 # CONFIG_SND_SOC_INTEL_KBL_DA7219_MAX98927_MACH is not set
 # CONFIG_SND_SOC_INTEL_KBL_RT5660_MACH is not set
+# CONFIG_SND_SOC_INTEL_GLK_DA7219_MAX98357A_MACH is not set
+CONFIG_SND_SOC_INTEL_GLK_RT5682_MAX98357A_MACH=m
 CONFIG_SND_SOC_INTEL_SKL_HDA_DSP_GENERIC_MACH=m
+CONFIG_SND_SOC_INTEL_SOF_RT5682_MACH=m
 # CONFIG_SND_SOC_MTK_BTCVSD is not set
-# CONFIG_SND_SOC_SOF_TOPLEVEL is not set
+CONFIG_SND_SOC_SOF_TOPLEVEL=y
+CONFIG_SND_SOC_SOF_PCI=m
+# CONFIG_SND_SOC_SOF_ACPI is not set
+# CONFIG_SND_SOC_SOF_DEVELOPER_SUPPORT is not set
+CONFIG_SND_SOC_SOF=m
+CONFIG_SND_SOC_SOF_PROBE_WORK_QUEUE=y
+CONFIG_SND_SOC_SOF_INTEL_TOPLEVEL=y
+CONFIG_SND_SOC_SOF_INTEL_PCI=m
+CONFIG_SND_SOC_SOF_INTEL_HIFI_EP_IPC=m
+CONFIG_SND_SOC_SOF_INTEL_ATOM_HIFI_EP=m
+CONFIG_SND_SOC_SOF_INTEL_COMMON=m
+CONFIG_SND_SOC_SOF_MERRIFIELD_SUPPORT=y
+CONFIG_SND_SOC_SOF_MERRIFIELD=m
+CONFIG_SND_SOC_SOF_APOLLOLAKE_SUPPORT=y
+CONFIG_SND_SOC_SOF_APOLLOLAKE=m
+CONFIG_SND_SOC_SOF_GEMINILAKE_SUPPORT=y
+CONFIG_SND_SOC_SOF_GEMINILAKE=m
+CONFIG_SND_SOC_SOF_CANNONLAKE_SUPPORT=y
+CONFIG_SND_SOC_SOF_CANNONLAKE=m
+CONFIG_SND_SOC_SOF_COFFEELAKE_SUPPORT=y
+CONFIG_SND_SOC_SOF_COFFEELAKE=m
+CONFIG_SND_SOC_SOF_ICELAKE_SUPPORT=y
+CONFIG_SND_SOC_SOF_ICELAKE=m
+# CONFIG_SND_SOC_SOF_COMETLAKE_LP_SUPPORT is not set
+# CONFIG_SND_SOC_SOF_COMETLAKE_H_SUPPORT is not set
+# CONFIG_SND_SOC_SOF_TIGERLAKE_SUPPORT is not set
+# CONFIG_SND_SOC_SOF_ELKHARTLAKE_SUPPORT is not set
+# CONFIG_SND_SOC_SOF_JASPERLAKE_SUPPORT is not set
+CONFIG_SND_SOC_SOF_HDA_COMMON=m
+CONFIG_SND_SOC_SOF_HDA_LINK=y
+CONFIG_SND_SOC_SOF_HDA_AUDIO_CODEC=y
+# CONFIG_SND_SOC_SOF_HDA_ALWAYS_ENABLE_DMI_L1 is not set
+CONFIG_SND_SOC_SOF_HDA_COMMON_HDMI_CODEC=y
+CONFIG_SND_SOC_SOF_HDA_LINK_BASELINE=m
+CONFIG_SND_SOC_SOF_HDA=m
+CONFIG_SND_SOC_SOF_XTENSA=m
 
 #
 # STMicroelectronics STM32 SOC audio support
@@ -5846,6 +5885,7 @@
 CONFIG_SND_SOC_RT5670=m
 CONFIG_SND_SOC_RT5677=m
 CONFIG_SND_SOC_RT5677_SPI=m
+CONFIG_SND_SOC_RT5682=m
 # CONFIG_SND_SOC_SGTL5000 is not set
 # CONFIG_SND_SOC_SIMPLE_AMPLIFIER is not set
 # CONFIG_SND_SOC_SIRF_AUDIO_CODEC is not set
@@ -6423,7 +6463,24 @@
 CONFIG_MMC_CQHCI=m
 CONFIG_MMC_TOSHIBA_PCI=m
 # CONFIG_MMC_MTK is not set
-# CONFIG_MEMSTICK is not set
+CONFIG_MEMSTICK=m
+# CONFIG_MEMSTICK_DEBUG is not set
+
+#
+# MemoryStick drivers
+#
+# CONFIG_MEMSTICK_UNSAFE_RESUME is not set
+CONFIG_MSPRO_BLOCK=m
+# CONFIG_MS_BLOCK is not set
+
+#
+# MemoryStick Host Controller Drivers
+#
+CONFIG_MEMSTICK_TIFM_MS=m
+CONFIG_MEMSTICK_JMICRON_38X=m
+CONFIG_MEMSTICK_R592=m
+CONFIG_MEMSTICK_REALTEK_PCI=m
+CONFIG_MEMSTICK_REALTEK_USB=m
 CONFIG_NEW_LEDS=y
 CONFIG_LEDS_CLASS=y
 # CONFIG_LEDS_CLASS_FLASH is not set
@@ -6747,7 +6804,7 @@
 # CONFIG_STAGING is not set
 CONFIG_X86_PLATFORM_DEVICES=y
 CONFIG_ACER_WMI=m
-# CONFIG_ACER_WIRELESS is not set
+CONFIG_ACER_WIRELESS=m
 CONFIG_ACERHDF=m
 CONFIG_ALIENWARE_WMI=m
 CONFIG_ASUS_LAPTOP=m
@@ -6794,7 +6851,7 @@
 CONFIG_ASUS_WIRELESS=m
 CONFIG_ACPI_WMI=m
 CONFIG_WMI_BMOF=m
-# CONFIG_INTEL_WMI_THUNDERBOLT is not set
+CONFIG_INTEL_WMI_THUNDERBOLT=m
 # CONFIG_XIAOMI_WMI is not set
 CONFIG_MSI_WMI=m
 CONFIG_PEAQ_WMI=m
@@ -6802,7 +6859,7 @@
 CONFIG_TOSHIBA_BT_RFKILL=m
 CONFIG_TOSHIBA_HAPS=m
 CONFIG_TOSHIBA_WMI=m
-# CONFIG_ACPI_CMPC is not set
+CONFIG_ACPI_CMPC=m
 CONFIG_INTEL_INT0002_VGPIO=m
 CONFIG_INTEL_HID_EVENT=m
 CONFIG_INTEL_VBTN=m
@@ -6822,7 +6879,7 @@
 CONFIG_SURFACE_3_BUTTON=m
 # CONFIG_INTEL_PUNIT_IPC is not set
 # CONFIG_MLX_PLATFORM is not set
-# CONFIG_INTEL_TURBO_MAX_3 is not set
+CONFIG_INTEL_TURBO_MAX_3=y
 # CONFIG_INTEL_CHTDC_TI_PWRBTN is not set
 CONFIG_I2C_MULTI_INSTANTIATE=m
 CONFIG_INTEL_ATOMISP2_PM=m
@@ -6839,7 +6896,12 @@
 # CONFIG_SYSTEM76_ACPI is not set
 CONFIG_PMC_ATOM=y
 # CONFIG_MFD_CROS_EC is not set
-# CONFIG_CHROME_PLATFORMS is not set
+CONFIG_CHROME_PLATFORMS=y
+CONFIG_CHROMEOS_LAPTOP=m
+CONFIG_CHROMEOS_PSTORE=m
+# CONFIG_CHROMEOS_TBMC is not set
+# CONFIG_CROS_EC is not set
+CONFIG_CROS_KBD_LED_BACKLIGHT=m
 # CONFIG_MELLANOX_PLATFORM is not set
 CONFIG_CLKDEV_LOOKUP=y
 CONFIG_HAVE_CLK_PREPARE=y
@@ -6882,7 +6944,9 @@
 
 # CONFIG_IOMMU_DEBUGFS is not set
 # CONFIG_IOMMU_DEFAULT_PASSTHROUGH is not set
-# CONFIG_AMD_IOMMU is not set
+CONFIG_IOMMU_DMA=y
+CONFIG_AMD_IOMMU=y
+CONFIG_AMD_IOMMU_V2=y
 CONFIG_DMAR_TABLE=y
 CONFIG_INTEL_IOMMU=y
 CONFIG_INTEL_IOMMU_SVM=y
@@ -7113,7 +7177,13 @@
 CONFIG_OCFS2_FS_STATS=y
 CONFIG_OCFS2_DEBUG_MASKLOG=y
 # CONFIG_OCFS2_DEBUG_FS is not set
-# CONFIG_BTRFS_FS is not set
+CONFIG_BTRFS_FS=m
+CONFIG_BTRFS_FS_POSIX_ACL=y
+# CONFIG_BTRFS_FS_CHECK_INTEGRITY is not set
+# CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set
+# CONFIG_BTRFS_DEBUG is not set
+# CONFIG_BTRFS_ASSERT is not set
+# CONFIG_BTRFS_FS_REF_VERIFY is not set
 CONFIG_NILFS2_FS=m
 CONFIG_F2FS_FS=m
 CONFIG_F2FS_STAT_FS=y
@@ -7132,7 +7202,8 @@
 CONFIG_EXPORTFS_BLOCK_OPS=y
 CONFIG_FILE_LOCKING=y
 CONFIG_MANDATORY_FILE_LOCKING=y
-# CONFIG_FS_ENCRYPTION is not set
+CONFIG_FS_ENCRYPTION=y
+CONFIG_FS_ENCRYPTION_ALGS=m
 # CONFIG_FS_VERITY is not set
 CONFIG_FSNOTIFY=y
 CONFIG_DNOTIFY=y
@@ -7152,7 +7223,7 @@
 CONFIG_AUTOFS_FS=m
 CONFIG_FUSE_FS=m
 CONFIG_CUSE=m
-# CONFIG_VIRTIO_FS is not set
+CONFIG_VIRTIO_FS=m
 CONFIG_OVERLAY_FS=m
 # CONFIG_OVERLAY_FS_REDIRECT_DIR is not set
 CONFIG_OVERLAY_FS_REDIRECT_ALWAYS_FOLLOW=y
@@ -7275,7 +7346,13 @@
 CONFIG_UFS_FS=m
 # CONFIG_UFS_FS_WRITE is not set
 # CONFIG_UFS_DEBUG is not set
-# CONFIG_EROFS_FS is not set
+CONFIG_EROFS_FS=m
+# CONFIG_EROFS_FS_DEBUG is not set
+CONFIG_EROFS_FS_XATTR=y
+CONFIG_EROFS_FS_POSIX_ACL=y
+CONFIG_EROFS_FS_SECURITY=y
+CONFIG_EROFS_FS_ZIP=y
+CONFIG_EROFS_FS_CLUSTER_PAGE_LIMIT=1
 # CONFIG_VBOXSF_FS is not set
 CONFIG_NETWORK_FILESYSTEMS=y
 CONFIG_NFS_FS=m
@@ -7446,7 +7523,11 @@
 # CONFIG_SECURITY_LOADPIN is not set
 CONFIG_SECURITY_YAMA=y
 # CONFIG_SECURITY_SAFESETID is not set
-# CONFIG_SECURITY_LOCKDOWN_LSM is not set
+CONFIG_SECURITY_LOCKDOWN_LSM=y
+CONFIG_SECURITY_LOCKDOWN_LSM_EARLY=y
+CONFIG_LOCK_DOWN_KERNEL_FORCE_NONE=y
+# CONFIG_LOCK_DOWN_KERNEL_FORCE_INTEGRITY is not set
+# CONFIG_LOCK_DOWN_KERNEL_FORCE_CONFIDENTIALITY is not set
 CONFIG_INTEGRITY=y
 CONFIG_INTEGRITY_SIGNATURE=y
 CONFIG_INTEGRITY_ASYMMETRIC_KEYS=y
Comment 17 Ralf 2020-09-18 16:08:40 UTC
Sometime during my bisecting, some other change must have fixed this -- the remaining config diff I ended up with doesn't make much sense, and I cannot reproduce this with the Debian kernels any more either.

Note You need to log in before you can comment on or make changes to this bug.