Hardware: MSI Laptop DMI: Micro-Star International Co., Ltd. Alpha 15 B5EEK/MS-158L, BIOS E158LAMS.107 11/10/2021 lspci: 00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne Root Complex 00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne IOMMU 00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge 00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge 00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge 00:02.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge 00:02.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge 00:02.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge 00:02.4 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge 00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge 00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir Internal PCIe GPP Bridge to Bus 00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 51) 00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51) 00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 0 00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 1 00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 2 00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 3 00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 4 00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 5 00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 6 00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 7 01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch (rev c3) 02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch 03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] (rev c3) 03:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller 04:00.0 Network controller: MEDIATEK Corp. MT7921K (RZ608) Wi-Fi 6E 80MHz 05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15) 06:00.0 Non-Volatile memory controller: Micron/Crucial Technology P1 NVMe PCIe SSD (rev 03) 07:00.0 Non-Volatile memory controller: Kingston Technology Company, Inc. Device 500c (rev 01) 08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne (rev c5) 08:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Renoir Radeon High Definition Audio Controller 08:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor 08:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1 08:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1 08:00.5 Multimedia controller: Advanced Micro Devices, Inc. [AMD] ACP/ACP3X/ACP6x Audio Coprocessor (rev 01) 08:00.6 Audio device: Advanced Micro Devices, Inc. [AMD] Family 17h/19h HD Audio Controller 08:00.7 Signal processing controller: Advanced Micro Devices, Inc. [AMD] Sensor Fusion Hub Commit ed589c7d6485b0f4bacd1c4fb385b79176a33a73 the kernel failing to boot, the system gets stuck at the grub screen and has to be reset. No messages in logs. Reverting this patch in next-20220923 leads to a booting kernel again.
Created attachment 301871 [details] kernel config
Unfortunately the reversion may have introduced a new bug which leads to a hang of the system without traces in the logs. Hang seems to occur after 15min to 1h and under higher graphical load. Could this be a temperature issue caused by the reversion?
This is a locking issue, the following patch made the kernel boot again diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c index 9b27211b806f..9179989fe920 100644 --- a/drivers/thermal/thermal_core.c +++ b/drivers/thermal/thermal_core.c @@ -1154,7 +1154,7 @@ int thermal_zone_get_crit_temp(struct thermal_zone_device *tz, int *temp) if (!tz->trips) return -EINVAL; - mutex_lock(&tz->lock); + //mutex_lock(&tz->lock); for (i = 0; i < tz->num_trips; i++) { if (tz->trips[i].type == THERMAL_TRIP_CRITICAL) { @@ -1165,7 +1165,7 @@ int thermal_zone_get_crit_temp(struct thermal_zone_device *tz, int *temp) ret = -EINVAL; out: - mutex_unlock(&tz->lock); + //mutex_unlock(&tz->lock); return ret; } @@ -1202,9 +1202,9 @@ int thermal_zone_get_trip(struct thermal_zone_device *tz, int trip_id, { int ret; - mutex_lock(&tz->lock); + //mutex_lock(&tz->lock); ret = __thermal_zone_get_trip(tz, trip_id, trip); - mutex_unlock(&tz->lock); + //mutex_unlock(&tz->lock); return ret; } @@ -1216,7 +1216,7 @@ int thermal_zone_set_trip(struct thermal_zone_device *tz, int trip_id, struct thermal_trip t; int ret = -EINVAL; - mutex_lock(&tz->lock); + //mutex_lock(&tz->lock); if (!tz->ops->set_trip_temp && !tz->ops->set_trip_hyst && !tz->trips) goto out; @@ -1244,7 +1244,7 @@ int thermal_zone_set_trip(struct thermal_zone_device *tz, int trip_id, tz->trips[trip_id] = *trip; out: - mutex_unlock(&tz->lock); + //mutex_unlock(&tz->lock); if (!ret) { thermal_notify_tz_trip_change(tz->id, trip_id, trip->type,
This is the important part, which removes the hang: @@ -1202,9 +1202,9 @@ int thermal_zone_get_trip(struct thermal_zone_device *tz, int trip_id, { int ret; - mutex_lock(&tz->lock); + //mutex_lock(&tz->lock); ret = __thermal_zone_get_trip(tz, trip_id, trip); - mutex_unlock(&tz->lock); + //mutex_unlock(&tz->lock); return ret; }
CC'ing Stephen Rothwell
ACPI thermal zone is the problem, setting CONFIG_ACPI_THERMAL_ZONE=n make the kernel bootable again. CONFIG_ACPI_THERMAL=m postpones the hang until the module is loaded (3s into boot instead of 0.3s with =y).
The hang when booting is fixed by the changes in linux-20220927, but the instability continues. After about 15min the system locks with capslock LED flashing.
As the said instability also occurs when I revert the thermal patches it is caused by something else ...