Bug 216528

Summary: commit ed589c7d6485b0f4bacd1c4fb385b79176a33a73 leads to silent hang on boot for MSI Laptop
Product: Drivers Reporter: spasswolf
Component: OtherAssignee: drivers_other
Status: RESOLVED CODE_FIX    
Severity: normal CC: apmbugs
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: next-20220923 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: kernel config

Description spasswolf 2022-09-25 10:10:51 UTC
Hardware: MSI Laptop
DMI: Micro-Star International Co., Ltd. Alpha 15 B5EEK/MS-158L, BIOS E158LAMS.107 11/10/2021
lspci:
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne Root Complex
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne IOMMU
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:02.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
00:02.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
00:02.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
00:02.4 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir Internal PCIe GPP Bridge to Bus
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 51)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 5
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 6
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 7
01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch (rev c3)
02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch
03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] (rev c3)
03:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller
04:00.0 Network controller: MEDIATEK Corp. MT7921K (RZ608) Wi-Fi 6E 80MHz
05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
06:00.0 Non-Volatile memory controller: Micron/Crucial Technology P1 NVMe PCIe SSD (rev 03)
07:00.0 Non-Volatile memory controller: Kingston Technology Company, Inc. Device 500c (rev 01)
08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne (rev c5)
08:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Renoir Radeon High Definition Audio Controller
08:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor
08:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1
08:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1
08:00.5 Multimedia controller: Advanced Micro Devices, Inc. [AMD] ACP/ACP3X/ACP6x Audio Coprocessor (rev 01)
08:00.6 Audio device: Advanced Micro Devices, Inc. [AMD] Family 17h/19h HD Audio Controller
08:00.7 Signal processing controller: Advanced Micro Devices, Inc. [AMD] Sensor Fusion Hub
Commit ed589c7d6485b0f4bacd1c4fb385b79176a33a73 the kernel failing to boot, the system gets stuck at the grub screen and has to be reset. No messages in logs.
Reverting this patch in next-20220923 leads to a booting kernel again.
Comment 1 spasswolf 2022-09-25 10:14:11 UTC
Created attachment 301871 [details]
kernel config
Comment 2 spasswolf 2022-09-25 12:23:59 UTC
Unfortunately the reversion may have introduced a new bug which leads to a hang of the system without traces in the logs. Hang seems to occur after 15min to 1h and under higher graphical load. Could this be a temperature issue caused by the reversion?
Comment 3 spasswolf 2022-09-25 15:49:16 UTC
This is a locking issue, the following patch made the kernel boot again
diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 9b27211b806f..9179989fe920 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -1154,7 +1154,7 @@ int thermal_zone_get_crit_temp(struct thermal_zone_device *tz, int *temp)
 	if (!tz->trips)
 		return -EINVAL;
 
-	mutex_lock(&tz->lock);
+	//mutex_lock(&tz->lock);
 
 	for (i = 0; i < tz->num_trips; i++) {
 		if (tz->trips[i].type == THERMAL_TRIP_CRITICAL) {
@@ -1165,7 +1165,7 @@ int thermal_zone_get_crit_temp(struct thermal_zone_device *tz, int *temp)
 
 	ret = -EINVAL;
 out:
-	mutex_unlock(&tz->lock);
+	//mutex_unlock(&tz->lock);
 
 	return ret;
 }
@@ -1202,9 +1202,9 @@ int thermal_zone_get_trip(struct thermal_zone_device *tz, int trip_id,
 {
 	int ret;
 
-	mutex_lock(&tz->lock);
+	//mutex_lock(&tz->lock);
 	ret = __thermal_zone_get_trip(tz, trip_id, trip);
-	mutex_unlock(&tz->lock);
+	//mutex_unlock(&tz->lock);
 
 	return ret;
 }
@@ -1216,7 +1216,7 @@ int thermal_zone_set_trip(struct thermal_zone_device *tz, int trip_id,
 	struct thermal_trip t;
 	int ret = -EINVAL;
 
-	mutex_lock(&tz->lock);
+	//mutex_lock(&tz->lock);
 
 	if (!tz->ops->set_trip_temp && !tz->ops->set_trip_hyst && !tz->trips)
 		goto out;
@@ -1244,7 +1244,7 @@ int thermal_zone_set_trip(struct thermal_zone_device *tz, int trip_id,
 		tz->trips[trip_id] = *trip;
 
 out:
-	mutex_unlock(&tz->lock);
+	//mutex_unlock(&tz->lock);
 
 	if (!ret) {
 		thermal_notify_tz_trip_change(tz->id, trip_id, trip->type,
Comment 4 spasswolf 2022-09-25 16:13:55 UTC
This is the important part, which removes the hang:
@@ -1202,9 +1202,9 @@ int thermal_zone_get_trip(struct thermal_zone_device *tz, int trip_id,
 {
 	int ret;
 
-	mutex_lock(&tz->lock);
+	//mutex_lock(&tz->lock);
 	ret = __thermal_zone_get_trip(tz, trip_id, trip);
-	mutex_unlock(&tz->lock);
+	//mutex_unlock(&tz->lock);
 
 	return ret;
 }
Comment 5 Artem S. Tashkinov 2022-09-25 19:19:25 UTC
CC'ing Stephen Rothwell
Comment 6 spasswolf 2022-09-27 08:44:32 UTC
ACPI thermal zone is the problem, setting CONFIG_ACPI_THERMAL_ZONE=n make the kernel bootable again. CONFIG_ACPI_THERMAL=m postpones the hang until the module is loaded (3s into boot instead of 0.3s with =y).
Comment 7 spasswolf 2022-09-27 21:06:25 UTC
The hang when booting is fixed by the changes in linux-20220927, but the instability continues. After about 15min the system locks with capslock LED flashing.
Comment 8 spasswolf 2022-09-28 19:50:42 UTC
As the said instability also occurs when I revert the thermal patches it is caused by something else ...