Bug 216528 - commit ed589c7d6485b0f4bacd1c4fb385b79176a33a73 leads to silent hang on boot for MSI Laptop
Summary: commit ed589c7d6485b0f4bacd1c4fb385b79176a33a73 leads to silent hang on boot ...
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-09-25 10:10 UTC by spasswolf
Modified: 2022-09-28 19:50 UTC (History)
1 user (show)

See Also:
Kernel Version: next-20220923
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
kernel config (156.70 KB, text/plain)
2022-09-25 10:14 UTC, spasswolf
Details

Description spasswolf 2022-09-25 10:10:51 UTC
Hardware: MSI Laptop
DMI: Micro-Star International Co., Ltd. Alpha 15 B5EEK/MS-158L, BIOS E158LAMS.107 11/10/2021
lspci:
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne Root Complex
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne IOMMU
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:02.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
00:02.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
00:02.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
00:02.4 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir Internal PCIe GPP Bridge to Bus
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 51)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 5
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 6
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 7
01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch (rev c3)
02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch
03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] (rev c3)
03:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller
04:00.0 Network controller: MEDIATEK Corp. MT7921K (RZ608) Wi-Fi 6E 80MHz
05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
06:00.0 Non-Volatile memory controller: Micron/Crucial Technology P1 NVMe PCIe SSD (rev 03)
07:00.0 Non-Volatile memory controller: Kingston Technology Company, Inc. Device 500c (rev 01)
08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne (rev c5)
08:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Renoir Radeon High Definition Audio Controller
08:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor
08:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1
08:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1
08:00.5 Multimedia controller: Advanced Micro Devices, Inc. [AMD] ACP/ACP3X/ACP6x Audio Coprocessor (rev 01)
08:00.6 Audio device: Advanced Micro Devices, Inc. [AMD] Family 17h/19h HD Audio Controller
08:00.7 Signal processing controller: Advanced Micro Devices, Inc. [AMD] Sensor Fusion Hub
Commit ed589c7d6485b0f4bacd1c4fb385b79176a33a73 the kernel failing to boot, the system gets stuck at the grub screen and has to be reset. No messages in logs.
Reverting this patch in next-20220923 leads to a booting kernel again.
Comment 1 spasswolf 2022-09-25 10:14:11 UTC
Created attachment 301871 [details]
kernel config
Comment 2 spasswolf 2022-09-25 12:23:59 UTC
Unfortunately the reversion may have introduced a new bug which leads to a hang of the system without traces in the logs. Hang seems to occur after 15min to 1h and under higher graphical load. Could this be a temperature issue caused by the reversion?
Comment 3 spasswolf 2022-09-25 15:49:16 UTC
This is a locking issue, the following patch made the kernel boot again
diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 9b27211b806f..9179989fe920 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -1154,7 +1154,7 @@ int thermal_zone_get_crit_temp(struct thermal_zone_device *tz, int *temp)
 	if (!tz->trips)
 		return -EINVAL;
 
-	mutex_lock(&tz->lock);
+	//mutex_lock(&tz->lock);
 
 	for (i = 0; i < tz->num_trips; i++) {
 		if (tz->trips[i].type == THERMAL_TRIP_CRITICAL) {
@@ -1165,7 +1165,7 @@ int thermal_zone_get_crit_temp(struct thermal_zone_device *tz, int *temp)
 
 	ret = -EINVAL;
 out:
-	mutex_unlock(&tz->lock);
+	//mutex_unlock(&tz->lock);
 
 	return ret;
 }
@@ -1202,9 +1202,9 @@ int thermal_zone_get_trip(struct thermal_zone_device *tz, int trip_id,
 {
 	int ret;
 
-	mutex_lock(&tz->lock);
+	//mutex_lock(&tz->lock);
 	ret = __thermal_zone_get_trip(tz, trip_id, trip);
-	mutex_unlock(&tz->lock);
+	//mutex_unlock(&tz->lock);
 
 	return ret;
 }
@@ -1216,7 +1216,7 @@ int thermal_zone_set_trip(struct thermal_zone_device *tz, int trip_id,
 	struct thermal_trip t;
 	int ret = -EINVAL;
 
-	mutex_lock(&tz->lock);
+	//mutex_lock(&tz->lock);
 
 	if (!tz->ops->set_trip_temp && !tz->ops->set_trip_hyst && !tz->trips)
 		goto out;
@@ -1244,7 +1244,7 @@ int thermal_zone_set_trip(struct thermal_zone_device *tz, int trip_id,
 		tz->trips[trip_id] = *trip;
 
 out:
-	mutex_unlock(&tz->lock);
+	//mutex_unlock(&tz->lock);
 
 	if (!ret) {
 		thermal_notify_tz_trip_change(tz->id, trip_id, trip->type,
Comment 4 spasswolf 2022-09-25 16:13:55 UTC
This is the important part, which removes the hang:
@@ -1202,9 +1202,9 @@ int thermal_zone_get_trip(struct thermal_zone_device *tz, int trip_id,
 {
 	int ret;
 
-	mutex_lock(&tz->lock);
+	//mutex_lock(&tz->lock);
 	ret = __thermal_zone_get_trip(tz, trip_id, trip);
-	mutex_unlock(&tz->lock);
+	//mutex_unlock(&tz->lock);
 
 	return ret;
 }
Comment 5 Artem S. Tashkinov 2022-09-25 19:19:25 UTC
CC'ing Stephen Rothwell
Comment 6 spasswolf 2022-09-27 08:44:32 UTC
ACPI thermal zone is the problem, setting CONFIG_ACPI_THERMAL_ZONE=n make the kernel bootable again. CONFIG_ACPI_THERMAL=m postpones the hang until the module is loaded (3s into boot instead of 0.3s with =y).
Comment 7 spasswolf 2022-09-27 21:06:25 UTC
The hang when booting is fixed by the changes in linux-20220927, but the instability continues. After about 15min the system locks with capslock LED flashing.
Comment 8 spasswolf 2022-09-28 19:50:42 UTC
As the said instability also occurs when I revert the thermal patches it is caused by something else ...

Note You need to log in before you can comment on or make changes to this bug.