Bug 214989 - HW consistent power fault defect cause system hang on kernel 5.4
Summary: HW consistent power fault defect cause system hang on kernel 5.4
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: HotPlug (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: Greg Kroah-Hartman
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-11-12 01:38 UTC by Linjun Bao
Modified: 2021-11-16 06:23 UTC (History)
6 users (show)

See Also:
Kernel Version: 5.4.75
Subsystem:
Regression: No
Bisected commit-id:


Attachments
This is content on display when system hang, since system hang, no other info is available (380 bytes, text/plain)
2021-11-12 01:38 UTC, Linjun Bao
Details

Description Linjun Bao 2021-11-12 01:38:46 UTC
Created attachment 299545 [details]
This is content on display when system hang, since system hang, no other info is available

Recently we encounter system hang (dead spinlock) when moving to kernel linux-5.4.y.


Finally, we use bisect to locate the suspicious commit https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=4667358dab9cc07da044d5bc087065545b1000df.


Our system has some HW defect, which will wrongly set PCI_EXP_SLTSTA_PFD high, and this commit will lead to infinite loop jumping to read_status (no chance to clear status PCI_EXP_SLTSTA_PFD bit since ctrl is not updated), I know this is our HW defect, but this commit makes kernel trapped in this isr function and leads to kernel hang (then the user could not get useful information to show what's wrong), which I think is not expected behavior, so I would like to report to you for discussion.

After discussion with Stuart(stuart.w.hayes@gmail.com), he helps provide a patch that will only compare those status bits that are currently set against those same bits from the previous loop. This patch helps eliminate the system hang issue.
Comment 1 Linjun Bao 2021-11-12 01:40:09 UTC
Add patch content.

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 3024d7e85e6a..bf8fe868a293 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -594,7 +594,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id)
  	struct controller *ctrl = (struct controller *)dev_id;
  	struct pci_dev *pdev = ctrl_dev(ctrl);
  	struct device *parent = pdev->dev.parent;
-	u16 status, events = 0;
+	u16 changed, status, events = 0;
  
  	/*
  	 * Interrupts only occur in D3hot or shallower and only if enabled @@ -643,6 +643,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id)
  	if (ctrl->power_fault_detected)
  		status &= ~PCI_EXP_SLTSTA_PFD;
  
+	changed = status ^ (events & status);
  	events |= status;
  	if (!events) {
  		if (parent)
@@ -659,7 +660,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id)
  		 * So re-read the Slot Status register in case a bit was set
  		 * between read and write.
  		 */
-		if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode)
+		if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode && changed)
  			goto read_status;
  	}
Comment 2 Greg Kroah-Hartman 2021-11-12 07:00:04 UTC
On Fri, Nov 12, 2021 at 01:38:46AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=214989
> 
>             Bug ID: 214989
>            Summary: HW consistent power fault defect cause system hang on
>                     kernel 5.4
>            Product: Drivers
>            Version: 2.5
>     Kernel Version: 5.4.75

Very old kernel version, please try 5.15 and contact the developers on
the respective mailing list as listed in the MAINTAINERS file.
Comment 3 Linjun Bao 2021-11-15 01:55:27 UTC
Yeah, the same issue for 5.15, since the pciehp_isr function is the same as 5.4, already added the developers.
Comment 4 Lukas Wunner 2021-11-16 06:23:14 UTC
Patch submitted:
https://lore.kernel.org/linux-pci/20211115192723.GA19161@wunner.de/

Note You need to log in before you can comment on or make changes to this bug.