Bug 215992
Summary: | AER Multiple ERR_COR/FATAL/NONFATAL Received bits never cleared | ||
---|---|---|---|
Product: | Drivers | Reporter: | Bjorn Helgaas (bjorn) |
Component: | PCI | Assignee: | drivers_pci (drivers_pci) |
Status: | NEW --- | ||
Severity: | normal | ||
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
URL: | https://lore.kernel.org/r/20220418150237.1021519-1-sathyanarayanan.kuppuswamy@linux.intel.com | ||
Kernel Version: | 5.17 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Bjorn Helgaas
2022-05-17 21:04:38 UTC
Opened on behalf of Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> to archive background and repro information from https://lore.kernel.org/r/20220418150237.1021519-1-sathyanarayanan.kuppuswamy@linux.intel.com Here's the repro information, which I think came from Eric Badger <ebadger@purestorage.com>: This error can be reproduced by making following changes to the aer_irq() function and by executing the given test commands. static irqreturn_t aer_irq(int irq, void *context) struct aer_err_source e_src = {}; pci_read_config_dword(rp, aer + PCI_ERR_ROOT_STATUS, &e_src.status); + pci_dbg(pdev->port, "Root Error Status: %04x\n", + e_src.status); if (!(e_src.status & AER_ERR_STATUS_MASK)) return IRQ_NONE; + mdelay(5000); # Prep injection data for a correctable error $ cd /sys/kernel/debug/apei/einj $ echo 0x00000040 > error_type $ echo 0x4 > flags $ echo 0x891000 > param4 # Root Error Status is initially clear $ setpci -s <Dev ID> ECAP0001+0x30.w 0000 # Inject one error $ echo 1 > error_inject # Interrupt received pcieport <Dev ID>: AER: Root Error Status 0001 # Inject another error (within 5 seconds) $ echo 1 > error_inject # You will get a new IRQ with only multiple ERR_COR bit set pcieport <Dev ID>: AER: Root Error Status 0002 Currently, the above issue has been only reproduced in the ICL server platform. The above should say: if (!(e_src.status & (PCI_ERR_ROOT_UNCOR_RCV|PCI_ERR_ROOT_COR_RCV))) return IRQ_NONE; since this is a repro case for v5.17. The *fix* for this issue changes that test to: if (!(e_src.status & AER_ERR_STATUS_MASK)) |