Bug 215992 - AER Multiple ERR_COR/FATAL/NONFATAL Received bits never cleared
Summary: AER Multiple ERR_COR/FATAL/NONFATAL Received bits never cleared
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL: https://lore.kernel.org/r/20220418150...
Keywords:
Depends on:
Blocks:
 
Reported: 2022-05-17 21:04 UTC by Bjorn Helgaas
Modified: 2022-05-17 22:27 UTC (History)
0 users

See Also:
Kernel Version: 5.17
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Bjorn Helgaas 2022-05-17 21:04:38 UTC
In some cases PCI_ERR_ROOT_MULTI_COR_RCV and PCI_ERR_ROOT_MULTI_UNCOR_RCV can be set but never cleared, e.g.,

  - hardware receives ERR_COR message
  - hardware sets PCI_ERR_ROOT_COR_RCV in PCI_ERR_ROOT_STATUS
  - aer_irq() entered
  - aer_irq(): status = pci_read_config_dword(PCI_ERR_ROOT_STATUS)
  - aer_irq(): now status == PCI_ERR_ROOT_COR_RCV
  - hardware receives second ERR_COR message
  - hardware sets PCI_ERR_ROOT_MULTI_COR_RCV in PCI_ERR_ROOT_STATUS
  - aer_irq(): pci_write_config_dword(PCI_ERR_ROOT_STATUS, status)
  - PCI_ERR_ROOT_STATUS now has PCI_ERR_ROOT_MULTI_COR_RCV set
  - aer_irq() entered again
  - aer_irq(): status = pci_read_config_dword(PCI_ERR_ROOT_STATUS)
  - aer_irq(): now status == PCI_ERR_ROOT_MULTI_COR_RCV
  - aer_irq() exits because (PCI_ERR_ROOT_UNCOR_RCV|PCI_ERR_ROOT_COR_RCV) not set
  - PCI_ERR_ROOT_STATUS still has PCI_ERR_ROOT_MULTI_COR_RCV set
Comment 1 Bjorn Helgaas 2022-05-17 21:07:06 UTC
Opened on behalf of Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> to archive background and repro information from https://lore.kernel.org/r/20220418150237.1021519-1-sathyanarayanan.kuppuswamy@linux.intel.com

Here's the repro information, which I think came from Eric Badger <ebadger@purestorage.com>:

This error can be reproduced by making following changes to the aer_irq()
function and by executing the given test commands.

  static irqreturn_t aer_irq(int irq, void *context)
          struct aer_err_source e_src = {};

          pci_read_config_dword(rp, aer + PCI_ERR_ROOT_STATUS,
                                 &e_src.status);
  +       pci_dbg(pdev->port, "Root Error Status: %04x\n",
  +             e_src.status);
          if (!(e_src.status & AER_ERR_STATUS_MASK))
                  return IRQ_NONE;

  +       mdelay(5000);

  # Prep injection data for a correctable error
  $ cd /sys/kernel/debug/apei/einj
  $ echo 0x00000040 > error_type
  $ echo 0x4 > flags
  $ echo 0x891000 > param4

  # Root Error Status is initially clear
  $ setpci -s <Dev ID> ECAP0001+0x30.w
  0000

  # Inject one error
  $ echo 1 > error_inject

  # Interrupt received
  pcieport <Dev ID>: AER: Root Error Status 0001

  # Inject another error (within 5 seconds)
  $ echo 1 > error_inject

  # You will get a new IRQ with only multiple ERR_COR bit set
  pcieport <Dev ID>: AER: Root Error Status 0002

Currently, the above issue has been only reproduced in the ICL server
platform.
Comment 2 Bjorn Helgaas 2022-05-17 22:27:39 UTC
The above should say:

  if (!(e_src.status & (PCI_ERR_ROOT_UNCOR_RCV|PCI_ERR_ROOT_COR_RCV)))
    return IRQ_NONE;

since this is a repro case for v5.17.  The *fix* for this issue changes that test to:

  if (!(e_src.status & AER_ERR_STATUS_MASK))

Note You need to log in before you can comment on or make changes to this bug.