Bug 59531

Summary: Error recovery slot_reset() callback called while device is frozen
Product: Drivers Reporter: Bjorn Helgaas (bjorn)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: NEW ---    
Severity: normal    
Priority: P1    
Hardware: All   
OS: Linux   
URL: https://lkml.kernel.org/r/F7B8FD780A346D46A0042F5C63B06AE785E3B2@SHSMSX102.ccr.corp.intel.com
Kernel Version: 3.10 Subsystem:
Regression: No Bisected commit-id:

Description Bjorn Helgaas 2013-06-10 17:20:46 UTC
The following report is from Joseph Liu <Joseph.Liu@emulex.com>:

I am working on a driver's PCI error recovery with AER. I have a question with one of my observations from platform AER driver's behavior. As your names and emails are listed to the PCIe AER driver coming with the kernel, I send this question to you:

During injecting AER Non-Correctable/Fatal error and PCIe error recovery process, I observed the following from our Low Level Driver (LLD):

1. The LLD's error_detected() callback got called and the PCI channel state passed in as "pci_channel_io_frozen", as expected;

2. The LLD's error_detected() callback function returned with PCI_ERS_RESULT_NEED_RESET, requesting a PCI slot reset;

3. The LLD's slot_reset() callback got called and it attempted to re-initialize the device hardware for the recovery. However, the PCI slot state was still in "pci_channel_io_frozen" and the pci_channel_offline() helper routine returned 1. This is not expected, and it actually prevented LLD in performing needed access to memory mapped I/O space for preparing the device for recovery;

4. Later, the LLD's resume() callback got called and the PCI slot state was set to "pci_channel_io_normal";

In the upstream Linux kernel 3.7.0's pci-error-recovery.txt, "STEP 4 Slot Reset", it stated after platform has performed PCI slot reset and then calls LLD's slot_reset() callback:

"This call gives drivers the chance to re-initialize the hardware (re-download firmware, etc.).  At this point, the driver may assume that the card is in a fresh state and is fully functional. The slot is unfrozen and the driver has full access to PCI config space, memory mapped I/O space and DMA. Interrupts (Legacy, MSI, or MSI-X) will also be available."

I looked into the kernel 3.7.0's AER driver's aerdrv_core.c and see that the PCI slot state was set to "pci_channel_io_frozen" on AER_FATAL serverity, and only set back to "pci_channel_io_normal" in report_resume() function. The PCI slot state was not set to "pci_channel_io_normal" when invoking report_slot_reset().

As a comparison, I also perform the EEH error recovery with the LLD driver on a PPC platform, which indeed set the PCI slot state to "pci_channel_io_normal" when calling LLD's slot_reset() callback function.

Can you please verify this platform AER driver's behavior is intended and will stay with the AER platform support, or it should be changed to be consistent with the PCI error recovery procedure described in the pci-error-recovery.txt documentation? I also noticed that before invoking the broadcast_error_message() function with "slot_reset", there is a comment in the 3.7.0 kernel source:

/*
 * TODO: Should call platform-specific
 * functions to reset slot before calling
 * drivers' slot_reset callbacks?
 */
And I do see that only the PCIe Link_Reset was performed, no PCI fundamental reset was performed even the LLD set the PCIe device's "pdev->needs_freset = 1", is this the area that later will be added for performing needed PCI hot reset or fundamental reset at this stage?