Bug 197055 - Unable to recover non-Fatal AER if any function under the same bus does not implement AER driver error detected callback
Summary: Unable to recover non-Fatal AER if any function under the same bus does not i...
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
Depends on:
Reported: 2017-09-27 16:00 UTC by Gabriele Paoloni
Modified: 2017-09-27 16:00 UTC (History)
0 users

See Also:
Kernel Version: 4.14-rc2
Regression: No
Bisected commit-id:

aer_fail_vs_fixed_log (5.51 KB, text/plain)
2017-09-27 16:00 UTC, Gabriele Paoloni

Description Gabriele Paoloni 2017-09-27 16:00:50 UTC
Created attachment 258617 [details]

Currently if an uncorrectable error is reported by an EP the AER
driver walks over all the devices connected to the upstream port
bus and in turns call the report_error_detected() callback.
If any of the devices connected to the bus does not implement
dev->driver->err_handler->error_detected() do_recovery() will fail
leaving all the bus hierarchy devices unrecovered.

According to the PCIe specs section " Non-Fatal Errors" 
<< Non-fatal errors are uncorrectable errors which cause a particular
transaction to be unreliable but the Link is otherwise fully functional.
Isolating Non-fatal from Fatal errors provides Requester/Receiver logic
in a device or system management software the opportunity to recover
from the error without resetting the components on the Link and
disturbing other transactions in progress. Devices not associated with
the transaction in error are not impacted by the error.>>
Therefore for non-fatal errors it seems not correct to walk over the upstream port bus invoking report_error_detected() for all the devices connected to te bus.

In HiSilicon Hi1620 we have some integrated SoC peripheral controllers that appear under the same bus. More in details we have the HiSilicon SAS controller implementing dev->driver->err_handler->error_detected(), whereas the SATA controller does not implement the callback.

RC---bus74---|-SAS ctrl
             |-SATA ctrl

Injecting AER errors leads to AER not being able to recover (please refer to attached "aer_fail_vs_fixed_log.txt")

Patching with the following code we fix the issue (see "aer_fail_vs_fixed_log.txt") 

diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index b1303b3..057465ad 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -390,7 +390,14 @@ static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
 		 * If the error is reported by an end point, we think this
 		 * error is related to the upstream link of the end point.
-		pci_walk_bus(dev->bus, cb, &result_data);
+		if (state == pci_channel_io_normal)
+			/*
+			 * the error is non fatal so the bus is ok, just invoke
+			 * the callback for the function that logged the error.
+			 */
+			cb(dev, &result_data);
+		else
+			pci_walk_bus(dev->bus, cb, &result_data);
 	return result_data.result;

Note You need to log in before you can comment on or make changes to this bug.