Created attachment 107068 [details] dmesg log Yijing reported a hotplug problem [1] with a Huawei rh5885 and a QLogic QLE2562 Fibre Channel HBA. If a slot is empty at boot-time, BIOS configures the Root Port leading to the slot with a Max_Payload_Size (MPS) of 128. When a device is hot-added, it powers up with the default MPS=128, and Linux doesn't change that, so the card works correctly. If the slot contains a QLogic FC card at boot-time, BIOS configures the Root Port and the FC card with MPS=256, and the card again works correctly. If the card is hot-removed and hot-added back, the card powers up with MPS=128. Linux doesn't change the MPS, so the card doesn't work because it has MPS=128 while the Root Port has MPS=256. This problem is not specific to the Huawei platform or the QLogic device. Linux needs to make sure MPS settings are safe after the hot-add event. [1] https://lkml.kernel.org/r/505456FD.6040801@huawei.com
The reference discussion link: http://marc.info/?l=linux-scsi&m=134788365823217&w=2 Joe Jin <joe.jin@oracle.com> also reported the same problem found in their E1000 device. reference discussion link: http://marc.info/?l=e1000-devel&m=134182518500774&w=2 We can work around this issue by appending "pci=pcie_bus_safe" boot command. But this is not a smart solution. I will provide the patch to fix this problem be triggered by hotplug soon.
Joe is seeing a different problem. In Joe's case, there is no hotplug event, and in fact, the NIC involved is not hot-pluggable. On Joe's machine, with the default BIOS settings as shown in http://marc.info/?l=e1000-devel&m=134207732924628&w=2, we have the following MPS settings in the path to the NIC: 00:07.0 Root Port bridge to [bus 02-05] MPS=256 02:00.0 Upstream Port bridge to [bus 03-05] MPS=256 03:02.0 Downstream Port bridge to [bus 05] MPS=128 05:00.0 82571EB NIC MPS=128 05:00.1 82571EB NIC MPS=128 The mismatch between MPS settings of 02:00.0 and 03:02.0 causes the problem. If 05:00.0 issues a DMA read, the response can be a TLP with 256-byte payload, which should be rejected as a malformed TLP by 03:02.0. Joe's workaround was to change a BIOS setting so the max MPS value is 128, as shown in http://marc.info/?l=e1000-devel&m=134215430719726&w=2. Then we have this: 00:07.0 Root Port bridge to [bus 02-05] MPS=128 02:00.0 Upstream Port bridge to [bus 03-05] MPS=128 03:02.0 Downstream Port bridge to [bus 05] MPS=128 05:00.0 82571EB NIC MPS=128 05:00.1 82571EB NIC MPS=128 I think the fact that BIOS programmed the MPS settings of 02:00.0 and 03:02.0 differently is pretty clearly a BIOS bug. It's possible that Linux could issue a warning or even fix it up, but I don't think the patch proposed for this bug (bug 60671) will help Joe's issue.
I opened bug 60799 for the issue Joe reported. Basically it just contains the sources from which I extracted the information in comment #2.
Yijing, did this get resolved? It sounds similar to the problem Keith is seeing: http://lkml.kernel.org/r/alpine.LRH.2.03.1406031308200.11244@AMR