Bug 210153

Summary: Intel P4600 (SSDPEDKE020T7) stalls and throws "timeout, completion polled" randomly
Product: Drivers Reporter: Christian Theune (ct)
Component: Flash/Memory Technology DevicesAssignee: David Woodhouse (dwmw2)
Status: NEW ---    
Severity: high    
Priority: P1    
Hardware: x86-64   
OS: Linux   
See Also: https://bugzilla.kernel.org/show_bug.cgi?id=204887
https://bugzilla.kernel.org/show_bug.cgi?id=210177
Kernel Version: 4.19.157 Subsystem:
Regression: No Bisected commit-id:

Description Christian Theune 2020-11-12 05:53:08 UTC
We're experiencing intermittent stalls on this device after moving it from an Intel system to an AMD system. This happens with 4 similar devices all moved from Intel to AMD. Over all 4 systems this 

We updated the kernel and all the firmware for the devices to no avail.

The new board is: H11SSL-i

The new CPU is: 1 AMD EPYC 7302P 16-Core Processor

The kernel now is: Linux cartman19 4.19.157 #1 SMP Thu Nov 12 00:29:39 CET 2020 x86_64 AMD EPYC 7302P 16-Core Processor AuthenticAMD GNU/Linux

The firmware now is:

- Intel SSD DC P4600 Series BTLE81750BBZ2P0IGN -

Bootloader : 0136
DevicePath : /dev/nvme3n1
DeviceStatus : Healthy
Firmware : QDV101D1
FirmwareUpdateAvailable : The selected drive contains current firmware as of this tool release.
Index : 3
ModelNumber : INTEL SSDPEDKE020T7
ProductFamily : Intel SSD DC P4600 Series
SerialNumber : BTLE81750BBZ2P0IGN

We're currently at a loss on what's happening here and neither Google, nor the bug tracker have helped. We're trying to contact Intel and the Board vendor in parallel.

There is 1 related bug in the tracker (#204887) however, we do not do resume and there is nothing else in the kernel log when this happens.

Let me know if you need any more info. I can also be available in person on IRC or other interactive media if needed.
Comment 1 Christian Theune 2020-11-12 21:59:33 UTC
This might be related to another bug that we experienced on the same machine:
https://bugzilla.kernel.org/show_bug.cgi?id=210177

We have now deactivated the IOMMU in the BIOS and are monitoring whether this improves the situation. The last two hours have been promising.
Comment 2 Christian Theune 2020-11-13 13:56:36 UTC
Update: no further stalls since 17+ hours