Bug 210153 - Intel P4600 (SSDPEDKE020T7) stalls and throws "timeout, completion polled" randomly
Summary: Intel P4600 (SSDPEDKE020T7) stalls and throws "timeout, completion polled" ra...
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Flash/Memory Technology Devices (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: David Woodhouse
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-11-12 05:53 UTC by Christian Theune
Modified: 2020-11-13 13:56 UTC (History)
0 users

See Also:
Kernel Version: 4.19.157
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Christian Theune 2020-11-12 05:53:08 UTC
We're experiencing intermittent stalls on this device after moving it from an Intel system to an AMD system. This happens with 4 similar devices all moved from Intel to AMD. Over all 4 systems this 

We updated the kernel and all the firmware for the devices to no avail.

The new board is: H11SSL-i

The new CPU is: 1 AMD EPYC 7302P 16-Core Processor

The kernel now is: Linux cartman19 4.19.157 #1 SMP Thu Nov 12 00:29:39 CET 2020 x86_64 AMD EPYC 7302P 16-Core Processor AuthenticAMD GNU/Linux

The firmware now is:

- Intel SSD DC P4600 Series BTLE81750BBZ2P0IGN -

Bootloader : 0136
DevicePath : /dev/nvme3n1
DeviceStatus : Healthy
Firmware : QDV101D1
FirmwareUpdateAvailable : The selected drive contains current firmware as of this tool release.
Index : 3
ModelNumber : INTEL SSDPEDKE020T7
ProductFamily : Intel SSD DC P4600 Series
SerialNumber : BTLE81750BBZ2P0IGN

We're currently at a loss on what's happening here and neither Google, nor the bug tracker have helped. We're trying to contact Intel and the Board vendor in parallel.

There is 1 related bug in the tracker (#204887) however, we do not do resume and there is nothing else in the kernel log when this happens.

Let me know if you need any more info. I can also be available in person on IRC or other interactive media if needed.
Comment 1 Christian Theune 2020-11-12 21:59:33 UTC
This might be related to another bug that we experienced on the same machine:
https://bugzilla.kernel.org/show_bug.cgi?id=210177

We have now deactivated the IOMMU in the BIOS and are monitoring whether this improves the situation. The last two hours have been promising.
Comment 2 Christian Theune 2020-11-13 13:56:36 UTC
Update: no further stalls since 17+ hours

Note You need to log in before you can comment on or make changes to this bug.