Created attachment 280185 [details]
patch over v4.15 branch
SSD drive stops processing requests after a while.
With attached patch system is stable.
Exact SSD information:
#hdparm -I /dev/sda
ATA device, with non-removable media
Model Number: Samsung SSD 750 EVO 250GB
Serial Number: S33SNWBH536151Z
Firmware Revision: MAT01B6Q
Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Created attachment 280187 [details]
patch over master branch
Given https://bugzilla.kernel.org/show_bug.cgi?id=201693 and https://bugzilla.kernel.org/show_bug.cgi?id=203475
this patch should be expanded to match "Samsung *" again as you had over in your original https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1809972 patch.
Samsung 750, 850, 860, and 870 have all been reported to have the issue.
Please don't hobble the Samsung 860* on all controllers blindly. I have both the 860 Pro and the 860 EVO, and both are working great with an "Intel Corporation NM10/ICH7 Family SATA Controller [AHCI mode] (rev 01)". If you start disabling queued TRIM on these drives, I'll have to start patching my kernel to re-enable it, and it would be a travesty for everyone else who isn't aware that their drives are being quietly hobbled.
This issue seems to be specific to only some controllers, and I'm not so sure it's even a controller incompatibility, as I too have seen the "WRITE FPDMA QUEUED" failures on at least one of these drives in the past, but that problem went away after I replaced a failing CPU fan, replaced some blown capacitors on my motherboard, and replaced some faulty RAM. I suspect electrical noise as the trigger of this issue. Maybe there's an edge case related to SATA packet retransmissions that the controller vendors didn't fully test.
A suggestion to those affected by this issue: try switching to a *shielded* SATA cable, as short in length as you are able. They're actually hard to find, as most SATA cables are cheap garbage.
Enough people see this with these particular models of Samsung SSDs on a multiple brands of SATA controllers that it is unlikely to be a cabling issue. It's inadequate drive firmware. Other drives connected to the same cabling in the same drive bay do not have these problems.
My hardware: Professionally manufactured, installed, and routed, HP Microserver Gen10. This isn't a custom build.
Nobody should be forced to chase down magical mythical special cables to work around one SSD manufacturers problem.
Run fio random IO tests with ncq disabled and see if it even benefits you in the first place before assuming that it does. My own quick test suggested disabling it was surprisingly and unexpectedly beneficial.
(In reply to Gregory P. Smith from comment #4)
> Run fio random IO tests with ncq disabled and see if it even benefits you in
> the first place before assuming that it does.
I did, and it does. Reproducing here my results from Bug 201693 Comment 36:
«I just ran fio on my Samsung 860 EVO 2TB, random 4K reads with libaio engine, I/O depth 256, jobs 4, runtime 120 seconds.
With queue_depth=32 (default): read: IOPS=46.7k, BW=182MiB/s
With queue_depth=1: read: IOPS=13.3k, BW=51.0MiB/s
No messages in my kernel logs for the duration of these tests.»