Bug 202093

Summary: Samsung SSD 750 EVO doesn't support queued TRIM command
Product: Drivers Reporter: Victor Belyaevski (victorbely)
Component: Flash/Memory Technology DevicesAssignee: David Woodhouse (dwmw2)
Status: NEW ---    
Severity: normal CC: greg, kai.heng.feng, kernel, victorbely
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.15.0 Subsystem:
Regression: No Bisected commit-id:
Attachments: patch over v4.15 branch
patch over master branch

Description Victor Belyaevski 2018-12-29 16:09:41 UTC
Created attachment 280185 [details]
patch over v4.15 branch

SSD drive stops processing requests after a while.
With attached patch system is stable.

Exact SSD information:

#hdparm -I /dev/sda

/dev/sda:

ATA device, with non-removable media
        Model Number:       Samsung SSD 750 EVO 250GB               
        Serial Number:      S33SNWBH536151Z     
        Firmware Revision:  MAT01B6Q
        Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
...
Comment 1 Victor Belyaevski 2018-12-29 16:10:32 UTC
Created attachment 280187 [details]
patch over master branch
Comment 2 Gregory P. Smith 2021-02-24 09:29:23 UTC
Given https://bugzilla.kernel.org/show_bug.cgi?id=201693 and https://bugzilla.kernel.org/show_bug.cgi?id=203475

this patch should be expanded to match "Samsung [78]*" again as you had over in your original https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1809972 patch.

Samsung 750, 850, 860, and 870 have all been reported to have the issue.
Comment 3 Matt Whitlock 2021-02-24 15:34:04 UTC
Please don't hobble the Samsung 860* on all controllers blindly. I have both the 860 Pro and the 860 EVO, and both are working great with an "Intel Corporation NM10/ICH7 Family SATA Controller [AHCI mode] (rev 01)". If you start disabling queued TRIM on these drives, I'll have to start patching my kernel to re-enable it, and it would be a travesty for everyone else who isn't aware that their drives are being quietly hobbled.

This issue seems to be specific to only some controllers, and I'm not so sure it's even a controller incompatibility, as I too have seen the "WRITE FPDMA QUEUED" failures on at least one of these drives in the past, but that problem went away after I replaced a failing CPU fan, replaced some blown capacitors on my motherboard, and replaced some faulty RAM. I suspect electrical noise as the trigger of this issue. Maybe there's an edge case related to SATA packet retransmissions that the controller vendors didn't fully test.

A suggestion to those affected by this issue: try switching to a *shielded* SATA cable, as short in length as you are able. They're actually hard to find, as most SATA cables are cheap garbage.
Comment 4 Gregory P. Smith 2021-02-24 18:50:38 UTC
Enough people see this with these particular models of Samsung SSDs on a multiple brands of SATA controllers that it is unlikely to be a cabling issue.  It's inadequate drive firmware.  Other drives connected to the same cabling in the same drive bay do not have these problems.

My hardware: Professionally manufactured, installed, and routed, HP Microserver Gen10.  This isn't a custom build.

Nobody should be forced to chase down magical mythical special cables to work around one SSD manufacturers problem.

Run fio random IO tests with ncq disabled and see if it even benefits you in the first place before assuming that it does.  My own quick test suggested disabling it was surprisingly and unexpectedly beneficial.
Comment 5 Matt Whitlock 2021-02-24 19:46:07 UTC
(In reply to Gregory P. Smith from comment #4)
> Run fio random IO tests with ncq disabled and see if it even benefits you in
> the first place before assuming that it does.

I did, and it does. Reproducing here my results from Bug 201693 Comment 36:

«I just ran fio on my Samsung 860 EVO 2TB, random 4K reads with libaio engine, I/O depth 256, jobs 4, runtime 120 seconds.

With queue_depth=32 (default): read: IOPS=46.7k, BW=182MiB/s

With queue_depth=1: read: IOPS=13.3k, BW=51.0MiB/s

No messages in my kernel logs for the duration of these tests.»