Bug 203475 - Samsung 860 EVO queued TRIM issues
Summary: Samsung 860 EVO queued TRIM issues
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Tejun Heo
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-05-01 22:00 UTC by Roman Mamedov
Modified: 2020-03-16 05:54 UTC (History)
6 users (show)

See Also:
Kernel Version: 4.14.114
Tree: Mainline
Regression: No


Attachments
dmesg of the errors occuring (14.52 KB, text/plain)
2019-05-01 22:00 UTC, Roman Mamedov
Details
disable queued TRIM for Samsung 860 series SSDs (548 bytes, patch)
2019-05-01 22:01 UTC, Roman Mamedov
Details | Diff

Description Roman Mamedov 2019-05-01 22:00:54 UTC
Created attachment 282579 [details]
dmesg of the errors occuring

I have a Samsung SSD 860 EVO mSATA 500GB SSD connected via an ASMedia ASM1062 Serial ATA Controller. It causes has 20-30 seconds lockups on fstrim (which runs during bootup on my system), with messages such as:

[  332.792044] ata14.00: exception Emask 0x0 SAct 0x3fffe SErr 0x0 action 0x6 frozen
[  332.798271] ata14.00: failed command: SEND FPDMA QUEUED
[  332.804499] ata14.00: cmd 64/01:08:00:00:00/00:00:00:00:00/a0 tag 1 ncq dma 512 out
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  332.817145] ata14.00: status: { DRDY }

After disabling queued TRIM via the included patch, the issue disappears.
Comment 1 Roman Mamedov 2019-05-01 22:01:44 UTC
Created attachment 282581 [details]
disable queued TRIM for Samsung 860 series SSDs
Comment 2 Solomon Peachy 2019-07-13 12:29:27 UTC
This patch is still relevant for master.  Add my vote to merging this; I'd like to be able to re-enable NCQ on this SSD.
Comment 3 Jens Axboe 2019-07-14 16:57:43 UTC
This patch looks good - any chance you can email one with a proper commit log and signed-off-by etc to linux-ide@vger.kernel.org? And you can CC me, axboe@kernel.dk, and I'll get it queued up for the current kernel.
Comment 4 Roman Mamedov 2019-07-15 17:41:33 UTC
Jens, thanks, sent to https://marc.info/?l=linux-ide&m=156312691006716&w=2, it is now being discussed there.

Solomon: what model do you have that also has a problem with TRIM, 860 EVO mSATA too? And which firmware revision?
Comment 5 Solomon Peachy 2019-07-15 17:54:25 UTC
I have the 1TB SATA (not mSATA!) version.

smartctl -a dump:

Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 860 EVO 1TB
Serial Number:    S3Z8NB0K717690X
LU WWN Device Id: 5 002538 e4054049c
Firmware Version: RVT01B6Q
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Jul 15 13:47:44 2019 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

kernel log snippet: (Untainted Fedora 5.1.16-300.fc30.x86_64 kernel)

ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata1.00: supports DRM functions and may not be fully accessible
ata1.00: ATA-11: Samsung SSD 860 EVO 1TB, RVT01B6Q, max UDMA/133
ata1.00: 1953525168 sectors, multi 1: LBA48 NCQ (depth 32), AA
ata1.00: supports DRM functions and may not be fully accessible
ata1.00: configured for UDMA/133
scsi 0:0:0:0: Direct-Access     ATA      Samsung SSD 860  1B6Q PQ: 0 ANSI: 5
sd 0:0:0:0: Attached scsi generic sg0 type 0
ata1.00: Enabling discard_zeroes_data
sd 0:0:0:0: [sda] 1953525168 512-byte logical blocks: (1.00 TB/932 GiB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
ata1.00: Enabling discard_zeroes_data
sda: sda1 sda2 sda3
ata1.00: Enabling discard_zeroes_data
sd 0:0:0:0: [sda] supports TCG Opal
sd 0:0:0:0: [sda] Attached SCSI disk
Comment 6 Solomon Peachy 2019-07-15 17:59:18 UTC
See also BZ #201693
Comment 7 Roman Mamedov 2019-07-15 18:38:08 UTC
> See also BZ #201693

Did you confirm that with my patch applied you have no problem with 860 EVO on the AMD SATA controller anymore? I thought that one is a hopeless matter and the issues extend to more than just TRIM, to regular (high-speed) reads/writes too. For that reason I moved mine to an ASMedia controller, and here it is clear-cut that only the queued TRIM fails, everything else works fine.
Comment 8 Solomon Peachy 2019-07-15 18:50:52 UTC
I'm building a patched fedora kernel with the patch, and will get back to you later today.

But in the mean time I can confirm that by setting the drive's queue depth to 1, I have no timeout or corruption issues.  [[ echo 1 > /sys/block/sda/device/queue_depth ]]
Comment 9 Solomon Peachy 2019-07-16 02:40:21 UTC
Finally got it built and booted up.. and it went kaboom.

Same kernel (Fedora 5.1.16-300) but with Roman's patch applied, yields much the same kernel log, with this addition:

ata1.00: disabling queued TRIM support

Unfortunately, about 30 seconds later, it went kaboom:

[   35.527148] ata1.00: exception Emask 0x10 SAct 0xfc000 SErr 0x0 action 0x6 frozen
[   35.527155] ata1.00: irq_stat 0x08000000, interface fatal error
[   35.527161] ata1.00: failed command: WRITE FPDMA QUEUED
[   35.527171] ata1.00: cmd 61/20:70:e0:a6:8b/00:00:25:00:00/40 tag 14 ncq dma 16384 out
                        res 40/00:70:e0:a6:8b/00:00:25:00:00/40 Emask 0x10 (ATA bus error)
[   35.527176] ata1.00: status: { DRDY }
[   35.527179] ata1.00: failed command: WRITE FPDMA QUEUED
[   35.527187] ata1.00: cmd 61/08:78:e0:ad:8b/00:00:25:00:00/40 tag 15 ncq dma 4096 out
                        res 40/00:70:e0:a6:8b/00:00:25:00:00/40 Emask 0x10 (ATA bus error)
[   35.527191] ata1.00: status: { DRDY }
[   35.527194] ata1.00: failed command: WRITE FPDMA QUEUED
[   35.527202] ata1.00: cmd 61/20:80:60:d0:91/00:00:25:00:00/40 tag 16 ncq dma 16384 out
                        res 40/00:70:e0:a6:8b/00:00:25:00:00/40 Emask 0x10 (ATA bus error)
[   35.527205] ata1.00: status: { DRDY }
[   35.527208] ata1.00: failed command: WRITE FPDMA QUEUED
[   35.527216] ata1.00: cmd 61/40:88:00:d1:91/00:00:25:00:00/40 tag 17 ncq dma 32768 out
                        res 40/00:70:e0:a6:8b/00:00:25:00:00/40 Emask 0x10 (ATA bus error)
[   35.527219] ata1.00: status: { DRDY }
[   35.527222] ata1.00: failed command: WRITE FPDMA QUEUED
[   35.527230] ata1.00: cmd 61/08:90:c0:51:92/00:00:25:00:00/40 tag 18 ncq dma 4096 out
                        res 40/00:70:e0:a6:8b/00:00:25:00:00/40 Emask 0x10 (ATA bus error)
[   35.527233] ata1.00: status: { DRDY }
[   35.527236] ata1.00: failed command: WRITE FPDMA QUEUED
[   35.527243] ata1.00: cmd 61/20:98:20:52:92/00:00:25:00:00/40 tag 19 ncq dma 16384 out
                        res 40/00:70:e0:a6:8b/00:00:25:00:00/40 Emask 0x10 (ATA bus error)
[   35.527246] ata1.00: status: { DRDY }
[   35.527252] ata1: hard resetting link
[   35.986132] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[   35.986457] ata1.00: supports DRM functions and may not be fully accessible
[   35.987384] ata1.00: disabling queued TRIM support
[   35.989818] ata1.00: supports DRM functions and may not be fully accessible
[   35.990591] ata1.00: disabling queued TRIM support
[   35.992641] ata1.00: configured for UDMA/133
[   35.992670] ata1: EH complete
[   35.992941] ata1.00: Enabling discard_zeroes_data

So perhaps this SSD is simply incompatible with NCQ.  Sigh.
Comment 10 Roman Mamedov 2019-07-16 04:14:12 UTC
> So perhaps this SSD is simply incompatible with NCQ.

Not in general, only in combination with AMD SATA, as discussed in that other bugreport. And indeed there it's not only TRIM, but also regular writes. Any chance you could test on a different controller (ASMedia, Marvell, ...)?
Comment 11 Solomon Peachy 2019-07-16 12:03:54 UTC
It's frustrating that Samsung has demonstrated no interest in solving this problem properly.  It's not like AMD-based systems are _that_ rare.

Every system I have at home is AMD-based or has an incompatible form factor.  I'll see what I can dig up around the office.
Comment 12 Solomon Peachy 2019-07-25 23:50:42 UTC
I just swapped in an ASMedia-based SATA controller, and re-enabled NCQ (by using the default queue_depth).  The system is subjectively much, much faster and is (so far) error free.

Note You need to log in before you can comment on or make changes to this bug.