Bug 203475

Summary: Samsung 860 EVO queued TRIM issues
Product: IO/Storage Reporter: Roman Mamedov (rm+bko)
Component: Serial ATAAssignee: Tejun Heo (tj)
Status: RESOLVED CODE_FIX    
Severity: normal CC: agurenko, alexander, axboe, brice.simon, bugzilla, fweimer, johnsimcall, justin, jwrdegoede, kernelbugs, lnicola, mikko.rantalainen, ole, pizza, pjbrs, reg.kernelbugzilla.wad1w, rm+bko, siltal02, sitsofe, stathis, ushakov
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.14.114 Tree: Mainline
Regression: No
Attachments: dmesg of the errors occuring
disable queued TRIM for Samsung 860 series SSDs

Description Roman Mamedov 2019-05-01 22:00:54 UTC
Created attachment 282579 [details]
dmesg of the errors occuring

I have a Samsung SSD 860 EVO mSATA 500GB SSD connected via an ASMedia ASM1062 Serial ATA Controller. It causes has 20-30 seconds lockups on fstrim (which runs during bootup on my system), with messages such as:

[  332.792044] ata14.00: exception Emask 0x0 SAct 0x3fffe SErr 0x0 action 0x6 frozen
[  332.798271] ata14.00: failed command: SEND FPDMA QUEUED
[  332.804499] ata14.00: cmd 64/01:08:00:00:00/00:00:00:00:00/a0 tag 1 ncq dma 512 out
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  332.817145] ata14.00: status: { DRDY }

After disabling queued TRIM via the included patch, the issue disappears.
Comment 1 Roman Mamedov 2019-05-01 22:01:44 UTC
Created attachment 282581 [details]
disable queued TRIM for Samsung 860 series SSDs
Comment 2 Solomon Peachy 2019-07-13 12:29:27 UTC
This patch is still relevant for master.  Add my vote to merging this; I'd like to be able to re-enable NCQ on this SSD.
Comment 3 Jens Axboe 2019-07-14 16:57:43 UTC
This patch looks good - any chance you can email one with a proper commit log and signed-off-by etc to linux-ide@vger.kernel.org? And you can CC me, axboe@kernel.dk, and I'll get it queued up for the current kernel.
Comment 4 Roman Mamedov 2019-07-15 17:41:33 UTC
Jens, thanks, sent to https://marc.info/?l=linux-ide&m=156312691006716&w=2, it is now being discussed there.

Solomon: what model do you have that also has a problem with TRIM, 860 EVO mSATA too? And which firmware revision?
Comment 5 Solomon Peachy 2019-07-15 17:54:25 UTC
I have the 1TB SATA (not mSATA!) version.

smartctl -a dump:

Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 860 EVO 1TB
Serial Number:    S3Z8NB0K717690X
LU WWN Device Id: 5 002538 e4054049c
Firmware Version: RVT01B6Q
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Jul 15 13:47:44 2019 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

kernel log snippet: (Untainted Fedora 5.1.16-300.fc30.x86_64 kernel)

ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata1.00: supports DRM functions and may not be fully accessible
ata1.00: ATA-11: Samsung SSD 860 EVO 1TB, RVT01B6Q, max UDMA/133
ata1.00: 1953525168 sectors, multi 1: LBA48 NCQ (depth 32), AA
ata1.00: supports DRM functions and may not be fully accessible
ata1.00: configured for UDMA/133
scsi 0:0:0:0: Direct-Access     ATA      Samsung SSD 860  1B6Q PQ: 0 ANSI: 5
sd 0:0:0:0: Attached scsi generic sg0 type 0
ata1.00: Enabling discard_zeroes_data
sd 0:0:0:0: [sda] 1953525168 512-byte logical blocks: (1.00 TB/932 GiB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
ata1.00: Enabling discard_zeroes_data
sda: sda1 sda2 sda3
ata1.00: Enabling discard_zeroes_data
sd 0:0:0:0: [sda] supports TCG Opal
sd 0:0:0:0: [sda] Attached SCSI disk
Comment 6 Solomon Peachy 2019-07-15 17:59:18 UTC
See also BZ #201693
Comment 7 Roman Mamedov 2019-07-15 18:38:08 UTC
> See also BZ #201693

Did you confirm that with my patch applied you have no problem with 860 EVO on the AMD SATA controller anymore? I thought that one is a hopeless matter and the issues extend to more than just TRIM, to regular (high-speed) reads/writes too. For that reason I moved mine to an ASMedia controller, and here it is clear-cut that only the queued TRIM fails, everything else works fine.
Comment 8 Solomon Peachy 2019-07-15 18:50:52 UTC
I'm building a patched fedora kernel with the patch, and will get back to you later today.

But in the mean time I can confirm that by setting the drive's queue depth to 1, I have no timeout or corruption issues.  [[ echo 1 > /sys/block/sda/device/queue_depth ]]
Comment 9 Solomon Peachy 2019-07-16 02:40:21 UTC
Finally got it built and booted up.. and it went kaboom.

Same kernel (Fedora 5.1.16-300) but with Roman's patch applied, yields much the same kernel log, with this addition:

ata1.00: disabling queued TRIM support

Unfortunately, about 30 seconds later, it went kaboom:

[   35.527148] ata1.00: exception Emask 0x10 SAct 0xfc000 SErr 0x0 action 0x6 frozen
[   35.527155] ata1.00: irq_stat 0x08000000, interface fatal error
[   35.527161] ata1.00: failed command: WRITE FPDMA QUEUED
[   35.527171] ata1.00: cmd 61/20:70:e0:a6:8b/00:00:25:00:00/40 tag 14 ncq dma 16384 out
                        res 40/00:70:e0:a6:8b/00:00:25:00:00/40 Emask 0x10 (ATA bus error)
[   35.527176] ata1.00: status: { DRDY }
[   35.527179] ata1.00: failed command: WRITE FPDMA QUEUED
[   35.527187] ata1.00: cmd 61/08:78:e0:ad:8b/00:00:25:00:00/40 tag 15 ncq dma 4096 out
                        res 40/00:70:e0:a6:8b/00:00:25:00:00/40 Emask 0x10 (ATA bus error)
[   35.527191] ata1.00: status: { DRDY }
[   35.527194] ata1.00: failed command: WRITE FPDMA QUEUED
[   35.527202] ata1.00: cmd 61/20:80:60:d0:91/00:00:25:00:00/40 tag 16 ncq dma 16384 out
                        res 40/00:70:e0:a6:8b/00:00:25:00:00/40 Emask 0x10 (ATA bus error)
[   35.527205] ata1.00: status: { DRDY }
[   35.527208] ata1.00: failed command: WRITE FPDMA QUEUED
[   35.527216] ata1.00: cmd 61/40:88:00:d1:91/00:00:25:00:00/40 tag 17 ncq dma 32768 out
                        res 40/00:70:e0:a6:8b/00:00:25:00:00/40 Emask 0x10 (ATA bus error)
[   35.527219] ata1.00: status: { DRDY }
[   35.527222] ata1.00: failed command: WRITE FPDMA QUEUED
[   35.527230] ata1.00: cmd 61/08:90:c0:51:92/00:00:25:00:00/40 tag 18 ncq dma 4096 out
                        res 40/00:70:e0:a6:8b/00:00:25:00:00/40 Emask 0x10 (ATA bus error)
[   35.527233] ata1.00: status: { DRDY }
[   35.527236] ata1.00: failed command: WRITE FPDMA QUEUED
[   35.527243] ata1.00: cmd 61/20:98:20:52:92/00:00:25:00:00/40 tag 19 ncq dma 16384 out
                        res 40/00:70:e0:a6:8b/00:00:25:00:00/40 Emask 0x10 (ATA bus error)
[   35.527246] ata1.00: status: { DRDY }
[   35.527252] ata1: hard resetting link
[   35.986132] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[   35.986457] ata1.00: supports DRM functions and may not be fully accessible
[   35.987384] ata1.00: disabling queued TRIM support
[   35.989818] ata1.00: supports DRM functions and may not be fully accessible
[   35.990591] ata1.00: disabling queued TRIM support
[   35.992641] ata1.00: configured for UDMA/133
[   35.992670] ata1: EH complete
[   35.992941] ata1.00: Enabling discard_zeroes_data

So perhaps this SSD is simply incompatible with NCQ.  Sigh.
Comment 10 Roman Mamedov 2019-07-16 04:14:12 UTC
> So perhaps this SSD is simply incompatible with NCQ.

Not in general, only in combination with AMD SATA, as discussed in that other bugreport. And indeed there it's not only TRIM, but also regular writes. Any chance you could test on a different controller (ASMedia, Marvell, ...)?
Comment 11 Solomon Peachy 2019-07-16 12:03:54 UTC
It's frustrating that Samsung has demonstrated no interest in solving this problem properly.  It's not like AMD-based systems are _that_ rare.

Every system I have at home is AMD-based or has an incompatible form factor.  I'll see what I can dig up around the office.
Comment 12 Solomon Peachy 2019-07-25 23:50:42 UTC
I just swapped in an ASMedia-based SATA controller, and re-enabled NCQ (by using the default queue_depth).  The system is subjectively much, much faster and is (so far) error free.
Comment 13 Simon Arlott 2020-07-04 09:15:00 UTC
I'm getting the same issue on 4.15..5.4.49 with an Intel ASRock Z170 Extreme4 SATA controller:

[389520.385306] ata2.00: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x6 frozen
[389520.385315] ata2.00: failed command: WRITE FPDMA QUEUED
[389520.385327] ata2.00: cmd 61/60:00:80:8e:20/00:00:98:00:00/40 tag 0 ncq dma 49152 out
                         res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
[389520.385332] ata2.00: status: { DRDY }
[389520.385336] ata2.00: failed command: WRITE FPDMA QUEUED
[389520.385345] ata2.00: cmd 61/20:08:00:8f:20/00:00:98:00:00/40 tag 1 ncq dma 16384 out
                         res 40/00:01:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[389520.385349] ata2.00: status: { DRDY }
[389520.385353] ata2.00: failed command: SEND FPDMA QUEUED
[389520.385364] ata2.00: cmd 64/01:10:00:00:00/00:00:00:00:00/a0 tag 2 ncq dma 512 out
                         res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
[389520.385370] ata2.00: status: { DRDY }
[389520.385374] ata2.00: failed command: WRITE FPDMA QUEUED
[389520.385382] ata2.00: cmd 61/e0:18:b8:ea:77/05:00:97:00:00/40 tag 3 ncq dma 770048 out
                         res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
[389520.385386] ata2.00: status: { DRDY }
[389520.385393] ata2: hard resetting link
[389520.699442] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[389520.701434] ata2.00: supports DRM functions and may not be fully accessible
[389520.704682] ata2.00: supports DRM functions and may not be fully accessible
[389520.707501] ata2.00: configured for UDMA/133
[389520.707511] ata2: EH complete
[389520.707742] ata2.00: Enabling discard_zeroes_data
[389551.093259] ata2.00: exception Emask 0x0 SAct 0x1fc0000 SErr 0x0 action 0x6 frozen
[389551.093261] ata2.00: failed command: WRITE FPDMA QUEUED
[389551.093264] ata2.00: cmd 61/d8:90:a8:bc:a0/09:00:97:00:00/40 tag 18 ncq dma 1290240 ou
                         res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
[389551.093265] ata2.00: status: { DRDY }
[389551.093266] ata2.00: failed command: WRITE FPDMA QUEUED
[389551.093267] ata2.00: cmd 61/e0:98:b8:ea:77/05:00:97:00:00/40 tag 19 ncq dma 770048 out
                         res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
[389551.093268] ata2.00: status: { DRDY }
[389551.093269] ata2.00: failed command: SEND FPDMA QUEUED
[389551.093271] ata2.00: cmd 64/01:a0:00:00:00/00:00:00:00:00/a0 tag 20 ncq dma 512 out
                         res 40/00:01:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[389551.093271] ata2.00: status: { DRDY }
[389551.093272] ata2.00: failed command: WRITE FPDMA QUEUED
[389551.093274] ata2.00: cmd 61/20:a8:00:8f:20/00:00:98:00:00/40 tag 21 ncq dma 16384 out
                         res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
[389551.093274] ata2.00: status: { DRDY }
[389551.093275] ata2.00: failed command: WRITE FPDMA QUEUED
[389551.093295] ata2.00: cmd 61/60:b0:80:8e:20/00:00:98:00:00/40 tag 22 ncq dma 49152 out
                         res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
[389551.093296] ata2.00: status: { DRDY }
[389551.093296] ata2.00: failed command: WRITE FPDMA QUEUED
[389551.093298] ata2.00: cmd 61/b0:b8:80:c6:a0/09:00:97:00:00/40 tag 23 ncq dma 1269760 ou
                         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[389551.093299] ata2.00: status: { DRDY }
[389551.093300] ata2.00: failed command: WRITE FPDMA QUEUED
[389551.093301] ata2.00: cmd 61/10:c0:f0:21:22/00:00:96:00:00/40 tag 24 ncq dma 8192 out
                         res 40/00:01:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[389551.093302] ata2.00: status: { DRDY }
[389551.093303] ata2: hard resetting link
[389551.407389] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[389551.409259] ata2.00: supports DRM functions and may not be fully accessible
[389551.412712] ata2.00: supports DRM functions and may not be fully accessible
[389551.415759] ata2.00: configured for UDMA/133
[389551.415773] ata2: EH complete
[389581.797243] ata2.00: exception Emask 0x0 SAct 0x3f80 SErr 0x0 action 0x6 frozen
[389581.797246] ata2.00: failed command: WRITE FPDMA QUEUED
[389581.797248] ata2.00: cmd 61/10:38:f0:21:22/00:00:96:00:00/40 tag 7 ncq dma 8192 out
                         res 40/00:01:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[389581.797249] ata2.00: status: { DRDY }
[389581.797250] ata2.00: failed command: WRITE FPDMA QUEUED
[389581.797252] ata2.00: cmd 61/b0:40:80:c6:a0/09:00:97:00:00/40 tag 8 ncq dma 1269760 ou
                         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[389581.797253] ata2.00: status: { DRDY }
[389581.797253] ata2.00: failed command: WRITE FPDMA QUEUED
[389581.797255] ata2.00: cmd 61/60:48:80:8e:20/00:00:98:00:00/40 tag 9 ncq dma 49152 out
                         res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[389581.797256] ata2.00: status: { DRDY }
[389581.797257] ata2.00: failed command: WRITE FPDMA QUEUED
[389581.797258] ata2.00: cmd 61/20:50:00:8f:20/00:00:98:00:00/40 tag 10 ncq dma 16384 out
                         res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[389581.797259] ata2.00: status: { DRDY }
[389581.797260] ata2.00: failed command: SEND FPDMA QUEUED
[389581.797262] ata2.00: cmd 64/01:58:00:00:00/00:00:00:00:00/a0 tag 11 ncq dma 512 out
                         res 40/00:01:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[389581.797262] ata2.00: status: { DRDY }
[389581.797263] ata2.00: failed command: WRITE FPDMA QUEUED
[389581.797265] ata2.00: cmd 61/e0:60:b8:ea:77/05:00:97:00:00/40 tag 12 ncq dma 770048 out
                         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[389581.797265] ata2.00: status: { DRDY }
[389581.797266] ata2.00: failed command: WRITE FPDMA QUEUED
[389581.797268] ata2.00: cmd 61/d8:68:a8:bc:a0/09:00:97:00:00/40 tag 13 ncq dma 1290240 ou
                         res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
[389581.797268] ata2.00: status: { DRDY }
[389581.797270] ata2: hard resetting link
[389582.111393] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[389582.113289] ata2.00: supports DRM functions and may not be fully accessible
[389582.116517] ata2.00: supports DRM functions and may not be fully accessible
[389582.119421] ata2.00: configured for UDMA/133
[389582.119438] ata2: EH complete
[389582.119715] ata2.00: Enabling discard_zeroes_data
[389582.120788] ata2.00: Enabling discard_zeroes_data
[389612.533285] ata2.00: NCQ disabled due to excessive errors
[389612.533292] ata2.00: exception Emask 0x0 SAct 0x7c00000f SErr 0x0 action 0x6 frozen
[389612.533301] ata2.00: failed command: WRITE FPDMA QUEUED
[389612.533313] ata2.00: cmd 61/b0:00:80:c6:a0/09:00:97:00:00/40 tag 0 ncq dma 1269760 ou
                         res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
[389612.533317] ata2.00: status: { DRDY }
[389612.533322] ata2.00: failed command: WRITE FPDMA QUEUED
[389612.533331] ata2.00: cmd 61/10:08:f0:21:22/00:00:96:00:00/40 tag 1 ncq dma 8192 out
                         res 40/00:01:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[389612.533335] ata2.00: status: { DRDY }
[389612.533339] ata2.00: failed command: READ FPDMA QUEUED
[389612.533347] ata2.00: cmd 60/18:10:c0:d3:00/00:00:00:00:00/40 tag 2 ncq dma 12288 in
                         res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
[389612.533351] ata2.00: status: { DRDY }
[389612.533354] ata2.00: failed command: READ FPDMA QUEUED
[389612.533363] ata2.00: cmd 60/20:18:80:b9:e7/00:00:58:00:00/40 tag 3 ncq dma 16384 in
                         res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
[389612.533366] ata2.00: status: { DRDY }
[389612.533371] ata2.00: failed command: WRITE FPDMA QUEUED
[389612.533380] ata2.00: cmd 61/d8:d0:a8:bc:a0/09:00:97:00:00/40 tag 26 ncq dma 1290240 ou
                         res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
[389612.533383] ata2.00: status: { DRDY }
[389612.533387] ata2.00: failed command: WRITE FPDMA QUEUED
[389612.533396] ata2.00: cmd 61/e0:d8:b8:ea:77/05:00:97:00:00/40 tag 27 ncq dma 770048 out
                         res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
[389612.533399] ata2.00: status: { DRDY }
[389612.533402] ata2.00: failed command: SEND FPDMA QUEUED
[389612.533410] ata2.00: cmd 64/01:e0:00:00:00/00:00:00:00:00/a0 tag 28 ncq dma 512 out
                         res 40/00:01:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[389612.533414] ata2.00: status: { DRDY }
[389612.533417] ata2.00: failed command: WRITE FPDMA QUEUED
[389612.533426] ata2.00: cmd 61/20:e8:00:8f:20/00:00:98:00:00/40 tag 29 ncq dma 16384 out
                         res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
[389612.533429] ata2.00: status: { DRDY }
[389612.533433] ata2.00: failed command: WRITE FPDMA QUEUED
[389612.533441] ata2.00: cmd 61/60:f0:80:8e:20/00:00:98:00:00/40 tag 30 ncq dma 49152 out
                         res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
[389612.533445] ata2.00: status: { DRDY }
[389612.533451] ata2: hard resetting link
[389612.851755] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[389612.853797] ata2.00: supports DRM functions and may not be fully accessible
[389612.857594] ata2.00: supports DRM functions and may not be fully accessible
[389612.860819] ata2.00: configured for UDMA/133
[389612.860879] ata2: EH complete
[389612.865362] ata2.00: Enabling discard_zeroes_data

This is during an fstrim, and it doesn't happen on the Samsung 850 EVO.

Device Model:     Samsung SSD 850 EVO 2TB
Firmware Version: EMT02B6Q

Device Model:     Samsung SSD 860 EVO 2TB
Firmware Version: RVT04B6Q

00:17.0 SATA controller: Intel Corporation Q170/Q150/B150/H170/H110/Z170/CM236 Chipset SATA Controller [AHCI Mode] (rev 31)
Comment 14 stathis 2020-12-01 20:58:37 UTC
Same issue, different controller:

System: FUJITSU PRIMERGY TX1310 M1/D3219-A1, BIOS V4.6.5.4 R1.11.0 for D3219-A1x 09/25/2018

Kernel: Linux server 5.4.72-gentoo-x86_64 #1 SMP Sat Oct 17 05:17:10 EET 2020 x86_64 Intel(R) Xeon(R) CPU E3-1226 v3 @ 3.30GHz GenuineIntel GNU/Linux

Controller: 00:1f.2 SATA controller: Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] (rev 04)

Device Model:     Samsung SSD 860 EVO 500GB
Firmware Version: RVT04B6Q

[395138.151251] ata6.00: exception Emask 0x10 SAct 0x40003fff SErr 0x400100 action 0x6 frozen
[395138.152011] ata6.00: irq_stat 0x08000008, interface fatal error
[395138.152755] ata6: SError: { UnrecovData Handshk }
[395138.153470] ata6.00: failed command: WRITE FPDMA QUEUED
[395138.154222] ata6.00: cmd 61/08:00:78:38:80/00:00:0e:00:00/40 tag 0 ncq dma 4096 out
                         res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error)
[395138.155801] ata6.00: status: { DRDY }
[395138.156579] ata6.00: failed command: WRITE FPDMA QUEUED
[395138.156581] ata6.00: cmd 61/08:08:18:26:81/00:00:0e:00:00/40 tag 1 ncq dma 4096 out
                         res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error)
[395138.156581] ata6.00: status: { DRDY }
[395138.156582] ata6.00: failed command: WRITE FPDMA QUEUED
[395138.156593] ata6.00: cmd 61/08:10:50:59:81/00:00:0e:00:00/40 tag 2 ncq dma 4096 out
                         res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error)
[395138.156594] ata6.00: status: { DRDY }
[395138.156594] ata6.00: failed command: WRITE FPDMA QUEUED
[395138.156596] ata6.00: cmd 61/08:18:90:6a:81/00:00:0e:00:00/40 tag 3 ncq dma 4096 out
                         res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error)
[395138.156596] ata6.00: status: { DRDY }
[395138.156597] ata6.00: failed command: WRITE FPDMA QUEUED
[395138.156598] ata6.00: cmd 61/08:20:58:b2:81/00:00:0e:00:00/40 tag 4 ncq dma 4096 out
                         res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error)
[395138.156599] ata6.00: status: { DRDY }
[395138.156599] ata6.00: failed command: WRITE FPDMA QUEUED
[395138.156601] ata6.00: cmd 61/08:28:b0:26:c0/00:00:0e:00:00/40 tag 5 ncq dma 4096 out
                         res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error)
[395138.156602] ata6.00: status: { DRDY }
[395138.171913] ata6.00: failed command: WRITE FPDMA QUEUED
[395138.171915] ata6.00: cmd 61/10:30:a0:27:c0/00:00:0e:00:00/40 tag 6 ncq dma 8192 out
                         res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error)
[395138.171916] ata6.00: status: { DRDY }
[395138.171916] ata6.00: failed command: WRITE FPDMA QUEUED
[395138.171919] ata6.00: cmd 61/08:38:50:2a:c0/00:00:0e:00:00/40 tag 7 ncq dma 4096 out
                         res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error)
[395138.176836] ata6.00: status: { DRDY }
[395138.176837] ata6.00: failed command: WRITE FPDMA QUEUED
[395138.176839] ata6.00: cmd 61/08:40:e8:49:c8/00:00:0e:00:00/40 tag 8 ncq dma 4096 out
                         res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error)
[395138.176839] ata6.00: status: { DRDY }
[395138.176840] ata6.00: failed command: WRITE FPDMA QUEUED
[395138.176841] ata6.00: cmd 61/08:48:58:08:80/00:00:0f:00:00/40 tag 9 ncq dma 4096 out
                         res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error)
[395138.176842] ata6.00: status: { DRDY }
[395138.183063] ata6.00: failed command: WRITE FPDMA QUEUED
[395138.183065] ata6.00: cmd 61/08:50:08:08:c0/00:00:13:00:00/40 tag 10 ncq dma 4096 out
                         res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error)
[395138.183075] ata6.00: status: { DRDY }
[395138.183076] ata6.00: failed command: WRITE FPDMA QUEUED
[395138.183077] ata6.00: cmd 61/08:58:80:08:c0/00:00:13:00:00/40 tag 11 ncq dma 4096 out
                         res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error)
[395138.183078] ata6.00: status: { DRDY }
[395138.189053] ata6.00: failed command: WRITE FPDMA QUEUED
[395138.189055] ata6.00: cmd 61/08:60:a8:12:c0/00:00:13:00:00/40 tag 12 ncq dma 4096 out
                         res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error)
[395138.189055] ata6.00: status: { DRDY }
[395138.189065] ata6.00: failed command: WRITE FPDMA QUEUED
[395138.189066] ata6.00: cmd 61/08:68:f8:12:c0/00:00:13:00:00/40 tag 13 ncq dma 4096 out
                         res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error)
[395138.189067] ata6.00: status: { DRDY }
[395138.189068] ata6.00: failed command: WRITE FPDMA QUEUED
[395138.189070] ata6.00: cmd 61/08:f0:90:2d:80/00:00:0e:00:00/40 tag 30 ncq dma 4096 out
                         res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error)
[395138.189071] ata6.00: status: { DRDY }
[395138.199031] ata6: hard resetting link
[395138.511140] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[395138.517064] ata6.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
[395138.519256] ata6.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
[395138.521402] ata6.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
[395138.523837] ata6.00: supports DRM functions and may not be fully accessible
[395138.529475] ata6.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
[395138.530403] ata6.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
[395138.531236] ata6.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
[395138.532417] ata6.00: supports DRM functions and may not be fully accessible
[395138.536106] ata6.00: configured for UDMA/133
[395138.537034] ata6: EH complete
[395138.537973] ata6.00: Enabling discard_zeroes_data


What's the recommended way to go? Disable NCQ?
Comment 15 Roman Mamedov 2020-12-05 09:12:16 UTC
> What's the recommended way to go? Disable NCQ?

I believe if you see "WRITE FPDMA QUEUED" messages, the issue is with NCQ in general, and yes, you should try disabling it for the device. But if you see "SEND FPDMA QUEUED" as in the initial post, then you might've gotten away with disabling just the queued TRIM.

It is surprising to see that it even fails on Intel's controllers as well, all of this was mostly discussed with regard to AMD SATA.
Comment 16 Simon Arlott 2020-12-05 10:35:43 UTC
(In reply to Roman Mamedov from comment #15)
> It is surprising to see that it even fails on Intel's controllers as well,
> all of this was mostly discussed with regard to AMD SATA.

It's not surprising when you realise that queued trim used to be disabled on the Samsung 8* until Samsung's marketing department made an unsubstantiated claim that "the improved queued trim enhances Linux compatibility":

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca6bfcb2f6d9deab3924bf901e73622a94900473
Comment 17 Hans de Goede 2020-12-05 15:30:23 UTC
(In reply to Simon Arlott from comment #16)
> It's not surprising when you realise that queued trim used to be disabled on
> the Samsung 8* until Samsung's marketing department made an unsubstantiated
> claim that "the improved queued trim enhances Linux compatibility":
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=ca6bfcb2f6d9deab3924bf901e73622a94900473

So it sounds like we just need to revert that patch, or at least re-enable the ATA_HORKAGE_NO_NCQ_TRIM quirk for the 860 series ?
Comment 18 Sitsofe Wheeler 2020-12-08 09:14:08 UTC
Hans: also see https://bugzilla.kernel.org/show_bug.cgi?id=201693 . My personal experience is detailed over on https://marc.info/?t=154644279600003&r=1&w=2 and happens on plain reads. I've been booting with the kernel param libata.force=2.00:noncq to disable NCQ on the second ATA port where the Samsung 860 is plugged in which seems to stabilize things.
Comment 19 stathis 2020-12-08 21:14:45 UTC

(In reply to Sitsofe Wheeler from comment #18)
> Hans: also see https://bugzilla.kernel.org/show_bug.cgi?id=201693 . My
> personal experience is detailed over on
> https://marc.info/?t=154644279600003&r=1&w=2 and happens on plain reads.
> I've been booting with the kernel param libata.force=2.00:noncq to disable
> NCQ on the second ATA port where the Samsung 860 is plugged in which seems
> to stabilize things.


I disabled NCQ for the drive using the equivalent kernel parameter and have not seen these messages again (although they have only appeared once recently - after a few months of the SSD's operation). 

For what is worth it, performance of 4K random reads has seen a tenfold decline (from 380MB/s down to 38MB/s) without NCQ, which I guess is expectable. Performance on other tests, with NCQ vs without NCQ, didn't seem to be affected much.
Comment 20 Andriy 2021-02-12 13:05:29 UTC
Intel controller, same issue.

Model: Samsung SSD 860 EVO 1TB
Firmware Revision:  RVT04B6Q

Machine: Dell Precision M4700
BIOS: A19, 11/30/2018

SATA controller: Intel Corporation 7 Series Chipset Family 6-port SATA Controller [AHCI mode] (rev 04)

Kernel: Linux  5.10.0-1-amd64 #1 SMP Debian 5.10.5-1 (2021-01-09) x86_64 GNU/Linux

Linux version 5.10.0-1-amd64 (debian-kernel@lists.debian.org) (gcc-10 (Debian 10.2.1-5) 10.2.1 20210108, GNU ld (GNU Binutils for Debian) 2.35.1) #1 SMP Debian 5.10.5-1 (2021-01-09)

ata1.00: exception Emask 0x10 SAct 0x7f80 SErr 0x440100 action 0x6 frozen
ata1.00: irq_stat 0x08000000, interface fatal error
ata1: SError: { UnrecovData CommWake Handshk }
ata1.00: failed command: WRITE FPDMA QUEUED
ata1.00: cmd 61/00:38:20:16:02/0a:00:65:00:00/40 tag 7 ncq dma 1310720 ou
         res 40/00:40:20:20:02/00:00:65:00:00/40 Emask 0x10 (ATA bus error)
ata1.00: status: { DRDY }

Disabling NCQ and setting link_power_management_policy to max_performance reduces the frequency of errors.    

echo 1 > /sys/block/sda/device/queue_depth
echo max_performance > /sys/class/scsi_host/host*/link_power_management_policy

I had some days without errors, but occasionally they are happening again mostly after updating/installing packages.
Comment 21 PJBrs 2021-02-19 15:52:04 UTC
I'm encountering this bug as well, on a Thinkpad t450s, a Samsung SSD 860 EVO 1TB (firmware RVT04B6Q) with Slackware-14.2 with kernel upgraded to 5.10.15. I'm adding my info particularly because of my non-AMD SATA controller:

00:1f.2 SATA controller: Intel Corporation Wildcat Point-LP SATA Controller [AHCI Mode] (rev 03) (prog-if 01 [AHCI 1.0])
        Subsystem: Lenovo Wildcat Point-LP SATA Controller [AHCI Mode]
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin B routed to IRQ 44
        Region 0: I/O ports at 30a8 [size=8]
        Region 1: I/O ports at 30b4 [size=4]
        Region 2: I/O ports at 30a0 [size=8]
        Region 3: I/O ports at 30b0 [size=4]
        Region 4: I/O ports at 3060 [size=32]
        Region 5: Memory at f123c000 (32-bit, non-prefetchable) [size=2K]
        Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
                Address: fee00298  Data: 0000
        Capabilities: [70] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [a8] SATA HBA v1.0 BAR4 Offset=00000004
        Kernel driver in use: ahci

I have two ext4 partitions mounted with discards on, one of which encrypted. I see ata errors just about every time I reboot my machine, and was able to easily provoke it manually by issuing fstrim on my root and home partitions.

I was (apparently) able to work around this bug both by issueing echo 1 > /sys/block/sda/device/queue_depth and by reverting https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca6bfcb2f6d9deab3924bf901e73622a94900473

Please let me know if there's anything else I can do to help. I personally was quite put off by the sudden onset of all these ata errors after I thought I had prolonged my laptop's life with a nice and big SSD. I'm happy to work around the issue, but it would be better to be able to use vanilla sources without failures.
Comment 22 Brice Simon 2021-03-08 17:13:12 UTC
Hi All, 

Same issue here w/ Intel Comet Lake SATA Controller (on a set of Intel NUCs).

By the look of it the kernel also tries to reduce the link speed from 6Gbps to 3Gbps but no joy.

[Mon Mar  8 17:04:24 2021] ata3.00: exception Emask 0x10 SAct 0x1000000 SErr 0x400100 action 0x6 frozen
[Mon Mar  8 17:04:24 2021] ata3.00: irq_stat 0x08000000, interface fatal error
[Mon Mar  8 17:04:24 2021] ata3: SError: { UnrecovData Handshk }
[Mon Mar  8 17:04:24 2021] ata3.00: failed command: WRITE FPDMA QUEUED
[Mon Mar  8 17:04:24 2021] ata3.00: cmd 61/00:c0:30:66:1c/02:00:1d:00:00/40 tag 24 ncq dma 262144 out
[Mon Mar  8 17:04:24 2021] ata3.00: status: { DRDY }
[Mon Mar  8 17:04:24 2021] ata3: hard resetting link
[Mon Mar  8 17:04:24 2021] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Mon Mar  8 17:04:24 2021] ata3.00: supports DRM functions and may not be fully accessible
[Mon Mar  8 17:04:24 2021] ata3.00: supports DRM functions and may not be fully accessible
[Mon Mar  8 17:04:24 2021] ata3.00: configured for UDMA/133
[Mon Mar  8 17:04:24 2021] ata3: EH complete
[Mon Mar  8 17:04:24 2021] ata3.00: Enabling discard_zeroes_data
[Mon Mar  8 17:04:24 2021] ata3.00: exception Emask 0x10 SAct 0x100000 SErr 0x400100 action 0x6 frozen
[Mon Mar  8 17:04:24 2021] ata3.00: irq_stat 0x08000000, interface fatal error
[Mon Mar  8 17:04:24 2021] ata3: SError: { UnrecovData Handshk }
[Mon Mar  8 17:04:24 2021] ata3.00: failed command: WRITE FPDMA QUEUED
[Mon Mar  8 17:04:24 2021] ata3.00: cmd 61/00:a0:30:14:1d/02:00:1d:00:00/40 tag 20 ncq dma 262144 out
[...]
[Mon Mar  8 17:04:25 2021] ata3.00: status: { DRDY }
[Mon Mar  8 17:04:25 2021] ata3: hard resetting link
[Mon Mar  8 17:04:25 2021] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[Mon Mar  8 17:04:25 2021] ata3.00: supports DRM functions and may not be fully accessible
[Mon Mar  8 17:04:25 2021] ata3.00: supports DRM functions and may not be fully accessible
[Mon Mar  8 17:04:25 2021] ata3.00: configured for UDMA/133
[Mon Mar  8 17:04:25 2021] ata3: EH complete
[Mon Mar  8 17:04:25 2021] ata3.00: Enabling discard_zeroes_data

So far I have:

- Updated NUC Bios to FNCML357
- Updated Samsung Disks FW to RVT04B6Q
- Updated Ubuntu 20.04 w/ Kernel 5.4.0-66-generic
- Tried on two different NUC servers
- Tried w/ two differnt Samsung drives (1TB and 500G)
- Tried differnet power settings on the NUC (and attempted to disabled m2 and SDHC slots)
- Tried a few fresh installs of Ubuntu 20.04.2 as well


I have also raised a case w/ Samsung just in case. Intel have not helped so far.

But the error messages keep on going. 

Only disabling NCQ seems to have some level of impact. Aside from crippling performance of course.
Comment 23 anonymous 2021-05-13 22:21:31 UTC
Hi all! Same issue here with AMD SB950 Controller (on motherboard Gigabyte GA-970A-UD3 rev. 1.0/1.1 with latest BIOS) and SSD Samsung 860 EVO 1TB (firmware RVT04B6Q).

Kernel version 5.11.17.

Workaround "libata.force=1.00:noncq" in cmdline works for me.
Comment 24 Mike Kazantsev 2021-05-22 03:26:28 UTC
At some point I've tried to swap that Samsung SSD for SanDisk Ultra 3D (SDSSDH3) SSD, but even while setting up LVM partitions on it already got same errors, so wrote it off as this AMD SATA controller being buggy with any SSD.

Just now got Marvell 88SE9230 PCIe controller card and thought to try same 860 EVO with that - less than a minute after flipping ncq depth from 1 (set by workaround-script on boot) to 32 (as per "ata7.00: 976773168 sectors, multi 1: LBA48 NCQ (depth 32), AA"), got same lithany of errors as with AMD controller before on f2fs trim requests (if I'm reading dmesg correctly), with just pretty much idle desktop.

There was a mention of potential power issues, but given that f2fs seem to say "F2FS-fs (dm-33): Issue discard(411486, 411486, 1) failed, ret: -5" specifically, it seems like a weird coincidence, and I have an only a year-two old Thermaltake TR2 650W PSU here, in this otherwise ~10yo machine that barely draws ~200W iirc, so idk if it's likely.

Link to dmesg output with all libata-related stuff from kernel init and full log of ssd/f2fs errors that happened with Linux 5.10.35 (built from kernel.org tarball) + Marvell 88SE9230 PCIe card + 860 EVO ~1min after switching ncq depth from 1 to 32:

  https://e.var.nz/2021-05-22.samsung-860-evo-marvell-88SE9230-trim-issue-dmesg.5kbcfhyvg03qm.log

Though it looks same-ish as was already posted above by other folks.

Might still try SanDisk SSD with Marvell 88SE9230 controller again, but pretty sure it's just some kind of linux issue at this point, unfortunately, given different ssd and sata controllers involved.
Comment 25 Mike Kazantsev 2021-05-22 03:28:10 UTC
In a failed copy-paste omitted the first paragraph for the message above, sorry:

Have what looks like same issue with AMD SB850 (M4A87TD EVO motherboard) and Samsung 860 EVO 500G for about a year with 5.4/5.10 kernels.
Tried patching linux quirks table to enable milder workarounds, but same as comments above suggest, only disabling NCQ seem to help.

...
Comment 26 Mikko Rantalainen 2021-06-11 18:58:20 UTC
Here's another system (running Ubuntu SMP PREEMPT kernel based on vanilla 5.4.86) with Samsung 860 EVO and I hit random freezes and when system automatically recovers after after a long delay the journalctl output contains following errors:

ata1.00: exception Emask 0x0 SAct 0xfffe3f00 SErr 0x0 action 0x6 frozen
ata1.00: failed command: WRITE FPDMA QUEUED
ata1.00: cmd 61/08:40:c0:d4:a0/00:00:37:00:00/40 tag 8 ncq dma 4096 out
         res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1.00: failed command: WRITE FPDMA QUEUED
ata1.00: cmd 61/08:48:d0:d4:a0/00:00:37:00:00/40 tag 9 ncq dma 4096 out
         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1.00: failed command: WRITE FPDMA QUEUED
ata1.00: cmd 61/18:50:e0:d4:a0/00:00:37:00:00/40 tag 10 ncq dma 12288 out
         res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1.00: failed command: WRITE FPDMA QUEUED
ata1.00: cmd 61/10:58:88:d5:a0/00:00:37:00:00/40 tag 11 ncq dma 8192 out
         res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1.00: failed command: WRITE FPDMA QUEUED
ata1.00: cmd 61/10:60:b8:d5:a0/00:00:37:00:00/40 tag 12 ncq dma 8192 out
         res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1.00: failed command: WRITE FPDMA QUEUED
ata1.00: cmd 61/08:68:d8:d5:a0/00:00:37:00:00/40 tag 13 ncq dma 4096 out
         res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1.00: failed command: WRITE FPDMA QUEUED
ata1.00: cmd 61/08:88:b8:d2:a0/00:00:37:00:00/40 tag 17 ncq dma 4096 out
         res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
...


Device Model:     Samsung SSD 860 EVO 500GB
Serial Number:    S3Z2NB0K836858L
LU WWN Device Id: 5 002538 e40709e42
Firmware Version: RVT01B6Q

I know that USB devices support turning on and off quirks of different devices.

Is there a runtime option to disable queued TRIM for a given SATA device?

The system is running intel chipset:

description: SATA controller
product: 7 Series/C210 Series Chipset Family 6-port SATA Controller [AHCI mode] [8086:1E02]
vendor: Intel Corporation [8086]
physical id: 1f.2
bus info: pci@0000:00:1f.2
version: 04
width: 32 bits
clock: 66MHz
capabilities: storage msi pm ahci_1.0 bus_master cap_list
configuration: driver=ahci latency=0

The motherboard is P8H77-M PRO in case it makes a difference.

It seems pretty safe to assume that the 860 EVO series is just broken and cannot cope with all commands combined with NCQ, no matter what Samsung marketing department says. I would rather not disable NCQ support because it would cause major performance hit.
Comment 27 Logman 2021-08-01 09:45:07 UTC
Same issue. 

OS: Ubuntu 20.04.2 LTS

CPU: Ryzen 5 5600X,
Motherboard: ROG STRIX B550-E GAMING
Memory: Corsair DDR4 CMK32GX4M2A2400C14 32GB (2x16GB) 

Bios: Version: 2006
        Release Date: 03/19/2021

SSD: 
Device Model:     Samsung SSD 860 EVO 1TB
Firmware Version: RVT02B6Q
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)

AMD SATA controller: SATA controller: Advanced Micro Devices, Inc. [AMD] Device 43eb

Kernel: 5.4.0-80-generic

---
Jul 25 02:59:19 purkki kernel: [3651117.306707] ahci 0000:01:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0xd5c54000 flags=0x0000]
Jul 25 02:59:19 purkki kernel: [3651117.599827] ata2.00: exception Emask 0x10 SAct 0x1fe00000 SErr 0x0 action 0x6 frozen
Jul 25 02:59:19 purkki kernel: [3651117.599831] ata2.00: irq_stat 0x08000000, interface fatal error
Jul 25 02:59:19 purkki kernel: [3651117.599835] ata2.00: failed command: WRITE FPDMA QUEUED
Jul 25 02:59:19 purkki kernel: [3651117.599839] ata2.00: cmd 61/20:a8:60:df:2d/00:00:3b:00:00/40 tag 21 ncq dma 16384 out
Jul 25 02:59:19 purkki kernel: [3651117.599839]          res 40/00:e0:c0:0a:14/00:00:3b:00:00/40 Emask 0x10 (ATA bus error)
Jul 25 02:59:19 purkki kernel: [3651117.599842] ata2.00: status: { DRDY }
Jul 25 02:59:19 purkki kernel: [3651117.599844] ata2.00: failed command: WRITE FPDMA QUEUED
Jul 25 02:59:19 purkki kernel: [3651117.599847] ata2.00: cmd 61/10:b0:90:ab:2e/00:00:3b:00:00/40 tag 22 ncq dma 8192 out
Jul 25 02:59:19 purkki kernel: [3651117.599847]          res 40/00:e0:c0:0a:14/00:00:3b:00:00/40 Emask 0x10 (ATA bus error)
Jul 25 02:59:19 purkki kernel: [3651117.599851] ata2.00: status: { DRDY }
Jul 25 02:59:19 purkki kernel: [3651117.599852] ata2.00: failed command: WRITE FPDMA QUEUED
Jul 25 02:59:19 purkki kernel: [3651117.599855] ata2.00: cmd 61/10:b8:40:e4:2e/00:00:3b:00:00/40 tag 23 ncq dma 8192 out
Jul 25 02:59:19 purkki kernel: [3651117.599855]          res 40/00:e0:c0:0a:14/00:00:3b:00:00/40 Emask 0x10 (ATA bus error)
Jul 25 02:59:19 purkki kernel: [3651117.599858] ata2.00: status: { DRDY }
Jul 25 02:59:19 purkki kernel: [3651117.599859] ata2.00: failed command: WRITE FPDMA QUEUED
Jul 25 02:59:19 purkki kernel: [3651117.599862] ata2.00: cmd 61/10:c0:90:e7:2e/00:00:3b:00:00/40 tag 24 ncq dma 8192 out
Jul 25 02:59:19 purkki kernel: [3651117.599862]          res 40/00:e0:c0:0a:14/00:00:3b:00:00/40 Emask 0x10 (ATA bus error)
Jul 25 02:59:19 purkki kernel: [3651117.599865] ata2.00: status: { DRDY }
Jul 25 02:59:19 purkki kernel: [3651117.599867] ata2.00: failed command: WRITE FPDMA QUEUED
Jul 25 02:59:19 purkki kernel: [3651117.599870] ata2.00: cmd 61/10:c8:80:ed:2e/00:00:3b:00:00/40 tag 25 ncq dma 8192 out
Jul 25 02:59:19 purkki kernel: [3651117.599870]          res 40/00:e0:c0:0a:14/00:00:3b:00:00/40 Emask 0x10 (ATA bus error)
Jul 25 02:59:19 purkki kernel: [3651117.599873] ata2.00: status: { DRDY }
Jul 25 02:59:19 purkki kernel: [3651117.599874] ata2.00: failed command: WRITE FPDMA QUEUED
Jul 25 02:59:19 purkki kernel: [3651117.599877] ata2.00: cmd 61/10:d0:b0:ed:2e/00:00:3b:00:00/40 tag 26 ncq dma 8192 out
Jul 25 02:59:19 purkki kernel: [3651117.599877]          res 40/00:e0:c0:0a:14/00:00:3b:00:00/40 Emask 0x10 (ATA bus error)
Jul 25 02:59:19 purkki kernel: [3651117.599880] ata2.00: status: { DRDY }
Jul 25 02:59:19 purkki kernel: [3651117.599881] ata2.00: failed command: WRITE FPDMA QUEUED
Jul 25 02:59:19 purkki kernel: [3651117.599884] ata2.00: cmd 61/20:d8:50:ff:2e/00:00:3b:00:00/40 tag 27 ncq dma 16384 out
Jul 25 02:59:19 purkki kernel: [3651117.599884]          res 40/00:e0:c0:0a:14/00:00:3b:00:00/40 Emask 0x10 (ATA bus error)
Jul 25 02:59:19 purkki kernel: [3651117.599887] ata2.00: status: { DRDY }
Jul 25 02:59:19 purkki kernel: [3651117.599889] ata2.00: failed command: WRITE FPDMA QUEUED
Jul 25 02:59:19 purkki kernel: [3651117.599891] ata2.00: cmd 61/30:e0:c0:0a:14/00:00:3b:00:00/40 tag 28 ncq dma 24576 out
Jul 25 02:59:19 purkki kernel: [3651117.599891]          res 40/00:e0:c0:0a:14/00:00:3b:00:00/40 Emask 0x10 (ATA bus error)
Jul 25 02:59:19 purkki kernel: [3651117.599894] ata2.00: status: { DRDY }
Jul 25 02:59:19 purkki kernel: [3651117.599897] ata2: hard resetting link
Jul 25 02:59:20 purkki kernel: [3651118.075832] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jul 25 02:59:20 purkki kernel: [3651118.076267] ata2.00: supports DRM functions and may not be fully accessible
Jul 25 02:59:25 purkki kernel: [3651123.267858] ata2.00: qc timeout (cmd 0x47)
Jul 25 02:59:25 purkki kernel: [3651123.267867] ata2.00: READ LOG DMA EXT failed, trying PIO
Jul 25 02:59:25 purkki kernel: [3651123.267868] ata2.00: NCQ Send/Recv Log not supported
Jul 25 02:59:25 purkki kernel: [3651123.267870] ata2.00: failed to get Identify Device Data, Emask 0x40
Jul 25 02:59:25 purkki kernel: [3651123.267871] ata2.00: ATA Identify Device Log not supported
Jul 25 02:59:25 purkki kernel: [3651123.267872] ata2.00: Security Log not supported
Jul 25 02:59:25 purkki kernel: [3651123.267877] ata2.00: failed to set xfermode (err_mask=0x40)
Jul 25 02:59:25 purkki kernel: [3651123.267884] ata2: hard resetting link

Uptime: 23:16:52 up 43 days,  2:29,  1 user,  load average: 0.00, 0.01, 0.00


I did add: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash libata.force=noncq"
Comment 28 Logman 2021-08-01 09:47:58 UTC
(In reply to Logman from comment #27)


> I did add: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash libata.force=noncq"


So far looks ok.
Uptime: 12:46:27 up 2 days, 16:50, no errors.
Comment 29 Justin Clift 2021-08-22 06:09:15 UTC
As a data point, the Samsung 870 EVO appears to have either the same problem, or something closely related.

SSD info:

=== START OF INFORMATION SECTION ===
Device Model:     Samsung SSD 870 EVO 1TB
Serial Number:    S5Y2NF0R128941E
LU WWN Device Id: 5 002538 f4112d21c
Firmware Version: SVT01B6Q
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Aug 22 15:50:09 2021 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Example kernel messages (from kernel 5.13.11):

*****************************
Aug 22 14:47:39 s3 kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x90000 action 0x6 frozen
Aug 22 14:47:39 s3 kernel: ata1: SError: { PHYRdyChg 10B8B }
Aug 22 14:47:39 s3 kernel: ata1.00: failed command: WRITE DMA EXT
Aug 22 14:47:39 s3 kernel: ata1.00: cmd 35/00:08:b8:29:00/00:00:29:00:00/e0 tag 7 dma 4096 out
                                    res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 22 14:47:39 s3 kernel: ata1.00: status: { DRDY }
Aug 22 14:47:39 s3 kernel: ata1: hard resetting link
Aug 22 14:47:40 s3 kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Aug 22 14:47:40 s3 kernel: ata1.00: supports DRM functions and may not be fully accessible
Aug 22 14:47:40 s3 kernel: ata1.00: supports DRM functions and may not be fully accessible
Aug 22 14:47:40 s3 kernel: ata1.00: configured for UDMA/133
Aug 22 14:47:40 s3 kernel: ata1.00: device reported invalid CHS sector 0
Aug 22 14:47:40 s3 kernel: sd 0:0:0:0: [sda] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=31s
Aug 22 14:47:40 s3 kernel: sd 0:0:0:0: [sda] tag#7 Sense Key : Illegal Request [current]
Aug 22 14:47:40 s3 kernel: sd 0:0:0:0: [sda] tag#7 Add. Sense: Unaligned write command
Aug 22 14:47:40 s3 kernel: sd 0:0:0:0: [sda] tag#7 CDB: Write(10) 2a 00 29 00 29 b8 00 00 08 00
Aug 22 14:47:40 s3 kernel: blk_update_request: I/O error, dev sda, sector 687876536 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
*****************************

Using "libata.force=noncq" isn't solving the problem though.  The above error is with NCQ disabled. :(

Looking at the kernel source here as a guide:

https://github.com/torvalds/linux/blob/9ff50bf2f2ff5fab01cac26d8eed21a89308e6ef/drivers/ata/libata-core.c#L3951-L3952

... it seems like two potential kernel parameters for libata.force would be needed. Both "noncq" and "noncqtrim".  I'll try that shortly, and see if it helps.

While looking at that kernel source, it seems like some other Samsung SSD's have trouble with their link state power management:

https://github.com/torvalds/linux/blob/9ff50bf2f2ff5fab01cac26d8eed21a89308e6ef/drivers/ata/libata-core.c#L3930-L3934

Not sure if there's a kernel parameter to disable that, but manually setting should work.

eg:

# echo 'max_performance' > /sys/class/scsi_host/host0/link_power_management_policy
Comment 30 Hans de Goede 2021-08-23 07:47:07 UTC
(In reply to Justin Clift from comment #29)
> Not sure if there's a kernel parameter to disable that, but manually setting
> should work.
> 
> eg:
> 
> # echo 'max_performance' >
> /sys/class/scsi_host/host0/link_power_management_policy

You can set the default to max_performance by setting the following on the kernel cmdline: "ahci.mobile_lpm_policy=0"

Note that as the name implies, the kernel only sets the policy to a different value by default on mobile (laptop) chipsets on desktop chipsets the default is max_performance.

Have you tried setting the link_power_management_policy with your 870 EVO? (and does it help?).

Also I wonder if you could try replacing the SATA cable with a new one? Errors like this can also happen due to a bad SATA cable.
Comment 31 Hans de Goede 2021-08-23 07:53:21 UTC
(In reply to Justin Clift from comment #29)

Also I wonder about your PSU? Is it perhaps old? Or are you perhaps using a  converter to go from a molex power-connector to a sata power-connector? Those might be flaky too.

The reason why I'm asking this is that disabling NCQ drastically lowers the performance of the SSD which in turn drastically lowers it power-consumption. So their could be a power-supply issue (voltage-drop or spikes under load) which is causing issues with the power supplied to the SATA PHYs leading to these kinda transfer errors. Such an issue would only show under heavy load; and disabling NCQ makes it impossible to cause a heavy load on the SSD, since now it will only process 1 request at a time.
Comment 32 Hans de Goede 2021-08-23 07:55:23 UTC
(In reply to Justin Clift from comment #29)

p.s.

What is the chipset-vendor of the SATA controller to which your 870 EVO is connected? All the troubles with the 860 EVO seem to be limited to AMD/Asmedia/Marvell SATA controllers. The Intel SATA controllers seem to work fine.
Comment 33 Simon Arlott 2021-08-23 08:06:10 UTC
(In reply to Hans de Goede from comment #32)
> (In reply to Justin Clift from comment #29)
> 
> What is the chipset-vendor of the SATA controller to which your 870 EVO is
> connected? All the troubles with the 860 EVO seem to be limited to
> AMD/Asmedia/Marvell SATA controllers. The Intel SATA controllers seem to
> work fine.

No they don't, please stop repeating this.

This has been a problem on the 840, the 850, the 860 and now the 870.

The 840 and 850 are still prevented from using queued TRIM by the kernel.

The 860 and 870 are not, based solely on marketing information from Samsung claiming that the problem is fixed.

The problem is not the SATA controllers (Intel is affected too), SATA cables (I've swapped mine) or the power supplies but the SSDs.
Comment 34 Hans de Goede 2021-08-23 09:12:47 UTC
(In reply to Simon Arlott from comment #33)
> > What is the chipset-vendor of the SATA controller to which your 870 EVO is
> > connected? All the troubles with the 860 EVO seem to be limited to
> > AMD/Asmedia/Marvell SATA controllers. The Intel SATA controllers seem to
> > work fine.
> 
> No they don't, please stop repeating this.
> 
> This has been a problem on the 840, the 850, the 860 and now the 870.
> 
> The 840 and 850 are still prevented from using queued TRIM by the kernel.
> 
> The 860 and 870 are not, based solely on marketing information from Samsung
> claiming that the problem is fixed.
> 
> The problem is not the SATA controllers (Intel is affected too), SATA cables
> (I've swapped mine) or the power supplies but the SSDs.

So after completely re-reading / analyzing both this bug as well as bug 201693 with a fresh pair of eyes (since the last time I did this was a long time ago) I agree. After careful reading / analysis it seems that there really are 2 different bugs here impacting both the 860 EVO and the 870 EVO:

1. Queued Trim commands are causing issues on Intel + ASmedia + Marvell controllers

2. Things are seriously broken on AMD controllers and only completely disabling NCQ altogether helps there.


I will submit a kernel patch (with a Fixes tag so that it gets backported to stable series) for 1. right away; and I've asked a colleague to start working on a new ATA horkage flag which disables NCQ on AMD SATA controllers only, so that we can add that flag (together with the ATA_HORKAGE_NO_NCQ_TRIM flag which my patch adds) to the 860 EVO and the 870 EVO to also resolve 2.

###

Note this still does not explain Justin's problem though, since Justin already has NCQ completely disabled. Justin, are you sure you actually have "libata.force=noncq" on the kernel commandline? You can check this with "cat /proc/cmdline", just adding it to /etc/default/grub file is not enough, you also need to generate grub2.cfg for changes to take effect.

Also are you perhaps using an out of tree kernel-driver?
Comment 35 Hans de Goede 2021-08-23 09:13:28 UTC
p.s.

Sorry that it took me so long (much too long) to realize that we are dealing with 2 distinct bugs here.
Comment 37 Justin Clift 2021-08-23 15:04:41 UTC
Thanks Hans.  I'll check what I can now, though in-depth testing will have to be on the weekend. :)

Data that's likely relevant and useful:

* The computer this is happening on is an older model Acer Nitro 5 laptop.  Ryzen 7 2700U cpu, and RX 560X graphics.

Bought it before knowing that Ryzen (at least this one) needs a bunch of kernel command line options to even think about being stable. :/

So, my kernel command line currently is:

*******
$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.13.11-lp153.2.g8c13a2d-default root=UUID=8bde2e75-7e73-43a7-8a82-03d5b3b81afc splash=silent resume=/dev/mapper/cr_ata-KINGSTON_RBUSNS8180DS3128GJ_50026B768291D106-part3 splash=silent resume=/dev/mapper/cr_ata-KINGSTON_RBUSNS8180DS3128GJ_50026B768291D106-part3 quiet rcu_nocbs=0-7 pcie_aspm=off pcie_port_pm=off pci=noacpi ivrs_ioapic[4]=00.14.0 ivrs_ioapic[5]=00.00.2 idle=nomwait intel_idle.max_cstate=0 processor.max_cstate=1 iommu=pt libata.force=noncq,noncqtrim mitigations=auto
*******

The laptop has been turned on all day today, but not actually doing anything as I use a work provided macbook during the day.

Before starting work today I manually set (using the echo approach) the link state power management to max_performance for the Samsung 870 EVO:

*******
$ cat /sys/class/scsi_host/host0/link_power_management_policy
max_performance
*******

*No* errors have shown up in the meantime, which is a very, very good sign that "something" in one of those changes helped:

*******
sudo journalctl -k | grep ata1
Aug 23 07:54:49 s3 kernel: ata1: SATA max UDMA/133 abar m2048@0xff700000 port 0xff700100 irq 22
Aug 23 07:54:49 s3 kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Aug 23 07:54:49 s3 kernel: ata1.00: FORCE: horkage modified (noncq)
Aug 23 07:54:49 s3 kernel: ata1.00: FORCE: horkage modified (noncqtrim)
Aug 23 07:54:49 s3 kernel: ata1.00: supports DRM functions and may not be fully accessible
Aug 23 07:54:49 s3 kernel: ata1.00: ATA-11: Samsung SSD 870 EVO 1TB, SVT01B6Q, max UDMA/133
Aug 23 07:54:49 s3 kernel: ata1.00: 1953525168 sectors, multi 1: LBA48 NCQ (not used)
Aug 23 07:54:49 s3 kernel: ata1.00: supports DRM functions and may not be fully accessible
Aug 23 07:54:49 s3 kernel: ata1.00: configured for UDMA/133
Aug 23 07:54:49 s3 kernel: ata1.00: Enabling discard_zeroes_data
Aug 23 07:54:49 s3 kernel: ata1.00: Enabling discard_zeroes_data
Aug 23 07:54:49 s3 kernel: ata1.00: Enabling discard_zeroes_data
*******

Note the 'FORCE: horkage modified (noncqtrim)', so it's pretty clear that was picked up by the kernel. :)

That being said, it *is* also possible there's a cabling issue at play here too.  Unlike other people's laptops, this one has the cover over the ssd/hdd area removed and I'm running a sata + power extender cable to the front for easy access.  eg it lets me swap physical drives (when powered off) easily.

That being said, the previous drive (a crap Crucial BX500, ~500GB) didn't show an error with this cable.  And a Samsung 860 Evo 500GB with Win10 on it runs fine (even yesterday) off the same cable.

But still, it could be possible the Samsung 870 Evo is a bit more sensitive to something about that cable than the others.

---

For better testing (this weekend), I'll can:

1. Move the Samsung 870 Evo back into the laptop housing directly (without extender cable)
2. Try the kernel without any libata.force options.  eg test if the problem is really the cable
3. If problems occur, then try with libata.force=noncq
4. Ditto, but with just libata.force=noncqtrim
5. Try with libata.force=noncq,noncqtrim

If problems are still showing up, then try setting the link state power management to max_performance (not the default on this laptop).

---

Meanwhile, that kernel patch seems like it'll help people anyway. :)
Comment 38 Justin Clift 2021-08-23 15:08:58 UTC
Probably worth mentioning that this ncq problem occurs within a few minutes of starting the laptop (prior to settings change earlier today)

So it's pretty easy to notice when a change for the better has happened. :)
Comment 39 Laurentiu Nicola 2021-08-23 15:15:08 UTC
To add another data point, I can confirm that my 860 EVO works fine on my ASMedia controller with the exception of NCQ trim.
Comment 40 Justin Clift 2021-08-23 15:16:53 UTC
Laurentiu, does that mean you run it with (say) your kernel having `libata.force=noncqtrim`, or an equivalent approach?
Comment 41 Justin Clift 2021-08-23 15:17:48 UTC
Gah.  Sorry, just realised that should have been "Nicola, ..." instead. ;)
Comment 42 Laurentiu Nicola 2021-08-23 15:20:21 UTC
I used to disable NCQ periodically, trim, then turn it back on, because I didn't realize that noncqtrim exists. I haven't had any NCQ issues in a couple of years (since getting that drive).

(Don't worry about the name, you got it right the first time.)
Comment 43 Justin Clift 2021-08-23 16:55:11 UTC
Cool. :)

In the meantime, I've rebooted with `libata.force=noncqtrim` and left the link power management at it's default setting:

*******
$ cat /sys/class/scsi_host/host0/link_power_management_policy
med_power_with_dipm

$ cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-5.13.11-lp153.2.g8c13a2d-default root=UUID=8bde2e75-7e73-43a7-8a82-03d5b3b81afc splash=silent resume=/dev/mapper/cr_ata-KINGSTON_RBUSNS8180DS3128GJ_50026B768291D106-part3 splash=silent resume=/dev/mapper/cr_ata-KINGSTON_RBUSNS8180DS3128GJ_50026B768291D106-part3 quiet rcu_nocbs=0-7 pcie_aspm=off pcie_port_pm=off pci=noacpi ivrs_ioapic[4]=00.14.0 ivrs_ioapic[5]=00.00.2 idle=nomwait intel_idle.max_cstate=0 processor.max_cstate=1 iommu=pt libata.force=noncqtrim mitigations=auto
*******

So far (only about 15 mins) things are working ok, with no weirdness from the ssd.  I'll update this issue either way tomorrow, after it's been running a bunch of hours.
Comment 44 Justin Clift 2021-08-24 00:48:20 UTC
As a data point, the `libata.force=noncqtrim` option by itself wasn't the complete solution for this system.

This morning, in order to try and trigger any weirdness I copied ~60GB of random files from one folder of the drive to another.

A few minutes later, ata errors started showing up:

*******
Aug 24 10:03:37 s3 kernel: ata1.00: exception Emask 0x0 SAct 0x70 SErr 0xd0000 action 0x6 frozen
Aug 24 10:03:37 s3 kernel: ata1: SError: { PHYRdyChg CommWake 10B8B }
Aug 24 10:03:37 s3 kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Aug 24 10:03:37 s3 kernel: ata1.00: cmd 61/08:20:68:b8:77/00:00:2d:00:00/40 tag 4 ncq dma 4096 out
Aug 24 10:03:37 s3 kernel: ata1.00: status: { DRDY }
Aug 24 10:03:37 s3 kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Aug 24 10:03:37 s3 kernel: ata1.00: cmd 61/08:28:90:bc:01/00:00:08:00:00/40 tag 5 ncq dma 4096 out
Aug 24 10:03:37 s3 kernel: ata1.00: status: { DRDY }
Aug 24 10:03:37 s3 kernel: ata1.00: failed command: READ FPDMA QUEUED
Aug 24 10:03:37 s3 kernel: ata1.00: cmd 60/08:30:90:af:72/00:00:2e:00:00/40 tag 6 ncq dma 4096 in
Aug 24 10:03:37 s3 kernel: ata1.00: status: { DRDY }
Aug 24 10:03:37 s3 kernel: ata1: hard resetting link
Aug 24 10:03:38 s3 kernel: ata1: SATA link down (SStatus 0 SControl 300)
Aug 24 10:03:38 s3 kernel: ata1: hard resetting link
Aug 24 10:03:38 s3 kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Aug 24 10:03:38 s3 kernel: ata1.00: supports DRM functions and may not be fully accessible
Aug 24 10:03:38 s3 kernel: ata1.00: disabling queued TRIM support
Aug 24 10:03:38 s3 kernel: ata1.00: supports DRM functions and may not be fully accessible
Aug 24 10:03:38 s3 kernel: ata1.00: disabling queued TRIM support
Aug 24 10:03:38 s3 kernel: ata1.00: configured for UDMA/133
Aug 24 10:03:38 s3 kernel: ata1: EH complete
Aug 24 10:03:38 s3 kernel: ata1.00: Enabling discard_zeroes_data
*******
(with many more repeats of above READ/WRITE FPDMA QUEUED errors)

Rebooting with the inclusion of `ahci.mobile_lpm_policy=0` on the kernel command line unexpectedly *didn't* leave the link state power management at `max_performance`:

*******
$ cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-5.13.11-lp153.2.g8c13a2d-default root=UUID=8bde2e75-7e73-43a7-8a82-03d5b3b81afc splash=silent resume=/dev/mapper/cr_ata-KINGSTON_RBUSNS8180DS3128GJ_50026B768291D106-part3 splash=silent resume=/dev/mapper/cr_ata-KINGSTON_RBUSNS8180DS3128GJ_50026B768291D106-part3 quiet rcu_nocbs=0-7 pcie_aspm=off pcie_port_pm=off pci=noacpi ivrs_ioapic[4]=00.14.0 ivrs_ioapic[5]=00.00.2 idle=nomwait intel_idle.max_cstate=0 processor.max_cstate=1 iommu=pt libata.force=noncqtrim ahci.mobile_lpm_policy=0 mitigations=auto

$ cat /sys/class/scsi_host/host0/link_power_management_policy
med_power_with_dipm
*******

I've manually changed it (using echo approach):

*******
# echo 'max_performance' > /sys/class/scsi_host/host0/link_power_management_policy

$ cat /sys/class/scsi_host/host0/link_power_management_policy
max_performance
*******

With that link power management change in place, copying around ~140GB of files on the drive has worked without error. :)

So far, that combination is looking decent.

After work tonight I'll probably try to figure out a udev rule as a workaround for setting the link power management (as per https://bugzilla.kernel.org/show_bug.cgi?id=201693#c13).  Assuming this combination is now functional, that's great.

I'll still try to break it on the weekend as above though, to figure out whether the extension cable (etc) is really causing issues and help diagnose things. :)
Comment 45 Hans de Goede 2021-08-24 08:31:20 UTC
Justin, so I just checked and ahci.mobile_lpm_policy does not do anything on your AMD based laptop, because non of the AMD chipsets are marked as being "mobile" in the PCI-device-id list in drivers/ata/ahci.c .

Are you perhaps using TLP are some other script to "improve" / tweak the power-management settings ? Then that script is likely setting the link_power_management_policy ...

As for testing with vs without libata.force=noncqtrim, notice that to actually test if this makes a difference you need to make sure that there are actually trim commands being send to the disk. So you would need to cause heavy file-io (including erasing large files) and then run "fstrim" at the same time.
Comment 46 Alexander Tsoy 2021-08-24 16:59:31 UTC
(In reply to Hans de Goede from comment #34)
...
> 2. Things are seriously broken on AMD controllers and only completely
> disabling NCQ altogether helps there.

Please note that even disabling NCQ doesn't solve this problem completely. I still had occasional I/O freezes with my AMD SP5100 (SB700S) chipset, but without any kernel messages.
I upgraded to AMD X570 based system several months ago and everything is completely stable now with NCQ *enabled*.
Comment 47 Justin Clift 2021-08-29 16:40:56 UTC
Apologies for the delay, this weekend got away from me with other things.  I'll have to get this tested sometime in the next few days or next weekend. :/
Comment 48 Hans de Goede 2021-08-30 15:15:02 UTC
As already mentioned in comment 34 we have been working towards a solution for this:

"""
So after completely re-reading / analyzing both this bug as well as bug 201693 with a fresh pair of eyes (since the last time I did this was a long time ago) I agree. After careful reading / analysis it seems that there really are 2 different bugs here impacting both the 860 EVO and the 870 EVO:

1. Queued Trim commands are causing issues on Intel + ASmedia + Marvell controllers

2. Things are seriously broken on AMD controllers and only completely disabling NCQ altogether helps there.
"""

A patch implementing 1. has been submitted upstream a week ago here:
https://lore.kernel.org/linux-ide/20210823095220.30157-1-hdegoede@redhat.com/T/#u

And a patch implementing 2. was just submitted upstream:
https://lore.kernel.org/linux-ide/54f63e11-e421-0fa6-80e1-297287dc0974@redhat.com/

Together these should resolve (work around) this issue for most users.
Comment 49 Krzysztof Oledzki 2021-09-03 20:35:52 UTC
For clarification - we established in https://bugzilla.kernel.org/show_bug.cgi?id=201693 that the problem is limited to "ATI AMD" AHCI controllers - 0x1002, not "Modern AMD" - 0x1022.
Comment 50 Jens Axboe 2021-09-03 20:40:07 UTC
Patches have been queued up. Tejun, can you close it?
Comment 51 Hans de Goede 2021-09-03 20:51:09 UTC
The patches for both this bug as well as for bug 201693 are on their way to Linus, closing.
Comment 52 Gurenko Alex 2021-09-15 08:22:49 UTC
Probably a little late at this point, but I'm trying to understand something here.

The issue seems to manifests itself on 8{6,7}0 EVO models (probably EVQ), but there are also Pro models that seems to work just fine, based on random comments from people commenting the news about this patch. Myself included. I'm using Samsung 860 Pro with X570 chipset for a year now with zero issues so far. So now this patch will unconditionally cut performance on affected and not-affected devices, is that right?
Will there be a flag to force enable ncq than?
Comment 53 Justin Clift 2021-09-15 08:54:47 UTC
In theory, the kernel command line options allow turning `ncq` and `ncqtrim` both on, and off.

From:

https://www.kernel.org/doc/html/v5.15-rc1/admin-guide/kernel-parameters.html

```
* [no]ncq: Turn on or off NCQ.
* [no]ncqtrim: Turn off queued DSM TRIM.
```

So, if the change does cut performance with your system you should be able to enable things again without too much hassle.  Hopefully. (!) :)
Comment 54 Justin Clift 2021-09-15 08:59:04 UTC
This is the source code with the exact spellings, if that helps:

https://github.com/torvalds/linux/blob/3ca706c189db861b2ca2019a0901b94050ca49d8/drivers/ata/libata-core.c#L6155-L6160
Comment 55 Gurenko Alex 2021-09-15 09:07:49 UTC
(In reply to Justin Clift from comment #53)
> In theory, the kernel command line options allow turning `ncq` and `ncqtrim`
> both on, and off.
> 
> From:
> 
> https://www.kernel.org/doc/html/v5.15-rc1/admin-guide/kernel-parameters.html
> 
> ```
> * [no]ncq: Turn on or off NCQ.
> * [no]ncqtrim: Turn off queued DSM TRIM.
> ```
> 
> So, if the change does cut performance with your system you should be able
> to enable things again without too much hassle.  Hopefully. (!) :)

Thanks, that actually helps. I've been looking into other drives to replace my 860 Pro with. Since I have 2 almost identical systems, I've been thinking putting my drive into another setup (that uses windows) and buying myself new drives as I wanted to expand anyway.
Comment 56 Hans de Goede 2021-09-15 11:32:13 UTC
(In reply to Gurenko Alex from comment #52)
> Probably a little late at this point, but I'm trying to understand something
> here.
> 
> The issue seems to manifests itself on 8{6,7}0 EVO models (probably EVQ),
> but there are also Pro models that seems to work just fine, based on random
> comments from people commenting the news about this patch. Myself included.
> I'm using Samsung 860 Pro with X570 chipset for a year now with zero issues
> so far. So now this patch will unconditionally cut performance on affected
> and not-affected devices, is that right?

There are 2 parts to the patch:

1. Disable queued-trim on all Samsung 860 + 870 drives, this will impact your setup too, but this only impact trims, NCQ is otherwise still fully used so the performance impact of this typically is negligible, especially since most distro-s don't use continues trim to begin with. Chances are you have not seen any issues because of this.

2. Completely disable NCQ when a Samsung 860 / 870 drive is used connected to a SATA controller with an ATI PCI-vendor-id. Your X570 has an AMD PCI-vendor-id, so you are not impacted by this change.

Also note that several people have actually reported issues with queued-trims in combination with the 860 Pro, IOW the 860 Pro really also needs 1.
Comment 57 Gurenko Alex 2021-09-15 12:18:08 UTC
(In reply to Hans de Goede from comment #56)
> (In reply to Gurenko Alex from comment #52)
> > Probably a little late at this point, but I'm trying to understand
> something
> > here.
> > 
> > The issue seems to manifests itself on 8{6,7}0 EVO models (probably EVQ),
> > but there are also Pro models that seems to work just fine, based on random
> > comments from people commenting the news about this patch. Myself included.
> > I'm using Samsung 860 Pro with X570 chipset for a year now with zero issues
> > so far. So now this patch will unconditionally cut performance on affected
> > and not-affected devices, is that right?
> 
> There are 2 parts to the patch:
> 
> 1. Disable queued-trim on all Samsung 860 + 870 drives, this will impact
> your setup too, but this only impact trims, NCQ is otherwise still fully
> used so the performance impact of this typically is negligible, especially
> since most distro-s don't use continues trim to begin with. Chances are you
> have not seen any issues because of this.
> 
> 2. Completely disable NCQ when a Samsung 860 / 870 drive is used connected
> to a SATA controller with an ATI PCI-vendor-id. Your X570 has an AMD
> PCI-vendor-id, so you are not impacted by this change.
> 
> Also note that several people have actually reported issues with
> queued-trims in combination with the 860 Pro, IOW the 860 Pro really also
> needs 1.

Thanks a lot for a clear explanation. I think I've got worried by the kernel mailing list reference: "Note that with AMD SATA controllers users are reporting even worse issues".