Bug 203475
Summary: | Samsung 860 EVO queued TRIM issues | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Roman Mamedov (rm+bko) |
Component: | Serial ATA | Assignee: | Tejun Heo (tj) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | agurenko, alexander, bill-osdl.org-bugzilla, braccoz, brice.simon, bugzilla, fweimer, i, johnsimcall, justin, jwrdegoede, kernelbugs, lnicola, mikko.rantalainen, ole, pizza, pjbrs, reg.kernelbugzilla.wad1w, rm+bko, siltal02, sitsofe, stathis, t50, tom.crossland, ushakov, vincent |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.14.114 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
dmesg of the errors occuring
disable queued TRIM for Samsung 860 series SSDs |
Created attachment 282581 [details]
disable queued TRIM for Samsung 860 series SSDs
This patch is still relevant for master. Add my vote to merging this; I'd like to be able to re-enable NCQ on this SSD. This patch looks good - any chance you can email one with a proper commit log and signed-off-by etc to linux-ide@vger.kernel.org? And you can CC me, axboe@kernel.dk, and I'll get it queued up for the current kernel. Jens, thanks, sent to https://marc.info/?l=linux-ide&m=156312691006716&w=2, it is now being discussed there. Solomon: what model do you have that also has a problem with TRIM, 860 EVO mSATA too? And which firmware revision? I have the 1TB SATA (not mSATA!) version. smartctl -a dump: Model Family: Samsung based SSDs Device Model: Samsung SSD 860 EVO 1TB Serial Number: S3Z8NB0K717690X LU WWN Device Id: 5 002538 e4054049c Firmware Version: RVT01B6Q User Capacity: 1,000,204,886,016 bytes [1.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Form Factor: 2.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5 SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Mon Jul 15 13:47:44 2019 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled kernel log snippet: (Untainted Fedora 5.1.16-300.fc30.x86_64 kernel) ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) ata1.00: supports DRM functions and may not be fully accessible ata1.00: ATA-11: Samsung SSD 860 EVO 1TB, RVT01B6Q, max UDMA/133 ata1.00: 1953525168 sectors, multi 1: LBA48 NCQ (depth 32), AA ata1.00: supports DRM functions and may not be fully accessible ata1.00: configured for UDMA/133 scsi 0:0:0:0: Direct-Access ATA Samsung SSD 860 1B6Q PQ: 0 ANSI: 5 sd 0:0:0:0: Attached scsi generic sg0 type 0 ata1.00: Enabling discard_zeroes_data sd 0:0:0:0: [sda] 1953525168 512-byte logical blocks: (1.00 TB/932 GiB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA ata1.00: Enabling discard_zeroes_data sda: sda1 sda2 sda3 ata1.00: Enabling discard_zeroes_data sd 0:0:0:0: [sda] supports TCG Opal sd 0:0:0:0: [sda] Attached SCSI disk See also BZ #201693 > See also BZ #201693
Did you confirm that with my patch applied you have no problem with 860 EVO on the AMD SATA controller anymore? I thought that one is a hopeless matter and the issues extend to more than just TRIM, to regular (high-speed) reads/writes too. For that reason I moved mine to an ASMedia controller, and here it is clear-cut that only the queued TRIM fails, everything else works fine.
I'm building a patched fedora kernel with the patch, and will get back to you later today. But in the mean time I can confirm that by setting the drive's queue depth to 1, I have no timeout or corruption issues. [[ echo 1 > /sys/block/sda/device/queue_depth ]] Finally got it built and booted up.. and it went kaboom. Same kernel (Fedora 5.1.16-300) but with Roman's patch applied, yields much the same kernel log, with this addition: ata1.00: disabling queued TRIM support Unfortunately, about 30 seconds later, it went kaboom: [ 35.527148] ata1.00: exception Emask 0x10 SAct 0xfc000 SErr 0x0 action 0x6 frozen [ 35.527155] ata1.00: irq_stat 0x08000000, interface fatal error [ 35.527161] ata1.00: failed command: WRITE FPDMA QUEUED [ 35.527171] ata1.00: cmd 61/20:70:e0:a6:8b/00:00:25:00:00/40 tag 14 ncq dma 16384 out res 40/00:70:e0:a6:8b/00:00:25:00:00/40 Emask 0x10 (ATA bus error) [ 35.527176] ata1.00: status: { DRDY } [ 35.527179] ata1.00: failed command: WRITE FPDMA QUEUED [ 35.527187] ata1.00: cmd 61/08:78:e0:ad:8b/00:00:25:00:00/40 tag 15 ncq dma 4096 out res 40/00:70:e0:a6:8b/00:00:25:00:00/40 Emask 0x10 (ATA bus error) [ 35.527191] ata1.00: status: { DRDY } [ 35.527194] ata1.00: failed command: WRITE FPDMA QUEUED [ 35.527202] ata1.00: cmd 61/20:80:60:d0:91/00:00:25:00:00/40 tag 16 ncq dma 16384 out res 40/00:70:e0:a6:8b/00:00:25:00:00/40 Emask 0x10 (ATA bus error) [ 35.527205] ata1.00: status: { DRDY } [ 35.527208] ata1.00: failed command: WRITE FPDMA QUEUED [ 35.527216] ata1.00: cmd 61/40:88:00:d1:91/00:00:25:00:00/40 tag 17 ncq dma 32768 out res 40/00:70:e0:a6:8b/00:00:25:00:00/40 Emask 0x10 (ATA bus error) [ 35.527219] ata1.00: status: { DRDY } [ 35.527222] ata1.00: failed command: WRITE FPDMA QUEUED [ 35.527230] ata1.00: cmd 61/08:90:c0:51:92/00:00:25:00:00/40 tag 18 ncq dma 4096 out res 40/00:70:e0:a6:8b/00:00:25:00:00/40 Emask 0x10 (ATA bus error) [ 35.527233] ata1.00: status: { DRDY } [ 35.527236] ata1.00: failed command: WRITE FPDMA QUEUED [ 35.527243] ata1.00: cmd 61/20:98:20:52:92/00:00:25:00:00/40 tag 19 ncq dma 16384 out res 40/00:70:e0:a6:8b/00:00:25:00:00/40 Emask 0x10 (ATA bus error) [ 35.527246] ata1.00: status: { DRDY } [ 35.527252] ata1: hard resetting link [ 35.986132] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 35.986457] ata1.00: supports DRM functions and may not be fully accessible [ 35.987384] ata1.00: disabling queued TRIM support [ 35.989818] ata1.00: supports DRM functions and may not be fully accessible [ 35.990591] ata1.00: disabling queued TRIM support [ 35.992641] ata1.00: configured for UDMA/133 [ 35.992670] ata1: EH complete [ 35.992941] ata1.00: Enabling discard_zeroes_data So perhaps this SSD is simply incompatible with NCQ. Sigh. > So perhaps this SSD is simply incompatible with NCQ.
Not in general, only in combination with AMD SATA, as discussed in that other bugreport. And indeed there it's not only TRIM, but also regular writes. Any chance you could test on a different controller (ASMedia, Marvell, ...)?
It's frustrating that Samsung has demonstrated no interest in solving this problem properly. It's not like AMD-based systems are _that_ rare. Every system I have at home is AMD-based or has an incompatible form factor. I'll see what I can dig up around the office. I just swapped in an ASMedia-based SATA controller, and re-enabled NCQ (by using the default queue_depth). The system is subjectively much, much faster and is (so far) error free. I'm getting the same issue on 4.15..5.4.49 with an Intel ASRock Z170 Extreme4 SATA controller: [389520.385306] ata2.00: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x6 frozen [389520.385315] ata2.00: failed command: WRITE FPDMA QUEUED [389520.385327] ata2.00: cmd 61/60:00:80:8e:20/00:00:98:00:00/40 tag 0 ncq dma 49152 out res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout) [389520.385332] ata2.00: status: { DRDY } [389520.385336] ata2.00: failed command: WRITE FPDMA QUEUED [389520.385345] ata2.00: cmd 61/20:08:00:8f:20/00:00:98:00:00/40 tag 1 ncq dma 16384 out res 40/00:01:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) [389520.385349] ata2.00: status: { DRDY } [389520.385353] ata2.00: failed command: SEND FPDMA QUEUED [389520.385364] ata2.00: cmd 64/01:10:00:00:00/00:00:00:00:00/a0 tag 2 ncq dma 512 out res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout) [389520.385370] ata2.00: status: { DRDY } [389520.385374] ata2.00: failed command: WRITE FPDMA QUEUED [389520.385382] ata2.00: cmd 61/e0:18:b8:ea:77/05:00:97:00:00/40 tag 3 ncq dma 770048 out res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout) [389520.385386] ata2.00: status: { DRDY } [389520.385393] ata2: hard resetting link [389520.699442] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [389520.701434] ata2.00: supports DRM functions and may not be fully accessible [389520.704682] ata2.00: supports DRM functions and may not be fully accessible [389520.707501] ata2.00: configured for UDMA/133 [389520.707511] ata2: EH complete [389520.707742] ata2.00: Enabling discard_zeroes_data [389551.093259] ata2.00: exception Emask 0x0 SAct 0x1fc0000 SErr 0x0 action 0x6 frozen [389551.093261] ata2.00: failed command: WRITE FPDMA QUEUED [389551.093264] ata2.00: cmd 61/d8:90:a8:bc:a0/09:00:97:00:00/40 tag 18 ncq dma 1290240 ou res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout) [389551.093265] ata2.00: status: { DRDY } [389551.093266] ata2.00: failed command: WRITE FPDMA QUEUED [389551.093267] ata2.00: cmd 61/e0:98:b8:ea:77/05:00:97:00:00/40 tag 19 ncq dma 770048 out res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout) [389551.093268] ata2.00: status: { DRDY } [389551.093269] ata2.00: failed command: SEND FPDMA QUEUED [389551.093271] ata2.00: cmd 64/01:a0:00:00:00/00:00:00:00:00/a0 tag 20 ncq dma 512 out res 40/00:01:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) [389551.093271] ata2.00: status: { DRDY } [389551.093272] ata2.00: failed command: WRITE FPDMA QUEUED [389551.093274] ata2.00: cmd 61/20:a8:00:8f:20/00:00:98:00:00/40 tag 21 ncq dma 16384 out res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout) [389551.093274] ata2.00: status: { DRDY } [389551.093275] ata2.00: failed command: WRITE FPDMA QUEUED [389551.093295] ata2.00: cmd 61/60:b0:80:8e:20/00:00:98:00:00/40 tag 22 ncq dma 49152 out res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout) [389551.093296] ata2.00: status: { DRDY } [389551.093296] ata2.00: failed command: WRITE FPDMA QUEUED [389551.093298] ata2.00: cmd 61/b0:b8:80:c6:a0/09:00:97:00:00/40 tag 23 ncq dma 1269760 ou res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [389551.093299] ata2.00: status: { DRDY } [389551.093300] ata2.00: failed command: WRITE FPDMA QUEUED [389551.093301] ata2.00: cmd 61/10:c0:f0:21:22/00:00:96:00:00/40 tag 24 ncq dma 8192 out res 40/00:01:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) [389551.093302] ata2.00: status: { DRDY } [389551.093303] ata2: hard resetting link [389551.407389] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [389551.409259] ata2.00: supports DRM functions and may not be fully accessible [389551.412712] ata2.00: supports DRM functions and may not be fully accessible [389551.415759] ata2.00: configured for UDMA/133 [389551.415773] ata2: EH complete [389581.797243] ata2.00: exception Emask 0x0 SAct 0x3f80 SErr 0x0 action 0x6 frozen [389581.797246] ata2.00: failed command: WRITE FPDMA QUEUED [389581.797248] ata2.00: cmd 61/10:38:f0:21:22/00:00:96:00:00/40 tag 7 ncq dma 8192 out res 40/00:01:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) [389581.797249] ata2.00: status: { DRDY } [389581.797250] ata2.00: failed command: WRITE FPDMA QUEUED [389581.797252] ata2.00: cmd 61/b0:40:80:c6:a0/09:00:97:00:00/40 tag 8 ncq dma 1269760 ou res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [389581.797253] ata2.00: status: { DRDY } [389581.797253] ata2.00: failed command: WRITE FPDMA QUEUED [389581.797255] ata2.00: cmd 61/60:48:80:8e:20/00:00:98:00:00/40 tag 9 ncq dma 49152 out res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [389581.797256] ata2.00: status: { DRDY } [389581.797257] ata2.00: failed command: WRITE FPDMA QUEUED [389581.797258] ata2.00: cmd 61/20:50:00:8f:20/00:00:98:00:00/40 tag 10 ncq dma 16384 out res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [389581.797259] ata2.00: status: { DRDY } [389581.797260] ata2.00: failed command: SEND FPDMA QUEUED [389581.797262] ata2.00: cmd 64/01:58:00:00:00/00:00:00:00:00/a0 tag 11 ncq dma 512 out res 40/00:01:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) [389581.797262] ata2.00: status: { DRDY } [389581.797263] ata2.00: failed command: WRITE FPDMA QUEUED [389581.797265] ata2.00: cmd 61/e0:60:b8:ea:77/05:00:97:00:00/40 tag 12 ncq dma 770048 out res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [389581.797265] ata2.00: status: { DRDY } [389581.797266] ata2.00: failed command: WRITE FPDMA QUEUED [389581.797268] ata2.00: cmd 61/d8:68:a8:bc:a0/09:00:97:00:00/40 tag 13 ncq dma 1290240 ou res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout) [389581.797268] ata2.00: status: { DRDY } [389581.797270] ata2: hard resetting link [389582.111393] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [389582.113289] ata2.00: supports DRM functions and may not be fully accessible [389582.116517] ata2.00: supports DRM functions and may not be fully accessible [389582.119421] ata2.00: configured for UDMA/133 [389582.119438] ata2: EH complete [389582.119715] ata2.00: Enabling discard_zeroes_data [389582.120788] ata2.00: Enabling discard_zeroes_data [389612.533285] ata2.00: NCQ disabled due to excessive errors [389612.533292] ata2.00: exception Emask 0x0 SAct 0x7c00000f SErr 0x0 action 0x6 frozen [389612.533301] ata2.00: failed command: WRITE FPDMA QUEUED [389612.533313] ata2.00: cmd 61/b0:00:80:c6:a0/09:00:97:00:00/40 tag 0 ncq dma 1269760 ou res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout) [389612.533317] ata2.00: status: { DRDY } [389612.533322] ata2.00: failed command: WRITE FPDMA QUEUED [389612.533331] ata2.00: cmd 61/10:08:f0:21:22/00:00:96:00:00/40 tag 1 ncq dma 8192 out res 40/00:01:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) [389612.533335] ata2.00: status: { DRDY } [389612.533339] ata2.00: failed command: READ FPDMA QUEUED [389612.533347] ata2.00: cmd 60/18:10:c0:d3:00/00:00:00:00:00/40 tag 2 ncq dma 12288 in res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout) [389612.533351] ata2.00: status: { DRDY } [389612.533354] ata2.00: failed command: READ FPDMA QUEUED [389612.533363] ata2.00: cmd 60/20:18:80:b9:e7/00:00:58:00:00/40 tag 3 ncq dma 16384 in res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout) [389612.533366] ata2.00: status: { DRDY } [389612.533371] ata2.00: failed command: WRITE FPDMA QUEUED [389612.533380] ata2.00: cmd 61/d8:d0:a8:bc:a0/09:00:97:00:00/40 tag 26 ncq dma 1290240 ou res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout) [389612.533383] ata2.00: status: { DRDY } [389612.533387] ata2.00: failed command: WRITE FPDMA QUEUED [389612.533396] ata2.00: cmd 61/e0:d8:b8:ea:77/05:00:97:00:00/40 tag 27 ncq dma 770048 out res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout) [389612.533399] ata2.00: status: { DRDY } [389612.533402] ata2.00: failed command: SEND FPDMA QUEUED [389612.533410] ata2.00: cmd 64/01:e0:00:00:00/00:00:00:00:00/a0 tag 28 ncq dma 512 out res 40/00:01:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) [389612.533414] ata2.00: status: { DRDY } [389612.533417] ata2.00: failed command: WRITE FPDMA QUEUED [389612.533426] ata2.00: cmd 61/20:e8:00:8f:20/00:00:98:00:00/40 tag 29 ncq dma 16384 out res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout) [389612.533429] ata2.00: status: { DRDY } [389612.533433] ata2.00: failed command: WRITE FPDMA QUEUED [389612.533441] ata2.00: cmd 61/60:f0:80:8e:20/00:00:98:00:00/40 tag 30 ncq dma 49152 out res 40/00:01:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout) [389612.533445] ata2.00: status: { DRDY } [389612.533451] ata2: hard resetting link [389612.851755] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [389612.853797] ata2.00: supports DRM functions and may not be fully accessible [389612.857594] ata2.00: supports DRM functions and may not be fully accessible [389612.860819] ata2.00: configured for UDMA/133 [389612.860879] ata2: EH complete [389612.865362] ata2.00: Enabling discard_zeroes_data This is during an fstrim, and it doesn't happen on the Samsung 850 EVO. Device Model: Samsung SSD 850 EVO 2TB Firmware Version: EMT02B6Q Device Model: Samsung SSD 860 EVO 2TB Firmware Version: RVT04B6Q 00:17.0 SATA controller: Intel Corporation Q170/Q150/B150/H170/H110/Z170/CM236 Chipset SATA Controller [AHCI Mode] (rev 31) Same issue, different controller: System: FUJITSU PRIMERGY TX1310 M1/D3219-A1, BIOS V4.6.5.4 R1.11.0 for D3219-A1x 09/25/2018 Kernel: Linux server 5.4.72-gentoo-x86_64 #1 SMP Sat Oct 17 05:17:10 EET 2020 x86_64 Intel(R) Xeon(R) CPU E3-1226 v3 @ 3.30GHz GenuineIntel GNU/Linux Controller: 00:1f.2 SATA controller: Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] (rev 04) Device Model: Samsung SSD 860 EVO 500GB Firmware Version: RVT04B6Q [395138.151251] ata6.00: exception Emask 0x10 SAct 0x40003fff SErr 0x400100 action 0x6 frozen [395138.152011] ata6.00: irq_stat 0x08000008, interface fatal error [395138.152755] ata6: SError: { UnrecovData Handshk } [395138.153470] ata6.00: failed command: WRITE FPDMA QUEUED [395138.154222] ata6.00: cmd 61/08:00:78:38:80/00:00:0e:00:00/40 tag 0 ncq dma 4096 out res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error) [395138.155801] ata6.00: status: { DRDY } [395138.156579] ata6.00: failed command: WRITE FPDMA QUEUED [395138.156581] ata6.00: cmd 61/08:08:18:26:81/00:00:0e:00:00/40 tag 1 ncq dma 4096 out res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error) [395138.156581] ata6.00: status: { DRDY } [395138.156582] ata6.00: failed command: WRITE FPDMA QUEUED [395138.156593] ata6.00: cmd 61/08:10:50:59:81/00:00:0e:00:00/40 tag 2 ncq dma 4096 out res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error) [395138.156594] ata6.00: status: { DRDY } [395138.156594] ata6.00: failed command: WRITE FPDMA QUEUED [395138.156596] ata6.00: cmd 61/08:18:90:6a:81/00:00:0e:00:00/40 tag 3 ncq dma 4096 out res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error) [395138.156596] ata6.00: status: { DRDY } [395138.156597] ata6.00: failed command: WRITE FPDMA QUEUED [395138.156598] ata6.00: cmd 61/08:20:58:b2:81/00:00:0e:00:00/40 tag 4 ncq dma 4096 out res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error) [395138.156599] ata6.00: status: { DRDY } [395138.156599] ata6.00: failed command: WRITE FPDMA QUEUED [395138.156601] ata6.00: cmd 61/08:28:b0:26:c0/00:00:0e:00:00/40 tag 5 ncq dma 4096 out res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error) [395138.156602] ata6.00: status: { DRDY } [395138.171913] ata6.00: failed command: WRITE FPDMA QUEUED [395138.171915] ata6.00: cmd 61/10:30:a0:27:c0/00:00:0e:00:00/40 tag 6 ncq dma 8192 out res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error) [395138.171916] ata6.00: status: { DRDY } [395138.171916] ata6.00: failed command: WRITE FPDMA QUEUED [395138.171919] ata6.00: cmd 61/08:38:50:2a:c0/00:00:0e:00:00/40 tag 7 ncq dma 4096 out res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error) [395138.176836] ata6.00: status: { DRDY } [395138.176837] ata6.00: failed command: WRITE FPDMA QUEUED [395138.176839] ata6.00: cmd 61/08:40:e8:49:c8/00:00:0e:00:00/40 tag 8 ncq dma 4096 out res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error) [395138.176839] ata6.00: status: { DRDY } [395138.176840] ata6.00: failed command: WRITE FPDMA QUEUED [395138.176841] ata6.00: cmd 61/08:48:58:08:80/00:00:0f:00:00/40 tag 9 ncq dma 4096 out res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error) [395138.176842] ata6.00: status: { DRDY } [395138.183063] ata6.00: failed command: WRITE FPDMA QUEUED [395138.183065] ata6.00: cmd 61/08:50:08:08:c0/00:00:13:00:00/40 tag 10 ncq dma 4096 out res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error) [395138.183075] ata6.00: status: { DRDY } [395138.183076] ata6.00: failed command: WRITE FPDMA QUEUED [395138.183077] ata6.00: cmd 61/08:58:80:08:c0/00:00:13:00:00/40 tag 11 ncq dma 4096 out res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error) [395138.183078] ata6.00: status: { DRDY } [395138.189053] ata6.00: failed command: WRITE FPDMA QUEUED [395138.189055] ata6.00: cmd 61/08:60:a8:12:c0/00:00:13:00:00/40 tag 12 ncq dma 4096 out res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error) [395138.189055] ata6.00: status: { DRDY } [395138.189065] ata6.00: failed command: WRITE FPDMA QUEUED [395138.189066] ata6.00: cmd 61/08:68:f8:12:c0/00:00:13:00:00/40 tag 13 ncq dma 4096 out res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error) [395138.189067] ata6.00: status: { DRDY } [395138.189068] ata6.00: failed command: WRITE FPDMA QUEUED [395138.189070] ata6.00: cmd 61/08:f0:90:2d:80/00:00:0e:00:00/40 tag 30 ncq dma 4096 out res 40/00:68:f8:12:c0/00:00:13:00:00/40 Emask 0x10 (ATA bus error) [395138.189071] ata6.00: status: { DRDY } [395138.199031] ata6: hard resetting link [395138.511140] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [395138.517064] ata6.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded [395138.519256] ata6.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out [395138.521402] ata6.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out [395138.523837] ata6.00: supports DRM functions and may not be fully accessible [395138.529475] ata6.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded [395138.530403] ata6.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out [395138.531236] ata6.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out [395138.532417] ata6.00: supports DRM functions and may not be fully accessible [395138.536106] ata6.00: configured for UDMA/133 [395138.537034] ata6: EH complete [395138.537973] ata6.00: Enabling discard_zeroes_data What's the recommended way to go? Disable NCQ? > What's the recommended way to go? Disable NCQ?
I believe if you see "WRITE FPDMA QUEUED" messages, the issue is with NCQ in general, and yes, you should try disabling it for the device. But if you see "SEND FPDMA QUEUED" as in the initial post, then you might've gotten away with disabling just the queued TRIM.
It is surprising to see that it even fails on Intel's controllers as well, all of this was mostly discussed with regard to AMD SATA.
(In reply to Roman Mamedov from comment #15) > It is surprising to see that it even fails on Intel's controllers as well, > all of this was mostly discussed with regard to AMD SATA. It's not surprising when you realise that queued trim used to be disabled on the Samsung 8* until Samsung's marketing department made an unsubstantiated claim that "the improved queued trim enhances Linux compatibility": https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca6bfcb2f6d9deab3924bf901e73622a94900473 (In reply to Simon Arlott from comment #16) > It's not surprising when you realise that queued trim used to be disabled on > the Samsung 8* until Samsung's marketing department made an unsubstantiated > claim that "the improved queued trim enhances Linux compatibility": > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > ?id=ca6bfcb2f6d9deab3924bf901e73622a94900473 So it sounds like we just need to revert that patch, or at least re-enable the ATA_HORKAGE_NO_NCQ_TRIM quirk for the 860 series ? Hans: also see https://bugzilla.kernel.org/show_bug.cgi?id=201693 . My personal experience is detailed over on https://marc.info/?t=154644279600003&r=1&w=2 and happens on plain reads. I've been booting with the kernel param libata.force=2.00:noncq to disable NCQ on the second ATA port where the Samsung 860 is plugged in which seems to stabilize things. (In reply to Sitsofe Wheeler from comment #18) > Hans: also see https://bugzilla.kernel.org/show_bug.cgi?id=201693 . My > personal experience is detailed over on > https://marc.info/?t=154644279600003&r=1&w=2 and happens on plain reads. > I've been booting with the kernel param libata.force=2.00:noncq to disable > NCQ on the second ATA port where the Samsung 860 is plugged in which seems > to stabilize things. I disabled NCQ for the drive using the equivalent kernel parameter and have not seen these messages again (although they have only appeared once recently - after a few months of the SSD's operation). For what is worth it, performance of 4K random reads has seen a tenfold decline (from 380MB/s down to 38MB/s) without NCQ, which I guess is expectable. Performance on other tests, with NCQ vs without NCQ, didn't seem to be affected much. Intel controller, same issue. Model: Samsung SSD 860 EVO 1TB Firmware Revision: RVT04B6Q Machine: Dell Precision M4700 BIOS: A19, 11/30/2018 SATA controller: Intel Corporation 7 Series Chipset Family 6-port SATA Controller [AHCI mode] (rev 04) Kernel: Linux 5.10.0-1-amd64 #1 SMP Debian 5.10.5-1 (2021-01-09) x86_64 GNU/Linux Linux version 5.10.0-1-amd64 (debian-kernel@lists.debian.org) (gcc-10 (Debian 10.2.1-5) 10.2.1 20210108, GNU ld (GNU Binutils for Debian) 2.35.1) #1 SMP Debian 5.10.5-1 (2021-01-09) ata1.00: exception Emask 0x10 SAct 0x7f80 SErr 0x440100 action 0x6 frozen ata1.00: irq_stat 0x08000000, interface fatal error ata1: SError: { UnrecovData CommWake Handshk } ata1.00: failed command: WRITE FPDMA QUEUED ata1.00: cmd 61/00:38:20:16:02/0a:00:65:00:00/40 tag 7 ncq dma 1310720 ou res 40/00:40:20:20:02/00:00:65:00:00/40 Emask 0x10 (ATA bus error) ata1.00: status: { DRDY } Disabling NCQ and setting link_power_management_policy to max_performance reduces the frequency of errors. echo 1 > /sys/block/sda/device/queue_depth echo max_performance > /sys/class/scsi_host/host*/link_power_management_policy I had some days without errors, but occasionally they are happening again mostly after updating/installing packages. I'm encountering this bug as well, on a Thinkpad t450s, a Samsung SSD 860 EVO 1TB (firmware RVT04B6Q) with Slackware-14.2 with kernel upgraded to 5.10.15. I'm adding my info particularly because of my non-AMD SATA controller: 00:1f.2 SATA controller: Intel Corporation Wildcat Point-LP SATA Controller [AHCI Mode] (rev 03) (prog-if 01 [AHCI 1.0]) Subsystem: Lenovo Wildcat Point-LP SATA Controller [AHCI Mode] Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin B routed to IRQ 44 Region 0: I/O ports at 30a8 [size=8] Region 1: I/O ports at 30b4 [size=4] Region 2: I/O ports at 30a0 [size=8] Region 3: I/O ports at 30b0 [size=4] Region 4: I/O ports at 3060 [size=32] Region 5: Memory at f123c000 (32-bit, non-prefetchable) [size=2K] Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit- Address: fee00298 Data: 0000 Capabilities: [70] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [a8] SATA HBA v1.0 BAR4 Offset=00000004 Kernel driver in use: ahci I have two ext4 partitions mounted with discards on, one of which encrypted. I see ata errors just about every time I reboot my machine, and was able to easily provoke it manually by issuing fstrim on my root and home partitions. I was (apparently) able to work around this bug both by issueing echo 1 > /sys/block/sda/device/queue_depth and by reverting https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca6bfcb2f6d9deab3924bf901e73622a94900473 Please let me know if there's anything else I can do to help. I personally was quite put off by the sudden onset of all these ata errors after I thought I had prolonged my laptop's life with a nice and big SSD. I'm happy to work around the issue, but it would be better to be able to use vanilla sources without failures. Hi All, Same issue here w/ Intel Comet Lake SATA Controller (on a set of Intel NUCs). By the look of it the kernel also tries to reduce the link speed from 6Gbps to 3Gbps but no joy. [Mon Mar 8 17:04:24 2021] ata3.00: exception Emask 0x10 SAct 0x1000000 SErr 0x400100 action 0x6 frozen [Mon Mar 8 17:04:24 2021] ata3.00: irq_stat 0x08000000, interface fatal error [Mon Mar 8 17:04:24 2021] ata3: SError: { UnrecovData Handshk } [Mon Mar 8 17:04:24 2021] ata3.00: failed command: WRITE FPDMA QUEUED [Mon Mar 8 17:04:24 2021] ata3.00: cmd 61/00:c0:30:66:1c/02:00:1d:00:00/40 tag 24 ncq dma 262144 out [Mon Mar 8 17:04:24 2021] ata3.00: status: { DRDY } [Mon Mar 8 17:04:24 2021] ata3: hard resetting link [Mon Mar 8 17:04:24 2021] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [Mon Mar 8 17:04:24 2021] ata3.00: supports DRM functions and may not be fully accessible [Mon Mar 8 17:04:24 2021] ata3.00: supports DRM functions and may not be fully accessible [Mon Mar 8 17:04:24 2021] ata3.00: configured for UDMA/133 [Mon Mar 8 17:04:24 2021] ata3: EH complete [Mon Mar 8 17:04:24 2021] ata3.00: Enabling discard_zeroes_data [Mon Mar 8 17:04:24 2021] ata3.00: exception Emask 0x10 SAct 0x100000 SErr 0x400100 action 0x6 frozen [Mon Mar 8 17:04:24 2021] ata3.00: irq_stat 0x08000000, interface fatal error [Mon Mar 8 17:04:24 2021] ata3: SError: { UnrecovData Handshk } [Mon Mar 8 17:04:24 2021] ata3.00: failed command: WRITE FPDMA QUEUED [Mon Mar 8 17:04:24 2021] ata3.00: cmd 61/00:a0:30:14:1d/02:00:1d:00:00/40 tag 20 ncq dma 262144 out [...] [Mon Mar 8 17:04:25 2021] ata3.00: status: { DRDY } [Mon Mar 8 17:04:25 2021] ata3: hard resetting link [Mon Mar 8 17:04:25 2021] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 320) [Mon Mar 8 17:04:25 2021] ata3.00: supports DRM functions and may not be fully accessible [Mon Mar 8 17:04:25 2021] ata3.00: supports DRM functions and may not be fully accessible [Mon Mar 8 17:04:25 2021] ata3.00: configured for UDMA/133 [Mon Mar 8 17:04:25 2021] ata3: EH complete [Mon Mar 8 17:04:25 2021] ata3.00: Enabling discard_zeroes_data So far I have: - Updated NUC Bios to FNCML357 - Updated Samsung Disks FW to RVT04B6Q - Updated Ubuntu 20.04 w/ Kernel 5.4.0-66-generic - Tried on two different NUC servers - Tried w/ two differnt Samsung drives (1TB and 500G) - Tried differnet power settings on the NUC (and attempted to disabled m2 and SDHC slots) - Tried a few fresh installs of Ubuntu 20.04.2 as well I have also raised a case w/ Samsung just in case. Intel have not helped so far. But the error messages keep on going. Only disabling NCQ seems to have some level of impact. Aside from crippling performance of course. Hi all! Same issue here with AMD SB950 Controller (on motherboard Gigabyte GA-970A-UD3 rev. 1.0/1.1 with latest BIOS) and SSD Samsung 860 EVO 1TB (firmware RVT04B6Q). Kernel version 5.11.17. Workaround "libata.force=1.00:noncq" in cmdline works for me. At some point I've tried to swap that Samsung SSD for SanDisk Ultra 3D (SDSSDH3) SSD, but even while setting up LVM partitions on it already got same errors, so wrote it off as this AMD SATA controller being buggy with any SSD. Just now got Marvell 88SE9230 PCIe controller card and thought to try same 860 EVO with that - less than a minute after flipping ncq depth from 1 (set by workaround-script on boot) to 32 (as per "ata7.00: 976773168 sectors, multi 1: LBA48 NCQ (depth 32), AA"), got same lithany of errors as with AMD controller before on f2fs trim requests (if I'm reading dmesg correctly), with just pretty much idle desktop. There was a mention of potential power issues, but given that f2fs seem to say "F2FS-fs (dm-33): Issue discard(411486, 411486, 1) failed, ret: -5" specifically, it seems like a weird coincidence, and I have an only a year-two old Thermaltake TR2 650W PSU here, in this otherwise ~10yo machine that barely draws ~200W iirc, so idk if it's likely. Link to dmesg output with all libata-related stuff from kernel init and full log of ssd/f2fs errors that happened with Linux 5.10.35 (built from kernel.org tarball) + Marvell 88SE9230 PCIe card + 860 EVO ~1min after switching ncq depth from 1 to 32: https://e.var.nz/2021-05-22.samsung-860-evo-marvell-88SE9230-trim-issue-dmesg.5kbcfhyvg03qm.log Though it looks same-ish as was already posted above by other folks. Might still try SanDisk SSD with Marvell 88SE9230 controller again, but pretty sure it's just some kind of linux issue at this point, unfortunately, given different ssd and sata controllers involved. In a failed copy-paste omitted the first paragraph for the message above, sorry: Have what looks like same issue with AMD SB850 (M4A87TD EVO motherboard) and Samsung 860 EVO 500G for about a year with 5.4/5.10 kernels. Tried patching linux quirks table to enable milder workarounds, but same as comments above suggest, only disabling NCQ seem to help. ... Here's another system (running Ubuntu SMP PREEMPT kernel based on vanilla 5.4.86) with Samsung 860 EVO and I hit random freezes and when system automatically recovers after after a long delay the journalctl output contains following errors: ata1.00: exception Emask 0x0 SAct 0xfffe3f00 SErr 0x0 action 0x6 frozen ata1.00: failed command: WRITE FPDMA QUEUED ata1.00: cmd 61/08:40:c0:d4:a0/00:00:37:00:00/40 tag 8 ncq dma 4096 out res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: status: { DRDY } ata1.00: failed command: WRITE FPDMA QUEUED ata1.00: cmd 61/08:48:d0:d4:a0/00:00:37:00:00/40 tag 9 ncq dma 4096 out res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: status: { DRDY } ata1.00: failed command: WRITE FPDMA QUEUED ata1.00: cmd 61/18:50:e0:d4:a0/00:00:37:00:00/40 tag 10 ncq dma 12288 out res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: status: { DRDY } ata1.00: failed command: WRITE FPDMA QUEUED ata1.00: cmd 61/10:58:88:d5:a0/00:00:37:00:00/40 tag 11 ncq dma 8192 out res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: status: { DRDY } ata1.00: failed command: WRITE FPDMA QUEUED ata1.00: cmd 61/10:60:b8:d5:a0/00:00:37:00:00/40 tag 12 ncq dma 8192 out res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: status: { DRDY } ata1.00: failed command: WRITE FPDMA QUEUED ata1.00: cmd 61/08:68:d8:d5:a0/00:00:37:00:00/40 tag 13 ncq dma 4096 out res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: status: { DRDY } ata1.00: failed command: WRITE FPDMA QUEUED ata1.00: cmd 61/08:88:b8:d2:a0/00:00:37:00:00/40 tag 17 ncq dma 4096 out res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) ... Device Model: Samsung SSD 860 EVO 500GB Serial Number: S3Z2NB0K836858L LU WWN Device Id: 5 002538 e40709e42 Firmware Version: RVT01B6Q I know that USB devices support turning on and off quirks of different devices. Is there a runtime option to disable queued TRIM for a given SATA device? The system is running intel chipset: description: SATA controller product: 7 Series/C210 Series Chipset Family 6-port SATA Controller [AHCI mode] [8086:1E02] vendor: Intel Corporation [8086] physical id: 1f.2 bus info: pci@0000:00:1f.2 version: 04 width: 32 bits clock: 66MHz capabilities: storage msi pm ahci_1.0 bus_master cap_list configuration: driver=ahci latency=0 The motherboard is P8H77-M PRO in case it makes a difference. It seems pretty safe to assume that the 860 EVO series is just broken and cannot cope with all commands combined with NCQ, no matter what Samsung marketing department says. I would rather not disable NCQ support because it would cause major performance hit. Same issue. OS: Ubuntu 20.04.2 LTS CPU: Ryzen 5 5600X, Motherboard: ROG STRIX B550-E GAMING Memory: Corsair DDR4 CMK32GX4M2A2400C14 32GB (2x16GB) Bios: Version: 2006 Release Date: 03/19/2021 SSD: Device Model: Samsung SSD 860 EVO 1TB Firmware Version: RVT02B6Q SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) AMD SATA controller: SATA controller: Advanced Micro Devices, Inc. [AMD] Device 43eb Kernel: 5.4.0-80-generic --- Jul 25 02:59:19 purkki kernel: [3651117.306707] ahci 0000:01:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0xd5c54000 flags=0x0000] Jul 25 02:59:19 purkki kernel: [3651117.599827] ata2.00: exception Emask 0x10 SAct 0x1fe00000 SErr 0x0 action 0x6 frozen Jul 25 02:59:19 purkki kernel: [3651117.599831] ata2.00: irq_stat 0x08000000, interface fatal error Jul 25 02:59:19 purkki kernel: [3651117.599835] ata2.00: failed command: WRITE FPDMA QUEUED Jul 25 02:59:19 purkki kernel: [3651117.599839] ata2.00: cmd 61/20:a8:60:df:2d/00:00:3b:00:00/40 tag 21 ncq dma 16384 out Jul 25 02:59:19 purkki kernel: [3651117.599839] res 40/00:e0:c0:0a:14/00:00:3b:00:00/40 Emask 0x10 (ATA bus error) Jul 25 02:59:19 purkki kernel: [3651117.599842] ata2.00: status: { DRDY } Jul 25 02:59:19 purkki kernel: [3651117.599844] ata2.00: failed command: WRITE FPDMA QUEUED Jul 25 02:59:19 purkki kernel: [3651117.599847] ata2.00: cmd 61/10:b0:90:ab:2e/00:00:3b:00:00/40 tag 22 ncq dma 8192 out Jul 25 02:59:19 purkki kernel: [3651117.599847] res 40/00:e0:c0:0a:14/00:00:3b:00:00/40 Emask 0x10 (ATA bus error) Jul 25 02:59:19 purkki kernel: [3651117.599851] ata2.00: status: { DRDY } Jul 25 02:59:19 purkki kernel: [3651117.599852] ata2.00: failed command: WRITE FPDMA QUEUED Jul 25 02:59:19 purkki kernel: [3651117.599855] ata2.00: cmd 61/10:b8:40:e4:2e/00:00:3b:00:00/40 tag 23 ncq dma 8192 out Jul 25 02:59:19 purkki kernel: [3651117.599855] res 40/00:e0:c0:0a:14/00:00:3b:00:00/40 Emask 0x10 (ATA bus error) Jul 25 02:59:19 purkki kernel: [3651117.599858] ata2.00: status: { DRDY } Jul 25 02:59:19 purkki kernel: [3651117.599859] ata2.00: failed command: WRITE FPDMA QUEUED Jul 25 02:59:19 purkki kernel: [3651117.599862] ata2.00: cmd 61/10:c0:90:e7:2e/00:00:3b:00:00/40 tag 24 ncq dma 8192 out Jul 25 02:59:19 purkki kernel: [3651117.599862] res 40/00:e0:c0:0a:14/00:00:3b:00:00/40 Emask 0x10 (ATA bus error) Jul 25 02:59:19 purkki kernel: [3651117.599865] ata2.00: status: { DRDY } Jul 25 02:59:19 purkki kernel: [3651117.599867] ata2.00: failed command: WRITE FPDMA QUEUED Jul 25 02:59:19 purkki kernel: [3651117.599870] ata2.00: cmd 61/10:c8:80:ed:2e/00:00:3b:00:00/40 tag 25 ncq dma 8192 out Jul 25 02:59:19 purkki kernel: [3651117.599870] res 40/00:e0:c0:0a:14/00:00:3b:00:00/40 Emask 0x10 (ATA bus error) Jul 25 02:59:19 purkki kernel: [3651117.599873] ata2.00: status: { DRDY } Jul 25 02:59:19 purkki kernel: [3651117.599874] ata2.00: failed command: WRITE FPDMA QUEUED Jul 25 02:59:19 purkki kernel: [3651117.599877] ata2.00: cmd 61/10:d0:b0:ed:2e/00:00:3b:00:00/40 tag 26 ncq dma 8192 out Jul 25 02:59:19 purkki kernel: [3651117.599877] res 40/00:e0:c0:0a:14/00:00:3b:00:00/40 Emask 0x10 (ATA bus error) Jul 25 02:59:19 purkki kernel: [3651117.599880] ata2.00: status: { DRDY } Jul 25 02:59:19 purkki kernel: [3651117.599881] ata2.00: failed command: WRITE FPDMA QUEUED Jul 25 02:59:19 purkki kernel: [3651117.599884] ata2.00: cmd 61/20:d8:50:ff:2e/00:00:3b:00:00/40 tag 27 ncq dma 16384 out Jul 25 02:59:19 purkki kernel: [3651117.599884] res 40/00:e0:c0:0a:14/00:00:3b:00:00/40 Emask 0x10 (ATA bus error) Jul 25 02:59:19 purkki kernel: [3651117.599887] ata2.00: status: { DRDY } Jul 25 02:59:19 purkki kernel: [3651117.599889] ata2.00: failed command: WRITE FPDMA QUEUED Jul 25 02:59:19 purkki kernel: [3651117.599891] ata2.00: cmd 61/30:e0:c0:0a:14/00:00:3b:00:00/40 tag 28 ncq dma 24576 out Jul 25 02:59:19 purkki kernel: [3651117.599891] res 40/00:e0:c0:0a:14/00:00:3b:00:00/40 Emask 0x10 (ATA bus error) Jul 25 02:59:19 purkki kernel: [3651117.599894] ata2.00: status: { DRDY } Jul 25 02:59:19 purkki kernel: [3651117.599897] ata2: hard resetting link Jul 25 02:59:20 purkki kernel: [3651118.075832] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Jul 25 02:59:20 purkki kernel: [3651118.076267] ata2.00: supports DRM functions and may not be fully accessible Jul 25 02:59:25 purkki kernel: [3651123.267858] ata2.00: qc timeout (cmd 0x47) Jul 25 02:59:25 purkki kernel: [3651123.267867] ata2.00: READ LOG DMA EXT failed, trying PIO Jul 25 02:59:25 purkki kernel: [3651123.267868] ata2.00: NCQ Send/Recv Log not supported Jul 25 02:59:25 purkki kernel: [3651123.267870] ata2.00: failed to get Identify Device Data, Emask 0x40 Jul 25 02:59:25 purkki kernel: [3651123.267871] ata2.00: ATA Identify Device Log not supported Jul 25 02:59:25 purkki kernel: [3651123.267872] ata2.00: Security Log not supported Jul 25 02:59:25 purkki kernel: [3651123.267877] ata2.00: failed to set xfermode (err_mask=0x40) Jul 25 02:59:25 purkki kernel: [3651123.267884] ata2: hard resetting link Uptime: 23:16:52 up 43 days, 2:29, 1 user, load average: 0.00, 0.01, 0.00 I did add: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash libata.force=noncq" (In reply to Logman from comment #27) > I did add: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash libata.force=noncq" So far looks ok. Uptime: 12:46:27 up 2 days, 16:50, no errors. As a data point, the Samsung 870 EVO appears to have either the same problem, or something closely related. SSD info: === START OF INFORMATION SECTION === Device Model: Samsung SSD 870 EVO 1TB Serial Number: S5Y2NF0R128941E LU WWN Device Id: 5 002538 f4112d21c Firmware Version: SVT01B6Q User Capacity: 1,000,204,886,016 bytes [1.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Form Factor: 2.5 inches Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5 SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Sun Aug 22 15:50:09 2021 AEST SMART support is: Available - device has SMART capability. SMART support is: Enabled Example kernel messages (from kernel 5.13.11): ***************************** Aug 22 14:47:39 s3 kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x90000 action 0x6 frozen Aug 22 14:47:39 s3 kernel: ata1: SError: { PHYRdyChg 10B8B } Aug 22 14:47:39 s3 kernel: ata1.00: failed command: WRITE DMA EXT Aug 22 14:47:39 s3 kernel: ata1.00: cmd 35/00:08:b8:29:00/00:00:29:00:00/e0 tag 7 dma 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Aug 22 14:47:39 s3 kernel: ata1.00: status: { DRDY } Aug 22 14:47:39 s3 kernel: ata1: hard resetting link Aug 22 14:47:40 s3 kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Aug 22 14:47:40 s3 kernel: ata1.00: supports DRM functions and may not be fully accessible Aug 22 14:47:40 s3 kernel: ata1.00: supports DRM functions and may not be fully accessible Aug 22 14:47:40 s3 kernel: ata1.00: configured for UDMA/133 Aug 22 14:47:40 s3 kernel: ata1.00: device reported invalid CHS sector 0 Aug 22 14:47:40 s3 kernel: sd 0:0:0:0: [sda] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=31s Aug 22 14:47:40 s3 kernel: sd 0:0:0:0: [sda] tag#7 Sense Key : Illegal Request [current] Aug 22 14:47:40 s3 kernel: sd 0:0:0:0: [sda] tag#7 Add. Sense: Unaligned write command Aug 22 14:47:40 s3 kernel: sd 0:0:0:0: [sda] tag#7 CDB: Write(10) 2a 00 29 00 29 b8 00 00 08 00 Aug 22 14:47:40 s3 kernel: blk_update_request: I/O error, dev sda, sector 687876536 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0 ***************************** Using "libata.force=noncq" isn't solving the problem though. The above error is with NCQ disabled. :( Looking at the kernel source here as a guide: https://github.com/torvalds/linux/blob/9ff50bf2f2ff5fab01cac26d8eed21a89308e6ef/drivers/ata/libata-core.c#L3951-L3952 ... it seems like two potential kernel parameters for libata.force would be needed. Both "noncq" and "noncqtrim". I'll try that shortly, and see if it helps. While looking at that kernel source, it seems like some other Samsung SSD's have trouble with their link state power management: https://github.com/torvalds/linux/blob/9ff50bf2f2ff5fab01cac26d8eed21a89308e6ef/drivers/ata/libata-core.c#L3930-L3934 Not sure if there's a kernel parameter to disable that, but manually setting should work. eg: # echo 'max_performance' > /sys/class/scsi_host/host0/link_power_management_policy (In reply to Justin Clift from comment #29) > Not sure if there's a kernel parameter to disable that, but manually setting > should work. > > eg: > > # echo 'max_performance' > > /sys/class/scsi_host/host0/link_power_management_policy You can set the default to max_performance by setting the following on the kernel cmdline: "ahci.mobile_lpm_policy=0" Note that as the name implies, the kernel only sets the policy to a different value by default on mobile (laptop) chipsets on desktop chipsets the default is max_performance. Have you tried setting the link_power_management_policy with your 870 EVO? (and does it help?). Also I wonder if you could try replacing the SATA cable with a new one? Errors like this can also happen due to a bad SATA cable. (In reply to Justin Clift from comment #29) Also I wonder about your PSU? Is it perhaps old? Or are you perhaps using a converter to go from a molex power-connector to a sata power-connector? Those might be flaky too. The reason why I'm asking this is that disabling NCQ drastically lowers the performance of the SSD which in turn drastically lowers it power-consumption. So their could be a power-supply issue (voltage-drop or spikes under load) which is causing issues with the power supplied to the SATA PHYs leading to these kinda transfer errors. Such an issue would only show under heavy load; and disabling NCQ makes it impossible to cause a heavy load on the SSD, since now it will only process 1 request at a time. (In reply to Justin Clift from comment #29) p.s. What is the chipset-vendor of the SATA controller to which your 870 EVO is connected? All the troubles with the 860 EVO seem to be limited to AMD/Asmedia/Marvell SATA controllers. The Intel SATA controllers seem to work fine. (In reply to Hans de Goede from comment #32) > (In reply to Justin Clift from comment #29) > > What is the chipset-vendor of the SATA controller to which your 870 EVO is > connected? All the troubles with the 860 EVO seem to be limited to > AMD/Asmedia/Marvell SATA controllers. The Intel SATA controllers seem to > work fine. No they don't, please stop repeating this. This has been a problem on the 840, the 850, the 860 and now the 870. The 840 and 850 are still prevented from using queued TRIM by the kernel. The 860 and 870 are not, based solely on marketing information from Samsung claiming that the problem is fixed. The problem is not the SATA controllers (Intel is affected too), SATA cables (I've swapped mine) or the power supplies but the SSDs. (In reply to Simon Arlott from comment #33) > > What is the chipset-vendor of the SATA controller to which your 870 EVO is > > connected? All the troubles with the 860 EVO seem to be limited to > > AMD/Asmedia/Marvell SATA controllers. The Intel SATA controllers seem to > > work fine. > > No they don't, please stop repeating this. > > This has been a problem on the 840, the 850, the 860 and now the 870. > > The 840 and 850 are still prevented from using queued TRIM by the kernel. > > The 860 and 870 are not, based solely on marketing information from Samsung > claiming that the problem is fixed. > > The problem is not the SATA controllers (Intel is affected too), SATA cables > (I've swapped mine) or the power supplies but the SSDs. So after completely re-reading / analyzing both this bug as well as bug 201693 with a fresh pair of eyes (since the last time I did this was a long time ago) I agree. After careful reading / analysis it seems that there really are 2 different bugs here impacting both the 860 EVO and the 870 EVO: 1. Queued Trim commands are causing issues on Intel + ASmedia + Marvell controllers 2. Things are seriously broken on AMD controllers and only completely disabling NCQ altogether helps there. I will submit a kernel patch (with a Fixes tag so that it gets backported to stable series) for 1. right away; and I've asked a colleague to start working on a new ATA horkage flag which disables NCQ on AMD SATA controllers only, so that we can add that flag (together with the ATA_HORKAGE_NO_NCQ_TRIM flag which my patch adds) to the 860 EVO and the 870 EVO to also resolve 2. ### Note this still does not explain Justin's problem though, since Justin already has NCQ completely disabled. Justin, are you sure you actually have "libata.force=noncq" on the kernel commandline? You can check this with "cat /proc/cmdline", just adding it to /etc/default/grub file is not enough, you also need to generate grub2.cfg for changes to take effect. Also are you perhaps using an out of tree kernel-driver? p.s. Sorry that it took me so long (much too long) to realize that we are dealing with 2 distinct bugs here. Thanks Hans. I'll check what I can now, though in-depth testing will have to be on the weekend. :) Data that's likely relevant and useful: * The computer this is happening on is an older model Acer Nitro 5 laptop. Ryzen 7 2700U cpu, and RX 560X graphics. Bought it before knowing that Ryzen (at least this one) needs a bunch of kernel command line options to even think about being stable. :/ So, my kernel command line currently is: ******* $ cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-5.13.11-lp153.2.g8c13a2d-default root=UUID=8bde2e75-7e73-43a7-8a82-03d5b3b81afc splash=silent resume=/dev/mapper/cr_ata-KINGSTON_RBUSNS8180DS3128GJ_50026B768291D106-part3 splash=silent resume=/dev/mapper/cr_ata-KINGSTON_RBUSNS8180DS3128GJ_50026B768291D106-part3 quiet rcu_nocbs=0-7 pcie_aspm=off pcie_port_pm=off pci=noacpi ivrs_ioapic[4]=00.14.0 ivrs_ioapic[5]=00.00.2 idle=nomwait intel_idle.max_cstate=0 processor.max_cstate=1 iommu=pt libata.force=noncq,noncqtrim mitigations=auto ******* The laptop has been turned on all day today, but not actually doing anything as I use a work provided macbook during the day. Before starting work today I manually set (using the echo approach) the link state power management to max_performance for the Samsung 870 EVO: ******* $ cat /sys/class/scsi_host/host0/link_power_management_policy max_performance ******* *No* errors have shown up in the meantime, which is a very, very good sign that "something" in one of those changes helped: ******* sudo journalctl -k | grep ata1 Aug 23 07:54:49 s3 kernel: ata1: SATA max UDMA/133 abar m2048@0xff700000 port 0xff700100 irq 22 Aug 23 07:54:49 s3 kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Aug 23 07:54:49 s3 kernel: ata1.00: FORCE: horkage modified (noncq) Aug 23 07:54:49 s3 kernel: ata1.00: FORCE: horkage modified (noncqtrim) Aug 23 07:54:49 s3 kernel: ata1.00: supports DRM functions and may not be fully accessible Aug 23 07:54:49 s3 kernel: ata1.00: ATA-11: Samsung SSD 870 EVO 1TB, SVT01B6Q, max UDMA/133 Aug 23 07:54:49 s3 kernel: ata1.00: 1953525168 sectors, multi 1: LBA48 NCQ (not used) Aug 23 07:54:49 s3 kernel: ata1.00: supports DRM functions and may not be fully accessible Aug 23 07:54:49 s3 kernel: ata1.00: configured for UDMA/133 Aug 23 07:54:49 s3 kernel: ata1.00: Enabling discard_zeroes_data Aug 23 07:54:49 s3 kernel: ata1.00: Enabling discard_zeroes_data Aug 23 07:54:49 s3 kernel: ata1.00: Enabling discard_zeroes_data ******* Note the 'FORCE: horkage modified (noncqtrim)', so it's pretty clear that was picked up by the kernel. :) That being said, it *is* also possible there's a cabling issue at play here too. Unlike other people's laptops, this one has the cover over the ssd/hdd area removed and I'm running a sata + power extender cable to the front for easy access. eg it lets me swap physical drives (when powered off) easily. That being said, the previous drive (a crap Crucial BX500, ~500GB) didn't show an error with this cable. And a Samsung 860 Evo 500GB with Win10 on it runs fine (even yesterday) off the same cable. But still, it could be possible the Samsung 870 Evo is a bit more sensitive to something about that cable than the others. --- For better testing (this weekend), I'll can: 1. Move the Samsung 870 Evo back into the laptop housing directly (without extender cable) 2. Try the kernel without any libata.force options. eg test if the problem is really the cable 3. If problems occur, then try with libata.force=noncq 4. Ditto, but with just libata.force=noncqtrim 5. Try with libata.force=noncq,noncqtrim If problems are still showing up, then try setting the link state power management to max_performance (not the default on this laptop). --- Meanwhile, that kernel patch seems like it'll help people anyway. :) Probably worth mentioning that this ncq problem occurs within a few minutes of starting the laptop (prior to settings change earlier today) So it's pretty easy to notice when a change for the better has happened. :) To add another data point, I can confirm that my 860 EVO works fine on my ASMedia controller with the exception of NCQ trim. Laurentiu, does that mean you run it with (say) your kernel having `libata.force=noncqtrim`, or an equivalent approach? Gah. Sorry, just realised that should have been "Nicola, ..." instead. ;) I used to disable NCQ periodically, trim, then turn it back on, because I didn't realize that noncqtrim exists. I haven't had any NCQ issues in a couple of years (since getting that drive). (Don't worry about the name, you got it right the first time.) Cool. :) In the meantime, I've rebooted with `libata.force=noncqtrim` and left the link power management at it's default setting: ******* $ cat /sys/class/scsi_host/host0/link_power_management_policy med_power_with_dipm $ cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-5.13.11-lp153.2.g8c13a2d-default root=UUID=8bde2e75-7e73-43a7-8a82-03d5b3b81afc splash=silent resume=/dev/mapper/cr_ata-KINGSTON_RBUSNS8180DS3128GJ_50026B768291D106-part3 splash=silent resume=/dev/mapper/cr_ata-KINGSTON_RBUSNS8180DS3128GJ_50026B768291D106-part3 quiet rcu_nocbs=0-7 pcie_aspm=off pcie_port_pm=off pci=noacpi ivrs_ioapic[4]=00.14.0 ivrs_ioapic[5]=00.00.2 idle=nomwait intel_idle.max_cstate=0 processor.max_cstate=1 iommu=pt libata.force=noncqtrim mitigations=auto ******* So far (only about 15 mins) things are working ok, with no weirdness from the ssd. I'll update this issue either way tomorrow, after it's been running a bunch of hours. As a data point, the `libata.force=noncqtrim` option by itself wasn't the complete solution for this system. This morning, in order to try and trigger any weirdness I copied ~60GB of random files from one folder of the drive to another. A few minutes later, ata errors started showing up: ******* Aug 24 10:03:37 s3 kernel: ata1.00: exception Emask 0x0 SAct 0x70 SErr 0xd0000 action 0x6 frozen Aug 24 10:03:37 s3 kernel: ata1: SError: { PHYRdyChg CommWake 10B8B } Aug 24 10:03:37 s3 kernel: ata1.00: failed command: WRITE FPDMA QUEUED Aug 24 10:03:37 s3 kernel: ata1.00: cmd 61/08:20:68:b8:77/00:00:2d:00:00/40 tag 4 ncq dma 4096 out Aug 24 10:03:37 s3 kernel: ata1.00: status: { DRDY } Aug 24 10:03:37 s3 kernel: ata1.00: failed command: WRITE FPDMA QUEUED Aug 24 10:03:37 s3 kernel: ata1.00: cmd 61/08:28:90:bc:01/00:00:08:00:00/40 tag 5 ncq dma 4096 out Aug 24 10:03:37 s3 kernel: ata1.00: status: { DRDY } Aug 24 10:03:37 s3 kernel: ata1.00: failed command: READ FPDMA QUEUED Aug 24 10:03:37 s3 kernel: ata1.00: cmd 60/08:30:90:af:72/00:00:2e:00:00/40 tag 6 ncq dma 4096 in Aug 24 10:03:37 s3 kernel: ata1.00: status: { DRDY } Aug 24 10:03:37 s3 kernel: ata1: hard resetting link Aug 24 10:03:38 s3 kernel: ata1: SATA link down (SStatus 0 SControl 300) Aug 24 10:03:38 s3 kernel: ata1: hard resetting link Aug 24 10:03:38 s3 kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Aug 24 10:03:38 s3 kernel: ata1.00: supports DRM functions and may not be fully accessible Aug 24 10:03:38 s3 kernel: ata1.00: disabling queued TRIM support Aug 24 10:03:38 s3 kernel: ata1.00: supports DRM functions and may not be fully accessible Aug 24 10:03:38 s3 kernel: ata1.00: disabling queued TRIM support Aug 24 10:03:38 s3 kernel: ata1.00: configured for UDMA/133 Aug 24 10:03:38 s3 kernel: ata1: EH complete Aug 24 10:03:38 s3 kernel: ata1.00: Enabling discard_zeroes_data ******* (with many more repeats of above READ/WRITE FPDMA QUEUED errors) Rebooting with the inclusion of `ahci.mobile_lpm_policy=0` on the kernel command line unexpectedly *didn't* leave the link state power management at `max_performance`: ******* $ cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-5.13.11-lp153.2.g8c13a2d-default root=UUID=8bde2e75-7e73-43a7-8a82-03d5b3b81afc splash=silent resume=/dev/mapper/cr_ata-KINGSTON_RBUSNS8180DS3128GJ_50026B768291D106-part3 splash=silent resume=/dev/mapper/cr_ata-KINGSTON_RBUSNS8180DS3128GJ_50026B768291D106-part3 quiet rcu_nocbs=0-7 pcie_aspm=off pcie_port_pm=off pci=noacpi ivrs_ioapic[4]=00.14.0 ivrs_ioapic[5]=00.00.2 idle=nomwait intel_idle.max_cstate=0 processor.max_cstate=1 iommu=pt libata.force=noncqtrim ahci.mobile_lpm_policy=0 mitigations=auto $ cat /sys/class/scsi_host/host0/link_power_management_policy med_power_with_dipm ******* I've manually changed it (using echo approach): ******* # echo 'max_performance' > /sys/class/scsi_host/host0/link_power_management_policy $ cat /sys/class/scsi_host/host0/link_power_management_policy max_performance ******* With that link power management change in place, copying around ~140GB of files on the drive has worked without error. :) So far, that combination is looking decent. After work tonight I'll probably try to figure out a udev rule as a workaround for setting the link power management (as per https://bugzilla.kernel.org/show_bug.cgi?id=201693#c13). Assuming this combination is now functional, that's great. I'll still try to break it on the weekend as above though, to figure out whether the extension cable (etc) is really causing issues and help diagnose things. :) Justin, so I just checked and ahci.mobile_lpm_policy does not do anything on your AMD based laptop, because non of the AMD chipsets are marked as being "mobile" in the PCI-device-id list in drivers/ata/ahci.c . Are you perhaps using TLP are some other script to "improve" / tweak the power-management settings ? Then that script is likely setting the link_power_management_policy ... As for testing with vs without libata.force=noncqtrim, notice that to actually test if this makes a difference you need to make sure that there are actually trim commands being send to the disk. So you would need to cause heavy file-io (including erasing large files) and then run "fstrim" at the same time. (In reply to Hans de Goede from comment #34) ... > 2. Things are seriously broken on AMD controllers and only completely > disabling NCQ altogether helps there. Please note that even disabling NCQ doesn't solve this problem completely. I still had occasional I/O freezes with my AMD SP5100 (SB700S) chipset, but without any kernel messages. I upgraded to AMD X570 based system several months ago and everything is completely stable now with NCQ *enabled*. Apologies for the delay, this weekend got away from me with other things. I'll have to get this tested sometime in the next few days or next weekend. :/ As already mentioned in comment 34 we have been working towards a solution for this: """ So after completely re-reading / analyzing both this bug as well as bug 201693 with a fresh pair of eyes (since the last time I did this was a long time ago) I agree. After careful reading / analysis it seems that there really are 2 different bugs here impacting both the 860 EVO and the 870 EVO: 1. Queued Trim commands are causing issues on Intel + ASmedia + Marvell controllers 2. Things are seriously broken on AMD controllers and only completely disabling NCQ altogether helps there. """ A patch implementing 1. has been submitted upstream a week ago here: https://lore.kernel.org/linux-ide/20210823095220.30157-1-hdegoede@redhat.com/T/#u And a patch implementing 2. was just submitted upstream: https://lore.kernel.org/linux-ide/54f63e11-e421-0fa6-80e1-297287dc0974@redhat.com/ Together these should resolve (work around) this issue for most users. For clarification - we established in https://bugzilla.kernel.org/show_bug.cgi?id=201693 that the problem is limited to "ATI AMD" AHCI controllers - 0x1002, not "Modern AMD" - 0x1022. Patches have been queued up. Tejun, can you close it? The patches for both this bug as well as for bug 201693 are on their way to Linus, closing. Probably a little late at this point, but I'm trying to understand something here. The issue seems to manifests itself on 8{6,7}0 EVO models (probably EVQ), but there are also Pro models that seems to work just fine, based on random comments from people commenting the news about this patch. Myself included. I'm using Samsung 860 Pro with X570 chipset for a year now with zero issues so far. So now this patch will unconditionally cut performance on affected and not-affected devices, is that right? Will there be a flag to force enable ncq than? In theory, the kernel command line options allow turning `ncq` and `ncqtrim` both on, and off. From: https://www.kernel.org/doc/html/v5.15-rc1/admin-guide/kernel-parameters.html ``` * [no]ncq: Turn on or off NCQ. * [no]ncqtrim: Turn off queued DSM TRIM. ``` So, if the change does cut performance with your system you should be able to enable things again without too much hassle. Hopefully. (!) :) This is the source code with the exact spellings, if that helps: https://github.com/torvalds/linux/blob/3ca706c189db861b2ca2019a0901b94050ca49d8/drivers/ata/libata-core.c#L6155-L6160 (In reply to Justin Clift from comment #53) > In theory, the kernel command line options allow turning `ncq` and `ncqtrim` > both on, and off. > > From: > > https://www.kernel.org/doc/html/v5.15-rc1/admin-guide/kernel-parameters.html > > ``` > * [no]ncq: Turn on or off NCQ. > * [no]ncqtrim: Turn off queued DSM TRIM. > ``` > > So, if the change does cut performance with your system you should be able > to enable things again without too much hassle. Hopefully. (!) :) Thanks, that actually helps. I've been looking into other drives to replace my 860 Pro with. Since I have 2 almost identical systems, I've been thinking putting my drive into another setup (that uses windows) and buying myself new drives as I wanted to expand anyway. (In reply to Gurenko Alex from comment #52) > Probably a little late at this point, but I'm trying to understand something > here. > > The issue seems to manifests itself on 8{6,7}0 EVO models (probably EVQ), > but there are also Pro models that seems to work just fine, based on random > comments from people commenting the news about this patch. Myself included. > I'm using Samsung 860 Pro with X570 chipset for a year now with zero issues > so far. So now this patch will unconditionally cut performance on affected > and not-affected devices, is that right? There are 2 parts to the patch: 1. Disable queued-trim on all Samsung 860 + 870 drives, this will impact your setup too, but this only impact trims, NCQ is otherwise still fully used so the performance impact of this typically is negligible, especially since most distro-s don't use continues trim to begin with. Chances are you have not seen any issues because of this. 2. Completely disable NCQ when a Samsung 860 / 870 drive is used connected to a SATA controller with an ATI PCI-vendor-id. Your X570 has an AMD PCI-vendor-id, so you are not impacted by this change. Also note that several people have actually reported issues with queued-trims in combination with the 860 Pro, IOW the 860 Pro really also needs 1. (In reply to Hans de Goede from comment #56) > (In reply to Gurenko Alex from comment #52) > > Probably a little late at this point, but I'm trying to understand > something > > here. > > > > The issue seems to manifests itself on 8{6,7}0 EVO models (probably EVQ), > > but there are also Pro models that seems to work just fine, based on random > > comments from people commenting the news about this patch. Myself included. > > I'm using Samsung 860 Pro with X570 chipset for a year now with zero issues > > so far. So now this patch will unconditionally cut performance on affected > > and not-affected devices, is that right? > > There are 2 parts to the patch: > > 1. Disable queued-trim on all Samsung 860 + 870 drives, this will impact > your setup too, but this only impact trims, NCQ is otherwise still fully > used so the performance impact of this typically is negligible, especially > since most distro-s don't use continues trim to begin with. Chances are you > have not seen any issues because of this. > > 2. Completely disable NCQ when a Samsung 860 / 870 drive is used connected > to a SATA controller with an ATI PCI-vendor-id. Your X570 has an AMD > PCI-vendor-id, so you are not impacted by this change. > > Also note that several people have actually reported issues with > queued-trims in combination with the 860 Pro, IOW the 860 Pro really also > needs 1. Thanks a lot for a clear explanation. I think I've got worried by the kernel mailing list reference: "Note that with AMD SATA controllers users are reporting even worse issues". I'm observing this issue with a Samsung 860 EVO on a new Minisforum UM560. The SSD works fine with the same kernel on another machine, so I assume the problem is not the SSD itself. System info: **** System: Kernel: 5.19.7 arch: x86_64 bits: 64 Machine: Type: Desktop System: BESSTAR TECH product: UM560 v: N/A serial: N/A Mobo: BESSTAR TECH model: F6BFC serial: N/A UEFI: American Megatrends LLC. v: 5.19 date: 07/01/2022 CPU: Info: 6-core model: AMD Ryzen 5 5625U with Radeon Graphics bits: 64 type: MT MCP cache: L2: 3 MiB Speed (MHz): avg: 1658 min/max: 1600/4387 cores: 1: 1600 2: 1600 3: 2300 4: 1600 5: 1600 6: 1600 7: 1600 8: 1600 9: 1600 10: 1600 11: 1600 12: 1600 Graphics: Device-1: AMD Barcelo driver: amdgpu v: kernel Drives: Local Storage: total: 4.09 TiB used: 1.04 TiB (25.4%) ID-1: /dev/nvme0n1 vendor: Seagate model: XPG SPECTRIX S40G size: 3.64 TiB ID-2: /dev/sda vendor: Samsung model: SSD 860 EVO 500GB size: 465.76 GiB Partition: ID-1: / size: 365.29 GiB used: 232.34 GiB (63.6%) fs: btrfs dev: /dev/sda2 ID-2: /boot size: 511 MiB used: 192.6 MiB (37.7%) fs: vfat dev: /dev/sda1 ID-3: /home size: 365.29 GiB used: 232.34 GiB (63.6%) fs: btrfs dev: /dev/sda2 ID-4: /var size: 365.29 GiB used: 232.34 GiB (63.6%) fs: btrfs dev: /dev/sda2 **** Relevant kernel logs: **** [ 0.000000] nuc kernel: Linux version 5.19.5-arch1-1 (linux@archlinux) (gcc (GCC) 12.2.0, GNU ld (GNU Binutils) 2.39.0) #1 SMP PREEMPT_DYNAMIC Mon, 29 Aug 2022 15:51:05 +0000 [ 0.000000] nuc kernel: Command line: initrd=\amd-ucode.img initrd=\initramfs-linux.img root=UUID=e7088636-a53d-4132-944a-caf17fe426d7 rw rootflags=subvol=ROOT [ 0.000000] nuc kernel: efi: EFI v2.70 by American Megatrends [ 0.000000] nuc kernel: efi: ACPI=0xcc97e000 ACPI 2.0=0xcc97e014 TPMFinalLog=0xcc94a000 SMBIOS=0xcd026000 SMBIOS 3.0=0xcd025000 MEMATTR=0xcb309018 ESRT=0xcb2fa698 RNG=0xcd062b18 TPMEventLog=0xc8a8f018 [ 0.000000] nuc kernel: efi: seeding entropy pool [ 0.314920] nuc kernel: pci 0000:05:00.0: [1022:7901] type 00 class 0x010601 [ 0.314948] nuc kernel: pci 0000:05:00.0: reg 0x24: [mem 0xfce01000-0xfce017ff] [ 0.314955] nuc kernel: pci 0000:05:00.0: enabling Extended Tags [ 0.315013] nuc kernel: pci 0000:05:00.0: 126.016 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x16 link at 0000:00:08.2 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link) [ 0.315045] nuc kernel: pci 0000:05:00.1: [1022:7901] type 00 class 0x010601 [ 0.315072] nuc kernel: pci 0000:05:00.1: reg 0x24: [mem 0xfce00000-0xfce007ff] [ 0.315079] nuc kernel: pci 0000:05:00.1: enabling Extended Tags [ 0.317166] nuc kernel: libata version 3.00 loaded. [ 0.334166] nuc kernel: pci 0000:05:00.0: Adding to iommu group 5 [ 0.334168] nuc kernel: pci 0000:05:00.1: Adding to iommu group 5 [ 0.361332] nuc kernel: ahci 0000:05:00.0: version 3.0 [ 0.361433] nuc kernel: ahci 0000:05:00.0: AHCI 0001.0301 32 slots 1 ports 6 Gbps 0x2 impl SATA mode [ 0.361435] nuc kernel: ahci 0000:05:00.0: flags: 64bit ncq sntf ilck pm led clo only pmp fbs pio slum part [ 0.361549] nuc kernel: scsi host0: ahci [ 0.361613] nuc kernel: scsi host1: ahci [ 0.361628] nuc kernel: ata1: DUMMY [ 0.361631] nuc kernel: ata2: SATA max UDMA/133 abar m2048@0xfce01000 port 0xfce01180 irq 33 [ 0.361738] nuc kernel: ahci 0000:05:00.1: AHCI 0001.0301 32 slots 1 ports 6 Gbps 0x1 impl SATA mode [ 0.361739] nuc kernel: ahci 0000:05:00.1: flags: 64bit ncq sntf ilck pm led clo only pmp fbs pio slum part [ 0.361825] nuc kernel: scsi host2: ahci [ 0.361849] nuc kernel: ata3: SATA max UDMA/133 abar m2048@0xfce00000 port 0xfce00100 irq 35 [ 0.678825] nuc kernel: ata3: SATA link down (SStatus 0 SControl 300) [ 0.842165] nuc kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 0.842746] nuc kernel: ata2.00: supports DRM functions and may not be fully accessible [ 0.842753] nuc kernel: ata2.00: ATA-11: Samsung SSD 860 EVO 500GB, RVT01B6Q, max UDMA/133 [ 0.843993] nuc kernel: ata2.00: 976773168 sectors, multi 1: LBA48 NCQ (depth 32), AA [ 0.848290] nuc kernel: ata2.00: Features: Trust Dev-Sleep NCQ-sndrcv [ 0.848975] nuc kernel: ata2.00: supports DRM functions and may not be fully accessible [ 0.855193] nuc kernel: ata2.00: configured for UDMA/133 [ 0.865886] nuc kernel: scsi 1:0:0:0: Direct-Access ATA Samsung SSD 860 1B6Q PQ: 0 ANSI: 5 [ 0.866242] nuc kernel: ata2.00: Enabling discard_zeroes_data [ 0.866260] nuc kernel: sd 1:0:0:0: [sda] 976773168 512-byte logical blocks: (500 GB/466 GiB) [ 0.866273] nuc kernel: sd 1:0:0:0: [sda] Write Protect is off [ 0.866276] nuc kernel: sd 1:0:0:0: [sda] Mode Sense: 00 3a 00 00 [ 0.866292] nuc kernel: sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 0.866325] nuc kernel: sd 1:0:0:0: [sda] Preferred minimum I/O size 512 bytes [ 0.866534] nuc kernel: ata2.00: Enabling discard_zeroes_data [ 0.886989] nuc kernel: sd 1:0:0:0: [sda] supports TCG Opal [ 0.886996] nuc kernel: sd 1:0:0:0: [sda] Attached SCSI disk [ 1.956802] nuc kernel: sd 1:0:0:0: Attached scsi generic sg0 type 0 [ 5.610398] nuc kernel: ata2.00: exception Emask 0x10 SAct 0xc0000 SErr 0x4c0000 action 0x6 frozen [ 5.610407] nuc kernel: ata2.00: irq_stat 0x08000000, interface fatal error [ 5.610411] nuc kernel: ata2: SError: { CommWake 10B8B Handshk } [ 5.610416] nuc kernel: ata2.00: failed command: WRITE FPDMA QUEUED [ 5.610417] nuc kernel: ata2.00: cmd 61/20:90:60:dc:d3/00:00:1b:00:00/40 tag 18 ncq dma 16384 out [ 5.610424] nuc kernel: ata2.00: status: { DRDY } [ 5.610426] nuc kernel: ata2.00: failed command: WRITE FPDMA QUEUED [ 5.610427] nuc kernel: ata2.00: cmd 61/20:98:60:dc:b3/00:00:1b:00:00/40 tag 19 ncq dma 16384 out [ 5.610432] nuc kernel: ata2.00: status: { DRDY } [ 5.610436] nuc kernel: ata2: hard resetting link [ 6.080138] nuc kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 6.080502] nuc kernel: ata2.00: supports DRM functions and may not be fully accessible [ 6.084997] nuc kernel: ata2.00: supports DRM functions and may not be fully accessible [ 6.089284] nuc kernel: ata2.00: configured for UDMA/133 [ 6.099448] nuc kernel: ata2: EH complete [ 6.099646] nuc kernel: ata2.00: Enabling discard_zeroes_data [ 6.310438] nuc kernel: ata2.00: exception Emask 0x12 SAct 0xf9000 SErr 0x500 action 0x6 frozen [ 6.310451] nuc kernel: ata2.00: irq_stat 0x08000000, interface fatal error [ 6.310455] nuc kernel: ata2: SError: { UnrecovData Proto } [ 6.310461] nuc kernel: ata2.00: failed command: WRITE FPDMA QUEUED [ 6.310464] nuc kernel: ata2.00: cmd 61/08:60:00:4a:f1/00:00:0d:00:00/40 tag 12 ncq dma 4096 out [ 6.310473] nuc kernel: ata2.00: status: { DRDY } [ 6.310476] nuc kernel: ata2.00: failed command: READ FPDMA QUEUED [ 6.310478] nuc kernel: ata2.00: cmd 60/30:78:68:dd:e1/01:00:15:00:00/40 tag 15 ncq dma 155648 in [ 6.310485] nuc kernel: ata2.00: status: { DRDY } [ 6.310487] nuc kernel: ata2.00: failed command: READ FPDMA QUEUED [ 6.310489] nuc kernel: ata2.00: cmd 60/10:80:50:15:e5/01:00:15:00:00/40 tag 16 ncq dma 139264 in [ 6.310495] nuc kernel: ata2.00: status: { DRDY } [ 6.310497] nuc kernel: ata2.00: failed command: READ FPDMA QUEUED [ 6.310498] nuc kernel: ata2.00: cmd 60/80:88:08:d6:e8/00:00:15:00:00/40 tag 17 ncq dma 65536 in [ 6.310504] nuc kernel: ata2.00: status: { DRDY } [ 6.310506] nuc kernel: ata2.00: failed command: READ FPDMA QUEUED [ 6.310508] nuc kernel: ata2.00: cmd 60/d0:90:40:8b:eb/00:00:15:00:00/40 tag 18 ncq dma 106496 in [ 6.310514] nuc kernel: ata2.00: status: { DRDY } [ 6.310516] nuc kernel: ata2.00: failed command: READ FPDMA QUEUED [ 6.310517] nuc kernel: ata2.00: cmd 60/78:98:70:19:dd/00:00:15:00:00/40 tag 19 ncq dma 61440 in [ 6.310523] nuc kernel: ata2.00: status: { DRDY } [ 6.310527] nuc kernel: ata2: hard resetting link [ 6.780155] nuc kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 6.780575] nuc kernel: ata2.00: supports DRM functions and may not be fully accessible [ 6.786950] nuc kernel: ata2.00: supports DRM functions and may not be fully accessible [ 6.792020] nuc kernel: ata2.00: configured for UDMA/133 [ 6.802279] nuc kernel: ata2: EH complete [ 6.802490] nuc kernel: ata2.00: Enabling discard_zeroes_data [ 32.301996] nuc kernel: ata2.00: exception Emask 0x10 SAct 0x1ff000 SErr 0x4c0000 action 0x6 frozen [ 32.302006] nuc kernel: ata2.00: irq_stat 0x08000000, interface fatal error [ 32.302010] nuc kernel: ata2: SError: { CommWake 10B8B Handshk } [ 32.302016] nuc kernel: ata2.00: failed command: WRITE FPDMA QUEUED [ 32.302019] nuc kernel: ata2.00: cmd 61/00:60:e8:a1:87/02:00:1a:00:00/40 tag 12 ncq dma 262144 out [ 32.302028] nuc kernel: ata2.00: status: { DRDY } [ 32.302031] nuc kernel: ata2.00: failed command: WRITE FPDMA QUEUED [ 32.302032] nuc kernel: ata2.00: cmd 61/00:68:e8:a5:87/02:00:1a:00:00/40 tag 13 ncq dma 262144 out [ 32.302039] nuc kernel: ata2.00: status: { DRDY } [ 32.302041] nuc kernel: ata2.00: failed command: WRITE FPDMA QUEUED [ 32.302043] nuc kernel: ata2.00: cmd 61/00:70:e8:b1:87/02:00:1a:00:00/40 tag 14 ncq dma 262144 out [ 32.302049] nuc kernel: ata2.00: status: { DRDY } [ 32.302051] nuc kernel: ata2.00: failed command: WRITE FPDMA QUEUED [ 32.302052] nuc kernel: ata2.00: cmd 61/00:78:b0:8d:88/02:00:1a:00:00/40 tag 15 ncq dma 262144 out [ 32.302058] nuc kernel: ata2.00: status: { DRDY } [ 32.302060] nuc kernel: ata2.00: failed command: WRITE FPDMA QUEUED [ 32.302061] nuc kernel: ata2.00: cmd 61/00:80:b0:9b:88/02:00:1a:00:00/40 tag 16 ncq dma 262144 out [ 32.302067] nuc kernel: ata2.00: status: { DRDY } [ 32.302069] nuc kernel: ata2.00: failed command: WRITE FPDMA QUEUED [ 32.302070] nuc kernel: ata2.00: cmd 61/00:88:b0:9f:88/02:00:1a:00:00/40 tag 17 ncq dma 262144 out [ 32.302076] nuc kernel: ata2.00: status: { DRDY } [ 32.302078] nuc kernel: ata2.00: failed command: WRITE FPDMA QUEUED [ 32.302079] nuc kernel: ata2.00: cmd 61/00:90:b0:a9:88/02:00:1a:00:00/40 tag 18 ncq dma 262144 out [ 32.302085] nuc kernel: ata2.00: status: { DRDY } [ 32.302087] nuc kernel: ata2.00: failed command: WRITE FPDMA QUEUED [ 32.302088] nuc kernel: ata2.00: cmd 61/00:98:b0:b1:88/02:00:1a:00:00/40 tag 19 ncq dma 262144 out [ 32.302094] nuc kernel: ata2.00: status: { DRDY } [ 32.302096] nuc kernel: ata2.00: failed command: READ FPDMA QUEUED [ 32.302097] nuc kernel: ata2.00: cmd 60/20:a0:c0:be:fe/00:00:1b:00:00/40 tag 20 ncq dma 16384 in [ 32.302103] nuc kernel: ata2.00: status: { DRDY } [ 32.302107] nuc kernel: ata2: hard resetting link [ 32.771952] nuc kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 32.772617] nuc kernel: ata2.00: supports DRM functions and may not be fully accessible [ 32.778439] nuc kernel: ata2.00: supports DRM functions and may not be fully accessible [ 32.782935] nuc kernel: ata2.00: configured for UDMA/133 [ 32.793210] nuc kernel: ata2: EH complete [ 32.793396] nuc kernel: ata2.00: Enabling discard_zeroes_data **** The issue still manifests with noncqtrim: **** [ 0.050394] nuc kernel: Kernel command line: initrd=\amd-ucode.img initrd=\initramfs-linux.img libata.force=2.00:noncqtrim root=... [ 0.377335] nuc kernel: ata2: SATA max UDMA/133 abar m2048@0xfce01000 port 0xfce01180 irq 33 [ 0.863834] nuc kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 0.864490] nuc kernel: ata2.00: supports DRM functions and may not be fully accessible [ 0.864497] nuc kernel: ata2.00: ATA-11: Samsung SSD 860 EVO 500GB, RVT01B6Q, max UDMA/133 [ 0.865849] nuc kernel: ata2.00: 976773168 sectors, multi 1: LBA48 NCQ (depth 32), AA [ 0.870957] nuc kernel: ata2.00: Features: Trust Dev-Sleep NCQ-sndrcv [ 0.871657] nuc kernel: ata2.00: supports DRM functions and may not be fully accessible [ 0.878088] nuc kernel: ata2.00: configured for UDMA/133 [ 0.891275] nuc kernel: ata2.00: Enabling discard_zeroes_data [ 0.891639] nuc kernel: ata2.00: Enabling discard_zeroes_data [ 7.580617] nuc kernel: ata2.00: exception Emask 0x10 SAct 0x400000 SErr 0x4c0000 action 0x6 frozen [ 7.581377] nuc kernel: ata2.00: irq_stat 0x08000000, interface fatal error [ 7.581614] nuc kernel: ata2: SError: { CommWake 10B8B Handshk } [ 7.581824] nuc kernel: ata2.00: failed command: WRITE FPDMA QUEUED [ 7.582026] nuc kernel: ata2.00: cmd 61/08:b0:80:08:10/00:00:00:00:00/40 tag 22 ncq dma 4096 out [ 7.582422] nuc kernel: ata2.00: status: { DRDY } [ 7.582612] nuc kernel: ata2: hard resetting link [ 8.053782] nuc kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 8.054433] nuc kernel: ata2.00: supports DRM functions and may not be fully accessible [ 8.060727] nuc kernel: ata2.00: supports DRM functions and may not be fully accessible [ 8.065520] nuc kernel: ata2.00: configured for UDMA/133 [ 8.075755] nuc kernel: ata2: EH complete [ 8.075998] nuc kernel: ata2.00: Enabling discard_zeroes_data [ 11.926832] nuc kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 11.927174] nuc kernel: ata2.00: supports DRM functions and may not be fully accessible [ 11.931644] nuc kernel: ata2.00: supports DRM functions and may not be fully accessible [ 11.935943] nuc kernel: ata2.00: configured for UDMA/133 [ 11.946262] nuc kernel: ata2.00: Enabling discard_zeroes_data **** ... and with noncqtrim,noncq (and ahci.mobile_lpm_policy=1 for good measure) **** [ 0.050340] nuc kernel: Kernel command line: initrd=\amd-ucode.img initrd=\initramfs-linux.img libata.force=2.00:noncqtrim,2.00:noncq ahci.mobile_lpm_policy=1 root=... [ 0.436444] nuc kernel: ata2: SATA max UDMA/133 abar m2048@0xfce01000 port 0xfce01180 irq 33 [ 0.923236] nuc kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 0.924024] nuc kernel: ata2.00: FORCE: horkage modified (noncq) [ 0.924074] nuc kernel: ata2.00: supports DRM functions and may not be fully accessible [ 0.924076] nuc kernel: ata2.00: ATA-11: Samsung SSD 860 EVO 500GB, RVT01B6Q, max UDMA/133 [ 0.924079] nuc kernel: ata2.00: 976773168 sectors, multi 1: LBA48 NCQ (not used) [ 0.928747] nuc kernel: ata2.00: Features: Trust Dev-Sleep [ 0.929625] nuc kernel: ata2.00: supports DRM functions and may not be fully accessible [ 0.934390] nuc kernel: ata2.00: configured for UDMA/133 [ 0.947300] nuc kernel: ata2.00: Enabling discard_zeroes_data [ 0.947612] nuc kernel: ata2.00: Enabling discard_zeroes_data [ 6.783741] nuc kernel: ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x400000 action 0x6 frozen [ 6.784491] nuc kernel: ata2.00: irq_stat 0x08000000, interface fatal error [ 6.784909] nuc kernel: ata2: SError: { Handshk } [ 6.785118] nuc kernel: ata2.00: failed command: WRITE DMA [ 6.785317] nuc kernel: ata2.00: cmd ca/00:08:b0:45:91/00:00:00:00:00/e5 tag 19 dma 4096 out [ 6.785707] nuc kernel: ata2.00: status: { DRDY } [ 6.785896] nuc kernel: ata2: hard resetting link [ 7.257023] nuc kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 7.257384] nuc kernel: ata2.00: supports DRM functions and may not be fully accessible [ 7.261455] nuc kernel: ata2.00: supports DRM functions and may not be fully accessible [ 7.265234] nuc kernel: ata2.00: configured for UDMA/133 [ 7.275443] nuc kernel: ata2: EH complete [ 7.275791] nuc kernel: ata2.00: Enabling discard_zeroes_data [ 7.653439] nuc kernel: ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x400000 action 0x6 frozen [ 7.654185] nuc kernel: ata2.00: irq_stat 0x08000000, interface fatal error [ 7.654774] nuc kernel: ata2: SError: { Handshk } [ 7.655311] nuc kernel: ata2.00: failed command: WRITE DMA EXT [ 7.655834] nuc kernel: ata2.00: cmd 35/00:20:00:04:b4/00:00:1b:00:00/e0 tag 22 dma 16384 out [ 7.656979] nuc kernel: ata2.00: status: { DRDY } [ 7.657666] nuc kernel: ata2: hard resetting link [ 8.127025] nuc kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 8.127692] nuc kernel: ata2.00: supports DRM functions and may not be fully accessible [ 8.132495] nuc kernel: ata2.00: supports DRM functions and may not be fully accessible [ 8.136612] nuc kernel: ata2.00: configured for UDMA/133 [ 8.146823] nuc kernel: ata2: EH complete [ 8.149159] nuc kernel: ata2.00: Enabling discard_zeroes_data [ 8.426877] nuc kernel: ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x400000 action 0x6 frozen [ 8.428829] nuc kernel: ata2.00: irq_stat 0x08000000, interface fatal error [ 8.430147] nuc kernel: ata2: SError: { Handshk } [ 8.430851] nuc kernel: ata2.00: failed command: WRITE DMA EXT [ 8.431432] nuc kernel: ata2.00: cmd 35/00:20:20:b1:b4/00:00:1b:00:00/e0 tag 30 dma 16384 out [ 8.431965] nuc kernel: ata2.00: status: { DRDY } [ 8.432176] nuc kernel: ata2: hard resetting link [ 8.900241] nuc kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 8.900629] nuc kernel: ata2.00: supports DRM functions and may not be fully accessible [ 8.905896] nuc kernel: ata2.00: supports DRM functions and may not be fully accessible [ 8.909937] nuc kernel: ata2.00: configured for UDMA/133 [ 8.920106] nuc kernel: ata2: EH complete [ 8.922107] nuc kernel: ata2.00: Enabling discard_zeroes_data [ 9.320416] nuc kernel: ata2: limiting SATA link speed to 3.0 Gbps [ 9.320426] nuc kernel: ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x400000 action 0x6 frozen [ 9.322428] nuc kernel: ata2.00: irq_stat 0x08000000, interface fatal error [ 9.323496] nuc kernel: ata2: SError: { Handshk } [ 9.324411] nuc kernel: ata2.00: failed command: WRITE DMA [ 9.325107] nuc kernel: ata2.00: cmd ca/00:08:f8:79:3e/00:00:00:00:00/eb tag 6 dma 4096 out [ 9.325860] nuc kernel: ata2.00: status: { DRDY } [ 9.326399] nuc kernel: ata2: hard resetting link [ 9.793803] nuc kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 320) [ 9.794516] nuc kernel: ata2.00: supports DRM functions and may not be fully accessible [ 9.799595] nuc kernel: ata2.00: supports DRM functions and may not be fully accessible [ 9.803621] nuc kernel: ata2.00: configured for UDMA/133 [ 9.813823] nuc kernel: ata2: EH complete [ 9.814452] nuc kernel: ata2.00: Enabling discard_zeroes_data **** I'm now forcing the link speed to 3.0Gbps using libata.force=2:3.0G,2:noncqtrim and the issue has not yet manifested again. Is this something libata should manage automatically? (In reply to Tom Crossland from comment #58) > I'm observing this issue with a Samsung 860 EVO on a new Minisforum UM560. > The SSD works fine with the same kernel on another machine, so I assume the > problem is not the SSD itself. > > I'm now forcing the link speed to 3.0Gbps using > libata.force=2:3.0G,2:noncqtrim and the issue has not yet manifested again. Right, so I see that the Minisforum UM560 is a quite small formfactor device. Typically these devices use some custom SATA connection to 2.5" drives. Assuming you are indeed using a 2.5" Samsung 860 EVO and not an M.2 version then I think the likely cause here is that special custom SATA connection. If you did not use that from day 1 there might be dust in there, so you could try reseating the connector. Note it might also just be that that special cable in combination with the Samsung 860 EVO, which is known to be picky about SATA cable signal issues just does not work reliably at 6.0Gbps. > Is this something libata should manage automatically? No, this is a very specific problem related to your specific setup and not a generic problem with Ryzen systems; nor with the Samsung 860 EVO. (In reply to Hans de Goede from comment #59) > Right, so I see that the Minisforum UM560 is a quite small formfactor > device. Typically these devices use some custom SATA connection to 2.5" > drives. Assuming you are indeed using a 2.5" Samsung 860 EVO and not an M.2 > version then I think the likely cause here is that special custom SATA > connection. If you did not use that from day 1 there might be dust in there, > so you could try reseating the connector. > Note it might also just be that that special cable in combination with the > Samsung 860 EVO, which is known to be picky about SATA cable signal issues > just does not work reliably at 6.0Gbps. Yes, the UM560 is a mini PC and the SSD is the 2.5" model. It works fine at 6Gbps in a different mini PC with an even small form factor (Intel NUC715BNK), but that has a different SATA controller (Intel i5-7250U SoC) and better build quality. I agree there may be an issue with the connection in the UM560, the cable looks a bit flimsy and was fiddly to connect to the board. But it's out of the box and free of dust. I'll try reseating the connector to see if I can use 6.0Gbps reliably. > > Is this something libata should manage automatically? > > No, this is a very specific problem related to your specific setup and not a > generic problem with Ryzen systems; nor with the Samsung 860 EVO. Ok, understood, thanks for the helpful response and your time. Just as an another data point, I have Samsung SSD 860 EVO connected to Intel SATA controller and the device/SATA link will randomly have failures unless I use libata.force=3.0Gbps kernel flag. Based on various reports about Samsung 850 and 860 series I first tried to limit NCQ depth (I tried values 1 and 6 for months and both reduced performance a lot but couldn't avoid all problems. Both seemed to reduce amount of problems, though. I would guess the reduction was actually caused by reduced amount of data over the SATA connection because of higher latency for commands.) With libata.force=3.0Gbps I've yet to see the error so my current guess is that either the hardware for this chipset is broken or Linux kernel has some SATA specific race condition that only happens with very high load. I'm pretty sure I never saw this issue on this very same hardware with kernel 4.15.x which I used to run for a long time. I'm currently running kernel 5.4.x distributed by Canonical/Ubuntu. The reason I'm considering the possibility of race condition in Linux is that I've seen similar problems on multiple production servers I maintain. Those servers have zero common parts (some have AMD CPUs, some have Intel CPUs, some have Samsung SSDs, some have SSDs made by other manufacturers) and yet applying libata.force=3.0Gbps kernel flag has made all those systems stable. Those servers are running Linux kernel 5.11.x or 5.15.x. Here's the controller I'm currently still running which also needs libata.force=3.0Gbps to be stable with Samsung SSD 860 EVO (lspci -vvvnn): 00:1f.2 SATA controller [0106]: Intel Corporation 7 Series/C210 Series Chipset Family 6-port SATA Controller [AHCI mode] [8086:1e02] (rev 04) (prog-if 01 [AHCI 1.0]) Subsystem: ASUSTeK Computer Inc. P8 series motherboard [1043:84ca] Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin B routed to IRQ 27 Region 0: I/O ports at f0b0 [size=8] Region 1: I/O ports at f0a0 [size=4] Region 2: I/O ports at f090 [size=8] Region 3: I/O ports at f080 [size=4] Region 4: I/O ports at f060 [size=32] Region 5: Memory at f7c16000 (32-bit, non-prefetchable) [size=2K] Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit- Address: fee04004 Data: 4022 Capabilities: [70] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [a8] SATA HBA v1.0 BAR4 Offset=00000004 Capabilities: [b0] PCI Advanced Features AFCap: TP+ FLR+ AFCtrl: FLR- AFStatus: TP- Kernel driver in use: ahci Kernel modules: ahci FWIW, in my case, after updating the firmware of the SSD to the latest version, the issue has not reappeared (running at 6.0Gbps without any libata.force options). These links may be useful: https://semiconductor.samsung.com/consumer-storage/support/tools/ https://blog.quindorian.org/2021/05/firmware-update-samsung-ssd-in-linux.html/ (In reply to Mikko Rantalainen from comment #61) > Just as an another data point, I have Samsung SSD 860 EVO connected to Intel > SATA controller and the device/SATA link will randomly have failures unless > I use libata.force=3.0Gbps kernel flag. Based on various reports about > Samsung 850 and 860 series I first tried to limit NCQ depth (I tried values > 1 and 6 for months and both reduced performance a lot but couldn't avoid all > problems. Both seemed to reduce amount of problems, though. I would guess > the reduction was actually caused by reduced amount of data over the SATA > connection because of higher latency for commands.) > > With libata.force=3.0Gbps I've yet to see the error so my current guess is > that either the hardware for this chipset is broken or Linux kernel has some > SATA specific race condition that only happens with very high load. I'm > pretty sure I never saw this issue on this very same hardware with kernel > 4.15.x which I used to run for a long time. I'm currently running kernel > 5.4.x distributed by Canonical/Ubuntu. > > The reason I'm considering the possibility of race condition in Linux is > that I've seen similar problems on multiple production servers I maintain. > Those servers have zero common parts (some have AMD CPUs, some have Intel > CPUs, some have Samsung SSDs, some have SSDs made by other manufacturers) > and yet applying libata.force=3.0Gbps kernel flag has made all those systems > stable. Those servers are running Linux kernel 5.11.x or 5.15.x. > > Watch how developers carelessly enables and disables skipping trim for Samsung 860/870 between progressing kernel versions in my comment with sources below: https://bugzilla.kernel.org/show_bug.cgi?id=201693#c116 This occurred for me on a Marvell HBA (several identical in the same machine) with kernel 5.10.0-21-amd64, even after trying all the commandline workarounds here, including noncqtrim, and running the Samsung firmware update tool on Windows. 06:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller (rev 11) (prog-if 01 [AHCI 1.0]) Subsystem: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller ... Subsystem: 1b4b:9215 I replaced the Samsung drives (Intel and Crucial are OK), but if anyone else winds up in this situation, this udev rule worked while I waited for delivery: $ cat /etc/udev/rules.d/99-samsung-860.rules ACTION=="add|change", SUBSYSTEM=="block", KERNEL=="sd*", ENV{DEVTYPE}=="disk", ENV{ID_MODEL}=="Samsung_SSD_860*", ATTR{device/queue_depth}="1" A zfs scrub went from thousands of checksum errors to none. I think we might need an ATA_HORKAGE_NO_NCQ_ON_MARVELL (like ATA_HORKAGE_NO_NCQ_ON_ATI) for data safety. That being said, I do not know if any other Marvell controllers are affected other than the one(s) I have, but 1b4b:9215 is a turkey with the 860. Those specific 860's are fine on mobo AMD controllers as expected. > I'm now forcing the link speed to 3.0Gbps using
> libata.force=2:3.0G,2:noncqtrim and the issue has not yet manifested again.
These combination have worked for me with firmware RVT04B6Q. I'm no longer seeing the error messages from the kernel.
I have this SATA controller
09:00.2 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
and an 860 EVO SSD.
I was seeing the exact same error messages: when no libata command line was specified or when noncqtrim is forced, I'd see errors like this
failed command: WRITE FPDMA QUEUED
when noncq is specified, I'd see errors like this
failed command: WRITE DMA EXT
The quoted 3.0G + noncqtrim and updated firmware (previously RVT02B6Q) appears to work.
I get similar errors with my Samsung 870 EVO 1TB SATA (firmware version SVT01B6Q) and the Intel "8 Series/C220 Series Chipset Family 6-port SATA Controller 1" SATA controller (the machine is a HP ZBook 15 G2 laptop): 2023-09-10T11:50:59.858670+0200 zira kernel: ata1.00: exception Emask 0x0 SAct 0xc00 SErr 0x40000 action 0x0 2023-09-10T11:51:00.117366+0200 zira kernel: ata1.00: irq_stat 0x40000008 2023-09-10T11:51:00.117431+0200 zira kernel: ata1: SError: { CommWake } 2023-09-10T11:51:00.117474+0200 zira kernel: ata1.00: failed command: READ FPDMA QUEUED 2023-09-10T11:51:00.117511+0200 zira kernel: ata1.00: cmd 60/00:50:b8:12:c5/02:00:1f:00:00/40 tag 10 ncq dma 262144 in res 41/40:00:90:13:c5/00:02:1f:00:00/00 Emask 0x409 (media error) <F> 2023-09-10T11:51:00.117537+0200 zira kernel: ata1.00: status: { DRDY ERR } 2023-09-10T11:51:00.117560+0200 zira kernel: ata1.00: error: { UNC } 2023-09-10T11:51:00.117583+0200 zira kernel: ata1.00: supports DRM functions and may not be fully accessible 2023-09-10T11:51:00.117614+0200 zira kernel: ata1.00: supports DRM functions and may not be fully accessible 2023-09-10T11:51:00.117651+0200 zira kernel: ata1.00: configured for UDMA/133 2023-09-10T11:51:00.117681+0200 zira kernel: sd 0:0:0:0: [sda] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s 2023-09-10T11:51:00.117953+0200 zira kernel: sd 0:0:0:0: [sda] tag#10 Sense Key : Medium Error [current] 2023-09-10T11:51:00.118165+0200 zira kernel: sd 0:0:0:0: [sda] tag#10 Add. Sense: Unrecovered read error - auto reallocate failed 2023-09-10T11:51:00.118366+0200 zira kernel: sd 0:0:0:0: [sda] tag#10 CDB: Read(10) 28 00 1f c5 12 b8 00 02 00 00 2023-09-10T11:51:00.118557+0200 zira kernel: I/O error, dev sda, sector 533009296 op 0x0:(READ) flags 0x80700 phys_seg 37 prio class 2 2023-09-10T11:51:00.118582+0200 zira kernel: ata1: EH complete 2023-09-10T11:51:00.118608+0200 zira kernel: ata1.00: Enabling discard_zeroes_data [...] (there's also "failed command: WRITE FPDMA QUEUED" later). This ended up with the filesystem automatically remounted as read-only, and after the reboot, I needed a manual fsck. "media error", "UNC" indicate your SSD encountered unreadable sectors within its flash. This is an error condition internal to the SSD, and has nothing to do with its communication with the controller, or with this bug. Other error you see are likely just a side effect of the above. hi everybody, i have a question: does thix fix work for drives attached to SAS controllers too? that is because i have a few 870 attached to a SAS controller, and i see this in the log: Sep 18 04:32:17 truenas kernel: mpt3sas_cm0: hba_port entry: 0000000061f45ae0, port: 12 is added to hba_port list Sep 18 04:32:17 truenas kernel: mpt3sas_cm0: vphy entry: 00000000d0fa4886, port id: 0, phy:17 is added to port's vphys_list Sep 18 04:32:17 truenas kernel: mpt3sas_cm0: host_add: handle(0x0001), sas_addr(0x5f4ee0805d7cd700), phys(21) Sep 18 04:32:17 truenas kernel: mpt3sas_cm0: handle(0x17) sas_address(0x3f4ee0805d7cd700) port_type(0x1) Sep 18 04:32:17 truenas kernel: scsi 0:0:0:0: Direct-Access ATA Samsung SSD 870 3B6Q PQ: 0 ANSI: 6 Sep 18 04:32:17 truenas kernel: scsi 0:0:0:0: SATA: handle(0x0017), sas_addr(0x3f4ee0805d7cd700), phy(0), device_name(0x5002538f4361c09b) Sep 18 04:32:17 truenas kernel: scsi 0:0:0:0: enclosure logical id (0x3f4ee0805d112100), slot(8) Sep 18 04:32:17 truenas kernel: scsi 0:0:0:0: enclosure level(0x0001), connector name( C1 ) Sep 18 04:32:17 truenas kernel: scsi 0:0:0:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y) Sep 18 04:32:17 truenas kernel: scsi 0:0:0:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1) Sep 18 04:32:17 truenas kernel: end_device-0:0: add: handle(0x0017), sas_addr(0x3f4ee0805d7cd700) Sep 18 04:32:17 truenas kernel: mpt3sas_cm0: handle(0x1b) sas_address(0x3f4ee0805d7cd701) port_type(0x1) and there absolutely is no mention of disabling ncq trim, even if i add libata.force=noncqtrim to the kernel parameters. Interesting. Doing a bit of looking just now (am not a kernel hacker, so this is very much just winging it), the mpt3sas kernel driver code looks to be here: https://github.com/torvalds/linux/tree/master/drivers/scsi/mpt3sas Doing a search for NCQ in that directory does throw up a few hits: https://github.com/search?q=repo%3Atorvalds%2Flinux+path%3A%2F%5Edrivers%5C%2Fscsi%5C%2Fmpt3sas%5C%2F%2F+ncq&type=code Not super sure how relevant they are though. Might need to be specified as mpt3sas module load options rather than kernel command line parameters (no idea). :) |
Created attachment 282579 [details] dmesg of the errors occuring I have a Samsung SSD 860 EVO mSATA 500GB SSD connected via an ASMedia ASM1062 Serial ATA Controller. It causes has 20-30 seconds lockups on fstrim (which runs during bootup on my system), with messages such as: [ 332.792044] ata14.00: exception Emask 0x0 SAct 0x3fffe SErr 0x0 action 0x6 frozen [ 332.798271] ata14.00: failed command: SEND FPDMA QUEUED [ 332.804499] ata14.00: cmd 64/01:08:00:00:00/00:00:00:00:00/a0 tag 1 ncq dma 512 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 332.817145] ata14.00: status: { DRDY } After disabling queued TRIM via the included patch, the issue disappears.