Bug 201693

Summary: Samsung 860 EVO NCQ Issue with AMD SATA Controller
Product: IO/Storage Reporter: Ryley Angus (ryleyjangus)
Component: Serial ATAAssignee: Tejun Heo (tj)
Status: RESOLVED CODE_FIX    
Severity: normal CC: a+1009kernel, alejandro.donato, alexander, alexey.kv, amazon1, andreas.bugzilla.kernel.org, andrew, bestbeforejunefirst+kb, bloodjazman, dag, erwin.gaubitzer, forum, fweimer, greg, hardwareadictos, johnsimcall, jwrdegoede, kernel.bugzilla, kernel, klaus, nejrobbins, nospam.linux, nouveau, nx42768, ole, pizza, pjbrs, pmenzel+bugzilla.kernel.org, public-t.b, reg.kernelbugzilla.wad1w, rm+bko, roxmail, ryleyjangus, sitsofe, t50, ucelsanicin
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.19.1 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg output
lspci -vvv output
smartctl status for the 860 EVO
hdparm -I output
signature.asc

Description Ryley Angus 2018-11-15 00:05:57 UTC
Created attachment 279441 [details]
dmesg output

Hi, I've recently purchased a 2TB Samsung 860 EVO SSD to replace an existing 850 EVO. There have been no other hardware changes to the affected system and I cloned my existing Linux installation (LUKS/ext4) to the new SSD.

I had no issues with the 850 EVO (I purchased it after the queued trim issue was mitigated), but I have immediately had problems with the 860 EVO. If the filesystem is trimmed (manually or automatically), dmesg immediately reports several "WRITE FPDMA QUEUED" errors before hard resetting the link. The only time these errors haven't occurred is when the amount of space trimmed (as reported by fstrim) is less than 15-20GB.

If I disable NCQ for the SSD, I can trim the drive without issue. I've also tried using "libata.force=noncqtrim" but this did not change the situation.
Comment 1 Ryley Angus 2018-11-15 00:06:54 UTC
Created attachment 279443 [details]
lspci -vvv output
Comment 2 Ryley Angus 2018-11-15 00:07:24 UTC
Created attachment 279445 [details]
smartctl status for the 860 EVO
Comment 3 Ryley Angus 2018-11-15 00:17:48 UTC
Created attachment 279447 [details]
hdparm -I output
Comment 4 Ryley Angus 2018-11-15 00:50:36 UTC
Samsung's support forum has several users experiencing similar issues as far back as June: https://eu.community.samsung.com/t5/Cameras-IT-Everything-Else/860-EVO-250GB-causing-freezes-on-AMD-system/td-p/575813 . It seems there may be a compatibility issue between some AMD SATA controllers and the 860 series. I'll try to test my SSD with an Intel SATA controller.
Comment 5 Ryley Angus 2018-11-15 22:26:24 UTC
After disabling discard/trim on the SSD's ext4 partition, I was able to consistently reproduce the freezing behaviour that was previously triggered by trimming the disk by copying a large (>50GB) file to the SSD. The freezing was not immediate, it occurred 2-3 minutes into the transfer.

If I disable NCQ, both trim and heavy data transfers work reliably (as far as I can tell).
Comment 6 Roman Mamedov 2018-11-28 18:03:07 UTC
I can confirm I face the same issue with a 500 GB 860 EVO just on dd'ing from other disk to it (i.e. full speed streaming write).

Also you can see UDMA CRC Error Count increasing in SMART.

There was no issues with more than a dozen other vendor SSDs that I tried on this controller so far, only this one.

Eventually the Samsung steps down to 1.5Gbps SATA link, and from then on starts working fine. Disabling NCQ does indeed help, but it hobbles the random IO performance immensely. As far as I know there is no solution, other than hopefully a firmware fix by Samsung. Until then, to prevent data loss, NCQ should be disabled unfortunately.
Comment 7 Solomon Peachy 2018-12-01 18:20:30 UTC
Seeing this on my system too, currently running a Fedora 4.19.2 kernel.

Model=Samsung SSD 860 EVO 1TB
FwRev=RVT01B6Q

MSI 970A-G46 motherboard, which has an AMD970+SB950 chipset.

I can provide more details if necessary

Samsung has not released any firmware updates for this device, and by all accounts they do not intend to, despite this problem affecting Windows systems as well.
Comment 8 Matt Whitlock 2019-01-26 03:40:24 UTC
For those affected by this issue, does downgrading to kernel 4.18.19 relieve the symptoms? I started seeing similar "FPDMA QUEUED" errors during heavy I/O to my Samsung SSD 860 Pro after I upgraded to the 4.19 kernel series. Downgrading to 4.18.19, my symptoms disappeared.

I am currently in the process of bisecting the kernel sources to find the commit that introduced the regression. I've been at it since mid-December, as it takes several days to gain confidence that a given commit is "good." (Usually it takes 2-3 days of uptime before I discover that a given commit is "bad.")

If downgrading to 4.18.19 does not resolve the issue for others in this report, then I am experiencing a different issue.
Comment 9 Solomon Peachy 2019-01-26 13:01:28 UTC
The forum thread referenced by comment 4 includes logs taken from a 4.15 kernel.

By all accounts this does not appear to be a "regression" introduced by a recent Linux kernel; instead it appears to be due to an incompatibility between AMD SATA controllers and the 860 EVO device firmware.

(And it's also broken/flaky on Windows too.  So, probably not Linux's fault..)
Comment 10 Sitsofe Wheeler 2019-08-03 14:48:52 UTC
I think I'm seeing this issue (Samsung 860 SSD triggers "WRITE FPDMA QUEUED" errors in kernel log/dmesg under heavy I/O causing terrible performance and unreliable unless NCQ is disabled) too. The SATA controller in the HP MicroServer N36L seeing the issue is an AMD SB7x0/SB8x0/SB9x0. The issue happens on both 4.15.0-43-generic and 4.18.0-13-generic kernels and in my case I was using a simple fio job to make the issue occur:

fio --name=test --readonly --rw=randread --filename /dev/sdb --bs=32k \
    --ioengine=libaio --iodepth=32 --direct=1 --runtime=10m --time_based=1


Here are some more links that may (or may not) be the same thing:

https://github.com/zfsonlinux/zfs/issues/4873#issuecomment-449886669
https://eu.community.samsung.com/t5/Cameras-IT-Everything-Else/860-EVO-250GB-causing-freezes-on-AMD-system/td-p/575813
https://marc.info/?l=linux-block&m=154644276512949&w=2

This may also be linked to https://bugzilla.redhat.com/show_bug.cgi?id=1729678 and https://bugzilla.kernel.org/show_bug.cgi?id=203475 .

(CC'ing Jens)
Comment 11 Daniel Kenzelmann 2019-11-30 11:59:27 UTC
Same issue with Samsung SSD 860 EVO 1TB but with (newer?) RVT03B6Q firmware and the following controller:
Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] (rev 40) (prog-if 01 [AHCI 1.0])
	Subsystem: ASRock Incorporation SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]
	
Disabling NCQ completely seems to solve the issue (using libata.force=noncq ),

Does anyone know if setting /sys/block/<device>/device/queue_depth to 1 has the same effect of solving this issue? (or is it something else when enabling ncq altogether that causes the issue?)
Comment 12 Roman Mamedov 2019-11-30 12:13:03 UTC
> Does anyone know if setting /sys/block/<device>/device/queue_depth to 1 has
> the same effect of solving this issue?

Yes it absolutely should. You don't have to disable NCQ for the entire system to solve this. Can only be an issue if you want to use this as your boot drive, and don't figure out a way to set the queue_depth=1 early enough during boot. Then you can still hit the NCQ issue during the bootup process before it is set.
Comment 13 Alexander Tsoy 2020-06-21 16:56:43 UTC
Confirming this issue for Samsung 883 DCT as well. Disabling NCQ works as a workaround:

$ cat /etc/udev/rules.d/99-disk.rules 
ACTION=="add|change", SUBSYSTEM=="block", KERNEL=="sd*", ENV{DEVTYPE}=="disk", ENV{ID_MODEL}=="Samsung_SSD_883_DCT_1.92TB", ATTR{device/queue_depth}="1"

You might also want to include the udev rule into initramfs. For dracut users:

$ cat /etc/dracut.conf.d/local.conf
install_optional_items+=" /etc/udev/rules.d/99-disk.rules "
Comment 14 Alexander Tsoy 2020-08-04 23:47:31 UTC
Update: still have several issues with device/queue_depth=1: occasional freezes and very long freezes after performing fstrim, probably due to queued trim still enabled. So I ended up with different workaround (see Documentation/admin-guide/kernel-parameters.txt for libata.force option format):

$ grep -o "libata[^ ]*" /proc/cmdline 
libata.force=1:noncq,2:noncq

$ sudo dmesg | grep NCQ
[    4.438905] ata2.00: 3750748848 sectors, multi 16: LBA48 NCQ (not used)
[    4.443898] ata1.00: 3750748848 sectors, multi 16: LBA48 NCQ (not used)
[    4.445272] ata4.00: 19532873728 sectors, multi 0: LBA48 NCQ (depth 32), AA
[    4.446274] ata3.00: 19532873728 sectors, multi 0: LBA48 NCQ (depth 32), AA
Comment 15 Solomon Peachy 2020-08-17 15:19:54 UTC
I'd purchased a 3rd-party SATA controller and have been using the 860 EVO apparently problem-free for many months with ncq enabled... until this morning.

It was the first reboot in over 2 months, and the system crapped out badly on startup.  My guess is that Fedora tried to do a queued FSTRIM, leading to a large pile of DMA WRITE errors, resulting in xfs corruption bad enough that it failed to mount -- I had to run xfs_repair to get it to successfully mount the rootfs.

I forgot to disable NCQ after everything was fixed... and it got trashed to the point of needing xfs_repair _again_.

Meanwhile, Samsung still refuses to acknowledge there is a problem with their 860 EVO SSD firmware, much less release an update.
Comment 16 Solomon Peachy 2020-08-17 15:43:08 UTC
Oh, forgot to say my latest corruption was with Fedora's 5.7.12-200.fc32.x86_64 kernel, plugged into an ASMedia ASM1062 SATA controller.  The motherboard is the same as before, sporting a AMD970+SB950 chipset.
Comment 17 Dag Nygren 2020-08-22 08:31:33 UTC
Seeing a very similar thing happen with the completely different setup:

Drive: SAMSUNG 870 QVO 1TB
Controller: Intel Corporation 82801IBM/IEM

Just tried setting the queue_depth to 1 and so far so good. But the problem has been very intermittent. Cannot change cable as this is a laptop.
Comment 18 Andreas Elvers 2020-09-24 11:12:36 UTC
The NCQ problem also affects the Samsung server grade SSDs.

Drive: Samsung MZ7LH1T9HMLT
Controller: SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]

This issue seems to be very big. I am in wonder, why Samsung and AMD can't come to a conclusion to help their customers.
Comment 19 Alejandro Donato 2020-11-18 20:33:28 UTC
+1

Kernel: 5.4.0-54-generic

SATA controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]

Samsung SSD 850 EVO 500GB (firmware EMT02B6Q)

in the ASMEDIA SATA II 6Gbps (now disabled), read errors and heavy freezes (even filesystem damage)

In the AMD controller, link fails and switch to PIO4 UDMA/100 in last... 

----syslog------

[  248.159560] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[  248.228187] ata3.00: configured for UDMA/133
[  312.724185] ata3: limiting SATA link speed to 1.5 Gbps
[  320.675869] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[  320.769960] ata3.00: configured for UDMA/133
[  392.987615] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[  393.069527] ata3.00: configured for UDMA/133
[  468.707516] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[  468.802562] ata3.00: configured for UDMA/133
[  542.007285] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[  542.069186] ata3.00: configured for UDMA/133
[  571.309052] hrtimer: interrupt took 19395 ns
[  606.873323] ata3.00: limiting speed to UDMA/100:PIO4
[  614.882962] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[  614.969015] ata3.00: configured for UDMA/100
[  688.073627] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[  688.143670] ata3.00: configured for UDMA/100

-------------

Performance is heavly degraded.

This is a CRITICAL failure. Please update the bug report.

Thanks!!!
Comment 20 Roman Elshin 2020-11-21 13:15:30 UTC
I have samsung 860 pro with RVM02B6Q firmware, it incompatible not only with amd ahci. With asmedia asm1061 it seems to work with queued TRIM disabled ( 
ATA_HORKAGE_NO_NCQ_TRIM | ATA_HORKAGE_ZERO_AFTER_TRIM in libata-core.c)
With marvell 88se9120 it requires libata.force=3.0G for work.
Intresting, is there any sata3 pci-e x1 card where samsung's 860* crap works flowlesly?
Comment 21 nejrobbins 2020-12-05 03:26:41 UTC
Also happening here. 860 EVO, latest firmware, ASRock 970M Pro3 motherboard. So AMD Sata chipset, 970 northbridge with AMD SB950 southbridge. Getting the same "FPDMA QUEUED" errors, and the drive falls back to SATA 1.5GB/s. Disabling NCQ does solve the issue, but hurts performance.

Interestingly, on Windows with the AMD Sata Driver the OS freezes entirely. But with the Microsoft Sata Controller Driver (that AMD now recommends using), the system doesn't freeze anymore (but still gets the other issues without disabling NCQ). So maybe some type of sata firmware fix is possible?
Comment 22 Roman Mamedov 2020-12-05 09:15:08 UTC
> But with the Microsoft Sata Controller Driver (that AMD now recommends
> using), the system doesn't freeze anymore (but still gets the other issues
> without disabling NCQ).

Well, if it still has issues until NCQ disabled, it only confirms that the issue can't really be solved by the driver. Also, did you check what performance you get with it, and how it compares with using AMD's driver? IIRC the MS driver was quite slow, so it might be a bit more "reliable" only due to being suboptimal and not pushing the hardware nowhere nearly as hard.

One other thing I should mention, as I see people reporting issues on non-AMD controllers as well;  to clarify my experience so far, on the AMD chipset controller I have to disable NCQ entirely to get it working; but on the ASMedia controllers it seems to be enough to disable just the queued TRIM (so the same as Roman Elshin reports above).
https://bugzilla.kernel.org/show_bug.cgi?id=203475
Try that, maybe you can regain some of the lost performance and still get a reliable operation out of the device.
Comment 23 nejrobbins 2020-12-05 16:19:13 UTC
With the AMD driver I couldn't even run a benchmark, as the system would just freeze and I'd have to force restart. Both with CrystalDiskMark and Samsung Magician.

However on the Microsoft driver, I would still get around 500MB/s give or take if I remember correctly, which is the rated spec for the drive. Now on Linux but I haven't tested speeds with NCQ disabled.

I'll try disabling queued TRIM only and seeing what happens. Thx.
Comment 24 Sitsofe Wheeler 2020-12-08 19:19:13 UTC
Can people who are seeing this report which model (e.g. SSD 860 EVO 500GB)  firmware (e.g. RVT01B6Q) and PCI card (e.g. AMD SB7x0/SB8x0/SB9x0 ) they have? In my case smartctl -a <dev> reports that I'm on RVT01B6Q firmware which is apparently behind the latest (RVT04B6Q) listed on https://www.samsung.com/semiconductor/minisite/ssd/download/tools/ . If folks are feeling brave and can take the risk can they report if the issue is still reproduced on the latest firmware?
Comment 25 Sitsofe Wheeler 2020-12-08 19:31:53 UTC
Hmm poking about the web it doesn't look like firmware updates are solving this issue (see https://community.amd.com/t5/server-gurus-discussions/issues-with-samsung-ssds-on-epyc/td-p/402737 ).
Comment 26 nejrobbins 2020-12-08 19:57:06 UTC
Another person experiencing the issue with an Intel NUC: https://unix.stackexchange.com/questions/623238/root-causes-for-failed-command-write-fpdma-queued.

But @Sitsofe Wheeler I'm on RVT04B6Q and still experiencing the issue, so it doesn't seem to be firmware as you said. I have AMD SB950.
Comment 27 Sitsofe Wheeler 2020-12-08 23:05:21 UTC
I flashed my 860 EVO to RVT04B6Q and the issue is still present which confirms nejrobbins message above.
Comment 28 Roy 2020-12-28 00:44:08 UTC
Wish I had found this bug report before getting an SSD to provide a much-needed performance boost to my ageing PC.

* Specs:
- Asus M5A97 Evo Rev2.0, latest UEFI,
- AMD FX-6300,
- Samsung EVO 860, firmware RVT04B6Q,
- Kernel 5.9.16

lspci -vv (filtered SATA controller only, double-checked with lshw)
00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] (rev 40) (prog-if 01 [AHCI 1.0])
	Subsystem: ASUSTeK Computer Inc. M5A99X EVO (R1.0) SB950
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 32
	Interrupt: pin A routed to IRQ 19
	NUMA node: 0
	Region 0: I/O ports at f040 [size=8]
	Region 1: I/O ports at f030 [size=4]
	Region 2: I/O ports at f020 [size=8]
	Region 3: I/O ports at f010 [size=4]
	Region 4: I/O ports at f000 [size=16]
	Region 5: Memory at fe60b000 (32-bit, non-prefetchable) [size=1K]
	Capabilities: <access denied>
	Kernel driver in use: ahci


* Symptoms:
dmesg was riddled with these messages:

dec 27 14:08:24 tuvok kernel: ata1: log page 10h reported inactive tag 17
dec 27 14:08:24 tuvok kernel: ata1.00: exception Emask 0x1 SAct 0x7ffc003f SErr 0x0 action 0x0
dec 27 14:08:24 tuvok kernel: ata1.00: irq_stat 0x40000008
dec 27 14:08:24 tuvok kernel: ata1.00: failed command: WRITE FPDMA QUEUED
dec 27 14:08:24 tuvok kernel: ata1.00: cmd 61/08:00:58:2d:57/00:00:2d:00:00/40 tag 0 ncq dma 4096 out
                                       res 40/00:28:f0:2d:57/00:00:2d:00:00/40 Emask 0x1 (device error)
dec 27 14:08:24 tuvok kernel: ata1.00: status: { DRDY }
dec 27 14:08:24 tuvok kernel: ata1.00: failed command: WRITE FPDMA QUEUED
dec 27 14:08:24 tuvok kernel: ata1.00: cmd 61/08:08:68:2d:57/00:00:2d:00:00/40 tag 1 ncq dma 4096 out
                                       res 40/00:28:f0:2d:57/00:00:2d:00:00/40 Emask 0x1 (device error)
dec 27 14:08:24 tuvok kernel: ata1.00: status: { DRDY }
dec 27 14:08:24 tuvok kernel: ata1.00: failed command: WRITE FPDMA QUEUED
dec 27 14:08:24 tuvok kernel: ata1.00: cmd 61/08:10:98:2d:57/00:00:2d:00:00/40 tag 2 ncq dma 4096 out
                                       res 40/00:28:f0:2d:57/00:00:2d:00:00/40 Emask 0x1 (device error)
dec 27 14:08:24 tuvok kernel: ata1.00: status: { DRDY }
dec 27 14:08:24 tuvok kernel: ata1.00: failed command: WRITE FPDMA QUEUED
dec 27 14:08:24 tuvok kernel: ata1.00: cmd 61/08:18:a8:2d:57/00:00:2d:00:00/40 tag 3 ncq dma 4096 out
                                       res 40/00:28:f0:2d:57/00:00:2d:00:00/40 Emask 0x1 (device error)

* Work-around:
Booting with libata.force=noncq works around this issue.
Comment 29 Gregory P. Smith 2021-02-23 06:18:25 UTC
A new Samsung 870 EVO 1TB SSD runs into this issue on Linux any time a DISCARD is sent. :(

Ex: Removing an LVM snapshot after doing a backup because of the questionable behavior I've been observing... bam:

Feb 22 11:28:32 zoonaut kernel: [130904.469448] ata4.00: qc timeout (cmd 0x47)
Feb 22 11:28:32 zoonaut kernel: [130904.470626] ata4.00: READ LOG DMA EXT failed, trying PIO
Feb 22 11:28:32 zoonaut kernel: [130904.470633] ata4: failed to read log page 10h (errno=-5)
Feb 22 11:28:32 zoonaut kernel: [130904.470700] ata4.00: exception Emask 0x1 SAct 0x40 SErr 0x0 action 0x6 frozen
Feb 22 11:28:32 zoonaut kernel: [130904.470748] ata4.00: irq_stat 0x40000008
Feb 22 11:28:32 zoonaut kernel: [130904.470779] ata4.00: failed command: SEND FPDMA QUEUED
Feb 22 11:28:32 zoonaut kernel: [130904.470824] ata4.00: cmd 64/01:30:00:00:00/00:00:00:00:00/a0 tag 6 ncq dma 512 out
Feb 22 11:28:32 zoonaut kernel: [130904.470824]          res 50/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x1 (device error)
Feb 22 11:28:32 zoonaut kernel: [130904.470932] ata4.00: status: { DRDY }
Feb 22 11:28:32 zoonaut kernel: [130904.470965] ata4: hard resetting link
Feb 22 11:28:32 zoonaut kernel: [130904.789997] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Feb 22 11:28:32 zoonaut kernel: [130904.794412] ata4.00: supports DRM functions and may not be fully accessible
Feb 22 11:28:32 zoonaut kernel: [130904.797441] ata4.00: supports DRM functions and may not be fully accessible
Feb 22 11:28:32 zoonaut kernel: [130904.799759] ata4.00: configured for UDMA/133
Feb 22 11:28:32 zoonaut kernel: [130904.799771] ata4.00: device reported invalid CHS sector 0
Feb 22 11:28:32 zoonaut kernel: [130904.799792] sd 3:0:0:0: [sdc] tag#6 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb 22 11:28:32 zoonaut kernel: [130904.799799] sd 3:0:0:0: [sdc] tag#6 Sense Key : Illegal Request [current] 
Feb 22 11:28:32 zoonaut kernel: [130904.799803] sd 3:0:0:0: [sdc] tag#6 Add. Sense: Unaligned write command
Feb 22 11:28:32 zoonaut kernel: [130904.799809] sd 3:0:0:0: [sdc] tag#6 CDB: Write same(16) 93 08 00 00 00 00 69 40 10 00 00 20 00 00 00 00
Feb 22 11:28:32 zoonaut kernel: [130904.799815] blk_update_request: I/O error, dev sdc, sector 1765806080 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0
Feb 22 11:28:32 zoonaut kernel: [130904.799969] ata4: EH complete
Feb 22 11:28:32 zoonaut kernel: [130904.800165] ata4.00: Enabling discard_zeroes_data
...
Feb 22 18:40:55 zoonaut kernel: [156847.619004] ata4.00: exception Emask 0x0 SAct 0xf000 SErr 0x0 action 0x0
Feb 22 18:40:55 zoonaut kernel: [156847.619075] ata4.00: irq_stat 0x40000008
Feb 22 18:40:55 zoonaut kernel: [156847.619106] ata4.00: failed command: SEND FPDMA QUEUED
Feb 22 18:40:55 zoonaut kernel: [156847.619148] ata4.00: cmd 64/01:60:00:00:00/00:00:00:00:00/a0 tag 12 ncq dma 512 out
Feb 22 18:40:55 zoonaut kernel: [156847.619148]          res 41/04:01:00:00:00/00:00:00:00:00/00 Emask 0x401 (device err
or) <F>
Feb 22 18:40:55 zoonaut kernel: [156847.619247] ata4.00: status: { DRDY ERR }
Feb 22 18:40:55 zoonaut kernel: [156847.619275] ata4.00: error: { ABRT }
Feb 22 18:40:55 zoonaut kernel: [156847.619792] ata4.00: supports DRM functions and may not be fully accessible
Feb 22 18:40:55 zoonaut kernel: [156847.622381] ata4.00: supports DRM functions and may not be fully accessible
Feb 22 18:40:55 zoonaut kernel: [156847.624512] ata4.00: configured for UDMA/133
Feb 22 18:40:55 zoonaut kernel: [156847.624544] ata4: EH complete
Feb 22 18:40:55 zoonaut kernel: [156847.624714] ata4.00: Enabling discard_zeroes_data
Feb 22 18:40:55 zoonaut kernel: [156847.734820] ata4.00: exception Emask 0x0 SAct 0x7e SErr 0x0 action 0x0
Feb 22 18:40:55 zoonaut kernel: [156847.734890] ata4.00: irq_stat 0x40000008
Feb 22 18:40:55 zoonaut kernel: [156847.734922] ata4.00: failed command: SEND FPDMA QUEUED
Feb 22 18:40:55 zoonaut kernel: [156847.734963] ata4.00: cmd 64/01:08:00:00:00/00:00:00:00:00/a0 tag 1 ncq dma 512 out
Feb 22 18:40:55 zoonaut kernel: [156847.734963]          res 41/04:01:00:00:00/00:00:00:00:00/00 Emask 0x401 (device error) <F>
Feb 22 18:40:55 zoonaut kernel: [156847.735061] ata4.00: status: { DRDY ERR }
Feb 22 18:40:55 zoonaut kernel: [156847.735090] ata4.00: error: { ABRT }
Feb 22 18:40:55 zoonaut kernel: [156847.735722] ata4.00: supports DRM functions and may not be fully accessible
Feb 22 18:40:55 zoonaut kernel: [156847.738424] ata4.00: supports DRM functions and may not be fully accessible
Feb 22 18:40:55 zoonaut kernel: [156847.740732] ata4.00: configured for UDMA/133
Feb 22 18:40:55 zoonaut kernel: [156847.740776] ata4: EH complete
Feb 22 18:40:55 zoonaut kernel: [156847.741073] ata4.00: Enabling discard_zeroes_data
... and on and on and on...
Feb 22 18:40:56 zoonaut kernel: [156848.541172] ata4.00: Enabling discard_zeroes_data
Feb 22 18:40:56 zoonaut kernel: [156848.638967] ata4.00: exception Emask 0x0 SAct 0x3f00 SErr 0x0 action 0x0
Feb 22 18:40:56 zoonaut kernel: [156848.642010] ata4.00: irq_stat 0x40000008
Feb 22 18:40:56 zoonaut kernel: [156848.645155] ata4.00: failed command: SEND FPDMA QUEUED
Feb 22 18:40:56 zoonaut kernel: [156848.648344] ata4.00: cmd 64/01:40:00:00:00/00:00:00:00:00/a0 tag 8 ncq dma 512 out
Feb 22 18:40:56 zoonaut kernel: [156848.648344]          res 41/04:01:00:00:00/00:00:00:00:00/00 Emask 0x401 (device error) <F>
Feb 22 18:40:56 zoonaut kernel: [156848.654650] ata4.00: status: { DRDY ERR }
Feb 22 18:40:56 zoonaut kernel: [156848.657798] ata4.00: error: { ABRT }
Feb 22 18:40:56 zoonaut kernel: [156848.661629] ata4.00: supports DRM functions and may not be fully accessible
Feb 22 18:40:56 zoonaut kernel: [156848.664769] ata4.00: supports DRM functions and may not be fully accessible
Feb 22 18:40:56 zoonaut kernel: [156848.666981] ata4.00: configured for UDMA/133
Feb 22 18:40:56 zoonaut kernel: [156848.667013] sd 3:0:0:0: [sdc] tag#8 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb 22 18:40:56 zoonaut kernel: [156848.667019] sd 3:0:0:0: [sdc] tag#8 Sense Key : Illegal Request [current] 
Feb 22 18:40:56 zoonaut kernel: [156848.667024] sd 3:0:0:0: [sdc] tag#8 Add. Sense: Unaligned write command
Feb 22 18:40:56 zoonaut kernel: [156848.667029] sd 3:0:0:0: [sdc] tag#8 CDB: Write same(16) 93 08 00 00 00 00 69 40 10 00 00 3f ff c0 00 00
Feb 22 18:40:56 zoonaut kernel: [156848.667035] blk_update_request: I/O error, dev sdc, sector 1765806080 op 0x3:(DISCARD) flags 0x4000 phys_seg 1 prio class 0
Feb 22 18:40:56 zoonaut kernel: [156848.670129] ata4: EH complete

lvremove thankfully just waits and retries all its I/O patiently, ultimately succeeding and just notes as an error message that it's DISCARD got an I/O error.

But I no longer trust the device in my machine.

HP Microserver Gen10 AMD based system: lspci -v shows

00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 49) (prog-if 01 [AHCI 1.0])
        Subsystem: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
Comment 30 Gregory P. Smith 2021-02-24 05:27:19 UTC
For those disabling NCQ as a workaround.  You can do this per-drive rather than system wide.  Writing 1 to /sys/block/sdX/device/queue_depth instead of the default value disables NCQ.

lvresize -L -299G no longer takes ages and produces ata discard errors in syslog after I do that.  re-enable NCQ by putting a higher value like the default 32 (really 31) there causes it to run into errors again.

I've already picked up an WD Blue SSD to replace this buggy Samsung.  Not supporting NCQ properly is unacceptable to me.  I could just leave trim and discard disabled, but that's an equally hacky non-default config that nothing manufactured and sold in 2021 should require.

I'll avoid Samsung SSD in the future and recommend others do the same.  This may only be a bug in their SATA line (becoming a legacy product merely for this replacing HDDs).

Linux kernel wise, the 8xx series Samsung SATA SSDs could be blocklisted as known troublesome devices so that trim or NCQ are disabled by default on them.  Do we accept kind of vendor quirk hack in mainline kernels?
Comment 31 Roman Mamedov 2021-02-24 07:23:43 UTC
Gregory, did you try disabling just the queued TRIM, not NCQ entirely? As suggested in https://bugzilla.kernel.org/show_bug.cgi?id=203475
Comment 32 Gregory P. Smith 2021-02-24 09:30:22 UTC
Thanks for the link. I hadn't seen that issue. Seems to be the same thing as this. Disabling just queued TRIM and not NCQ entirely appears to require rebuilding a kernel which isn't something I can do to this machien.

Relevant patch and pointer to the change that needs reverting in the kernel:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca6bfcb2f6d9deab3924bf901e73622a94900473

People using Linux distros - file issues against your distro asking them to to use a "Samsung [78]*" in that blocklist.

Also related: https://bugzilla.kernel.org/show_bug.cgi?id=202093  Which came from Canonical's existing Ubuntu issue https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1809972
Comment 33 Matt Whitlock 2021-02-24 15:01:33 UTC
(In reply to Gregory P. Smith from comment #30)
> Linux kernel wise, the 8xx series Samsung SATA SSDs could be blocklisted as
> known troublesome devices so that trim or NCQ are disabled by default on
> them.

PLEASE don't do that! I have a Samsung 860 Pro and a Samsung 860 EVO, and both work great with NCQ and queued Trim enabled. The incompatibility is specifically with AMD SATA controllers. Don't hobble these great drives universally because of one bad controller.
Comment 34 nejrobbins 2021-02-24 17:12:00 UTC
It is predominantly AMD SATA, but there have been some reports of it happening to Intel systems. I agree though it shouldn't be disabled outright.

From what I saw, typically just disabling queued TRIM doesn't fix the issue, as it appears in other contexts and not just when trimming. But it may still be useful to try for some.

You can also disable NCQ for a specific SATA port (and not system wide) with libata.force=x.00:noncq, where X is the disk number.

From: https://wiki.archlinux.org/index.php/Solid_state_drive#Resolving_NCQ_errors

I wonder what the performance impact of disabling NCQ is? I'm not sure of the relationship between average use queue depth and sequential vs random RW.


Just as a side note, on Windows with the Microsoft SATA driver the OS would freeze during benchmarks, but with the AMD driver it wouldn't. Not sure if NCQ got disabled there or anything, and didn't do a benchmark at the tims.
Comment 35 Gregory P. Smith 2021-02-24 18:41:10 UTC
Kernel wise, it should ship with safe default behavior.  Let people who want this enabled on known often borked devices opt-in.  Don't default to causing performance or potential data loss issues for those who fail to opt-out.

The kernel's libata has the HORKAGE list for a reason.  Lets use it to maximal user benefit to avoid problems.  Re-enabling ncq (or ncqtrim if it is that specific) on these Samsung SSDs was a mistake that everyone here piping up is paying for.

The 870 is brand new, just released this year.  Yet it has the problem.

While I've seen an old blog post claiming disabling NCQ on their SATA SSD leading to a reduction in 4K random read performance, firing up fio to do a randread test is not reproducing that for me.  In fact... I just found that I/O speed on the device surprising went _up_ 20-30% after I disabled NCQ by writing 1 to /sys/block/sdc/device/queue_depth.  WAT?

If that's the case, I don't want NCQ on such a devices.  No idea if I'm holding fio wrong.  First time using it.

Anyways, that's all the time I have for this.  Samsung did wrong with their SATA SSD firmware.  The Kernel is doing wrong for users who own those devices today.  

You've got the power to fix it for all Samsung SSD owners.
Comment 36 Matt Whitlock 2021-02-24 19:38:38 UTC
(In reply to Gregory P. Smith from comment #35)
> firing up fio to do a
> randread test is not reproducing that for me.  In fact... I just found that
> I/O speed on the device surprising went _up_ 20-30% after I disabled NCQ by
> writing 1 to /sys/block/sdc/device/queue_depth.  WAT?

I just ran fio on my Samsung 860 EVO 2TB, random 4K reads with libaio engine, I/O depth 256, jobs 4, runtime 120 seconds.

With queue_depth=32 (default): read: IOPS=46.7k, BW=182MiB/s

With queue_depth=1: read: IOPS=13.3k, BW=51.0MiB/s

No messages in my kernel logs for the duration of these tests.

So again, please don't cripple these drives for everyone. If there is an incompatibility with specific SATA controllers, then address that specifically.
Comment 37 Gregory P. Smith 2021-02-24 21:02:20 UTC
Trying to come up with a list of broken samsung-ssd/sata-controller combos is Samsung's job.  But I'd be surprised if they wanted to do that as it'd nerf their marketing-only benchmarks.  Ship code that is safe by default.

Rollback https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca6bfcb2f6d9deab3924bf901e73622a94900473

Disabling it on all Intel, AMD, asmedia, and marvell SATA controllers would be the only reasonable choice to do in the kernel based on all of the comments on the issues so far.

You're setting an arbitrarily impossible testing bar to meet in favor of your own personal combo's performance at the expense of everyone elses data integrity.
Comment 38 Roman Mamedov 2021-02-24 21:18:25 UTC
Should be noted that the referenced commit changes only the NCQ TRIM blacklist, whereas Matt's test results are obtained while disabling the NCQ entirely. 

Disabling just the queued TRIM will not have nearly as much performance impact as disabling NCQ itself. In most cases it will have none, since IIRC the latest best practices from FS devs are not to use inline trim (the "discard" mount option) at all, opting for daily/weekly invocations of "fstrim" instead. But this also might not help on some controllers, the telltale sign for whether it will or not, is did errors have "WRITE FPDMA QUEUED" (a generic NCQ issue), as opposed to "SEND FPDMA QUEUED" (problem with NCQ TRIM specifically).
Comment 39 Gregory P. Smith 2021-02-24 21:44:36 UTC
Right.  When i can reboot, i'll try mine with just noncqtrim.  (it wasn't clear to me if there is a way to control the noncqtrim horkage setting via /sys/block/sdc/device/ like there is for disabling ncq entirely)

Matt: Can you share the fio command line / config you used?  I'd like to repeat the same test on my own system.  Yours is clearly performing much more like I'd expect given the settings (ncq being disabled _should_ put a big dent in 4k random read performance on a decent SSD)
Comment 40 Matt Whitlock 2021-02-24 23:20:38 UTC
(In reply to Gregory P. Smith from comment #39)
> Matt: Can you share the fio command line / config you used?

Not knowing anything about fio myself, I just used the example command line from https://docs.oracle.com/en-us/iaas/Content/Block/References/samplefiocommandslinux.htm:

fio --filename=/dev/disk/by-id/ata-Samsung_SSD_yaddayaddayadda --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --readonly
Comment 41 Gregory P. Smith 2021-02-27 05:51:58 UTC
Confirming: Rebooting with libata.force=4:noncqtrim to disable queued trim on my Samsung 870 EVO _appears_ to work around the issue for easy to reproduce situations (lvresize to reduce a volume size).

[    2.481763] ata4.00: FORCE: horkage modified (noncqtrim)
[    2.481823] ata4.00: supports DRM functions and may not be fully accessible
[    2.482434] ata4.00: disabling queued TRIM support
[    2.482437] ata4.00: ATA-11: Samsung SSD 870 EVO 1TB, SVT01B6Q, max UDMA/133
[    2.482439] ata4.00: 1953525168 sectors, multi 1: LBA48 NCQ (depth 32), AA

Also confirming: There is no measurable performance degradation during normal use of my SSD from doing so.  All this is disabling is Queued TRIM.

I understand this to means a TRIM must act as a barrier to let the existing queued IO finish before happening alone, as a serialized command, after which normal NCQ IO can resume.

Confirming by doing some lvresize -L -100G commands on a LV on the volume during an fio run, I see a very brief blip in the speeds printed to stdout.  But as it's just a single trim it is inconsequential to performance.

I wouldn't mount your filesystem with -o discard in such a configuration if you'll have frequent transient files and need unwavering read throughput.  But a regular ~weekly scheduled fstrim -a shouldn't be a big deal (which I believe is the default setup in popular distros anyways?).

The patch just re-marks these drives as "disable Queued TRIM" by default.  It doesn't disable NCQ entirely.  Seems like a good safe default.

I don't doubt that some people have issues with NCQ itself.  The reason noncq and noncqtrim libata.force flags were added to the kernel in 2015 was due to the large number of SSDs out there that don't behave well.  (https://patchwork.ozlabs.org/project/linux-ide/patch/1430790861-30066-1-git-send-email-martin.petersen@oracle.com/)

The kernel could fail better in this situation.  Perhaps: "Got an error as a result of a queued trim?  Automatically flip that device to noncqtrim mode."  But that's a larger logic change with consequences.  Updating the horkage blocklist is simple and targets this specific issue.

Matt - thanks for the fio command and link!  _That_ one seems to properly exercise my SSD.

4k read: IOPS=88.7k, BW=346MiB/s (363MB/s)      with noncqtrim or without
4k read: IOPS=11.1k, BW=43.5MiB/s (45.6MB/s)    with noncq (queue_depth=1)
Comment 42 Hans de Goede 2021-03-02 12:34:36 UTC
Hi All,

Upstream kernel dev here, who has done some work in this area in the past.

So I was about to submit a patch upstream to just disable NCQ-TRIM on all "Samsung SSD 8*" sata drives based on this bug report.

But reading through the entire bug report again, I have decided to not do this.

My reason for not doing this is that I believe that it will not solve the problem, it will maybe make it less prominent but it will not solve it.

There are several comments in this bug-report indicating that problems still happen under heavy IO with (queued) trim disabled, see comment 5, comment 6, comment 14.

And everything just points to an incompatibility between AMD/Asmedia SATA controllers (AMD southbridges have been made by Asmedia for a while) and the Samsung 860 / 870 sata SSDs. Given that falling back to 1.5 gbps appears to help, I guess that there is some incompatibility between the 2 phy-s on each side, which leads to issues under heavy load.

I guess a heavy load on the SDD causes the SSD's voltage-rails to become more noisy and this leaks through the phy-s which the AMD/Asmedia sata-controllers do not like.

Disabling NCQ (or NCQ-TRIM) reduces the load masking the issue, but as several comments here show, the issue still happens just less frequent so this is not really a fix.

This, combined with there also being many reports (including here) about similar issues under Windows, leads me to the conclusion that AMD/Asmedia sata-controllers 
and Samsung 860 / 870 sata SSDs are simply incompatible with each other.

So the only solution which I can give you is to not use this combination.

The only kernel patch to "solve" this which I can envision is detecting the combination and then simply refusing to use the SSD (with a big fat warning message).
Comment 43 Gregory P. Smith 2021-03-02 21:36:35 UTC
What is your goal in leaving this enabled?  Causing errors to surface soon that anyone paying attention to hardware misperformance and their log spam will notice, Google the error message, and hopefully wind up here?  learning that they need to customize a kernel command line to have libata.force=$BUSNUMBER:noncqtrim?

I admire the fail fast to force acknowledgement of the problem approach, but that is still rather indirect and painful to address.

**Gaining the ability to control noncqtrim at runtime via /sys/block/sdX/device would be excellent.**

That way decisions like this could be left to userland/distro/etc logic to determine and not require special custom kernel command line configs and further reboots.  It could even be detected by userspace and done automagically.  Discussion could move out of the kernel bugzilla.

Some comments from the various issues related to this also claim to be on Intel or other controllers.  comment 17 and comment 27 in this issue for example.

It is obviously impossible for us to verify the veracity of every commenters hardware setup.  But doing nothing and taking the manufacturers word for it as the 2018 change to re-enable the buggy setting seems to have been a mistake.  I question their motivation to re-enable it...
Comment 44 Hans de Goede 2021-03-03 08:40:03 UTC
(In reply to Gregory P. Smith from comment #43)
> What is your goal in leaving this enabled?  Causing errors to surface soon
> that anyone paying attention to hardware misperformance and their log spam
> will notice, Google the error message, and hopefully wind up here?  learning
> that they need to customize a kernel command line to have
> libata.force=$BUSNUMBER:noncqtrim?

As I mentioned already in my original comment, I did consider enabling noncqtrim on these models. But (as also already mentioned) the reporters in comment 5 (original reporter), comment 6 (dd does not do trim) both report still seeing issues when not using trim, IOW noncqtrim is not sufficient to fix this. It merely helps making the issue less obvious.

> Some comments from the various issues related to this also claim to be on
> Intel or other controllers.  comment 17 and comment 27 in this issue for
> example.

I did check those reports when writing my original comment, but I dismissed these comment 17 talks about "the problem has been very intermittent" and there has been 0 follow-up to that comment, making it at best anecdotal proof of there also being problems with Intel controllers.

Comment 17 is discussed in more detail in https://unix.stackexchange.com/queastions/623238/root-causes-for-failed-command-write-fpdma-queued in this case the problem happens at random without there being any load, which is very different from the reporters here which all report the problem happening under stress. And the reporter there suspects that it might be a bad SATA cable which sounds plausible.

Not every case of SATA transfer errors has the same root cause. A bad power-supply or a bad cable could equally well be causing problems.

> It is obviously impossible for us to verify the veracity of every commenters
> hardware setup.  But doing nothing and taking the manufacturers word for it
> as the 2018 change to re-enable the buggy setting seems to have been a
> mistake.  I question their motivation to re-enable it...

Again, merely re-enabling noncqtrim is not enough to fix this. If this works for you great. But there are a lot of indications that it does not help in all cases.

To only thing which does seem to help consistently for everyone is using the ncq setting. But as you have shown with the fio tests yourself the performance hit from that is huge.

As I already mentioned before I suspect some sort of power-supply issue also may have a hand in things here and doing a trim typically involves erasing flash blocks which is an operation with high power-consumption. So what happens here is that when enabling noncqtrim is that the sata connection to the SSD sits idle while waiting for the trim to complete. Since there is no SATA traffic the higher power-consumption caused by the trim can also not cause any SATA transfer corruptions.

While reading through this bug, I noticed that some of the involved systems are quite old. I wonder what the quality of the used PSU-s in these systems is. Even if the PSU-s where fine when they were new capacitors degrade over time.

Likewise I guess some people may be using converters to go from a molex power-connector to a sata power-connector. Those might be flaky too.

Has anyone who is seeing this tried replacing his PSU with a new high-quality PSU and checked if that helps ?  Yes unless you have a spare one lying around to test this is not cheap, but it would be an interesting data point.

Also note that AMD claims that they cannot reproduce the issue:
https://community.amd.com/t5/server-gurus-discussions/issues-with-samsung-ssds-on-epyc/m-p/402746/highlight/true#M835

I assume that AMD is using a high quality PSU, with nice and clean power-rails in there testing, which might be while they are not seeing this.

TL;DR: this is a complex issue, I would love to make a magic wand and make it go away for everyone, but I don't see any easy answers here. I do not believe that noncqtrim will solve this for everyone. The only thing which consistently seems to help is to go full noncq on these drives, which would lead to big flood of complaints about the performance tanking from people using these with Intel SATA controllers.
Comment 45 Roy 2021-03-03 09:43:11 UTC
(In reply to Hans de Goede from comment #44)
> While reading through this bug, I noticed that some of the involved systems
> are quite old. I wonder what the quality of the used PSU-s in these systems
> is. Even if the PSU-s where fine when they were new capacitors degrade over
> time.
> 
> Likewise I guess some people may be using converters to go from a molex
> power-connector to a sata power-connector. Those might be flaky too.
> 
> Has anyone who is seeing this tried replacing his PSU with a new
> high-quality PSU and checked if that helps ?  Yes unless you have a spare
> one lying around to test this is not cheap, but it would be an interesting
> data point.
> 
> Also note that AMD claims that they cannot reproduce the issue:
> https://community.amd.com/t5/server-gurus-discussions/issues-with-samsung-
> ssds-on-epyc/m-p/402746/highlight/true#M835
> 
> I assume that AMD is using a high quality PSU, with nice and clean
> power-rails in there testing, which might be while they are not seeing this.

Please bear in mind that this AMD community forum report is against a Ryzen-era machine. On the contrary, most people confirming this bug have Bulldozer-era machines or older. There's a good chance problems have been resolved with the Ryzen generation of south bridges.

For me (AMD FX 6300, Asus M5A97-EVO R2, Samsung ) this bug is incredibly easy to reproduce: just boot and run for a few hours. Disabling NCQ makes the problem entirely go away, disabling NCQ Trim is something I haven't tried (yet). The real-world performance penalty appears to be limited, and one I'm (begrudgingly) willing to live with for the remaining lifespan of this machine. Getting a new SSD is an easy way to bring new life to such machines, which is why you're seeing quite a few of us on this report.
Comment 46 Roman Mamedov 2021-03-03 09:49:14 UTC
While I'm not really advocating any kernel change anymore, it seems baffling that with literally 15 other SSDs not having any issue whatsoever on the same system and controller, but when specifically this one single model from Samsung displays its high-profile and well-known all over the Internet NCQ issue, some will still lean to wave that away as "your own fault", that I use an old system with bad PSUs. Great.
Comment 47 Hans de Goede 2021-03-03 10:36:38 UTC
(In reply to Roman Mamedov from comment #46)
> While I'm not really advocating any kernel change anymore, it seems baffling
> that with literally 15 other SSDs not having any issue whatsoever on the
> same system and controller, but when specifically this one single model from
> Samsung displays its high-profile and well-known all over the Internet NCQ
> issue, some will still lean to wave that away as "your own fault", that I
> use an old system with bad PSUs. Great.

<sigh>

I'm not blaming anyone / I'm not saying this is anyone's fault.

As an engineer I'm trying to find a root-cause for this problem. Because without a root cause I cannot fix it.

One part of the equation seems to be using these specific Samsung SSDs, but that clearly is not the whole story.

All I did was post a theory that it might be related to using an older, possibly degraded, PSU. SSD-s have much more "spiky" power-consumption behavior then HDDs, so this might be PSU related.
Comment 48 Hans de Goede 2021-03-03 10:45:29 UTC
(In reply to Roy from comment #45)
> Please bear in mind that this AMD community forum report is against a
> Ryzen-era machine. On the contrary, most people confirming this bug have
> Bulldozer-era machines or older.

So I guess we should consider doing a kernel side quirk where the kernel disables NCQ on the combination of having a Samsung 860 or 870 SSD with a SATA controller on these older AMD chipsets. This does require having a list of PCI-ids for the controllers on which to enable this quirk.
Comment 49 Roy 2021-03-03 10:55:56 UTC
(In reply to Hans de Goede from comment #48)
> (In reply to Roy from comment #45)
> > Please bear in mind that this AMD community forum report is against a
> > Ryzen-era machine. On the contrary, most people confirming this bug have
> > Bulldozer-era machines or older.
> 
> So I guess we should consider doing a kernel side quirk where the kernel
> disables NCQ on the combination of having a Samsung 860 or 870 SSD with a
> SATA controller on these older AMD chipsets. This does require having a list
> of PCI-ids for the controllers on which to enable this quirk.

[roy@Tuvok ~]$ lsscsi -v
[0:0:0:0]    disk    ATA      Samsung SSD 860  4B6Q  /dev/sda 
  dir: /sys/bus/scsi/devices/0:0:0:0  [/sys/devices/pci0000:00/0000:00:11.0/ata1/host0/target0:0:0/0:0:0:0]

[roy@Tuvok ~]$ lspci -vn:
<...>
00:11.0 0106: 1002:4391 (rev 40) (prog-if 01 [AHCI 1.0])
	Subsystem: 1043:84dd
<...>
Comment 50 Alejandro Donato 2021-03-03 11:14:43 UTC
My 2 cents.

Before reporting this bug, i discard any hardware issue. Do the whole "hard test check and procedures" to not generate a false/incomplete/useless report.
In my humble opinion, this bug tracker is not to play with.
This is not a regular user forum asking for a way to compile a driver.
I assume the ones who takes time to report and track a bug are experienced people.
Sorry if i sound angry, but the conclusions sounds really very out of context.

I have 4 systems that have all different hardware (with similar or the same controllers), and they all fails.

4 disks, 4 PSUs, 4 SATA wires... I assume my luck its not that bad to get all this hardware and all is faulty... 

I hope my comments are not taked bad (and sorry my bad english, its not my native language), only ask to not minimize this issue and try to understand that this is not a "crappy/old hardware" related issue.

My technical skills guide me to presume a firmware/driver fault, and for sure, can be fixed.

Thanks!
Comment 51 nejrobbins 2021-03-03 14:25:00 UTC
(In reply to Roy from comment #45)
> 
> For me (AMD FX 6300, Asus M5A97-EVO R2, Samsung ) this bug is incredibly
> easy to reproduce: just boot and run for a few hours. Disabling NCQ makes
> the problem entirely go away, disabling NCQ Trim is something I haven't
> tried (yet). The real-world performance penalty appears to be limited, and
> one I'm (begrudgingly) willing to live with for the remaining lifespan of
> this machine. Getting a new SSD is an easy way to bring new life to such
> machines, which is why you're seeing quite a few of us on this report.


Yep, FX 6200 here and can report the same. Issue seems to show up particularly during writes, like when running mkinitcpio.

When I disable NCQ, the error doesn't show up ever again. I have used this SSD with two different power supplies (EVGA 430W W1 and EVGA 650W Supernova G1) and have had the error with both. Have also tried different SATA cables.

Haven't tried disabling NCQ TRIM, but I doubt it will help, as I have the WRITE FPDMA QUEUED error, and not SEND FPDMA QUEUED.

I also tried this SSD briefly on Windows, and I noticed that during benchmarks with the AMD SATA driver, the system would freeze completely, but with the AMD-recommended Microsoft driver, it would not. IIRC the CRC Error Count would still increase, which is still indicative of this problem and that the driver didn't do anything regarding NCQ. 

I'm wondering, what is the real impact of receiving these errors? I guess the drive would lock up a bit, but is the impact of this error worse than the performance impact of disabling NCQ?
Comment 52 Matt Whitlock 2021-03-03 17:19:56 UTC
(In reply to nejrobbins from comment #51)
> CRC Error Count

How do we view this counter under Linux? I haven't seen any reports of this value in the comments on this bug report, and I think it would be very revealing.
Comment 53 Roy 2021-03-03 17:29:08 UTC
(In reply to Matt Whitlock from comment #52)
> (In reply to nejrobbins from comment #51)
> > CRC Error Count
> 
> How do we view this counter under Linux? I haven't seen any reports of this
> value in the comments on this bug report, and I think it would be very
> revealing.

sudo smartctl -a /dev/sdX
Comment 54 Tejun Heo 2021-03-03 17:47:06 UTC
Hans, given that nobody is likely to take a bus tracer to root-cause an issue specific to combination of an older controller and some SSDs, there are multiple reports, and that the downsides of disabling ncq trim are pretty minimal, maybe disabling ncq trim on the affected combos isn't such a bad idea at least as a qualify-of-life measure?

We had something similar with SanDisk SSDs which would completely lock up under load several years ago. The lockup rate was too high to be practical, especially in large deployments, and disabling NCQ (yeah, whole NCQ) lowered the failure rate enough so that at least the machines stayed up most of the time. So, we quirked it off. Fortunately, we could push SanDisk to root cause the problem and the root cause turned out to be too high max request size, and now the affected drives have NCQ back on with IO size quirk.

The point I'm trying to make is that while it's of course ideal to root cause issues and plug them at the source, we sometimes have to operate with information that's available at the moment and there's nothing wrong with lowering failure rate enough with imperfect workarounds so that things are more bearable for the time being. We just have to balance the pros and cons and make a reasonable decision. Here, provided that disabling NCQ trim on the specific combo makes sufficient difference, I don't think quirking it is unreasonable.

Thanks.
Comment 55 Hans de Goede 2021-03-03 18:57:32 UTC
(In reply to Tejun Heo from comment #54)
> Hans, given that nobody is likely to take a bus tracer to root-cause an
> issue specific to combination of an older controller and some SSDs, there
> are multiple reports, and that the downsides of disabling ncq trim are
> pretty minimal, maybe disabling ncq trim on the affected combos isn't such a
> bad idea at least as a qualify-of-life measure?

Tejun, I completely agree with you. As I stated in my first comment disabling ncq-trim was my initial plan. But then I took a closer look at all the comments here and for say 50% of the cases disabling (ncq) trim is not enough. Where as 100% seems to report success with disabling ncq altogether.

But disabling ncq altogether is a big hammer. Too big IMHO since this only happens with AMD + Asmedia controllers. I guess we could introduce a special horkage flag for this and disable NCQ on these devices if the PCI vendor-id == AMD or vendor-id == Asmedia ?   I know you're no longer the drivers/ata maintainer, but your input / insight on this is still very much welcome.

###

Another consideration is that  the impact of the bug is not entirely clear to me yet. Yes there are errors in the log, but it seems that for most users the system recovers just fine after that. The recovery does take time, but it is also not clear to me what the frequency of the errors is. Some reports talk about 20 times a day ...

So 2 questions for everyone who is seeing this bug:

1. Beside the errors in the logs, what are the other symptoms of this bug which you see on your system(s). Do things get slow / is there data corruption / anything else ?

2. If the problem is merely things getting slow, how slow are we talking about and does this slowdown happen all the time, or only a couple of times per day ?
Comment 56 Tejun Heo 2021-03-03 19:06:40 UTC
(In reply to Hans de Goede from comment #55)
> Tejun, I completely agree with you. As I stated in my first comment
> disabling ncq-trim was my initial plan. But then I took a closer look at all
> the comments here and for say 50% of the cases disabling (ncq) trim is not
> enough. Where as 100% seems to report success with disabling ncq altogether.

I see. Understood.

> But disabling ncq altogether is a big hammer. Too big IMHO since this only
> happens with AMD + Asmedia controllers. I guess we could introduce a special
> horkage flag for this and disable NCQ on these devices if the PCI vendor-id
> == AMD or vendor-id == Asmedia ?   I know you're no longer the drivers/ata
> maintainer, but your input / insight on this is still very much welcome.

Fully agreed, given how wide spread these SSDs are, I think it'd make sense to apply the workaround only on the affected controllers, even for just turning off NCQ trim.

> Another consideration is that  the impact of the bug is not entirely clear
> to me yet. Yes there are errors in the log, but it seems that for most users
> the system recovers just fine after that. The recovery does take time, but
> it is also not clear to me what the frequency of the errors is. Some reports
> talk about 20 times a day ...
> 
> So 2 questions for everyone who is seeing this bug:
> 
> 1. Beside the errors in the logs, what are the other symptoms of this bug
> which you see on your system(s). Do things get slow / is there data
> corruption / anything else ?
> 
> 2. If the problem is merely things getting slow, how slow are we talking
> about and does this slowdown happen all the time, or only a couple of times
> per day ?

Yeah, gathering more data on the symptoms and the effectiveness of workarounds would hopefully shed more light on the direction. Thank you so much for working on this.
Comment 57 Matt Whitlock 2021-03-03 19:38:26 UTC
(In reply to Hans de Goede from comment #55)
> 1. Beside the errors in the logs, what are the other symptoms of this bug
> which you see on your system(s).

Back when I was experiencing intermittent "WRITE FPDMA QUEUED" errors on my Samsung SSD 860 Pro on an "Intel Corporation NM10/ICH7 Family SATA Controller [AHCI mode] (rev 01)" (an issue that disappeared after I recapped my motherboard), a symptom that I was experiencing was that the MD software RAID driver would kick the affected drive out of my mirrored pair after a batch of errors. Maybe that was just due to a software timeout, though I wasn't setting the "failfast" flag. One maybe relevant observation is that I was not able to re-add the drive to the array until after rebooting.

Unfortunately (and fortunately), I am no longer able to reproduce the problem, so I will not be able to collect any more observations or attempt any workarounds. For what it's worth, I do run my file systems with the "discard" mount flag enabled, so I believe I can assert that queued TRIM isn't causing any problems for me.
Comment 58 Matt Whitlock 2021-03-03 19:48:00 UTC
(In reply to Roy from comment #53)
> sudo smartctl -a /dev/sdX

That would show the interface CRC error count from the device's perspective, but how do we view the error count from the host controller's perspective? Is there a stats file for the SATA controller in debugfs or something like that?
Comment 59 Alejandro Donato 2021-03-03 20:15:12 UTC
El 3/3/21 a las 16:06, bugzilla-daemon@bugzilla.kernel.org escribió:
> So 2 questions for everyone who is seeing this bug:
>
> 1. Beside the errors in the logs, what are the other symptoms of this bug
> which you see on your system(s). Do things get slow / is there data
> corruption / anything else ?
>
> 2. If the problem is merely things getting slow, how slow are we talking
> about and does this slowdown happen all the time, or only a couple of times
> per day ?

1 - things get very slow and it ends in data corruption

2 - i notice working on big files (like accesing virtual disk images), 
starts to trigger the issue. And, as i say before, ends with data 
corruption.

Thanks!
Comment 60 Solomon Peachy 2021-03-03 21:21:58 UTC
Created attachment 295623 [details]
signature.asc

On Wed, Mar 03, 2021 at 06:57:32PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> 1. Beside the errors in the logs, what are the other symptoms of this bug
> which
> you see on your system(s). Do things get slow / is there data corruption /
> anything else ?

When the SSD was plugged into the motherboard's controller, I would see 
significant slowdowns that occasionally lead to data corruption.  I had 
to completely disable NCQ to made the issues go away completely, with 
the resultant performance impact.

When I plugged the SSD into a generic PCIe ASMedia controller, queued 
trim was sufficient to avoid issues -- but leaving queued trim enabled 
was all but guaranteed to cause filesystem corruption, twice to the 
point of the FS going read-only and un-mountable without an xfs_repair 
pass.

> 2. If the problem is merely things getting slow, how slow are we talking
> about
> and does this slowdown happen all the time, or only a couple of times per day
> ?

It easily happened multiple times a day; my supposition was that it 
primarily depended on the nuances of the write load but I was never able 
to narrow it down.

I've since swapped that SSD onto an Intel 8086:8d62 controller, and it 
hasn't so much as hiccupped since, with full NCQ and queued trim.
Comment 61 PJBrs 2021-03-09 08:03:14 UTC
I already replied to the bug report over here - https://bugzilla.kernel.org/show_bug.cgi?id=203475 - since that one isn't specific to AMD SATA controllers.

Hans de Goede, you wrote:
(In reply to Hans de Goede from comment #55)
> But disabling ncq altogether is a big hammer. Too big IMHO since this only
> happens with AMD + Asmedia controllers. I guess we could introduce a special
> horkage flag for this and disable NCQ on these devices if the PCI vendor-id
> == AMD or vendor-id == Asmedia ?   I know you're no longer the drivers/ata
> maintainer, but your input / insight on this is still very much welcome.

I've read this report here as well as the reports in bug 203475, and it seems to me that, while the issue is most severe on AMD controllers, it definitely also shows up on several different intel sata controllers. 

I'm using a 1 TB Samsung 860 EVO SSD (firmware RVT04B6Q) on my ThinkPad T450s. SATA controller info:

00:1f.2 SATA controller: Intel Corporation Wildcat Point-LP SATA Controller [AHCI Mode] (rev 03) (prog-if 01 [AHCI 1.0])
        Subsystem: Lenovo Wildcat Point-LP SATA Controller [AHCI Mode]
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin B routed to IRQ 44
        Region 0: I/O ports at 30a8 [size=8]
        Region 1: I/O ports at 30b4 [size=4]
        Region 2: I/O ports at 30a0 [size=8]
        Region 3: I/O ports at 30b0 [size=4]
        Region 4: I/O ports at 3060 [size=32]
        Region 5: Memory at f123c000 (32-bit, non-prefetchable) [size=2K]
        Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
                Address: fee00298  Data: 0000
        Capabilities: [70] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [a8] SATA HBA v1.0 BAR4 Offset=00000004
        Kernel driver in use: ahci

I have two ext4 partitions on this drive mounted with discard, one of which encrypted.

> So 2 questions for everyone who is seeing this bug:
> 
> 1. Beside the errors in the logs, what are the other symptoms of this bug
> which you see on your system(s). Do things get slow / is there data
> corruption / anything else ?

I noticed this issue first when one to several times each day the machine would freeze almost entirely. Only the mouse cursor seemed to still react. I noticed it especially a couple of minutes after resuming. I work around the issue by reverting https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca6bfcb2f6d9deab3924bf901e73622a94900473

I didn't wait for more severe problems to occur before working around the issue.

> 2. If the problem is merely things getting slow, how slow are we talking
> about and does this slowdown happen all the time, or only a couple of times
> per day ?

As mentioned above, a full freeze except for mouse cursor for a very noticeable length of time (I think somewhere between 10-30 seconds, didn't count).

In closing - I agree that there is no clear-cut problem and therefore no clear-cut solution. But I very strongly want to signal here that it also exists with intel sata controllers. I understand that disabling queued trim alone may not be a sufficient solution in all cases, but from the reports I've read it seems to me that it solves much more than the 50% of the problems that you estimated. Maybe that's because it's easier to disable ncq altogether (and solve the issue) than it is to disable queued trim alone?
Comment 62 Klaus Zipfel 2021-03-21 01:20:38 UTC
I also want to elaborate on this issue.

It seems to look mostly like a storage controller issue and is not limited to trim. Yet, the SSD type in combination with NCQ might also play a role.

My motherboard: Asrock Fatal1ty Z370 Gaming K6 (Intel + Additional AS Media storage controlers)

My system uses BTRFS on LVM on LUKS on a Samsung 870 Evo (Prior: WD Green: WDS240G2G0B). If I recall this correctly, TRIM operations will not be forwarded to the drive when using the default config on this type of setup.

When having the SSD attached to an AS Media storage controller (ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 02)), the error seems to appear amplified: E.g if I clone a 12 GB repo, the system tends to hang completely 

Having the
Comment 63 Klaus Zipfel 2021-03-21 01:42:37 UTC
I also want to elaborate on this issue.

It seems to look mostly like a storage controller issue and is not limited to trim only or the SSD type. Yet, the SSD type in combination with NCQ might also play a role.

My motherboard: Asrock Fatal1ty Z370 Gaming K6 (Intel + Additional AS Media storage controlers)

My system uses BTRFS on LVM on LUKS on a Samsung 870 Evo (Prior: WD Green: WDS240G2G0B) on Kernel 5.11.6-1. 
If I recall this correctly, TRIM operations will not be forwarded to the drive when using the default config on this type of setup. So TRIM might not be the main driver of this issue.

When having the SSD attached to an AS Media storage controller (ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 02)), the error seems to appear amplified: E.g if I clone a 12 GB git repo, the storage I/O seems to hang every other second and can also cause the whole system to freeze.
dmesg also prints a lot of 'BTRFS error (device dm-3): bdev /dev/mapper/system-pool errs: wr 0, rd 0, flush 0, corrupt 108, gen 0' due to wrong checksums '(BTRFS warning (device dm-3): csum failed root 264 ino 115629 off 1016291328 csum 0x1b9828c4 expected csum 0xa30291b3 mirror 1)'

Doing a scrub via "scrub started on /dev/mapper/system-pool" confirms the errors in dmesg.

This "seemed to get better", when having the SSD attached to the Intel SATA controler (Intel Corporation 200 Series PCH SATA controller [AHCI mode]).
However it did not went away for the Samsung 870 EVO, while it seem to be the case for the WD drive - but I only tried it once here!!!.
Setting NCQ queue_depth to 1 seem to mitigate the issue (on the Intel controller for the Samsung SSD - But not at all on the AS Media Controller!).
However, cloning the previously mentioned repo again and again to the Samsung 870 Evo on the Intel Controler with queue_depth=1 ended in BTRFS checksum errors again on the freshly cloned repo. I always checked this by firing "scrub started on /dev/mapper/system-pool".

So in my eyes, the problem is not yet pinned down to a single drive or storage controller.
Comment 64 Hans de Goede 2021-03-21 10:29:52 UTC
@Klaus Zipfel

Thank you for the long comment and all the testing you've done.

Your BTRFS tests showing data-corruption, (re)confirms that this really is a serious issue.

Your tests also show that unfortunately there is no easy fix from the kernel side here.

I'm a bit surprised that you need queue_depth=1 on the Intel controller at all; and that you still see corruption in that scenario.

Is your samsung drive using the latest firmware? There were some issues with AMD controller which reportedly are fixed by a firmware update.

Same question for your motherboard BIOS, in the past BIOS update have (silently without any mentions in the changelog) resolved SATA issues as well, so make sure you are up2date there too.

Also are you doing any overclocking ? Overclocking can also cause things like this even if the system is otherwise fine, esp. overclocking of the bus/base frequency instead of just bumping the multiplier.

I suspect either a firmware issue with the drive, or perhaps a power-supply issue.

The fact that queue_depth=1 is necessary makes me suspect the PSU, as explained earlier this significantly lowers the amount of power-consumption spikes which the SSD will exhibit.

How old is your PSU? and how is the drive connected to the PSU? Is it possible to connect the drive to another sata-power connector on the PSU ?
Comment 65 Alejandro Donato 2021-03-21 18:15:50 UTC
My another 2 cents,

In my tests, i can confirm, power is not the issue (i'm using a 1000W 
power supply tested and verified under load). No electrical noises, no 
power spikes (checked with an osciloscope).

I also do tests with 3 EVO drives.

So, if someone can validate this, we can discard power issues.

Thank all!!!

El 21/3/21 a las 07:29, bugzilla-daemon@bugzilla.kernel.org escribió:
> https://bugzilla.kernel.org/show_bug.cgi?id=201693
>
> --- Comment #64 from Hans de Goede (jwrdegoede@fedoraproject.org) ---
> @Klaus Zipfel
>
> Thank you for the long comment and all the testing you've done.
>
> Your BTRFS tests showing data-corruption, (re)confirms that this really is a
> serious issue.
>
> Your tests also show that unfortunately there is no easy fix from the kernel
> side here.
>
> I'm a bit surprised that you need queue_depth=1 on the Intel controller at
> all;
> and that you still see corruption in that scenario.
>
> Is your samsung drive using the latest firmware? There were some issues with
> AMD controller which reportedly are fixed by a firmware update.
>
> Same question for your motherboard BIOS, in the past BIOS update have
> (silently
> without any mentions in the changelog) resolved SATA issues as well, so make
> sure you are up2date there too.
>
> Also are you doing any overclocking ? Overclocking can also cause things like
> this even if the system is otherwise fine, esp. overclocking of the bus/base
> frequency instead of just bumping the multiplier.
>
> I suspect either a firmware issue with the drive, or perhaps a power-supply
> issue.
>
> The fact that queue_depth=1 is necessary makes me suspect the PSU, as
> explained
> earlier this significantly lowers the amount of power-consumption spikes
> which
> the SSD will exhibit.
>
> How old is your PSU? and how is the drive connected to the PSU? Is it
> possible
> to connect the drive to another sata-power connector on the PSU ?
>
Comment 66 Roman Elshin 2021-03-21 19:40:58 UTC
>So, if someone can validate this, we can discard power issues.

I doesn't  checked all my 3 relatevely fresh PSU by osciloscope, but other ssds work fine with thems, and i suppose power issue can't be solved by using external pci-e controller in a same system (Marvell 88SE9215 pci-e x1 card in my case).
Comment 67 Klaus Zipfel 2021-03-22 00:37:45 UTC
(In reply to Hans de Goede from comment #64)
> @Klaus Zipfel
> 
> Thank you for the long comment and all the testing you've done.
> 
> Your BTRFS tests showing data-corruption, (re)confirms that this really is a
> serious issue.
> 
> Your tests also show that unfortunately there is no easy fix from the kernel
> side here.
> 
> I'm a bit surprised that you need queue_depth=1 on the Intel controller at
> all; and that you still see corruption in that scenario.
> 
> Is your samsung drive using the latest firmware? There were some issues with
> AMD controller which reportedly are fixed by a firmware update.
> 
> Same question for your motherboard BIOS, in the past BIOS update have
> (silently without any mentions in the changelog) resolved SATA issues as
> well, so make sure you are up2date there too.
> 
> Also are you doing any overclocking ? Overclocking can also cause things
> like this even if the system is otherwise fine, esp. overclocking of the
> bus/base frequency instead of just bumping the multiplier.
> 
> I suspect either a firmware issue with the drive, or perhaps a power-supply
> issue.
> 
> The fact that queue_depth=1 is necessary makes me suspect the PSU, as
> explained earlier this significantly lowers the amount of power-consumption
> spikes which the SSD will exhibit.
> 
> How old is your PSU? and how is the drive connected to the PSU? Is it
> possible to connect the drive to another sata-power connector on the PSU ?


I have more insights on my end now. Please regard my previous statements with the Intel Controler + Samsung 870 SSD (including NCQ = off) **causing the corruption of my FS* as "almost" void. For the AS Media controler, I yet can not speak but I will test this hardware combination once I fixed the issue on my end.


To your questions:

- The SSD is on the latest firmware (The Samsung 870 Evo has been released beginning of this year - No newer firmware available)

- My BIOS is on the latest version (yet from around mid/end of 2019)

- Yes, I had the system overclocked but only with the multiplicator. The BCLK was 100 MHz with SpreadSprectum turned off. For my now on following tests, I turned off every single overclocking though.

- My PSU (Enermax Platimax D.F. 1200W) is around 2-3 years old and "completely overpowered" for my system (got this from an RMA). However I can not speak for the voltage stability (did not check with an oscilloscope yet)...

- ... but I hooked up the SSD to a second PSU dedicated to this SSD.

Anyways, the errors seem to still appeared, no matter if NCQ was on or off.

A big HOWEVER now: I seem to have found at least the reason for **my** issues: The kernelmodule I am working on right now seems to cause the problem for me (https://github.com/systemofapwne/mousedriver/issues/3). If this is due the FPU arithmetic (even though I use kernel_fpu_begin()/kernel_fpu_end() where it matters).

When this kernel module is not loaded, the Samsung 870 EVO SSD on the Intel controler with NCQ turned on (queue_depth = 32) was reproduceably not causing BTRFS checksum errors, while with my kernelmodule, it was causing the issues.
Comment 68 andreas 2021-03-22 10:51:13 UTC
I am experiencing the same bug with Samsung PM883 drives. 2TB and 4TB models.

Device Model:     SAMSUNG MZ7LH3T8HMLT-00005
Firmware Version: HXT7404Q

> On 8. Dec 2020, at 20:19, bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=201693
> 
> --- Comment #24 from Sitsofe Wheeler (sitsofe@yahoo.com) ---
> Can people who are seeing this report which model (e.g. SSD 860 EVO 500GB) 
> firmware (e.g. RVT01B6Q) and PCI card (e.g. AMD SB7x0/SB8x0/SB9x0 ) they
> have?
> In my case smartctl -a <dev> reports that I'm on RVT01B6Q firmware which is
> apparently behind the latest (RVT04B6Q) listed on
> https://www.samsung.com/semiconductor/minisite/ssd/download/tools/ . If folks
> are feeling brave and can take the risk can they report if the issue is still
> reproduced on the latest firmware?
> 
> -- 
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 69 Klaus Zipfel 2021-03-23 23:10:53 UTC
Adding to my previous comment #67: The error with the data corruption definitely was on my side and unrelated to this issue. I am sorry for stealing your precious time. 

I can confirm, that the Samsung SSD 870 EVO seems to work now without currupting my filesystem. And that not only on the Intel controler but also on the ASMedia.  

Note: Trim is still off and NCQ on.
Comment 70 Alejandro Donato 2021-03-24 23:47:51 UTC
Maybe this info helps.

In my tests, data corruption shows 2 of 10 times and under heavy load 
(moving and deleting multiple big files), using SSD drive as a caché of 
a mechanical drive.

In fact, i notice the bad performance issue, working with this kind of 
cache array.

Obviously, the other tests (as a single drive) shows the real issue.

El 23/3/21 a las 20:10, bugzilla-daemon@bugzilla.kernel.org escribió:
> https://bugzilla.kernel.org/show_bug.cgi?id=201693
>
> --- Comment #69 from Klaus Zipfel (klaus@zipfel.family) ---
> Adding to my previous comment #67: The error with the data corruption
> definitely was on my side and unrelated to this issue. I am sorry for
> stealing
> your precious time.
>
> I can confirm, that the Samsung SSD 870 EVO seems to work now without
> currupting my filesystem. And that not only on the Intel controler but also
> on
> the ASMedia.
>
> Note: Trim is still off and NCQ on.
>
Comment 71 Hans de Goede 2021-08-30 15:14:47 UTC
As already mentioned in bug 203475 we have been working towards a solution for this:

"""
So after completely re-reading / analyzing both this bug as well as bug 201693 with a fresh pair of eyes (since the last time I did this was a long time ago) I agree. After careful reading / analysis it seems that there really are 2 different bugs here impacting both the 860 EVO and the 870 EVO:

1. Queued Trim commands are causing issues on Intel + ASmedia + Marvell controllers

2. Things are seriously broken on AMD controllers and only completely disabling NCQ altogether helps there.
"""

A patch implementing 1. has been submitted upstream a week ago here:
https://lore.kernel.org/linux-ide/20210823095220.30157-1-hdegoede@redhat.com/T/#u

And a patch implementing 2. was just submitted upstream:
https://lore.kernel.org/linux-ide/54f63e11-e421-0fa6-80e1-297287dc0974@redhat.com/

Together these should resolve (work around) this issue for most users.
Comment 72 Hans de Goede 2021-09-01 09:42:28 UTC
Hi All,

So there are now some reports in the upstream patch discussions of a user with a 860 PRO on a X570 AMD motherboard who is not seeing any issues.

I replied the following to this: "The problem is that when users are hit by this they end up with a non functional system and even fs / data  corruption. Where
as OTOH disabling NCQ leads to a (significant) performance degradation but affected systems will still work fine.

So I believe that it is best to err on the safe side here and accept the performance degradation as a trade-of for fixing the fs / data corruption."

With that said I would still like to try and make the set of AMD boards on which we disable NCQ in combination with a 860 or 870 driver narrower.

If you have a Samsung 860 or 870 SSD with an AMD motherboard and you need to disable NCQ / set the queue-dept to 1 to make it work reliable can you then please provide a comment (or attachment) with the output of:

lscpi -nn

Run on the troublesome AMD motherboard?

The goal is to see if we can build a set of affected AMD SATA controller PCI product-ids on which to disable NCQ to make the kernel-patch to automatically disable NCQ narrower.
Comment 73 Mike Kazantsev 2021-09-01 10:11:52 UTC
Full "lspci -nn" output on this workstation:

  00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD/ATI] RX780/RX790 Host Bridge [1002:5957]
  00:02.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] RX780/RD790 PCI to PCI bridge (external gfx0 port A) [1002:5978]
  00:0a.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] RD790 PCI to PCI bridge (PCI express gpp port F) [1002:597f]
  00:11.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] [1002:4391] (rev 40)
  00:12.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
  00:12.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
  00:13.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
  00:13.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
  00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus Controller [1002:4385] (rev 42)
  00:14.2 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 Azalia (Intel HDA) [1002:4383] (rev 40)
  00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 LPC host controller [1002:439d] (rev 40)
  00:14.4 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 PCI to PCI Bridge [1002:4384] (rev 40)
  00:14.5 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI2 Controller [1002:4399]
  00:15.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0) [1002:43a0]
  00:16.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
  00:16.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
  00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 10h Processor HyperTransport Configuration [1022:1200]
  00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 10h Processor Address Map [1022:1201]
  00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 10h Processor DRAM Controller [1022:1202]
  00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 10h Processor Miscellaneous Control [1022:1203]
  00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 10h Processor Link Control [1022:1204]
  01:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 06)
  02:07.0 Ethernet controller [0200]: VIA Technologies, Inc. VT6105/VT6106S [Rhine-III] [1106:3106] (rev 8b)
  03:00.0 USB controller [0c03]: NEC Corporation uPD720200 USB 3.0 Host Controller [1033:0194] (rev 03)
  04:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Radeon RX 550 640SP / RX 560/560X] [1002:67ff] (rev cf)
  04:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] [1002:aae0]

Connected drive info from smartctl:

  Device Model: Samsung SSD 860 EVO 500GB
  Firmware Version: RVT04B6Q
  ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
  SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)

Tried patching drivers/ata/libata-core.c on this setup first, with:

+  { "Samsung SSD 860*",           NULL,   ATA_HORKAGE_NO_NCQ_TRIM |
+                                          ATA_HORKAGE_ZERO_AFTER_TRIM, },

Iirc confirmed that these were applied in dmesg, but that didn't help, so have "libata.force=2.00:noncq" from then on, and a fallback hack to do it by device-id via sysfs on early boot jic - that seem to remove the issues, and of course I'm fine with this trade-off, considering the alternative.

Also seen same issue on a different (old, but slightly less so) AMD chipset with a similar samsung drive when dd'ing some windows install to it via liveusb too, might get an lspci off it later today, when I'll be near it, I think windows works there without complaining, but maybe also with its own hacks and/or suboptimally.
Comment 74 nejrobbins 2021-09-01 22:53:28 UTC
I'm not exactly sure how to interpret the different values in the output of lspci -nn, and my SATA controller seems the same as yours, but I'll post my full output just in case. This is a 970 FX board. 

My lspci -nn output:
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD/ATI] RD9x0/RX980 Host Bridge [1002:5a14] (rev 02)
00:02.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] RD890/RD9x0/RX980 PCI to PCI bridge (PCI Express GFX port 0) [1002:5a16]
00:09.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] RD890/RD9x0/RX980 PCI to PCI bridge (PCI Express GPP Port 4) [1002:5a1c]
00:11.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] [1002:4391] (rev 40)
00:12.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
00:12.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
00:13.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
00:13.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus Controller [1002:4385] (rev 42)
00:14.2 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 Azalia (Intel HDA) [1002:4383] (rev 40)
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 LPC host controller [1002:439d] (rev 40)
00:14.4 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 PCI to PCI Bridge [1002:4384] (rev 40)
00:14.5 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI2 Controller [1002:4399]
00:15.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0) [1002:43a0]
00:16.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
00:16.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 0 [1022:1600]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 1 [1022:1601]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 2 [1022:1602]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 3 [1022:1603]
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 4 [1022:1604]
00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 5 [1022:1605]
01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] [1002:67df] (rev ef)
01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590] [1002:aaf0]
02:00.0 USB controller [0c03]: Etron Technology, Inc. EJ188/EJ198 USB 3.0 Host Controller [1b6f:7052]
04:00.0 Network controller [0280]: Qualcomm Atheros AR9287 Wireless Network Adapter (PCI-Express) [168c:002e] (rev 01)

Device Model:     Samsung SSD 860 EVO 250GB
Firmware Version: RVT04B6Q
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Comment 75 Mike Kazantsev 2021-09-02 04:12:18 UTC
"lspci -nn" from the other AMD mobo where I've seen this:

  00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD/ATI] RD9x0/RX980 Host Bridge [1002:5a14] (rev 02)
  00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD/ATI] RD890S/RD990 I/O Memory Management Unit (IOMMU) [1002:5a23]
  00:02.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] RD890/RD9x0/RX980 PCI to PCI bridge (PCI Express GFX port 0) [1002:5a16]
  00:04.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] RD890/RD9x0/RX980 PCI to PCI bridge (PCI Express GPP Port 0) [1002:5a18]
  00:09.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] RD890/RD9x0/RX980 PCI to PCI bridge (PCI Express GPP Port 4) [1002:5a1c]
  00:0a.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] RD890/RD9x0/RX980 PCI to PCI bridge (PCI Express GPP Port 5) [1002:5a1d]
  00:11.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] [1002:4391] (rev 40)
  00:12.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
  00:12.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
  00:13.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
  00:13.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
  00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus Controller [1002:4385] (rev 42)
  00:14.2 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 Azalia (Intel HDA) [1002:4383] (rev 40)
  00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 LPC host controller [1002:439d] (rev 40)
  00:14.4 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 PCI to PCI Bridge [1002:4384] (rev 40)
  00:14.5 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI2 Controller [1002:4399]
  00:16.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
  00:16.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
  00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 0 [1022:1600]
  00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 1 [1022:1601]
  00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 2 [1022:1602]
  00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 3 [1022:1603]
  00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 4 [1022:1604]
  00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 5 [1022:1605]
  01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] [1002:67df] (rev e7)
  01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590] [1002:aaf0]
  02:00.0 USB controller [0c03]: Etron Technology, Inc. EJ168 USB 3.0 Host Controller [1b6f:7023] (rev 01)
  03:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 06)
  04:00.0 USB controller [0c03]: Etron Technology, Inc. EJ168 USB 3.0 Host Controller [1b6f:7023] (rev 01)

Seem to be exactly same SSD there:

  Model Family:     Samsung based SSDs
  Device Model:     Samsung SSD 860 EVO 500GB
  Firmware Version: RVT04B6Q
  ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
  SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Comment 76 Alejandro Donato 2021-09-02 15:29:00 UTC
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] RS780 
Host Bridge [1022:9600]
00:02.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] RS780 PCI 
to PCI bridge (ext gfx port 0) [1022:9603]
00:07.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 
RS780/RS880 PCI to PCI bridge (PCIE port 3) [1022:9607]
00:09.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 
RS780/RS880 PCI to PCI bridge (PCIE port 4) [1022:9608]
00:0a.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 
RS780/RS880 PCI to PCI bridge (PCIE port 5) [1022:9609]
00:11.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD/ATI] 
SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] [1002:4391]
00:12.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] 
SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
00:12.1 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] 
SB7x0 USB OHCI1 Controller [1002:4398]
00:12.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] 
SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
00:13.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] 
SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
00:13.1 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] 
SB7x0 USB OHCI1 Controller [1002:4398]
00:13.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] 
SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus 
Controller [1002:4385] (rev 3c)
00:14.1 IDE interface [0101]: Advanced Micro Devices, Inc. [AMD/ATI] 
SB7x0/SB8x0/SB9x0 IDE Controller [1002:439c]
00:14.2 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] 
SBx00 Azalia (Intel HDA) [1002:4383]
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD/ATI] 
SB7x0/SB8x0/SB9x0 LPC host controller [1002:439d]
00:14.4 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 
PCI to PCI Bridge [1002:4384]
00:14.5 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] 
SB7x0/SB8x0/SB9x0 USB OHCI2 Controller [1002:4399]
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 
15h Processor Function 0 [1022:1600]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 
15h Processor Function 1 [1022:1601]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 
15h Processor Function 2 [1022:1602]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 
15h Processor Function 3 [1022:1603]
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 
15h Processor Function 4 [1022:1604]
00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 
15h Processor Function 5 [1022:1605]
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GF119 
[GeForce GT 610] [10de:104a] (rev a1)
01:00.1 Audio device [0403]: NVIDIA Corporation GF119 HDMI Audio 
Controller [10de:0e08] (rev a1)
02:00.0 SATA controller [0106]: ASMedia Technology Inc. ASM1062 Serial 
ATA Controller [1b21:0612] (rev 01)
03:00.0 USB controller [0c03]: Etron Technology, Inc. EJ188/EJ198 USB 
3.0 Host Controller [1b6f:7052]
04:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. 
RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] 
(rev 06)

Thx for your work on this!

El 2/9/21 a las 01:12, bugzilla-daemon@bugzilla.kernel.org escribió:
> lspci -nn
Comment 77 Krzysztof Oledzki 2021-09-02 16:02:27 UTC
Dell Optiplex 580 w/AMD 785G+SB710 is also impacted by this issue.

What seems to be in common is 1002:4391, but there are several more board_ahci_sb700 (and also sb600) devices in linux/drivers/ata/ahci.c

        { PCI_VDEVICE(ATI, 0x4380), board_ahci_sb600 }, /* ATI SB600 */
        { PCI_VDEVICE(ATI, 0x4390), board_ahci_sb700 }, /* ATI SB700/800 */
        { PCI_VDEVICE(ATI, 0x4391), board_ahci_sb700 }, /* ATI SB700/800 */
        { PCI_VDEVICE(ATI, 0x4392), board_ahci_sb700 }, /* ATI SB700/800 */
        { PCI_VDEVICE(ATI, 0x4393), board_ahci_sb700 }, /* ATI SB700/800 */
        { PCI_VDEVICE(ATI, 0x4394), board_ahci_sb700 }, /* ATI SB700/800 */
        { PCI_VDEVICE(ATI, 0x4395), board_ahci_sb700 }, /* ATI SB700/800 */

I wonder if the problem is really AMD or "ATI AMD".

00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] RS880 Host Bridge [1022:9601]
00:02.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] RS780 PCI to PCI bridge (ext gfx port 0) [1022:9603]
00:04.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] RS780/RS880 PCI to PCI bridge (PCIE port 0) [1022:9604]
00:09.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] RS780/RS880 PCI to PCI bridge (PCIE port 4) [1022:9608]
00:11.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] [1002:4391]
00:12.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
00:12.1 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0 USB OHCI1 Controller [1002:4398]
00:12.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
00:13.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
00:13.1 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0 USB OHCI1 Controller [1002:4398]
00:13.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus Controller [1002:4385] (rev 3c)
00:14.1 IDE interface [0101]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 IDE Controller [1002:439c]
00:14.2 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 Azalia (Intel HDA) [1002:4383]
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 LPC host controller [1002:439d]
00:14.4 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 PCI to PCI Bridge [1002:4384]
00:14.5 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI2 Controller [1002:4399]
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 10h Processor HyperTransport Configuration [1022:1200]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 10h Processor Address Map [1022:1201]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 10h Processor DRAM Controller [1022:1202]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 10h Processor Miscellaneous Control [1022:1203]
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 10h Processor Link Control [1022:1204]
01:00.0 Ethernet controller [0200]: Mellanox Technologies MT27500 Family [ConnectX-3] [15b3:1003]
02:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] RV620 LE [Radeon HD 3450] [1002:95c5]
03:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5761 Gigabit Ethernet PCIe [14e4:1681] (rev 10)
Comment 78 Hans de Goede 2021-09-02 16:09:47 UTC
(In reply to Krzysztof Oledzki from comment #77)
> Dell Optiplex 580 w/AMD 785G+SB710 is also impacted by this issue.
> 
> What seems to be in common is 1002:4391, but there are several more
> board_ahci_sb700 (and also sb600) devices in linux/drivers/ata/ahci.c
> 
>         { PCI_VDEVICE(ATI, 0x4380), board_ahci_sb600 }, /* ATI SB600 */
>         { PCI_VDEVICE(ATI, 0x4390), board_ahci_sb700 }, /* ATI SB700/800 */
>         { PCI_VDEVICE(ATI, 0x4391), board_ahci_sb700 }, /* ATI SB700/800 */
>         { PCI_VDEVICE(ATI, 0x4392), board_ahci_sb700 }, /* ATI SB700/800 */
>         { PCI_VDEVICE(ATI, 0x4393), board_ahci_sb700 }, /* ATI SB700/800 */
>         { PCI_VDEVICE(ATI, 0x4394), board_ahci_sb700 }, /* ATI SB700/800 */
>         { PCI_VDEVICE(ATI, 0x4395), board_ahci_sb700 }, /* ATI SB700/800 */
> 
> I wonder if the problem is really AMD or "ATI AMD".

I agree, it seems like we need to change the kernel patch to automatically disable NCQ on Samsung 860 and 870 drivers when the vendor-id == 0x1002.

Is anyone seeing the issue where NCQ needs to be completely disabled / queue-depth needs to be sey to 1 on a motherboard where "lspci -nn" shows 1022 as the vendor-id for the SATA controller?
Comment 79 Matt Whitlock 2021-09-02 17:17:42 UTC
I've been running a Samsung SSD 860 PRO 512GB on an Intel NM10/ICH7 SATA controller for over two years now with zero problems.

I've also been running a Samsung SSD 860 EVO 2TB on the same controller for the past 10 months and have had no problems with it either.

The PRO has partitions that are members of RAID1 mdraid volumes, whose contained file systems are mounted with "-o discard". The EVO has partitions that are members of the same mdraid volumes and also a partition that is a member of a RAID1 mdraid volume that contains a LUKS volume that has the "allow-discards" flag enabled and whose contained file system is mounted with "-o discard".

I'm currently running Linux version 5.10.52-gentoo. The only blacklist entry in libata-core.c that matches my Samsung SSDs sets ATA_HORKAGE_ZERO_AFTER_TRIM (which is actually a good thing[1], not really a horkage), so I assume queued TRIM is enabled on both.

I am somewhat sad to see the baby thrown out with the bath water in this latest round of patches. I am fortunate enough to be have found this bug report and to be paying attention so I can apply a reverse patch to avoid taking a performance hit going forward. Others will not be so lucky.

__________

[1] https://patchwork.ozlabs.org/project/linux-ide/patch/1420727311-7066-1-git-send-email-martin.petersen@oracle.com/
Comment 80 Hans de Goede 2021-09-02 20:14:12 UTC
(In reply to Matt Whitlock from comment #79)
> I am somewhat sad to see the baby thrown out with the bath water in this
> latest round of patches. I am fortunate enough to be have found this bug
> report and to be paying attention so I can apply a reverse patch to avoid
> taking a performance hit going forward. Others will not be so lucky.

In case of an Intel SATA controller we will only be disabling queued trim commands, while otherwise leaving NCQ fully enabled. The chances of you actually noticing any performance difference from this are pretty small.

One of the reasons why this bug has actually been open so long is so as to avoid causing performance regressions on not-affected systems.
Comment 81 Gregory P. Smith 2021-09-02 20:16:35 UTC
My 1T Samsung 870 is connected to the Marvell controller below.

This is a HPe MicroServer Gen10.  All four drive bays use that controller.  This is a fairly popular, common, and affordable home/office server machine.

$ lspci -nn
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Root Complex [1022:1576]
00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) I/O Memory Management Unit [1022:1577]
00:01.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] [1002:9874] (rev 84)
00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Host Bridge [1022:157b]
00:02.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Root Port [1022:157c]
00:02.5 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Root Port [1022:157c]
00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Host Bridge [1022:157b]
00:08.0 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Carrizo Platform Security Processor [1022:1578]
00:09.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Carrizo Audio Dummy Host Bridge [1022:157d]
00:10.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] FCH USB XHCI Controller [1022:7914] (rev 20)
00:11.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 49)
00:12.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] FCH USB EHCI Controller [1022:7908] (rev 49)
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 4a)
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 11)
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Function 0 [1022:1570]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Function 1 [1022:1571]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Function 2 [1022:1572]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Function 3 [1022:1573]
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Function 4 [1022:1574]
00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Function 5 [1022:1575]
01:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9230 PCIe SATA 6Gb/s Controller [1b4b:9230] (rev 11)
02:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe [14e4:165f]
02:00.1 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe [14e4:165f]
Comment 82 Gregory P. Smith 2021-09-02 20:18:49 UTC
booting with libata.force=4:noncqtrim has made the just posted setup reliable for me.
Comment 83 Hans de Goede 2021-09-02 20:27:02 UTC
(In reply to Gregory P. Smith from comment #81)
> My 1T Samsung 870 is connected to the Marvell controller below.

Thank you for the lcpsi output.

If I'm reading your comment 41 then just disabling queued trim ("noncqtrim" option)  is enough to make things work in that setup, correct?

This matches all the other reports where the "noncqtrim" option is sufficient to make things work normally, except on some AMD/ATI SATA controllers.

There already is a patch pending upstream to make noncqtrim the default on all Samsung 860 and 870 SSDs independent of the used controller:
https://lore.kernel.org/linux-ide/20210823095220.30157-1-hdegoede@redhat.com/T/#u

The reason I was asking for lspci output is because for some users with AMD/ATI SATA controllers the "noncqtrim" option is not enough to get things stable, they need "noncq" which is a much bigger hammer, so the plan is to limit that to only certain SATA controllers (or certain SATA controller vendor-ids).
Comment 84 Hans de Goede 2021-09-02 20:27:45 UTC
(In reply to Gregory P. Smith from comment #82)
> booting with libata.force=4:noncqtrim has made the just posted setup
> reliable for me.

Ah looks like our comments crossed, thanks for confirming that.
Comment 85 Hans de Goede 2021-09-03 20:54:18 UTC
The patches for both this bug (using the ATI 0x1002 vendor id for the check) as well as for bug 203475 have been merged into:

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/?h=for-next

So they are  on their way to Linus, closing.
Comment 86 Boann 2021-09-06 05:48:15 UTC
Just to add a confusing data point, my Samsung 860 EVO and AMD SATA controller work perfectly together.

SSD info, from smartctl:

Device Model:     Samsung SSD 860 EVO 1TB
Firmware Version: RVT04B6Q
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)

NCQ is definitely enabled, according to the dmesg log:

[    2.897566] ata4.00: ATA-11: Samsung SSD 860 EVO 1TB, RVT04B6Q, max UDMA/133
[    2.897568] ata4.00: 1953525168 sectors, multi 1: LBA48 NCQ (depth 32), AA
[    2.899995] ata4.00: supports DRM functions and may not be fully accessible
[    2.902825] ata4.00: configured for UDMA/133

Kernel version, according to uname:

Linux 4.19.0-17-amd64 #1 SMP Debian 4.19.194-3 (2021-07-18) x86_64 GNU/Linux

SATA controller info, according to `lspci -v -nn`:

15:00.1 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller [1022:43c8] (rev 01) (prog-if 01 [AHCI 1.0])
        Subsystem: ASMedia Technology Inc. 400 Series Chipset SATA Controller [1b21:1062]
        Flags: bus master, fast devsel, latency 0, IRQ 40
        Memory at fce80000 (32-bit, non-prefetchable) [size=128K]
        Expansion ROM at fce00000 [disabled] [size=512K]
        Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [78] Power Management version 3
        Capabilities: [80] Express Legacy Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Kernel driver in use: ahci
        Kernel modules: ahci

39:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 61) (prog-if 01 [AHCI 1.0])
        Subsystem: Micro-Star International Co., Ltd. [MSI] FCH SATA Controller [AHCI mode] [1462:7b79]
        Flags: bus master, fast devsel, latency 0, IRQ 44
        Memory at fcf00000 (32-bit, non-prefetchable) [size=2K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
        Capabilities: [64] Express Endpoint, MSI 00
        Capabilities: [a0] MSI: Enable+ Count=1/2 Maskable- 64bit+
        Capabilities: [d0] SATA HBA v1.0
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [270] #19
        Kernel driver in use: ahci
        Kernel modules: ahci

---

I've been using this SSD for nine months daily without ever seeing this error or any I/O issues.

I don't know if I've ever used "queued trim". I know periodic trim with fstrim is enabled and runs weekly without hiccups.

I ran `zgrep FPDMA /var/log/*` to see if there was anything logged there (kernel logs going back 3 weeks), and there is not a single line reported.

I also occasionally sync my system drive to an external backup HD and run checksums over all files to compare, so I can detect if a single bit flips, and with this SSD, it never has.

Sorry if this seems so selfish, when so many people are struggling with this mysterious, alarming, infuriating bug. But my own issue is rather the opposite: When my distro's kernel receives this patch to disable NCQ, will there be an easy way I can override to re-enable it? I know my current system configuration is fine.
Comment 87 Krzysztof Oledzki 2021-09-06 06:17:58 UTC
No confusion, we established that the problem is limited to "ATI AMD" AHCI controllers - 0x1002, not "Modern AMD" - 0x1022. You seem to be using 1022:43c8 / 1022:7901 so nothing should change for you.

However, we are still disabling NCQ TRIM for Samsung SSD 840/850/860/870 on all controllers, but this is expected to have neglectable perf impact.

See also:
 https://bugzilla.kernel.org/show_bug.cgi?id=203475#c49
 https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/?h=libata-5.15&id=7a8526a5cd51cf5f070310c6c37dd7293334ac49
 https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/?h=libata-5.15&id=8a6430ab9c9c87cb64c512e505e8690bbaee190b

BTW: we also have now "ncqati" flag allowing to re-enable NCQ.
Comment 88 Boann 2021-09-06 13:57:24 UTC
(In reply to Krzysztof Oledzki from comment #87)
> No confusion, we established that the problem is limited to "ATI AMD" AHCI
> controllers - 0x1002, not "Modern AMD" - 0x1022.
>
> BTW: we also have now "ncqati" flag allowing to re-enable NCQ.

Oh, well, then that's excellent. Thank you sirs, for your thoughtful and careful handling of this bug.
Comment 89 Andrew Filippov 2021-09-06 20:39:20 UTC
(In reply to Krzysztof Oledzki from comment #87)
> No confusion, we established that the problem is limited to "ATI AMD" AHCI
> controllers - 0x1002, not "Modern AMD" - 0x1022. You seem to be using
> 1022:43c8 / 1022:7901 so nothing should change for you.
> 
> However, we are still disabling NCQ TRIM for Samsung SSD 840/850/860/870 on
> all controllers, but this is expected to have neglectable perf impact.
> 
> See also:
>  https://bugzilla.kernel.org/show_bug.cgi?id=203475#c49
>  https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/
> commit/?h=libata-5.15&id=7a8526a5cd51cf5f070310c6c37dd7293334ac49
>  https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/
> commit/?h=libata-5.15&id=8a6430ab9c9c87cb64c512e505e8690bbaee190b
> 
> BTW: we also have now "ncqati" flag allowing to re-enable NCQ.

Thanks for the information.

What options need to be passed to the 5.15+ kernel via "libata.force=" for full ATA TRIM to fully work as before on Samsung EVO 860/870?
Comment 91 QK 2021-10-06 21:43:19 UTC
I'm experiencing regular system freezes with both Samsung 860 and 870 EVO running BTRFS and Kernel 5.13. Everything usually runs smoothly for some hours and then the system freezes and can only be recovered by a forced power switch off.

I have an AMD SATA controller of the type 1022 (which is supposed to work well, based on what I read above).

'sudo smartctl -a /dev/sda' tells me:
"SMART overall-health self-assessment test result: PASSED"

I saw here that the blacklisting of Samsung 860 and 870 will occur in Kernel 5.15: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/ata/libata-core.c?h=v5.15-rc4

I have 2 questions:
1. Do you think that the root cause of those system freezes could be indeed the "queued TRIM" issue with those SSDs?
2. How could I manually switch "queued TRIM" off before kernel 5.15 is released? Is that acomplished by setting "libata.force=noncqtrim" as boot parameter?
Comment 92 Krzysztof Oledzki 2021-10-07 02:12:31 UTC
First, I think we continue to mix two different issues - general NCQ on "AMD ATI AHCI" (this bug) and NCQ trim - https://bugzilla.kernel.org/show_bug.cgi?id=203475

Now... 5.13 was a non-longterm stable kernel now EOL, with 5.13.19 being the latest version. However, 5.13.19 also includes "libata: add ATA_HORKAGE_NO_NCQ_TRIM for Samsung 860 and 870 SSDs" fix: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.13.y&id=e8d5567d9f6c5946dca0b17b43f101e0875ce5ac which should disable NCQ TRIM for you.

So, if NCQ TRIM is the reason for your problem, you don't have to wait for 5.15, just update the kernel to 5.13.19.

Longer term, you should consider updating to 5.14-stable (5.14.9, soon  	5.14.10 - https://lwn.net/ml/linux-kernel/20211006073100.650368172@linuxfoundation.org/) as this fix has been included since 5.14.6 - https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.14.6

But again, this is https://bugzilla.kernel.org/show_bug.cgi?id=203475
Comment 93 Roman Mamedov 2021-10-07 06:19:41 UTC
> Do you think that the root cause of those system freezes could be indeed the
> "queued TRIM" issue with those SSDs?

It does not appear to be the case. I believe all of these NCQ issues would typically lead only to very long delays, as the SATA device returns those errors but then recovers, not a complete hang. And especially I would not expect the Xorg mouse cursor to stop moving (if that's the case for you), while the kernel is waiting for the SATA timeouts. Clicks on the on-screen elements might not register, but the cursor should keep moving. Also check if you can toggle NumLock light on the keyboard during the hang. If it's this SATA issue, it should absolutely work.

To be completely sure, look up a manual for how to set up "netconsole", that way you could hopefully save the latest dmesg messages before the hang to another computer and see exactly what's the cause.
Comment 94 Roy 2021-10-12 12:47:20 UTC
Upgraded to 5.14.9 (Fedora), and thought I'd remove the "libata.force=noncq" kernel parameter. Alas, still trouble!

[  284.042684] ata1.00: exception Emask 0x10 SAct 0x70000001 SErr 0x0 action 0x6 frozen
[  284.042694] ata1.00: irq_stat 0x08000000, interface fatal error
[  284.042698] ata1.00: failed command: WRITE FPDMA QUEUED
[  284.042700] ata1.00: cmd 61/20:00:d0:7a:a6/00:00:2b:00:00/40 tag 0 ncq dma 16384 out
                        res 40/00:e0:a8:21:02/00:00:49:00:00/40 Emask 0x10 (ATA bus error)
[  284.042711] ata1.00: status: { DRDY }
[  284.042714] ata1.00: failed command: WRITE FPDMA QUEUED
[  284.042716] ata1.00: cmd 61/10:e0:a8:21:02/00:00:49:00:00/40 tag 28 ncq dma 8192 out
                        res 40/00:e0:a8:21:02/00:00:49:00:00/40 Emask 0x10 (ATA bus error)
[  284.042726] ata1.00: status: { DRDY }
[  284.042729] ata1.00: failed command: WRITE FPDMA QUEUED
[  284.042731] ata1.00: cmd 61/08:e8:a8:fc:33/00:00:2c:00:00/40 tag 29 ncq dma 4096 out
                        res 40/00:e0:a8:21:02/00:00:49:00:00/40 Emask 0x10 (ATA bus error)
[  284.042739] ata1.00: status: { DRDY }
[  284.042742] ata1.00: failed command: WRITE FPDMA QUEUED
[  284.042744] ata1.00: cmd 61/10:f0:c0:fc:33/00:00:2c:00:00/40 tag 30 ncq dma 8192 out
                        res 40/00:e0:a8:21:02/00:00:49:00:00/40 Emask 0x10 (ATA bus error)
[  284.042752] ata1.00: status: { DRDY }
[  284.042756] ata1: hard resetting link
[  284.506747] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  284.507053] ata1.00: supports DRM functions and may not be fully accessible
[  284.507946] ata1.00: disabling queued TRIM support
[  284.510247] ata1.00: supports DRM functions and may not be fully accessible
[  284.510977] ata1.00: disabling queued TRIM support
[  284.512930] ata1.00: configured for UDMA/133
[  284.512963] ata1: EH complete
[  284.513372] ata1.00: Enabling discard_zeroes_data

I'm fairly sure it's connected to the AMD SATA controller in my system. For completeness:
[user@Host ~]$ lspci | grep -i sata
00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] (rev 40)
04:00.0 SATA controller: ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 01)

[user@Host ~]$ lspci -vvvn # manually filtered
00:11.0 0106: 1002:4391 (rev 40) (prog-if 01 [AHCI 1.0])
	Subsystem: 1043:84dd
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 32
	Interrupt: pin A routed to IRQ 19
	NUMA node: 0
	IOMMU group: 6
	Region 0: I/O ports at f040 [size=8]
	Region 1: I/O ports at f030 [size=4]
	Region 2: I/O ports at f020 [size=8]
	Region 3: I/O ports at f010 [size=4]
	Region 4: I/O ports at f000 [size=16]
	Region 5: Memory at fe60b000 (32-bit, non-prefetchable) [size=1K]
	Capabilities: <access denied>
	Kernel driver in use: ahci

04:00.0 0106: 1b21:0612 (rev 01) (prog-if 01 [AHCI 1.0])
	Subsystem: 1043:84b7
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 35
	NUMA node: 0
	IOMMU group: 19
	Region 0: I/O ports at c050 [size=8]
	Region 1: I/O ports at c040 [size=4]
	Region 2: I/O ports at c030 [size=8]
	Region 3: I/O ports at c020 [size=4]
	Region 4: I/O ports at c000 [size=32]
	Region 5: Memory at fe400000 (32-bit, non-prefetchable) [size=512]
	Capabilities: <access denied>
	Kernel driver in use: ahci
Comment 95 Roman Mamedov 2021-10-12 12:55:38 UTC
> Upgraded to 5.14.9 (Fedora), and thought I'd remove the "libata.force=noncq"
> kernel parameter. Alas, still trouble!

Post the portion of dmesg from during boot-up, where it says what NCQ depth is being used, and iirc if any quirks are active.
Comment 96 Roy 2021-10-12 13:00:37 UTC
Yes, sorry, as I pressed "post" I realised that I omitted some details. Here goes:

okt 12 13:34:41 Tuvok kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
okt 12 13:34:41 Tuvok kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
okt 12 13:34:41 Tuvok kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
okt 12 13:34:41 Tuvok kernel: Loaded X.509 cert 'Fedora kernel signing key: XXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
okt 12 13:34:41 Tuvok kernel: ima: Allocated hash algorithm: sha256
okt 12 13:34:41 Tuvok kernel: ima: No architecture policies found
okt 12 13:34:41 Tuvok kernel: evm: Initialising EVM extended attributes:
okt 12 13:34:41 Tuvok kernel: evm: security.selinux
okt 12 13:34:41 Tuvok kernel: evm: security.SMACK64 (disabled)
okt 12 13:34:41 Tuvok kernel: evm: security.SMACK64EXEC (disabled)
okt 12 13:34:41 Tuvok kernel: evm: security.SMACK64TRANSMUTE (disabled)
okt 12 13:34:41 Tuvok kernel: evm: security.SMACK64MMAP (disabled)
okt 12 13:34:41 Tuvok kernel: evm: security.apparmor (disabled)
okt 12 13:34:41 Tuvok kernel: evm: security.ima
okt 12 13:34:41 Tuvok kernel: evm: security.capability
okt 12 13:34:41 Tuvok kernel: evm: HMAC attrs: 0x1
okt 12 13:34:41 Tuvok kernel: ata1.00: supports DRM functions and may not be fully accessible
okt 12 13:34:41 Tuvok kernel: usb 4-2: new full-speed USB device number 2 using ohci-pci
okt 12 13:34:41 Tuvok kernel: PM:   Magic number: 9:228:582
okt 12 13:34:41 Tuvok kernel: pci_bus 0000:03: hash matches
okt 12 13:34:41 Tuvok kernel: pcieport 0000:00:06.0: hash matches
okt 12 13:34:41 Tuvok kernel: RAS: Correctable Errors collector initialized.
okt 12 13:34:41 Tuvok kernel: ata1.00: disabling queued TRIM support
okt 12 13:34:41 Tuvok kernel: ata1.00: ATA-11: Samsung SSD 860 EVO 1TB, RVT04B6Q, max UDMA/133
okt 12 13:34:41 Tuvok kernel: ata1.00: 1953525168 sectors, multi 1: LBA48 NCQ (depth 32), AA
okt 12 13:34:41 Tuvok kernel: ata2.00: ATA-8: ST31000524AS, JC45, max UDMA/133
okt 12 13:34:41 Tuvok kernel: ata2.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 32)
okt 12 13:34:41 Tuvok kernel: ata3.00: ATAPI: TSSTcorp CDDVDW SH-224BB, SB00, max UDMA/100
okt 12 13:34:41 Tuvok kernel: ata2.00: configured for UDMA/133
okt 12 13:34:41 Tuvok kernel: ata3.00: configured for UDMA/100
okt 12 13:34:41 Tuvok kernel: ata1.00: supports DRM functions and may not be fully accessible
okt 12 13:34:41 Tuvok kernel: ata1.00: disabling queued TRIM support
okt 12 13:34:41 Tuvok kernel: ata1.00: configured for UDMA/133
okt 12 13:34:41 Tuvok kernel: scsi 0:0:0:0: Direct-Access     ATA      Samsung SSD 860  4B6Q PQ: 0 ANSI: 5
okt 12 13:34:41 Tuvok kernel: ata1.00: Enabling discard_zeroes_data
okt 12 13:34:41 Tuvok kernel: sd 0:0:0:0: [sda] 1953525168 512-byte logical blocks: (1.00 TB/932 GiB)
okt 12 13:34:41 Tuvok kernel: sd 0:0:0:0: [sda] Write Protect is off
okt 12 13:34:41 Tuvok kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
okt 12 13:34:41 Tuvok kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
okt 12 13:34:41 Tuvok kernel: sd 0:0:0:0: Attached scsi generic sg0 type 0
okt 12 13:34:41 Tuvok kernel: ata1.00: Enabling discard_zeroes_data
Comment 97 Hans de Goede 2021-10-12 13:16:05 UTC
Hi Roy (long time no see),

The 5.14.9 kernel has the commit to disable queued-trims, which also shows up in your logs. But you have one of the trouble some ATI/AMD chipset era sata controllers, so you also need this patch:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=86524ac0ddacaaf39edf90ef7473ffc868dd6089

5.14.11 is in Fedora's updates-testing repo:
https://koji.fedoraproject.org/koji/buildinfo?buildID=1843984a

If you install that the problem should go away.

Regards,

Hans
Comment 98 Krzysztof Oledzki 2021-10-13 02:03:22 UTC
Yes, "libata: Add ATA_HORKAGE_NO_NCQ_ON_ATI for Samsung 860 and 870 SSD." has been included in 5.14.11:
 https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.14.11
 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.14.y&id=86524ac0ddacaaf39edf90ef7473ffc868dd6089

It is now also in 5.10.72, 5.4.152 and 4.19.210, 4.14.250, 4.9.286, 4.4.288.
Comment 99 Roy 2021-10-14 12:16:47 UTC
Thanks both, and thanks for getting these quirks in place. I can confirm that with 5.14.11 I can remove the libata.force=noncq kernel parameter without any trouble.
Comment 101 inchybinky 2022-02-26 23:19:42 UTC
Does anyone know if 5.14.11 has also addressed the problem for Samsung's PM883 enterprise-class drives?  

They use the Samsung 860 controller and hardware.  Another person in this thread has identified the same issue with this drive.

I am on the current version of Ubuntu LTS, with Kernel 5.4..  on an Intel NUC 10th Gen.  I have FPDMA Write as described above; they occur at boot, but then the system works fine.  CRC error counts increase with every reboot though.

I have not added any kernel paramaters.  It would be good to know if the PM883 drive has been included though.  If not, I suppose I will have to add one or both of the above parameters for trim and ncq.

Thank you
Comment 102 inchybinky 2022-02-26 23:53:03 UTC
(In reply to andreas from comment #68)
> I am experiencing the same bug with Samsung PM883 drives. 2TB and 4TB models.
> 
> Device Model:     SAMSUNG MZ7LH3T8HMLT-00005
> Firmware Version: HXT7404Q
> 
> > On 8. Dec 2020, at 20:19, bugzilla-daemon@bugzilla.kernel.org wrote:
> > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=201693
> > 
> > --- Comment #24 from Sitsofe Wheeler (sitsofe@yahoo.com) ---
> > Can people who are seeing this report which model (e.g. SSD 860 EVO 500GB) 
> > firmware (e.g. RVT01B6Q) and PCI card (e.g. AMD SB7x0/SB8x0/SB9x0 ) they
> > have?
> > In my case smartctl -a <dev> reports that I'm on RVT01B6Q firmware which is
> > apparently behind the latest (RVT04B6Q) listed on
> > https://www.samsung.com/semiconductor/minisite/ssd/download/tools/ . If
> folks
> > are feeling brave and can take the risk can they report if the issue is
> still
> > reproduced on the latest firmware?
> > 
> > -- 
> > You are receiving this mail because:
> > You are on the CC list for the bug.

Hello andreas:  I also have a PM883 (7.68TB version).  Were you able to solve your problem with a kernel upgrade to 5.14.11?  Or did you add one or both of the trim/ncq commands?  Would appreciate your guidance.  Thank you in advance.
Comment 103 Hans de Goede 2022-02-28 11:35:52 UTC
(In reply to inchybinky from comment #101)
> Does anyone know if 5.14.11 has also addressed the problem for Samsung's
> PM883 enterprise-class drives?  
> 
> They use the Samsung 860 controller and hardware.  Another person in this
> thread has identified the same issue with this drive.
> 
> I am on the current version of Ubuntu LTS, with Kernel 5.4..  on an Intel
> NUC 10th Gen.  I have FPDMA Write as described above; they occur at boot,
> but then the system works fine.  CRC error counts increase with every reboot
> though.
> 
> I have not added any kernel paramaters.  It would be good to know if the
> PM883 drive has been included though.  If not, I suppose I will have to add
> one or both of the above parameters for trim and ncq.
> 
> Thank you

ATM the kernel only applies the workaround to SATA devices with a model string matching one of:

"Samsung SSD 860*"
"Samsung SSD 870*"

To also apply the workaround automatically to the PM883 enterprise-class drives I need to know the mode string of those, please run:

cat /sys/class/scsi_device/*/device/model

on a machine with such a drive and then copy and paste the output in a comment here.
Comment 104 Krzysztof Oledzki 2022-03-01 04:55:14 UTC
Also, before we add a workaround, we should check if disabling NCQ solves the problem.

Could you please test this?

libata.force=noncq
Comment 105 hardwareadictos 2022-03-19 10:15:47 UTC
Good mornig.

I can confirm this happening on AMD Sata controller (https://www.supermicro.com/en/products/motherboard/M11SDV-8C+-LN4F) and an EVO 870:

  *-sata
       description: SATA controller
       product: FCH SATA Controller [AHCI mode]
       vendor: Advanced Micro Devices, Inc. [AMD]
       physical id: 0.2
       bus info: pci@0000:07:00.2
       logical name: scsi0
       logical name: scsi1
       logical name: scsi2
       logical name: scsi3
       version: 51
       width: 32 bits
       clock: 33MHz
       capabilities: sata pm pciexpress msi ahci_1.0 bus_master cap_list emulated
       configuration: driver=ahci latency=0
       resources: irq:46 memory:ef602000-ef602fff
     *-disk:0
          description: ATA Disk
          product: Samsung SSD 870
          physical id: 0
          bus info: scsi@0:0.0.0
          logical name: /dev/sda
          version: 1B6Q
          size: 931GiB (1TB)
          capabilities: removable
          configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512
        *-medium
             physical id: 0
             logical name: /dev/sda
             size: 931GiB (1TB)
             capabilities: gpt-1.00 partitioned partitioned:gpt
             configuration: guid=1295f5ca-cc51-6449-b884-b9b76f930336
     *-disk:1
          description: ATA Disk
          product: Samsung SSD 870
          physical id: 1
          bus info: scsi@1:0.0.0
          logical name: /dev/sdb
          version: 2B6Q
          size: 931GiB (1TB)
          capabilities: removable
          configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512
        *-medium
             physical id: 0
             logical name: /dev/sdb
             size: 931GiB (1TB)
             capabilities: gpt-1.00 partitioned partitioned:gpt
             configuration: guid=6bee9b39-d327-cc4b-b3df-d8582cce9d04
     *-disk:2
          description: ATA Disk
          product: Samsung SSD 870
          physical id: 2
          bus info: scsi@2:0.0.0
          logical name: /dev/sdc
          version: 1B6Q
          size: 931GiB (1TB)
          capabilities: removable
          configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512
        *-medium
             physical id: 0
             logical name: /dev/sdc
             size: 931GiB (1TB)
             capabilities: gpt-1.00 partitioned partitioned:gpt
             configuration: guid=5445d2ac-f424-ee4b-ae77-4e7b03750b3f
     *-disk:3
          description: ATA Disk
          product: Samsung SSD 870
          physical id: 3
          bus info: scsi@3:0.0.0
          logical name: /dev/sdd
          version: 1B6Q
          size: 931GiB (1TB)
          capabilities: removable
          configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512
        *-medium
             physical id: 0
             logical name: /dev/sdd
             size: 931GiB (1TB)
             capabilities: gpt-1.00 partitioned partitioned:gpt
             configuration: guid=4f3e0797-d8f7-2b4d-9834-00bcc939e156

I have 4 disk on a RAIDZ config:

=== START OF INFORMATION SECTION ===
Device Model:     Samsung SSD 870 EVO 1TB
LU WWN Device Id: 5 002538 f311ab0dc
Firmware Version: SVT01B6Q
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Mar 19 11:07:39 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Running PROXMOX 7.1.10 with 5.13 kernel:

 5.13.19-6-pve #1 SMP PVE 5.13.19-14 (Thu, 10 Mar 2022 16:24:52 +0100) x86_64 GNU/Linux

Disabled TRIM individually (libata.force=1.00:noncq,2.00:noncq,3.00:noncq,4.00:noncq) and generally (libata.force=noncq) via grub and under device specific config:

echo 1 > /sys/block/sd*/device/queue_depth

Also enabled device energy max_performance mode:

echo max_performance > /sys/class/scsi_host/host*/link_power_management_policy

None of the configurations seems to work consistently making the zfs array to crash within hours/minutes. Been running the same hardware on a Truenas system (FreeBSD) with constant rsyncs on the same pool without any single error.

Let me know if you need me to do any tests. Open to help.
Comment 106 Paul Menzel 2022-03-19 13:35:15 UTC
In has patch notes [1], Christian writes it might be firmware related and affect newer Samsung 870 EVOs.

Could this be verified here, and was Samsung notified?

[1]: https://lore.kernel.org/linux-ide/YjSVGnH+8NypPR6r@smile.fi.intel.com/T/#m4b286c284a4ac305cd4c070e34335bdcfd01b982
Comment 107 inchybinky 2022-03-28 18:08:00 UTC
(In reply to Krzysztof Oledzki from comment #104)
> Also, before we add a workaround, we should check if disabling NCQ solves
> the problem.
> 
> Could you please test this?
> 
> libata.force=noncq

Hello Hans de Goede and Krzysztof Oledzki:

The output of cat /sys/class/scsi_device/*/device/model for the SAMSUNG PM883 7.68TB SATA SSD is:  SAMSUNG MZ7LH7T6

libada.force=noncq helps, but does not completely eliminate the problem.  

It is also necessary to force the SATA link to 3.0G. Otherwise, there are multiple errors in dmesg (along with rising CRC error count) with the link attempting at 6.0G before it downshifts to 3.0G.  Setting it at 3.0G stops the CRC error count as there is no attempt to link at 6.0G.

With libadata.force=noncq,3.0G, no error messages appear in dmesg.

Thank you for your assistance.
Comment 108 Hans de Goede 2022-04-01 14:08:31 UTC
(In reply to inchybinky from comment #107)
> (In reply to Krzysztof Oledzki from comment #104)
> > Also, before we add a workaround, we should check if disabling NCQ solves
> > the problem.
> > 
> > Could you please test this?
> > 
> > libata.force=noncq
> 
> Hello Hans de Goede and Krzysztof Oledzki:
> 
> The output of cat /sys/class/scsi_device/*/device/model for the SAMSUNG
> PM883 7.68TB SATA SSD is:  SAMSUNG MZ7LH7T6
> 
> libada.force=noncq helps, but does not completely eliminate the problem.  
> 
> It is also necessary to force the SATA link to 3.0G. Otherwise, there are
> multiple errors in dmesg (along with rising CRC error count) with the link
> attempting at 6.0G before it downshifts to 3.0G.  Setting it at 3.0G stops
> the CRC error count as there is no attempt to link at 6.0G.
> 
> With libadata.force=noncq,3.0G, no error messages appear in dmesg.
> 
> Thank you for your assistance.

Thank you for testing.

I just noticed from your original report that you are using an Intel NUC, so an Intel SATA controller rather then an ATI/AMD SATA controller. That means that whatever you are seeing is different then the problem this bug is originally about. 

All the other reporters were having issues with Samsung 860 / 870 series in combination with an ATI SATA controller.

I'm not sure what is going on here. Since you also need to downgrade the linkspeed, it might be a good idea to replace the SATA cable if possible.
Comment 109 Hans de Goede 2022-04-01 14:11:59 UTC
From: https://www.reddit.com/r/intelnuc/comments/m12ytr/do_you_use_a_10th_gen_nuc_with_a_25_samsung_ssd/

"I had a bit of trouble trying both an 850 evo and then an 860 evo in mine - could read the SSD and transfer data at less than 5Mb/s, could not format. My issue was the SATA cable had In to be reseated in its socket on the mother board - a few times - before either started performing as expected."

And there also is:
https://community.intel.com/t5/Intel-NUCs/NUC-D54250WYKH-SSD-not-detected-issue-Finding-and-possible/td-p/346477/page/2
https://www.intel.com/content/www/us/en/support/articles/000029624/intel-nuc/intel-nuc-kits.html
Comment 110 inchybinky 2022-04-03 13:15:47 UTC
(In reply to Hans de Goede from comment #109)
> From:
> https://www.reddit.com/r/intelnuc/comments/m12ytr/
> do_you_use_a_10th_gen_nuc_with_a_25_samsung_ssd/
> 
> "I had a bit of trouble trying both an 850 evo and then an 860 evo in mine -
> could read the SSD and transfer data at less than 5Mb/s, could not format.
> My issue was the SATA cable had In to be reseated in its socket on the
> mother board - a few times - before either started performing as expected."
> 
> And there also is:
> https://community.intel.com/t5/Intel-NUCs/NUC-D54250WYKH-SSD-not-detected-
> issue-Finding-and-possible/td-p/346477/page/2
> https://www.intel.com/content/www/us/en/support/articles/000029624/intel-nuc/
> intel-nuc-kits.html

(In reply to Hans de Goede from comment #108)
> (In reply to inchybinky from comment #107)
> > (In reply to Krzysztof Oledzki from comment #104)
> > > Also, before we add a workaround, we should check if disabling NCQ solves
> > > the problem.
> > > 
> > > Could you please test this?
> > > 
> > > libata.force=noncq
> > 
> > Hello Hans de Goede and Krzysztof Oledzki:
> > 
> > The output of cat /sys/class/scsi_device/*/device/model for the SAMSUNG
> > PM883 7.68TB SATA SSD is:  SAMSUNG MZ7LH7T6
> > 
> > libada.force=noncq helps, but does not completely eliminate the problem.  
> > 
> > It is also necessary to force the SATA link to 3.0G. Otherwise, there are
> > multiple errors in dmesg (along with rising CRC error count) with the link
> > attempting at 6.0G before it downshifts to 3.0G.  Setting it at 3.0G stops
> > the CRC error count as there is no attempt to link at 6.0G.
> > 
> > With libadata.force=noncq,3.0G, no error messages appear in dmesg.
> > 
> > Thank you for your assistance.
> 
> Thank you for testing.
> 
> I just noticed from your original report that you are using an Intel NUC, so
> an Intel SATA controller rather then an ATI/AMD SATA controller. That means
> that whatever you are seeing is different then the problem this bug is
> originally about. 
> 
> All the other reporters were having issues with Samsung 860 / 870 series in
> combination with an ATI SATA controller.
> 
> I'm not sure what is going on here. Since you also need to downgrade the
> linkspeed, it might be a good idea to replace the SATA cable if possible.

Hi again, and thank you for your reply.

There are several others who have reported problems with intel controllers here, and another person with a PM883 (comment 68) - see the comments linked from this thread below.

I don't think replacing the SATA cable is an option for me in this Intel NUC, unless I special order one from Intel - if they allow that.  Though I suspect it's not the problem as many others have had the same issue, some in this thread, and many through a google search in other forums.  

I learned in other forums and confirmed from reading threads here that libata.force=noncq,3.0G gets rid of the error messages.  To be sure, it does.  My drive is working fine for my use case and I'm happy to leave it as is with this setting.  And I can see that ubuntu is still trimming the drive once a week - no problems there either.  Still, I have to use these flags for this drive.

I have read that the Samsung PM883 shares the same hardware as the Samsung 860, just different flash + capacitors.  

https://bugzilla.kernel.org/show_bug.cgi?id=201693#c17
https://bugzilla.kernel.org/show_bug.cgi?id=201693#c26
https://bugzilla.kernel.org/show_bug.cgi?id=201693#c34
https://bugzilla.kernel.org/show_bug.cgi?id=201693#c43
https://bugzilla.kernel.org/show_bug.cgi?id=201693#c68
Comment 111 hardwareadictos 2022-04-23 15:35:46 UTC
I can confirm rear 2 weeks of stability that limiting SATA bus bandwidth to 3G fixed my problems also.
Comment 112 public-t.b 2022-07-11 11:01:17 UTC
I'm a bit late, but just for information in case someone tumbles on this later on:

I used a Samsung 870 EVO 4TB with Ryzen 7 3700X on X470 (I don't remember if attached to AMD or other chip on the mainboard) with Kernel from before the patches being introduced, and I did not know about these issues being existing, as this was the first SSD I did not run attached to an old 3ware raid controller (I'm running all other Samsung ssds in raid1 on 3ware controllers without any issues so far, controller seems not capable of forwarding trim commands).

Ugly result was:
I had to RMA the SSD after some month due to getting more and more problems with ECC errors, CRC errors and sector reallocations happening when a TRIM was hitting the drive, as drive was used to 30% everything was fine, but filling up the issues and performance degradation was hitting on a more frequent basis (whenever any error occurred speeds dropped to at best 60MB/s for read and write).

When adding the noncq parameters the errors were no longer showing up, but as the reallocated sectors stayed I ran RMA process for the drive and got a new one and will stay on keeping those SSDs always on 3ware raid controllers (maybe 10%-20% slower than direct attached but less headache).
Comment 113 Roman Mamedov 2022-07-11 11:39:03 UTC
> Ryzen 7 3700X on X470 (I don't remember if attached to AMD or other chip on
> the mainboard) with Kernel from before the patches being introduced

The issue, or the patches to fix it, are unrelated to the controller that you are using, as yours is the newer model which is not known to be affected.

Also nobody has reported before to see any persistent negative effect on the SSD from the controller issues, such as reallocated sectors. So I'm leaning to guess that you just got a faulty SSD for the first one.

Lastly, check if your 3ware controller actually passes-thru the TRIM command, as those enterprise controllers sometimes are picky on whether they will or not. Without TRIM, both performance and longevity of the SSD could be taking a hit.
Comment 114 Roman Mamedov 2022-07-11 11:43:27 UTC
Oh, I overlooked you already said they do not pass TRIM. Well, that's unfortunate.
Comment 115 Mike 2022-12-03 03:32:14 UTC
TL;DR Uneawarness of kernel<->firmware incompatibile behaviour can kill 870 EVO 2TB drive

The history:
I was connected new 870 2TB EVO on Intel SATA 200 controller X299 chipset in AHCI mode . I cloned my previous drive structure from classic HDD and I wasn't aware about the problems with Samsung drive and linux. There was couple of Window versions on drive and Ubuntu 18.04 with kernels 4.11.X and next 4.15.X. 1.5 year ago. I was seeing this errors first time:
 failed command: WRITE FPDMA QUEUED 
after some of the restarts. I thought, ok , new machine - wrong contact on SATA cables, change physical cable and port (from 6 to 1) put the drive into the older rack housing for 2x2.5" in 3.5" which (I discovered later) limited SATA to 3Gbps mode(SATA II). I didn't care but wanted to have this drive removable from some reasons but I think it could mask it a little.
I was booting linux maybe 3-5 times during the 1.5 year for an Ubuntu update and small works and was not observing anything wrong or FPDMA QUEUED error spam. Two days ago I updated 18.04 to the latest with if I remember good 5.4 and tried to see if my new Audigy Rx is working. Everything was detected, not greyed out, but no sound. I started to search on Firefox - it started crashing on any tab I open with any address and "Report a bug". I couldn't find anything using FF, so I restarted system thinking: maybe it will help after last system update. Nothing helped and dmesg started to spam with failed command: WRITE FPDMA QUEUED 
I thought ok Ubuntu 18.04 might be lightly supported now, so I upgrade it. Upgrade to 20.04 was with errors, the restart with no gui lightdm errors and 
failed command: WRITE FPDMA QUEUED
Make another upgrade to Ubuntu 22.04 - with errors and no but but errors
failed command: WRITE FPDMA QUEUED
lightdm errors and no GPU detected
I verified drive under Windows Samsung Magician and achieved unrecoverable errors.
I swapped finally SATA port and cable from port 6 to 1 to confirm it is not a port or cable and still the same errors failed command: WRITE FPDMA QUEUED and verified that bad blocks counting up.

The conclusion:
During the year under Windows use and watching Samsung Magician didn't reported anything wrong on SMART and with the drive. The drive has about 5 TBW usage on 2TB so it will be under RMA, but I am sure it has coincidence of Samsung firmware and kernel options you play between kernel patch releases from 4.15 - 5.15 and more appearing/disappearing of "failed command: WRITE FPDMA QUEUED. "
I hope I'll recover all important data as passive reading drive - Linux is not bootable any more. 

My thoughts:
I am strongly dismissing tales about poor cabling. Although such errors can appear when there are weak connections, but the Samsung problem is kernel behavior raising firmware bugs of drive in some circumstances. It is more likely during higher speed mode of SATA(6Gbps mode)/higher throughput of data during boot/intensive move/copy. It can cause bad blocks on drives, because they have their own queuing/behaving strategy for refreshing/trimming/replacing data in the physical structure. There is no other explanation.

It shouldn't be like that especially for unaware users. The unaware users of kernel versions doesn't know the problem with drive and can malfunction them even if You find a patch. It should be secured in firmware definitely.
Comment 116 Mike 2022-12-04 01:40:20 UTC
(In reply to Mike from comment #115)
> The conclusion:
> During the year under Windows use and watching Samsung Magician didn't
> reported anything wrong on SMART and with the drive. The drive has about 5
> TBW usage on 2TB so it will be under RMA, but I am sure it has coincidence
> of Samsung firmware and kernel options you play between kernel patch
> releases from 4.15 - 5.15 and more appearing/disappearing of "failed
> command: WRITE FPDMA QUEUED. "

I have identified now the developer's play with TRIM settings, so I have proofs now for killing my SAMSUNG 870 2TB EVO between OS updates:

1) Ubuntu 18.04 linux 4.15.0-200-generic - stable working drive, becouse of libata-core.c line 4544 - all Samsung 8xx excluded - drive is safe(unaware by user):
	{ "Samsung SSD 8*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_ZERO_AFTER_TRIM, },

Updated on security recommendation of Ubuntu's system update unaware of the kernel regression to :
2) Ubuntu 18.04 linux 4.19.125-0419125-generic - unstable working drive, becouse of libata-core.c line 4575 - only Samsung 850 & 860 excluded:
	{ "Samsung SSD 840*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_ZERO_AFTER_TRIM, },
	{ "Samsung SSD 850*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_ZERO_AFTER_TRIM, },

Upgraded to Ubuntu 20.04:
3) Ubuntu 20.04 linux 5.4.0-135-generic -  - unstable working drive, becouse of libata-core.c line 4555 - only Samsung 850 & 860 excluded:
	{ "Samsung SSD 840*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_ZERO_AFTER_TRIM, },
	{ "Samsung SSD 850*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_ZERO_AFTER_TRIM, },

Upgraded to Ubuntu 22.04:
3) Ubuntu 22.04 linux 5.4.0-135-generic -  - unstable working drive, becouse of previous state of libata-core.c, now on line 4002 - true poetry of excluded Samsung SSD:


	{ "Samsung SSD 840 EVO*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_NO_DMA_LOG |
						ATA_HORKAGE_ZERO_AFTER_TRIM, },
	{ "Samsung SSD 840*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_ZERO_AFTER_TRIM, },
	{ "Samsung SSD 850*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_ZERO_AFTER_TRIM, },
	{ "Samsung SSD 860*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_ZERO_AFTER_TRIM |
						ATA_HORKAGE_NO_NCQ_ON_ATI, },
	{ "Samsung SSD 870*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_ZERO_AFTER_TRIM |
						ATA_HORKAGE_NO_NCQ_ON_ATI, },

Thank's for nice unaware killing of my data. What does trim on SSD - the other developed story is there:
https://www.algolia.com/blog/engineering/when-solid-state-drives-are-not-that-solid/
Comment 117 Paul Menzel 2022-12-04 19:02:00 UTC
@Mike, I am sorry about your troubles. Please contact the maintainers and linux-ide@vger.kernel.org, and also mention the Samsung SSD EVO firmware version you have.
Comment 118 Krzysztof Oledzki 2022-12-04 21:16:19 UTC
Perhaps as a start, it would make sense to file a separate bug instead of trying to hijack this one, and be more clear about the problem statement, include data from SMART, etc.

This one (currently in state RESOLVED, CODE_FIX) was tracking the issue with "ATI AMD" AHCI controllers, where the recent update clearly states "Intel SATA 200 controller X299 chipset in AHCI mode".
Comment 119 Julien FR 2023-02-07 15:54:28 UTC
We're encountering the same issue with a X570, 6x 4To Samsung 860 EVO and 5.15.83-1-pve #1 SMP PVE 5.15.83-1 (2022-12-15T00:00Z) x86_64 GNU/Linux. 

All are connected to the motherboard's integrated S-ATA ports but only two of the six ports are affected.

Affected controller : 
26:00.0 SATA controller [0106]: ASMedia Technology Inc. ASM1062 Serial ATA Controller [1b21:0612] (rev 02)

Unaffected controller : 
2b:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
2c:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)


I'm struggling to understand if the current fix is supposed to be applied in our context ? 
Queuing is clearly not disabled (showing depth 32 for all 6 drives).
Comment 120 Hans de Goede 2023-02-15 15:29:12 UTC
(In reply to Julien FR from comment #119)
> I'm struggling to understand if the current fix is supposed to be applied in
> our context ? 

No, the current fix is not supposed to apply in your context.

The current fix only applies to old AMD SATA controllers, back when they still uses ATIs PCI-vendor-id for the SATA controllers.

I wonder if the somewhat old 5.15.83 kernel you are running has the NO_NCQ_TRIM quirk for the Samsung 860 series:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8a6430ab9c9c87cb64c512e505e8690bbaee190b

Have you tried passing "libata.force=noncqtrim" on the kernel commandline?
Comment 121 Julien FR 2023-02-15 16:13:13 UTC
That's reassuring. 

I have passed libata.force=1.00:noncq,2.00:noncq,1.00:3.0G,2.00:3.0G 
The pool then accepted a bolus of 12 To at a stable rate of 1 Go/s which to me seems to confirm that these options do the trick. 

Curiously though, on a second server with the same CPU/motherboard but 6x 4To 870 EVO (not 860), I have not seen the issue yet (despite fairly high disk throughput). 


Would it be as reliable to use noncqtrim instead of noncq + 3.0G ?
Comment 122 Hans de Goede 2023-02-15 16:37:45 UTC
> Would it be as reliable to use noncqtrim instead of noncq + 3.0G ?

I'm afraid I cannot answer that. Analysis of previous bugreports has shown that using noncqtrim appears to be enough to make Samsung 860 SSDs work reliably with all SATA controllers except for the old ATI ones.

But there are no guarantees here. Testing noncqtrim would be interesting as another datapoint in how to get these somewhat disappointing SSDs reliable with Linux, but doing so might eat your data!
Comment 123 Hans de Goede 2023-02-15 16:41:32 UTC
Quick follow up 5.15.83 already automatically enables noncqtrim on the Samsung 860* models, so it seems that in your case with the asmedia controller that is not enough to make things work in a stable fashion:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/drivers/ata/libata-core.c?h=v5.15.83
Comment 124 Paul Menzel 2023-02-15 17:12:39 UTC
Julien, as this bug is resolved and for a different issue, please create a new issue, and reference it here.
Comment 125 Lucas Tam 2023-08-16 21:54:15 UTC
I'm seeing this error on 5.15.108-2 on Proxmox but with a WD Pro Red... is this bug only for Samsung EVOs?
Comment 126 DocMAX 2024-02-11 22:55:12 UTC
I have the Samsung 870 QVO and i can't use it with 6.0gb/s!!!

07:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 81)
07:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 81)


dmesg:
Feb 11 22:51:31 pve kernel: ata2.00: Enabling discard_zeroes_data
Feb 11 22:52:21 pve kernel: ata2.00: exception Emask 0x0 SAct 0x1c SErr 0xc0000 action 0x6 frozen
Feb 11 22:52:21 pve kernel: ata2: SError: { CommWake 10B8B }
Feb 11 22:52:21 pve kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Feb 11 22:52:21 pve kernel: ata2.00: cmd 61/10:10:10:0a:00/00:00:00:00:00/40 tag 2 ncq dma 8192 out
                                     res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 11 22:52:21 pve kernel: ata2.00: status: { DRDY }
Feb 11 22:52:21 pve kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Feb 11 22:52:21 pve kernel: ata2.00: cmd 61/10:18:10:74:c0/00:00:d1:01:00/40 tag 3 ncq dma 8192 out
                                     res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 11 22:52:21 pve kernel: ata2.00: status: { DRDY }
Feb 11 22:52:21 pve kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Feb 11 22:52:21 pve kernel: ata2.00: cmd 61/10:20:10:76:c0/00:00:d1:01:00/40 tag 4 ncq dma 8192 out
                                     res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 11 22:52:21 pve kernel: ata2.00: status: { DRDY }
Feb 11 22:52:21 pve kernel: ata2: hard resetting link
Feb 11 22:52:21 pve kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Feb 11 22:52:21 pve kernel: ata2.00: supports DRM functions and may not be fully accessible
Feb 11 22:52:21 pve kernel: ata2.00: supports DRM functions and may not be fully accessible
Feb 11 22:52:21 pve kernel: ata2.00: configured for UDMA/133
Feb 11 22:52:21 pve kernel: ata2: EH complete
Feb 11 22:52:21 pve kernel: ata2.00: Enabling discard_zeroes_data
Feb 11 22:52:22 pve kernel: ata2.00: exception Emask 0x10 SAct 0x100400 SErr 0x0 action 0x6 frozen
Feb 11 22:52:22 pve kernel: ata2.00: irq_stat 0x08000000, interface fatal error
Feb 11 22:52:22 pve kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Feb 11 22:52:22 pve kernel: ata2.00: cmd 61/08:50:00:28:00/00:00:24:00:00/40 tag 10 ncq dma 4096 out
                                     res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
Feb 11 22:52:22 pve kernel: ata2.00: status: { DRDY }
Feb 11 22:52:22 pve kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Feb 11 22:52:22 pve kernel: ata2.00: cmd 61/08:a0:00:28:00/00:00:26:00:00/40 tag 20 ncq dma 4096 out
                                     res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
Feb 11 22:52:22 pve kernel: ata2.00: status: { DRDY }
Feb 11 22:52:22 pve kernel: ata2: hard resetting link
Feb 11 22:52:22 pve kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Feb 11 22:52:22 pve kernel: ata2.00: supports DRM functions and may not be fully accessible
Feb 11 22:52:22 pve kernel: ata2.00: supports DRM functions and may not be fully accessible
Feb 11 22:52:22 pve kernel: ata2.00: configured for UDMA/133
Feb 11 22:52:22 pve kernel: ata2: EH complete
Feb 11 22:52:22 pve kernel: ata2.00: Enabling discard_zeroes_data
Feb 11 22:52:53 pve kernel: ata2.00: exception Emask 0x0 SAct 0x80000 SErr 0x0 action 0x6 frozen
Feb 11 22:52:53 pve kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Feb 11 22:52:53 pve kernel: ata2.00: cmd 61/10:98:10:76:c0/00:00:d1:01:00/40 tag 19 ncq dma 8192 out
                                     res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 11 22:52:53 pve kernel: ata2.00: status: { DRDY }
Feb 11 22:52:53 pve kernel: ata2: hard resetting link
Feb 11 22:52:54 pve kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Feb 11 22:52:54 pve kernel: ata2.00: supports DRM functions and may not be fully accessible
Feb 11 22:52:54 pve kernel: ata2.00: supports DRM functions and may not be fully accessible
Feb 11 22:52:54 pve kernel: ata2.00: configured for UDMA/133
Feb 11 22:52:54 pve kernel: ata2: EH complete
Feb 11 22:52:54 pve kernel: ata2.00: Enabling discard_zeroes_data
Feb 11 22:53:24 pve kernel: ata2: limiting SATA link speed to 3.0 Gbps
Feb 11 22:53:24 pve kernel: ata2.00: exception Emask 0x0 SAct 0x1000 SErr 0x0 action 0x6 frozen
Feb 11 22:53:24 pve kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Feb 11 22:53:24 pve kernel: ata2.00: cmd 61/10:60:10:76:c0/00:00:d1:01:00/40 tag 12 ncq dma 8192 out
                                     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 11 22:53:24 pve kernel: ata2.00: status: { DRDY }
Feb 11 22:53:24 pve kernel: ata2: hard resetting link
Feb 11 22:53:25 pve kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Feb 11 22:53:25 pve kernel: ata2.00: supports DRM functions and may not be fully accessible
Feb 11 22:53:25 pve kernel: ata2.00: supports DRM functions and may not be fully accessible
Feb 11 22:53:25 pve kernel: ata2.00: configured for UDMA/133
Feb 11 22:53:25 pve kernel: ata2.00: device reported invalid CHS sector 0
Feb 11 22:53:25 pve kernel: ata2: EH complete
Feb 11 22:53:25 pve kernel: ata2.00: Enabling discard_zeroes_data