Bug 201693 - Samsung 860 EVO NCQ Issue with AMD SATA Controller
Summary: Samsung 860 EVO NCQ Issue with AMD SATA Controller
Status: RESOLVED CODE_FIX
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Tejun Heo
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-11-15 00:05 UTC by Ryley Angus
Modified: 2021-09-07 01:41 UTC (History)
26 users (show)

See Also:
Kernel Version: 4.19.1
Tree: Mainline
Regression: No


Attachments
dmesg output (93.57 KB, text/plain)
2018-11-15 00:05 UTC, Ryley Angus
Details
lspci -vvv output (33.17 KB, text/plain)
2018-11-15 00:06 UTC, Ryley Angus
Details
smartctl status for the 860 EVO (4.16 KB, text/plain)
2018-11-15 00:07 UTC, Ryley Angus
Details
hdparm -I output (3.39 KB, text/plain)
2018-11-15 00:17 UTC, Ryley Angus
Details
signature.asc (833 bytes, application/pgp-signature)
2021-03-03 21:21 UTC, Solomon Peachy
Details

Description Ryley Angus 2018-11-15 00:05:57 UTC
Created attachment 279441 [details]
dmesg output

Hi, I've recently purchased a 2TB Samsung 860 EVO SSD to replace an existing 850 EVO. There have been no other hardware changes to the affected system and I cloned my existing Linux installation (LUKS/ext4) to the new SSD.

I had no issues with the 850 EVO (I purchased it after the queued trim issue was mitigated), but I have immediately had problems with the 860 EVO. If the filesystem is trimmed (manually or automatically), dmesg immediately reports several "WRITE FPDMA QUEUED" errors before hard resetting the link. The only time these errors haven't occurred is when the amount of space trimmed (as reported by fstrim) is less than 15-20GB.

If I disable NCQ for the SSD, I can trim the drive without issue. I've also tried using "libata.force=noncqtrim" but this did not change the situation.
Comment 1 Ryley Angus 2018-11-15 00:06:54 UTC
Created attachment 279443 [details]
lspci -vvv output
Comment 2 Ryley Angus 2018-11-15 00:07:24 UTC
Created attachment 279445 [details]
smartctl status for the 860 EVO
Comment 3 Ryley Angus 2018-11-15 00:17:48 UTC
Created attachment 279447 [details]
hdparm -I output
Comment 4 Ryley Angus 2018-11-15 00:50:36 UTC
Samsung's support forum has several users experiencing similar issues as far back as June: https://eu.community.samsung.com/t5/Cameras-IT-Everything-Else/860-EVO-250GB-causing-freezes-on-AMD-system/td-p/575813 . It seems there may be a compatibility issue between some AMD SATA controllers and the 860 series. I'll try to test my SSD with an Intel SATA controller.
Comment 5 Ryley Angus 2018-11-15 22:26:24 UTC
After disabling discard/trim on the SSD's ext4 partition, I was able to consistently reproduce the freezing behaviour that was previously triggered by trimming the disk by copying a large (>50GB) file to the SSD. The freezing was not immediate, it occurred 2-3 minutes into the transfer.

If I disable NCQ, both trim and heavy data transfers work reliably (as far as I can tell).
Comment 6 Roman Mamedov 2018-11-28 18:03:07 UTC
I can confirm I face the same issue with a 500 GB 860 EVO just on dd'ing from other disk to it (i.e. full speed streaming write).

Also you can see UDMA CRC Error Count increasing in SMART.

There was no issues with more than a dozen other vendor SSDs that I tried on this controller so far, only this one.

Eventually the Samsung steps down to 1.5Gbps SATA link, and from then on starts working fine. Disabling NCQ does indeed help, but it hobbles the random IO performance immensely. As far as I know there is no solution, other than hopefully a firmware fix by Samsung. Until then, to prevent data loss, NCQ should be disabled unfortunately.
Comment 7 Solomon Peachy 2018-12-01 18:20:30 UTC
Seeing this on my system too, currently running a Fedora 4.19.2 kernel.

Model=Samsung SSD 860 EVO 1TB
FwRev=RVT01B6Q

MSI 970A-G46 motherboard, which has an AMD970+SB950 chipset.

I can provide more details if necessary

Samsung has not released any firmware updates for this device, and by all accounts they do not intend to, despite this problem affecting Windows systems as well.
Comment 8 Matt Whitlock 2019-01-26 03:40:24 UTC
For those affected by this issue, does downgrading to kernel 4.18.19 relieve the symptoms? I started seeing similar "FPDMA QUEUED" errors during heavy I/O to my Samsung SSD 860 Pro after I upgraded to the 4.19 kernel series. Downgrading to 4.18.19, my symptoms disappeared.

I am currently in the process of bisecting the kernel sources to find the commit that introduced the regression. I've been at it since mid-December, as it takes several days to gain confidence that a given commit is "good." (Usually it takes 2-3 days of uptime before I discover that a given commit is "bad.")

If downgrading to 4.18.19 does not resolve the issue for others in this report, then I am experiencing a different issue.
Comment 9 Solomon Peachy 2019-01-26 13:01:28 UTC
The forum thread referenced by comment 4 includes logs taken from a 4.15 kernel.

By all accounts this does not appear to be a "regression" introduced by a recent Linux kernel; instead it appears to be due to an incompatibility between AMD SATA controllers and the 860 EVO device firmware.

(And it's also broken/flaky on Windows too.  So, probably not Linux's fault..)
Comment 10 Sitsofe Wheeler 2019-08-03 14:48:52 UTC
I think I'm seeing this issue (Samsung 860 SSD triggers "WRITE FPDMA QUEUED" errors in kernel log/dmesg under heavy I/O causing terrible performance and unreliable unless NCQ is disabled) too. The SATA controller in the HP MicroServer N36L seeing the issue is an AMD SB7x0/SB8x0/SB9x0. The issue happens on both 4.15.0-43-generic and 4.18.0-13-generic kernels and in my case I was using a simple fio job to make the issue occur:

fio --name=test --readonly --rw=randread --filename /dev/sdb --bs=32k \
    --ioengine=libaio --iodepth=32 --direct=1 --runtime=10m --time_based=1


Here are some more links that may (or may not) be the same thing:

https://github.com/zfsonlinux/zfs/issues/4873#issuecomment-449886669
https://eu.community.samsung.com/t5/Cameras-IT-Everything-Else/860-EVO-250GB-causing-freezes-on-AMD-system/td-p/575813
https://marc.info/?l=linux-block&m=154644276512949&w=2

This may also be linked to https://bugzilla.redhat.com/show_bug.cgi?id=1729678 and https://bugzilla.kernel.org/show_bug.cgi?id=203475 .

(CC'ing Jens)
Comment 11 Daniel Kenzelmann 2019-11-30 11:59:27 UTC
Same issue with Samsung SSD 860 EVO 1TB but with (newer?) RVT03B6Q firmware and the following controller:
Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] (rev 40) (prog-if 01 [AHCI 1.0])
	Subsystem: ASRock Incorporation SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]
	
Disabling NCQ completely seems to solve the issue (using libata.force=noncq ),

Does anyone know if setting /sys/block/<device>/device/queue_depth to 1 has the same effect of solving this issue? (or is it something else when enabling ncq altogether that causes the issue?)
Comment 12 Roman Mamedov 2019-11-30 12:13:03 UTC
> Does anyone know if setting /sys/block/<device>/device/queue_depth to 1 has
> the same effect of solving this issue?

Yes it absolutely should. You don't have to disable NCQ for the entire system to solve this. Can only be an issue if you want to use this as your boot drive, and don't figure out a way to set the queue_depth=1 early enough during boot. Then you can still hit the NCQ issue during the bootup process before it is set.
Comment 13 Alexander Tsoy 2020-06-21 16:56:43 UTC
Confirming this issue for Samsung 883 DCT as well. Disabling NCQ works as a workaround:

$ cat /etc/udev/rules.d/99-disk.rules 
ACTION=="add|change", SUBSYSTEM=="block", KERNEL=="sd*", ENV{DEVTYPE}=="disk", ENV{ID_MODEL}=="Samsung_SSD_883_DCT_1.92TB", ATTR{device/queue_depth}="1"

You might also want to include the udev rule into initramfs. For dracut users:

$ cat /etc/dracut.conf.d/local.conf
install_optional_items+=" /etc/udev/rules.d/99-disk.rules "
Comment 14 Alexander Tsoy 2020-08-04 23:47:31 UTC
Update: still have several issues with device/queue_depth=1: occasional freezes and very long freezes after performing fstrim, probably due to queued trim still enabled. So I ended up with different workaround (see Documentation/admin-guide/kernel-parameters.txt for libata.force option format):

$ grep -o "libata[^ ]*" /proc/cmdline 
libata.force=1:noncq,2:noncq

$ sudo dmesg | grep NCQ
[    4.438905] ata2.00: 3750748848 sectors, multi 16: LBA48 NCQ (not used)
[    4.443898] ata1.00: 3750748848 sectors, multi 16: LBA48 NCQ (not used)
[    4.445272] ata4.00: 19532873728 sectors, multi 0: LBA48 NCQ (depth 32), AA
[    4.446274] ata3.00: 19532873728 sectors, multi 0: LBA48 NCQ (depth 32), AA
Comment 15 Solomon Peachy 2020-08-17 15:19:54 UTC
I'd purchased a 3rd-party SATA controller and have been using the 860 EVO apparently problem-free for many months with ncq enabled... until this morning.

It was the first reboot in over 2 months, and the system crapped out badly on startup.  My guess is that Fedora tried to do a queued FSTRIM, leading to a large pile of DMA WRITE errors, resulting in xfs corruption bad enough that it failed to mount -- I had to run xfs_repair to get it to successfully mount the rootfs.

I forgot to disable NCQ after everything was fixed... and it got trashed to the point of needing xfs_repair _again_.

Meanwhile, Samsung still refuses to acknowledge there is a problem with their 860 EVO SSD firmware, much less release an update.
Comment 16 Solomon Peachy 2020-08-17 15:43:08 UTC
Oh, forgot to say my latest corruption was with Fedora's 5.7.12-200.fc32.x86_64 kernel, plugged into an ASMedia ASM1062 SATA controller.  The motherboard is the same as before, sporting a AMD970+SB950 chipset.
Comment 17 Dag Nygren 2020-08-22 08:31:33 UTC
Seeing a very similar thing happen with the completely different setup:

Drive: SAMSUNG 870 QVO 1TB
Controller: Intel Corporation 82801IBM/IEM

Just tried setting the queue_depth to 1 and so far so good. But the problem has been very intermittent. Cannot change cable as this is a laptop.
Comment 18 Andreas Elvers 2020-09-24 11:12:36 UTC
The NCQ problem also affects the Samsung server grade SSDs.

Drive: Samsung MZ7LH1T9HMLT
Controller: SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]

This issue seems to be very big. I am in wonder, why Samsung and AMD can't come to a conclusion to help their customers.
Comment 19 Alejandro Donato 2020-11-18 20:33:28 UTC
+1

Kernel: 5.4.0-54-generic

SATA controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]

Samsung SSD 850 EVO 500GB (firmware EMT02B6Q)

in the ASMEDIA SATA II 6Gbps (now disabled), read errors and heavy freezes (even filesystem damage)

In the AMD controller, link fails and switch to PIO4 UDMA/100 in last... 

----syslog------

[  248.159560] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[  248.228187] ata3.00: configured for UDMA/133
[  312.724185] ata3: limiting SATA link speed to 1.5 Gbps
[  320.675869] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[  320.769960] ata3.00: configured for UDMA/133
[  392.987615] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[  393.069527] ata3.00: configured for UDMA/133
[  468.707516] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[  468.802562] ata3.00: configured for UDMA/133
[  542.007285] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[  542.069186] ata3.00: configured for UDMA/133
[  571.309052] hrtimer: interrupt took 19395 ns
[  606.873323] ata3.00: limiting speed to UDMA/100:PIO4
[  614.882962] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[  614.969015] ata3.00: configured for UDMA/100
[  688.073627] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[  688.143670] ata3.00: configured for UDMA/100

-------------

Performance is heavly degraded.

This is a CRITICAL failure. Please update the bug report.

Thanks!!!
Comment 20 Roman Elshin 2020-11-21 13:15:30 UTC
I have samsung 860 pro with RVM02B6Q firmware, it incompatible not only with amd ahci. With asmedia asm1061 it seems to work with queued TRIM disabled ( 
ATA_HORKAGE_NO_NCQ_TRIM | ATA_HORKAGE_ZERO_AFTER_TRIM in libata-core.c)
With marvell 88se9120 it requires libata.force=3.0G for work.
Intresting, is there any sata3 pci-e x1 card where samsung's 860* crap works flowlesly?
Comment 21 nejrobbins 2020-12-05 03:26:41 UTC
Also happening here. 860 EVO, latest firmware, ASRock 970M Pro3 motherboard. So AMD Sata chipset, 970 northbridge with AMD SB950 southbridge. Getting the same "FPDMA QUEUED" errors, and the drive falls back to SATA 1.5GB/s. Disabling NCQ does solve the issue, but hurts performance.

Interestingly, on Windows with the AMD Sata Driver the OS freezes entirely. But with the Microsoft Sata Controller Driver (that AMD now recommends using), the system doesn't freeze anymore (but still gets the other issues without disabling NCQ). So maybe some type of sata firmware fix is possible?
Comment 22 Roman Mamedov 2020-12-05 09:15:08 UTC
> But with the Microsoft Sata Controller Driver (that AMD now recommends
> using), the system doesn't freeze anymore (but still gets the other issues
> without disabling NCQ).

Well, if it still has issues until NCQ disabled, it only confirms that the issue can't really be solved by the driver. Also, did you check what performance you get with it, and how it compares with using AMD's driver? IIRC the MS driver was quite slow, so it might be a bit more "reliable" only due to being suboptimal and not pushing the hardware nowhere nearly as hard.

One other thing I should mention, as I see people reporting issues on non-AMD controllers as well;  to clarify my experience so far, on the AMD chipset controller I have to disable NCQ entirely to get it working; but on the ASMedia controllers it seems to be enough to disable just the queued TRIM (so the same as Roman Elshin reports above).
https://bugzilla.kernel.org/show_bug.cgi?id=203475
Try that, maybe you can regain some of the lost performance and still get a reliable operation out of the device.
Comment 23 nejrobbins 2020-12-05 16:19:13 UTC
With the AMD driver I couldn't even run a benchmark, as the system would just freeze and I'd have to force restart. Both with CrystalDiskMark and Samsung Magician.

However on the Microsoft driver, I would still get around 500MB/s give or take if I remember correctly, which is the rated spec for the drive. Now on Linux but I haven't tested speeds with NCQ disabled.

I'll try disabling queued TRIM only and seeing what happens. Thx.
Comment 24 Sitsofe Wheeler 2020-12-08 19:19:13 UTC
Can people who are seeing this report which model (e.g. SSD 860 EVO 500GB)  firmware (e.g. RVT01B6Q) and PCI card (e.g. AMD SB7x0/SB8x0/SB9x0 ) they have? In my case smartctl -a <dev> reports that I'm on RVT01B6Q firmware which is apparently behind the latest (RVT04B6Q) listed on https://www.samsung.com/semiconductor/minisite/ssd/download/tools/ . If folks are feeling brave and can take the risk can they report if the issue is still reproduced on the latest firmware?
Comment 25 Sitsofe Wheeler 2020-12-08 19:31:53 UTC
Hmm poking about the web it doesn't look like firmware updates are solving this issue (see https://community.amd.com/t5/server-gurus-discussions/issues-with-samsung-ssds-on-epyc/td-p/402737 ).
Comment 26 nejrobbins 2020-12-08 19:57:06 UTC
Another person experiencing the issue with an Intel NUC: https://unix.stackexchange.com/questions/623238/root-causes-for-failed-command-write-fpdma-queued.

But @Sitsofe Wheeler I'm on RVT04B6Q and still experiencing the issue, so it doesn't seem to be firmware as you said. I have AMD SB950.
Comment 27 Sitsofe Wheeler 2020-12-08 23:05:21 UTC
I flashed my 860 EVO to RVT04B6Q and the issue is still present which confirms nejrobbins message above.
Comment 28 Roy 2020-12-28 00:44:08 UTC
Wish I had found this bug report before getting an SSD to provide a much-needed performance boost to my ageing PC.

* Specs:
- Asus M5A97 Evo Rev2.0, latest UEFI,
- AMD FX-6300,
- Samsung EVO 860, firmware RVT04B6Q,
- Kernel 5.9.16

lspci -vv (filtered SATA controller only, double-checked with lshw)
00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] (rev 40) (prog-if 01 [AHCI 1.0])
	Subsystem: ASUSTeK Computer Inc. M5A99X EVO (R1.0) SB950
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 32
	Interrupt: pin A routed to IRQ 19
	NUMA node: 0
	Region 0: I/O ports at f040 [size=8]
	Region 1: I/O ports at f030 [size=4]
	Region 2: I/O ports at f020 [size=8]
	Region 3: I/O ports at f010 [size=4]
	Region 4: I/O ports at f000 [size=16]
	Region 5: Memory at fe60b000 (32-bit, non-prefetchable) [size=1K]
	Capabilities: <access denied>
	Kernel driver in use: ahci


* Symptoms:
dmesg was riddled with these messages:

dec 27 14:08:24 tuvok kernel: ata1: log page 10h reported inactive tag 17
dec 27 14:08:24 tuvok kernel: ata1.00: exception Emask 0x1 SAct 0x7ffc003f SErr 0x0 action 0x0
dec 27 14:08:24 tuvok kernel: ata1.00: irq_stat 0x40000008
dec 27 14:08:24 tuvok kernel: ata1.00: failed command: WRITE FPDMA QUEUED
dec 27 14:08:24 tuvok kernel: ata1.00: cmd 61/08:00:58:2d:57/00:00:2d:00:00/40 tag 0 ncq dma 4096 out
                                       res 40/00:28:f0:2d:57/00:00:2d:00:00/40 Emask 0x1 (device error)
dec 27 14:08:24 tuvok kernel: ata1.00: status: { DRDY }
dec 27 14:08:24 tuvok kernel: ata1.00: failed command: WRITE FPDMA QUEUED
dec 27 14:08:24 tuvok kernel: ata1.00: cmd 61/08:08:68:2d:57/00:00:2d:00:00/40 tag 1 ncq dma 4096 out
                                       res 40/00:28:f0:2d:57/00:00:2d:00:00/40 Emask 0x1 (device error)
dec 27 14:08:24 tuvok kernel: ata1.00: status: { DRDY }
dec 27 14:08:24 tuvok kernel: ata1.00: failed command: WRITE FPDMA QUEUED
dec 27 14:08:24 tuvok kernel: ata1.00: cmd 61/08:10:98:2d:57/00:00:2d:00:00/40 tag 2 ncq dma 4096 out
                                       res 40/00:28:f0:2d:57/00:00:2d:00:00/40 Emask 0x1 (device error)
dec 27 14:08:24 tuvok kernel: ata1.00: status: { DRDY }
dec 27 14:08:24 tuvok kernel: ata1.00: failed command: WRITE FPDMA QUEUED
dec 27 14:08:24 tuvok kernel: ata1.00: cmd 61/08:18:a8:2d:57/00:00:2d:00:00/40 tag 3 ncq dma 4096 out
                                       res 40/00:28:f0:2d:57/00:00:2d:00:00/40 Emask 0x1 (device error)

* Work-around:
Booting with libata.force=noncq works around this issue.
Comment 29 Gregory P. Smith 2021-02-23 06:18:25 UTC
A new Samsung 870 EVO 1TB SSD runs into this issue on Linux any time a DISCARD is sent. :(

Ex: Removing an LVM snapshot after doing a backup because of the questionable behavior I've been observing... bam:

Feb 22 11:28:32 zoonaut kernel: [130904.469448] ata4.00: qc timeout (cmd 0x47)
Feb 22 11:28:32 zoonaut kernel: [130904.470626] ata4.00: READ LOG DMA EXT failed, trying PIO
Feb 22 11:28:32 zoonaut kernel: [130904.470633] ata4: failed to read log page 10h (errno=-5)
Feb 22 11:28:32 zoonaut kernel: [130904.470700] ata4.00: exception Emask 0x1 SAct 0x40 SErr 0x0 action 0x6 frozen
Feb 22 11:28:32 zoonaut kernel: [130904.470748] ata4.00: irq_stat 0x40000008
Feb 22 11:28:32 zoonaut kernel: [130904.470779] ata4.00: failed command: SEND FPDMA QUEUED
Feb 22 11:28:32 zoonaut kernel: [130904.470824] ata4.00: cmd 64/01:30:00:00:00/00:00:00:00:00/a0 tag 6 ncq dma 512 out
Feb 22 11:28:32 zoonaut kernel: [130904.470824]          res 50/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x1 (device error)
Feb 22 11:28:32 zoonaut kernel: [130904.470932] ata4.00: status: { DRDY }
Feb 22 11:28:32 zoonaut kernel: [130904.470965] ata4: hard resetting link
Feb 22 11:28:32 zoonaut kernel: [130904.789997] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Feb 22 11:28:32 zoonaut kernel: [130904.794412] ata4.00: supports DRM functions and may not be fully accessible
Feb 22 11:28:32 zoonaut kernel: [130904.797441] ata4.00: supports DRM functions and may not be fully accessible
Feb 22 11:28:32 zoonaut kernel: [130904.799759] ata4.00: configured for UDMA/133
Feb 22 11:28:32 zoonaut kernel: [130904.799771] ata4.00: device reported invalid CHS sector 0
Feb 22 11:28:32 zoonaut kernel: [130904.799792] sd 3:0:0:0: [sdc] tag#6 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb 22 11:28:32 zoonaut kernel: [130904.799799] sd 3:0:0:0: [sdc] tag#6 Sense Key : Illegal Request [current] 
Feb 22 11:28:32 zoonaut kernel: [130904.799803] sd 3:0:0:0: [sdc] tag#6 Add. Sense: Unaligned write command
Feb 22 11:28:32 zoonaut kernel: [130904.799809] sd 3:0:0:0: [sdc] tag#6 CDB: Write same(16) 93 08 00 00 00 00 69 40 10 00 00 20 00 00 00 00
Feb 22 11:28:32 zoonaut kernel: [130904.799815] blk_update_request: I/O error, dev sdc, sector 1765806080 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0
Feb 22 11:28:32 zoonaut kernel: [130904.799969] ata4: EH complete
Feb 22 11:28:32 zoonaut kernel: [130904.800165] ata4.00: Enabling discard_zeroes_data
...
Feb 22 18:40:55 zoonaut kernel: [156847.619004] ata4.00: exception Emask 0x0 SAct 0xf000 SErr 0x0 action 0x0
Feb 22 18:40:55 zoonaut kernel: [156847.619075] ata4.00: irq_stat 0x40000008
Feb 22 18:40:55 zoonaut kernel: [156847.619106] ata4.00: failed command: SEND FPDMA QUEUED
Feb 22 18:40:55 zoonaut kernel: [156847.619148] ata4.00: cmd 64/01:60:00:00:00/00:00:00:00:00/a0 tag 12 ncq dma 512 out
Feb 22 18:40:55 zoonaut kernel: [156847.619148]          res 41/04:01:00:00:00/00:00:00:00:00/00 Emask 0x401 (device err
or) <F>
Feb 22 18:40:55 zoonaut kernel: [156847.619247] ata4.00: status: { DRDY ERR }
Feb 22 18:40:55 zoonaut kernel: [156847.619275] ata4.00: error: { ABRT }
Feb 22 18:40:55 zoonaut kernel: [156847.619792] ata4.00: supports DRM functions and may not be fully accessible
Feb 22 18:40:55 zoonaut kernel: [156847.622381] ata4.00: supports DRM functions and may not be fully accessible
Feb 22 18:40:55 zoonaut kernel: [156847.624512] ata4.00: configured for UDMA/133
Feb 22 18:40:55 zoonaut kernel: [156847.624544] ata4: EH complete
Feb 22 18:40:55 zoonaut kernel: [156847.624714] ata4.00: Enabling discard_zeroes_data
Feb 22 18:40:55 zoonaut kernel: [156847.734820] ata4.00: exception Emask 0x0 SAct 0x7e SErr 0x0 action 0x0
Feb 22 18:40:55 zoonaut kernel: [156847.734890] ata4.00: irq_stat 0x40000008
Feb 22 18:40:55 zoonaut kernel: [156847.734922] ata4.00: failed command: SEND FPDMA QUEUED
Feb 22 18:40:55 zoonaut kernel: [156847.734963] ata4.00: cmd 64/01:08:00:00:00/00:00:00:00:00/a0 tag 1 ncq dma 512 out
Feb 22 18:40:55 zoonaut kernel: [156847.734963]          res 41/04:01:00:00:00/00:00:00:00:00/00 Emask 0x401 (device error) <F>
Feb 22 18:40:55 zoonaut kernel: [156847.735061] ata4.00: status: { DRDY ERR }
Feb 22 18:40:55 zoonaut kernel: [156847.735090] ata4.00: error: { ABRT }
Feb 22 18:40:55 zoonaut kernel: [156847.735722] ata4.00: supports DRM functions and may not be fully accessible
Feb 22 18:40:55 zoonaut kernel: [156847.738424] ata4.00: supports DRM functions and may not be fully accessible
Feb 22 18:40:55 zoonaut kernel: [156847.740732] ata4.00: configured for UDMA/133
Feb 22 18:40:55 zoonaut kernel: [156847.740776] ata4: EH complete
Feb 22 18:40:55 zoonaut kernel: [156847.741073] ata4.00: Enabling discard_zeroes_data
... and on and on and on...
Feb 22 18:40:56 zoonaut kernel: [156848.541172] ata4.00: Enabling discard_zeroes_data
Feb 22 18:40:56 zoonaut kernel: [156848.638967] ata4.00: exception Emask 0x0 SAct 0x3f00 SErr 0x0 action 0x0
Feb 22 18:40:56 zoonaut kernel: [156848.642010] ata4.00: irq_stat 0x40000008
Feb 22 18:40:56 zoonaut kernel: [156848.645155] ata4.00: failed command: SEND FPDMA QUEUED
Feb 22 18:40:56 zoonaut kernel: [156848.648344] ata4.00: cmd 64/01:40:00:00:00/00:00:00:00:00/a0 tag 8 ncq dma 512 out
Feb 22 18:40:56 zoonaut kernel: [156848.648344]          res 41/04:01:00:00:00/00:00:00:00:00/00 Emask 0x401 (device error) <F>
Feb 22 18:40:56 zoonaut kernel: [156848.654650] ata4.00: status: { DRDY ERR }
Feb 22 18:40:56 zoonaut kernel: [156848.657798] ata4.00: error: { ABRT }
Feb 22 18:40:56 zoonaut kernel: [156848.661629] ata4.00: supports DRM functions and may not be fully accessible
Feb 22 18:40:56 zoonaut kernel: [156848.664769] ata4.00: supports DRM functions and may not be fully accessible
Feb 22 18:40:56 zoonaut kernel: [156848.666981] ata4.00: configured for UDMA/133
Feb 22 18:40:56 zoonaut kernel: [156848.667013] sd 3:0:0:0: [sdc] tag#8 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb 22 18:40:56 zoonaut kernel: [156848.667019] sd 3:0:0:0: [sdc] tag#8 Sense Key : Illegal Request [current] 
Feb 22 18:40:56 zoonaut kernel: [156848.667024] sd 3:0:0:0: [sdc] tag#8 Add. Sense: Unaligned write command
Feb 22 18:40:56 zoonaut kernel: [156848.667029] sd 3:0:0:0: [sdc] tag#8 CDB: Write same(16) 93 08 00 00 00 00 69 40 10 00 00 3f ff c0 00 00
Feb 22 18:40:56 zoonaut kernel: [156848.667035] blk_update_request: I/O error, dev sdc, sector 1765806080 op 0x3:(DISCARD) flags 0x4000 phys_seg 1 prio class 0
Feb 22 18:40:56 zoonaut kernel: [156848.670129] ata4: EH complete

lvremove thankfully just waits and retries all its I/O patiently, ultimately succeeding and just notes as an error message that it's DISCARD got an I/O error.

But I no longer trust the device in my machine.

HP Microserver Gen10 AMD based system: lspci -v shows

00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 49) (prog-if 01 [AHCI 1.0])
        Subsystem: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
Comment 30 Gregory P. Smith 2021-02-24 05:27:19 UTC
For those disabling NCQ as a workaround.  You can do this per-drive rather than system wide.  Writing 1 to /sys/block/sdX/device/queue_depth instead of the default value disables NCQ.

lvresize -L -299G no longer takes ages and produces ata discard errors in syslog after I do that.  re-enable NCQ by putting a higher value like the default 32 (really 31) there causes it to run into errors again.

I've already picked up an WD Blue SSD to replace this buggy Samsung.  Not supporting NCQ properly is unacceptable to me.  I could just leave trim and discard disabled, but that's an equally hacky non-default config that nothing manufactured and sold in 2021 should require.

I'll avoid Samsung SSD in the future and recommend others do the same.  This may only be a bug in their SATA line (becoming a legacy product merely for this replacing HDDs).

Linux kernel wise, the 8xx series Samsung SATA SSDs could be blocklisted as known troublesome devices so that trim or NCQ are disabled by default on them.  Do we accept kind of vendor quirk hack in mainline kernels?
Comment 31 Roman Mamedov 2021-02-24 07:23:43 UTC
Gregory, did you try disabling just the queued TRIM, not NCQ entirely? As suggested in https://bugzilla.kernel.org/show_bug.cgi?id=203475
Comment 32 Gregory P. Smith 2021-02-24 09:30:22 UTC
Thanks for the link. I hadn't seen that issue. Seems to be the same thing as this. Disabling just queued TRIM and not NCQ entirely appears to require rebuilding a kernel which isn't something I can do to this machien.

Relevant patch and pointer to the change that needs reverting in the kernel:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca6bfcb2f6d9deab3924bf901e73622a94900473

People using Linux distros - file issues against your distro asking them to to use a "Samsung [78]*" in that blocklist.

Also related: https://bugzilla.kernel.org/show_bug.cgi?id=202093  Which came from Canonical's existing Ubuntu issue https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1809972
Comment 33 Matt Whitlock 2021-02-24 15:01:33 UTC
(In reply to Gregory P. Smith from comment #30)
> Linux kernel wise, the 8xx series Samsung SATA SSDs could be blocklisted as
> known troublesome devices so that trim or NCQ are disabled by default on
> them.

PLEASE don't do that! I have a Samsung 860 Pro and a Samsung 860 EVO, and both work great with NCQ and queued Trim enabled. The incompatibility is specifically with AMD SATA controllers. Don't hobble these great drives universally because of one bad controller.
Comment 34 nejrobbins 2021-02-24 17:12:00 UTC
It is predominantly AMD SATA, but there have been some reports of it happening to Intel systems. I agree though it shouldn't be disabled outright.

From what I saw, typically just disabling queued TRIM doesn't fix the issue, as it appears in other contexts and not just when trimming. But it may still be useful to try for some.

You can also disable NCQ for a specific SATA port (and not system wide) with libata.force=x.00:noncq, where X is the disk number.

From: https://wiki.archlinux.org/index.php/Solid_state_drive#Resolving_NCQ_errors

I wonder what the performance impact of disabling NCQ is? I'm not sure of the relationship between average use queue depth and sequential vs random RW.


Just as a side note, on Windows with the Microsoft SATA driver the OS would freeze during benchmarks, but with the AMD driver it wouldn't. Not sure if NCQ got disabled there or anything, and didn't do a benchmark at the tims.
Comment 35 Gregory P. Smith 2021-02-24 18:41:10 UTC
Kernel wise, it should ship with safe default behavior.  Let people who want this enabled on known often borked devices opt-in.  Don't default to causing performance or potential data loss issues for those who fail to opt-out.

The kernel's libata has the HORKAGE list for a reason.  Lets use it to maximal user benefit to avoid problems.  Re-enabling ncq (or ncqtrim if it is that specific) on these Samsung SSDs was a mistake that everyone here piping up is paying for.

The 870 is brand new, just released this year.  Yet it has the problem.

While I've seen an old blog post claiming disabling NCQ on their SATA SSD leading to a reduction in 4K random read performance, firing up fio to do a randread test is not reproducing that for me.  In fact... I just found that I/O speed on the device surprising went _up_ 20-30% after I disabled NCQ by writing 1 to /sys/block/sdc/device/queue_depth.  WAT?

If that's the case, I don't want NCQ on such a devices.  No idea if I'm holding fio wrong.  First time using it.

Anyways, that's all the time I have for this.  Samsung did wrong with their SATA SSD firmware.  The Kernel is doing wrong for users who own those devices today.  

You've got the power to fix it for all Samsung SSD owners.
Comment 36 Matt Whitlock 2021-02-24 19:38:38 UTC
(In reply to Gregory P. Smith from comment #35)
> firing up fio to do a
> randread test is not reproducing that for me.  In fact... I just found that
> I/O speed on the device surprising went _up_ 20-30% after I disabled NCQ by
> writing 1 to /sys/block/sdc/device/queue_depth.  WAT?

I just ran fio on my Samsung 860 EVO 2TB, random 4K reads with libaio engine, I/O depth 256, jobs 4, runtime 120 seconds.

With queue_depth=32 (default): read: IOPS=46.7k, BW=182MiB/s

With queue_depth=1: read: IOPS=13.3k, BW=51.0MiB/s

No messages in my kernel logs for the duration of these tests.

So again, please don't cripple these drives for everyone. If there is an incompatibility with specific SATA controllers, then address that specifically.
Comment 37 Gregory P. Smith 2021-02-24 21:02:20 UTC
Trying to come up with a list of broken samsung-ssd/sata-controller combos is Samsung's job.  But I'd be surprised if they wanted to do that as it'd nerf their marketing-only benchmarks.  Ship code that is safe by default.

Rollback https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca6bfcb2f6d9deab3924bf901e73622a94900473

Disabling it on all Intel, AMD, asmedia, and marvell SATA controllers would be the only reasonable choice to do in the kernel based on all of the comments on the issues so far.

You're setting an arbitrarily impossible testing bar to meet in favor of your own personal combo's performance at the expense of everyone elses data integrity.
Comment 38 Roman Mamedov 2021-02-24 21:18:25 UTC
Should be noted that the referenced commit changes only the NCQ TRIM blacklist, whereas Matt's test results are obtained while disabling the NCQ entirely. 

Disabling just the queued TRIM will not have nearly as much performance impact as disabling NCQ itself. In most cases it will have none, since IIRC the latest best practices from FS devs are not to use inline trim (the "discard" mount option) at all, opting for daily/weekly invocations of "fstrim" instead. But this also might not help on some controllers, the telltale sign for whether it will or not, is did errors have "WRITE FPDMA QUEUED" (a generic NCQ issue), as opposed to "SEND FPDMA QUEUED" (problem with NCQ TRIM specifically).
Comment 39 Gregory P. Smith 2021-02-24 21:44:36 UTC
Right.  When i can reboot, i'll try mine with just noncqtrim.  (it wasn't clear to me if there is a way to control the noncqtrim horkage setting via /sys/block/sdc/device/ like there is for disabling ncq entirely)

Matt: Can you share the fio command line / config you used?  I'd like to repeat the same test on my own system.  Yours is clearly performing much more like I'd expect given the settings (ncq being disabled _should_ put a big dent in 4k random read performance on a decent SSD)
Comment 40 Matt Whitlock 2021-02-24 23:20:38 UTC
(In reply to Gregory P. Smith from comment #39)
> Matt: Can you share the fio command line / config you used?

Not knowing anything about fio myself, I just used the example command line from https://docs.oracle.com/en-us/iaas/Content/Block/References/samplefiocommandslinux.htm:

fio --filename=/dev/disk/by-id/ata-Samsung_SSD_yaddayaddayadda --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --readonly
Comment 41 Gregory P. Smith 2021-02-27 05:51:58 UTC
Confirming: Rebooting with libata.force=4:noncqtrim to disable queued trim on my Samsung 870 EVO _appears_ to work around the issue for easy to reproduce situations (lvresize to reduce a volume size).

[    2.481763] ata4.00: FORCE: horkage modified (noncqtrim)
[    2.481823] ata4.00: supports DRM functions and may not be fully accessible
[    2.482434] ata4.00: disabling queued TRIM support
[    2.482437] ata4.00: ATA-11: Samsung SSD 870 EVO 1TB, SVT01B6Q, max UDMA/133
[    2.482439] ata4.00: 1953525168 sectors, multi 1: LBA48 NCQ (depth 32), AA

Also confirming: There is no measurable performance degradation during normal use of my SSD from doing so.  All this is disabling is Queued TRIM.

I understand this to means a TRIM must act as a barrier to let the existing queued IO finish before happening alone, as a serialized command, after which normal NCQ IO can resume.

Confirming by doing some lvresize -L -100G commands on a LV on the volume during an fio run, I see a very brief blip in the speeds printed to stdout.  But as it's just a single trim it is inconsequential to performance.

I wouldn't mount your filesystem with -o discard in such a configuration if you'll have frequent transient files and need unwavering read throughput.  But a regular ~weekly scheduled fstrim -a shouldn't be a big deal (which I believe is the default setup in popular distros anyways?).

The patch just re-marks these drives as "disable Queued TRIM" by default.  It doesn't disable NCQ entirely.  Seems like a good safe default.

I don't doubt that some people have issues with NCQ itself.  The reason noncq and noncqtrim libata.force flags were added to the kernel in 2015 was due to the large number of SSDs out there that don't behave well.  (https://patchwork.ozlabs.org/project/linux-ide/patch/1430790861-30066-1-git-send-email-martin.petersen@oracle.com/)

The kernel could fail better in this situation.  Perhaps: "Got an error as a result of a queued trim?  Automatically flip that device to noncqtrim mode."  But that's a larger logic change with consequences.  Updating the horkage blocklist is simple and targets this specific issue.

Matt - thanks for the fio command and link!  _That_ one seems to properly exercise my SSD.

4k read: IOPS=88.7k, BW=346MiB/s (363MB/s)      with noncqtrim or without
4k read: IOPS=11.1k, BW=43.5MiB/s (45.6MB/s)    with noncq (queue_depth=1)
Comment 42 Hans de Goede 2021-03-02 12:34:36 UTC
Hi All,

Upstream kernel dev here, who has done some work in this area in the past.

So I was about to submit a patch upstream to just disable NCQ-TRIM on all "Samsung SSD 8*" sata drives based on this bug report.

But reading through the entire bug report again, I have decided to not do this.

My reason for not doing this is that I believe that it will not solve the problem, it will maybe make it less prominent but it will not solve it.

There are several comments in this bug-report indicating that problems still happen under heavy IO with (queued) trim disabled, see comment 5, comment 6, comment 14.

And everything just points to an incompatibility between AMD/Asmedia SATA controllers (AMD southbridges have been made by Asmedia for a while) and the Samsung 860 / 870 sata SSDs. Given that falling back to 1.5 gbps appears to help, I guess that there is some incompatibility between the 2 phy-s on each side, which leads to issues under heavy load.

I guess a heavy load on the SDD causes the SSD's voltage-rails to become more noisy and this leaks through the phy-s which the AMD/Asmedia sata-controllers do not like.

Disabling NCQ (or NCQ-TRIM) reduces the load masking the issue, but as several comments here show, the issue still happens just less frequent so this is not really a fix.

This, combined with there also being many reports (including here) about similar issues under Windows, leads me to the conclusion that AMD/Asmedia sata-controllers 
and Samsung 860 / 870 sata SSDs are simply incompatible with each other.

So the only solution which I can give you is to not use this combination.

The only kernel patch to "solve" this which I can envision is detecting the combination and then simply refusing to use the SSD (with a big fat warning message).
Comment 43 Gregory P. Smith 2021-03-02 21:36:35 UTC
What is your goal in leaving this enabled?  Causing errors to surface soon that anyone paying attention to hardware misperformance and their log spam will notice, Google the error message, and hopefully wind up here?  learning that they need to customize a kernel command line to have libata.force=$BUSNUMBER:noncqtrim?

I admire the fail fast to force acknowledgement of the problem approach, but that is still rather indirect and painful to address.

**Gaining the ability to control noncqtrim at runtime via /sys/block/sdX/device would be excellent.**

That way decisions like this could be left to userland/distro/etc logic to determine and not require special custom kernel command line configs and further reboots.  It could even be detected by userspace and done automagically.  Discussion could move out of the kernel bugzilla.

Some comments from the various issues related to this also claim to be on Intel or other controllers.  comment 17 and comment 27 in this issue for example.

It is obviously impossible for us to verify the veracity of every commenters hardware setup.  But doing nothing and taking the manufacturers word for it as the 2018 change to re-enable the buggy setting seems to have been a mistake.  I question their motivation to re-enable it...
Comment 44 Hans de Goede 2021-03-03 08:40:03 UTC
(In reply to Gregory P. Smith from comment #43)
> What is your goal in leaving this enabled?  Causing errors to surface soon
> that anyone paying attention to hardware misperformance and their log spam
> will notice, Google the error message, and hopefully wind up here?  learning
> that they need to customize a kernel command line to have
> libata.force=$BUSNUMBER:noncqtrim?

As I mentioned already in my original comment, I did consider enabling noncqtrim on these models. But (as also already mentioned) the reporters in comment 5 (original reporter), comment 6 (dd does not do trim) both report still seeing issues when not using trim, IOW noncqtrim is not sufficient to fix this. It merely helps making the issue less obvious.

> Some comments from the various issues related to this also claim to be on
> Intel or other controllers.  comment 17 and comment 27 in this issue for
> example.

I did check those reports when writing my original comment, but I dismissed these comment 17 talks about "the problem has been very intermittent" and there has been 0 follow-up to that comment, making it at best anecdotal proof of there also being problems with Intel controllers.

Comment 17 is discussed in more detail in https://unix.stackexchange.com/queastions/623238/root-causes-for-failed-command-write-fpdma-queued in this case the problem happens at random without there being any load, which is very different from the reporters here which all report the problem happening under stress. And the reporter there suspects that it might be a bad SATA cable which sounds plausible.

Not every case of SATA transfer errors has the same root cause. A bad power-supply or a bad cable could equally well be causing problems.

> It is obviously impossible for us to verify the veracity of every commenters
> hardware setup.  But doing nothing and taking the manufacturers word for it
> as the 2018 change to re-enable the buggy setting seems to have been a
> mistake.  I question their motivation to re-enable it...

Again, merely re-enabling noncqtrim is not enough to fix this. If this works for you great. But there are a lot of indications that it does not help in all cases.

To only thing which does seem to help consistently for everyone is using the ncq setting. But as you have shown with the fio tests yourself the performance hit from that is huge.

As I already mentioned before I suspect some sort of power-supply issue also may have a hand in things here and doing a trim typically involves erasing flash blocks which is an operation with high power-consumption. So what happens here is that when enabling noncqtrim is that the sata connection to the SSD sits idle while waiting for the trim to complete. Since there is no SATA traffic the higher power-consumption caused by the trim can also not cause any SATA transfer corruptions.

While reading through this bug, I noticed that some of the involved systems are quite old. I wonder what the quality of the used PSU-s in these systems is. Even if the PSU-s where fine when they were new capacitors degrade over time.

Likewise I guess some people may be using converters to go from a molex power-connector to a sata power-connector. Those might be flaky too.

Has anyone who is seeing this tried replacing his PSU with a new high-quality PSU and checked if that helps ?  Yes unless you have a spare one lying around to test this is not cheap, but it would be an interesting data point.

Also note that AMD claims that they cannot reproduce the issue:
https://community.amd.com/t5/server-gurus-discussions/issues-with-samsung-ssds-on-epyc/m-p/402746/highlight/true#M835

I assume that AMD is using a high quality PSU, with nice and clean power-rails in there testing, which might be while they are not seeing this.

TL;DR: this is a complex issue, I would love to make a magic wand and make it go away for everyone, but I don't see any easy answers here. I do not believe that noncqtrim will solve this for everyone. The only thing which consistently seems to help is to go full noncq on these drives, which would lead to big flood of complaints about the performance tanking from people using these with Intel SATA controllers.
Comment 45 Roy 2021-03-03 09:43:11 UTC
(In reply to Hans de Goede from comment #44)
> While reading through this bug, I noticed that some of the involved systems
> are quite old. I wonder what the quality of the used PSU-s in these systems
> is. Even if the PSU-s where fine when they were new capacitors degrade over
> time.
> 
> Likewise I guess some people may be using converters to go from a molex
> power-connector to a sata power-connector. Those might be flaky too.
> 
> Has anyone who is seeing this tried replacing his PSU with a new
> high-quality PSU and checked if that helps ?  Yes unless you have a spare
> one lying around to test this is not cheap, but it would be an interesting
> data point.
> 
> Also note that AMD claims that they cannot reproduce the issue:
> https://community.amd.com/t5/server-gurus-discussions/issues-with-samsung-
> ssds-on-epyc/m-p/402746/highlight/true#M835
> 
> I assume that AMD is using a high quality PSU, with nice and clean
> power-rails in there testing, which might be while they are not seeing this.

Please bear in mind that this AMD community forum report is against a Ryzen-era machine. On the contrary, most people confirming this bug have Bulldozer-era machines or older. There's a good chance problems have been resolved with the Ryzen generation of south bridges.

For me (AMD FX 6300, Asus M5A97-EVO R2, Samsung ) this bug is incredibly easy to reproduce: just boot and run for a few hours. Disabling NCQ makes the problem entirely go away, disabling NCQ Trim is something I haven't tried (yet). The real-world performance penalty appears to be limited, and one I'm (begrudgingly) willing to live with for the remaining lifespan of this machine. Getting a new SSD is an easy way to bring new life to such machines, which is why you're seeing quite a few of us on this report.
Comment 46 Roman Mamedov 2021-03-03 09:49:14 UTC
While I'm not really advocating any kernel change anymore, it seems baffling that with literally 15 other SSDs not having any issue whatsoever on the same system and controller, but when specifically this one single model from Samsung displays its high-profile and well-known all over the Internet NCQ issue, some will still lean to wave that away as "your own fault", that I use an old system with bad PSUs. Great.
Comment 47 Hans de Goede 2021-03-03 10:36:38 UTC
(In reply to Roman Mamedov from comment #46)
> While I'm not really advocating any kernel change anymore, it seems baffling
> that with literally 15 other SSDs not having any issue whatsoever on the
> same system and controller, but when specifically this one single model from
> Samsung displays its high-profile and well-known all over the Internet NCQ
> issue, some will still lean to wave that away as "your own fault", that I
> use an old system with bad PSUs. Great.

<sigh>

I'm not blaming anyone / I'm not saying this is anyone's fault.

As an engineer I'm trying to find a root-cause for this problem. Because without a root cause I cannot fix it.

One part of the equation seems to be using these specific Samsung SSDs, but that clearly is not the whole story.

All I did was post a theory that it might be related to using an older, possibly degraded, PSU. SSD-s have much more "spiky" power-consumption behavior then HDDs, so this might be PSU related.
Comment 48 Hans de Goede 2021-03-03 10:45:29 UTC
(In reply to Roy from comment #45)
> Please bear in mind that this AMD community forum report is against a
> Ryzen-era machine. On the contrary, most people confirming this bug have
> Bulldozer-era machines or older.

So I guess we should consider doing a kernel side quirk where the kernel disables NCQ on the combination of having a Samsung 860 or 870 SSD with a SATA controller on these older AMD chipsets. This does require having a list of PCI-ids for the controllers on which to enable this quirk.
Comment 49 Roy 2021-03-03 10:55:56 UTC
(In reply to Hans de Goede from comment #48)
> (In reply to Roy from comment #45)
> > Please bear in mind that this AMD community forum report is against a
> > Ryzen-era machine. On the contrary, most people confirming this bug have
> > Bulldozer-era machines or older.
> 
> So I guess we should consider doing a kernel side quirk where the kernel
> disables NCQ on the combination of having a Samsung 860 or 870 SSD with a
> SATA controller on these older AMD chipsets. This does require having a list
> of PCI-ids for the controllers on which to enable this quirk.

[roy@Tuvok ~]$ lsscsi -v
[0:0:0:0]    disk    ATA      Samsung SSD 860  4B6Q  /dev/sda 
  dir: /sys/bus/scsi/devices/0:0:0:0  [/sys/devices/pci0000:00/0000:00:11.0/ata1/host0/target0:0:0/0:0:0:0]

[roy@Tuvok ~]$ lspci -vn:
<...>
00:11.0 0106: 1002:4391 (rev 40) (prog-if 01 [AHCI 1.0])
	Subsystem: 1043:84dd
<...>
Comment 50 Alejandro Donato 2021-03-03 11:14:43 UTC
My 2 cents.

Before reporting this bug, i discard any hardware issue. Do the whole "hard test check and procedures" to not generate a false/incomplete/useless report.
In my humble opinion, this bug tracker is not to play with.
This is not a regular user forum asking for a way to compile a driver.
I assume the ones who takes time to report and track a bug are experienced people.
Sorry if i sound angry, but the conclusions sounds really very out of context.

I have 4 systems that have all different hardware (with similar or the same controllers), and they all fails.

4 disks, 4 PSUs, 4 SATA wires... I assume my luck its not that bad to get all this hardware and all is faulty... 

I hope my comments are not taked bad (and sorry my bad english, its not my native language), only ask to not minimize this issue and try to understand that this is not a "crappy/old hardware" related issue.

My technical skills guide me to presume a firmware/driver fault, and for sure, can be fixed.

Thanks!
Comment 51 nejrobbins 2021-03-03 14:25:00 UTC
(In reply to Roy from comment #45)
> 
> For me (AMD FX 6300, Asus M5A97-EVO R2, Samsung ) this bug is incredibly
> easy to reproduce: just boot and run for a few hours. Disabling NCQ makes
> the problem entirely go away, disabling NCQ Trim is something I haven't
> tried (yet). The real-world performance penalty appears to be limited, and
> one I'm (begrudgingly) willing to live with for the remaining lifespan of
> this machine. Getting a new SSD is an easy way to bring new life to such
> machines, which is why you're seeing quite a few of us on this report.


Yep, FX 6200 here and can report the same. Issue seems to show up particularly during writes, like when running mkinitcpio.

When I disable NCQ, the error doesn't show up ever again. I have used this SSD with two different power supplies (EVGA 430W W1 and EVGA 650W Supernova G1) and have had the error with both. Have also tried different SATA cables.

Haven't tried disabling NCQ TRIM, but I doubt it will help, as I have the WRITE FPDMA QUEUED error, and not SEND FPDMA QUEUED.

I also tried this SSD briefly on Windows, and I noticed that during benchmarks with the AMD SATA driver, the system would freeze completely, but with the AMD-recommended Microsoft driver, it would not. IIRC the CRC Error Count would still increase, which is still indicative of this problem and that the driver didn't do anything regarding NCQ. 

I'm wondering, what is the real impact of receiving these errors? I guess the drive would lock up a bit, but is the impact of this error worse than the performance impact of disabling NCQ?
Comment 52 Matt Whitlock 2021-03-03 17:19:56 UTC
(In reply to nejrobbins from comment #51)
> CRC Error Count

How do we view this counter under Linux? I haven't seen any reports of this value in the comments on this bug report, and I think it would be very revealing.
Comment 53 Roy 2021-03-03 17:29:08 UTC
(In reply to Matt Whitlock from comment #52)
> (In reply to nejrobbins from comment #51)
> > CRC Error Count
> 
> How do we view this counter under Linux? I haven't seen any reports of this
> value in the comments on this bug report, and I think it would be very
> revealing.

sudo smartctl -a /dev/sdX
Comment 54 Tejun Heo 2021-03-03 17:47:06 UTC
Hans, given that nobody is likely to take a bus tracer to root-cause an issue specific to combination of an older controller and some SSDs, there are multiple reports, and that the downsides of disabling ncq trim are pretty minimal, maybe disabling ncq trim on the affected combos isn't such a bad idea at least as a qualify-of-life measure?

We had something similar with SanDisk SSDs which would completely lock up under load several years ago. The lockup rate was too high to be practical, especially in large deployments, and disabling NCQ (yeah, whole NCQ) lowered the failure rate enough so that at least the machines stayed up most of the time. So, we quirked it off. Fortunately, we could push SanDisk to root cause the problem and the root cause turned out to be too high max request size, and now the affected drives have NCQ back on with IO size quirk.

The point I'm trying to make is that while it's of course ideal to root cause issues and plug them at the source, we sometimes have to operate with information that's available at the moment and there's nothing wrong with lowering failure rate enough with imperfect workarounds so that things are more bearable for the time being. We just have to balance the pros and cons and make a reasonable decision. Here, provided that disabling NCQ trim on the specific combo makes sufficient difference, I don't think quirking it is unreasonable.

Thanks.
Comment 55 Hans de Goede 2021-03-03 18:57:32 UTC
(In reply to Tejun Heo from comment #54)
> Hans, given that nobody is likely to take a bus tracer to root-cause an
> issue specific to combination of an older controller and some SSDs, there
> are multiple reports, and that the downsides of disabling ncq trim are
> pretty minimal, maybe disabling ncq trim on the affected combos isn't such a
> bad idea at least as a qualify-of-life measure?

Tejun, I completely agree with you. As I stated in my first comment disabling ncq-trim was my initial plan. But then I took a closer look at all the comments here and for say 50% of the cases disabling (ncq) trim is not enough. Where as 100% seems to report success with disabling ncq altogether.

But disabling ncq altogether is a big hammer. Too big IMHO since this only happens with AMD + Asmedia controllers. I guess we could introduce a special horkage flag for this and disable NCQ on these devices if the PCI vendor-id == AMD or vendor-id == Asmedia ?   I know you're no longer the drivers/ata maintainer, but your input / insight on this is still very much welcome.

###

Another consideration is that  the impact of the bug is not entirely clear to me yet. Yes there are errors in the log, but it seems that for most users the system recovers just fine after that. The recovery does take time, but it is also not clear to me what the frequency of the errors is. Some reports talk about 20 times a day ...

So 2 questions for everyone who is seeing this bug:

1. Beside the errors in the logs, what are the other symptoms of this bug which you see on your system(s). Do things get slow / is there data corruption / anything else ?

2. If the problem is merely things getting slow, how slow are we talking about and does this slowdown happen all the time, or only a couple of times per day ?
Comment 56 Tejun Heo 2021-03-03 19:06:40 UTC
(In reply to Hans de Goede from comment #55)
> Tejun, I completely agree with you. As I stated in my first comment
> disabling ncq-trim was my initial plan. But then I took a closer look at all
> the comments here and for say 50% of the cases disabling (ncq) trim is not
> enough. Where as 100% seems to report success with disabling ncq altogether.

I see. Understood.

> But disabling ncq altogether is a big hammer. Too big IMHO since this only
> happens with AMD + Asmedia controllers. I guess we could introduce a special
> horkage flag for this and disable NCQ on these devices if the PCI vendor-id
> == AMD or vendor-id == Asmedia ?   I know you're no longer the drivers/ata
> maintainer, but your input / insight on this is still very much welcome.

Fully agreed, given how wide spread these SSDs are, I think it'd make sense to apply the workaround only on the affected controllers, even for just turning off NCQ trim.

> Another consideration is that  the impact of the bug is not entirely clear
> to me yet. Yes there are errors in the log, but it seems that for most users
> the system recovers just fine after that. The recovery does take time, but
> it is also not clear to me what the frequency of the errors is. Some reports
> talk about 20 times a day ...
> 
> So 2 questions for everyone who is seeing this bug:
> 
> 1. Beside the errors in the logs, what are the other symptoms of this bug
> which you see on your system(s). Do things get slow / is there data
> corruption / anything else ?
> 
> 2. If the problem is merely things getting slow, how slow are we talking
> about and does this slowdown happen all the time, or only a couple of times
> per day ?

Yeah, gathering more data on the symptoms and the effectiveness of workarounds would hopefully shed more light on the direction. Thank you so much for working on this.
Comment 57 Matt Whitlock 2021-03-03 19:38:26 UTC
(In reply to Hans de Goede from comment #55)
> 1. Beside the errors in the logs, what are the other symptoms of this bug
> which you see on your system(s).

Back when I was experiencing intermittent "WRITE FPDMA QUEUED" errors on my Samsung SSD 860 Pro on an "Intel Corporation NM10/ICH7 Family SATA Controller [AHCI mode] (rev 01)" (an issue that disappeared after I recapped my motherboard), a symptom that I was experiencing was that the MD software RAID driver would kick the affected drive out of my mirrored pair after a batch of errors. Maybe that was just due to a software timeout, though I wasn't setting the "failfast" flag. One maybe relevant observation is that I was not able to re-add the drive to the array until after rebooting.

Unfortunately (and fortunately), I am no longer able to reproduce the problem, so I will not be able to collect any more observations or attempt any workarounds. For what it's worth, I do run my file systems with the "discard" mount flag enabled, so I believe I can assert that queued TRIM isn't causing any problems for me.
Comment 58 Matt Whitlock 2021-03-03 19:48:00 UTC
(In reply to Roy from comment #53)
> sudo smartctl -a /dev/sdX

That would show the interface CRC error count from the device's perspective, but how do we view the error count from the host controller's perspective? Is there a stats file for the SATA controller in debugfs or something like that?
Comment 59 Alejandro Donato 2021-03-03 20:15:12 UTC
El 3/3/21 a las 16:06, bugzilla-daemon@bugzilla.kernel.org escribió:
> So 2 questions for everyone who is seeing this bug:
>
> 1. Beside the errors in the logs, what are the other symptoms of this bug
> which you see on your system(s). Do things get slow / is there data
> corruption / anything else ?
>
> 2. If the problem is merely things getting slow, how slow are we talking
> about and does this slowdown happen all the time, or only a couple of times
> per day ?

1 - things get very slow and it ends in data corruption

2 - i notice working on big files (like accesing virtual disk images), 
starts to trigger the issue. And, as i say before, ends with data 
corruption.

Thanks!
Comment 60 Solomon Peachy 2021-03-03 21:21:58 UTC
Created attachment 295623 [details]
signature.asc

On Wed, Mar 03, 2021 at 06:57:32PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> 1. Beside the errors in the logs, what are the other symptoms of this bug
> which
> you see on your system(s). Do things get slow / is there data corruption /
> anything else ?

When the SSD was plugged into the motherboard's controller, I would see 
significant slowdowns that occasionally lead to data corruption.  I had 
to completely disable NCQ to made the issues go away completely, with 
the resultant performance impact.

When I plugged the SSD into a generic PCIe ASMedia controller, queued 
trim was sufficient to avoid issues -- but leaving queued trim enabled 
was all but guaranteed to cause filesystem corruption, twice to the 
point of the FS going read-only and un-mountable without an xfs_repair 
pass.

> 2. If the problem is merely things getting slow, how slow are we talking
> about
> and does this slowdown happen all the time, or only a couple of times per day
> ?

It easily happened multiple times a day; my supposition was that it 
primarily depended on the nuances of the write load but I was never able 
to narrow it down.

I've since swapped that SSD onto an Intel 8086:8d62 controller, and it 
hasn't so much as hiccupped since, with full NCQ and queued trim.
Comment 61 PJBrs 2021-03-09 08:03:14 UTC
I already replied to the bug report over here - https://bugzilla.kernel.org/show_bug.cgi?id=203475 - since that one isn't specific to AMD SATA controllers.

Hans de Goede, you wrote:
(In reply to Hans de Goede from comment #55)
> But disabling ncq altogether is a big hammer. Too big IMHO since this only
> happens with AMD + Asmedia controllers. I guess we could introduce a special
> horkage flag for this and disable NCQ on these devices if the PCI vendor-id
> == AMD or vendor-id == Asmedia ?   I know you're no longer the drivers/ata
> maintainer, but your input / insight on this is still very much welcome.

I've read this report here as well as the reports in bug 203475, and it seems to me that, while the issue is most severe on AMD controllers, it definitely also shows up on several different intel sata controllers. 

I'm using a 1 TB Samsung 860 EVO SSD (firmware RVT04B6Q) on my ThinkPad T450s. SATA controller info:

00:1f.2 SATA controller: Intel Corporation Wildcat Point-LP SATA Controller [AHCI Mode] (rev 03) (prog-if 01 [AHCI 1.0])
        Subsystem: Lenovo Wildcat Point-LP SATA Controller [AHCI Mode]
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin B routed to IRQ 44
        Region 0: I/O ports at 30a8 [size=8]
        Region 1: I/O ports at 30b4 [size=4]
        Region 2: I/O ports at 30a0 [size=8]
        Region 3: I/O ports at 30b0 [size=4]
        Region 4: I/O ports at 3060 [size=32]
        Region 5: Memory at f123c000 (32-bit, non-prefetchable) [size=2K]
        Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
                Address: fee00298  Data: 0000
        Capabilities: [70] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [a8] SATA HBA v1.0 BAR4 Offset=00000004
        Kernel driver in use: ahci

I have two ext4 partitions on this drive mounted with discard, one of which encrypted.

> So 2 questions for everyone who is seeing this bug:
> 
> 1. Beside the errors in the logs, what are the other symptoms of this bug
> which you see on your system(s). Do things get slow / is there data
> corruption / anything else ?

I noticed this issue first when one to several times each day the machine would freeze almost entirely. Only the mouse cursor seemed to still react. I noticed it especially a couple of minutes after resuming. I work around the issue by reverting https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca6bfcb2f6d9deab3924bf901e73622a94900473

I didn't wait for more severe problems to occur before working around the issue.

> 2. If the problem is merely things getting slow, how slow are we talking
> about and does this slowdown happen all the time, or only a couple of times
> per day ?

As mentioned above, a full freeze except for mouse cursor for a very noticeable length of time (I think somewhere between 10-30 seconds, didn't count).

In closing - I agree that there is no clear-cut problem and therefore no clear-cut solution. But I very strongly want to signal here that it also exists with intel sata controllers. I understand that disabling queued trim alone may not be a sufficient solution in all cases, but from the reports I've read it seems to me that it solves much more than the 50% of the problems that you estimated. Maybe that's because it's easier to disable ncq altogether (and solve the issue) than it is to disable queued trim alone?
Comment 62 Klaus Zipfel 2021-03-21 01:20:38 UTC
I also want to elaborate on this issue.

It seems to look mostly like a storage controller issue and is not limited to trim. Yet, the SSD type in combination with NCQ might also play a role.

My motherboard: Asrock Fatal1ty Z370 Gaming K6 (Intel + Additional AS Media storage controlers)

My system uses BTRFS on LVM on LUKS on a Samsung 870 Evo (Prior: WD Green: WDS240G2G0B). If I recall this correctly, TRIM operations will not be forwarded to the drive when using the default config on this type of setup.

When having the SSD attached to an AS Media storage controller (ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 02)), the error seems to appear amplified: E.g if I clone a 12 GB repo, the system tends to hang completely 

Having the
Comment 63 Klaus Zipfel 2021-03-21 01:42:37 UTC
I also want to elaborate on this issue.

It seems to look mostly like a storage controller issue and is not limited to trim only or the SSD type. Yet, the SSD type in combination with NCQ might also play a role.

My motherboard: Asrock Fatal1ty Z370 Gaming K6 (Intel + Additional AS Media storage controlers)

My system uses BTRFS on LVM on LUKS on a Samsung 870 Evo (Prior: WD Green: WDS240G2G0B) on Kernel 5.11.6-1. 
If I recall this correctly, TRIM operations will not be forwarded to the drive when using the default config on this type of setup. So TRIM might not be the main driver of this issue.

When having the SSD attached to an AS Media storage controller (ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 02)), the error seems to appear amplified: E.g if I clone a 12 GB git repo, the storage I/O seems to hang every other second and can also cause the whole system to freeze.
dmesg also prints a lot of 'BTRFS error (device dm-3): bdev /dev/mapper/system-pool errs: wr 0, rd 0, flush 0, corrupt 108, gen 0' due to wrong checksums '(BTRFS warning (device dm-3): csum failed root 264 ino 115629 off 1016291328 csum 0x1b9828c4 expected csum 0xa30291b3 mirror 1)'

Doing a scrub via "scrub started on /dev/mapper/system-pool" confirms the errors in dmesg.

This "seemed to get better", when having the SSD attached to the Intel SATA controler (Intel Corporation 200 Series PCH SATA controller [AHCI mode]).
However it did not went away for the Samsung 870 EVO, while it seem to be the case for the WD drive - but I only tried it once here!!!.
Setting NCQ queue_depth to 1 seem to mitigate the issue (on the Intel controller for the Samsung SSD - But not at all on the AS Media Controller!).
However, cloning the previously mentioned repo again and again to the Samsung 870 Evo on the Intel Controler with queue_depth=1 ended in BTRFS checksum errors again on the freshly cloned repo. I always checked this by firing "scrub started on /dev/mapper/system-pool".

So in my eyes, the problem is not yet pinned down to a single drive or storage controller.
Comment 64 Hans de Goede 2021-03-21 10:29:52 UTC
@Klaus Zipfel

Thank you for the long comment and all the testing you've done.

Your BTRFS tests showing data-corruption, (re)confirms that this really is a serious issue.

Your tests also show that unfortunately there is no easy fix from the kernel side here.

I'm a bit surprised that you need queue_depth=1 on the Intel controller at all; and that you still see corruption in that scenario.

Is your samsung drive using the latest firmware? There were some issues with AMD controller which reportedly are fixed by a firmware update.

Same question for your motherboard BIOS, in the past BIOS update have (silently without any mentions in the changelog) resolved SATA issues as well, so make sure you are up2date there too.

Also are you doing any overclocking ? Overclocking can also cause things like this even if the system is otherwise fine, esp. overclocking of the bus/base frequency instead of just bumping the multiplier.

I suspect either a firmware issue with the drive, or perhaps a power-supply issue.

The fact that queue_depth=1 is necessary makes me suspect the PSU, as explained earlier this significantly lowers the amount of power-consumption spikes which the SSD will exhibit.

How old is your PSU? and how is the drive connected to the PSU? Is it possible to connect the drive to another sata-power connector on the PSU ?
Comment 65 Alejandro Donato 2021-03-21 18:15:50 UTC
My another 2 cents,

In my tests, i can confirm, power is not the issue (i'm using a 1000W 
power supply tested and verified under load). No electrical noises, no 
power spikes (checked with an osciloscope).

I also do tests with 3 EVO drives.

So, if someone can validate this, we can discard power issues.

Thank all!!!

El 21/3/21 a las 07:29, bugzilla-daemon@bugzilla.kernel.org escribió:
> https://bugzilla.kernel.org/show_bug.cgi?id=201693
>
> --- Comment #64 from Hans de Goede (jwrdegoede@fedoraproject.org) ---
> @Klaus Zipfel
>
> Thank you for the long comment and all the testing you've done.
>
> Your BTRFS tests showing data-corruption, (re)confirms that this really is a
> serious issue.
>
> Your tests also show that unfortunately there is no easy fix from the kernel
> side here.
>
> I'm a bit surprised that you need queue_depth=1 on the Intel controller at
> all;
> and that you still see corruption in that scenario.
>
> Is your samsung drive using the latest firmware? There were some issues with
> AMD controller which reportedly are fixed by a firmware update.
>
> Same question for your motherboard BIOS, in the past BIOS update have
> (silently
> without any mentions in the changelog) resolved SATA issues as well, so make
> sure you are up2date there too.
>
> Also are you doing any overclocking ? Overclocking can also cause things like
> this even if the system is otherwise fine, esp. overclocking of the bus/base
> frequency instead of just bumping the multiplier.
>
> I suspect either a firmware issue with the drive, or perhaps a power-supply
> issue.
>
> The fact that queue_depth=1 is necessary makes me suspect the PSU, as
> explained
> earlier this significantly lowers the amount of power-consumption spikes
> which
> the SSD will exhibit.
>
> How old is your PSU? and how is the drive connected to the PSU? Is it
> possible
> to connect the drive to another sata-power connector on the PSU ?
>
Comment 66 Roman Elshin 2021-03-21 19:40:58 UTC
>So, if someone can validate this, we can discard power issues.

I doesn't  checked all my 3 relatevely fresh PSU by osciloscope, but other ssds work fine with thems, and i suppose power issue can't be solved by using external pci-e controller in a same system (Marvell 88SE9215 pci-e x1 card in my case).
Comment 67 Klaus Zipfel 2021-03-22 00:37:45 UTC
(In reply to Hans de Goede from comment #64)
> @Klaus Zipfel
> 
> Thank you for the long comment and all the testing you've done.
> 
> Your BTRFS tests showing data-corruption, (re)confirms that this really is a
> serious issue.
> 
> Your tests also show that unfortunately there is no easy fix from the kernel
> side here.
> 
> I'm a bit surprised that you need queue_depth=1 on the Intel controller at
> all; and that you still see corruption in that scenario.
> 
> Is your samsung drive using the latest firmware? There were some issues with
> AMD controller which reportedly are fixed by a firmware update.
> 
> Same question for your motherboard BIOS, in the past BIOS update have
> (silently without any mentions in the changelog) resolved SATA issues as
> well, so make sure you are up2date there too.
> 
> Also are you doing any overclocking ? Overclocking can also cause things
> like this even if the system is otherwise fine, esp. overclocking of the
> bus/base frequency instead of just bumping the multiplier.
> 
> I suspect either a firmware issue with the drive, or perhaps a power-supply
> issue.
> 
> The fact that queue_depth=1 is necessary makes me suspect the PSU, as
> explained earlier this significantly lowers the amount of power-consumption
> spikes which the SSD will exhibit.
> 
> How old is your PSU? and how is the drive connected to the PSU? Is it
> possible to connect the drive to another sata-power connector on the PSU ?


I have more insights on my end now. Please regard my previous statements with the Intel Controler + Samsung 870 SSD (including NCQ = off) **causing the corruption of my FS* as "almost" void. For the AS Media controler, I yet can not speak but I will test this hardware combination once I fixed the issue on my end.


To your questions:

- The SSD is on the latest firmware (The Samsung 870 Evo has been released beginning of this year - No newer firmware available)

- My BIOS is on the latest version (yet from around mid/end of 2019)

- Yes, I had the system overclocked but only with the multiplicator. The BCLK was 100 MHz with SpreadSprectum turned off. For my now on following tests, I turned off every single overclocking though.

- My PSU (Enermax Platimax D.F. 1200W) is around 2-3 years old and "completely overpowered" for my system (got this from an RMA). However I can not speak for the voltage stability (did not check with an oscilloscope yet)...

- ... but I hooked up the SSD to a second PSU dedicated to this SSD.

Anyways, the errors seem to still appeared, no matter if NCQ was on or off.

A big HOWEVER now: I seem to have found at least the reason for **my** issues: The kernelmodule I am working on right now seems to cause the problem for me (https://github.com/systemofapwne/mousedriver/issues/3). If this is due the FPU arithmetic (even though I use kernel_fpu_begin()/kernel_fpu_end() where it matters).

When this kernel module is not loaded, the Samsung 870 EVO SSD on the Intel controler with NCQ turned on (queue_depth = 32) was reproduceably not causing BTRFS checksum errors, while with my kernelmodule, it was causing the issues.
Comment 68 andreas 2021-03-22 10:51:13 UTC
I am experiencing the same bug with Samsung PM883 drives. 2TB and 4TB models.

Device Model:     SAMSUNG MZ7LH3T8HMLT-00005
Firmware Version: HXT7404Q

> On 8. Dec 2020, at 20:19, bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=201693
> 
> --- Comment #24 from Sitsofe Wheeler (sitsofe@yahoo.com) ---
> Can people who are seeing this report which model (e.g. SSD 860 EVO 500GB) 
> firmware (e.g. RVT01B6Q) and PCI card (e.g. AMD SB7x0/SB8x0/SB9x0 ) they
> have?
> In my case smartctl -a <dev> reports that I'm on RVT01B6Q firmware which is
> apparently behind the latest (RVT04B6Q) listed on
> https://www.samsung.com/semiconductor/minisite/ssd/download/tools/ . If folks
> are feeling brave and can take the risk can they report if the issue is still
> reproduced on the latest firmware?
> 
> -- 
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 69 Klaus Zipfel 2021-03-23 23:10:53 UTC
Adding to my previous comment #67: The error with the data corruption definitely was on my side and unrelated to this issue. I am sorry for stealing your precious time. 

I can confirm, that the Samsung SSD 870 EVO seems to work now without currupting my filesystem. And that not only on the Intel controler but also on the ASMedia.  

Note: Trim is still off and NCQ on.
Comment 70 Alejandro Donato 2021-03-24 23:47:51 UTC
Maybe this info helps.

In my tests, data corruption shows 2 of 10 times and under heavy load 
(moving and deleting multiple big files), using SSD drive as a caché of 
a mechanical drive.

In fact, i notice the bad performance issue, working with this kind of 
cache array.

Obviously, the other tests (as a single drive) shows the real issue.

El 23/3/21 a las 20:10, bugzilla-daemon@bugzilla.kernel.org escribió:
> https://bugzilla.kernel.org/show_bug.cgi?id=201693
>
> --- Comment #69 from Klaus Zipfel (klaus@zipfel.family) ---
> Adding to my previous comment #67: The error with the data corruption
> definitely was on my side and unrelated to this issue. I am sorry for
> stealing
> your precious time.
>
> I can confirm, that the Samsung SSD 870 EVO seems to work now without
> currupting my filesystem. And that not only on the Intel controler but also
> on
> the ASMedia.
>
> Note: Trim is still off and NCQ on.
>
Comment 71 Hans de Goede 2021-08-30 15:14:47 UTC
As already mentioned in bug 203475 we have been working towards a solution for this:

"""
So after completely re-reading / analyzing both this bug as well as bug 201693 with a fresh pair of eyes (since the last time I did this was a long time ago) I agree. After careful reading / analysis it seems that there really are 2 different bugs here impacting both the 860 EVO and the 870 EVO:

1. Queued Trim commands are causing issues on Intel + ASmedia + Marvell controllers

2. Things are seriously broken on AMD controllers and only completely disabling NCQ altogether helps there.
"""

A patch implementing 1. has been submitted upstream a week ago here:
https://lore.kernel.org/linux-ide/20210823095220.30157-1-hdegoede@redhat.com/T/#u

And a patch implementing 2. was just submitted upstream:
https://lore.kernel.org/linux-ide/54f63e11-e421-0fa6-80e1-297287dc0974@redhat.com/

Together these should resolve (work around) this issue for most users.
Comment 72 Hans de Goede 2021-09-01 09:42:28 UTC
Hi All,

So there are now some reports in the upstream patch discussions of a user with a 860 PRO on a X570 AMD motherboard who is not seeing any issues.

I replied the following to this: "The problem is that when users are hit by this they end up with a non functional system and even fs / data  corruption. Where
as OTOH disabling NCQ leads to a (significant) performance degradation but affected systems will still work fine.

So I believe that it is best to err on the safe side here and accept the performance degradation as a trade-of for fixing the fs / data corruption."

With that said I would still like to try and make the set of AMD boards on which we disable NCQ in combination with a 860 or 870 driver narrower.

If you have a Samsung 860 or 870 SSD with an AMD motherboard and you need to disable NCQ / set the queue-dept to 1 to make it work reliable can you then please provide a comment (or attachment) with the output of:

lscpi -nn

Run on the troublesome AMD motherboard?

The goal is to see if we can build a set of affected AMD SATA controller PCI product-ids on which to disable NCQ to make the kernel-patch to automatically disable NCQ narrower.
Comment 73 Mike Kazantsev 2021-09-01 10:11:52 UTC
Full "lspci -nn" output on this workstation:

  00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD/ATI] RX780/RX790 Host Bridge [1002:5957]
  00:02.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] RX780/RD790 PCI to PCI bridge (external gfx0 port A) [1002:5978]
  00:0a.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] RD790 PCI to PCI bridge (PCI express gpp port F) [1002:597f]
  00:11.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] [1002:4391] (rev 40)
  00:12.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
  00:12.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
  00:13.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
  00:13.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
  00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus Controller [1002:4385] (rev 42)
  00:14.2 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 Azalia (Intel HDA) [1002:4383] (rev 40)
  00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 LPC host controller [1002:439d] (rev 40)
  00:14.4 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 PCI to PCI Bridge [1002:4384] (rev 40)
  00:14.5 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI2 Controller [1002:4399]
  00:15.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0) [1002:43a0]
  00:16.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
  00:16.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
  00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 10h Processor HyperTransport Configuration [1022:1200]
  00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 10h Processor Address Map [1022:1201]
  00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 10h Processor DRAM Controller [1022:1202]
  00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 10h Processor Miscellaneous Control [1022:1203]
  00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 10h Processor Link Control [1022:1204]
  01:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 06)
  02:07.0 Ethernet controller [0200]: VIA Technologies, Inc. VT6105/VT6106S [Rhine-III] [1106:3106] (rev 8b)
  03:00.0 USB controller [0c03]: NEC Corporation uPD720200 USB 3.0 Host Controller [1033:0194] (rev 03)
  04:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Radeon RX 550 640SP / RX 560/560X] [1002:67ff] (rev cf)
  04:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] [1002:aae0]

Connected drive info from smartctl:

  Device Model: Samsung SSD 860 EVO 500GB
  Firmware Version: RVT04B6Q
  ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
  SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)

Tried patching drivers/ata/libata-core.c on this setup first, with:

+  { "Samsung SSD 860*",           NULL,   ATA_HORKAGE_NO_NCQ_TRIM |
+                                          ATA_HORKAGE_ZERO_AFTER_TRIM, },

Iirc confirmed that these were applied in dmesg, but that didn't help, so have "libata.force=2.00:noncq" from then on, and a fallback hack to do it by device-id via sysfs on early boot jic - that seem to remove the issues, and of course I'm fine with this trade-off, considering the alternative.

Also seen same issue on a different (old, but slightly less so) AMD chipset with a similar samsung drive when dd'ing some windows install to it via liveusb too, might get an lspci off it later today, when I'll be near it, I think windows works there without complaining, but maybe also with its own hacks and/or suboptimally.
Comment 74 nejrobbins 2021-09-01 22:53:28 UTC
I'm not exactly sure how to interpret the different values in the output of lspci -nn, and my SATA controller seems the same as yours, but I'll post my full output just in case. This is a 970 FX board. 

My lspci -nn output:
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD/ATI] RD9x0/RX980 Host Bridge [1002:5a14] (rev 02)
00:02.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] RD890/RD9x0/RX980 PCI to PCI bridge (PCI Express GFX port 0) [1002:5a16]
00:09.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] RD890/RD9x0/RX980 PCI to PCI bridge (PCI Express GPP Port 4) [1002:5a1c]
00:11.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] [1002:4391] (rev 40)
00:12.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
00:12.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
00:13.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
00:13.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus Controller [1002:4385] (rev 42)
00:14.2 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 Azalia (Intel HDA) [1002:4383] (rev 40)
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 LPC host controller [1002:439d] (rev 40)
00:14.4 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 PCI to PCI Bridge [1002:4384] (rev 40)
00:14.5 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI2 Controller [1002:4399]
00:15.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0) [1002:43a0]
00:16.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
00:16.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 0 [1022:1600]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 1 [1022:1601]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 2 [1022:1602]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 3 [1022:1603]
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 4 [1022:1604]
00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 5 [1022:1605]
01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] [1002:67df] (rev ef)
01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590] [1002:aaf0]
02:00.0 USB controller [0c03]: Etron Technology, Inc. EJ188/EJ198 USB 3.0 Host Controller [1b6f:7052]
04:00.0 Network controller [0280]: Qualcomm Atheros AR9287 Wireless Network Adapter (PCI-Express) [168c:002e] (rev 01)

Device Model:     Samsung SSD 860 EVO 250GB
Firmware Version: RVT04B6Q
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Comment 75 Mike Kazantsev 2021-09-02 04:12:18 UTC
"lspci -nn" from the other AMD mobo where I've seen this:

  00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD/ATI] RD9x0/RX980 Host Bridge [1002:5a14] (rev 02)
  00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD/ATI] RD890S/RD990 I/O Memory Management Unit (IOMMU) [1002:5a23]
  00:02.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] RD890/RD9x0/RX980 PCI to PCI bridge (PCI Express GFX port 0) [1002:5a16]
  00:04.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] RD890/RD9x0/RX980 PCI to PCI bridge (PCI Express GPP Port 0) [1002:5a18]
  00:09.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] RD890/RD9x0/RX980 PCI to PCI bridge (PCI Express GPP Port 4) [1002:5a1c]
  00:0a.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] RD890/RD9x0/RX980 PCI to PCI bridge (PCI Express GPP Port 5) [1002:5a1d]
  00:11.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] [1002:4391] (rev 40)
  00:12.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
  00:12.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
  00:13.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
  00:13.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
  00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus Controller [1002:4385] (rev 42)
  00:14.2 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 Azalia (Intel HDA) [1002:4383] (rev 40)
  00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 LPC host controller [1002:439d] (rev 40)
  00:14.4 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 PCI to PCI Bridge [1002:4384] (rev 40)
  00:14.5 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI2 Controller [1002:4399]
  00:16.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
  00:16.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
  00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 0 [1022:1600]
  00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 1 [1022:1601]
  00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 2 [1022:1602]
  00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 3 [1022:1603]
  00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 4 [1022:1604]
  00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 5 [1022:1605]
  01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] [1002:67df] (rev e7)
  01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590] [1002:aaf0]
  02:00.0 USB controller [0c03]: Etron Technology, Inc. EJ168 USB 3.0 Host Controller [1b6f:7023] (rev 01)
  03:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 06)
  04:00.0 USB controller [0c03]: Etron Technology, Inc. EJ168 USB 3.0 Host Controller [1b6f:7023] (rev 01)

Seem to be exactly same SSD there:

  Model Family:     Samsung based SSDs
  Device Model:     Samsung SSD 860 EVO 500GB
  Firmware Version: RVT04B6Q
  ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
  SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Comment 76 Alejandro Donato 2021-09-02 15:29:00 UTC
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] RS780 
Host Bridge [1022:9600]
00:02.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] RS780 PCI 
to PCI bridge (ext gfx port 0) [1022:9603]
00:07.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 
RS780/RS880 PCI to PCI bridge (PCIE port 3) [1022:9607]
00:09.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 
RS780/RS880 PCI to PCI bridge (PCIE port 4) [1022:9608]
00:0a.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 
RS780/RS880 PCI to PCI bridge (PCIE port 5) [1022:9609]
00:11.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD/ATI] 
SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] [1002:4391]
00:12.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] 
SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
00:12.1 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] 
SB7x0 USB OHCI1 Controller [1002:4398]
00:12.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] 
SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
00:13.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] 
SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
00:13.1 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] 
SB7x0 USB OHCI1 Controller [1002:4398]
00:13.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] 
SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus 
Controller [1002:4385] (rev 3c)
00:14.1 IDE interface [0101]: Advanced Micro Devices, Inc. [AMD/ATI] 
SB7x0/SB8x0/SB9x0 IDE Controller [1002:439c]
00:14.2 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] 
SBx00 Azalia (Intel HDA) [1002:4383]
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD/ATI] 
SB7x0/SB8x0/SB9x0 LPC host controller [1002:439d]
00:14.4 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 
PCI to PCI Bridge [1002:4384]
00:14.5 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] 
SB7x0/SB8x0/SB9x0 USB OHCI2 Controller [1002:4399]
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 
15h Processor Function 0 [1022:1600]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 
15h Processor Function 1 [1022:1601]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 
15h Processor Function 2 [1022:1602]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 
15h Processor Function 3 [1022:1603]
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 
15h Processor Function 4 [1022:1604]
00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 
15h Processor Function 5 [1022:1605]
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GF119 
[GeForce GT 610] [10de:104a] (rev a1)
01:00.1 Audio device [0403]: NVIDIA Corporation GF119 HDMI Audio 
Controller [10de:0e08] (rev a1)
02:00.0 SATA controller [0106]: ASMedia Technology Inc. ASM1062 Serial 
ATA Controller [1b21:0612] (rev 01)
03:00.0 USB controller [0c03]: Etron Technology, Inc. EJ188/EJ198 USB 
3.0 Host Controller [1b6f:7052]
04:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. 
RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] 
(rev 06)

Thx for your work on this!

El 2/9/21 a las 01:12, bugzilla-daemon@bugzilla.kernel.org escribió:
> lspci -nn
Comment 77 Krzysztof Oledzki 2021-09-02 16:02:27 UTC
Dell Optiplex 580 w/AMD 785G+SB710 is also impacted by this issue.

What seems to be in common is 1002:4391, but there are several more board_ahci_sb700 (and also sb600) devices in linux/drivers/ata/ahci.c

        { PCI_VDEVICE(ATI, 0x4380), board_ahci_sb600 }, /* ATI SB600 */
        { PCI_VDEVICE(ATI, 0x4390), board_ahci_sb700 }, /* ATI SB700/800 */
        { PCI_VDEVICE(ATI, 0x4391), board_ahci_sb700 }, /* ATI SB700/800 */
        { PCI_VDEVICE(ATI, 0x4392), board_ahci_sb700 }, /* ATI SB700/800 */
        { PCI_VDEVICE(ATI, 0x4393), board_ahci_sb700 }, /* ATI SB700/800 */
        { PCI_VDEVICE(ATI, 0x4394), board_ahci_sb700 }, /* ATI SB700/800 */
        { PCI_VDEVICE(ATI, 0x4395), board_ahci_sb700 }, /* ATI SB700/800 */

I wonder if the problem is really AMD or "ATI AMD".

00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] RS880 Host Bridge [1022:9601]
00:02.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] RS780 PCI to PCI bridge (ext gfx port 0) [1022:9603]
00:04.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] RS780/RS880 PCI to PCI bridge (PCIE port 0) [1022:9604]
00:09.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] RS780/RS880 PCI to PCI bridge (PCIE port 4) [1022:9608]
00:11.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] [1002:4391]
00:12.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
00:12.1 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0 USB OHCI1 Controller [1002:4398]
00:12.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
00:13.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
00:13.1 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0 USB OHCI1 Controller [1002:4398]
00:13.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus Controller [1002:4385] (rev 3c)
00:14.1 IDE interface [0101]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 IDE Controller [1002:439c]
00:14.2 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 Azalia (Intel HDA) [1002:4383]
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 LPC host controller [1002:439d]
00:14.4 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 PCI to PCI Bridge [1002:4384]
00:14.5 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI2 Controller [1002:4399]
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 10h Processor HyperTransport Configuration [1022:1200]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 10h Processor Address Map [1022:1201]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 10h Processor DRAM Controller [1022:1202]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 10h Processor Miscellaneous Control [1022:1203]
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 10h Processor Link Control [1022:1204]
01:00.0 Ethernet controller [0200]: Mellanox Technologies MT27500 Family [ConnectX-3] [15b3:1003]
02:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] RV620 LE [Radeon HD 3450] [1002:95c5]
03:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5761 Gigabit Ethernet PCIe [14e4:1681] (rev 10)
Comment 78 Hans de Goede 2021-09-02 16:09:47 UTC
(In reply to Krzysztof Oledzki from comment #77)
> Dell Optiplex 580 w/AMD 785G+SB710 is also impacted by this issue.
> 
> What seems to be in common is 1002:4391, but there are several more
> board_ahci_sb700 (and also sb600) devices in linux/drivers/ata/ahci.c
> 
>         { PCI_VDEVICE(ATI, 0x4380), board_ahci_sb600 }, /* ATI SB600 */
>         { PCI_VDEVICE(ATI, 0x4390), board_ahci_sb700 }, /* ATI SB700/800 */
>         { PCI_VDEVICE(ATI, 0x4391), board_ahci_sb700 }, /* ATI SB700/800 */
>         { PCI_VDEVICE(ATI, 0x4392), board_ahci_sb700 }, /* ATI SB700/800 */
>         { PCI_VDEVICE(ATI, 0x4393), board_ahci_sb700 }, /* ATI SB700/800 */
>         { PCI_VDEVICE(ATI, 0x4394), board_ahci_sb700 }, /* ATI SB700/800 */
>         { PCI_VDEVICE(ATI, 0x4395), board_ahci_sb700 }, /* ATI SB700/800 */
> 
> I wonder if the problem is really AMD or "ATI AMD".

I agree, it seems like we need to change the kernel patch to automatically disable NCQ on Samsung 860 and 870 drivers when the vendor-id == 0x1002.

Is anyone seeing the issue where NCQ needs to be completely disabled / queue-depth needs to be sey to 1 on a motherboard where "lspci -nn" shows 1022 as the vendor-id for the SATA controller?
Comment 79 Matt Whitlock 2021-09-02 17:17:42 UTC
I've been running a Samsung SSD 860 PRO 512GB on an Intel NM10/ICH7 SATA controller for over two years now with zero problems.

I've also been running a Samsung SSD 860 EVO 2TB on the same controller for the past 10 months and have had no problems with it either.

The PRO has partitions that are members of RAID1 mdraid volumes, whose contained file systems are mounted with "-o discard". The EVO has partitions that are members of the same mdraid volumes and also a partition that is a member of a RAID1 mdraid volume that contains a LUKS volume that has the "allow-discards" flag enabled and whose contained file system is mounted with "-o discard".

I'm currently running Linux version 5.10.52-gentoo. The only blacklist entry in libata-core.c that matches my Samsung SSDs sets ATA_HORKAGE_ZERO_AFTER_TRIM (which is actually a good thing[1], not really a horkage), so I assume queued TRIM is enabled on both.

I am somewhat sad to see the baby thrown out with the bath water in this latest round of patches. I am fortunate enough to be have found this bug report and to be paying attention so I can apply a reverse patch to avoid taking a performance hit going forward. Others will not be so lucky.

__________

[1] https://patchwork.ozlabs.org/project/linux-ide/patch/1420727311-7066-1-git-send-email-martin.petersen@oracle.com/
Comment 80 Hans de Goede 2021-09-02 20:14:12 UTC
(In reply to Matt Whitlock from comment #79)
> I am somewhat sad to see the baby thrown out with the bath water in this
> latest round of patches. I am fortunate enough to be have found this bug
> report and to be paying attention so I can apply a reverse patch to avoid
> taking a performance hit going forward. Others will not be so lucky.

In case of an Intel SATA controller we will only be disabling queued trim commands, while otherwise leaving NCQ fully enabled. The chances of you actually noticing any performance difference from this are pretty small.

One of the reasons why this bug has actually been open so long is so as to avoid causing performance regressions on not-affected systems.
Comment 81 Gregory P. Smith 2021-09-02 20:16:35 UTC
My 1T Samsung 870 is connected to the Marvell controller below.

This is a HPe MicroServer Gen10.  All four drive bays use that controller.  This is a fairly popular, common, and affordable home/office server machine.

$ lspci -nn
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Root Complex [1022:1576]
00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) I/O Memory Management Unit [1022:1577]
00:01.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] [1002:9874] (rev 84)
00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Host Bridge [1022:157b]
00:02.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Root Port [1022:157c]
00:02.5 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Root Port [1022:157c]
00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Host Bridge [1022:157b]
00:08.0 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Carrizo Platform Security Processor [1022:1578]
00:09.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Carrizo Audio Dummy Host Bridge [1022:157d]
00:10.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] FCH USB XHCI Controller [1022:7914] (rev 20)
00:11.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 49)
00:12.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] FCH USB EHCI Controller [1022:7908] (rev 49)
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 4a)
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 11)
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Function 0 [1022:1570]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Function 1 [1022:1571]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Function 2 [1022:1572]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Function 3 [1022:1573]
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Function 4 [1022:1574]
00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Function 5 [1022:1575]
01:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9230 PCIe SATA 6Gb/s Controller [1b4b:9230] (rev 11)
02:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe [14e4:165f]
02:00.1 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe [14e4:165f]
Comment 82 Gregory P. Smith 2021-09-02 20:18:49 UTC
booting with libata.force=4:noncqtrim has made the just posted setup reliable for me.
Comment 83 Hans de Goede 2021-09-02 20:27:02 UTC
(In reply to Gregory P. Smith from comment #81)
> My 1T Samsung 870 is connected to the Marvell controller below.

Thank you for the lcpsi output.

If I'm reading your comment 41 then just disabling queued trim ("noncqtrim" option)  is enough to make things work in that setup, correct?

This matches all the other reports where the "noncqtrim" option is sufficient to make things work normally, except on some AMD/ATI SATA controllers.

There already is a patch pending upstream to make noncqtrim the default on all Samsung 860 and 870 SSDs independent of the used controller:
https://lore.kernel.org/linux-ide/20210823095220.30157-1-hdegoede@redhat.com/T/#u

The reason I was asking for lspci output is because for some users with AMD/ATI SATA controllers the "noncqtrim" option is not enough to get things stable, they need "noncq" which is a much bigger hammer, so the plan is to limit that to only certain SATA controllers (or certain SATA controller vendor-ids).
Comment 84 Hans de Goede 2021-09-02 20:27:45 UTC
(In reply to Gregory P. Smith from comment #82)
> booting with libata.force=4:noncqtrim has made the just posted setup
> reliable for me.

Ah looks like our comments crossed, thanks for confirming that.
Comment 85 Hans de Goede 2021-09-03 20:54:18 UTC
The patches for both this bug (using the ATI 0x1002 vendor id for the check) as well as for bug 203475 have been merged into:

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/?h=for-next

So they are  on their way to Linus, closing.
Comment 86 Boann 2021-09-06 05:48:15 UTC
Just to add a confusing data point, my Samsung 860 EVO and AMD SATA controller work perfectly together.

SSD info, from smartctl:

Device Model:     Samsung SSD 860 EVO 1TB
Firmware Version: RVT04B6Q
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)

NCQ is definitely enabled, according to the dmesg log:

[    2.897566] ata4.00: ATA-11: Samsung SSD 860 EVO 1TB, RVT04B6Q, max UDMA/133
[    2.897568] ata4.00: 1953525168 sectors, multi 1: LBA48 NCQ (depth 32), AA
[    2.899995] ata4.00: supports DRM functions and may not be fully accessible
[    2.902825] ata4.00: configured for UDMA/133

Kernel version, according to uname:

Linux 4.19.0-17-amd64 #1 SMP Debian 4.19.194-3 (2021-07-18) x86_64 GNU/Linux

SATA controller info, according to `lspci -v -nn`:

15:00.1 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller [1022:43c8] (rev 01) (prog-if 01 [AHCI 1.0])
        Subsystem: ASMedia Technology Inc. 400 Series Chipset SATA Controller [1b21:1062]
        Flags: bus master, fast devsel, latency 0, IRQ 40
        Memory at fce80000 (32-bit, non-prefetchable) [size=128K]
        Expansion ROM at fce00000 [disabled] [size=512K]
        Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [78] Power Management version 3
        Capabilities: [80] Express Legacy Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Kernel driver in use: ahci
        Kernel modules: ahci

39:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 61) (prog-if 01 [AHCI 1.0])
        Subsystem: Micro-Star International Co., Ltd. [MSI] FCH SATA Controller [AHCI mode] [1462:7b79]
        Flags: bus master, fast devsel, latency 0, IRQ 44
        Memory at fcf00000 (32-bit, non-prefetchable) [size=2K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
        Capabilities: [64] Express Endpoint, MSI 00
        Capabilities: [a0] MSI: Enable+ Count=1/2 Maskable- 64bit+
        Capabilities: [d0] SATA HBA v1.0
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [270] #19
        Kernel driver in use: ahci
        Kernel modules: ahci

---

I've been using this SSD for nine months daily without ever seeing this error or any I/O issues.

I don't know if I've ever used "queued trim". I know periodic trim with fstrim is enabled and runs weekly without hiccups.

I ran `zgrep FPDMA /var/log/*` to see if there was anything logged there (kernel logs going back 3 weeks), and there is not a single line reported.

I also occasionally sync my system drive to an external backup HD and run checksums over all files to compare, so I can detect if a single bit flips, and with this SSD, it never has.

Sorry if this seems so selfish, when so many people are struggling with this mysterious, alarming, infuriating bug. But my own issue is rather the opposite: When my distro's kernel receives this patch to disable NCQ, will there be an easy way I can override to re-enable it? I know my current system configuration is fine.
Comment 87 Krzysztof Oledzki 2021-09-06 06:17:58 UTC
No confusion, we established that the problem is limited to "ATI AMD" AHCI controllers - 0x1002, not "Modern AMD" - 0x1022. You seem to be using 1022:43c8 / 1022:7901 so nothing should change for you.

However, we are still disabling NCQ TRIM for Samsung SSD 840/850/860/870 on all controllers, but this is expected to have neglectable perf impact.

See also:
 https://bugzilla.kernel.org/show_bug.cgi?id=203475#c49
 https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/?h=libata-5.15&id=7a8526a5cd51cf5f070310c6c37dd7293334ac49
 https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/?h=libata-5.15&id=8a6430ab9c9c87cb64c512e505e8690bbaee190b

BTW: we also have now "ncqati" flag allowing to re-enable NCQ.
Comment 88 Boann 2021-09-06 13:57:24 UTC
(In reply to Krzysztof Oledzki from comment #87)
> No confusion, we established that the problem is limited to "ATI AMD" AHCI
> controllers - 0x1002, not "Modern AMD" - 0x1022.
>
> BTW: we also have now "ncqati" flag allowing to re-enable NCQ.

Oh, well, then that's excellent. Thank you sirs, for your thoughtful and careful handling of this bug.
Comment 89 Andrew Filippov 2021-09-06 20:39:20 UTC
(In reply to Krzysztof Oledzki from comment #87)
> No confusion, we established that the problem is limited to "ATI AMD" AHCI
> controllers - 0x1002, not "Modern AMD" - 0x1022. You seem to be using
> 1022:43c8 / 1022:7901 so nothing should change for you.
> 
> However, we are still disabling NCQ TRIM for Samsung SSD 840/850/860/870 on
> all controllers, but this is expected to have neglectable perf impact.
> 
> See also:
>  https://bugzilla.kernel.org/show_bug.cgi?id=203475#c49
>  https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/
> commit/?h=libata-5.15&id=7a8526a5cd51cf5f070310c6c37dd7293334ac49
>  https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/
> commit/?h=libata-5.15&id=8a6430ab9c9c87cb64c512e505e8690bbaee190b
> 
> BTW: we also have now "ncqati" flag allowing to re-enable NCQ.

Thanks for the information.

What options need to be passed to the 5.15+ kernel via "libata.force=" for full ATA TRIM to fully work as before on Samsung EVO 860/870?

Note You need to log in before you can comment on or make changes to this bug.