Created attachment 279441 [details]
Hi, I've recently purchased a 2TB Samsung 860 EVO SSD to replace an existing 850 EVO. There have been no other hardware changes to the affected system and I cloned my existing Linux installation (LUKS/ext4) to the new SSD.
I had no issues with the 850 EVO (I purchased it after the queued trim issue was mitigated), but I have immediately had problems with the 860 EVO. If the filesystem is trimmed (manually or automatically), dmesg immediately reports several "WRITE FPDMA QUEUED" errors before hard resetting the link. The only time these errors haven't occurred is when the amount of space trimmed (as reported by fstrim) is less than 15-20GB.
If I disable NCQ for the SSD, I can trim the drive without issue. I've also tried using "libata.force=noncqtrim" but this did not change the situation.
Created attachment 279443 [details]
lspci -vvv output
Created attachment 279445 [details]
smartctl status for the 860 EVO
Created attachment 279447 [details]
hdparm -I output
Samsung's support forum has several users experiencing similar issues as far back as June: https://eu.community.samsung.com/t5/Cameras-IT-Everything-Else/860-EVO-250GB-causing-freezes-on-AMD-system/td-p/575813 . It seems there may be a compatibility issue between some AMD SATA controllers and the 860 series. I'll try to test my SSD with an Intel SATA controller.
After disabling discard/trim on the SSD's ext4 partition, I was able to consistently reproduce the freezing behaviour that was previously triggered by trimming the disk by copying a large (>50GB) file to the SSD. The freezing was not immediate, it occurred 2-3 minutes into the transfer.
If I disable NCQ, both trim and heavy data transfers work reliably (as far as I can tell).
I can confirm I face the same issue with a 500 GB 860 EVO just on dd'ing from other disk to it (i.e. full speed streaming write).
Also you can see UDMA CRC Error Count increasing in SMART.
There was no issues with more than a dozen other vendor SSDs that I tried on this controller so far, only this one.
Eventually the Samsung steps down to 1.5Gbps SATA link, and from then on starts working fine. Disabling NCQ does indeed help, but it hobbles the random IO performance immensely. As far as I know there is no solution, other than hopefully a firmware fix by Samsung. Until then, to prevent data loss, NCQ should be disabled unfortunately.
Seeing this on my system too, currently running a Fedora 4.19.2 kernel.
Model=Samsung SSD 860 EVO 1TB
MSI 970A-G46 motherboard, which has an AMD970+SB950 chipset.
I can provide more details if necessary
Samsung has not released any firmware updates for this device, and by all accounts they do not intend to, despite this problem affecting Windows systems as well.
For those affected by this issue, does downgrading to kernel 4.18.19 relieve the symptoms? I started seeing similar "FPDMA QUEUED" errors during heavy I/O to my Samsung SSD 860 Pro after I upgraded to the 4.19 kernel series. Downgrading to 4.18.19, my symptoms disappeared.
I am currently in the process of bisecting the kernel sources to find the commit that introduced the regression. I've been at it since mid-December, as it takes several days to gain confidence that a given commit is "good." (Usually it takes 2-3 days of uptime before I discover that a given commit is "bad.")
If downgrading to 4.18.19 does not resolve the issue for others in this report, then I am experiencing a different issue.
The forum thread referenced by comment 4 includes logs taken from a 4.15 kernel.
By all accounts this does not appear to be a "regression" introduced by a recent Linux kernel; instead it appears to be due to an incompatibility between AMD SATA controllers and the 860 EVO device firmware.
(And it's also broken/flaky on Windows too. So, probably not Linux's fault..)
I think I'm seeing this issue (Samsung 860 SSD triggers "WRITE FPDMA QUEUED" errors in kernel log/dmesg under heavy I/O causing terrible performance and unreliable unless NCQ is disabled) too. The SATA controller in the HP MicroServer N36L seeing the issue is an AMD SB7x0/SB8x0/SB9x0. The issue happens on both 4.15.0-43-generic and 4.18.0-13-generic kernels and in my case I was using a simple fio job to make the issue occur:
fio --name=test --readonly --rw=randread --filename /dev/sdb --bs=32k \
--ioengine=libaio --iodepth=32 --direct=1 --runtime=10m --time_based=1
Here are some more links that may (or may not) be the same thing:
This may also be linked to https://bugzilla.redhat.com/show_bug.cgi?id=1729678 and https://bugzilla.kernel.org/show_bug.cgi?id=203475 .
Same issue with Samsung SSD 860 EVO 1TB but with (newer?) RVT03B6Q firmware and the following controller:
Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] (rev 40) (prog-if 01 [AHCI 1.0])
Subsystem: ASRock Incorporation SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]
Disabling NCQ completely seems to solve the issue (using libata.force=noncq ),
Does anyone know if setting /sys/block/<device>/device/queue_depth to 1 has the same effect of solving this issue? (or is it something else when enabling ncq altogether that causes the issue?)
> Does anyone know if setting /sys/block/<device>/device/queue_depth to 1 has
> the same effect of solving this issue?
Yes it absolutely should. You don't have to disable NCQ for the entire system to solve this. Can only be an issue if you want to use this as your boot drive, and don't figure out a way to set the queue_depth=1 early enough during boot. Then you can still hit the NCQ issue during the bootup process before it is set.