Bug 201693 - Samsung 860 EVO NCQ Issue with AMD SATA Controller
Summary: Samsung 860 EVO NCQ Issue with AMD SATA Controller
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Tejun Heo
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-11-15 00:05 UTC by Ryley Angus
Modified: 2019-11-30 12:13 UTC (History)
10 users (show)

See Also:
Kernel Version: 4.19.1
Tree: Mainline
Regression: No


Attachments
dmesg output (93.57 KB, text/plain)
2018-11-15 00:05 UTC, Ryley Angus
Details
lspci -vvv output (33.17 KB, text/plain)
2018-11-15 00:06 UTC, Ryley Angus
Details
smartctl status for the 860 EVO (4.16 KB, text/plain)
2018-11-15 00:07 UTC, Ryley Angus
Details
hdparm -I output (3.39 KB, text/plain)
2018-11-15 00:17 UTC, Ryley Angus
Details

Description Ryley Angus 2018-11-15 00:05:57 UTC
Created attachment 279441 [details]
dmesg output

Hi, I've recently purchased a 2TB Samsung 860 EVO SSD to replace an existing 850 EVO. There have been no other hardware changes to the affected system and I cloned my existing Linux installation (LUKS/ext4) to the new SSD.

I had no issues with the 850 EVO (I purchased it after the queued trim issue was mitigated), but I have immediately had problems with the 860 EVO. If the filesystem is trimmed (manually or automatically), dmesg immediately reports several "WRITE FPDMA QUEUED" errors before hard resetting the link. The only time these errors haven't occurred is when the amount of space trimmed (as reported by fstrim) is less than 15-20GB.

If I disable NCQ for the SSD, I can trim the drive without issue. I've also tried using "libata.force=noncqtrim" but this did not change the situation.
Comment 1 Ryley Angus 2018-11-15 00:06:54 UTC
Created attachment 279443 [details]
lspci -vvv output
Comment 2 Ryley Angus 2018-11-15 00:07:24 UTC
Created attachment 279445 [details]
smartctl status for the 860 EVO
Comment 3 Ryley Angus 2018-11-15 00:17:48 UTC
Created attachment 279447 [details]
hdparm -I output
Comment 4 Ryley Angus 2018-11-15 00:50:36 UTC
Samsung's support forum has several users experiencing similar issues as far back as June: https://eu.community.samsung.com/t5/Cameras-IT-Everything-Else/860-EVO-250GB-causing-freezes-on-AMD-system/td-p/575813 . It seems there may be a compatibility issue between some AMD SATA controllers and the 860 series. I'll try to test my SSD with an Intel SATA controller.
Comment 5 Ryley Angus 2018-11-15 22:26:24 UTC
After disabling discard/trim on the SSD's ext4 partition, I was able to consistently reproduce the freezing behaviour that was previously triggered by trimming the disk by copying a large (>50GB) file to the SSD. The freezing was not immediate, it occurred 2-3 minutes into the transfer.

If I disable NCQ, both trim and heavy data transfers work reliably (as far as I can tell).
Comment 6 Roman Mamedov 2018-11-28 18:03:07 UTC
I can confirm I face the same issue with a 500 GB 860 EVO just on dd'ing from other disk to it (i.e. full speed streaming write).

Also you can see UDMA CRC Error Count increasing in SMART.

There was no issues with more than a dozen other vendor SSDs that I tried on this controller so far, only this one.

Eventually the Samsung steps down to 1.5Gbps SATA link, and from then on starts working fine. Disabling NCQ does indeed help, but it hobbles the random IO performance immensely. As far as I know there is no solution, other than hopefully a firmware fix by Samsung. Until then, to prevent data loss, NCQ should be disabled unfortunately.
Comment 7 Solomon Peachy 2018-12-01 18:20:30 UTC
Seeing this on my system too, currently running a Fedora 4.19.2 kernel.

Model=Samsung SSD 860 EVO 1TB
FwRev=RVT01B6Q

MSI 970A-G46 motherboard, which has an AMD970+SB950 chipset.

I can provide more details if necessary

Samsung has not released any firmware updates for this device, and by all accounts they do not intend to, despite this problem affecting Windows systems as well.
Comment 8 Matt Whitlock 2019-01-26 03:40:24 UTC
For those affected by this issue, does downgrading to kernel 4.18.19 relieve the symptoms? I started seeing similar "FPDMA QUEUED" errors during heavy I/O to my Samsung SSD 860 Pro after I upgraded to the 4.19 kernel series. Downgrading to 4.18.19, my symptoms disappeared.

I am currently in the process of bisecting the kernel sources to find the commit that introduced the regression. I've been at it since mid-December, as it takes several days to gain confidence that a given commit is "good." (Usually it takes 2-3 days of uptime before I discover that a given commit is "bad.")

If downgrading to 4.18.19 does not resolve the issue for others in this report, then I am experiencing a different issue.
Comment 9 Solomon Peachy 2019-01-26 13:01:28 UTC
The forum thread referenced by comment 4 includes logs taken from a 4.15 kernel.

By all accounts this does not appear to be a "regression" introduced by a recent Linux kernel; instead it appears to be due to an incompatibility between AMD SATA controllers and the 860 EVO device firmware.

(And it's also broken/flaky on Windows too.  So, probably not Linux's fault..)
Comment 10 Sitsofe Wheeler 2019-08-03 14:48:52 UTC
I think I'm seeing this issue (Samsung 860 SSD triggers "WRITE FPDMA QUEUED" errors in kernel log/dmesg under heavy I/O causing terrible performance and unreliable unless NCQ is disabled) too. The SATA controller in the HP MicroServer N36L seeing the issue is an AMD SB7x0/SB8x0/SB9x0. The issue happens on both 4.15.0-43-generic and 4.18.0-13-generic kernels and in my case I was using a simple fio job to make the issue occur:

fio --name=test --readonly --rw=randread --filename /dev/sdb --bs=32k \
    --ioengine=libaio --iodepth=32 --direct=1 --runtime=10m --time_based=1


Here are some more links that may (or may not) be the same thing:

https://github.com/zfsonlinux/zfs/issues/4873#issuecomment-449886669
https://eu.community.samsung.com/t5/Cameras-IT-Everything-Else/860-EVO-250GB-causing-freezes-on-AMD-system/td-p/575813
https://marc.info/?l=linux-block&m=154644276512949&w=2

This may also be linked to https://bugzilla.redhat.com/show_bug.cgi?id=1729678 and https://bugzilla.kernel.org/show_bug.cgi?id=203475 .

(CC'ing Jens)
Comment 11 Daniel Kenzelmann 2019-11-30 11:59:27 UTC
Same issue with Samsung SSD 860 EVO 1TB but with (newer?) RVT03B6Q firmware and the following controller:
Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] (rev 40) (prog-if 01 [AHCI 1.0])
	Subsystem: ASRock Incorporation SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]
	
Disabling NCQ completely seems to solve the issue (using libata.force=noncq ),

Does anyone know if setting /sys/block/<device>/device/queue_depth to 1 has the same effect of solving this issue? (or is it something else when enabling ncq altogether that causes the issue?)
Comment 12 Roman Mamedov 2019-11-30 12:13:03 UTC
> Does anyone know if setting /sys/block/<device>/device/queue_depth to 1 has
> the same effect of solving this issue?

Yes it absolutely should. You don't have to disable NCQ for the entire system to solve this. Can only be an issue if you want to use this as your boot drive, and don't figure out a way to set the queue_depth=1 early enough during boot. Then you can still hit the NCQ issue during the bootup process before it is set.

Note You need to log in before you can comment on or make changes to this bug.