Bug 201693 - Samsung 860 EVO NCQ Issue with AMD SATA Controller
Summary: Samsung 860 EVO NCQ Issue with AMD SATA Controller
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Tejun Heo
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-11-15 00:05 UTC by Ryley Angus
Modified: 2021-02-27 05:51 UTC (History)
23 users (show)

See Also:
Kernel Version: 4.19.1
Tree: Mainline
Regression: No


Attachments
dmesg output (93.57 KB, text/plain)
2018-11-15 00:05 UTC, Ryley Angus
Details
lspci -vvv output (33.17 KB, text/plain)
2018-11-15 00:06 UTC, Ryley Angus
Details
smartctl status for the 860 EVO (4.16 KB, text/plain)
2018-11-15 00:07 UTC, Ryley Angus
Details
hdparm -I output (3.39 KB, text/plain)
2018-11-15 00:17 UTC, Ryley Angus
Details

Description Ryley Angus 2018-11-15 00:05:57 UTC
Created attachment 279441 [details]
dmesg output

Hi, I've recently purchased a 2TB Samsung 860 EVO SSD to replace an existing 850 EVO. There have been no other hardware changes to the affected system and I cloned my existing Linux installation (LUKS/ext4) to the new SSD.

I had no issues with the 850 EVO (I purchased it after the queued trim issue was mitigated), but I have immediately had problems with the 860 EVO. If the filesystem is trimmed (manually or automatically), dmesg immediately reports several "WRITE FPDMA QUEUED" errors before hard resetting the link. The only time these errors haven't occurred is when the amount of space trimmed (as reported by fstrim) is less than 15-20GB.

If I disable NCQ for the SSD, I can trim the drive without issue. I've also tried using "libata.force=noncqtrim" but this did not change the situation.
Comment 1 Ryley Angus 2018-11-15 00:06:54 UTC
Created attachment 279443 [details]
lspci -vvv output
Comment 2 Ryley Angus 2018-11-15 00:07:24 UTC
Created attachment 279445 [details]
smartctl status for the 860 EVO
Comment 3 Ryley Angus 2018-11-15 00:17:48 UTC
Created attachment 279447 [details]
hdparm -I output
Comment 4 Ryley Angus 2018-11-15 00:50:36 UTC
Samsung's support forum has several users experiencing similar issues as far back as June: https://eu.community.samsung.com/t5/Cameras-IT-Everything-Else/860-EVO-250GB-causing-freezes-on-AMD-system/td-p/575813 . It seems there may be a compatibility issue between some AMD SATA controllers and the 860 series. I'll try to test my SSD with an Intel SATA controller.
Comment 5 Ryley Angus 2018-11-15 22:26:24 UTC
After disabling discard/trim on the SSD's ext4 partition, I was able to consistently reproduce the freezing behaviour that was previously triggered by trimming the disk by copying a large (>50GB) file to the SSD. The freezing was not immediate, it occurred 2-3 minutes into the transfer.

If I disable NCQ, both trim and heavy data transfers work reliably (as far as I can tell).
Comment 6 Roman Mamedov 2018-11-28 18:03:07 UTC
I can confirm I face the same issue with a 500 GB 860 EVO just on dd'ing from other disk to it (i.e. full speed streaming write).

Also you can see UDMA CRC Error Count increasing in SMART.

There was no issues with more than a dozen other vendor SSDs that I tried on this controller so far, only this one.

Eventually the Samsung steps down to 1.5Gbps SATA link, and from then on starts working fine. Disabling NCQ does indeed help, but it hobbles the random IO performance immensely. As far as I know there is no solution, other than hopefully a firmware fix by Samsung. Until then, to prevent data loss, NCQ should be disabled unfortunately.
Comment 7 Solomon Peachy 2018-12-01 18:20:30 UTC
Seeing this on my system too, currently running a Fedora 4.19.2 kernel.

Model=Samsung SSD 860 EVO 1TB
FwRev=RVT01B6Q

MSI 970A-G46 motherboard, which has an AMD970+SB950 chipset.

I can provide more details if necessary

Samsung has not released any firmware updates for this device, and by all accounts they do not intend to, despite this problem affecting Windows systems as well.
Comment 8 Matt Whitlock 2019-01-26 03:40:24 UTC
For those affected by this issue, does downgrading to kernel 4.18.19 relieve the symptoms? I started seeing similar "FPDMA QUEUED" errors during heavy I/O to my Samsung SSD 860 Pro after I upgraded to the 4.19 kernel series. Downgrading to 4.18.19, my symptoms disappeared.

I am currently in the process of bisecting the kernel sources to find the commit that introduced the regression. I've been at it since mid-December, as it takes several days to gain confidence that a given commit is "good." (Usually it takes 2-3 days of uptime before I discover that a given commit is "bad.")

If downgrading to 4.18.19 does not resolve the issue for others in this report, then I am experiencing a different issue.
Comment 9 Solomon Peachy 2019-01-26 13:01:28 UTC
The forum thread referenced by comment 4 includes logs taken from a 4.15 kernel.

By all accounts this does not appear to be a "regression" introduced by a recent Linux kernel; instead it appears to be due to an incompatibility between AMD SATA controllers and the 860 EVO device firmware.

(And it's also broken/flaky on Windows too.  So, probably not Linux's fault..)
Comment 10 Sitsofe Wheeler 2019-08-03 14:48:52 UTC
I think I'm seeing this issue (Samsung 860 SSD triggers "WRITE FPDMA QUEUED" errors in kernel log/dmesg under heavy I/O causing terrible performance and unreliable unless NCQ is disabled) too. The SATA controller in the HP MicroServer N36L seeing the issue is an AMD SB7x0/SB8x0/SB9x0. The issue happens on both 4.15.0-43-generic and 4.18.0-13-generic kernels and in my case I was using a simple fio job to make the issue occur:

fio --name=test --readonly --rw=randread --filename /dev/sdb --bs=32k \
    --ioengine=libaio --iodepth=32 --direct=1 --runtime=10m --time_based=1


Here are some more links that may (or may not) be the same thing:

https://github.com/zfsonlinux/zfs/issues/4873#issuecomment-449886669
https://eu.community.samsung.com/t5/Cameras-IT-Everything-Else/860-EVO-250GB-causing-freezes-on-AMD-system/td-p/575813
https://marc.info/?l=linux-block&m=154644276512949&w=2

This may also be linked to https://bugzilla.redhat.com/show_bug.cgi?id=1729678 and https://bugzilla.kernel.org/show_bug.cgi?id=203475 .

(CC'ing Jens)
Comment 11 Daniel Kenzelmann 2019-11-30 11:59:27 UTC
Same issue with Samsung SSD 860 EVO 1TB but with (newer?) RVT03B6Q firmware and the following controller:
Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] (rev 40) (prog-if 01 [AHCI 1.0])
	Subsystem: ASRock Incorporation SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]
	
Disabling NCQ completely seems to solve the issue (using libata.force=noncq ),

Does anyone know if setting /sys/block/<device>/device/queue_depth to 1 has the same effect of solving this issue? (or is it something else when enabling ncq altogether that causes the issue?)
Comment 12 Roman Mamedov 2019-11-30 12:13:03 UTC
> Does anyone know if setting /sys/block/<device>/device/queue_depth to 1 has
> the same effect of solving this issue?

Yes it absolutely should. You don't have to disable NCQ for the entire system to solve this. Can only be an issue if you want to use this as your boot drive, and don't figure out a way to set the queue_depth=1 early enough during boot. Then you can still hit the NCQ issue during the bootup process before it is set.
Comment 13 Alexander Tsoy 2020-06-21 16:56:43 UTC
Confirming this issue for Samsung 883 DCT as well. Disabling NCQ works as a workaround:

$ cat /etc/udev/rules.d/99-disk.rules 
ACTION=="add|change", SUBSYSTEM=="block", KERNEL=="sd*", ENV{DEVTYPE}=="disk", ENV{ID_MODEL}=="Samsung_SSD_883_DCT_1.92TB", ATTR{device/queue_depth}="1"

You might also want to include the udev rule into initramfs. For dracut users:

$ cat /etc/dracut.conf.d/local.conf
install_optional_items+=" /etc/udev/rules.d/99-disk.rules "
Comment 14 Alexander Tsoy 2020-08-04 23:47:31 UTC
Update: still have several issues with device/queue_depth=1: occasional freezes and very long freezes after performing fstrim, probably due to queued trim still enabled. So I ended up with different workaround (see Documentation/admin-guide/kernel-parameters.txt for libata.force option format):

$ grep -o "libata[^ ]*" /proc/cmdline 
libata.force=1:noncq,2:noncq

$ sudo dmesg | grep NCQ
[    4.438905] ata2.00: 3750748848 sectors, multi 16: LBA48 NCQ (not used)
[    4.443898] ata1.00: 3750748848 sectors, multi 16: LBA48 NCQ (not used)
[    4.445272] ata4.00: 19532873728 sectors, multi 0: LBA48 NCQ (depth 32), AA
[    4.446274] ata3.00: 19532873728 sectors, multi 0: LBA48 NCQ (depth 32), AA
Comment 15 Solomon Peachy 2020-08-17 15:19:54 UTC
I'd purchased a 3rd-party SATA controller and have been using the 860 EVO apparently problem-free for many months with ncq enabled... until this morning.

It was the first reboot in over 2 months, and the system crapped out badly on startup.  My guess is that Fedora tried to do a queued FSTRIM, leading to a large pile of DMA WRITE errors, resulting in xfs corruption bad enough that it failed to mount -- I had to run xfs_repair to get it to successfully mount the rootfs.

I forgot to disable NCQ after everything was fixed... and it got trashed to the point of needing xfs_repair _again_.

Meanwhile, Samsung still refuses to acknowledge there is a problem with their 860 EVO SSD firmware, much less release an update.
Comment 16 Solomon Peachy 2020-08-17 15:43:08 UTC
Oh, forgot to say my latest corruption was with Fedora's 5.7.12-200.fc32.x86_64 kernel, plugged into an ASMedia ASM1062 SATA controller.  The motherboard is the same as before, sporting a AMD970+SB950 chipset.
Comment 17 Dag Nygren 2020-08-22 08:31:33 UTC
Seeing a very similar thing happen with the completely different setup:

Drive: SAMSUNG 870 QVO 1TB
Controller: Intel Corporation 82801IBM/IEM

Just tried setting the queue_depth to 1 and so far so good. But the problem has been very intermittent. Cannot change cable as this is a laptop.
Comment 18 Andreas Elvers 2020-09-24 11:12:36 UTC
The NCQ problem also affects the Samsung server grade SSDs.

Drive: Samsung MZ7LH1T9HMLT
Controller: SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]

This issue seems to be very big. I am in wonder, why Samsung and AMD can't come to a conclusion to help their customers.
Comment 19 Alejandro Donato 2020-11-18 20:33:28 UTC
+1

Kernel: 5.4.0-54-generic

SATA controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]

Samsung SSD 850 EVO 500GB (firmware EMT02B6Q)

in the ASMEDIA SATA II 6Gbps (now disabled), read errors and heavy freezes (even filesystem damage)

In the AMD controller, link fails and switch to PIO4 UDMA/100 in last... 

----syslog------

[  248.159560] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[  248.228187] ata3.00: configured for UDMA/133
[  312.724185] ata3: limiting SATA link speed to 1.5 Gbps
[  320.675869] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[  320.769960] ata3.00: configured for UDMA/133
[  392.987615] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[  393.069527] ata3.00: configured for UDMA/133
[  468.707516] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[  468.802562] ata3.00: configured for UDMA/133
[  542.007285] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[  542.069186] ata3.00: configured for UDMA/133
[  571.309052] hrtimer: interrupt took 19395 ns
[  606.873323] ata3.00: limiting speed to UDMA/100:PIO4
[  614.882962] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[  614.969015] ata3.00: configured for UDMA/100
[  688.073627] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[  688.143670] ata3.00: configured for UDMA/100

-------------

Performance is heavly degraded.

This is a CRITICAL failure. Please update the bug report.

Thanks!!!
Comment 20 Roman Elshin 2020-11-21 13:15:30 UTC
I have samsung 860 pro with RVM02B6Q firmware, it incompatible not only with amd ahci. With asmedia asm1061 it seems to work with queued TRIM disabled ( 
ATA_HORKAGE_NO_NCQ_TRIM | ATA_HORKAGE_ZERO_AFTER_TRIM in libata-core.c)
With marvell 88se9120 it requires libata.force=3.0G for work.
Intresting, is there any sata3 pci-e x1 card where samsung's 860* crap works flowlesly?
Comment 21 nejrobbins 2020-12-05 03:26:41 UTC
Also happening here. 860 EVO, latest firmware, ASRock 970M Pro3 motherboard. So AMD Sata chipset, 970 northbridge with AMD SB950 southbridge. Getting the same "FPDMA QUEUED" errors, and the drive falls back to SATA 1.5GB/s. Disabling NCQ does solve the issue, but hurts performance.

Interestingly, on Windows with the AMD Sata Driver the OS freezes entirely. But with the Microsoft Sata Controller Driver (that AMD now recommends using), the system doesn't freeze anymore (but still gets the other issues without disabling NCQ). So maybe some type of sata firmware fix is possible?
Comment 22 Roman Mamedov 2020-12-05 09:15:08 UTC
> But with the Microsoft Sata Controller Driver (that AMD now recommends
> using), the system doesn't freeze anymore (but still gets the other issues
> without disabling NCQ).

Well, if it still has issues until NCQ disabled, it only confirms that the issue can't really be solved by the driver. Also, did you check what performance you get with it, and how it compares with using AMD's driver? IIRC the MS driver was quite slow, so it might be a bit more "reliable" only due to being suboptimal and not pushing the hardware nowhere nearly as hard.

One other thing I should mention, as I see people reporting issues on non-AMD controllers as well;  to clarify my experience so far, on the AMD chipset controller I have to disable NCQ entirely to get it working; but on the ASMedia controllers it seems to be enough to disable just the queued TRIM (so the same as Roman Elshin reports above).
https://bugzilla.kernel.org/show_bug.cgi?id=203475
Try that, maybe you can regain some of the lost performance and still get a reliable operation out of the device.
Comment 23 nejrobbins 2020-12-05 16:19:13 UTC
With the AMD driver I couldn't even run a benchmark, as the system would just freeze and I'd have to force restart. Both with CrystalDiskMark and Samsung Magician.

However on the Microsoft driver, I would still get around 500MB/s give or take if I remember correctly, which is the rated spec for the drive. Now on Linux but I haven't tested speeds with NCQ disabled.

I'll try disabling queued TRIM only and seeing what happens. Thx.
Comment 24 Sitsofe Wheeler 2020-12-08 19:19:13 UTC
Can people who are seeing this report which model (e.g. SSD 860 EVO 500GB)  firmware (e.g. RVT01B6Q) and PCI card (e.g. AMD SB7x0/SB8x0/SB9x0 ) they have? In my case smartctl -a <dev> reports that I'm on RVT01B6Q firmware which is apparently behind the latest (RVT04B6Q) listed on https://www.samsung.com/semiconductor/minisite/ssd/download/tools/ . If folks are feeling brave and can take the risk can they report if the issue is still reproduced on the latest firmware?
Comment 25 Sitsofe Wheeler 2020-12-08 19:31:53 UTC
Hmm poking about the web it doesn't look like firmware updates are solving this issue (see https://community.amd.com/t5/server-gurus-discussions/issues-with-samsung-ssds-on-epyc/td-p/402737 ).
Comment 26 nejrobbins 2020-12-08 19:57:06 UTC
Another person experiencing the issue with an Intel NUC: https://unix.stackexchange.com/questions/623238/root-causes-for-failed-command-write-fpdma-queued.

But @Sitsofe Wheeler I'm on RVT04B6Q and still experiencing the issue, so it doesn't seem to be firmware as you said. I have AMD SB950.
Comment 27 Sitsofe Wheeler 2020-12-08 23:05:21 UTC
I flashed my 860 EVO to RVT04B6Q and the issue is still present which confirms nejrobbins message above.
Comment 28 Roy 2020-12-28 00:44:08 UTC
Wish I had found this bug report before getting an SSD to provide a much-needed performance boost to my ageing PC.

* Specs:
- Asus M5A97 Evo Rev2.0, latest UEFI,
- AMD FX-6300,
- Samsung EVO 860, firmware RVT04B6Q,
- Kernel 5.9.16

lspci -vv (filtered SATA controller only, double-checked with lshw)
00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] (rev 40) (prog-if 01 [AHCI 1.0])
	Subsystem: ASUSTeK Computer Inc. M5A99X EVO (R1.0) SB950
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 32
	Interrupt: pin A routed to IRQ 19
	NUMA node: 0
	Region 0: I/O ports at f040 [size=8]
	Region 1: I/O ports at f030 [size=4]
	Region 2: I/O ports at f020 [size=8]
	Region 3: I/O ports at f010 [size=4]
	Region 4: I/O ports at f000 [size=16]
	Region 5: Memory at fe60b000 (32-bit, non-prefetchable) [size=1K]
	Capabilities: <access denied>
	Kernel driver in use: ahci


* Symptoms:
dmesg was riddled with these messages:

dec 27 14:08:24 tuvok kernel: ata1: log page 10h reported inactive tag 17
dec 27 14:08:24 tuvok kernel: ata1.00: exception Emask 0x1 SAct 0x7ffc003f SErr 0x0 action 0x0
dec 27 14:08:24 tuvok kernel: ata1.00: irq_stat 0x40000008
dec 27 14:08:24 tuvok kernel: ata1.00: failed command: WRITE FPDMA QUEUED
dec 27 14:08:24 tuvok kernel: ata1.00: cmd 61/08:00:58:2d:57/00:00:2d:00:00/40 tag 0 ncq dma 4096 out
                                       res 40/00:28:f0:2d:57/00:00:2d:00:00/40 Emask 0x1 (device error)
dec 27 14:08:24 tuvok kernel: ata1.00: status: { DRDY }
dec 27 14:08:24 tuvok kernel: ata1.00: failed command: WRITE FPDMA QUEUED
dec 27 14:08:24 tuvok kernel: ata1.00: cmd 61/08:08:68:2d:57/00:00:2d:00:00/40 tag 1 ncq dma 4096 out
                                       res 40/00:28:f0:2d:57/00:00:2d:00:00/40 Emask 0x1 (device error)
dec 27 14:08:24 tuvok kernel: ata1.00: status: { DRDY }
dec 27 14:08:24 tuvok kernel: ata1.00: failed command: WRITE FPDMA QUEUED
dec 27 14:08:24 tuvok kernel: ata1.00: cmd 61/08:10:98:2d:57/00:00:2d:00:00/40 tag 2 ncq dma 4096 out
                                       res 40/00:28:f0:2d:57/00:00:2d:00:00/40 Emask 0x1 (device error)
dec 27 14:08:24 tuvok kernel: ata1.00: status: { DRDY }
dec 27 14:08:24 tuvok kernel: ata1.00: failed command: WRITE FPDMA QUEUED
dec 27 14:08:24 tuvok kernel: ata1.00: cmd 61/08:18:a8:2d:57/00:00:2d:00:00/40 tag 3 ncq dma 4096 out
                                       res 40/00:28:f0:2d:57/00:00:2d:00:00/40 Emask 0x1 (device error)

* Work-around:
Booting with libata.force=noncq works around this issue.
Comment 29 Gregory P. Smith 2021-02-23 06:18:25 UTC
A new Samsung 870 EVO 1TB SSD runs into this issue on Linux any time a DISCARD is sent. :(

Ex: Removing an LVM snapshot after doing a backup because of the questionable behavior I've been observing... bam:

Feb 22 11:28:32 zoonaut kernel: [130904.469448] ata4.00: qc timeout (cmd 0x47)
Feb 22 11:28:32 zoonaut kernel: [130904.470626] ata4.00: READ LOG DMA EXT failed, trying PIO
Feb 22 11:28:32 zoonaut kernel: [130904.470633] ata4: failed to read log page 10h (errno=-5)
Feb 22 11:28:32 zoonaut kernel: [130904.470700] ata4.00: exception Emask 0x1 SAct 0x40 SErr 0x0 action 0x6 frozen
Feb 22 11:28:32 zoonaut kernel: [130904.470748] ata4.00: irq_stat 0x40000008
Feb 22 11:28:32 zoonaut kernel: [130904.470779] ata4.00: failed command: SEND FPDMA QUEUED
Feb 22 11:28:32 zoonaut kernel: [130904.470824] ata4.00: cmd 64/01:30:00:00:00/00:00:00:00:00/a0 tag 6 ncq dma 512 out
Feb 22 11:28:32 zoonaut kernel: [130904.470824]          res 50/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x1 (device error)
Feb 22 11:28:32 zoonaut kernel: [130904.470932] ata4.00: status: { DRDY }
Feb 22 11:28:32 zoonaut kernel: [130904.470965] ata4: hard resetting link
Feb 22 11:28:32 zoonaut kernel: [130904.789997] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Feb 22 11:28:32 zoonaut kernel: [130904.794412] ata4.00: supports DRM functions and may not be fully accessible
Feb 22 11:28:32 zoonaut kernel: [130904.797441] ata4.00: supports DRM functions and may not be fully accessible
Feb 22 11:28:32 zoonaut kernel: [130904.799759] ata4.00: configured for UDMA/133
Feb 22 11:28:32 zoonaut kernel: [130904.799771] ata4.00: device reported invalid CHS sector 0
Feb 22 11:28:32 zoonaut kernel: [130904.799792] sd 3:0:0:0: [sdc] tag#6 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb 22 11:28:32 zoonaut kernel: [130904.799799] sd 3:0:0:0: [sdc] tag#6 Sense Key : Illegal Request [current] 
Feb 22 11:28:32 zoonaut kernel: [130904.799803] sd 3:0:0:0: [sdc] tag#6 Add. Sense: Unaligned write command
Feb 22 11:28:32 zoonaut kernel: [130904.799809] sd 3:0:0:0: [sdc] tag#6 CDB: Write same(16) 93 08 00 00 00 00 69 40 10 00 00 20 00 00 00 00
Feb 22 11:28:32 zoonaut kernel: [130904.799815] blk_update_request: I/O error, dev sdc, sector 1765806080 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0
Feb 22 11:28:32 zoonaut kernel: [130904.799969] ata4: EH complete
Feb 22 11:28:32 zoonaut kernel: [130904.800165] ata4.00: Enabling discard_zeroes_data
...
Feb 22 18:40:55 zoonaut kernel: [156847.619004] ata4.00: exception Emask 0x0 SAct 0xf000 SErr 0x0 action 0x0
Feb 22 18:40:55 zoonaut kernel: [156847.619075] ata4.00: irq_stat 0x40000008
Feb 22 18:40:55 zoonaut kernel: [156847.619106] ata4.00: failed command: SEND FPDMA QUEUED
Feb 22 18:40:55 zoonaut kernel: [156847.619148] ata4.00: cmd 64/01:60:00:00:00/00:00:00:00:00/a0 tag 12 ncq dma 512 out
Feb 22 18:40:55 zoonaut kernel: [156847.619148]          res 41/04:01:00:00:00/00:00:00:00:00/00 Emask 0x401 (device err
or) <F>
Feb 22 18:40:55 zoonaut kernel: [156847.619247] ata4.00: status: { DRDY ERR }
Feb 22 18:40:55 zoonaut kernel: [156847.619275] ata4.00: error: { ABRT }
Feb 22 18:40:55 zoonaut kernel: [156847.619792] ata4.00: supports DRM functions and may not be fully accessible
Feb 22 18:40:55 zoonaut kernel: [156847.622381] ata4.00: supports DRM functions and may not be fully accessible
Feb 22 18:40:55 zoonaut kernel: [156847.624512] ata4.00: configured for UDMA/133
Feb 22 18:40:55 zoonaut kernel: [156847.624544] ata4: EH complete
Feb 22 18:40:55 zoonaut kernel: [156847.624714] ata4.00: Enabling discard_zeroes_data
Feb 22 18:40:55 zoonaut kernel: [156847.734820] ata4.00: exception Emask 0x0 SAct 0x7e SErr 0x0 action 0x0
Feb 22 18:40:55 zoonaut kernel: [156847.734890] ata4.00: irq_stat 0x40000008
Feb 22 18:40:55 zoonaut kernel: [156847.734922] ata4.00: failed command: SEND FPDMA QUEUED
Feb 22 18:40:55 zoonaut kernel: [156847.734963] ata4.00: cmd 64/01:08:00:00:00/00:00:00:00:00/a0 tag 1 ncq dma 512 out
Feb 22 18:40:55 zoonaut kernel: [156847.734963]          res 41/04:01:00:00:00/00:00:00:00:00/00 Emask 0x401 (device error) <F>
Feb 22 18:40:55 zoonaut kernel: [156847.735061] ata4.00: status: { DRDY ERR }
Feb 22 18:40:55 zoonaut kernel: [156847.735090] ata4.00: error: { ABRT }
Feb 22 18:40:55 zoonaut kernel: [156847.735722] ata4.00: supports DRM functions and may not be fully accessible
Feb 22 18:40:55 zoonaut kernel: [156847.738424] ata4.00: supports DRM functions and may not be fully accessible
Feb 22 18:40:55 zoonaut kernel: [156847.740732] ata4.00: configured for UDMA/133
Feb 22 18:40:55 zoonaut kernel: [156847.740776] ata4: EH complete
Feb 22 18:40:55 zoonaut kernel: [156847.741073] ata4.00: Enabling discard_zeroes_data
... and on and on and on...
Feb 22 18:40:56 zoonaut kernel: [156848.541172] ata4.00: Enabling discard_zeroes_data
Feb 22 18:40:56 zoonaut kernel: [156848.638967] ata4.00: exception Emask 0x0 SAct 0x3f00 SErr 0x0 action 0x0
Feb 22 18:40:56 zoonaut kernel: [156848.642010] ata4.00: irq_stat 0x40000008
Feb 22 18:40:56 zoonaut kernel: [156848.645155] ata4.00: failed command: SEND FPDMA QUEUED
Feb 22 18:40:56 zoonaut kernel: [156848.648344] ata4.00: cmd 64/01:40:00:00:00/00:00:00:00:00/a0 tag 8 ncq dma 512 out
Feb 22 18:40:56 zoonaut kernel: [156848.648344]          res 41/04:01:00:00:00/00:00:00:00:00/00 Emask 0x401 (device error) <F>
Feb 22 18:40:56 zoonaut kernel: [156848.654650] ata4.00: status: { DRDY ERR }
Feb 22 18:40:56 zoonaut kernel: [156848.657798] ata4.00: error: { ABRT }
Feb 22 18:40:56 zoonaut kernel: [156848.661629] ata4.00: supports DRM functions and may not be fully accessible
Feb 22 18:40:56 zoonaut kernel: [156848.664769] ata4.00: supports DRM functions and may not be fully accessible
Feb 22 18:40:56 zoonaut kernel: [156848.666981] ata4.00: configured for UDMA/133
Feb 22 18:40:56 zoonaut kernel: [156848.667013] sd 3:0:0:0: [sdc] tag#8 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb 22 18:40:56 zoonaut kernel: [156848.667019] sd 3:0:0:0: [sdc] tag#8 Sense Key : Illegal Request [current] 
Feb 22 18:40:56 zoonaut kernel: [156848.667024] sd 3:0:0:0: [sdc] tag#8 Add. Sense: Unaligned write command
Feb 22 18:40:56 zoonaut kernel: [156848.667029] sd 3:0:0:0: [sdc] tag#8 CDB: Write same(16) 93 08 00 00 00 00 69 40 10 00 00 3f ff c0 00 00
Feb 22 18:40:56 zoonaut kernel: [156848.667035] blk_update_request: I/O error, dev sdc, sector 1765806080 op 0x3:(DISCARD) flags 0x4000 phys_seg 1 prio class 0
Feb 22 18:40:56 zoonaut kernel: [156848.670129] ata4: EH complete

lvremove thankfully just waits and retries all its I/O patiently, ultimately succeeding and just notes as an error message that it's DISCARD got an I/O error.

But I no longer trust the device in my machine.

HP Microserver Gen10 AMD based system: lspci -v shows

00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 49) (prog-if 01 [AHCI 1.0])
        Subsystem: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
Comment 30 Gregory P. Smith 2021-02-24 05:27:19 UTC
For those disabling NCQ as a workaround.  You can do this per-drive rather than system wide.  Writing 1 to /sys/block/sdX/device/queue_depth instead of the default value disables NCQ.

lvresize -L -299G no longer takes ages and produces ata discard errors in syslog after I do that.  re-enable NCQ by putting a higher value like the default 32 (really 31) there causes it to run into errors again.

I've already picked up an WD Blue SSD to replace this buggy Samsung.  Not supporting NCQ properly is unacceptable to me.  I could just leave trim and discard disabled, but that's an equally hacky non-default config that nothing manufactured and sold in 2021 should require.

I'll avoid Samsung SSD in the future and recommend others do the same.  This may only be a bug in their SATA line (becoming a legacy product merely for this replacing HDDs).

Linux kernel wise, the 8xx series Samsung SATA SSDs could be blocklisted as known troublesome devices so that trim or NCQ are disabled by default on them.  Do we accept kind of vendor quirk hack in mainline kernels?
Comment 31 Roman Mamedov 2021-02-24 07:23:43 UTC
Gregory, did you try disabling just the queued TRIM, not NCQ entirely? As suggested in https://bugzilla.kernel.org/show_bug.cgi?id=203475
Comment 32 Gregory P. Smith 2021-02-24 09:30:22 UTC
Thanks for the link. I hadn't seen that issue. Seems to be the same thing as this. Disabling just queued TRIM and not NCQ entirely appears to require rebuilding a kernel which isn't something I can do to this machien.

Relevant patch and pointer to the change that needs reverting in the kernel:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca6bfcb2f6d9deab3924bf901e73622a94900473

People using Linux distros - file issues against your distro asking them to to use a "Samsung [78]*" in that blocklist.

Also related: https://bugzilla.kernel.org/show_bug.cgi?id=202093  Which came from Canonical's existing Ubuntu issue https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1809972
Comment 33 Matt Whitlock 2021-02-24 15:01:33 UTC
(In reply to Gregory P. Smith from comment #30)
> Linux kernel wise, the 8xx series Samsung SATA SSDs could be blocklisted as
> known troublesome devices so that trim or NCQ are disabled by default on
> them.

PLEASE don't do that! I have a Samsung 860 Pro and a Samsung 860 EVO, and both work great with NCQ and queued Trim enabled. The incompatibility is specifically with AMD SATA controllers. Don't hobble these great drives universally because of one bad controller.
Comment 34 nejrobbins 2021-02-24 17:12:00 UTC
It is predominantly AMD SATA, but there have been some reports of it happening to Intel systems. I agree though it shouldn't be disabled outright.

From what I saw, typically just disabling queued TRIM doesn't fix the issue, as it appears in other contexts and not just when trimming. But it may still be useful to try for some.

You can also disable NCQ for a specific SATA port (and not system wide) with libata.force=x.00:noncq, where X is the disk number.

From: https://wiki.archlinux.org/index.php/Solid_state_drive#Resolving_NCQ_errors

I wonder what the performance impact of disabling NCQ is? I'm not sure of the relationship between average use queue depth and sequential vs random RW.


Just as a side note, on Windows with the Microsoft SATA driver the OS would freeze during benchmarks, but with the AMD driver it wouldn't. Not sure if NCQ got disabled there or anything, and didn't do a benchmark at the tims.
Comment 35 Gregory P. Smith 2021-02-24 18:41:10 UTC
Kernel wise, it should ship with safe default behavior.  Let people who want this enabled on known often borked devices opt-in.  Don't default to causing performance or potential data loss issues for those who fail to opt-out.

The kernel's libata has the HORKAGE list for a reason.  Lets use it to maximal user benefit to avoid problems.  Re-enabling ncq (or ncqtrim if it is that specific) on these Samsung SSDs was a mistake that everyone here piping up is paying for.

The 870 is brand new, just released this year.  Yet it has the problem.

While I've seen an old blog post claiming disabling NCQ on their SATA SSD leading to a reduction in 4K random read performance, firing up fio to do a randread test is not reproducing that for me.  In fact... I just found that I/O speed on the device surprising went _up_ 20-30% after I disabled NCQ by writing 1 to /sys/block/sdc/device/queue_depth.  WAT?

If that's the case, I don't want NCQ on such a devices.  No idea if I'm holding fio wrong.  First time using it.

Anyways, that's all the time I have for this.  Samsung did wrong with their SATA SSD firmware.  The Kernel is doing wrong for users who own those devices today.  

You've got the power to fix it for all Samsung SSD owners.
Comment 36 Matt Whitlock 2021-02-24 19:38:38 UTC
(In reply to Gregory P. Smith from comment #35)
> firing up fio to do a
> randread test is not reproducing that for me.  In fact... I just found that
> I/O speed on the device surprising went _up_ 20-30% after I disabled NCQ by
> writing 1 to /sys/block/sdc/device/queue_depth.  WAT?

I just ran fio on my Samsung 860 EVO 2TB, random 4K reads with libaio engine, I/O depth 256, jobs 4, runtime 120 seconds.

With queue_depth=32 (default): read: IOPS=46.7k, BW=182MiB/s

With queue_depth=1: read: IOPS=13.3k, BW=51.0MiB/s

No messages in my kernel logs for the duration of these tests.

So again, please don't cripple these drives for everyone. If there is an incompatibility with specific SATA controllers, then address that specifically.
Comment 37 Gregory P. Smith 2021-02-24 21:02:20 UTC
Trying to come up with a list of broken samsung-ssd/sata-controller combos is Samsung's job.  But I'd be surprised if they wanted to do that as it'd nerf their marketing-only benchmarks.  Ship code that is safe by default.

Rollback https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca6bfcb2f6d9deab3924bf901e73622a94900473

Disabling it on all Intel, AMD, asmedia, and marvell SATA controllers would be the only reasonable choice to do in the kernel based on all of the comments on the issues so far.

You're setting an arbitrarily impossible testing bar to meet in favor of your own personal combo's performance at the expense of everyone elses data integrity.
Comment 38 Roman Mamedov 2021-02-24 21:18:25 UTC
Should be noted that the referenced commit changes only the NCQ TRIM blacklist, whereas Matt's test results are obtained while disabling the NCQ entirely. 

Disabling just the queued TRIM will not have nearly as much performance impact as disabling NCQ itself. In most cases it will have none, since IIRC the latest best practices from FS devs are not to use inline trim (the "discard" mount option) at all, opting for daily/weekly invocations of "fstrim" instead. But this also might not help on some controllers, the telltale sign for whether it will or not, is did errors have "WRITE FPDMA QUEUED" (a generic NCQ issue), as opposed to "SEND FPDMA QUEUED" (problem with NCQ TRIM specifically).
Comment 39 Gregory P. Smith 2021-02-24 21:44:36 UTC
Right.  When i can reboot, i'll try mine with just noncqtrim.  (it wasn't clear to me if there is a way to control the noncqtrim horkage setting via /sys/block/sdc/device/ like there is for disabling ncq entirely)

Matt: Can you share the fio command line / config you used?  I'd like to repeat the same test on my own system.  Yours is clearly performing much more like I'd expect given the settings (ncq being disabled _should_ put a big dent in 4k random read performance on a decent SSD)
Comment 40 Matt Whitlock 2021-02-24 23:20:38 UTC
(In reply to Gregory P. Smith from comment #39)
> Matt: Can you share the fio command line / config you used?

Not knowing anything about fio myself, I just used the example command line from https://docs.oracle.com/en-us/iaas/Content/Block/References/samplefiocommandslinux.htm:

fio --filename=/dev/disk/by-id/ata-Samsung_SSD_yaddayaddayadda --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --readonly
Comment 41 Gregory P. Smith 2021-02-27 05:51:58 UTC
Confirming: Rebooting with libata.force=4:noncqtrim to disable queued trim on my Samsung 870 EVO _appears_ to work around the issue for easy to reproduce situations (lvresize to reduce a volume size).

[    2.481763] ata4.00: FORCE: horkage modified (noncqtrim)
[    2.481823] ata4.00: supports DRM functions and may not be fully accessible
[    2.482434] ata4.00: disabling queued TRIM support
[    2.482437] ata4.00: ATA-11: Samsung SSD 870 EVO 1TB, SVT01B6Q, max UDMA/133
[    2.482439] ata4.00: 1953525168 sectors, multi 1: LBA48 NCQ (depth 32), AA

Also confirming: There is no measurable performance degradation during normal use of my SSD from doing so.  All this is disabling is Queued TRIM.

I understand this to means a TRIM must act as a barrier to let the existing queued IO finish before happening alone, as a serialized command, after which normal NCQ IO can resume.

Confirming by doing some lvresize -L -100G commands on a LV on the volume during an fio run, I see a very brief blip in the speeds printed to stdout.  But as it's just a single trim it is inconsequential to performance.

I wouldn't mount your filesystem with -o discard in such a configuration if you'll have frequent transient files and need unwavering read throughput.  But a regular ~weekly scheduled fstrim -a shouldn't be a big deal (which I believe is the default setup in popular distros anyways?).

The patch just re-marks these drives as "disable Queued TRIM" by default.  It doesn't disable NCQ entirely.  Seems like a good safe default.

I don't doubt that some people have issues with NCQ itself.  The reason noncq and noncqtrim libata.force flags were added to the kernel in 2015 was due to the large number of SSDs out there that don't behave well.  (https://patchwork.ozlabs.org/project/linux-ide/patch/1430790861-30066-1-git-send-email-martin.petersen@oracle.com/)

The kernel could fail better in this situation.  Perhaps: "Got an error as a result of a queued trim?  Automatically flip that device to noncqtrim mode."  But that's a larger logic change with consequences.  Updating the horkage blocklist is simple and targets this specific issue.

Matt - thanks for the fio command and link!  _That_ one seems to properly exercise my SSD.

4k read: IOPS=88.7k, BW=346MiB/s (363MB/s)      with noncqtrim or without
4k read: IOPS=11.1k, BW=43.5MiB/s (45.6MB/s)    with noncq (queue_depth=1)

Note You need to log in before you can comment on or make changes to this bug.