Bug 202665 - NVMe AMD-Vi IO_PAGE_FAULT only with hardware IOMMU and fstrim/discard
Summary: NVMe AMD-Vi IO_PAGE_FAULT only with hardware IOMMU and fstrim/discard
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Block Layer (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Jens Axboe
URL:
Keywords:
: 198733 (view as bug list)
Depends on:
Blocks:
 
Reported: 2019-02-24 10:21 UTC by aladjev.andrew@gmail.com
Modified: 2019-12-02 05:22 UTC (History)
32 users (show)

See Also:
Kernel Version: 4.20.12
Tree: Mainline
Regression: No


Attachments
4.20.12 kernel config (118.35 KB, text/plain)
2019-02-24 10:21 UTC, aladjev.andrew@gmail.com
Details
dmesg with page fault (64.48 KB, text/plain)
2019-02-24 10:22 UTC, aladjev.andrew@gmail.com
Details
dmesg with iommu=soft (57.37 KB, text/plain)
2019-02-24 10:22 UTC, aladjev.andrew@gmail.com
Details
dmesg without discard (60.93 KB, text/plain)
2019-02-24 10:23 UTC, aladjev.andrew@gmail.com
Details
dmesg with hardware iommu and fstrim (67.26 KB, text/plain)
2019-02-26 14:34 UTC, aladjev.andrew@gmail.com
Details
fstrim / function_graph trace for fstrim (1.04 MB, application/x-xz)
2019-03-31 14:36 UTC, Jonathan McDowell
Details
ftrace for fstrim / (1.28 MB, application/x-xz)
2019-03-31 18:25 UTC, Tim Murphy
Details
ftrace for (failing) fstrim /mnt/work (1.58 MB, application/x-xz)
2019-03-31 18:38 UTC, Tim Murphy
Details
5.0.5 trace-cmd (1.65 MB, application/x-xz)
2019-04-02 10:54 UTC, nutodafozo
Details
trace-cmd results (10.84 KB, text/plain)
2019-04-03 14:07 UTC, Tim Murphy
Details
trace_report (392.66 KB, application/x-xz)
2019-04-05 06:01 UTC, Andreas
Details
trace report + dmesg (21.54 KB, application/x-xz)
2019-05-03 18:39 UTC, Christoph Nelles
Details
attachment-10825-0.html (1.26 KB, text/html)
2019-09-07 18:06 UTC, Tim Murphy
Details
kernel-5.3-nvme-discard-align-to-page-size.patch (3.46 KB, patch)
2019-10-01 21:37 UTC, Vladimir Smirnov
Details | Diff
log with iommu=pt (2.37 MB, text/plain)
2019-10-03 04:45 UTC, swk
Details
log with iommu=on (93.35 KB, text/plain)
2019-10-03 04:46 UTC, swk
Details
log with AMD virtualization OFF (2.70 MB, text/plain)
2019-10-03 10:43 UTC, swk
Details

Description aladjev.andrew@gmail.com 2019-02-24 10:21:21 UTC
Created attachment 281315 [details]
4.20.12 kernel config

Hello. I think I've found a new issue with nvme that was not reported yet.

I've installed amd 2400g, asus tuf b450m pro, nvme adata sx8200 yesterday. First of all I've updated to latest BIOS 0604. Than used sysresccd (kernel 4.19) with "iommu=soft" to boot and install gentoo base system with latest toolchain: gcc 8.2.0, binutils 2.30-r4, glibc 2.27-r6, linux-firmware 2019022, linux-headers 4.20, kernel 4.20.12.

I've configured kernel with recommended options for raven ridge. I will attach ".config" file to this issue.

Than I've created fstab with:
/	ext4	discard,noatime	0 1
/home	ext4	discard,noatime	0 2

and made a first reboot. All things worked as expected. Than I proceed with installation further. After 20 minutes I've checked dmesg and found that AMD-Vi was spamming errors:

# dmesg
[  145.127297] nvme 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000fd769100 flags=0x0000]
[  145.127301] AMD-Vi: Event logged [IO_PAGE_FAULT device=06:00.0 domain=0x0000 address=0x00000000fd769180 flags=0x0000]

I didn't received any issues with my data on SSD. Smartctl said that disk is good, cold and feel fine. I will attach "dmesg.bad".

Than I've enabled "iommu=soft", rebooted and issue disappeared. Than I removed "iommu=soft" and "discard" from fstab, rebooted and issue disappeared too. I will attach "dmesg.iommu.soft" and "dmesg.without.discard".

So I can reproduce IO_PAGE_FAULT with hardware IOMMU enabled and discard only. People on archlinux forum reproduced same issue on amd 2700, asus tuf b450 plus, and nvme intel SSDPEKKW256G8 without discard but with manual fstrim.

I think I will use system with "iommu=soft" and discard for now. Please let me know how to debug this issue to provide more details. Thank you.

PS please do not look at "[drm]" errors, I will configure amdgpu later.
Comment 1 aladjev.andrew@gmail.com 2019-02-24 10:22:18 UTC
Created attachment 281317 [details]
dmesg with page fault
Comment 2 aladjev.andrew@gmail.com 2019-02-24 10:22:41 UTC
Created attachment 281319 [details]
dmesg with iommu=soft
Comment 3 aladjev.andrew@gmail.com 2019-02-24 10:23:02 UTC
Created attachment 281321 [details]
dmesg without discard
Comment 4 aladjev.andrew@gmail.com 2019-02-26 14:29:08 UTC
I've read recommendation from vendors to disable discard and make fstrim everyday instead. I've reproduced the same issue with hardware iommu and fstrim.
Comment 5 aladjev.andrew@gmail.com 2019-02-26 14:34:43 UTC
Created attachment 281357 [details]
dmesg with hardware iommu and fstrim
Comment 6 aladjev.andrew@gmail.com 2019-03-01 12:54:34 UTC
This is the only one issue I have with ryzen build for now. Everything is just works including radeonsi video driver.

I think that this issue may be related to AMD Store MI https://www.amd.com/system/files/2018-04/AMD-StoreMI-FAQ.pdf that is not yet implemented in Linux but exists in B450 chipset.
Comment 7 nutodafozo 2019-03-08 11:15:28 UTC
I can confirm this issue on Asus X470 PRO motherboard (latest bios from today 4406) and HP EX920 ssd (same sm2265 controller) - running fstrim spams dmesg with IO_PAGE_FAULT.
OS is Ubuntu 18.04.2 (Linux machine 4.18.0-16-generic #17~18.04.1-Ubuntu SMP Tue Feb 12 13:35:51 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux)

[ 3415.904659] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000fbf96000 flags=0x0000]
...
[ 3415.908733] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000fd1bb000 flags=0x0000]
[ 3415.908768] amd_iommu_report_page_fault: 28 callbacks suppressed
[ 3415.908769] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x00000000fd1b0000 flags=0x0000]
...
[ 3415.924844] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x00000000fd1a8000 flags=0x0000]
Comment 8 Andreas 2019-03-08 12:29:17 UTC
*** Bug 198733 has been marked as a duplicate of this bug. ***
Comment 9 Andreas 2019-03-08 12:32:32 UTC
BTW, still present in 5.0.0.

For details see Bug 198733 (Linux ZenMachine 5.0.0-gentoo-RYZEN #1 SMP Wed Mar 6 10:39:58 CET 2019 x86_64 AMD Ryzen 7 1800X Eight-Core Processor AuthenticAMD GNU/Linux).
Comment 10 Eduard Hasenleithner 2019-03-09 16:53:06 UTC
Here is another setup with the same problems (on Linux 5.0.0)
* AMD Ryzen TR 1950X
* MSI X399 SLI PLUS
* Corsair MP510 960GB

Status with this setup is identical to the OP
* iommu=soft + fstrim => OK
* iommu active + discard => AMD-Vi IO_PAGE_FAULT
* iommu active + fstrim => AMD-Vi IO_PAGE_FAULT

So at the moment it looks like that the problem is restricted to AMD Ryzen. Also worth noting is that the NVMe shows errors in its log (example of one message from "nvme smart-log"):
error_count  : 65526
sqid         : 6
cmdid        : 0x4c
status_field : 0x400c(INTERNAL)
parm_err_loc : 0xffff
lba          : 0x80000000
nsid         : 0x1
vs           : 0
Comment 11 Mikhail Kurinnoi 2019-03-17 08:27:33 UTC
I can confirm this issue on AMD Ryzen 3 2200G, ASRock B450M Pro4 and Adata XPG SX8200 Pro 256GB (NVMe). Didn't faced this issue with only one installed SATA SSD (Kingston HyperX Savage 240GB) with same kernel.
Comment 12 Andreas 2019-03-17 09:00:28 UTC
Ah! I've had this error with only one NVMe SSD from the start. I just recently added a second NVMe SSD, using a PCI Express expansion card. It didn't change anything for me.
Comment 13 Stanislaw Gruszka 2019-03-29 10:39:40 UTC
There is upstream fix for AMD IOMMMU dma mapping issue:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4e50ce03976fbc8ae995a000c4b10c737467beaa

It's already on 5.0.x and 4.19.x (4.20 is end of life). There is a chance it fixes this issue.
Comment 14 Tim Murphy 2019-03-29 19:19:42 UTC
I can confirm the above fix does not resolve the IOMMU issue. I'm running latest manjaro kernel (5.0.3-1) and the problem still occurs. The iommu=soft workaround does avoid the issue for me, thus I'm confident my issue is the same as this one.
Comment 15 Jonathan McDowell 2019-03-29 21:05:46 UTC
(In reply to Tim Murphy from comment #14)
> I can confirm the above fix does not resolve the IOMMU issue. I'm running
> latest manjaro kernel (5.0.3-1) and the problem still occurs. The iommu=soft
> workaround does avoid the issue for me, thus I'm confident my issue is the
> same as this one.

The fix in the 5.0.x stream only landed for 5.0.5 so worth trying a later kernel. I no longer see the IO_PAGE_FAULT message with 5.0.5 but I still see:

[   40.479512] print_req_error: I/O error, dev nvme0n1, sector 1390960 flags 803                                       

messages when I do "fstrim -v /", which I don't see when I have "iommu=pt" passed on the kernel command line.
Comment 16 Tim Murphy 2019-03-29 21:42:23 UTC
Thanks. Sadly, my setup still shows page fault errors with 5.0.5-arch1, which i pulled & built earlier today. FWIW, error occurs only on my 2nd 'data' NVMe, which is attached via add-in card for 2nd (x4) PCIE slot. My boot NVMe has never displayed the error, on any kernel. My setup is Asrock B450M Pro4, Ryzen 1700.
Comment 17 Jonathan McDowell 2019-03-30 09:58:35 UTC
My hardware is also an Asrock B450M Pro4, but with a Ryzen 2700. I'm running 2 NVMe SSDs, one on the board slot and one in a x4 PCIE slot. I see the print_req_errors and IO_PAGE_FAULT errors for both pre 5.0.5, but only the print_req_io errors on 5.0.5. Reliably triggered by "fstrim -v /", and neither appears with "iommu=pt".
Comment 18 Stanislaw Gruszka 2019-03-31 07:40:38 UTC
If you do not have already, please configure kernel with  

CONFIG_DYNAMIC_FTRACE
CONFIG_FUNCTION_GRAPH_TRACER 

run this script as root with iommu enabled:


#!/bin/bash
mount -t debugfs debugfs /sys/kernel/debug
cd /sys/kernel/debug/tracing/
function_graph > current_tracer 
echo 1 > tracing_on 
fstrim /
echo 0 > tracing_on
cat trace  > ~/trace.txt


and provide trace.txt file here (compressed if too big for bugzilla).
Comment 19 Stanislaw Gruszka 2019-03-31 07:59:24 UTC
(In reply to Stanislaw Gruszka from comment #18)
> function_graph > current_tracer 
echo function_graph > current_tracer
Comment 20 Jonathan McDowell 2019-03-31 14:36:21 UTC
Created attachment 282075 [details]
fstrim / function_graph trace for fstrim

Trace from 5.0.5 with no iommu= option passed to the kernel. No IO_PAGE_FAULT errors, but still the I/O errors, which aren't seen with "iommu=pt".
Comment 21 Tim Murphy 2019-03-31 18:25:31 UTC
Created attachment 282077 [details]
ftrace for fstrim /

I'm adding the requested result from my 5.0.5-arch1-1-custom kernel.
Thanks, Tim
Comment 22 Tim Murphy 2019-03-31 18:38:47 UTC
Created attachment 282079 [details]
ftrace for (failing) fstrim /mnt/work

This is the requested output for the failing operation on my 5.0.5-arch1-1-custom rig (fstrim of PCIE add-in card attached NVME device) - obsoletes my prior attachment which was done on my non-failing root NVME device (fstrim /).

Here's what shows in dmesg when the fstrim fails.

69.396195] nvme 0000:23:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x0 flags=0x0000]
[   69.396458] nvme 0000:23:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x0 flags=0x0000]
[   69.396645] nvme 0000:23:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x200 flags=0x0000]
[   69.396898] nvme 0000:23:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x0 flags=0x0000]
[   69.397079] nvme 0000:23:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x400 flags=0x0000]
[   69.397258] nvme 0000:23:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x0 flags=0x0000]
[   69.397439] nvme 0000:23:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x400 flags=0x0000]
[   69.397618] nvme 0000:23:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x0 flags=0x0000]
[   69.404931] nvme 0000:23:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x200 flags=0x0000]
[   69.405166] print_req_error: I/O error, dev nvme0n1, sector 76088 flags 803
[   69.405226] nvme 0000:23:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x0 flags=0x0000]
[   69.405373] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0000 address=0x200 flags=0x0000]
[tim1@pearl ~]$

Thanks
Comment 23 Stanislaw Gruszka 2019-04-02 09:03:40 UTC
Sorry guys, but it is harder than I thought to see where the problem is. Perhaps tracing events will give better picture. Please install:

https://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git

and do 

$ ./trace-cmd/tracecmd/trace-cmd record -e block -e nvme -e iommu fstrim /
$ ./trace-cmd/tracecmd/trace-cmd report > trace_report.txt

and attach trace_report.txt(.xz)

All block,nvme,iommu events should be available, if not kernel recompilation with proper option will be needed.
Comment 24 nutodafozo 2019-04-02 10:54:49 UTC
Created attachment 282093 [details]
5.0.5 trace-cmd
Comment 25 nutodafozo 2019-04-02 10:56:24 UTC
Comment on attachment 282093 [details]
5.0.5 trace-cmd

Stanislaw, here's my trace attached, it is 5.0.5-050005-generic #201903271212 SMP Wed Mar 27 16:14:07 UTC 2019 x86_64.

linux@pc:/tmp$ sudo ./trace-cmd/tracecmd/trace-cmd record -e block -e nvme -e iommu fstrim -v /
/: 869,9 GiB (934002778112 bytes) trimmed
CPU0 data recorded at offset=0x55d000
    8192 bytes in size
CPU1 data recorded at offset=0x55f000
    0 bytes in size
CPU2 data recorded at offset=0x55f000
    4096 bytes in size
CPU3 data recorded at offset=0x560000
    20480 bytes in size
CPU4 data recorded at offset=0x565000
    139264 bytes in size
CPU5 data recorded at offset=0x587000
    0 bytes in size
CPU6 data recorded at offset=0x587000
    0 bytes in size
CPU7 data recorded at offset=0x587000
    8192 bytes in size
CPU8 data recorded at offset=0x589000
    45056 bytes in size
CPU9 data recorded at offset=0x594000
    0 bytes in size
CPU10 data recorded at offset=0x594000
    4096 bytes in size
CPU11 data recorded at offset=0x595000
    16384 bytes in size
CPU12 data recorded at offset=0x599000
    0 bytes in size
CPU13 data recorded at offset=0x599000
    3526656 bytes in size
CPU14 data recorded at offset=0x8f6000
    12582912 bytes in size
CPU15 data recorded at offset=0x14f6000
    315392 bytes in size
linux@pc:/tmp$ sudo ./trace-cmd/tracecmd/trace-cmd report > trace_report.txt     trace-cmd: No such file or directory
  [nvme:nvme_sq] function nvme_trace_disk_name not defined
  [nvme:nvme_setup_cmd] function nvme_trace_disk_name not defined
  [nvme:nvme_complete_rq] function nvme_trace_disk_name not defined

dmesg errors:
[  230.001807] nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0xf9d9b000 flags=0x0000]
[  230.003210] nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0xff783000 flags=0x0000]
[  230.004647] nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0xf9e74000 flags=0x0000]
[  230.006076] nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0xff77c000 flags=0x0000]
[  230.007463] nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0xe4eda000 flags=0x0000]
[  230.008914] nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0xfa2fe000 flags=0x0000]
[  230.010305] nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0xfb8b1000 flags=0x0000]
[  230.011688] nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0xfa432000 flags=0x0000]
[  230.013064] nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0xfa430000 flags=0x0000]
[  230.014437] nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0xfa435000 flags=0x0000]
[  230.015819] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0xfa4f7000 flags=0x0000]
[  230.017197] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0xfa42d000 flags=0x0000]
[  230.018580] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0xfa749000 flags=0x0000]
[  230.019956] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0xfa74a000 flags=0x0000]
[  230.021333] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0xfa74c000 flags=0x0000]
[  230.022783] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0xf9dcc000 flags=0x0000]
[  230.023828] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0xfa09e000 flags=0x0000]
[  230.025200] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0xfd37d000 flags=0x0000]
[  230.026585] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0xfd3f3000 flags=0x0000]
[  230.028008] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0xfd37d000 flags=0x0000]
Comment 26 Tim Murphy 2019-04-03 14:07:31 UTC
Created attachment 282111 [details]
trace-cmd results

Stanislaw, here is the output you requested from my system (apologies for delay).  Note, the failing device in my case is mounted on /mnt/work. Thanks. 

[tim1@pearl 202665]$ sudo trace-cmd record -e block -e nvme -e iommu fstrim /mnt/work
[sudo] password for tim1: 
fstrim: /mnt/work: FITRIM ioctl failed: Input/output error
CPU0 data recorded at offset=0x57e000
0 bytes in size
CPU1 data recorded at offset=0x57e000
0 bytes in size
CPU2 data recorded at offset=0x57e000
0 bytes in size
CPU3 data recorded at offset=0x57e000
0 bytes in size
CPU4 data recorded at offset=0x57e000
4096 bytes in size
CPU5 data recorded at offset=0x57f000
4096 bytes in size
CPU6 data recorded at offset=0x580000
0 bytes in size
CPU7 data recorded at offset=0x580000
0 bytes in size
CPU8 data recorded at offset=0x580000
0 bytes in size
CPU9 data recorded at offset=0x580000
0 bytes in size
CPU10 data recorded at offset=0x580000
0 bytes in size
CPU11 data recorded at offset=0x580000
0 bytes in size
CPU12 data recorded at offset=0x580000
0 bytes in size
CPU13 data recorded at offset=0x580000
0 bytes in size
CPU14 data recorded at offset=0x580000
0 bytes in size
CPU15 data recorded at offset=0x580000
0 bytes in size
[tim1@pearl 202665]$ sudo trace-cmd report > trace_report.txt
trace-cmd: No such file or directory
[nvme:nvme_sq] function nvme_trace_disk_name not defined
[nvme:nvme_setup_cmd] function nvme_trace_disk_name not defined
[nvme:nvme_complete_rq] function nvme_trace_disk_name not defined
[tim1@pearl 202665]$
Comment 27 Andreas 2019-04-05 06:01:24 UTC
Created attachment 282131 [details]
trace_report

I can also confirm that the bug is still around on 5.0.6.

Board: ASUS PRIME X370-PRO, BIOS 4406 02/28/2019
CPU: AMD Ryzen 7 1800X
NVMe #1: Intel M.2 600p (onboard M.2)
NVMe #2: Crucial P1 (SilverStone SST-ECM20 PCIe 3.0 x4 to M.2)

/ is on NVMe #1 from the onboard M.2, but the problem also occurs on the NVMe from the PCIe expansion card.
Comment 28 Andreas 2019-04-05 13:10:15 UTC
And just to make it clear:

* iommu=soft   + discard/fstrim     => OK
* iommu=pt     + discard/fstrim     => OK
* iommu active + discard or fstrim  => nvme/AMD-Vi IO_PAGE_FAULT

I currently use iommu=pt. Would it help to see a log with iommu=soft and iommu=pt to see the difference?

Using btrfs in a RAID (with one partition on my NVMe #1 and one on my NVMe #2) I had various other serious errors as well, like:
print_req_error: I/O error, dev nvme1n1, sector 1214060160

BUT this occured after the complete fail of NVMe #2 and its loss from the list of devices (/dev/nvme1n1 was gone), like this:
[  605.403827] INFO: task systemd:2816 blocked for more than 120 seconds. 
[  605.403830]       Tainted: G           O    T 4.20.1-gentoo-RYZEN #1 
[  605.403831] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[  605.403833] systemd         D    0  2816      1 0x00000000 
[  605.403836] Call Trace: 
[  605.403844]  __schedule+0x21c/0x720 
[  605.403847]  schedule+0x27/0x80 
[  605.403850]  io_schedule+0x11/0x40 
[  605.403853]  wait_on_page_bit+0x11d/0x200 
[  605.403855]  ? __page_cache_alloc+0x20/0x20 
[  605.403859]  read_extent_buffer_pages+0x257/0x300 
[  605.403863]  btree_read_extent_buffer_pages+0xc2/0x230 
[  605.403865]  ? alloc_extent_buffer+0x35e/0x390 
[  605.403868]  read_tree_block+0x5c/0x80 
[  605.403871]  read_block_for_search.isra.13+0x1a9/0x380 
[  605.403874]  btrfs_search_slot+0x226/0x970 
[  605.403876]  btrfs_lookup_inode+0x63/0xfc 
[  605.403879]  btrfs_iget_path+0x67e/0x770 
[  605.403882]  btrfs_lookup_dentry+0x478/0x570 
[  605.403885]  btrfs_lookup+0x18/0x40 
[  605.403888]  path_openat+0xbbd/0x13e0 
[  605.403891]  do_filp_open+0xa7/0x110 
[  605.403894]  do_sys_open+0x18e/0x230 
[  605.403896]  __x64_sys_openat+0x1f/0x30 
[  605.403899]  do_syscall_64+0x55/0x100 
[  605.403901]  entry_SYSCALL_64_after_hwframe+0x44/0xa9 
[  605.403904] RIP: 0033:0x7f57bc1a731a 
[  605.403909] Code: Bad RIP value. 
[  605.403911] RSP: 002b:00007ffe14628540 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 
[  605.403913] RAX: ffffffffffffffda RBX: 00007ffe14628638 RCX: 00007f57bc1a731a 
[  605.403914] RDX: 00000000000a0100 RSI: 0000562ae1fd7dd0 RDI: 00000000ffffff9c 
[  605.403915] RBP: 0000000000000008 R08: 91824bee752ca339 R09: 00007f57bbf11540 
[  605.403917] R10: 0000000000000000 R11: 0000000000000246 R12: 0000562ae1fd7de6 
[  605.403918] R13: 0000562ae1fd7b10 R14: 00007ffe146285c0 R15: 0000562ae1fa6168 
[  655.735860] nvme nvme1: Device not ready; aborting reset 

Without the btrfs being in RAID mode the device isn't lost somehow, although I don't have any other partition in real use on NVMe #2 at the moment, other than a swap partition. But I cannot say that my system swaps that much as it has 32 GB of RAM which rarely gets used up completely.

I don't see the connection to this error though, but if there is, it could help to dignose it. So if it helps I can setup a test partition on that other NVMe device. Just tell me what you need me to set up...
Comment 29 Christoph Nelles 2019-05-03 18:38:06 UTC
Add me to the list of affected people. 
- Asrock X399 Taichi ATX-Size
- Threadripper 1950X
- 2x Corsair force M510P 480GB (Phison Electronics Corporation E12 NVMe Controller)
- Kernel 5.0.10 customized.

Adding trace report + kernel log.
Comment 30 Christoph Nelles 2019-05-03 18:39:20 UTC
Created attachment 282603 [details]
trace report + dmesg
Comment 31 Christoph Nelles 2019-05-04 09:52:51 UTC
Not directly related to this issue, but the workaround iommu=pt has more side effects. I had to disable Secure Memory Encryption as the Megaraid SAS and radeon driver were unable to initialize properly with SME: 

mpt3sas 0000:09:00.0: SME is active, device will require DMA bounce buffers
mpt2sas_cm0: reply_post_free pool: dma_pool_alloc failed
mpt2sas_cm0: failure at drivers/scsi/mpt3sas/mpt3sas_scsih.c:10506/_scsih_probe()!

radeon 0000:07:00.0: SME is active, device will require DMA bounce buffers
radeon 0000:07:00.0: SME is active, device will require DMA bounce buffers
software IO TLB: SME is active and system is using DMA bounce buffers
[drm:r600_ring_test [radeon]] *ERROR* radeon: ring 0 test failed (scratch(0x8504)=0xCAFEDEAD)
radeon 0000:07:00.0: disabling GPU acceleration

I tried iommu=on mem_encrypt=off, but discarding the NVMe failed like before.
Comment 32 Andreas 2019-05-05 08:17:35 UTC
Out of interest, the issue still exists in kernel 5.0.12.

Slightly OT, I use a Radeon RX Vega (VEGA10) graphics card and SME never ever worked. I tried iommu=soft, iommu=pt and no iommu kernel cmdline option (i.e. on). Whenever I use mem_encrypt=on, the last line I see on the screen is this:
fb0: switching to amdgpudrmfb from simple
The system doesn't panic or stall or anything, it's just that I don't see any more screen updates at all. I'd have to do everything in the blind, or over ssh.

BTW, I looked more closely at /Documentation/admin-guide/kernel-parameters.txt in the Linux kernel source tree, and there is also the following:

amd_iommu= [HW,X86-64]
           Pass parameters to the AMD IOMMU driver in the system.
           Possible values are:
           fullflush - enable flushing of IO/TLB entries when
                       they are unmapped. Otherwise they are
                       flushed before they will be reused, which
                       is a lot of faster
           off       - do not initialize any AMD IOMMU found in
                       the system
           force_isolation - Force device isolation for all
                             devices. The IOMMU driver is not
                             allowed anymore to lift isolation
                             requirements as needed. This option
                             does not override iommu=pt

amd_iommu_dump= [HW,X86-64]
                Enable AMD IOMMU driver option to dump the ACPI table
                for AMD IOMMU. With this option enabled, AMD IOMMU
                driver will print ACPI tables for AMD IOMMU during
                IOMMU initialization.

amd_iommu_intr=[legacy|vapic]

iommu=[off|force|noforce|biomerge|panic|nopanic|merge|nomerge|soft|pt|nopt]


Maybe I need some amd_iommu* tweaks as well? What options are there for amd_iommu_dump? enable/disable maybe?

And is the amdgpu bug with SME enabled somehow related?
Comment 33 Christoph Nelles 2019-05-05 12:21:56 UTC
amd_iommu_dump=1
And during ACPI IVRS is printed on the console/log:

------- example, not related to this issue
[    0.851042] AMD-Vi: Using IVHD type 0x11
[    0.851401] AMD-Vi: device: 00:00.2 cap: 0040 seg: 0 flags: b0 info 0000
[    0.851401] AMD-Vi:        mmio-addr: 00000000feb80000
[    0.851430] AMD-Vi:   DEV_SELECT_RANGE_START  devid: 00:01.0 flags: 00
[    0.851431] AMD-Vi:   DEV_RANGE_END           devid: ff:1f.6
[    0.851870] AMD-Vi:   DEV_ALIAS_RANGE                 devid: ff:00.0 flags: 00 devid_to: 00:14.4
[    0.851871] AMD-Vi:   DEV_RANGE_END           devid: ff:1f.7
[    0.851875] AMD-Vi:   DEV_SPECIAL(HPET[0])           devid: 00:14.0
[    0.851876] AMD-Vi:   DEV_SPECIAL(IOAPIC[33])                devid: 00:14.0
[    0.851877] AMD-Vi:   DEV_SPECIAL(IOAPIC[34])                devid: 00:00.1
[    1.171028] AMD-Vi: IOMMU performance counters supported
------- example, not related to this issue

Maybe worth having a look into these. 

I need only a text console, but using the radeon driver saves me a few watts than just using VGA framebuffer. I am not sure how SME works with DMA, but device read/writes would also need to be encrypted.
Comment 34 hamelg 2019-06-08 15:27:57 UTC
The same here. I've to apply the workaround iommu=soft to be able trim my ssd.

 - MSI B450 GAMING PLUS (MS-7B86), BIOS 1.4
 - CPU: AMD Ryzen 2700
 - SSD nvme Force MP510 (FW ECFM12.2)
 - Kernel 5.1.7 (Archlinux)
Comment 35 nutodafozo 2019-07-02 11:38:57 UTC
AMD broke VFIO since agesa 0072 (https://www.reddit.com/r/Amd/comments/bh3qqz/agesa_0072_pci_quirk/).
There's a patch for 5.1 kernel that makes it work, could somebody test if it helps with our problem here?

patch: https://clbin.com/VCiYJ
Comment 36 Andreas 2019-07-02 19:53:19 UTC
(In reply to nutodafozo from comment #35)
> AMD broke VFIO since agesa 0072
> (https://www.reddit.com/r/Amd/comments/bh3qqz/agesa_0072_pci_quirk/).
> There's a patch for 5.1 kernel that makes it work, could somebody test if it
> helps with our problem here?
> 
> patch: https://clbin.com/VCiYJ

Tried it on kernel 5.1.15. It doesn't fix the discard (trim) problem for me.
Comment 37 Christoph Nelles 2019-07-02 20:00:17 UTC
Same here with 5.1.15. Without iommu=pt the errors come back.
Comment 38 hamelg 2019-07-02 20:54:50 UTC
Same here, the patch doesn't make any difference :(
My bios has AGESA code 1.0.0.6, the patch is related to version 1.0.0.7.x+.
Comment 39 nutodafozo 2019-07-19 07:04:41 UTC
did somebody try with new agesa's (1.0.0.2/1.0.0.3ab)?
Comment 40 Seba Pe 2019-07-21 21:48:06 UTC
(In reply to nutodafozo from comment #39)
> did somebody try with new agesa's (1.0.0.2/1.0.0.3ab)?

Can reproduce with a Ryzen 3600x, x570 motherboard, AGESA 1.0.0.3ab. Kernel 5.2.1-arch1.
Comment 41 swk 2019-08-22 01:04:47 UTC
hardware:
threadripper 2990wx, x399, AGESA 1.1.0.2
Force MP510, Fw: ECFM12.3

kernel: 5.3.0-rc5

boot param:
iommu=pt, avic=1

this issue is gone, but with iommu=pt removed io_page_fault error comes back.
Comment 42 stuart hayes 2019-09-04 20:48:14 UTC
Is it possible that relaxed ordering is enabled on the NVME device?  (lspci -vvv, look for RlxdOrd+ or RlxdOrd-)  A few months ago I was getting I/O page faults with an NVME drive on an AMD system that had relaxed ordering enabled, because (as I recall) the drive's write to the completion queue got reordered before the last data write, and the amd_iommu driver had already unmapped the data buffer before the last write went through the hardware, which caused the IOMMU fault.

If it is enabled, you could try disabling it.
Comment 43 Christoph Nelles 2019-09-04 21:12:59 UTC
> If it is enabled, you could try disabling it.

Can you give me some instructions or directions how to do this? Haven't found much on the internet.
Comment 44 stuart hayes 2019-09-05 01:15:13 UTC
I think you'll need the pciutils package for lspci / setpci.

First find the bus/device/function number of your NVMe drive... probably "lspci |grep -i nvme" will show you.  It'll be some numbers like 0000:05:00.0 if it's on PCI bus 5, device 0, function 0.  Once you have that you can do...

lspci -vvvv -s 0000:05:00.0 |grep Rlxd   (use your numbers, not 0000:05:00.0)

That will show you if it is even enabled... if you see RlxdOrd+, then read the device control register in the PCI express capability structure.  I don't have a system in front of me to test this on, so it may not work, but I think you should be able to do that with:

setpci -s 0000:05:00.0 CAP_EXP+0x08.w

That will show you the 16-bit value of the pci express device control register.  You want to write that value, except set bit 4 to 0.  That means subtract 0x10 from the value, assuming bit 4 is set... so if you read 0x1234, clearing bit 4 would result in 0x1224.  Write that value with:

setpci -s 0000:05:00.0 CAP_EXP+0x08.w=0x1224  (use your number, not 0x1224!)

Then do the original "lspci -vvv -s 0000:05:00.0 | grep Rlxd" to make sure it now says "RlxdOrd-" instead of "RlxdOrd+".

Your change will be wiped out if you unplug the drive or reboot the system, though... it isn't permanent.
Comment 45 nutodafozo 2019-09-05 09:51:43 UTC
$ sudo lspci -vvvv -s 01:00.0
01:00.0 Non-Volatile memory controller: Silicon Motion, Inc. Device 2262 (rev 03) (prog-if 02 [NVM Express])
        Subsystem: Silicon Motion, Inc. Device 2262
....
                DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
....
user@pc:~$ sudo setpci -s 01:00.0 CAP_EXP+0x08.w
201f
>>> bin(0x201f)   
'0b10000000011111'
$ sudo setpci -s 01:00.0 CAP_EXP+0x08.w=0x200f
$ sudo lspci -vvvv -s 01:00.0|grep Rlx      
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
$ sudo systemctl start fstrim.service

Hm, is that it? No errors in dmesg now...
Comment 46 hamelg 2019-09-05 19:07:44 UTC
The workaround doesn't work here, I still get the kernel error "nvme...IO_PAGE_FAULT" and fstrim fails with "Input/output error".

DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
        RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
Comment 47 Eduard Hasenleithner 2019-09-05 19:39:16 UTC
Workaround also fails for me. Here is the lspci output after disabling RlxdOrd:

41:00.0 Non-Volatile memory controller: Device 1987:5012 (rev 01) (prog-if 02 [NVM Express])
        Subsystem: Device 1987:5012
...
               DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
                        RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
Comment 48 Andreas 2019-09-05 20:43:07 UTC
Same here. I booted without iommu=pt and risked using mount option discard. The error messages (AMD-Vi IO_PAGE_FAULT) came before and after setting RlxdOrd-

# lspci | grep -i "Non-Volatile memory controller"
01:00.0 Non-Volatile memory controller: Intel Corporation SSD 600P Series (rev 03)
04:00.0 Non-Volatile memory controller: Micron/Crucial Technology Device 2263 (rev 03)
# lspci -vvvv -s 0000:01:00.0 |grep Rlxd
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
# lspci -vvvv -s 0000:04:00.0 |grep Rlxd
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
# setpci -s 01:00.0 CAP_EXP+0x08.w
201f
# setpci -s 04:00.0 CAP_EXP+0x08.w
201f
# setpci -s 01:00.0 CAP_EXP+0x08.w=0x200f
# lspci -vvvv -s 0000:01:00.0 |grep Rlxd
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
# setpci -s 04:00.0 CAP_EXP+0x08.w=0x200f
# lspci -vvvv -s 0000:04:00.0 |grep Rlxd
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
# uname -a
Linux ZenMachine 5.2.11-gentoo-RYZEN #1 SMP Fri Aug 30 23:51:50 CEST 2019 x86_64 AMD Ryzen 7 1800X Eight-Core Processor AuthenticAMD GNU/Linux
Comment 49 Eduard Hasenleithner 2019-09-07 13:41:50 UTC
Found a workaround for the problem with my MP510:

When I change the kmalloc_array call in nvme_setup_discard in drivers/nvme/host/core.c to unconditionally allocate a full page "range = kmalloc_array(256, sizeof *range), GFP_ATOMIC|__GFP_NOWARN)" then the IO_PAGE_FAULT messages are gone.

Not sure what is going on here but I suspect a firmware bug on the MP510 (my firmware is ECFM12.1). The NVMe behaves as if it expects data area for the NVM-1.3 "6.7 Dataset Management command" to always have a full page size. When kmalloc_array is called with a smaller size it returns addresses which are not aligned to a page size. The NVMe then sees that the "offset portion of the PBAO field of PRP1 is non-zero" and assumes the address of the 2nd page to be present in PRP2. On my host PRP2 is always set to zero and when NVME tries to read it gives IO_PAGE_FAULT messages with zero address:
nvme 0000:aa:bb.c: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x0 flags=0x0000].

So another conclusion is that when setting iommu=pt it does not really fix my problem with MP510 but just hides the discard problem.
Comment 50 Christoph Nelles 2019-09-07 14:41:57 UTC
Incredibale analysis :-) Haven't looked into the code, but maybe rounding up to a full page may be the safer alternative instead of using requesting one.

My Force MP510 have firmware ECFM12.2, but it seems Corsair does not offer FW upgrades.
Comment 51 hamelg 2019-09-07 15:13:50 UTC
(In reply to Eduard Hasenleithner from comment #49)

> Not sure what is going on here but I suspect a firmware bug on the MP510 (my
> firmware is ECFM12.1). 


some comments reports the same issue with different SSD controllers, not only Phison E12.
Comment 52 Eduard Hasenleithner 2019-09-07 15:42:26 UTC
(In reply to hamelg from comment #51)
> some comments reports the same issue with different SSD controllers, not
> only Phison E12.

True. But I'm so specific here since I guess that the reason for others failing is different from my case. E.g. the other logs contain IO_PAGE_FAULT messages with nonzero address.
Comment 53 Andreas 2019-09-07 17:32:48 UTC
(In reply to Eduard Hasenleithner from comment #49)
> So another conclusion is that when setting iommu=pt it does not really fix
> my problem with MP510 but just hides the discard problem.

What's the implication? TRIM isn't actually being used?
Is there a way to see if and when and how many blocks have been "trimmed"?
Comment 54 Tim Murphy 2019-09-07 18:06:33 UTC
Created attachment 284879 [details]
attachment-10825-0.html

The -v option to fstrim lists how many blocks were trimmed

On Sat, Sep 7, 2019 at 12:32 PM <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=202665
>
> --- Comment #53 from Andreas (andreas.thalhammer@linux.com) ---
> (In reply to Eduard Hasenleithner from comment #49)
> > So another conclusion is that when setting iommu=pt it does not really
> fix
> > my problem with MP510 but just hides the discard problem.
>
> What's the implication? TRIM isn't actually being used?
> Is there a way to see if and when and how many blocks have been "trimmed"?
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 55 Andreas 2019-09-07 18:33:37 UTC
(In reply to Tim Murphy from comment #54)
> The -v option to fstrim lists how many blocks were trimmed

Thanks.
> # fstrim -v /
> /: 80,9 GiB (86825152512 Bytes) trimmed

I did I used the command again after one minute:
> /: 1.2 GiB (1303777280 bytes) trimmed

I don't really understand this, but maybe it has to do with me using the discard mount option, not a periodical fstrim.

Anyway, if this is correct, TRIM should work despite 1) the errors and 2) iommu=pt.
Comment 56 Christoph Nelles 2019-09-07 18:43:33 UTC
This is only a summary. It may be one large continuous block or many small fragments. The MP510 has a very high discard size: 
/sys/block/nvme0n1/queue/discard_granularity 512
/sys/block/nvme0n1/queue/discard_max_bytes 2199023255040
/sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040
/sys/block/nvme0n1/queue/discard_zeroes_data 0

So a blkdiscard should be able to make this in one discard request. If you left an unpartitioned area, you can create a partition there and test blkdiscard. Or if you have a swap partition on it, swapon will do a discard AFAIK. 

At least for SATA/SAS, there were multiple commands/ways for trimming a device, but no idea how this is implemented with NVMes
Comment 57 Andreas 2019-09-07 19:31:12 UTC
(In reply to Christoph Nelles from comment #56)
> /sys/block/nvme0n1/queue/discard_granularity 512
> /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040
> /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040
> /sys/block/nvme0n1/queue/discard_zeroes_data 0

All my NVMe-SSDs show the same values. That is a Intel 600p (nvme0n1) on the on-board NVMe connector of the motherboard and a Crucial P1 (nvme1n1) on the PCIe NVMe expansion card (SilverStone SST-ECM20 PCIe 3.0 x4 to M.2).

Anyway, fstrim -v doesn't seem to work on swap devices (as they cannot be mounted) and blkdiscard only does discards, but doesn't give summaries. (Or I'm just too stupid to find the required command line options...)
Comment 58 Vladimir Smirnov 2019-09-09 18:47:14 UTC
I have the same issue on one of my NVMe SSDs.

System:

MSI X570 Ace (AGESA both stock and 1.0.0.3abb), Ryzen 9 3900X on Crucial MP600 (NVMe, Firmware Version EGFM11.0, Phison E16). I have a second NVMe drive (Samsung 950 Pro) which is not affected.

The messages I got were the same as in comment 49:
nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x0 flags=0x0000]

So I've tried workaround from the same workaround (unconditionally allocate a full page) and it also helped to get rid of those error messages.
Comment 59 swk 2019-09-10 09:11:35 UTC
(In reply to Christoph Nelles from comment #50)
> Incredibale analysis :-) Haven't looked into the code, but maybe rounding up
> to a full page may be the safer alternative instead of using requesting one.
> 
> My Force MP510 have firmware ECFM12.2, but it seems Corsair does not offer
> FW upgrades.


here is ECFM12.3 for MP510

http://ssd.borecraft.com/photos/Phison%20ECFM12.3%20Firmware%20update-20190720T063056Z-001.zip

by the way I have tried it, there is no change with respect to this issue. but you get little performance improvement.
Comment 60 Eduard Hasenleithner 2019-09-12 17:01:34 UTC
I've now investigated the situation also for an intel 660p device with firmware 002C. (This should have a Silicon Motion SM2263EN controller.) With this I'm also getting IO_PAGE_FAULT logs. The controller is behaving differently, but IMHO also non-conformant to NVMe spec:

* The controller always reads a multiple of 512 bytes
* When the 512 bytes don't fit within the remaining part of the page the controller continues reading with the subsequent page. The subsequent page is really adjacent, it doesnt use the value given in PRP2 (which happens to be 0).

So for this model it is sufficient to align the discard info to a 512 byte boundary.

Considering all the trouble with different controllers it is probably best to allocate a multiple of a page (4096 byte) for the discard command. (I guess a page is also the maximum needed for discard). Is it realistic to get such a kernel patch accepted?
Comment 61 nutodafozo 2019-09-12 21:51:41 UTC
How come this problem arises only on Ryzen?
Comment 62 valahanovich 2019-09-14 01:22:24 UTC
(In reply to nutodafozo from comment #61)
> How come this problem arises only on Ryzen?

Have similar problem on 
Cpu: fx8350
Mb: M5A99FX PRO R2.0
Ssd: Crucial P1 1TB (CT1000P1SSD8)

So... not only ryzen, probably AMD-Vi in general.
Comment 63 nutodafozo 2019-09-14 07:38:48 UTC
If it's AMD-Vi, then why patching the generic kernel code helps it for phison e12?


As for my hp ex920 (SiliconMotion 2262 controller), I ran trim again and again can confirm that setting RlxdOrd- helps it - no errors.

So it seems this topic has at least 3 different cases for these AMD-Vi IO_PAGE_FAULT errors
1) SM2262 controller. Solution: set RlxdOrd-. #gotta wait for another confirmation, adata 8200 user would be nice.
2) phison e12/e16 ssds. Solution: kernel patch by Eduard Hasenleithner
3) intel 660p (SM2263EN). Solution: kernel patch by Eduard Hasenleithner #gotta wait for another confirmation,crucial p1 user would be nice.
...
4) intel 600p (SM2260)?

PS I should probably try Eduard's patch on my SM2262, if it'll help, too, without setting RlxdOrd-
Comment 64 Andreas 2019-09-14 09:52:17 UTC
Okay, yes, changing only "segments" to 256 fixed the error for me as well.
diff for drivers/nvme/host/core.c:

diff -Nau1r a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
--- a/drivers/nvme/host/core.c 2019-09-14 11:27:34.986373747 +0200
+++ b/drivers/nvme/host/core.c 2019-09-13 16:14:16.937812531 +0200
@@ -564,3 +564,3 @@
 
-       range = kmalloc_array(segments, sizeof(*range),
+       range = kmalloc_array(256, sizeof(*range),
                                GFP_ATOMIC | __GFP_NOWARN);

That's for both my Intel 600p and Crucial P1.
Comment 65 Andreas 2019-09-14 10:06:01 UTC
More details:

Board: ASUS PRIME X370-PRO, BIOS 5204 07/29/2019
CPU: AMD Ryzen 7 1800X
NVMe #1: Intel M.2 600p (onboard M.2)
NVMe #2: Crucial P1 (SilverStone SST-ECM20 PCIe 3.0 x4 to M.2)
Distro: Gentoo Linux
Kernel: 5.2.14

TRIM usage:
AMD-Vi IO_PAGE_FAULT occured constantly with filesystems "discard" mount option

I applied the segments-->256 patch and removed iommu=pt from the kernel cmdline.

Result:
* iommu active + discard or fstrim  => No more AMD-Vi IO_PAGE_FAULTs

About the patch: I don't understand what I'm doing here. So: What does this fix do and is this the final way to fix it? Are there some negative side effects (I assume using a variable size in kmalloc_array does make sense)? Why does it work with some other NVMe SSDs without this fix and why does RlxdOrd- make a difference on some other systems?
Comment 66 Norik 2019-09-16 19:07:52 UTC
One more datapoint and possible resolution with this specific hardware.

Base Board Information
	Manufacturer: Micro-Star International Co., Ltd
	Product Name: B450-A PRO (MS-7B86)
	Version: 2.0

BIOS Information
	Vendor: American Megatrends Inc.
	Version: A.A0
	Release Date: 08/28/2019

01:00.0 Non-Volatile memory controller: Silicon Motion, Inc. Device 2262 (rev 03) (prog-if 02 [NVM Express])

--snip--
Capabilities: [70] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
			MaxPayload 128 bytes, MaxReadReq 512 bytes

--snip--

BOOT_IMAGE=/boot/vmlinuz-5.0.0-27-generic root=/dev/mapper/kubuntu--vg-root ro video=vesafb:off quiet splash vt.handoff=1



The "AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0... " messages logged when trimming "fstrim -v /". Setting RlxOrd+ using "sudo setpci -s 01:00.0 CAP_EXP+0x08.w=0x200f" did not have an effect, page fault messages were still logged. System would go into a catatonic state for a few minutes while starting VM's under QEMU/KVM.

Setting the BIOS setting IOMMU=Enable from the default IOMMU=Auto resolved the issue.
Comment 67 Vladimir Smirnov 2019-09-17 12:28:54 UTC
If it matters, in my case (Corsair MP600, MSI X570 Ace motherboard) I had IOMMU=Enabled since first attempts to workaround that and only patch from #49 helped.
Comment 68 nutodafozo 2019-09-22 19:05:59 UTC
Earlier I've said that setting RlxdOrd- helped me with hp ex920 (sm2262), turns out it doesn't. Just now got another bunch of IO_PAGE_FAULT's after fstrim.service kicked in.

[190614.943785] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0xfcdd1000 flags=0x0000]
[190614.947494] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0xc0986000 flags=0x0000]
Comment 69 Norik 2019-09-22 19:24:56 UTC
(In reply to Norik from comment #66)
> One more datapoint and possible resolution with this specific hardware.
> 
> Base Board Information
>       Manufacturer: Micro-Star International Co., Ltd
>       Product Name: B450-A PRO (MS-7B86)
>       Version: 2.0
> 
> BIOS Information
>       Vendor: American Megatrends Inc.
>       Version: A.A0
>       Release Date: 08/28/2019
> 
> 01:00.0 Non-Volatile memory controller: Silicon Motion, Inc. Device 2262
> (rev 03) (prog-if 02 [NVM Express])
> 
> --snip--
> Capabilities: [70] Express (v2) Endpoint, MSI 00
>               DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s
> unlimited, L1
> unlimited
>                       ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
> SlotPowerLimit 0.000W
>               DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+
> Unsupported+
>                       RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
>                       MaxPayload 128 bytes, MaxReadReq 512 bytes
> 
> --snip--
> 
> BOOT_IMAGE=/boot/vmlinuz-5.0.0-27-generic root=/dev/mapper/kubuntu--vg-root
> ro video=vesafb:off quiet splash vt.handoff=1
> 
> 
> 
> The "AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0... " messages logged
> when trimming "fstrim -v /". Setting RlxOrd+ using "sudo setpci -s 01:00.0
> CAP_EXP+0x08.w=0x200f" did not have an effect, page fault messages were
> still logged. System would go into a catatonic state for a few minutes while
> starting VM's under QEMU/KVM.
> 
> Setting the BIOS setting IOMMU=Enable from the default IOMMU=Auto resolved
> the issue.


Update: This report seems to be reducing the frequency of page fault logs. Does not eliminate them. Seems related to the size of the trim activity.
Comment 70 Andreas 2019-09-23 16:02:35 UTC
For the record, I also tried changing the IOMMU value in the BIOS (UEFI) setup. No change.

ASUS PRIME X370-PRO, BIOS 5204 07/29/2019
UEFI BIOS Setting: Advanced\AMD CBS\IOMMU=[Disabled|Enabled|Auto]

Auto is enabled, and setting it to enabled doesn't change anything: I get the same errors with both settings without iommu=pt or without the patch (tested with kernel 5.2.14).

I used the patch from Eduard Hasenleithner on drivers/nvme/host/core.c function kmalloc_array, segments --> 256, again on kernel 5.3.0. I use it now on a daily basis, with mount options "discard" in fstab for all filesystems which support it, and all seems stable. BIG THANKS.
Comment 71 Christoph Nelles 2019-09-29 16:07:52 UTC
My version of the fix for Linux 5.3.1. Probably many errors in these few lines, but currently this works for me.

--- a/drivers/nvme/host/core.c  2019-09-21 07:19:47.000000000 +0200
+++ b/drivers/nvme/host/core.c  2019-09-29 18:01:13.533381568 +0200
@@ -563,5 +563,11 @@
        struct bio *bio;
+       size_t space_required = sizeof(*range) * segments;
+       size_t space_padded = round_up(space_required, PAGE_SIZE);

-       range = kmalloc_array(segments, sizeof(*range),
-                               GFP_ATOMIC | __GFP_NOWARN);
+       if (space_required > PAGE_SIZE) {
+               pr_warning("Discard request larger than one page. Segments: %lu, struct size: %lu, total size: %lu, padded: %lu\n ",
+               (unsigned long) segments, (unsigned long) sizeof(*range), (unsigned long) space_required, (unsigned long) space_padded);
+       }
+
+       range = kmalloc(space_padded, GFP_ATOMIC | __GFP_NOWARN);
        if (!range) 

Not sure if more than 256 segments are possible and not sure if pr_warning is allowed in this context.
Comment 72 Serg Shipaev 2019-10-01 13:50:05 UTC
(In reply to Christoph Nelles from comment #71)
> My version of the fix for Linux 5.3.1. Probably many errors in these few
> lines, but currently this works for me.
> 

Hi,

Your patch is working like a charm on my system:

CentOS 7.7 ELrepo kernel 5.3.1 patched (also) against with your patch release:
Linux home.dmfn.ru 5.3.1-1.el7.dmfn.x86_64 #4 SMP Tue Oct 1 01:19:52 MSK 2019 x86_64 x86_64 x86_64 GNU/Linux

hardware IOMMU (2 x Intel Xeon 2683v3):
dmesg | grep -i iommu
[    1.088429] DMAR: IOMMU enabled
...

NVMEs:
root@home ~]# nvme list
...
Transcend 1GB TME110S TS1TMTE110S
WDC WDS256G1X0C-00ENX0 256GB

No more issues with discard!

THANKS!
Comment 73 Vladimir Smirnov 2019-10-01 21:16:40 UTC
My version of the fix:
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -559,10 +559,15 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
                struct nvme_command *cmnd)
 {
        unsigned short segments = blk_rq_nr_discard_segments(req), n = 0;
+       unsigned short alloc_size = segments;
        struct nvme_dsm_range *range;
        struct bio *bio;
 
-       range = kmalloc_array(segments, sizeof(*range),
+       if (ns->ctrl->quirks & NVME_QUIRK_DISCARD_ALIGN_TO_PAGE_SIZE) {
+               alloc_size = round_up(segments, PAGE_SIZE);
+       }
+
+       range = kmalloc_array(alloc_size, sizeof(*range),
                                GFP_ATOMIC | __GFP_NOWARN);
        if (!range) {
                /*
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 2d678fb968c7..5abcd1bd6028 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -97,6 +97,10 @@ enum nvme_quirks {
         * Force simple suspend/resume path.
         */
        NVME_QUIRK_SIMPLE_SUSPEND               = (1 << 10),
+        /*
+         * Discard command should be aligned to a PAGE_SIZE
+         */
+        NVME_QUIRK_DISCARD_ALIGN_TO_PAGE_SIZE   = (1 << 11),
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 732d5b63ec05..af3faa468682 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3012,9 +3012,13 @@ static const struct pci_device_id nvme_id_table[] = {
                                NVME_QUIRK_DEALLOCATE_ZEROES, },
        { PCI_VDEVICE(INTEL, 0xf1a5),   /* Intel 600P/P3100 */
                .driver_data = NVME_QUIRK_NO_DEEPEST_PS |
-                               NVME_QUIRK_MEDIUM_PRIO_SQ },
+                               NVME_QUIRK_MEDIUM_PRIO_SQ |
+                               NVME_QUIRK_DISCARD_ALIGN_TO_PAGE_SIZE, },
        { PCI_VDEVICE(INTEL, 0xf1a6),   /* Intel 760p/Pro 7600p */
-               .driver_data = NVME_QUIRK_IGNORE_DEV_SUBNQN, },
+               .driver_data = NVME_QUIRK_IGNORE_DEV_SUBNQN |
+                               NVME_QUIRK_DISCARD_ALIGN_TO_PAGE_SIZE, },
+       { PCI_VDEVICE(INTEL, 0xf1a8),   /* Intel 660P */
+               .driver_data = NVME_QUIRK_DISCARD_ALIGN_TO_PAGE_SIZE },
        { PCI_VDEVICE(INTEL, 0x5845),   /* Qemu emulated controller */
                .driver_data = NVME_QUIRK_IDENTIFY_CNS |
                                NVME_QUIRK_DISABLE_WRITE_ZEROES, },
@@ -3028,6 +3032,20 @@ static const struct pci_device_id nvme_id_table[] = {
                .driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY, },
        { PCI_DEVICE(0x144d, 0xa821),   /* Samsung PM1725 */
                .driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY, },
+       { PCI_DEVICE(0x1987, 0x5016),   /* Phison E16 */
+               .driver_data = NVME_QUIRK_DISCARD_ALIGN_TO_PAGE_SIZE, },
+       { PCI_DEVICE(0x1987, 0x5012),   /* Phison E12 */
+               .driver_data = NVME_QUIRK_DISCARD_ALIGN_TO_PAGE_SIZE, },
+       { PCI_DEVICE(0x126f, 0x2265),   /* Silicon Motion SM2265 */
+               .driver_data = NVME_QUIRK_DISCARD_ALIGN_TO_PAGE_SIZE, },
+       { PCI_DEVICE(0x126f, 0x2263),   /* Silicon Motion SM2263 */
+               .driver_data = NVME_QUIRK_DISCARD_ALIGN_TO_PAGE_SIZE, },
+       { PCI_DEVICE(0x126f, 0x2262),   /* Silicon Motion SM2262 */
+               .driver_data = NVME_QUIRK_DISCARD_ALIGN_TO_PAGE_SIZE, },
+       { PCI_DEVICE(0x126f, 0x2260),   /* Silicon Motion SM2260 */
+               .driver_data = NVME_QUIRK_DISCARD_ALIGN_TO_PAGE_SIZE, },
+       { PCI_DEVICE(0xc0a9, 0x2263),   /* Crucial P1 (SM2263) */
+               .driver_data = NVME_QUIRK_DISCARD_ALIGN_TO_PAGE_SIZE, },
        { PCI_DEVICE(0x144d, 0xa822),   /* Samsung PM1725a */
                .driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY, },
        { PCI_DEVICE(0x1d1d, 0x1f1f),   /* LighNVM qemu device */




It's done by introducing a new quirk that is currently applied only to Phison E16 and E12 devices and some Device IDs I've found for the SM226x SSDs (I bet it's not all of them though).
Comment 74 Christoph Nelles 2019-10-01 21:24:47 UTC
Best solution :-) But are you sure you calculated the alloc size correctly? 

//given there's one segment
+       unsigned short alloc_size = segments;
//alloc_size = 1
+               alloc_size = round_up(segments, PAGE_SIZE);
//when quirk mode, alloc_size will be round up to 4096
+       range = kmalloc_array(alloc_size, sizeof(*range),
                                GFP_ATOMIC | __GFP_NOWARN);
//allocating 4096 * sizeof(struct nvme_dsm_range) (16 bytes) = 64kb
Comment 75 Vladimir Smirnov 2019-10-01 21:31:08 UTC
Right, it should be, I think:
round_up(segments, PAGE_SIZE / sizeof(*range));

to preserve the semantics.
Comment 76 Vladimir Smirnov 2019-10-01 21:37:49 UTC
Created attachment 285295 [details]
kernel-5.3-nvme-discard-align-to-page-size.patch

This one is fixed version of previous one (with qurik for specific model)
Comment 77 Marti Raudsepp 2019-10-02 14:06:31 UTC
(In reply to Serg Shipaev from comment #72)
> Your patch is working like a charm on my system:

> hardware IOMMU (2 x Intel Xeon 2683v3):

Just to be clear, Serg, did you have this issue on an Intel system with Intel VT-d?

Because I'm seeing lots of reports here with AMD-Vi but you're the only person who even suggested this affects an Intel IOMMU. But you did not post any logs etc from such a system.
Comment 78 Serg Shipaev 2019-10-02 14:18:28 UTC
(In reply to Marti Raudsepp from comment #77)
> (In reply to Serg Shipaev from comment #72)
> > Your patch is working like a charm on my system:
> 
> > hardware IOMMU (2 x Intel Xeon 2683v3):
> 
> Just to be clear, Serg, did you have this issue on an Intel system with
> Intel VT-d?

Hi, Marti

Indeed. The Intel platform also has such an issue. And the patches above are making a pretty workaround.
Comment 79 Vladimir Smirnov 2019-10-02 14:20:49 UTC
I think the messages on Intel Platform are different and comes from DMAR
Comment 80 Andreas 2019-10-02 16:24:54 UTC
Patch works on my system.

Again the specs:
Board: ASUS PRIME X370-PRO, BIOS 5220 09/12/2019
CPU: AMD Ryzen 7 1800X
NVMe #1: Intel M.2 600p [8086:f1a5]
NVMe #2: Crucial P1 [c0a9:2263]
Distro: Gentoo Linux
Kernel: 5.3.2
Patch: kernel-5.3-nvme-discard-align-to-page-size.patch

Thanks!
Comment 81 swk 2019-10-03 04:43:59 UTC
I have enclosed 2 logs with and without iommu=pt

Enabled IOMMU in the bios during both type of test case.
Setup 1: X570 + Ryzen 3600X + Gigabyte AORUS Gen 4 1TB + Kernel 5.4.0rc1
Setup 2: X399 + TR2990WX + Corsair MP510 + Kernel 5.4.0rc1

I have not applied the PAGE alignment code modification.

as per my observation with iommu=pt, it works even if the memory is 4KB unaligned but it does not when iommu=on (not pt)
Comment 82 swk 2019-10-03 04:45:42 UTC
Created attachment 285313 [details]
log with iommu=pt
Comment 83 swk 2019-10-03 04:46:13 UTC
Created attachment 285315 [details]
log with iommu=on
Comment 84 swk 2019-10-03 04:48:44 UTC
watch for [NVME_DSM] tag
Comment 85 Vladimir Smirnov 2019-10-03 05:45:46 UTC
(In reply to swk from comment #81)
> I have enclosed 2 logs with and without iommu=pt
> 
> Enabled IOMMU in the bios during both type of test case.
> Setup 1: X570 + Ryzen 3600X + Gigabyte AORUS Gen 4 1TB + Kernel 5.4.0rc1
> Setup 2: X399 + TR2990WX + Corsair MP510 + Kernel 5.4.0rc1
> 
> I have not applied the PAGE alignment code modification.
> 
> as per my observation with iommu=pt, it works even if the memory is 4KB
> unaligned but it does not when iommu=on (not pt)

as it's mentioned somewhere earlier on a bug, `iommu=pt` only masks the issue (iommu is enabled only for devices that are passed-through to the VM, so it cannot detect and notify about a page fault) and discard continue to work anyway.
Comment 86 swk 2019-10-03 10:43:16 UTC
Created attachment 285319 [details]
log with AMD virtualization OFF

with IOMMU and virtualization enabled, we use the virtual bus address for the DMA between host and nvme device, for some reason i thinks its messed up. 

with IOMMU pass through set, it tells the kernel not to apply the virtual bus address translation on the devices which does not support, so our nvme which comes under this category now operates only using cpu side virtual memory and not virtual bus address.

the enclosed log file confirms that root cause is iommu+virtualization. in this run I have disabled the virtualization and iommu in bios there by kernel never uses virtual bus memory which is same as IOMMU pass through.
Comment 87 Vladimir Smirnov 2019-10-03 16:29:14 UTC
(In reply to swk from comment #86)
> Created attachment 285319 [details]
> log with AMD virtualization OFF
> 
> with IOMMU and virtualization enabled, we use the virtual bus address for
> the DMA between host and nvme device, for some reason i thinks its messed
> up. 
> 
> with IOMMU pass through set, it tells the kernel not to apply the virtual
> bus address translation on the devices which does not support, so our nvme
> which comes under this category now operates only using cpu side virtual
> memory and not virtual bus address.
> 
> the enclosed log file confirms that root cause is iommu+virtualization. in
> this run I have disabled the virtualization and iommu in bios there by
> kernel never uses virtual bus memory which is same as IOMMU pass through.

There is another view on the root cause of the Page Fault described by original workaround author: https://bugzilla.kernel.org/show_bug.cgi?id=202665#c49 (for Phishon controllers) and https://bugzilla.kernel.org/show_bug.cgi?id=202665#c60 (for Silicon Motion controllers)

I'm not qualified enough to judge, but it sounds more like the controller's firmware is the actual root cause of the issue and IOMMU only allows to detect it.
Comment 88 Andreas 2019-10-06 08:27:16 UTC
Out of interest: wouldn't the fastest solution be to just set a fixed size in function nvme_setup_discard to allocate, without adding the overhead to check if a specific PCI device is actually affected or not? Naturally this would only be meaningfull if nvme_setup_discard is called often, saving cycles when being called again and again. This also assumes that memory is not limited, which in my case it isn't. I also assume that the memory page is freed after the discard command finishes.

With kernel 5.3.4 I reverted to the patch by Eduard Hasenleithner in Comment 49 to unconditionally allocate 4096, one full page size.
Comment 89 Vladimir Smirnov 2019-10-06 16:13:48 UTC
I've just tried to rewrite patch by Eduard Hasenleithner to look more like it's done in other parts of the driver - where for such cases it's more common to use quirks as overhead of 1 if statement is not a big deal.

However I'm not familiar with kernel development (I just happen to build a desktop at home, that's affected by that) and it seems that there were no comments from Jens about what's the best approach to solve that.
Comment 90 Andreas 2019-10-06 22:26:31 UTC
Vladimir Smirnov, your patch worked fine.

I'm also just a user of a desktop system that happend to be hit by this issue.

I don't know which approach is the correct one. But I hope this issue will be fixed in the kernel - for good.
Comment 91 Rocky Prabowo 2019-10-07 18:08:19 UTC
Both disabling PCIe relaxed ordering or using Vladimir's patch fixes the issues mentioned here, but disabling RlxdOrd is kind of useless when scheduled fstrim from systemd kicks in early during the boot process so you have to make a script to disable PCIe relaxed ordering before fstrim service doing their job.

I have ADATA SX8200PNP (aka SX8200 Pro) running on a Ryzen 2500U laptop. SX8200PNP uses SM2262EN controller but have a different VID and PID, not the generic one used in the patch, so I have to add new entries to nvme_id_table with the workaround made by Vladimir and additional quirk for another problem related with SX8200PNP (NVME_QUIRK_IGNORE_DEV_SUBNQN).
Comment 92 Philip Langdale 2019-10-18 05:13:39 UTC
The way to get this fixed in the kernel is to send the patch (Vladimir's probably) to the linux-nvme mailing list. The devs rarely pay much attention to what goes on in bugzilla.

Vladimir, if you can send the patch to the list, it will bring it to the right people's attention and if it needs modification, they'll tell you. I'd offer to do it myself, but I can't provide a Sign-Off on it; only you can.

For the record, I've got a Phison E12 based ssd (MyDigitalSSD BPX Pro) and an intel CPU, so the error messages look different but the problem and solution are the same.
Comment 93 nutodafozo 2019-10-18 17:23:48 UTC
I'd like to confirm that patch also helped me with another phison e12 ssd SiliconPower P34A80 (fw12.3)

Want to ask if anybody else gets these errors:
pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
nvme 0000:01:00.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
nvme 0000:01:00.0: AER:   device [1987:5012] error status/mask=00001000/00006000
nvme 0000:01:00.0: AER:    [12] Timeout
Comment 94 Christoph Nelles 2019-10-18 17:26:44 UTC
I have them. There seems no way to stop the corrected message, so I commented them out: 

--- linux-5.3.1-stock/drivers/pci/pcie/aer.c    2019-09-21 07:19:47.000000000 +0200
+++ linux-5.3.1/drivers/pci/pcie/aer.c  2019-09-29 21:49:39.579714115 +0200
@@ -1178,6 +1178,6 @@
                        e_info.multi_error_valid = 0;
-               aer_print_port_info(pdev, &e_info);
+               //aer_print_port_info(pdev, &e_info);

-               if (find_source_device(pdev, &e_info))
-                       aer_process_err_devices(&e_info);
+               //if (find_source_device(pdev, &e_info))
+               //      aer_process_err_devices(&e_info);
        }
Comment 95 ono.kirin 2019-11-01 05:10:59 UTC
I made a docker image to build kernel 4.15.0 of Ubuntu 16.04 to apply a patch to fix problem. After this, fstrim works correctly.

https://github.com/fx-kirin/docker-ubuntu-kernel-build/tree/ubuntu16.04-kernel4.15.0

The explaination is here.

http://fx-kirin.com/ubuntu/fix-amd-vi-io_page_fault/
Comment 96 valahanovich 2019-11-10 12:38:25 UTC
Ok. Alignment patch helped me with fx-8350 cpu and Crucial P1 1TB (CT1000P1SSD8).
Relaxed ordering wasn't active in my case.
But looks like bug still not mentioned in mailing lists. We don't necessary need to send a signed patch, just mention the problem and existing solution.
Comment 98 Eduard Hasenleithner 2019-11-10 15:29:19 UTC
Actually I've already started discussion in this thread: https://lists.infradead.org/pipermail/linux-nvme/2019-November/027822.html
Comment 99 hamelg 2019-11-10 16:51:09 UTC
Does it mean the fix will be present with vanilla kernel 5.4 ?
Comment 100 Keith Busch 2019-11-13 00:37:22 UTC
(In reply to hamelg from comment #99)
> Does it mean the fix will be present with vanilla kernel 5.4 ?

It's staged for 5.5; we can set it for stable once that merge window opens so it can it all the LTS's.
Comment 101 Andreas 2019-11-19 15:00:15 UTC
I think there should be an option to default to a 4k page. And the reason is:

(In reply to Vladimir Smirnov from comment #87)
> ... it sounds more like the controller's
> firmware is the actual root cause of the issue and IOMMU only allows to
> detect it.

If so, it should be the default in case the IOMMU+Virtualization is not active, because otherwise there is no way to detect it.

Also, I'd like to know if the QUIRK is checked on every call of nvme_setup_discard -- if so, wouldn't it be the cleaner solution to check for affected hardware once (on initialization), then set the kmalloc_array accordingly, and statically?
Comment 102 Dirk Pritsch 2019-11-28 15:29:51 UTC
Hi.

Don't know if you're still looking for affected devices, but I think I have one (or two):

The error is as follows:

[ 1924.242628] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
[ 1924.242634] nvme 0000:01:00.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 1924.242637] nvme 0000:01:00.0: AER:   device [1987:5013] error status/mask=00001000/00006000
[ 1924.242640] nvme 0000:01:00.0: AER:    [12] Timeout               
[ 1925.192262] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
[ 1925.192269] nvme 0000:01:00.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 1925.192274] nvme 0000:01:00.0: AER:   device [1987:5013] error status/mask=00001000/00006000
[ 1925.192276] nvme 0000:01:00.0: AER:    [12] Timeout               
[ 1925.855331] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
[ 1925.855337] nvme 0000:01:00.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 1925.855341] nvme 0000:01:00.0: AER:   device [1987:5013] error status/mask=00001000/00006000
[ 1925.855343] nvme 0000:01:00.0: AER:    [12] Timeout
...

The devices are two Gigabyte SSD:

root@enterprise:~# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev  
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     SN193808924309       GIGABYTE GP-GSM2NE3100TNTD               1           1,02  TB /   1,02  TB    512   B +  0 B   EDFM00.2
/dev/nvme1n1     SN193808927365       GIGABYTE GP-GSM2NE3512GNTD               1         512,11  GB / 512,11  GB    512   B +  0 B   EDFM00.2

Both sitting on an "ASUS ROG STRIX B450-I GAMING" Mini-ITX board with a Ryzen 5-3600 CPU

running a PopOS! 19.10 with kernel Linux version 5.3.0-20-generic (buildd@lgw01-amd64-060) (gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2)) #21+system76~1572304854~19.10~8caa3e6-Ubuntu SMP Tue Oct 29 00:4 (Ubuntu 5.3.0-20.21+system76~1572304854~19.10~8caa3e6-generic 5.3.7)

Nov 28 12:02:33 pop-os kernel: [    0.970766] nvme nvme0: pci function 0000:01:00.0
Nov 28 12:02:33 pop-os kernel: [    0.970814] nvme nvme1: pci function 0000:07:00.0
Nov 28 12:02:33 pop-os kernel: [    1.190008] nvme nvme1: missing or invalid SUBNQN field.
Nov 28 12:02:33 pop-os kernel: [    1.195375] nvme nvme0: missing or invalid SUBNQN field.
Nov 28 12:02:33 pop-os kernel: [    1.226915] nvme nvme1: allocated 128 MiB host memory buffer.
Nov 28 12:02:33 pop-os kernel: [    1.249453] nvme nvme1: 8/0/0 default/read/poll queues
Nov 28 12:02:33 pop-os kernel: [    1.254760]  nvme0n1: p1
Nov 28 12:02:33 pop-os kernel: [    1.263947] nvme nvme0: allocated 128 MiB host memory buffer.
Nov 28 12:02:33 pop-os kernel: [    1.300798] nvme nvme0: 8/0/0 default/read/poll queues
Nov 28 12:02:33 pop-os kernel: [    1.332509]  nvme1n1: p1 p2 p3 p4 p5 p6 p7 p8 p9
...

Nov 28 12:02:33 pop-os kernel: [    0.776499] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
Nov 28 12:02:33 pop-os kernel: [    0.776499] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
Nov 28 12:02:33 pop-os kernel: [    0.779361] pci 0000:00:00.2: AMD-Vi: Found IOMMU cap 0x40
Nov 28 12:02:33 pop-os kernel: [    0.779361] pci 0000:00:00.2: AMD-Vi: Extended features (0x58f77ef22294ade):
Nov 28 12:02:33 pop-os kernel: [    0.779362]  PPR X2APIC NX GT IA GA PC GA_vAPIC
Nov 28 12:02:33 pop-os kernel: [    0.779364] AMD-Vi: Interrupt remapping enabled
Nov 28 12:02:33 pop-os kernel: [    0.779364] AMD-Vi: Virtual APIC enabled
Nov 28 12:02:33 pop-os kernel: [    0.779364] AMD-Vi: X2APIC enabled
Nov 28 12:02:33 pop-os kernel: [    0.779442] AMD-Vi: Lazy IO/TLB flushing enabled
Nov 28 12:02:33 pop-os kernel: [    0.780098] amd_uncore: AMD NB counters detected
Nov 28 12:02:33 pop-os kernel: [    0.780101] amd_uncore: AMD LLC counters detected
Nov 28 12:02:33 pop-os kernel: [    0.780221] LVT offset 0 assigned for vector 0x400
Nov 28 12:02:33 pop-os kernel: [    0.780278] perf: AMD IBS detected (0x000003ff)
Nov 28 12:02:33 pop-os kernel: [    0.780282] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
...


Please tell me, if you need more data. (I will install Buster in the next days, so I can also double check the error messages.)

Regards, Dirk
Comment 103 Vladimir Smirnov 2019-11-28 15:55:14 UTC
Your message doesn't sound like the same issue discussed in this bug.

However you can still try this patch and just verify that pci I'd of your ssd is there.
Comment 104 nutodafozo 2019-11-29 09:19:09 UTC
These AER timeouts are probably phison e12 specific, I have them, Christoph Nelles above confirmed, he has them, too.
Comment 105 Karsten Weiss 2019-11-30 20:45:05 UTC
For the record, this is the latest version of the patch:

http://git.infradead.org/nvme.git/commitdiff/530436c45ef2e446c12538a400e465929a0b3ade?hp=400b6a7b13a3fd71cff087139ce45dd1e5fff444

Shouldn’t it also be backported to older stable kernels?
Comment 106 AM 2019-12-02 05:12:26 UTC
Forgive me if this is a dumb question, but are there some instructions regarding how one could implement this fix?

I see that the fix is within: drivers/nvme/host/core.c 
...and replacing 'segments' with '256'

-    :range = kmalloc_array(segments, sizeof(*range),
+    : range = kmalloc_array(256, sizeof(*range),



1)  How does one get to/access (drivers/nvme/host/core.c) in order to make this change?

2)  After making this change, does anything else need to be done, other than a reboot I'm guessing?

Note You need to log in before you can comment on or make changes to this bug.