Bug 217871

Summary:	Got nvme timeout with KIOXIA 512GB + VMD mode
Product:	IO/Storage	Reporter:	AceLan Kao (acelan)
Component:	NVMe	Assignee:	IO/NVME Virtual Default Assignee (io_nvme)
Status:	NEW ---
Severity:	high	CC:	bagasdotme, bjorn, dirk.su, kai.heng.feng, kbusch, max.lee, nirmal.patel, ping.a.bao
Priority:	P3
Hardware:	Intel
OS:	Linux
Kernel Version:		Subsystem:
Regression:	No	Bisected commit-id:
Attachments:	dmesg.log dmesg intrmap=off

Description AceLan Kao 2023-09-05 05:39:58 UTC

Created attachment 305033 [details]
dmesg.log

Kernel: latest mainline 3f86ed6ec0b3 ("Merge tag 'arc-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc")

Encounter below nvme timeout issue with VMD mode

[   97.343423] nvme nvme0: I/O 259 QID 2 timeout, completion polled
[   97.343446] nvme nvme0: I/O 384 QID 3 timeout, completion polled
[   97.343459] nvme nvme0: I/O 320 QID 4 timeout, completion polled
[   97.343470] nvme nvme0: I/O 707 QID 5 timeout, completion polled

Disable APST(nvme_core.default_ps_max_latency_us=0) doesn't work

ubuntu@ubuntu:~$ sudo nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning : 0
temperature : 41 C (314 Kelvin)
available_spare : 100%
available_spare_threshold : 50%
percentage_used : 0%
endurance group critical warning summary: 0
data_units_read : 4995511
data_units_written : 4477527
host_read_commands : 29406022
host_write_commands : 22505736
controller_busy_time : 116
power_cycles : 199
power_on_hours : 523
unsafe_shutdowns : 52
media_errors : 0
num_err_log_entries : 0
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1 : 41 C (314 Kelvin)
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0


ubuntu@ubuntu:~$ sudo smartctl -a /dev/nvme0
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-6.5.0-060500rc5drmtip20230813-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: KBG50ZNS512G NVMe KIOXIA 512GB
Serial Number: 33PC1K7SEWDX
Firmware Version: 11200108
PCI Vendor/Subsystem ID: 0x1e0f
IEEE OUI Identifier: 0x8ce38e
Total NVM Capacity: 512,110,190,592 [512 GB]
Unallocated NVM Capacity: 0
Controller ID: 0
NVMe Version: 1.4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 512,110,190,592 [512 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 8ce38e 10011ec358
Local Time is: Tue Aug 22 09:41:45 2023 EDT
Firmware Updates (0x14): 2 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x00df): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp Verify
Log Page Attributes (0x1e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 83 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Namespace 1 Features (0x02): NA_Fields

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
 0 + 4.10W - - 0 0 0 0 1 1
 1 + 2.60W - - 1 1 1 1 1 1
 2 + 1.30W - - 2 2 2 2 1 1
 3 - 0.0500W - - 3 3 3 3 800 1000
 4 - 0.0030W - - 4 4 4 4 3000 44900

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
 0 + 512 0 2
 1 - 4096 0 1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 40 Celsius
Available Spare: 100%
Available Spare Threshold: 50%
Percentage Used: 0%
Data Units Read: 4,995,632 [2.55 TB]
Data Units Written: 4,477,529 [2.29 TB]
Host Read Commands: 29,406,658
Host Write Commands: 22,505,946
Controller Busy Time: 116
Power Cycles: 199
Power On Hours: 523
Unsafe Shutdowns: 52
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 40 Celsius

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

Comment 1 Bagas Sanjaya 2023-09-05 12:21:37 UTC

(In reply to AceLan Kao from comment #0)
> Created attachment 305033 [details]
> dmesg.log
> 
> Kernel: latest mainline 3f86ed6ec0b3 ("Merge tag 'arc-6.6-rc1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc")
> 
> Encounter below nvme timeout issue with VMD mode
> 
> [   97.343423] nvme nvme0: I/O 259 QID 2 timeout, completion polled
> [   97.343446] nvme nvme0: I/O 384 QID 3 timeout, completion polled
> [   97.343459] nvme nvme0: I/O 320 QID 4 timeout, completion polled
> [   97.343470] nvme nvme0: I/O 707 QID 5 timeout, completion polled
> 
> Disable APST(nvme_core.default_ps_max_latency_us=0) doesn't work
> 
> ubuntu@ubuntu:~$ sudo nvme smart-log /dev/nvme0
> Smart Log for NVME device:nvme0 namespace-id:ffffffff
> critical_warning : 0
> temperature : 41 C (314 Kelvin)
> available_spare : 100%
> available_spare_threshold : 50%
> percentage_used : 0%
> endurance group critical warning summary: 0
> data_units_read : 4995511
> data_units_written : 4477527
> host_read_commands : 29406022
> host_write_commands : 22505736
> controller_busy_time : 116
> power_cycles : 199
> power_on_hours : 523
> unsafe_shutdowns : 52
> media_errors : 0
> num_err_log_entries : 0
> Warning Temperature Time : 0
> Critical Composite Temperature Time : 0
> Temperature Sensor 1 : 41 C (314 Kelvin)
> Thermal Management T1 Trans Count : 0
> Thermal Management T2 Trans Count : 0
> Thermal Management T1 Total Time : 0
> Thermal Management T2 Total Time : 0
> 
> 
> ubuntu@ubuntu:~$ sudo smartctl -a /dev/nvme0
> smartctl 7.2 2020-12-30 r5155
> [x86_64-linux-6.5.0-060500rc5drmtip20230813-generic] (local build)
> Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
> 
> === START OF INFORMATION SECTION ===
> Model Number: KBG50ZNS512G NVMe KIOXIA 512GB
> Serial Number: 33PC1K7SEWDX
> Firmware Version: 11200108
> PCI Vendor/Subsystem ID: 0x1e0f
> IEEE OUI Identifier: 0x8ce38e
> Total NVM Capacity: 512,110,190,592 [512 GB]
> Unallocated NVM Capacity: 0
> Controller ID: 0
> NVMe Version: 1.4
> Number of Namespaces: 1
> Namespace 1 Size/Capacity: 512,110,190,592 [512 GB]
> Namespace 1 Formatted LBA Size: 512
> Namespace 1 IEEE EUI-64: 8ce38e 10011ec358
> Local Time is: Tue Aug 22 09:41:45 2023 EDT
> Firmware Updates (0x14): 2 Slots, no Reset required
> Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
> Optional NVM Commands (0x00df): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
> Timestmp Verify
> Log Page Attributes (0x1e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
> Maximum Data Transfer Size: 512 Pages
> Warning Comp. Temp. Threshold: 83 Celsius
> Critical Comp. Temp. Threshold: 85 Celsius
> Namespace 1 Features (0x02): NA_Fields
> 
> Supported Power States
> St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
>  0 + 4.10W - - 0 0 0 0 1 1
>  1 + 2.60W - - 1 1 1 1 1 1
>  2 + 1.30W - - 2 2 2 2 1 1
>  3 - 0.0500W - - 3 3 3 3 800 1000
>  4 - 0.0030W - - 4 4 4 4 3000 44900
> 
> Supported LBA Sizes (NSID 0x1)
> Id Fmt Data Metadt Rel_Perf
>  0 + 512 0 2
>  1 - 4096 0 1
> 
> === START OF SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> 
> SMART/Health Information (NVMe Log 0x02)
> Critical Warning: 0x00
> Temperature: 40 Celsius
> Available Spare: 100%
> Available Spare Threshold: 50%
> Percentage Used: 0%
> Data Units Read: 4,995,632 [2.55 TB]
> Data Units Written: 4,477,529 [2.29 TB]
> Host Read Commands: 29,406,658
> Host Write Commands: 22,505,946
> Controller Busy Time: 116
> Power Cycles: 199
> Power On Hours: 523
> Unsafe Shutdowns: 52
> Media and Data Integrity Errors: 0
> Error Information Log Entries: 0
> Warning Comp. Temperature Time: 0
> Critical Comp. Temperature Time: 0
> Temperature Sensor 1: 40 Celsius
> 
> Error Information (NVMe Log 0x01, 16 of 256 entries)
> No Errors Logged

Do you have this issue on v6.5? v6.4?

Comment 2 AceLan Kao 2023-09-06 07:48:23 UTC

Forgot to mention that this is MTL-H platform.
I can see the issue on both on v6.4 and v6.5 kernels.

Comment 3 AceLan Kao 2023-09-14 23:45:11 UTC

Disable CPU C-State from the BIOS, then it works well without nvme timeout error.

Comment 4 Keith Busch 2023-09-15 01:57:48 UTC

Is there a reason you don't just disable vmd mode?

Comment 5 AceLan Kao 2023-09-25 06:22:44 UTC

No, we can't disable VMD mode, the decision has been made by Dell.

The 2 workarounds are not feasible for us.
1. Set to AHCI mode, instead of VMD mode.
2. Disable CPU C-State

We're still checking latest mainline kernel, but no luck.

Comment 6 Nirmal Patel 2023-09-28 15:12:55 UTC

Is this regression? Has this particular drive ever worked with VMD mode enabled? Do you have iommu enabled?

Comment 7 AceLan Kao 2023-10-03 05:44:21 UTC

Looks like the issue only happens on new SoC, didn't observe the nvme timeout issue on RLP-S.

Tried iommu=force and iommu=off, none of the cmdline work.

Comment 8 Ping Bao 2023-10-12 17:41:48 UTC

Tested KIOTIA 2TB SSD from the customer as it is on our RVP platform, i.e. without re-installing OS, which previously shows the NVMe timeout (Oct 5 08:17:05) on the customer platform.

We booted the customer OS image in the SSD on our RVP platform and could not reproduce the reported NVMe timeout as shown below (Oct 12 12:9:56):

$ cat kern.log | grep -e vmd -e nvme
Oct  5 08:17:05 u kernel: [    1.400024] vmd 0000:00:0e.0: PCI host bridge to bus 10000:e0
Oct  5 08:17:05 u kernel: [    1.614382] vmd 0000:00:0e.0: Bound to PCI domain 10000
Oct  5 08:17:05 u kernel: [    1.799023] nvme nvme0: pci function 10000:e1:00.0
Oct  5 08:17:05 u kernel: [    1.799063] nvme 10000:e1:00.0: PCI INT A: no GSI
Oct  5 08:17:05 u kernel: [    1.805219] nvme nvme0: 18/0/0 default/read/poll queues
Oct  5 08:17:05 u kernel: [    1.808576]  nvme0n1: p1 p2 p3
Oct  5 08:17:05 u kernel: [   33.889977] nvme nvme0: I/O 960 QID 4 timeout, completion polled
Oct  5 08:17:05 u kernel: [   64.610386] nvme nvme0: I/O 128 QID 7 timeout, completion polled
Oct  5 08:17:05 u kernel: [   64.692950] EXT4-fs (nvme0n1p3): mounted filesystem 4cf993b3-ed8f-4f23-a995-bb7955c24e12 ro with ordered data mode. Quota mode: none.
Oct  5 08:17:05 u kernel: [   95.330389] nvme nvme0: I/O 963 QID 4 timeout, completion polled
Oct  5 08:17:05 u kernel: [  126.049958] nvme nvme0: I/O 961 QID 4 timeout, completion polled
Oct  5 08:17:05 u kernel: [  126.173027] block nvme0n1: the capability attribute has been deprecated.
Oct  5 08:17:05 u kernel: [  156.770086] nvme nvme0: I/O 579 QID 2 timeout, completion polled
Oct  5 08:17:05 u kernel: [  156.770112] nvme nvme0: I/O 320 QID 5 timeout, completion polled
Oct  5 08:17:05 u kernel: [  156.770126] nvme nvme0: I/O 128 QID 8 timeout, completion polled
Oct  5 08:17:05 u kernel: [  156.921174] EXT4-fs (nvme0n1p3): re-mounted 4cf993b3-ed8f-4f23-a995-bb7955c24e12 r/w. Quota mode: none.
Oct  5 08:17:05 u kernel: [  186.978103] nvme nvme0: I/O 448 QID 1 timeout, completion polled
Oct  5 08:17:05 u kernel: [  186.978123] nvme nvme0: I/O 768 QID 3 timeout, completion polled
Oct  5 08:17:05 u kernel: [  217.186086] nvme nvme0: I/O 771 QID 3 timeout, completion polled
Oct  5 08:17:05 u kernel: [  217.318607] Modules linked in: ecc(+) processor_thermal_rfim(+) industrialio(+) drm_kms_helper processor_thermal_mbox processor_thermal_rapl intel_vpu(+) i2c_algo_bit intel_ipu6(O+) intel_rapl_common intel_vsec mac_hid ov02c10(O) intel_skl_int3472_tps68470 mei_csi(O) mei_ace(O) tps68470_regulator int3403_thermal(+) intel_vsc(O) mei_vsc(O) int340x_thermal_zone clk_tps68470 v4l2_fwnode v4l2_async mei int3400_thermal acpi_tad intel_skl_int3472_discrete intel_hid acpi_thermal_rel sparse_keymap acpi_pad sch_fq_codel v4l2loopback(O) videodev mc msr parport_pc ppdev lp parport drm efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_sensor_custom hid_sensor_hub hid_generic intel_ishtp_hid nvme nvme_core nvme_common ahci video libahci crc32_pclmul psmouse ucsi_acpi thunderbolt intel_lpss_pci spi_intel_pci i2c_i801 intel_ish_ipc xhci_pci intel_lpss typec_ucsi spi_intel i2c_smbus intel_ishtp
Oct  5 08:17:05 u kernel: [  217.318660]  idma64 xhci_pci_renesas vmd typec i2c_hid_acpi i2c_hid hid wmi pinctrl_meteorlake

Oct 12 12:19:56 u kernel: [    1.667196] vmd 0000:00:0e.0: PCI host bridge to bus 10000:e0
Oct 12 12:19:56 u kernel: [    1.870276] vmd 0000:00:0e.0: Bound to PCI domain 10000
Oct 12 12:19:56 u kernel: [    1.968882] nvme nvme0: pci function 10000:e1:00.0
Oct 12 12:19:56 u kernel: [    1.968923] nvme 10000:e1:00.0: PCI INT A: no GSI
Oct 12 12:19:56 u kernel: [    1.973985] nvme nvme0: 18/0/0 default/read/poll queues
Oct 12 12:19:56 u kernel: [    1.977394]  nvme0n1: p1 p2 p3
Oct 12 12:19:56 u kernel: [    3.904155] EXT4-fs (nvme0n1p3): mounted filesystem 4cf993b3-ed8f-4f23-a995-bb7955c24e12 ro with ordered data mode. Quota mode: none.
Oct 12 12:19:56 u kernel: [    4.052327] block nvme0n1: the capability attribute has been deprecated.
Oct 12 12:19:56 u kernel: [    4.196034] EXT4-fs (nvme0n1p3): re-mounted 4cf993b3-ed8f-4f23-a995-bb7955c24e12 r/w. Quota mode: none.

The NVMe tiemout issue could be possibly caused by HW/BIOS/something else.

Comment 9 AceLan Kao 2023-11-16 06:26:01 UTC

Add a small delay in nvme_irq() ease this issue.
It becomes hard to reproduce the issue with the small delay in nvme_irq().

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 2f57da12d9836..b1a6a02e5cce8 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -28,6 +28,7 @@
 #include <linux/io-64-nonatomic-hi-lo.h>
 #include <linux/sed-opal.h>
 #include <linux/pci-p2pdma.h>
+#include <linux/delay.h>
 
 #include "trace.h"
 #include "nvme.h"
@@ -1081,6 +1082,7 @@ static irqreturn_t nvme_irq(int irq, void *data)
        struct nvme_queue *nvmeq = data;
        DEFINE_IO_COMP_BATCH(iob);
 
+       udelay(5);
        if (nvme_poll_cq(nvmeq, &iob)) {
                if (!rq_list_empty(iob.req_list))
                        nvme_pci_complete_batch(&iob);

Comment 10 Nirmal Patel 2023-11-16 21:21:01 UTC

what happens if you try intremap=off ?

Comment 11 AceLan Kao 2023-11-17 00:39:40 UTC

Created attachment 305413 [details]
dmesg intrmap=off

With intrmap=off, there are still nvme timeout message

[   35.902221] nvme nvme0: I/O 768 QID 4 timeout, completion polled
[   66.366225] nvme nvme0: I/O 771 QID 4 timeout, completion polled
[  130.253142] nvme nvme0: I/O 896 QID 6 timeout, completion polled
[  130.253168] nvme nvme0: I/O 768 QID 7 timeout, completion polled
[  130.293240] nvme nvme0: I/O 614 QID 2 timeout, completion polled
[  161.267969] nvme nvme0: I/O 1 QID 5 timeout, completion polled

Comment 12 AceLan Kao 2023-11-17 00:41:51 UTC

typo on above comment, it's intremap, I passed the correct kernel cmdline.

[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.6.0 root=UUID=2da08224-2920-429a-95e8-755a71fdc71d ro i915.force_probe=7d55 intremap=off quiet splash vt.handoff=7

Comment 13 AceLan Kao 2023-11-20 01:36:51 UTC

Some more findings, not sure how it affects the issue, but the delay works.

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3f0c9ee09a12..24993cfd11fb 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -28,6 +28,7 @@
 #include <linux/io-64-nonatomic-hi-lo.h>
 #include <linux/sed-opal.h>
 #include <linux/pci-p2pdma.h>
+#include <linux/delay.h>

 #include "trace.h"
 #include "nvme.h"
@@ -990,6 +991,7 @@ static inline bool nvme_cqe_pending(struct nvme_queue *nvmeq)
 {
        struct nvme_completion *hcqe = &nvmeq->cqes[nvmeq->cq_head];

+ udelay(5);
        return (le16_to_cpu(READ_ONCE(hcqe->status)) & 1) == nvmeq->cq_phase;
 }

Comment 14 AceLan Kao 2023-11-20 01:47:41 UTC

I'm not sure whos fault, but I got many value difference log between the 2 read.

[    1.658977] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before = 0x0000 after = 0x0001
[    3.775746] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before = 0x0000 after = 0x0001
[    3.775773] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before = 0x0000 after = 0x0001
[    3.780746] EXT4-fs (nvme0n1p2): mounted filesystem 2da08224-2920-429a-95e8-755a71fdc71d ro with ordered data mode. Quota mode: none.
[    3.813830] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before = 0x0000 after = 0x0001
[    3.829130] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before = 0x0000 after = 0x0001
[    3.829308] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before = 0x0000 after = 0x0001
[    3.834134] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before = 0x0000 after = 0x0001
[    3.868118] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before = 0x0000 after = 0x0001


diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3f0c9ee09a12..6564a1afae99 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -28,6 +28,7 @@
 #include <linux/io-64-nonatomic-hi-lo.h>
 #include <linux/sed-opal.h>
 #include <linux/pci-p2pdma.h>
+#include <linux/delay.h>
 
 #include "trace.h"
 #include "nvme.h"
@@ -989,7 +990,13 @@ static void nvme_pci_complete_batch(struct io_comp_batch *iob)
 static inline bool nvme_cqe_pending(struct nvme_queue *nvmeq)
 {
        struct nvme_completion *hcqe = &nvmeq->cqes[nvmeq->cq_head];
+       int before, after;
 
+       before = READ_ONCE(hcqe->status);
+       udelay(5);
+       after = READ_ONCE(hcqe->status);
+       if (before != after)
+               printk("%s:%d] %s() before = 0x%04X after = 0x%04X\n", __FILE__, __LINE__, __func__, before, after);
        return (le16_to_cpu(READ_ONCE(hcqe->status)) & 1) == nvmeq->cq_phase;
 }

Comment 15 Keith Busch 2023-11-20 02:45:55 UTC

Sounds like vmd is reordering msix and cqe writes. Is relaxed ordering enabled on any of the devices in the hierarchy? If so, try turning it off.

Comment 16 AceLan Kao 2023-11-20 03:52:50 UTC

The Relaxed Ordering bit has been set up on the PCI bridge, and I modify it from pci/quirk doesn't work and the issue persists.

BTW, adding mb() before READ_ONCE(hcqe->status) doesn't work, either.

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index eeec1d6f9023..f0f4ff963e96 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4502,6 +4502,7 @@ DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x2f0d, PCI_CLASS_NOT_DEFINED
                              quirk_relaxedordering_disable);
 DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x2f0e, PCI_CLASS_NOT_DEFINED, 8,
                              quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x7e4d, quirk_relaxedordering_disable);
 
 /*
  * The AMD ARM A1100 (aka "SEATTLE") SoC has a bug in its PCIe Root Complex


$ sudo dmesg | grep Relaxed
[    0.857153] pci 10000:e0:06.0: Disable Relaxed Ordering Attributes to avoid PCIe Completion erratum

10000:e0:06.0 PCI bridge [0604]: Intel Corporation Device [8086:7e4d] (rev 10)
        Kernel driver in use: pcieport
10000:e1:00.0 Non-Volatile memory controller [0108]: KIOXIA Corporation Device [1e0f:000c]
        Subsystem: KIOXIA Corporation Device [1e0f:0001]
        Kernel driver in use: nvme

u@u-XPS-13-9340:~/linux$ lspci -xxxs 10000:e0:06.0
10000:e0:06.0 PCI bridge: Intel Corporation Device 7e4d (rev 10)
00: 86 80 4d 7e 06 05 10 00 10 00 04 06 00 00 81 00
10: 00 00 00 00 00 00 00 00 00 e1 e1 00 f0 00 00 20
20: 00 7e 00 7e f1 ff 01 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 40 00 00 00 00 00 00 00 00 01 02 00

u@u-XPS-13-9340:~/linux$ lspci -xxxs 10000:e1:00.0
10000:e1:00.0 Non-Volatile memory controller: KIOXIA Corporation Device 000c
00: 0f 1e 0c 00 06 04 10 00 00 02 08 01 00 00 00 00
10: 04 00 00 7e 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 0f 1e 01 00
30: 00 00 00 00 80 00 00 00 00 00 00 00 00 01 00 00

Comment 17 AceLan Kao 2023-11-20 06:51:41 UTC

Add delay in vmd_irq() also works, but I still see the value changed with the pach on comment #14 applied on nvme_pci_complete_batch().

There is no nvme timeout logs in dmesg after bootup(0/10).

diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index ad56df98b8e6..ac0a8c0d40ce 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -16,6 +16,7 @@
 #include <linux/srcu.h>
 #include <linux/rculist.h>
 #include <linux/rcupdate.h>
+#include <linux/delay.h>

 #include <asm/irqdomain.h>

@@ -680,6 +681,7 @@ static irqreturn_t vmd_irq(int irq, void *data)
        struct vmd_irq *vmdirq;
        int idx;

+ udelay(5);
        idx = srcu_read_lock(&irqs->srcu);
        list_for_each_entry_rcu(vmdirq, &irqs->irq_list, node)
                generic_handle_irq(vmdirq->virq);

Comment 18 Nirmal Patel 2023-11-20 23:43:13 UTC

(In reply to AceLan Kao from comment #16)
> The Relaxed Ordering bit has been set up on the PCI bridge, and I modify it
> from pci/quirk doesn't work and the issue persists.
> 
> BTW, adding mb() before READ_ONCE(hcqe->status) doesn't work, either.
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index eeec1d6f9023..f0f4ff963e96 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -4502,6 +4502,7 @@ DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL,
> 0x2f0d, PCI_CLASS_NOT_DEFINED
>                               quirk_relaxedordering_disable);
>  DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x2f0e,
> PCI_CLASS_NOT_DEFINED, 8,
>                               quirk_relaxedordering_disable);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x7e4d,
> quirk_relaxedordering_disable);
>  
>  /*
>   * The AMD ARM A1100 (aka "SEATTLE") SoC has a bug in its PCIe Root Complex
> 
> 
> $ sudo dmesg | grep Relaxed
> [    0.857153] pci 10000:e0:06.0: Disable Relaxed Ordering Attributes to
> avoid PCIe Completion erratum
> 
> 10000:e0:06.0 PCI bridge [0604]: Intel Corporation Device [8086:7e4d] (rev
> 10)
>         Kernel driver in use: pcieport
> 10000:e1:00.0 Non-Volatile memory controller [0108]: KIOXIA Corporation
> Device [1e0f:000c]
>         Subsystem: KIOXIA Corporation Device [1e0f:0001]
>         Kernel driver in use: nvme
> 
> u@u-XPS-13-9340:~/linux$ lspci -xxxs 10000:e0:06.0
> 10000:e0:06.0 PCI bridge: Intel Corporation Device 7e4d (rev 10)
> 00: 86 80 4d 7e 06 05 10 00 10 00 04 06 00 00 81 00
> 10: 00 00 00 00 00 00 00 00 00 e1 e1 00 f0 00 00 20
> 20: 00 7e 00 7e f1 ff 01 00 00 00 00 00 00 00 00 00
> 30: 00 00 00 00 40 00 00 00 00 00 00 00 00 01 02 00
> 
> u@u-XPS-13-9340:~/linux$ lspci -xxxs 10000:e1:00.0
> 10000:e1:00.0 Non-Volatile memory controller: KIOXIA Corporation Device 000c
> 00: 0f 1e 0c 00 06 04 10 00 00 02 08 01 00 00 00 00
> 10: 04 00 00 7e 00 00 00 00 00 00 00 00 00 00 00 00
> 20: 00 00 00 00 00 00 00 00 00 00 00 00 0f 1e 01 00
> 30: 00 00 00 00 80 00 00 00 00 00 00 00 00 01 00 00


Are you saying Relax ordering bit is set in your platform by default? 
I checked multiple server platforms (not MTL) and they all have this register cleared and it is hardwired to 0b from BIOS. 
can you try setpci command to change it?

You can also try to disable it from BIOS. May be MTL has this register enabled for all the devices.
i.e. EDKII->Socket Configuration->IIO Configuration->PCIe Relaxed Ordering

Comment 19 AceLan Kao 2023-11-21 00:03:45 UTC

Yes, the Relaxed Ordering is set by default, and I can't disable it throught pci/quirk.

Tried setpci, but still can't modify the value.

u@u-XPS-13-9340:~$ lspci -xxxs 10000:e0:06.0
10000:e0:06.0 PCI bridge: Intel Corporation Device 7e4d (rev 10)
00: 86 80 4d 7e 06 05 10 00 10 00 04 06 00 00 81 00
10: 00 00 00 00 00 00 00 00 00 e1 e1 00 f0 00 00 20
20: 00 7e 00 7e f1 ff 01 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 40 00 00 00 00 00 00 00 00 01 02 00

u@u-XPS-13-9340:~$ sudo setpci -s 10000:e0:06.0 0x08.B=0x00
u@u-XPS-13-9340:~$ lspci -xxxs 10000:e0:06.0
10000:e0:06.0 PCI bridge: Intel Corporation Device 7e4d (rev 10)
00: 86 80 4d 7e 06 05 10 00 10 00 04 06 00 00 81 00
10: 00 00 00 00 00 00 00 00 00 e1 e1 00 f0 00 00 20
20: 00 7e 00 7e f1 ff 01 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 40 00 00 00 00 00 00 00 00 01 02 00

I think there is no such option is Dell's BIOS to disable it, should I ask Dell to fix it?

Comment 20 Kai-Heng Feng 2023-11-29 04:23:35 UTC

(In reply to Keith Busch from comment #15)
> Sounds like vmd is reordering msix and cqe writes. Is relaxed ordering
> enabled on any of the devices in the hierarchy? If so, try turning it off.

Does mb() have the same effect? Adding mb() before hcqe->status can still observe the issue.

Comment 21 Nirmal Patel 2023-11-30 18:19:32 UTC

(In reply to AceLan Kao from comment #19)
> Yes, the Relaxed Ordering is set by default, and I can't disable it throught
> pci/quirk.
> 
> Tried setpci, but still can't modify the value.
> 
> u@u-XPS-13-9340:~$ lspci -xxxs 10000:e0:06.0
> 10000:e0:06.0 PCI bridge: Intel Corporation Device 7e4d (rev 10)
> 00: 86 80 4d 7e 06 05 10 00 10 00 04 06 00 00 81 00
> 10: 00 00 00 00 00 00 00 00 00 e1 e1 00 f0 00 00 20
> 20: 00 7e 00 7e f1 ff 01 00 00 00 00 00 00 00 00 00
> 30: 00 00 00 00 40 00 00 00 00 00 00 00 00 01 02 00
> 
> u@u-XPS-13-9340:~$ sudo setpci -s 10000:e0:06.0 0x08.B=0x00
> u@u-XPS-13-9340:~$ lspci -xxxs 10000:e0:06.0
> 10000:e0:06.0 PCI bridge: Intel Corporation Device 7e4d (rev 10)
> 00: 86 80 4d 7e 06 05 10 00 10 00 04 06 00 00 81 00
> 10: 00 00 00 00 00 00 00 00 00 e1 e1 00 f0 00 00 20
> 20: 00 7e 00 7e f1 ff 01 00 00 00 00 00 00 00 00 00
> 30: 00 00 00 00 40 00 00 00 00 00 00 00 00 01 02 00
> 
> I think there is no such option is Dell's BIOS to disable it, should I ask
> Dell to fix it?

you can try asking them.

Comment 22 AceLan Kao 2023-12-05 00:52:48 UTC

With the test BIOS that disabled the relaxed ordering by default, we still see this issue.

As this issue only happens on VMD mode, the issue becomes more like why vmd invokes interrupt before nvme data is ready.

Comment 23 Nirmal Patel 2023-12-12 17:23:33 UTC

(In reply to AceLan Kao from comment #14)
> I'm not sure whos fault, but I got many value difference log between the 2
> read.
> 
> [    1.658977] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before =
> 0x0000 after = 0x0001
> [    3.775746] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before =
> 0x0000 after = 0x0001
> [    3.775773] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before =
> 0x0000 after = 0x0001
> [    3.780746] EXT4-fs (nvme0n1p2): mounted filesystem
> 2da08224-2920-429a-95e8-755a71fdc71d ro with ordered data mode. Quota mode:
> none.
> [    3.813830] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before =
> 0x0000 after = 0x0001
> [    3.829130] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before =
> 0x0000 after = 0x0001
> [    3.829308] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before =
> 0x0000 after = 0x0001
> [    3.834134] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before =
> 0x0000 after = 0x0001
> [    3.868118] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before =
> 0x0000 after = 0x0001
> 
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 3f0c9ee09a12..6564a1afae99 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -28,6 +28,7 @@
>  #include <linux/io-64-nonatomic-hi-lo.h>
>  #include <linux/sed-opal.h>
>  #include <linux/pci-p2pdma.h>
> +#include <linux/delay.h>
>  
>  #include "trace.h"
>  #include "nvme.h"
> @@ -989,7 +990,13 @@ static void nvme_pci_complete_batch(struct
> io_comp_batch *iob)
>  static inline bool nvme_cqe_pending(struct nvme_queue *nvmeq)
>  {
>         struct nvme_completion *hcqe = &nvmeq->cqes[nvmeq->cq_head];
> +       int before, after;
>  
> +       before = READ_ONCE(hcqe->status);
> +       udelay(5);
> +       after = READ_ONCE(hcqe->status);
> +       if (before != after)
> +               printk("%s:%d] %s() before = 0x%04X after = 0x%04X\n",
> __FILE__, __LINE__, __func__, before, after);
>         return (le16_to_cpu(READ_ONCE(hcqe->status)) & 1) == nvmeq->cq_phase;
>  }

I am not an NVMe expert, but if there is an HW bug, what would be best place to add a work around? i.e. adding udelay(5) in vmd_irq is not a good option.

We need some inputs here for the WA.

Comment 24 Keith Busch 2023-12-12 17:27:31 UTC

Arbitrary delays make for terrible workarounds. I'd say just disable VMD and call it a day. The device appears to be muxing interrupts without providing appropriate notification.

Comment 25 Kai-Heng Feng 2023-12-13 01:33:56 UTC

Can this series help?
https://lore.kernel.org/linux-pci/1573040408-3831-1-git-send-email-jonathan.derrick@intel.com/

Comment 26 AceLan Kao 2023-12-13 02:55:40 UTC

Kaiheng, 
The series is old and I can't solve the conflict of the 2nd patch, so I skipped that one. And it looks like I can't boot up the machine with the partially patched kernel :(

Keith, 
Could we say that it's a issue from VMD device. because we didn't encounter this issue with AHCI mode?
This issue occurs with many different brands of nvme, and we only see this issue happens on MTL platforms.
And I also heard that Windows encountered the same issue, too, and Windows fixed this by adding a dummy read of the nvme reg in nvme isr and claims it could flush the I/O.
We also tried this way to workaround the issue on linux, but a dummy read is expensive and probably takes longer than 5 microsecond. So, I'm suspecting if the dummy read really flush the I/O.
Anyway, we are still looking into this issue and can't just disable VMD.

Comment 27 Kai-Heng Feng 2023-12-13 03:33:23 UTC

Yea that's why I am asking Keith and Nirmal for their input.

Comment 28 Keith Busch 2023-12-13 11:20:21 UTC

(In reply to Kai-Heng Feng from comment #25)
> Can this series help?
> https://lore.kernel.org/linux-pci/1573040408-3831-1-git-send-email-jonathan.
> derrick@intel.com/

I don't think so. That was about performance, not correctness.

Comment 29 Bjorn Helgaas 2024-09-03 17:16:00 UTC

Based on this patch:

https://lore.kernel.org/r/20240903025544.286223-1-kai.heng.feng@canonical.com

this hardware erratum may be related:

https://edc.intel.com/content/www/us/en/design/products/platforms/details/meteor-lake-u-p/core-ultra-processor-specification-update/errata-details/#MTL016

The patch needs some revision, but adding the erratum link here for the record.