Bug 217871
Summary: | Got nvme timeout with KIOXIA 512GB + VMD mode | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | AceLan Kao (acelan) |
Component: | NVMe | Assignee: | IO/NVME Virtual Default Assignee (io_nvme) |
Status: | NEW --- | ||
Severity: | high | CC: | bagasdotme, bjorn, dirk.su, kai.heng.feng, kbusch, max.lee, nirmal.patel, ping.a.bao |
Priority: | P3 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | Subsystem: | ||
Regression: | No | Bisected commit-id: | |
Attachments: |
dmesg.log
dmesg intrmap=off |
(In reply to AceLan Kao from comment #0) > Created attachment 305033 [details] > dmesg.log > > Kernel: latest mainline 3f86ed6ec0b3 ("Merge tag 'arc-6.6-rc1' of > git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc") > > Encounter below nvme timeout issue with VMD mode > > [ 97.343423] nvme nvme0: I/O 259 QID 2 timeout, completion polled > [ 97.343446] nvme nvme0: I/O 384 QID 3 timeout, completion polled > [ 97.343459] nvme nvme0: I/O 320 QID 4 timeout, completion polled > [ 97.343470] nvme nvme0: I/O 707 QID 5 timeout, completion polled > > Disable APST(nvme_core.default_ps_max_latency_us=0) doesn't work > > ubuntu@ubuntu:~$ sudo nvme smart-log /dev/nvme0 > Smart Log for NVME device:nvme0 namespace-id:ffffffff > critical_warning : 0 > temperature : 41 C (314 Kelvin) > available_spare : 100% > available_spare_threshold : 50% > percentage_used : 0% > endurance group critical warning summary: 0 > data_units_read : 4995511 > data_units_written : 4477527 > host_read_commands : 29406022 > host_write_commands : 22505736 > controller_busy_time : 116 > power_cycles : 199 > power_on_hours : 523 > unsafe_shutdowns : 52 > media_errors : 0 > num_err_log_entries : 0 > Warning Temperature Time : 0 > Critical Composite Temperature Time : 0 > Temperature Sensor 1 : 41 C (314 Kelvin) > Thermal Management T1 Trans Count : 0 > Thermal Management T2 Trans Count : 0 > Thermal Management T1 Total Time : 0 > Thermal Management T2 Total Time : 0 > > > ubuntu@ubuntu:~$ sudo smartctl -a /dev/nvme0 > smartctl 7.2 2020-12-30 r5155 > [x86_64-linux-6.5.0-060500rc5drmtip20230813-generic] (local build) > Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org > > === START OF INFORMATION SECTION === > Model Number: KBG50ZNS512G NVMe KIOXIA 512GB > Serial Number: 33PC1K7SEWDX > Firmware Version: 11200108 > PCI Vendor/Subsystem ID: 0x1e0f > IEEE OUI Identifier: 0x8ce38e > Total NVM Capacity: 512,110,190,592 [512 GB] > Unallocated NVM Capacity: 0 > Controller ID: 0 > NVMe Version: 1.4 > Number of Namespaces: 1 > Namespace 1 Size/Capacity: 512,110,190,592 [512 GB] > Namespace 1 Formatted LBA Size: 512 > Namespace 1 IEEE EUI-64: 8ce38e 10011ec358 > Local Time is: Tue Aug 22 09:41:45 2023 EDT > Firmware Updates (0x14): 2 Slots, no Reset required > Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test > Optional NVM Commands (0x00df): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat > Timestmp Verify > Log Page Attributes (0x1e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg > Maximum Data Transfer Size: 512 Pages > Warning Comp. Temp. Threshold: 83 Celsius > Critical Comp. Temp. Threshold: 85 Celsius > Namespace 1 Features (0x02): NA_Fields > > Supported Power States > St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat > 0 + 4.10W - - 0 0 0 0 1 1 > 1 + 2.60W - - 1 1 1 1 1 1 > 2 + 1.30W - - 2 2 2 2 1 1 > 3 - 0.0500W - - 3 3 3 3 800 1000 > 4 - 0.0030W - - 4 4 4 4 3000 44900 > > Supported LBA Sizes (NSID 0x1) > Id Fmt Data Metadt Rel_Perf > 0 + 512 0 2 > 1 - 4096 0 1 > > === START OF SMART DATA SECTION === > SMART overall-health self-assessment test result: PASSED > > SMART/Health Information (NVMe Log 0x02) > Critical Warning: 0x00 > Temperature: 40 Celsius > Available Spare: 100% > Available Spare Threshold: 50% > Percentage Used: 0% > Data Units Read: 4,995,632 [2.55 TB] > Data Units Written: 4,477,529 [2.29 TB] > Host Read Commands: 29,406,658 > Host Write Commands: 22,505,946 > Controller Busy Time: 116 > Power Cycles: 199 > Power On Hours: 523 > Unsafe Shutdowns: 52 > Media and Data Integrity Errors: 0 > Error Information Log Entries: 0 > Warning Comp. Temperature Time: 0 > Critical Comp. Temperature Time: 0 > Temperature Sensor 1: 40 Celsius > > Error Information (NVMe Log 0x01, 16 of 256 entries) > No Errors Logged Do you have this issue on v6.5? v6.4? Forgot to mention that this is MTL-H platform. I can see the issue on both on v6.4 and v6.5 kernels. Disable CPU C-State from the BIOS, then it works well without nvme timeout error. Is there a reason you don't just disable vmd mode? No, we can't disable VMD mode, the decision has been made by Dell. The 2 workarounds are not feasible for us. 1. Set to AHCI mode, instead of VMD mode. 2. Disable CPU C-State We're still checking latest mainline kernel, but no luck. Is this regression? Has this particular drive ever worked with VMD mode enabled? Do you have iommu enabled? Looks like the issue only happens on new SoC, didn't observe the nvme timeout issue on RLP-S. Tried iommu=force and iommu=off, none of the cmdline work. Tested KIOTIA 2TB SSD from the customer as it is on our RVP platform, i.e. without re-installing OS, which previously shows the NVMe timeout (Oct 5 08:17:05) on the customer platform. We booted the customer OS image in the SSD on our RVP platform and could not reproduce the reported NVMe timeout as shown below (Oct 12 12:9:56): $ cat kern.log | grep -e vmd -e nvme Oct 5 08:17:05 u kernel: [ 1.400024] vmd 0000:00:0e.0: PCI host bridge to bus 10000:e0 Oct 5 08:17:05 u kernel: [ 1.614382] vmd 0000:00:0e.0: Bound to PCI domain 10000 Oct 5 08:17:05 u kernel: [ 1.799023] nvme nvme0: pci function 10000:e1:00.0 Oct 5 08:17:05 u kernel: [ 1.799063] nvme 10000:e1:00.0: PCI INT A: no GSI Oct 5 08:17:05 u kernel: [ 1.805219] nvme nvme0: 18/0/0 default/read/poll queues Oct 5 08:17:05 u kernel: [ 1.808576] nvme0n1: p1 p2 p3 Oct 5 08:17:05 u kernel: [ 33.889977] nvme nvme0: I/O 960 QID 4 timeout, completion polled Oct 5 08:17:05 u kernel: [ 64.610386] nvme nvme0: I/O 128 QID 7 timeout, completion polled Oct 5 08:17:05 u kernel: [ 64.692950] EXT4-fs (nvme0n1p3): mounted filesystem 4cf993b3-ed8f-4f23-a995-bb7955c24e12 ro with ordered data mode. Quota mode: none. Oct 5 08:17:05 u kernel: [ 95.330389] nvme nvme0: I/O 963 QID 4 timeout, completion polled Oct 5 08:17:05 u kernel: [ 126.049958] nvme nvme0: I/O 961 QID 4 timeout, completion polled Oct 5 08:17:05 u kernel: [ 126.173027] block nvme0n1: the capability attribute has been deprecated. Oct 5 08:17:05 u kernel: [ 156.770086] nvme nvme0: I/O 579 QID 2 timeout, completion polled Oct 5 08:17:05 u kernel: [ 156.770112] nvme nvme0: I/O 320 QID 5 timeout, completion polled Oct 5 08:17:05 u kernel: [ 156.770126] nvme nvme0: I/O 128 QID 8 timeout, completion polled Oct 5 08:17:05 u kernel: [ 156.921174] EXT4-fs (nvme0n1p3): re-mounted 4cf993b3-ed8f-4f23-a995-bb7955c24e12 r/w. Quota mode: none. Oct 5 08:17:05 u kernel: [ 186.978103] nvme nvme0: I/O 448 QID 1 timeout, completion polled Oct 5 08:17:05 u kernel: [ 186.978123] nvme nvme0: I/O 768 QID 3 timeout, completion polled Oct 5 08:17:05 u kernel: [ 217.186086] nvme nvme0: I/O 771 QID 3 timeout, completion polled Oct 5 08:17:05 u kernel: [ 217.318607] Modules linked in: ecc(+) processor_thermal_rfim(+) industrialio(+) drm_kms_helper processor_thermal_mbox processor_thermal_rapl intel_vpu(+) i2c_algo_bit intel_ipu6(O+) intel_rapl_common intel_vsec mac_hid ov02c10(O) intel_skl_int3472_tps68470 mei_csi(O) mei_ace(O) tps68470_regulator int3403_thermal(+) intel_vsc(O) mei_vsc(O) int340x_thermal_zone clk_tps68470 v4l2_fwnode v4l2_async mei int3400_thermal acpi_tad intel_skl_int3472_discrete intel_hid acpi_thermal_rel sparse_keymap acpi_pad sch_fq_codel v4l2loopback(O) videodev mc msr parport_pc ppdev lp parport drm efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_sensor_custom hid_sensor_hub hid_generic intel_ishtp_hid nvme nvme_core nvme_common ahci video libahci crc32_pclmul psmouse ucsi_acpi thunderbolt intel_lpss_pci spi_intel_pci i2c_i801 intel_ish_ipc xhci_pci intel_lpss typec_ucsi spi_intel i2c_smbus intel_ishtp Oct 5 08:17:05 u kernel: [ 217.318660] idma64 xhci_pci_renesas vmd typec i2c_hid_acpi i2c_hid hid wmi pinctrl_meteorlake Oct 12 12:19:56 u kernel: [ 1.667196] vmd 0000:00:0e.0: PCI host bridge to bus 10000:e0 Oct 12 12:19:56 u kernel: [ 1.870276] vmd 0000:00:0e.0: Bound to PCI domain 10000 Oct 12 12:19:56 u kernel: [ 1.968882] nvme nvme0: pci function 10000:e1:00.0 Oct 12 12:19:56 u kernel: [ 1.968923] nvme 10000:e1:00.0: PCI INT A: no GSI Oct 12 12:19:56 u kernel: [ 1.973985] nvme nvme0: 18/0/0 default/read/poll queues Oct 12 12:19:56 u kernel: [ 1.977394] nvme0n1: p1 p2 p3 Oct 12 12:19:56 u kernel: [ 3.904155] EXT4-fs (nvme0n1p3): mounted filesystem 4cf993b3-ed8f-4f23-a995-bb7955c24e12 ro with ordered data mode. Quota mode: none. Oct 12 12:19:56 u kernel: [ 4.052327] block nvme0n1: the capability attribute has been deprecated. Oct 12 12:19:56 u kernel: [ 4.196034] EXT4-fs (nvme0n1p3): re-mounted 4cf993b3-ed8f-4f23-a995-bb7955c24e12 r/w. Quota mode: none. The NVMe tiemout issue could be possibly caused by HW/BIOS/something else. Add a small delay in nvme_irq() ease this issue. It becomes hard to reproduce the issue with the small delay in nvme_irq(). diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 2f57da12d9836..b1a6a02e5cce8 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -28,6 +28,7 @@ #include <linux/io-64-nonatomic-hi-lo.h> #include <linux/sed-opal.h> #include <linux/pci-p2pdma.h> +#include <linux/delay.h> #include "trace.h" #include "nvme.h" @@ -1081,6 +1082,7 @@ static irqreturn_t nvme_irq(int irq, void *data) struct nvme_queue *nvmeq = data; DEFINE_IO_COMP_BATCH(iob); + udelay(5); if (nvme_poll_cq(nvmeq, &iob)) { if (!rq_list_empty(iob.req_list)) nvme_pci_complete_batch(&iob); what happens if you try intremap=off ? Created attachment 305413 [details]
dmesg intrmap=off
With intrmap=off, there are still nvme timeout message
[ 35.902221] nvme nvme0: I/O 768 QID 4 timeout, completion polled
[ 66.366225] nvme nvme0: I/O 771 QID 4 timeout, completion polled
[ 130.253142] nvme nvme0: I/O 896 QID 6 timeout, completion polled
[ 130.253168] nvme nvme0: I/O 768 QID 7 timeout, completion polled
[ 130.293240] nvme nvme0: I/O 614 QID 2 timeout, completion polled
[ 161.267969] nvme nvme0: I/O 1 QID 5 timeout, completion polled
typo on above comment, it's intremap, I passed the correct kernel cmdline. [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.6.0 root=UUID=2da08224-2920-429a-95e8-755a71fdc71d ro i915.force_probe=7d55 intremap=off quiet splash vt.handoff=7 Some more findings, not sure how it affects the issue, but the delay works. diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 3f0c9ee09a12..24993cfd11fb 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -28,6 +28,7 @@ #include <linux/io-64-nonatomic-hi-lo.h> #include <linux/sed-opal.h> #include <linux/pci-p2pdma.h> +#include <linux/delay.h> #include "trace.h" #include "nvme.h" @@ -990,6 +991,7 @@ static inline bool nvme_cqe_pending(struct nvme_queue *nvmeq) { struct nvme_completion *hcqe = &nvmeq->cqes[nvmeq->cq_head]; + udelay(5); return (le16_to_cpu(READ_ONCE(hcqe->status)) & 1) == nvmeq->cq_phase; } I'm not sure whos fault, but I got many value difference log between the 2 read. [ 1.658977] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before = 0x0000 after = 0x0001 [ 3.775746] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before = 0x0000 after = 0x0001 [ 3.775773] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before = 0x0000 after = 0x0001 [ 3.780746] EXT4-fs (nvme0n1p2): mounted filesystem 2da08224-2920-429a-95e8-755a71fdc71d ro with ordered data mode. Quota mode: none. [ 3.813830] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before = 0x0000 after = 0x0001 [ 3.829130] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before = 0x0000 after = 0x0001 [ 3.829308] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before = 0x0000 after = 0x0001 [ 3.834134] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before = 0x0000 after = 0x0001 [ 3.868118] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before = 0x0000 after = 0x0001 diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 3f0c9ee09a12..6564a1afae99 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -28,6 +28,7 @@ #include <linux/io-64-nonatomic-hi-lo.h> #include <linux/sed-opal.h> #include <linux/pci-p2pdma.h> +#include <linux/delay.h> #include "trace.h" #include "nvme.h" @@ -989,7 +990,13 @@ static void nvme_pci_complete_batch(struct io_comp_batch *iob) static inline bool nvme_cqe_pending(struct nvme_queue *nvmeq) { struct nvme_completion *hcqe = &nvmeq->cqes[nvmeq->cq_head]; + int before, after; + before = READ_ONCE(hcqe->status); + udelay(5); + after = READ_ONCE(hcqe->status); + if (before != after) + printk("%s:%d] %s() before = 0x%04X after = 0x%04X\n", __FILE__, __LINE__, __func__, before, after); return (le16_to_cpu(READ_ONCE(hcqe->status)) & 1) == nvmeq->cq_phase; } Sounds like vmd is reordering msix and cqe writes. Is relaxed ordering enabled on any of the devices in the hierarchy? If so, try turning it off. The Relaxed Ordering bit has been set up on the PCI bridge, and I modify it from pci/quirk doesn't work and the issue persists. BTW, adding mb() before READ_ONCE(hcqe->status) doesn't work, either. diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index eeec1d6f9023..f0f4ff963e96 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -4502,6 +4502,7 @@ DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x2f0d, PCI_CLASS_NOT_DEFINED quirk_relaxedordering_disable); DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x2f0e, PCI_CLASS_NOT_DEFINED, 8, quirk_relaxedordering_disable); +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x7e4d, quirk_relaxedordering_disable); /* * The AMD ARM A1100 (aka "SEATTLE") SoC has a bug in its PCIe Root Complex $ sudo dmesg | grep Relaxed [ 0.857153] pci 10000:e0:06.0: Disable Relaxed Ordering Attributes to avoid PCIe Completion erratum 10000:e0:06.0 PCI bridge [0604]: Intel Corporation Device [8086:7e4d] (rev 10) Kernel driver in use: pcieport 10000:e1:00.0 Non-Volatile memory controller [0108]: KIOXIA Corporation Device [1e0f:000c] Subsystem: KIOXIA Corporation Device [1e0f:0001] Kernel driver in use: nvme u@u-XPS-13-9340:~/linux$ lspci -xxxs 10000:e0:06.0 10000:e0:06.0 PCI bridge: Intel Corporation Device 7e4d (rev 10) 00: 86 80 4d 7e 06 05 10 00 10 00 04 06 00 00 81 00 10: 00 00 00 00 00 00 00 00 00 e1 e1 00 f0 00 00 20 20: 00 7e 00 7e f1 ff 01 00 00 00 00 00 00 00 00 00 30: 00 00 00 00 40 00 00 00 00 00 00 00 00 01 02 00 u@u-XPS-13-9340:~/linux$ lspci -xxxs 10000:e1:00.0 10000:e1:00.0 Non-Volatile memory controller: KIOXIA Corporation Device 000c 00: 0f 1e 0c 00 06 04 10 00 00 02 08 01 00 00 00 00 10: 04 00 00 7e 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 0f 1e 01 00 30: 00 00 00 00 80 00 00 00 00 00 00 00 00 01 00 00 Add delay in vmd_irq() also works, but I still see the value changed with the pach on comment #14 applied on nvme_pci_complete_batch(). There is no nvme timeout logs in dmesg after bootup(0/10). diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c index ad56df98b8e6..ac0a8c0d40ce 100644 --- a/drivers/pci/controller/vmd.c +++ b/drivers/pci/controller/vmd.c @@ -16,6 +16,7 @@ #include <linux/srcu.h> #include <linux/rculist.h> #include <linux/rcupdate.h> +#include <linux/delay.h> #include <asm/irqdomain.h> @@ -680,6 +681,7 @@ static irqreturn_t vmd_irq(int irq, void *data) struct vmd_irq *vmdirq; int idx; + udelay(5); idx = srcu_read_lock(&irqs->srcu); list_for_each_entry_rcu(vmdirq, &irqs->irq_list, node) generic_handle_irq(vmdirq->virq); (In reply to AceLan Kao from comment #16) > The Relaxed Ordering bit has been set up on the PCI bridge, and I modify it > from pci/quirk doesn't work and the issue persists. > > BTW, adding mb() before READ_ONCE(hcqe->status) doesn't work, either. > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index eeec1d6f9023..f0f4ff963e96 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -4502,6 +4502,7 @@ DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, > 0x2f0d, PCI_CLASS_NOT_DEFINED > quirk_relaxedordering_disable); > DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x2f0e, > PCI_CLASS_NOT_DEFINED, 8, > quirk_relaxedordering_disable); > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x7e4d, > quirk_relaxedordering_disable); > > /* > * The AMD ARM A1100 (aka "SEATTLE") SoC has a bug in its PCIe Root Complex > > > $ sudo dmesg | grep Relaxed > [ 0.857153] pci 10000:e0:06.0: Disable Relaxed Ordering Attributes to > avoid PCIe Completion erratum > > 10000:e0:06.0 PCI bridge [0604]: Intel Corporation Device [8086:7e4d] (rev > 10) > Kernel driver in use: pcieport > 10000:e1:00.0 Non-Volatile memory controller [0108]: KIOXIA Corporation > Device [1e0f:000c] > Subsystem: KIOXIA Corporation Device [1e0f:0001] > Kernel driver in use: nvme > > u@u-XPS-13-9340:~/linux$ lspci -xxxs 10000:e0:06.0 > 10000:e0:06.0 PCI bridge: Intel Corporation Device 7e4d (rev 10) > 00: 86 80 4d 7e 06 05 10 00 10 00 04 06 00 00 81 00 > 10: 00 00 00 00 00 00 00 00 00 e1 e1 00 f0 00 00 20 > 20: 00 7e 00 7e f1 ff 01 00 00 00 00 00 00 00 00 00 > 30: 00 00 00 00 40 00 00 00 00 00 00 00 00 01 02 00 > > u@u-XPS-13-9340:~/linux$ lspci -xxxs 10000:e1:00.0 > 10000:e1:00.0 Non-Volatile memory controller: KIOXIA Corporation Device 000c > 00: 0f 1e 0c 00 06 04 10 00 00 02 08 01 00 00 00 00 > 10: 04 00 00 7e 00 00 00 00 00 00 00 00 00 00 00 00 > 20: 00 00 00 00 00 00 00 00 00 00 00 00 0f 1e 01 00 > 30: 00 00 00 00 80 00 00 00 00 00 00 00 00 01 00 00 Are you saying Relax ordering bit is set in your platform by default? I checked multiple server platforms (not MTL) and they all have this register cleared and it is hardwired to 0b from BIOS. can you try setpci command to change it? You can also try to disable it from BIOS. May be MTL has this register enabled for all the devices. i.e. EDKII->Socket Configuration->IIO Configuration->PCIe Relaxed Ordering Yes, the Relaxed Ordering is set by default, and I can't disable it throught pci/quirk. Tried setpci, but still can't modify the value. u@u-XPS-13-9340:~$ lspci -xxxs 10000:e0:06.0 10000:e0:06.0 PCI bridge: Intel Corporation Device 7e4d (rev 10) 00: 86 80 4d 7e 06 05 10 00 10 00 04 06 00 00 81 00 10: 00 00 00 00 00 00 00 00 00 e1 e1 00 f0 00 00 20 20: 00 7e 00 7e f1 ff 01 00 00 00 00 00 00 00 00 00 30: 00 00 00 00 40 00 00 00 00 00 00 00 00 01 02 00 u@u-XPS-13-9340:~$ sudo setpci -s 10000:e0:06.0 0x08.B=0x00 u@u-XPS-13-9340:~$ lspci -xxxs 10000:e0:06.0 10000:e0:06.0 PCI bridge: Intel Corporation Device 7e4d (rev 10) 00: 86 80 4d 7e 06 05 10 00 10 00 04 06 00 00 81 00 10: 00 00 00 00 00 00 00 00 00 e1 e1 00 f0 00 00 20 20: 00 7e 00 7e f1 ff 01 00 00 00 00 00 00 00 00 00 30: 00 00 00 00 40 00 00 00 00 00 00 00 00 01 02 00 I think there is no such option is Dell's BIOS to disable it, should I ask Dell to fix it? (In reply to Keith Busch from comment #15) > Sounds like vmd is reordering msix and cqe writes. Is relaxed ordering > enabled on any of the devices in the hierarchy? If so, try turning it off. Does mb() have the same effect? Adding mb() before hcqe->status can still observe the issue. (In reply to AceLan Kao from comment #19) > Yes, the Relaxed Ordering is set by default, and I can't disable it throught > pci/quirk. > > Tried setpci, but still can't modify the value. > > u@u-XPS-13-9340:~$ lspci -xxxs 10000:e0:06.0 > 10000:e0:06.0 PCI bridge: Intel Corporation Device 7e4d (rev 10) > 00: 86 80 4d 7e 06 05 10 00 10 00 04 06 00 00 81 00 > 10: 00 00 00 00 00 00 00 00 00 e1 e1 00 f0 00 00 20 > 20: 00 7e 00 7e f1 ff 01 00 00 00 00 00 00 00 00 00 > 30: 00 00 00 00 40 00 00 00 00 00 00 00 00 01 02 00 > > u@u-XPS-13-9340:~$ sudo setpci -s 10000:e0:06.0 0x08.B=0x00 > u@u-XPS-13-9340:~$ lspci -xxxs 10000:e0:06.0 > 10000:e0:06.0 PCI bridge: Intel Corporation Device 7e4d (rev 10) > 00: 86 80 4d 7e 06 05 10 00 10 00 04 06 00 00 81 00 > 10: 00 00 00 00 00 00 00 00 00 e1 e1 00 f0 00 00 20 > 20: 00 7e 00 7e f1 ff 01 00 00 00 00 00 00 00 00 00 > 30: 00 00 00 00 40 00 00 00 00 00 00 00 00 01 02 00 > > I think there is no such option is Dell's BIOS to disable it, should I ask > Dell to fix it? you can try asking them. With the test BIOS that disabled the relaxed ordering by default, we still see this issue. As this issue only happens on VMD mode, the issue becomes more like why vmd invokes interrupt before nvme data is ready. (In reply to AceLan Kao from comment #14) > I'm not sure whos fault, but I got many value difference log between the 2 > read. > > [ 1.658977] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before = > 0x0000 after = 0x0001 > [ 3.775746] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before = > 0x0000 after = 0x0001 > [ 3.775773] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before = > 0x0000 after = 0x0001 > [ 3.780746] EXT4-fs (nvme0n1p2): mounted filesystem > 2da08224-2920-429a-95e8-755a71fdc71d ro with ordered data mode. Quota mode: > none. > [ 3.813830] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before = > 0x0000 after = 0x0001 > [ 3.829130] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before = > 0x0000 after = 0x0001 > [ 3.829308] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before = > 0x0000 after = 0x0001 > [ 3.834134] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before = > 0x0000 after = 0x0001 > [ 3.868118] drivers/nvme/host/pci.c:999] nvme_cqe_pending() before = > 0x0000 after = 0x0001 > > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index 3f0c9ee09a12..6564a1afae99 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -28,6 +28,7 @@ > #include <linux/io-64-nonatomic-hi-lo.h> > #include <linux/sed-opal.h> > #include <linux/pci-p2pdma.h> > +#include <linux/delay.h> > > #include "trace.h" > #include "nvme.h" > @@ -989,7 +990,13 @@ static void nvme_pci_complete_batch(struct > io_comp_batch *iob) > static inline bool nvme_cqe_pending(struct nvme_queue *nvmeq) > { > struct nvme_completion *hcqe = &nvmeq->cqes[nvmeq->cq_head]; > + int before, after; > > + before = READ_ONCE(hcqe->status); > + udelay(5); > + after = READ_ONCE(hcqe->status); > + if (before != after) > + printk("%s:%d] %s() before = 0x%04X after = 0x%04X\n", > __FILE__, __LINE__, __func__, before, after); > return (le16_to_cpu(READ_ONCE(hcqe->status)) & 1) == nvmeq->cq_phase; > } I am not an NVMe expert, but if there is an HW bug, what would be best place to add a work around? i.e. adding udelay(5) in vmd_irq is not a good option. We need some inputs here for the WA. Arbitrary delays make for terrible workarounds. I'd say just disable VMD and call it a day. The device appears to be muxing interrupts without providing appropriate notification. Can this series help? https://lore.kernel.org/linux-pci/1573040408-3831-1-git-send-email-jonathan.derrick@intel.com/ Kaiheng, The series is old and I can't solve the conflict of the 2nd patch, so I skipped that one. And it looks like I can't boot up the machine with the partially patched kernel :( Keith, Could we say that it's a issue from VMD device. because we didn't encounter this issue with AHCI mode? This issue occurs with many different brands of nvme, and we only see this issue happens on MTL platforms. And I also heard that Windows encountered the same issue, too, and Windows fixed this by adding a dummy read of the nvme reg in nvme isr and claims it could flush the I/O. We also tried this way to workaround the issue on linux, but a dummy read is expensive and probably takes longer than 5 microsecond. So, I'm suspecting if the dummy read really flush the I/O. Anyway, we are still looking into this issue and can't just disable VMD. Yea that's why I am asking Keith and Nirmal for their input. (In reply to Kai-Heng Feng from comment #25) > Can this series help? > https://lore.kernel.org/linux-pci/1573040408-3831-1-git-send-email-jonathan. > derrick@intel.com/ I don't think so. That was about performance, not correctness. Based on this patch: https://lore.kernel.org/r/20240903025544.286223-1-kai.heng.feng@canonical.com this hardware erratum may be related: https://edc.intel.com/content/www/us/en/design/products/platforms/details/meteor-lake-u-p/core-ultra-processor-specification-update/errata-details/#MTL016 The patch needs some revision, but adding the erratum link here for the record. |
Created attachment 305033 [details] dmesg.log Kernel: latest mainline 3f86ed6ec0b3 ("Merge tag 'arc-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc") Encounter below nvme timeout issue with VMD mode [ 97.343423] nvme nvme0: I/O 259 QID 2 timeout, completion polled [ 97.343446] nvme nvme0: I/O 384 QID 3 timeout, completion polled [ 97.343459] nvme nvme0: I/O 320 QID 4 timeout, completion polled [ 97.343470] nvme nvme0: I/O 707 QID 5 timeout, completion polled Disable APST(nvme_core.default_ps_max_latency_us=0) doesn't work ubuntu@ubuntu:~$ sudo nvme smart-log /dev/nvme0 Smart Log for NVME device:nvme0 namespace-id:ffffffff critical_warning : 0 temperature : 41 C (314 Kelvin) available_spare : 100% available_spare_threshold : 50% percentage_used : 0% endurance group critical warning summary: 0 data_units_read : 4995511 data_units_written : 4477527 host_read_commands : 29406022 host_write_commands : 22505736 controller_busy_time : 116 power_cycles : 199 power_on_hours : 523 unsafe_shutdowns : 52 media_errors : 0 num_err_log_entries : 0 Warning Temperature Time : 0 Critical Composite Temperature Time : 0 Temperature Sensor 1 : 41 C (314 Kelvin) Thermal Management T1 Trans Count : 0 Thermal Management T2 Trans Count : 0 Thermal Management T1 Total Time : 0 Thermal Management T2 Total Time : 0 ubuntu@ubuntu:~$ sudo smartctl -a /dev/nvme0 smartctl 7.2 2020-12-30 r5155 [x86_64-linux-6.5.0-060500rc5drmtip20230813-generic] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: KBG50ZNS512G NVMe KIOXIA 512GB Serial Number: 33PC1K7SEWDX Firmware Version: 11200108 PCI Vendor/Subsystem ID: 0x1e0f IEEE OUI Identifier: 0x8ce38e Total NVM Capacity: 512,110,190,592 [512 GB] Unallocated NVM Capacity: 0 Controller ID: 0 NVMe Version: 1.4 Number of Namespaces: 1 Namespace 1 Size/Capacity: 512,110,190,592 [512 GB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: 8ce38e 10011ec358 Local Time is: Tue Aug 22 09:41:45 2023 EDT Firmware Updates (0x14): 2 Slots, no Reset required Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test Optional NVM Commands (0x00df): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp Verify Log Page Attributes (0x1e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg Maximum Data Transfer Size: 512 Pages Warning Comp. Temp. Threshold: 83 Celsius Critical Comp. Temp. Threshold: 85 Celsius Namespace 1 Features (0x02): NA_Fields Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 4.10W - - 0 0 0 0 1 1 1 + 2.60W - - 1 1 1 1 1 1 2 + 1.30W - - 2 2 2 2 1 1 3 - 0.0500W - - 3 3 3 3 800 1000 4 - 0.0030W - - 4 4 4 4 3000 44900 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 2 1 - 4096 0 1 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 40 Celsius Available Spare: 100% Available Spare Threshold: 50% Percentage Used: 0% Data Units Read: 4,995,632 [2.55 TB] Data Units Written: 4,477,529 [2.29 TB] Host Read Commands: 29,406,658 Host Write Commands: 22,505,946 Controller Busy Time: 116 Power Cycles: 199 Power On Hours: 523 Unsafe Shutdowns: 52 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 40 Celsius Error Information (NVMe Log 0x01, 16 of 256 entries) No Errors Logged