Created attachment 285027 [details] dmesg We have a laptop equipped with i5-8265U CPU and a Kingston PCIE NVME SSD. We tested mainline kernel 5.3.0+ on the laptop. System takes about 60 seconds to resume. It is s2idle for suspend. Here is the information of the NVME SSD: $ sudo lspci -nnvs 02:00.0 02:00.0 Non-Volatile memory controller [0108]: Kingston Technologies Device [2646:5008] (rev 01) (prog-if 02 [NVM Express]) Subsystem: Kingston Technologies Device [2646:5008] Flags: bus master, fast devsel, latency 0, IRQ 16, NUMA node 0 Memory at 81100000 (64-bit, non-prefetchable) [size=16K] I/O ports at 3000 [size=8] Capabilities: [80] Express Endpoint, MSI 00 Capabilities: [d0] MSI-X: Enable+ Count=9 Masked- Capabilities: [e0] MSI: Enable- Count=1/8 Maskable- 64bit+ Capabilities: [f8] Power Management version 3 Capabilities: [100] Vendor Specific Information: ID=1556 Rev=1 Len=008 <?> Capabilities: [108] Latency Tolerance Reporting Capabilities: [110] L1 PM Substates Capabilities: [128] Alternative Routing-ID Interpretation (ARI) Capabilities: [200] Advanced Error Reporting Capabilities: [300] #19 Kernel driver in use: nvme Kernel modules: nvme There is the time gap about 60s between "[ 84.002162] ata3: SATA link down (SStatus 4 SControl 300)" and "[ 144.507485] nvme nvme0: I/O 12 QID 0 timeout, completion polled" after system resumes. It keeps showing "nvme nvme0: I/O 978 QID 3 timeout, completion polled" and then system usually hangs up. [ 84.002162] ata3: SATA link down (SStatus 4 SControl 300) [ 144.507485] nvme nvme0: I/O 12 QID 0 timeout, completion polled [ 144.508712] OOM killer enabled. [ 144.508714] Restarting tasks ... done. [ 144.539322] PM: suspend exit [ 147.542914] wlo1: authenticate with dc:9f:db:70:1f:a0 [ 147.544185] wlo1: send auth to dc:9f:db:70:1f:a0 (try 1/3) [ 147.695868] wlo1: authenticated [ 147.696964] wlo1: associate with dc:9f:db:70:1f:a0 (try 1/3) [ 147.700996] wlo1: RX AssocResp from dc:9f:db:70:1f:a0 (capab=0x31 status=0 aid=3) [ 147.712441] wlo1: associated [ 174.715534] nvme nvme0: I/O 978 QID 3 timeout, completion polled [ 174.715577] nvme nvme0: I/O 393 QID 4 timeout, completion polled [ 174.715639] nvme nvme0: I/O 515 QID 8 timeout, completion polled The timeout messages do not show between boot and first suspend, but they occur after resume.
Created attachment 285029 [details] dmesg with dump_stack in nvme_timeout() I add the dump_stack() to see the backtrace: diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 732d5b63ec05..f4650c7d516f 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1266,6 +1266,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved) dev_warn(dev->ctrl.device, "I/O %d QID %d timeout, completion polled\n", req->tag, nvmeq->qid); + dump_stack(); return BLK_EH_DONE; } Log shows the timeout comes from blk_mq_timeout_work: [ 140.412451] nvme nvme0: I/O 0 QID 0 timeout, completion polled [ 140.412459] CPU: 5 PID: 299 Comm: kworker/5:1H Not tainted 5.3.0+ #1 [ 140.412462] Hardware name: ASUSTeK COMPUTER INC. ZenBook UX431FL/UX431FL, BIOS UX431FL.300 06/17/2019 [ 140.412472] Workqueue: kblockd blk_mq_timeout_work [ 140.412475] Call Trace: [ 140.412491] dump_stack+0x46/0x60 [ 140.412501] nvme_timeout.cold.77+0x136/0x191 [nvme] [ 140.412511] ? sched_clock_cpu+0xc/0xb0 [ 140.412514] ? update_rq_clock.part.92+0xf/0x60 [ 140.412519] ? cpumask_next+0x16/0x20 [ 140.412524] blk_mq_check_expired+0x120/0x140 [ 140.412530] blk_mq_queue_tag_busy_iter+0x195/0x2e0 [ 140.412535] ? blk_poll+0x340/0x340 [ 140.412538] ? blk_poll+0x340/0x340 [ 140.412542] blk_mq_timeout_work+0x4f/0x120 [ 140.412548] process_one_work+0x1f0/0x3b0 [ 140.412552] worker_thread+0x28/0x3c0 [ 140.412556] ? process_one_work+0x3b0/0x3b0 [ 140.412561] kthread+0x10b/0x130 [ 140.412566] ? kthread_create_on_node+0x60/0x60 [ 140.412571] ret_from_fork+0x1f/0x30
I checked the power status of the NVME SSD: Before suspend: grep: /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/autosuspend_delay_ms: Input/output error /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/control:on /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/runtime_active_time:324462 /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/runtime_status:active /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/runtime_suspended_time:0 /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup:disabled /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_abort_count: /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_active: /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_active_count: /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_count: /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_expire_count: /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_last_time_ms: /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_max_time_ms: /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_total_time_ms: After resume: grep: /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/autosuspend_delay_ms: Input/output error /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/control:on /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/runtime_active_time:423800 /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/runtime_status:active /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/runtime_suspended_time:0 /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup:disabled /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_abort_count: /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_active: /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_active_count: /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_count: /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_expire_count: /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_last_time_ms: /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_max_time_ms: /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_total_time_ms: They look the same, except the incremental runtime_active_time.
Created attachment 285051 [details] NVME SSD information by smartctl
I add a quirk to make the NVME SSD suspend simply: diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 1ede1763a5ee..63395f800dcc 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -2298,6 +2298,10 @@ static const struct nvme_core_quirk_entry core_quirks[] = { .vid = 0x14a4, .fr = "22301111", .quirks = NVME_QUIRK_SIMPLE_SUSPEND, + }, + { + .vid = 0x2646, + .quirks = NVME_QUIRK_SIMPLE_SUSPEND, } }; Then, this issue cannot be reproduced. However, it is just a blind shoot. Is there any way to diagnostic more to make sure this quirk is correct for the device? That will be helpful for the commit message.
Created attachment 285109 [details] dmesg shows no more IRQ from NVME SSD after resume Try to understand the IRQ handling for the NVME device by adding the debug message based on mainline kernel: diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 6b4d7b064b38..8b2d2dbf5455 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1019,6 +1019,8 @@ static irqreturn_t nvme_irq(int irq, void *data) struct nvme_queue *nvmeq = data; irqreturn_t ret = IRQ_NONE; u16 start, end; + static u64 idx = 0; + dev_err(nvmeq->dev->dev, "%s: idx=%llu start\n", __func__, idx); /* * The rmb/wmb pair ensures we see all updates from a previous run of @@ -1033,9 +1035,11 @@ static irqreturn_t nvme_irq(int irq, void *data) if (start != end) { nvme_complete_cqes(nvmeq, start, end); - return IRQ_HANDLED; + ret = IRQ_HANDLED; } + dev_err(nvmeq->dev->dev, "%s: idx=%llu end\n", __func__, idx); + idx++; return ret; } There are a lot of "nvme_irq ... start/end" messages before suspend: [ 54.062745] nvme 0000:02:00.0: nvme_irq: idx=20216 start [ 54.062754] nvme 0000:02:00.0: nvme_irq: idx=20216 end But there is no more after resume. Then only shows "nvme ... timeout, completion polled" continuously. [ 162.662370] nvme nvme0: I/O 112 QID 1 timeout, completion polled That means nvme_irq the IRQ handler is not invoked after resume.
Created attachment 285111 [details] dmesg shows IRQ from NVME SSD both before suspend and after resume Combined the quirk in comment #4 and debug message in comment #5. The debug message shows both before suspend and after resume. That means the IRQ from NVME device comes back after resume.
Created attachment 285145 [details] Added debug message without NVME_QUIRK_SIMPLE_SUSPEND quirk I add more debug messages as the attached file based on mainline kernel to trace the suspend & resume code path. The suspend & resume path show as the dmesg log: [ 178.363854] nvme 0000:02:00.0: nvme_irq: idx=20544 start [ 178.363861] nvme 0000:02:00.0: nvme_irq: idx=20544 end [ 178.363868] Filesystems sync: 0.003 seconds [ 178.365124] Freezing user space processes ... (elapsed 0.003 seconds) done. [ 178.368170] OOM killer disabled. [ 178.368171] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done. [ 178.369360] printk: Suspending console(s) (use no_console_suspend to debug) [ 178.370370] wlo1: deauthenticating from dc:9f:db:70:1f:b0 by local choice (Reason: 3=DEAUTH_LEAVING) [ 178.483304] nvme 0000:02:00.0: nvme_suspend: freezed [ 178.483437] nvme 0000:02:00.0: nvme_irq: idx=20545 start [ 178.483448] nvme 0000:02:00.0: nvme_irq: idx=20545 end [ 178.483479] nvme 0000:02:00.0: nvme_suspend: ndev->last_ps 0, ctrl->npss = 4 [ 178.483569] nvme 0000:02:00.0: nvme_irq: idx=20546 start [ 178.483577] nvme 0000:02:00.0: nvme_irq: idx=20546 end [ 178.483602] nvme 0000:02:00.0: __nvme_submit_sync_cmd: request result 0 [ 178.483605] nvme 0000:02:00.0: nvme_suspend: protocol specific [ 178.499057] ACPI: EC: interrupt blocked [ 196.930344] ACPI: EC: interrupt unblocked [ 197.045281] pci 0000:01:00.0: Enabling HDA controller [ 197.061208] iwlwifi 0000:00:14.3: Applying debug destination EXTERNAL_DRAM [ 197.061341] ACPI: button: The lid device is not compliant to SW_LID. [ 197.177426] iwlwifi 0000:00:14.3: Applying debug destination EXTERNAL_DRAM [ 197.242784] iwlwifi 0000:00:14.3: FW already configured (0) - re-configuring [ 197.374262] ata3: SATA link down (SStatus 4 SControl 300) [ 257.103510] nvme nvme0: I/O 6 QID 0 timeout, completion polled [ 257.103606] nvme 0000:02:00.0: __nvme_submit_sync_cmd: request result 0 [ 257.103617] nvme 0000:02:00.0: nvme_resume: ndev->last_ps=0 [ 257.104774] OOM killer enabled. [ 257.104776] Restarting tasks ... done. [ 257.132452] PM: suspend exit [ 259.938464] wlo1: authenticate with dc:9f:db:70:1f:b0 ... [ 260.097035] wlo1: associated [ 287.311528] nvme nvme0: I/O 700 QID 1 timeout, completion polled [ 287.311557] nvme nvme0: I/O 173 QID 4 timeout, completion polled NVME seems suspend & resume as expected. But no IRQ from NVME SSD after resume.
Created attachment 285383 [details] Debug difference to skip NVMe's PCI_MSIX_ENTRY_CTRL_MASKBIT setting I traced long time and found PCI_MSIX_ENTRY_CTRL_MASKBIT is set during resume which masks the NVMe's IRQ signal. So, there is no NVMe interrupt after resume and NVMe I/O requests usually become timeout. I made a debug diff as the attachment which skips the setting of PCI_MSIX_ENTRY_CTRL_MASKBIT if the device is NVMe during resuming, then NVMe works fine after resume. Here is the debug message in dmesg during resuming: [ 45.665657] nvme 0000:02:00.0: __pci_msix_desc_mask_irq: 0x0c origianl value 0x901110d0, going to write 0x901110d1. However, 0000:02:00.0 is nvme, so skip [ 45.665660] nvme 0000:02:00.0: msix_mask_irq: 0000:02:00.0 called with flag is True, new masked value is 0x901110d0 [ 45.665726] nvme 0000:02:00.0: __pci_msix_desc_mask_irq: 0x1c origianl value 0x0b881490, going to write 0x0b881491. However, 0000:02:00.0 is nvme, so skip [ 45.665727] nvme 0000:02:00.0: msix_mask_irq: 0000:02:00.0 called with flag is True, new masked value is 0x0b881490 [ 45.666098] nvme 0000:02:00.0: __pci_msix_desc_mask_irq: 0x2c origianl value 0x0ca17434, going to write 0x0ca17435. However, 0000:02:00.0 is nvme, so skip [ 45.666100] nvme 0000:02:00.0: msix_mask_irq: 0000:02:00.0 called with flag is True, new masked value is 0x0ca17434 [ 45.666396] nvme 0000:02:00.0: __pci_msix_desc_mask_irq: 0x3c origianl value 0x4815041c, going to write 0x4815041d. However, 0000:02:00.0 is nvme, so skip [ 45.666397] nvme 0000:02:00.0: msix_mask_irq: 0000:02:00.0 called with flag is True, new masked value is 0x4815041c [ 45.666452] nvme 0000:02:00.0: __pci_msix_desc_mask_irq: 0x4c origianl value 0x480c1b44, going to write 0x480c1b45. However, 0000:02:00.0 is nvme, so skip [ 45.666453] nvme 0000:02:00.0: msix_mask_irq: 0000:02:00.0 called with flag is True, new masked value is 0x480c1b44 [ 45.666504] nvme 0000:02:00.0: __pci_msix_desc_mask_irq: 0x5c origianl value 0x440048a8, going to write 0x440048a9. However, 0000:02:00.0 is nvme, so skip [ 45.666505] nvme 0000:02:00.0: msix_mask_irq: 0000:02:00.0 called with flag is True, new masked value is 0x440048a8 [ 45.666552] nvme 0000:02:00.0: __pci_msix_desc_mask_irq: 0x6c origianl value 0x10886064, going to write 0x10886065. However, 0000:02:00.0 is nvme, so skip [ 45.666553] nvme 0000:02:00.0: msix_mask_irq: 0000:02:00.0 called with flag is True, new masked value is 0x10886064 [ 45.666601] nvme 0000:02:00.0: __pci_msix_desc_mask_irq: 0x7c origianl value 0x01802170, going to write 0x01802171. However, 0000:02:00.0 is nvme, so skip [ 45.666602] nvme 0000:02:00.0: msix_mask_irq: 0000:02:00.0 called with flag is True, new masked value is 0x01802170 [ 45.666606] nvme 0000:02:00.0: __pci_msix_desc_mask_irq: 0x8c origianl value 0xd1264c94, going to write 0xd1264c95. However, 0000:02:00.0 is nvme, so skip [ 45.666608] nvme 0000:02:00.0: msix_mask_irq: 0000:02:00.0 called with flag is True, new masked value is 0xd1264c94 1. The difference between the original "[PCI MSI/X] Mask bits" value and the new one is PCI_MSIX_ENTRY_CTRL_MASKBIT. 2. PCI_MSIX_ENTRY_CTRL_MASKBIT is set according to the "flag" of __pci_msix_desc_mask_irq() in drivers/pci/msi.c. 3. In __pci_restore_msix_state(), the entry->masked is passed as the "flag" to msix_mask_irq() -> __pci_msix_desc_mask_irq(). So, the problem becomes why the whole entry->masked is passed as the "flag" to msix_mask_irq() directly? It seems a non-zero value, then the "flag" will become True in the if-condition in __pci_msix_desc_mask_irq().
Created attachment 285385 [details] dmesg of comment #8
I sent a new patch [1] to fix the issue mentioned in comment 8, which is the root cause of this ticket. If the new patch is merged, then we can revert commit 19ea025e1d28 "nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T" in mainline kernel. [1] https://patchwork.kernel.org/patch/11178729/
The patch has been merged as commit 253eaf4faaaa ("PCI/MSI: Fix incorrect MSI-X masking on resume") [1] in linux-next repository. Also sent another patch [2] to revert previous quirk. [1]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/drivers/pci/msi.c?id=253eaf4faaaadffeade9192a42f47c51d1b0c3f9 [2]: https://patchwork.kernel.org/patch/11220807/