Bug 204887

Summary: Resume issue: "nvme nvme0: I/O 12 QID 0 timeout, completion polled"
Product: Drivers Reporter: jian-hong
Component: Flash/Memory Technology DevicesAssignee: David Woodhouse (dwmw2)
Status: RESOLVED CODE_FIX    
Severity: high    
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.3.0 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg
dmesg with dump_stack in nvme_timeout()
NVME SSD information by smartctl
dmesg shows no more IRQ from NVME SSD after resume
dmesg shows IRQ from NVME SSD both before suspend and after resume
Added debug message without NVME_QUIRK_SIMPLE_SUSPEND quirk
Debug difference to skip NVMe's PCI_MSIX_ENTRY_CTRL_MASKBIT setting
dmesg of comment #8

Description jian-hong 2019-09-17 08:30:19 UTC
Created attachment 285027 [details]
dmesg

We have a laptop equipped with i5-8265U CPU and a Kingston PCIE NVME SSD.  We tested mainline kernel 5.3.0+ on the laptop.  System takes about 60 seconds to resume.  It is s2idle for suspend.

Here is the information of the NVME SSD:

$ sudo lspci -nnvs 02:00.0
02:00.0 Non-Volatile memory controller [0108]: Kingston Technologies Device [2646:5008] (rev 01) (prog-if 02 [NVM Express])
	Subsystem: Kingston Technologies Device [2646:5008]
	Flags: bus master, fast devsel, latency 0, IRQ 16, NUMA node 0
	Memory at 81100000 (64-bit, non-prefetchable) [size=16K]
	I/O ports at 3000 [size=8]
	Capabilities: [80] Express Endpoint, MSI 00
	Capabilities: [d0] MSI-X: Enable+ Count=9 Masked-
	Capabilities: [e0] MSI: Enable- Count=1/8 Maskable- 64bit+
	Capabilities: [f8] Power Management version 3
	Capabilities: [100] Vendor Specific Information: ID=1556 Rev=1 Len=008 <?>
	Capabilities: [108] Latency Tolerance Reporting
	Capabilities: [110] L1 PM Substates
	Capabilities: [128] Alternative Routing-ID Interpretation (ARI)
	Capabilities: [200] Advanced Error Reporting
	Capabilities: [300] #19
	Kernel driver in use: nvme
	Kernel modules: nvme

There is the time gap about 60s between "[   84.002162] ata3: SATA link down (SStatus 4 SControl 300)" and "[  144.507485] nvme nvme0: I/O 12 QID 0 timeout, completion polled" after system resumes.  It keeps showing "nvme nvme0: I/O 978 QID 3 timeout, completion polled" and then system usually hangs up.

[   84.002162] ata3: SATA link down (SStatus 4 SControl 300)
[  144.507485] nvme nvme0: I/O 12 QID 0 timeout, completion polled
[  144.508712] OOM killer enabled.
[  144.508714] Restarting tasks ... done.
[  144.539322] PM: suspend exit
[  147.542914] wlo1: authenticate with dc:9f:db:70:1f:a0
[  147.544185] wlo1: send auth to dc:9f:db:70:1f:a0 (try 1/3)
[  147.695868] wlo1: authenticated
[  147.696964] wlo1: associate with dc:9f:db:70:1f:a0 (try 1/3)
[  147.700996] wlo1: RX AssocResp from dc:9f:db:70:1f:a0 (capab=0x31 status=0 aid=3)
[  147.712441] wlo1: associated
[  174.715534] nvme nvme0: I/O 978 QID 3 timeout, completion polled
[  174.715577] nvme nvme0: I/O 393 QID 4 timeout, completion polled
[  174.715639] nvme nvme0: I/O 515 QID 8 timeout, completion polled

The timeout messages do not show between boot and first suspend, but they occur after resume.
Comment 1 jian-hong 2019-09-17 09:15:14 UTC
Created attachment 285029 [details]
dmesg with dump_stack in nvme_timeout()

I add the dump_stack() to see the backtrace:

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 732d5b63ec05..f4650c7d516f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1266,6 +1266,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
                dev_warn(dev->ctrl.device,
                         "I/O %d QID %d timeout, completion polled\n",
                         req->tag, nvmeq->qid);
+               dump_stack();
                return BLK_EH_DONE;
        }


Log shows the timeout comes from blk_mq_timeout_work:

[  140.412451] nvme nvme0: I/O 0 QID 0 timeout, completion polled
[  140.412459] CPU: 5 PID: 299 Comm: kworker/5:1H Not tainted 5.3.0+ #1
[  140.412462] Hardware name: ASUSTeK COMPUTER INC. ZenBook UX431FL/UX431FL, BIOS UX431FL.300 06/17/2019
[  140.412472] Workqueue: kblockd blk_mq_timeout_work
[  140.412475] Call Trace:
[  140.412491]  dump_stack+0x46/0x60
[  140.412501]  nvme_timeout.cold.77+0x136/0x191 [nvme]
[  140.412511]  ? sched_clock_cpu+0xc/0xb0
[  140.412514]  ? update_rq_clock.part.92+0xf/0x60
[  140.412519]  ? cpumask_next+0x16/0x20
[  140.412524]  blk_mq_check_expired+0x120/0x140
[  140.412530]  blk_mq_queue_tag_busy_iter+0x195/0x2e0
[  140.412535]  ? blk_poll+0x340/0x340
[  140.412538]  ? blk_poll+0x340/0x340
[  140.412542]  blk_mq_timeout_work+0x4f/0x120
[  140.412548]  process_one_work+0x1f0/0x3b0
[  140.412552]  worker_thread+0x28/0x3c0
[  140.412556]  ? process_one_work+0x3b0/0x3b0
[  140.412561]  kthread+0x10b/0x130
[  140.412566]  ? kthread_create_on_node+0x60/0x60
[  140.412571]  ret_from_fork+0x1f/0x30
Comment 2 jian-hong 2019-09-18 08:30:44 UTC
I checked the power status of the NVME SSD:

Before suspend:

grep: /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/autosuspend_delay_ms: Input/output error
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/control:on
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/runtime_active_time:324462
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/runtime_status:active
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/runtime_suspended_time:0
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup:disabled
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_abort_count:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_active:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_active_count:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_count:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_expire_count:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_last_time_ms:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_max_time_ms:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_total_time_ms:

After resume:

grep: /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/autosuspend_delay_ms: Input/output error
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/control:on
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/runtime_active_time:423800
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/runtime_status:active
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/runtime_suspended_time:0
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup:disabled
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_abort_count:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_active:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_active_count:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_count:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_expire_count:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_last_time_ms:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_max_time_ms:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_total_time_ms:

They look the same, except the incremental runtime_active_time.
Comment 3 jian-hong 2019-09-18 09:07:48 UTC
Created attachment 285051 [details]
NVME SSD information by smartctl
Comment 4 jian-hong 2019-09-23 03:18:34 UTC
I add a quirk to make the NVME SSD suspend simply:

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1ede1763a5ee..63395f800dcc 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2298,6 +2298,10 @@ static const struct nvme_core_quirk_entry core_quirks[] = {
                .vid = 0x14a4,
                .fr = "22301111",
                .quirks = NVME_QUIRK_SIMPLE_SUSPEND,
+       },
+       {
+               .vid = 0x2646,
+               .quirks = NVME_QUIRK_SIMPLE_SUSPEND,
        }
 };

Then, this issue cannot be reproduced.  However, it is just a blind shoot.  Is there any way to diagnostic more to make sure this quirk is correct for the device?  That will be helpful for the commit message.
Comment 5 jian-hong 2019-09-23 08:21:24 UTC
Created attachment 285109 [details]
dmesg shows no more IRQ from NVME SSD after resume

Try to understand the IRQ handling for the NVME device by adding the debug message based on mainline kernel:

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6b4d7b064b38..8b2d2dbf5455 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1019,6 +1019,8 @@ static irqreturn_t nvme_irq(int irq, void *data)
        struct nvme_queue *nvmeq = data;
        irqreturn_t ret = IRQ_NONE;
        u16 start, end;
+       static u64 idx = 0;
+       dev_err(nvmeq->dev->dev, "%s: idx=%llu start\n", __func__, idx);
 
        /*
         * The rmb/wmb pair ensures we see all updates from a previous run of
@@ -1033,9 +1035,11 @@ static irqreturn_t nvme_irq(int irq, void *data)
 
        if (start != end) {
                nvme_complete_cqes(nvmeq, start, end);
-               return IRQ_HANDLED;
+               ret = IRQ_HANDLED;
        }
 
+       dev_err(nvmeq->dev->dev, "%s: idx=%llu end\n", __func__, idx);
+       idx++;
        return ret;
 }

There are a lot of "nvme_irq ... start/end" messages before suspend:

[   54.062745] nvme 0000:02:00.0: nvme_irq: idx=20216 start
[   54.062754] nvme 0000:02:00.0: nvme_irq: idx=20216 end

But there is no more after resume. Then only shows "nvme ... timeout, completion polled" continuously.

[  162.662370] nvme nvme0: I/O 112 QID 1 timeout, completion polled

That means nvme_irq the IRQ handler is not invoked after resume.
Comment 6 jian-hong 2019-09-23 08:26:04 UTC
Created attachment 285111 [details]
dmesg shows IRQ from NVME SSD both before suspend and after resume

Combined the quirk in comment #4 and debug message in comment #5.  The debug message shows both before suspend and after resume.

That means the IRQ from NVME device comes back after resume.
Comment 7 jian-hong 2019-09-24 08:57:40 UTC
Created attachment 285145 [details]
Added debug message without NVME_QUIRK_SIMPLE_SUSPEND quirk

I add more debug messages as the attached file based on mainline kernel to trace the suspend & resume code path.

The suspend & resume path show as the dmesg log:

[  178.363854] nvme 0000:02:00.0: nvme_irq: idx=20544 start
[  178.363861] nvme 0000:02:00.0: nvme_irq: idx=20544 end
[  178.363868] Filesystems sync: 0.003 seconds
[  178.365124] Freezing user space processes ... (elapsed 0.003 seconds) done.
[  178.368170] OOM killer disabled.
[  178.368171] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[  178.369360] printk: Suspending console(s) (use no_console_suspend to debug)
[  178.370370] wlo1: deauthenticating from dc:9f:db:70:1f:b0 by local choice (Reason: 3=DEAUTH_LEAVING)
[  178.483304] nvme 0000:02:00.0: nvme_suspend: freezed
[  178.483437] nvme 0000:02:00.0: nvme_irq: idx=20545 start
[  178.483448] nvme 0000:02:00.0: nvme_irq: idx=20545 end
[  178.483479] nvme 0000:02:00.0: nvme_suspend: ndev->last_ps 0, ctrl->npss = 4
[  178.483569] nvme 0000:02:00.0: nvme_irq: idx=20546 start
[  178.483577] nvme 0000:02:00.0: nvme_irq: idx=20546 end
[  178.483602] nvme 0000:02:00.0: __nvme_submit_sync_cmd: request result 0
[  178.483605] nvme 0000:02:00.0: nvme_suspend: protocol specific
[  178.499057] ACPI: EC: interrupt blocked
[  196.930344] ACPI: EC: interrupt unblocked
[  197.045281] pci 0000:01:00.0: Enabling HDA controller
[  197.061208] iwlwifi 0000:00:14.3: Applying debug destination EXTERNAL_DRAM
[  197.061341] ACPI: button: The lid device is not compliant to SW_LID.
[  197.177426] iwlwifi 0000:00:14.3: Applying debug destination EXTERNAL_DRAM
[  197.242784] iwlwifi 0000:00:14.3: FW already configured (0) - re-configuring
[  197.374262] ata3: SATA link down (SStatus 4 SControl 300)
[  257.103510] nvme nvme0: I/O 6 QID 0 timeout, completion polled
[  257.103606] nvme 0000:02:00.0: __nvme_submit_sync_cmd: request result 0
[  257.103617] nvme 0000:02:00.0: nvme_resume: ndev->last_ps=0
[  257.104774] OOM killer enabled.
[  257.104776] Restarting tasks ... done.
[  257.132452] PM: suspend exit
[  259.938464] wlo1: authenticate with dc:9f:db:70:1f:b0
...
[  260.097035] wlo1: associated
[  287.311528] nvme nvme0: I/O 700 QID 1 timeout, completion polled
[  287.311557] nvme nvme0: I/O 173 QID 4 timeout, completion polled

NVME seems suspend & resume as expected. But no IRQ from NVME SSD after resume.
Comment 8 jian-hong 2019-10-07 08:56:21 UTC
Created attachment 285383 [details]
Debug difference to skip NVMe's PCI_MSIX_ENTRY_CTRL_MASKBIT setting

I traced long time and found PCI_MSIX_ENTRY_CTRL_MASKBIT is set during resume which masks the NVMe's IRQ signal.  So, there is no NVMe interrupt after resume and NVMe I/O requests usually become timeout.
I made a debug diff as the attachment which skips the setting of PCI_MSIX_ENTRY_CTRL_MASKBIT if the device is NVMe during resuming, then NVMe works fine after resume.

Here is the debug message in dmesg during resuming:

[   45.665657] nvme 0000:02:00.0: __pci_msix_desc_mask_irq: 0x0c origianl value 0x901110d0, going to write 0x901110d1.  However, 0000:02:00.0 is nvme, so skip
[   45.665660] nvme 0000:02:00.0: msix_mask_irq: 0000:02:00.0 called with flag is True, new masked value is 0x901110d0
[   45.665726] nvme 0000:02:00.0: __pci_msix_desc_mask_irq: 0x1c origianl value 0x0b881490, going to write 0x0b881491.  However, 0000:02:00.0 is nvme, so skip
[   45.665727] nvme 0000:02:00.0: msix_mask_irq: 0000:02:00.0 called with flag is True, new masked value is 0x0b881490
[   45.666098] nvme 0000:02:00.0: __pci_msix_desc_mask_irq: 0x2c origianl value 0x0ca17434, going to write 0x0ca17435.  However, 0000:02:00.0 is nvme, so skip
[   45.666100] nvme 0000:02:00.0: msix_mask_irq: 0000:02:00.0 called with flag is True, new masked value is 0x0ca17434
[   45.666396] nvme 0000:02:00.0: __pci_msix_desc_mask_irq: 0x3c origianl value 0x4815041c, going to write 0x4815041d.  However, 0000:02:00.0 is nvme, so skip
[   45.666397] nvme 0000:02:00.0: msix_mask_irq: 0000:02:00.0 called with flag is True, new masked value is 0x4815041c
[   45.666452] nvme 0000:02:00.0: __pci_msix_desc_mask_irq: 0x4c origianl value 0x480c1b44, going to write 0x480c1b45.  However, 0000:02:00.0 is nvme, so skip
[   45.666453] nvme 0000:02:00.0: msix_mask_irq: 0000:02:00.0 called with flag is True, new masked value is 0x480c1b44
[   45.666504] nvme 0000:02:00.0: __pci_msix_desc_mask_irq: 0x5c origianl value 0x440048a8, going to write 0x440048a9.  However, 0000:02:00.0 is nvme, so skip
[   45.666505] nvme 0000:02:00.0: msix_mask_irq: 0000:02:00.0 called with flag is True, new masked value is 0x440048a8
[   45.666552] nvme 0000:02:00.0: __pci_msix_desc_mask_irq: 0x6c origianl value 0x10886064, going to write 0x10886065.  However, 0000:02:00.0 is nvme, so skip
[   45.666553] nvme 0000:02:00.0: msix_mask_irq: 0000:02:00.0 called with flag is True, new masked value is 0x10886064
[   45.666601] nvme 0000:02:00.0: __pci_msix_desc_mask_irq: 0x7c origianl value 0x01802170, going to write 0x01802171.  However, 0000:02:00.0 is nvme, so skip
[   45.666602] nvme 0000:02:00.0: msix_mask_irq: 0000:02:00.0 called with flag is True, new masked value is 0x01802170
[   45.666606] nvme 0000:02:00.0: __pci_msix_desc_mask_irq: 0x8c origianl value 0xd1264c94, going to write 0xd1264c95.  However, 0000:02:00.0 is nvme, so skip
[   45.666608] nvme 0000:02:00.0: msix_mask_irq: 0000:02:00.0 called with flag is True, new masked value is 0xd1264c94

1. The difference between the original "[PCI MSI/X] Mask bits" value and the new one is PCI_MSIX_ENTRY_CTRL_MASKBIT.
2. PCI_MSIX_ENTRY_CTRL_MASKBIT is set according to the "flag" of __pci_msix_desc_mask_irq() in drivers/pci/msi.c.
3. In __pci_restore_msix_state(), the entry->masked is passed as the "flag" to msix_mask_irq() -> __pci_msix_desc_mask_irq().

So, the problem becomes why the whole entry->masked is passed as the "flag" to msix_mask_irq() directly?  It seems a non-zero value, then the "flag" will become True in the if-condition in __pci_msix_desc_mask_irq().
Comment 9 jian-hong 2019-10-07 08:57:24 UTC
Created attachment 285385 [details]
dmesg of comment #8
Comment 10 jian-hong 2019-10-08 06:32:08 UTC
I sent a new patch [1] to fix the issue mentioned in comment 8, which is the root cause of this ticket.
If the new patch is merged, then we can revert commit 19ea025e1d28 "nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T" in mainline kernel.

[1] https://patchwork.kernel.org/patch/11178729/
Comment 11 jian-hong 2019-10-31 09:54:14 UTC
The patch has been merged as commit 253eaf4faaaa ("PCI/MSI: Fix incorrect MSI-X masking on resume") [1] in linux-next repository.

Also sent another patch [2] to revert previous quirk.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/drivers/pci/msi.c?id=253eaf4faaaadffeade9192a42f47c51d1b0c3f9
[2]: https://patchwork.kernel.org/patch/11220807/