Bug 204887 - Resume issue: "nvme nvme0: I/O 12 QID 0 timeout, completion polled"
Summary: Resume issue: "nvme nvme0: I/O 12 QID 0 timeout, completion polled"
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Flash/Memory Technology Devices (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: David Woodhouse
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-09-17 08:30 UTC by jian-hong
Modified: 2020-02-20 09:48 UTC (History)
0 users

See Also:
Kernel Version: 5.3.0
Tree: Mainline
Regression: No


Attachments
dmesg (82.96 KB, text/plain)
2019-09-17 08:30 UTC, jian-hong
Details
dmesg with dump_stack in nvme_timeout() (69.58 KB, text/plain)
2019-09-17 09:15 UTC, jian-hong
Details
NVME SSD information by smartctl (3.72 KB, text/plain)
2019-09-18 09:07 UTC, jian-hong
Details
dmesg shows no more IRQ from NVME SSD after resume (154.60 KB, text/plain)
2019-09-23 08:21 UTC, jian-hong
Details
dmesg shows IRQ from NVME SSD both before suspend and after resume (154.52 KB, text/plain)
2019-09-23 08:26 UTC, jian-hong
Details
Added debug message without NVME_QUIRK_SIMPLE_SUSPEND quirk (5.27 KB, patch)
2019-09-24 08:57 UTC, jian-hong
Details | Diff
Debug difference to skip NVMe's PCI_MSIX_ENTRY_CTRL_MASKBIT setting (4.05 KB, patch)
2019-10-07 08:56 UTC, jian-hong
Details | Diff
dmesg of comment #8 (64.79 KB, text/plain)
2019-10-07 08:57 UTC, jian-hong
Details

Description jian-hong 2019-09-17 08:30:19 UTC
Created attachment 285027 [details]
dmesg

We have a laptop equipped with i5-8265U CPU and a Kingston PCIE NVME SSD.  We tested mainline kernel 5.3.0+ on the laptop.  System takes about 60 seconds to resume.  It is s2idle for suspend.

Here is the information of the NVME SSD:

$ sudo lspci -nnvs 02:00.0
02:00.0 Non-Volatile memory controller [0108]: Kingston Technologies Device [2646:5008] (rev 01) (prog-if 02 [NVM Express])
	Subsystem: Kingston Technologies Device [2646:5008]
	Flags: bus master, fast devsel, latency 0, IRQ 16, NUMA node 0
	Memory at 81100000 (64-bit, non-prefetchable) [size=16K]
	I/O ports at 3000 [size=8]
	Capabilities: [80] Express Endpoint, MSI 00
	Capabilities: [d0] MSI-X: Enable+ Count=9 Masked-
	Capabilities: [e0] MSI: Enable- Count=1/8 Maskable- 64bit+
	Capabilities: [f8] Power Management version 3
	Capabilities: [100] Vendor Specific Information: ID=1556 Rev=1 Len=008 <?>
	Capabilities: [108] Latency Tolerance Reporting
	Capabilities: [110] L1 PM Substates
	Capabilities: [128] Alternative Routing-ID Interpretation (ARI)
	Capabilities: [200] Advanced Error Reporting
	Capabilities: [300] #19
	Kernel driver in use: nvme
	Kernel modules: nvme

There is the time gap about 60s between "[   84.002162] ata3: SATA link down (SStatus 4 SControl 300)" and "[  144.507485] nvme nvme0: I/O 12 QID 0 timeout, completion polled" after system resumes.  It keeps showing "nvme nvme0: I/O 978 QID 3 timeout, completion polled" and then system usually hangs up.

[   84.002162] ata3: SATA link down (SStatus 4 SControl 300)
[  144.507485] nvme nvme0: I/O 12 QID 0 timeout, completion polled
[  144.508712] OOM killer enabled.
[  144.508714] Restarting tasks ... done.
[  144.539322] PM: suspend exit
[  147.542914] wlo1: authenticate with dc:9f:db:70:1f:a0
[  147.544185] wlo1: send auth to dc:9f:db:70:1f:a0 (try 1/3)
[  147.695868] wlo1: authenticated
[  147.696964] wlo1: associate with dc:9f:db:70:1f:a0 (try 1/3)
[  147.700996] wlo1: RX AssocResp from dc:9f:db:70:1f:a0 (capab=0x31 status=0 aid=3)
[  147.712441] wlo1: associated
[  174.715534] nvme nvme0: I/O 978 QID 3 timeout, completion polled
[  174.715577] nvme nvme0: I/O 393 QID 4 timeout, completion polled
[  174.715639] nvme nvme0: I/O 515 QID 8 timeout, completion polled

The timeout messages do not show between boot and first suspend, but they occur after resume.
Comment 1 jian-hong 2019-09-17 09:15:14 UTC
Created attachment 285029 [details]
dmesg with dump_stack in nvme_timeout()

I add the dump_stack() to see the backtrace:

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 732d5b63ec05..f4650c7d516f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1266,6 +1266,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
                dev_warn(dev->ctrl.device,
                         "I/O %d QID %d timeout, completion polled\n",
                         req->tag, nvmeq->qid);
+               dump_stack();
                return BLK_EH_DONE;
        }


Log shows the timeout comes from blk_mq_timeout_work:

[  140.412451] nvme nvme0: I/O 0 QID 0 timeout, completion polled
[  140.412459] CPU: 5 PID: 299 Comm: kworker/5:1H Not tainted 5.3.0+ #1
[  140.412462] Hardware name: ASUSTeK COMPUTER INC. ZenBook UX431FL/UX431FL, BIOS UX431FL.300 06/17/2019
[  140.412472] Workqueue: kblockd blk_mq_timeout_work
[  140.412475] Call Trace:
[  140.412491]  dump_stack+0x46/0x60
[  140.412501]  nvme_timeout.cold.77+0x136/0x191 [nvme]
[  140.412511]  ? sched_clock_cpu+0xc/0xb0
[  140.412514]  ? update_rq_clock.part.92+0xf/0x60
[  140.412519]  ? cpumask_next+0x16/0x20
[  140.412524]  blk_mq_check_expired+0x120/0x140
[  140.412530]  blk_mq_queue_tag_busy_iter+0x195/0x2e0
[  140.412535]  ? blk_poll+0x340/0x340
[  140.412538]  ? blk_poll+0x340/0x340
[  140.412542]  blk_mq_timeout_work+0x4f/0x120
[  140.412548]  process_one_work+0x1f0/0x3b0
[  140.412552]  worker_thread+0x28/0x3c0
[  140.412556]  ? process_one_work+0x3b0/0x3b0
[  140.412561]  kthread+0x10b/0x130
[  140.412566]  ? kthread_create_on_node+0x60/0x60
[  140.412571]  ret_from_fork+0x1f/0x30
Comment 2 jian-hong 2019-09-18 08:30:44 UTC
I checked the power status of the NVME SSD:

Before suspend:

grep: /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/autosuspend_delay_ms: Input/output error
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/control:on
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/runtime_active_time:324462
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/runtime_status:active
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/runtime_suspended_time:0
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup:disabled
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_abort_count:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_active:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_active_count:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_count:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_expire_count:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_last_time_ms:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_max_time_ms:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_total_time_ms:

After resume:

grep: /sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/autosuspend_delay_ms: Input/output error
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/control:on
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/runtime_active_time:423800
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/runtime_status:active
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/runtime_suspended_time:0
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup:disabled
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_abort_count:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_active:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_active_count:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_count:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_expire_count:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_last_time_ms:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_max_time_ms:
/sys/module/nvme/drivers/pci:nvme/0000:02:00.0/power/wakeup_total_time_ms:

They look the same, except the incremental runtime_active_time.
Comment 3 jian-hong 2019-09-18 09:07:48 UTC
Created attachment 285051 [details]
NVME SSD information by smartctl
Comment 4 jian-hong 2019-09-23 03:18:34 UTC
I add a quirk to make the NVME SSD suspend simply:

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1ede1763a5ee..63395f800dcc 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2298,6 +2298,10 @@ static const struct nvme_core_quirk_entry core_quirks[] = {
                .vid = 0x14a4,
                .fr = "22301111",
                .quirks = NVME_QUIRK_SIMPLE_SUSPEND,
+       },
+       {
+               .vid = 0x2646,
+               .quirks = NVME_QUIRK_SIMPLE_SUSPEND,
        }
 };

Then, this issue cannot be reproduced.  However, it is just a blind shoot.  Is there any way to diagnostic more to make sure this quirk is correct for the device?  That will be helpful for the commit message.
Comment 5 jian-hong 2019-09-23 08:21:24 UTC
Created attachment 285109 [details]
dmesg shows no more IRQ from NVME SSD after resume

Try to understand the IRQ handling for the NVME device by adding the debug message based on mainline kernel:

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6b4d7b064b38..8b2d2dbf5455 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1019,6 +1019,8 @@ static irqreturn_t nvme_irq(int irq, void *data)
        struct nvme_queue *nvmeq = data;
        irqreturn_t ret = IRQ_NONE;
        u16 start, end;
+       static u64 idx = 0;
+       dev_err(nvmeq->dev->dev, "%s: idx=%llu start\n", __func__, idx);
 
        /*
         * The rmb/wmb pair ensures we see all updates from a previous run of
@@ -1033,9 +1035,11 @@ static irqreturn_t nvme_irq(int irq, void *data)
 
        if (start != end) {
                nvme_complete_cqes(nvmeq, start, end);
-               return IRQ_HANDLED;
+               ret = IRQ_HANDLED;
        }
 
+       dev_err(nvmeq->dev->dev, "%s: idx=%llu end\n", __func__, idx);
+       idx++;
        return ret;
 }

There are a lot of "nvme_irq ... start/end" messages before suspend:

[   54.062745] nvme 0000:02:00.0: nvme_irq: idx=20216 start
[   54.062754] nvme 0000:02:00.0: nvme_irq: idx=20216 end

But there is no more after resume. Then only shows "nvme ... timeout, completion polled" continuously.

[  162.662370] nvme nvme0: I/O 112 QID 1 timeout, completion polled

That means nvme_irq the IRQ handler is not invoked after resume.
Comment 6 jian-hong 2019-09-23 08:26:04 UTC
Created attachment 285111 [details]
dmesg shows IRQ from NVME SSD both before suspend and after resume

Combined the quirk in comment #4 and debug message in comment #5.  The debug message shows both before suspend and after resume.

That means the IRQ from NVME device comes back after resume.
Comment 7 jian-hong 2019-09-24 08:57:40 UTC
Created attachment 285145 [details]
Added debug message without NVME_QUIRK_SIMPLE_SUSPEND quirk

I add more debug messages as the attached file based on mainline kernel to trace the suspend & resume code path.

The suspend & resume path show as the dmesg log:

[  178.363854] nvme 0000:02:00.0: nvme_irq: idx=20544 start
[  178.363861] nvme 0000:02:00.0: nvme_irq: idx=20544 end
[  178.363868] Filesystems sync: 0.003 seconds
[  178.365124] Freezing user space processes ... (elapsed 0.003 seconds) done.
[  178.368170] OOM killer disabled.
[  178.368171] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[  178.369360] printk: Suspending console(s) (use no_console_suspend to debug)
[  178.370370] wlo1: deauthenticating from dc:9f:db:70:1f:b0 by local choice (Reason: 3=DEAUTH_LEAVING)
[  178.483304] nvme 0000:02:00.0: nvme_suspend: freezed
[  178.483437] nvme 0000:02:00.0: nvme_irq: idx=20545 start
[  178.483448] nvme 0000:02:00.0: nvme_irq: idx=20545 end
[  178.483479] nvme 0000:02:00.0: nvme_suspend: ndev->last_ps 0, ctrl->npss = 4
[  178.483569] nvme 0000:02:00.0: nvme_irq: idx=20546 start
[  178.483577] nvme 0000:02:00.0: nvme_irq: idx=20546 end
[  178.483602] nvme 0000:02:00.0: __nvme_submit_sync_cmd: request result 0
[  178.483605] nvme 0000:02:00.0: nvme_suspend: protocol specific
[  178.499057] ACPI: EC: interrupt blocked
[  196.930344] ACPI: EC: interrupt unblocked
[  197.045281] pci 0000:01:00.0: Enabling HDA controller
[  197.061208] iwlwifi 0000:00:14.3: Applying debug destination EXTERNAL_DRAM
[  197.061341] ACPI: button: The lid device is not compliant to SW_LID.
[  197.177426] iwlwifi 0000:00:14.3: Applying debug destination EXTERNAL_DRAM
[  197.242784] iwlwifi 0000:00:14.3: FW already configured (0) - re-configuring
[  197.374262] ata3: SATA link down (SStatus 4 SControl 300)
[  257.103510] nvme nvme0: I/O 6 QID 0 timeout, completion polled
[  257.103606] nvme 0000:02:00.0: __nvme_submit_sync_cmd: request result 0
[  257.103617] nvme 0000:02:00.0: nvme_resume: ndev->last_ps=0
[  257.104774] OOM killer enabled.
[  257.104776] Restarting tasks ... done.
[  257.132452] PM: suspend exit
[  259.938464] wlo1: authenticate with dc:9f:db:70:1f:b0
...
[  260.097035] wlo1: associated
[  287.311528] nvme nvme0: I/O 700 QID 1 timeout, completion polled
[  287.311557] nvme nvme0: I/O 173 QID 4 timeout, completion polled

NVME seems suspend & resume as expected. But no IRQ from NVME SSD after resume.
Comment 8 jian-hong 2019-10-07 08:56:21 UTC
Created attachment 285383 [details]
Debug difference to skip NVMe's PCI_MSIX_ENTRY_CTRL_MASKBIT setting

I traced long time and found PCI_MSIX_ENTRY_CTRL_MASKBIT is set during resume which masks the NVMe's IRQ signal.  So, there is no NVMe interrupt after resume and NVMe I/O requests usually become timeout.
I made a debug diff as the attachment which skips the setting of PCI_MSIX_ENTRY_CTRL_MASKBIT if the device is NVMe during resuming, then NVMe works fine after resume.

Here is the debug message in dmesg during resuming:

[   45.665657] nvme 0000:02:00.0: __pci_msix_desc_mask_irq: 0x0c origianl value 0x901110d0, going to write 0x901110d1.  However, 0000:02:00.0 is nvme, so skip
[   45.665660] nvme 0000:02:00.0: msix_mask_irq: 0000:02:00.0 called with flag is True, new masked value is 0x901110d0
[   45.665726] nvme 0000:02:00.0: __pci_msix_desc_mask_irq: 0x1c origianl value 0x0b881490, going to write 0x0b881491.  However, 0000:02:00.0 is nvme, so skip
[   45.665727] nvme 0000:02:00.0: msix_mask_irq: 0000:02:00.0 called with flag is True, new masked value is 0x0b881490
[   45.666098] nvme 0000:02:00.0: __pci_msix_desc_mask_irq: 0x2c origianl value 0x0ca17434, going to write 0x0ca17435.  However, 0000:02:00.0 is nvme, so skip
[   45.666100] nvme 0000:02:00.0: msix_mask_irq: 0000:02:00.0 called with flag is True, new masked value is 0x0ca17434
[   45.666396] nvme 0000:02:00.0: __pci_msix_desc_mask_irq: 0x3c origianl value 0x4815041c, going to write 0x4815041d.  However, 0000:02:00.0 is nvme, so skip
[   45.666397] nvme 0000:02:00.0: msix_mask_irq: 0000:02:00.0 called with flag is True, new masked value is 0x4815041c
[   45.666452] nvme 0000:02:00.0: __pci_msix_desc_mask_irq: 0x4c origianl value 0x480c1b44, going to write 0x480c1b45.  However, 0000:02:00.0 is nvme, so skip
[   45.666453] nvme 0000:02:00.0: msix_mask_irq: 0000:02:00.0 called with flag is True, new masked value is 0x480c1b44
[   45.666504] nvme 0000:02:00.0: __pci_msix_desc_mask_irq: 0x5c origianl value 0x440048a8, going to write 0x440048a9.  However, 0000:02:00.0 is nvme, so skip
[   45.666505] nvme 0000:02:00.0: msix_mask_irq: 0000:02:00.0 called with flag is True, new masked value is 0x440048a8
[   45.666552] nvme 0000:02:00.0: __pci_msix_desc_mask_irq: 0x6c origianl value 0x10886064, going to write 0x10886065.  However, 0000:02:00.0 is nvme, so skip
[   45.666553] nvme 0000:02:00.0: msix_mask_irq: 0000:02:00.0 called with flag is True, new masked value is 0x10886064
[   45.666601] nvme 0000:02:00.0: __pci_msix_desc_mask_irq: 0x7c origianl value 0x01802170, going to write 0x01802171.  However, 0000:02:00.0 is nvme, so skip
[   45.666602] nvme 0000:02:00.0: msix_mask_irq: 0000:02:00.0 called with flag is True, new masked value is 0x01802170
[   45.666606] nvme 0000:02:00.0: __pci_msix_desc_mask_irq: 0x8c origianl value 0xd1264c94, going to write 0xd1264c95.  However, 0000:02:00.0 is nvme, so skip
[   45.666608] nvme 0000:02:00.0: msix_mask_irq: 0000:02:00.0 called with flag is True, new masked value is 0xd1264c94

1. The difference between the original "[PCI MSI/X] Mask bits" value and the new one is PCI_MSIX_ENTRY_CTRL_MASKBIT.
2. PCI_MSIX_ENTRY_CTRL_MASKBIT is set according to the "flag" of __pci_msix_desc_mask_irq() in drivers/pci/msi.c.
3. In __pci_restore_msix_state(), the entry->masked is passed as the "flag" to msix_mask_irq() -> __pci_msix_desc_mask_irq().

So, the problem becomes why the whole entry->masked is passed as the "flag" to msix_mask_irq() directly?  It seems a non-zero value, then the "flag" will become True in the if-condition in __pci_msix_desc_mask_irq().
Comment 9 jian-hong 2019-10-07 08:57:24 UTC
Created attachment 285385 [details]
dmesg of comment #8
Comment 10 jian-hong 2019-10-08 06:32:08 UTC
I sent a new patch [1] to fix the issue mentioned in comment 8, which is the root cause of this ticket.
If the new patch is merged, then we can revert commit 19ea025e1d28 "nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T" in mainline kernel.

[1] https://patchwork.kernel.org/patch/11178729/
Comment 11 jian-hong 2019-10-31 09:54:14 UTC
The patch has been merged as commit 253eaf4faaaa ("PCI/MSI: Fix incorrect MSI-X masking on resume") [1] in linux-next repository.

Also sent another patch [2] to revert previous quirk.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/drivers/pci/msi.c?id=253eaf4faaaadffeade9192a42f47c51d1b0c3f9
[2]: https://patchwork.kernel.org/patch/11220807/

Note You need to log in before you can comment on or make changes to this bug.