Bug 220015

Summary: [BISECTED] NVME re-read ANA log page patch causes boot hang in 6.15.0-rc2
Product: IO/Storage Reporter: Todd Brandt (todd.e.brandt)
Component: NVMeAssignee: IO/NVME Virtual Default Assignee (io_nvme)
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: hare, linuxnet111, tr.ml
Priority: P3    
Hardware: All   
OS: Linux   
Kernel Version: 6.15.0-rc2 Subsystem:
Regression: Yes Bisected commit-id: 62baf70c327444338c34703c71aa8cc8e4189bd6
Bug Depends on:    
Bug Blocks: 178231    
Attachments: nvme-boot-error-console.txt

Description Todd Brandt 2025-04-15 21:49:30 UTC
Created attachment 307966 [details]
nvme-boot-error-console.txt

The following commit causes a boot hang in at least 2 of our machines. Reverting this commit and building 6.15.0-rc2 fixes the issue. I've attached the console log showing the error text as nvme-boot-error-console.txt

commit 62baf70c327444338c34703c71aa8cc8e4189bd6 (refs/bisect/bad)
Author: Hannes Reinecke <hare@kernel.org>
Date:   Thu Apr 3 09:19:30 2025 +0200

    nvme: re-read ANA log page after ns scan completes
    
    When scanning for new namespaces we might have missed an ANA AEN.
    
    The NVMe base spec (NVMe Base Specification v2.1, Figure 151 'Asynchonous
    Event Information - Notice': Asymmetric Namespace Access Change) states:
    
      A controller shall not send this even if an Attached Namespace
      Attribute Changed asynchronous event [...] is sent for the same event.
    
    so we need to re-read the ANA log page after we rescanned the namespace
    list to update the ANA states of the new namespaces.
    
    Signed-off-by: Hannes Reinecke <hare@kernel.org>
    Reviewed-by: Keith Busch <kbusch@kernel.org>
    Signed-off-by: Christoph Hellwig <hch@lst.de>
Comment 1 Todd Brandt 2025-04-15 21:51:17 UTC
The key piece of console error info is this:

[   35.326061] nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x2010
[   35.382840] nvme0n1: I/O Cmd(0x2) @ LBA 0, 8 blocks, I/O Error (sct 0x3 / sc 0x71) 
[   35.391169] I/O error, dev nvme0n1, sector 0 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[   35.400504] nvme nvme0: Failed to get ANA log: -4
[   35.456208] nvme nvme0: 8/0/0 default/read/poll queues
[   35.462684] nvme nvme0: Ignoring bogus Namespace Identifiers
[   35.498123] DMAR: DRHD: handling fault status reg 2
[   35.503428] DMAR: [DMA Read NO_PASID] Request device [01:00.0] fault addr 0x0 [fault reason 0x06] PTE Read access is not set
[   35.515010] DMAR: Dump dmar1 table entries for IOVA 0x0
[   35.520640] DMAR: root entry: 0x0000000105038001
[   35.520641] DMAR: context entry: hi 0x0000000000000a02, low 0x0000000105037001
[   35.533282] DMAR: pte level: 4, pte value: 0x0000000101754003
[   35.539427] DMAR: pte level: 3, pte value: 0x0000000000000000
[   35.545577] DMAR: page table not present at level 2
Comment 2 Todd Brandt 2025-04-15 23:03:42 UTC
ok I found the following proposed fix in lkml, trying it now:

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index b502ac07483b..eb6ea8acb3cc 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4300,7 +4300,7 @@ static void nvme_scan_work(struct work_struct *work)
 	if (test_bit(NVME_AER_NOTICE_NS_CHANGED, &ctrl->events))
 		nvme_queue_scan(ctrl);
 #ifdef CONFIG_NVME_MULTIPATH
-	else
+	else if (ctrl->ana_log_buf)
 		/* Re-read the ANA log page to not miss updates */
 		queue_work(nvme_wq, &ctrl->ana_work);
 #endif
Comment 3 Todd Brandt 2025-04-15 23:39:03 UTC
This patch seems to fix things, so once it's available in upstream I'll close this issue.
Comment 4 Artem S. Tashkinov 2025-04-16 11:44:17 UTC
*** Bug 220007 has been marked as a duplicate of this bug. ***
Comment 5 RockT 2025-04-16 13:35:58 UTC
Can confirm it happens on a Lenovo Thinkpad L14 as well.
Comment 6 Todd Brandt 2025-04-16 21:27:55 UTC
RockT: does that above patch fix it?

It's fixed it on multiple machines here. I have 3 that boot crashed and were not working, and one HP Spectre that couldn't suspend because nvme refused to suspend. It seems this issue was pretty broad in its effects. The patch fixed it on all 4.
Comment 7 RockT 2025-04-18 19:16:21 UTC
I'm using the ubuntu mainline kernel. I was able to patch the source but could not rebuild. All the documentation I found seems outdated. Sorry :(