Bug 220015 - [BISECTED] NVME re-read ANA log page patch causes boot hang in 6.15.0-rc2
Summary: [BISECTED] NVME re-read ANA log page patch causes boot hang in 6.15.0-rc2
Status: RESOLVED PATCH_ALREADY_AVAILABLE
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: NVMe (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: IO/NVME Virtual Default Assignee
URL:
Keywords:
Depends on:
Blocks: 178231
  Show dependency tree
 
Reported: 2025-04-15 21:49 UTC by Todd Brandt
Modified: 2025-04-19 10:43 UTC (History)
3 users (show)

See Also:
Kernel Version: 6.15.0-rc2
Subsystem:
Regression: Yes
Bisected commit-id: 62baf70c327444338c34703c71aa8cc8e4189bd6


Attachments
nvme-boot-error-console.txt (58.19 KB, text/plain)
2025-04-15 21:49 UTC, Todd Brandt
Details

Description Todd Brandt 2025-04-15 21:49:30 UTC
Created attachment 307966 [details]
nvme-boot-error-console.txt

The following commit causes a boot hang in at least 2 of our machines. Reverting this commit and building 6.15.0-rc2 fixes the issue. I've attached the console log showing the error text as nvme-boot-error-console.txt

commit 62baf70c327444338c34703c71aa8cc8e4189bd6 (refs/bisect/bad)
Author: Hannes Reinecke <hare@kernel.org>
Date:   Thu Apr 3 09:19:30 2025 +0200

    nvme: re-read ANA log page after ns scan completes
    
    When scanning for new namespaces we might have missed an ANA AEN.
    
    The NVMe base spec (NVMe Base Specification v2.1, Figure 151 'Asynchonous
    Event Information - Notice': Asymmetric Namespace Access Change) states:
    
      A controller shall not send this even if an Attached Namespace
      Attribute Changed asynchronous event [...] is sent for the same event.
    
    so we need to re-read the ANA log page after we rescanned the namespace
    list to update the ANA states of the new namespaces.
    
    Signed-off-by: Hannes Reinecke <hare@kernel.org>
    Reviewed-by: Keith Busch <kbusch@kernel.org>
    Signed-off-by: Christoph Hellwig <hch@lst.de>
Comment 1 Todd Brandt 2025-04-15 21:51:17 UTC
The key piece of console error info is this:

[   35.326061] nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x2010
[   35.382840] nvme0n1: I/O Cmd(0x2) @ LBA 0, 8 blocks, I/O Error (sct 0x3 / sc 0x71) 
[   35.391169] I/O error, dev nvme0n1, sector 0 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[   35.400504] nvme nvme0: Failed to get ANA log: -4
[   35.456208] nvme nvme0: 8/0/0 default/read/poll queues
[   35.462684] nvme nvme0: Ignoring bogus Namespace Identifiers
[   35.498123] DMAR: DRHD: handling fault status reg 2
[   35.503428] DMAR: [DMA Read NO_PASID] Request device [01:00.0] fault addr 0x0 [fault reason 0x06] PTE Read access is not set
[   35.515010] DMAR: Dump dmar1 table entries for IOVA 0x0
[   35.520640] DMAR: root entry: 0x0000000105038001
[   35.520641] DMAR: context entry: hi 0x0000000000000a02, low 0x0000000105037001
[   35.533282] DMAR: pte level: 4, pte value: 0x0000000101754003
[   35.539427] DMAR: pte level: 3, pte value: 0x0000000000000000
[   35.545577] DMAR: page table not present at level 2
Comment 2 Todd Brandt 2025-04-15 23:03:42 UTC
ok I found the following proposed fix in lkml, trying it now:

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index b502ac07483b..eb6ea8acb3cc 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4300,7 +4300,7 @@ static void nvme_scan_work(struct work_struct *work)
 	if (test_bit(NVME_AER_NOTICE_NS_CHANGED, &ctrl->events))
 		nvme_queue_scan(ctrl);
 #ifdef CONFIG_NVME_MULTIPATH
-	else
+	else if (ctrl->ana_log_buf)
 		/* Re-read the ANA log page to not miss updates */
 		queue_work(nvme_wq, &ctrl->ana_work);
 #endif
Comment 3 Todd Brandt 2025-04-15 23:39:03 UTC
This patch seems to fix things, so once it's available in upstream I'll close this issue.
Comment 4 Artem S. Tashkinov 2025-04-16 11:44:17 UTC
*** Bug 220007 has been marked as a duplicate of this bug. ***
Comment 5 RockT 2025-04-16 13:35:58 UTC
Can confirm it happens on a Lenovo Thinkpad L14 as well.
Comment 6 Todd Brandt 2025-04-16 21:27:55 UTC
RockT: does that above patch fix it?

It's fixed it on multiple machines here. I have 3 that boot crashed and were not working, and one HP Spectre that couldn't suspend because nvme refused to suspend. It seems this issue was pretty broad in its effects. The patch fixed it on all 4.
Comment 7 RockT 2025-04-18 19:16:21 UTC
I'm using the ubuntu mainline kernel. I was able to patch the source but could not rebuild. All the documentation I found seems outdated. Sorry :(

Note You need to log in before you can comment on or make changes to this bug.