Created attachment 307966 [details] nvme-boot-error-console.txt The following commit causes a boot hang in at least 2 of our machines. Reverting this commit and building 6.15.0-rc2 fixes the issue. I've attached the console log showing the error text as nvme-boot-error-console.txt commit 62baf70c327444338c34703c71aa8cc8e4189bd6 (refs/bisect/bad) Author: Hannes Reinecke <hare@kernel.org> Date: Thu Apr 3 09:19:30 2025 +0200 nvme: re-read ANA log page after ns scan completes When scanning for new namespaces we might have missed an ANA AEN. The NVMe base spec (NVMe Base Specification v2.1, Figure 151 'Asynchonous Event Information - Notice': Asymmetric Namespace Access Change) states: A controller shall not send this even if an Attached Namespace Attribute Changed asynchronous event [...] is sent for the same event. so we need to re-read the ANA log page after we rescanned the namespace list to update the ANA states of the new namespaces. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
The key piece of console error info is this: [ 35.326061] nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x2010 [ 35.382840] nvme0n1: I/O Cmd(0x2) @ LBA 0, 8 blocks, I/O Error (sct 0x3 / sc 0x71) [ 35.391169] I/O error, dev nvme0n1, sector 0 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 [ 35.400504] nvme nvme0: Failed to get ANA log: -4 [ 35.456208] nvme nvme0: 8/0/0 default/read/poll queues [ 35.462684] nvme nvme0: Ignoring bogus Namespace Identifiers [ 35.498123] DMAR: DRHD: handling fault status reg 2 [ 35.503428] DMAR: [DMA Read NO_PASID] Request device [01:00.0] fault addr 0x0 [fault reason 0x06] PTE Read access is not set [ 35.515010] DMAR: Dump dmar1 table entries for IOVA 0x0 [ 35.520640] DMAR: root entry: 0x0000000105038001 [ 35.520641] DMAR: context entry: hi 0x0000000000000a02, low 0x0000000105037001 [ 35.533282] DMAR: pte level: 4, pte value: 0x0000000101754003 [ 35.539427] DMAR: pte level: 3, pte value: 0x0000000000000000 [ 35.545577] DMAR: page table not present at level 2
ok I found the following proposed fix in lkml, trying it now: diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index b502ac07483b..eb6ea8acb3cc 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -4300,7 +4300,7 @@ static void nvme_scan_work(struct work_struct *work) if (test_bit(NVME_AER_NOTICE_NS_CHANGED, &ctrl->events)) nvme_queue_scan(ctrl); #ifdef CONFIG_NVME_MULTIPATH - else + else if (ctrl->ana_log_buf) /* Re-read the ANA log page to not miss updates */ queue_work(nvme_wq, &ctrl->ana_work); #endif
This patch seems to fix things, so once it's available in upstream I'll close this issue.
*** Bug 220007 has been marked as a duplicate of this bug. ***
Can confirm it happens on a Lenovo Thinkpad L14 as well.
RockT: does that above patch fix it? It's fixed it on multiple machines here. I have 3 that boot crashed and were not working, and one HP Spectre that couldn't suspend because nvme refused to suspend. It seems this issue was pretty broad in its effects. The patch fixed it on all 4.
I'm using the ubuntu mainline kernel. I was able to patch the source but could not rebuild. All the documentation I found seems outdated. Sorry :(
Merged: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/nvme/host/core.c?id=26d7fb4fd4ca1180e2fa96587dea544563b4962a