Bug 214983 - KIOXIA KBG40ZNV256G NVME SSD killed by resume from s2idle
Summary: KIOXIA KBG40ZNV256G NVME SSD killed by resume from s2idle
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: NVMe (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: IO/NVME Virtual Default Assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-11-10 18:00 UTC by pbs3141
Modified: 2021-11-15 17:11 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.15.0
Subsystem:
Regression: No
Bisected commit-id:


Attachments
lspci output (11.52 KB, text/plain)
2021-11-10 18:01 UTC, pbs3141
Details
nvme output (7.38 KB, text/plain)
2021-11-10 18:02 UTC, pbs3141
Details
full dmesg of affected system (80.29 KB, text/plain)
2021-11-10 18:03 UTC, pbs3141
Details
relocate hmb disabling on s2idle (968 bytes, patch)
2021-11-10 20:41 UTC, Keith Busch
Details | Diff
dmesg (83.24 KB, text/plain)
2021-11-10 23:39 UTC, pbs3141
Details
my crappy printk patch (1.17 KB, patch)
2021-11-11 03:46 UTC, pbs3141
Details | Diff
dmesg output after my crappy printk patch (82.28 KB, text/plain)
2021-11-11 03:47 UTC, pbs3141
Details
dmesg (82.52 KB, text/plain)
2021-11-11 14:42 UTC, pbs3141
Details
patch (2.88 KB, patch)
2021-11-11 14:43 UTC, pbs3141
Details | Diff

Description pbs3141 2021-11-10 18:00:10 UTC
I have a laptop with an SSD which is failing to come back online after a resume from suspend.

The SSD is a KIOXIA KBG40ZNV256G, with fully up-to-date firmware, and hardware identifying information

    lspci -v:
        lspci.txt (attached)

    nvme id-ctrl /dev/nvme0n1 -H:
        nvme.txt (attached)

The laptop is a HP 14-fq1021nr, with UEFI firmware updated to the latest version. This laptop ONLY supports the s2idle suspend mode, as confirmed by

    cat /sys/power/state:
        freeze mem disk
        
    cat /sys/power/mem_sleep:
        [s2idle]

Putting it into the s2idle state works perfectly in the preinstalled OS (Windows 10). By contrast, under Linux, suspend is 100% reproducibly broken. Although the system goes to sleep and wakes up without a hitch, it subsequently freezes for exactly 30 seconds, before the NVME drive disconnects with

    [   82.704939] nvme nvme0: Device not ready; aborting reset, CSTS=0x3
    [   82.704950] nvme nvme0: Removing after probe failure status: -19

After that the system was crippled, although it was still possible to write out the dmesg to an external USB drive, which I have attached. In the dmesg you'll notice that I booted with the kernel parameter

    nvme_core.default_ps_max_latency_us=0

This is because it's recommended in various places as a fix for this sort of issue. It didn't work, and neither did replacing 0 with 6000, or removing the parameter entirely. Also in the dmesg is the kernel version: untainted vanilla 5.15.0, but the issue is also present on every other kernel.

Other notes
-----------

 - There are many other threads on the Internet about similar issues with KIOXIA drives under Linux, though none of them with this exact model number.

 - The drive model number KBG40ZNV256G is not listed on KIOXIA's website; the closest one is KBG40ZNS256G, with an 'S'. I guess this means HP must have customised the firmware?

 - The two Amazon reviews of this laptop by Linux users do NOT complain about this issue. This is despite the fact they do complain about much more trivial things. We deduce that HP are selling this laptop as several different hardware combinations, and that the issue only occurs with some of them. (Hardware fault is ruled out since it works fine under Windows.)
Comment 1 pbs3141 2021-11-10 18:01:17 UTC
Created attachment 299505 [details]
lspci output
Comment 2 pbs3141 2021-11-10 18:02:22 UTC
Created attachment 299507 [details]
nvme output
Comment 3 pbs3141 2021-11-10 18:03:23 UTC
Created attachment 299509 [details]
full dmesg of affected system
Comment 4 Keith Busch 2021-11-10 18:29:37 UTC
The dmesg says your controller is in a fatal status, but a controller reset is supposed to clear it. It's not clearing it though. The acpi firmware tells the driver to use the "simple" suspend to prepare for D3, and the driver is going to honor that. It also looks like you've disabled APST, so we can rule that out as well. I'm not sure at the moment what else to try, I may come back with a debug patch later.
Comment 5 Keith Busch 2021-11-10 20:41:41 UTC
Created attachment 299515 [details]
relocate hmb disabling on s2idle

So, I notice your device has an HMB. The spec recommends a host disable HMB prior to shutting down the controller, but it's not required by spec, and this driver doesn't do that. I wonder if this controller requires the recommended sequence...

I've attached an experimental patch that will disable HMB first. Could you see if this is successful?
Comment 6 pbs3141 2021-11-10 23:36:15 UTC
I tested your patch, and although the 30 second freeze had now gone, the drive still disconnects after the 30 seconds. I've attached a new dmesg.

Sorry about the slight delay, I'm in UTC+9. I also had some teething problems with the kernel build; subsequent builds should be much faster.
Comment 7 pbs3141 2021-11-10 23:39:33 UTC
Created attachment 299521 [details]
dmesg
Comment 8 Keith Busch 2021-11-11 02:33:46 UTC
The dmesg actually looks the same as before. There's still a memory access io fault around the suspend sequence, and controller still reports fatal status on resume. I was hoping the io fault was hmb access that the patch could have prevented, but can not tell what the fault is about just from the dmesg.

It looks like the kernel version is the same as before the patch, though. Are you sure it's applied? If you're building from a git repo, there should be a '+' since there's more commits beyond the tagged kernel. I guess we could add a printk to make absolutely sure it's definitely running the patch.
Comment 9 pbs3141 2021-11-11 03:45:48 UTC
I suspected so too at first, but I just threw in a printk like you said and it appears in the dmesg, so the patch is definitely running.

I've attached my crappy printk patch and dmesg for completeness, though there's nothing interesting in either of them.

My Comment 6 is also a red herring. The freeze disappearing was not due to the patch, it was due to me changing the set of the commands I had run before suspending.

Other notes
-----------

 - For anyone else who stumbles up this thread in the future, it is necessary to run mount, dmesg and cat all at least once before suspending. Otherwise you will not be able to use them later to write out the dmesg after the error.
Comment 10 pbs3141 2021-11-11 03:46:51 UTC
Created attachment 299523 [details]
my crappy printk patch
Comment 11 pbs3141 2021-11-11 03:47:44 UTC
Created attachment 299525 [details]
dmesg output after my crappy printk patch
Comment 12 Keith Busch 2021-11-11 06:38:18 UTC
Thanks, the print was helpful, but we may need more before the return statements. It looks like the io fault happens after the nvme driver completed its suspend sequence. If that is the case, the device has no business accessing host memory, especially since we moved the hmb disable ahead of shutdown.

I am assuming the io faults are related to the controller fatal status. If that assumption is correct, I'm not sure what we can do from the driver to help since it happens after the driver completes shutdown. We may need someone from Kioxa to explain the stuck csts.cfs bit.
Comment 13 pbs3141 2021-11-11 09:44:22 UTC
Ok, I'll stuff it with printks and see what happens :D
Comment 14 pbs3141 2021-11-11 14:41:17 UTC
I wrapped every line in pre/post printks and logged return values, and the result was this:

[  137.679215] Pre: ret = nvme_set_host_mem(ndev, 0);
[  137.695890] Post: ret = nvme_set_host_mem(ndev, 0);
[  137.695893] 0
[  137.695895] Pre: return nvme_disable_prepare_reset(ndev, true);
[  137.837283] ACPI: EC: interrupt blocked
[  137.933840] ACPI: button: The lid device is not compliant to SW_LID.
[  137.943184] ACPI: EC: interrupt unblocked
[  137.974130] nvme 0000:02:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0009 address=0xc8961000 flags=0x0000]
[  137.974134] nvme 0000:02:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0009 address=0xc8961000 flags=0x0000]

So the page fault is still happening after the suspend.

The rest of the dmesg, as well as the patch that generated it, are attached.
Comment 15 pbs3141 2021-11-11 14:42:17 UTC
Created attachment 299533 [details]
dmesg
Comment 16 pbs3141 2021-11-11 14:43:51 UTC
Created attachment 299535 [details]
patch
Comment 17 Keith Busch 2021-11-11 18:44:13 UTC
it's not completely clear since you "surround" a 'return' statement, so the post message never gets printed. However, I suspect it does complete prior to the page fault event. Now, I am speculating that the fault and the controller fatal status are related, but there's no way I can confirm that.

Going back through your dmesg.... The fault always happens on the same address, c8961000. Where on earth is the device getting this address from, and why is it accessing it at this point? The address falls within this e820 range:

BIOS-e820: [mem 0x00000000c76d7000-0x00000000ccffdfff] reserved

So it's not usable. Something very bizarre is happening here: where did the device get this address, and why is it accessing it after the driver shut it down? I'm not sure there's anything we can do from the driver side to help here.
Comment 18 pbs3141 2021-11-12 03:14:27 UTC
> It's not completely clear since you "surround" a 'return' statement, so the
> post message never gets printed. However, I suspect it does complete prior to
> the page fault event.

It was cavalier of me not to surround the return statement. But your suspicion is correct, I just fixed the loophole and got

[   67.967564] Pre: ret = nvme_set_host_mem(ndev, 0);
[   67.981593] Post: ret = nvme_set_host_mem(ndev, 0);
[   67.981598] ret = 0
[   67.981601] Pre: ret = nvme_disable_prepare_reset(ndev, true);
[   68.128321] Post: ret = nvme_disable_prepare_reset(ndev, true);

> So it's not usable. Something very bizarre is happening here: where did the
> device get this address, and why is it accessing it after the driver shut it
> down? I'm not sure there's anything we can do from the driver side to help
> here.

So, how to take it from here? The immediate options I can see are

 - Ask KIOXIA about the stuck bit and the page fault, related or not. For this, the contact victor.gladkov@kioxia.com may be useful, having posted on this list in the past. If KIOXIA don't have a clue, ask HP.

 - Figure out what the kernel does to trigger the page fault. Perhaps this would suggest moving the discussion to another kernel subsystem, where further progress could be made.
Comment 19 Keith Busch 2021-11-12 04:10:40 UTC
Reaching out to Kioxa is probably the most reasonable approach at this point.
Comment 20 pbs3141 2021-11-12 07:39:04 UTC
Ok, I dropped them a message at their technical enquiry page,

    https://customer-us.kioxia.com/inquiry/product

The message says

    The KBG40ZNV256G SSD drive is behaving oddly under Linux, at least on the HP 14-fq1021nr laptop, resulting in broken suspend. The odd behaviour is as follows:

    1. When the laptop wakes up from suspend, the drive is stuck in fatal status 0x3. The error bit remains stuck even after a controller reset. This leads to the drive not being usable after suspend.

    2. Just before the laptop goes into suspend, the drive generates a page fault on address c8961000, which falls within a reserved address range according to the BIOS. This may or may not be helpful in diagnosing the first problem.

    We'd be very grateful for your assistance in explaining this behaviour. The corresponding discussion on kernel.org is https://bugzilla.kernel.org/show_bug.cgi?id=214983.

Sorry if I've misrepresented what you've said!
Comment 21 Keith Busch 2021-11-12 16:43:34 UTC
Thanks, your message sounds good to me.
Comment 22 pbs3141 2021-11-13 10:39:41 UTC
Following some obscure advice I came across on the arch wiki, I found that booting with

    iommu=soft

fixes the issue, though this can only be considered a temporary workaround until the real issue is fixed.

So you were right that the page fault was related. Does this help to understand what's going on any better?
Comment 23 Keith Busch 2021-11-15 17:11:03 UTC
IIUC, that option should have the kernel bounce the destination through a valid address, so i guess soft iommu makes sense why it'd cure the io page fault. This option is outside the driver, though, so I'm a little outside my domain here. If you want to take this to the mailing list, the linux-nvme@lists.infradead.org may have more knowledgable people.

Note You need to log in before you can comment on or make changes to this bug.