Bug 217802

Summary: regression NVME failure in 6.4.11 : 6.4.10 works fine.
Product: Drivers Reporter: Gene (gjunk2)
Component: Flash/Memory Technology DevicesAssignee: David Woodhouse (dwmw2)
Status: RESOLVED CODE_FIX    
Severity: high CC: aslatter, bjorn, bronecki.damian, carlon.luca, fergalmt, hi, info, ivzave, jade, jastxakajasmineteax, kernel_bugzilla, kgreunke, martchus, mike, miles, miso, ngompa13, nikof.06, philipp.misof, regressions, timkniel
Priority: P3    
Hardware: Intel   
OS: Linux   
Kernel Version: 6.4.11 Subsystem:
Regression: Yes Bisected commit-id: 101bd907b4244a726980ee67f95ed9cafab6ff7a

Description Gene 2023-08-16 20:21:23 UTC
Failure manually transcribed:

kernel: nvme nvme0: controller is down; will reset: CSTS:0xffffffff, PCI_STATUS=0xffff
kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
kernel: nvme nvme0: try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
kernel: nvme 0000:04:00.0: Unable to change power state from D3cold to D0, device inaccessible
kernel: nvme nvme0: Disabling device after reset failure: -19
mount[353]: mount /sysroot: can't read suprtblock on /dev/nvme0n1p5.
mount[353]:       dmesg(1) may have more information after failed moutn system call.
kernel: nvme0m1: detected capacity change from 2000409264 to 0
kernel: EXT4-fs (nvme0n1p5): unable to read superblock
systemd([1]: sysroot.mount: Mount process exited, code=exited, status=32/n/a
...

All kernels are upstream, untainted and compiled on Arch linux using:

 gcc version 13.2.1

Kernels Tested:
 - 6.4.10 - works fine
 - 6.5-rc6 - fails
 - 6.4.11 with 1 revert also fails

    Revert "nvme-pci: add NVME_QUIRK_BOGUS_NID for Samsung PM9B1 256G and 512G"
    
    This reverts commit 061fbf64825fb47367bbb6e0a528611f08119473.

Hardware:
  model name      : Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
  stepping        : 9
  microcode       : 0xf4

nvme:
04:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961/SM963

All tests on dell laptop running Arch. All
Comment 1 Gene 2023-08-16 20:33:07 UTC
Also I did try 6.4.11 with the suggested options : 
   nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

Also did not boot.
Comment 2 Gene 2023-08-17 01:38:23 UTC
git bisect results on lkml 

https://lkml.org/lkml/2023/8/16/1363
Comment 3 Gene 2023-08-17 10:19:28 UTC
Just FYI in case of interest to anyone.

I can confirm that blacklisting the drivers (rtsx_pci_and sdmmc and rtsx_pci) and rebuilding the initramfs - rebooting then works fine for both 6.4.11 and 6.5-rc6.
Comment 4 Jasmine T 2023-08-27 14:03:44 UTC
(In reply to Gene from comment #1)
> Also I did try 6.4.11 with the suggested options : 
>    nvme_core.default_ps_max_latency_us=0 pcie_aspm=off
> 
> Also did not boot.

Hello,
I'm facing this same problem with linux-mainline-6.5rc6-1 (built by Chaotic-AUR), linux-zen-6.4.12 and linux-lts-6.1.47-1. OS is Garuda Linux. I understand that here, support is not given for downstream kernels like Zen and LTS. 

In my case, adding 
    nvme_core.default_ps_max_latency_us=0 pcie_aspm=off
did fix it for me and some others facing similar issues (they didn't get thrown into an emergency shell after failing to switch root though - they got stuck on a black screen instead). None of us tried blacklisting the kernels, as these boot params suggested by the error worked. 
Everyone affected by this used NVMe devices, a lot of them from Samsung.

I use a Dell XPS 15 9560 (Toshiba KXG50ZNV512G NVMe 512GB). It has the problematic Realtek card reader.
I'm unsure if I should make a new report since the problem is only slightly different, with newer kernels. Reporting kernel bugs is very new to me so please let me know the right course of action for reporting this :) (not just Gene, but anyone here).
Comment 5 Jasmine T 2023-08-27 14:06:36 UTC
(In reply to Jasmine T from comment #4)
> None of us tried blacklisting the kernels


Sorry, typo... modules, not kernels. Need sleep.
Comment 6 Damian B 2023-08-29 06:32:22 UTC
I have same issues since 6.4.11 on my Dell XPS 15 9560 laptop using Fedora 38.
Comment 7 René 2023-08-30 09:24:50 UTC
Same issue here on 6.4.11 or higher

Dell Precision 5520
sn: X7AS11Z7TYAT
model: KXG50ZNV1T02 NVMe TOSHIBA 1024GB 
lspci: Toshiba Corporation XG5 NVMe SSD Controller (prog-if 02 [NVM Express])
Comment 8 Fergal MT 2023-09-01 06:38:43 UTC
Hi !

I have the same issue here, with a DELL XPS 15 9560, like Damian B.
Comment 9 timkniel 2023-09-06 15:23:29 UTC
Same issue, Dell Precision 5520.
Comment 10 Luca Carlon 2023-09-08 00:04:50 UTC
I also experienced this issue. All kernels suddenly stopped booting: 6.5, 6.4, 6.1 and 5.15. 6.1 stops working from 6.1.45 to 6.1.46.

By bisection I can say that, after this commit, boot of 6.1 fails: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=8ee39ec479147e29af704639f8e55fce246ed2d9. Same already mentioned.

Machine is:
Dell Precision 5520
SSD: PM961 NVMe SED Samsung 512GB

I also noticed another unrelated issue, so I decided to replace my SSD with a Samsung SSD 970 EVO Plus 1TB. This seems to solve both issues and I can now boot whatever kernel version I tested.
Comment 11 Nazar Mokrynskyi 2023-09-08 02:15:04 UTC
So happy I'm not alone, suffering for over a week due to this with Samsung 890 Pro 2TB:

02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO (prog-if 02 [NVM Express])
	Subsystem: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
	Flags: bus master, fast devsel, latency 0, IRQ 16, NUMA node 0, IOMMU group 17
	Memory at 85000000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: <access denied>
	Kernel driver in use: nvme
	Kernel modules: nvme

Tried different motherboard, bought another CPU to try, will try older kernel now.

Here are kernel logs:
[ 2762.189019] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[ 2762.189022] nvme nvme0: Does your device have a faulty power saving mode enabled?
[ 2762.189022] nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
[ 2762.254958] nvme 0000:02:00.0: enabling device (0000 -> 0002)
[ 2762.255161] nvme nvme0: Disabling device after reset failure: -19
[ 2762.271015] I/O error, dev nvme0n1, sector 178296536 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 2
[ 2762.271044] kworker/u64:12: attempt to access beyond end of device
Comment 12 Nazar Mokrynskyi 2023-09-08 11:11:15 UTC
My issue must be different, having it with kernels down to 6.4.3 on ASUS PRIME Z690-P D4.
Comment 13 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-09-11 08:02:53 UTC
Just to clarify: is the latest 6.5.y version and/or 6.6-rc1 broken as well?
Comment 14 Nicola 2023-09-11 10:02:16 UTC
Same here with a Dell XPS 9560.

The issue originally manifested with the stock Toshiba SSD (THNSN5256GPUK 256GB). I tried replacing it with a WD_BLACK SN770 1TB, same behaviour.

With the new WD SSD installed, adding the kernel parameters "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" does not seem to help.

I tried the following Kernels:
- 6.1.52-1 (lts): doesn't work
- 6.4.8: works out of the box
- 6.5.2: doesn't work
Comment 15 Michal Hlavac 2023-09-11 10:12:29 UTC
openSUSE Tumbleweed already applied patch to kernel 6.4.12 and I can confirm, it works on my XPS 15 9560
Comment 16 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-09-11 10:16:53 UTC
(In reply to Michal Hlavac from comment #15)
> openSUSE Tumbleweed already applied patch 

What patch? A revert of 101bd907b4244a ("misc: rtsx: judge ASPM Mode to set PETXCFG Reg")?
Comment 17 Michal Hlavac 2023-09-11 10:26:47 UTC
Sorry, I don't know, downstream ticket is https://bugzilla.suse.com/show_bug.cgi?id=1214428 
Maybe it will help you
Comment 18 loqs 2023-09-11 10:49:35 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #16)
> What patch? A revert of 101bd907b4244a ("misc: rtsx: judge ASPM Mode to set
> PETXCFG Reg")?
Yes https://github.com/openSUSE/kernel-source/commit/1b02b1528a26f4e9b577e215c114d8c5e773ee10

It is reported as still present on 6.5.2 in https://bugs.archlinux.org/task/79439#comment221866
Comment 19 Jasmine T 2023-09-12 06:11:50 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #13)
> Just to clarify: is the latest 6.5.y version and/or 6.6-rc1 broken as well?

Bug is still present on 6.6rc1-1 (using build from Chaotic-AUR).
Errors are exactly the same as what I previously experienced. More details including errors and system specs here: 
https://forum.garudalinux.org/t/dumped-into-emergency-shell-after-update-failed-to-start-switch-root-btrfs-errors/30440

Dell XPS 9560 i1-7700HQ with Toshiba KXG50ZNV512G NVMe 512GB (completely stock model)
Comment 20 Jade 2023-09-18 07:18:23 UTC
Hi all,

There are several people hitting this, also on the 9560, downstream at NixOS. I have confirmed that the revert on 6.4 fixes my machine booting.

This is our bug: https://github.com/NixOS/nixpkgs/issues/253418, and there is a bunch of troubleshooting here: https://discourse.nixos.org/t/nvme-drive-not-detecting-after-calameres-initiates/32108

My plan is to submit a change to revert the patch on all supported kernels in NixOS, following with OpenSUSE.
Comment 21 Artem S. Tashkinov 2023-09-24 11:59:31 UTC
The issue has been known for over a month now yet the bad commit has still not been reverted in both mainline and stable. No idea what's going on.
Comment 22 Alyssa Ross 2023-09-24 15:16:18 UTC
It looks like the fix is still waiting for Tested-by tags from people affected by this issue: https://lore.kernel.org/lkml/37b1afb997f14946a8784c73d1f9a4f5@realtek.com/

You could test it and submit one. ;)
Comment 23 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-09-25 07:54:48 UTC
Yeah, tested-by would likely help; FWIW, I was and still am unhappy about how this regression is handled, but CCing Linus ~two weeks ago and pointing him to the discussion yesterday[1] didn't lead to any visible action from his side. :-/

[1] https://lore.kernel.org/all/169557219938.3206394.2779757887621800924@leemhuis.info/
Comment 24 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-09-25 09:13:10 UTC
FWIW, testing is always helpful in cases like this, but not needed anymore, asthings will likely proceed soon anyway: 
https://lore.kernel.org/all/2023092522-climatic-commend-8c99@gregkh/
Comment 25 Eva 2023-09-25 10:57:58 UTC
Ah I was just about to test the patch... awesome to hear ^^ thank you 
everyone for your hard work on this regression.

On 25/9/23 7:13 pm, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=217802
>
> --- Comment #24 from The Linux kernel's regression tracker (Thorsten
> Leemhuis) (regressions@leemhuis.info) ---
> FWIW, testing is always helpful in cases like this, but not needed anymore,
> asthings will likely proceed soon anyway:
> https://lore.kernel.org/all/2023092522-climatic-commend-8c99@gregkh/
>
Comment 26 Gene 2023-10-01 23:08:29 UTC
Resolved in 6.6-rc4
Should be in 6.5.6 stable as well.
Comment 27 Bjorn Helgaas 2023-11-01 12:04:45 UTC
Apparently the fix is 0e4cac557531 ("misc: rtsx: Fix some platforms can not boot and move the l1ss judgment to probe"), which is included in v6.6.

https://git.kernel.org/linus/0e4cac557531