Bug 217802
Summary: | regression NVME failure in 6.4.11 : 6.4.10 works fine. | ||
---|---|---|---|
Product: | Drivers | Reporter: | Gene (gjunk2) |
Component: | Flash/Memory Technology Devices | Assignee: | David Woodhouse (dwmw2) |
Status: | RESOLVED CODE_FIX | ||
Severity: | high | CC: | aslatter, bjorn, bronecki.damian, carlon.luca, fergalmt, hi, info, ivzave, jade, jastxakajasmineteax, kernel_bugzilla, kgreunke, martchus, mike, miles, miso, ngompa13, nikof.06, philipp.misof, regressions, timkniel |
Priority: | P3 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | 6.4.11 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | 101bd907b4244a726980ee67f95ed9cafab6ff7a |
Description
Gene
2023-08-16 20:21:23 UTC
Also I did try 6.4.11 with the suggested options : nvme_core.default_ps_max_latency_us=0 pcie_aspm=off Also did not boot. git bisect results on lkml https://lkml.org/lkml/2023/8/16/1363 Just FYI in case of interest to anyone. I can confirm that blacklisting the drivers (rtsx_pci_and sdmmc and rtsx_pci) and rebuilding the initramfs - rebooting then works fine for both 6.4.11 and 6.5-rc6. (In reply to Gene from comment #1) > Also I did try 6.4.11 with the suggested options : > nvme_core.default_ps_max_latency_us=0 pcie_aspm=off > > Also did not boot. Hello, I'm facing this same problem with linux-mainline-6.5rc6-1 (built by Chaotic-AUR), linux-zen-6.4.12 and linux-lts-6.1.47-1. OS is Garuda Linux. I understand that here, support is not given for downstream kernels like Zen and LTS. In my case, adding nvme_core.default_ps_max_latency_us=0 pcie_aspm=off did fix it for me and some others facing similar issues (they didn't get thrown into an emergency shell after failing to switch root though - they got stuck on a black screen instead). None of us tried blacklisting the kernels, as these boot params suggested by the error worked. Everyone affected by this used NVMe devices, a lot of them from Samsung. I use a Dell XPS 15 9560 (Toshiba KXG50ZNV512G NVMe 512GB). It has the problematic Realtek card reader. I'm unsure if I should make a new report since the problem is only slightly different, with newer kernels. Reporting kernel bugs is very new to me so please let me know the right course of action for reporting this :) (not just Gene, but anyone here). (In reply to Jasmine T from comment #4) > None of us tried blacklisting the kernels Sorry, typo... modules, not kernels. Need sleep. I have same issues since 6.4.11 on my Dell XPS 15 9560 laptop using Fedora 38. Same issue here on 6.4.11 or higher Dell Precision 5520 sn: X7AS11Z7TYAT model: KXG50ZNV1T02 NVMe TOSHIBA 1024GB lspci: Toshiba Corporation XG5 NVMe SSD Controller (prog-if 02 [NVM Express]) Hi ! I have the same issue here, with a DELL XPS 15 9560, like Damian B. Same issue, Dell Precision 5520. I also experienced this issue. All kernels suddenly stopped booting: 6.5, 6.4, 6.1 and 5.15. 6.1 stops working from 6.1.45 to 6.1.46. By bisection I can say that, after this commit, boot of 6.1 fails: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=8ee39ec479147e29af704639f8e55fce246ed2d9. Same already mentioned. Machine is: Dell Precision 5520 SSD: PM961 NVMe SED Samsung 512GB I also noticed another unrelated issue, so I decided to replace my SSD with a Samsung SSD 970 EVO Plus 1TB. This seems to solve both issues and I can now boot whatever kernel version I tested. So happy I'm not alone, suffering for over a week due to this with Samsung 890 Pro 2TB: 02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO (prog-if 02 [NVM Express]) Subsystem: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO Flags: bus master, fast devsel, latency 0, IRQ 16, NUMA node 0, IOMMU group 17 Memory at 85000000 (64-bit, non-prefetchable) [size=16K] Capabilities: <access denied> Kernel driver in use: nvme Kernel modules: nvme Tried different motherboard, bought another CPU to try, will try older kernel now. Here are kernel logs: [ 2762.189019] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10 [ 2762.189022] nvme nvme0: Does your device have a faulty power saving mode enabled? [ 2762.189022] nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug [ 2762.254958] nvme 0000:02:00.0: enabling device (0000 -> 0002) [ 2762.255161] nvme nvme0: Disabling device after reset failure: -19 [ 2762.271015] I/O error, dev nvme0n1, sector 178296536 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 2 [ 2762.271044] kworker/u64:12: attempt to access beyond end of device My issue must be different, having it with kernels down to 6.4.3 on ASUS PRIME Z690-P D4. Just to clarify: is the latest 6.5.y version and/or 6.6-rc1 broken as well? Same here with a Dell XPS 9560. The issue originally manifested with the stock Toshiba SSD (THNSN5256GPUK 256GB). I tried replacing it with a WD_BLACK SN770 1TB, same behaviour. With the new WD SSD installed, adding the kernel parameters "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" does not seem to help. I tried the following Kernels: - 6.1.52-1 (lts): doesn't work - 6.4.8: works out of the box - 6.5.2: doesn't work openSUSE Tumbleweed already applied patch to kernel 6.4.12 and I can confirm, it works on my XPS 15 9560 (In reply to Michal Hlavac from comment #15) > openSUSE Tumbleweed already applied patch What patch? A revert of 101bd907b4244a ("misc: rtsx: judge ASPM Mode to set PETXCFG Reg")? Sorry, I don't know, downstream ticket is https://bugzilla.suse.com/show_bug.cgi?id=1214428 Maybe it will help you (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #16) > What patch? A revert of 101bd907b4244a ("misc: rtsx: judge ASPM Mode to set > PETXCFG Reg")? Yes https://github.com/openSUSE/kernel-source/commit/1b02b1528a26f4e9b577e215c114d8c5e773ee10 It is reported as still present on 6.5.2 in https://bugs.archlinux.org/task/79439#comment221866 (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #13) > Just to clarify: is the latest 6.5.y version and/or 6.6-rc1 broken as well? Bug is still present on 6.6rc1-1 (using build from Chaotic-AUR). Errors are exactly the same as what I previously experienced. More details including errors and system specs here: https://forum.garudalinux.org/t/dumped-into-emergency-shell-after-update-failed-to-start-switch-root-btrfs-errors/30440 Dell XPS 9560 i1-7700HQ with Toshiba KXG50ZNV512G NVMe 512GB (completely stock model) Hi all, There are several people hitting this, also on the 9560, downstream at NixOS. I have confirmed that the revert on 6.4 fixes my machine booting. This is our bug: https://github.com/NixOS/nixpkgs/issues/253418, and there is a bunch of troubleshooting here: https://discourse.nixos.org/t/nvme-drive-not-detecting-after-calameres-initiates/32108 My plan is to submit a change to revert the patch on all supported kernels in NixOS, following with OpenSUSE. The issue has been known for over a month now yet the bad commit has still not been reverted in both mainline and stable. No idea what's going on. It looks like the fix is still waiting for Tested-by tags from people affected by this issue: https://lore.kernel.org/lkml/37b1afb997f14946a8784c73d1f9a4f5@realtek.com/ You could test it and submit one. ;) Yeah, tested-by would likely help; FWIW, I was and still am unhappy about how this regression is handled, but CCing Linus ~two weeks ago and pointing him to the discussion yesterday[1] didn't lead to any visible action from his side. :-/ [1] https://lore.kernel.org/all/169557219938.3206394.2779757887621800924@leemhuis.info/ FWIW, testing is always helpful in cases like this, but not needed anymore, asthings will likely proceed soon anyway: https://lore.kernel.org/all/2023092522-climatic-commend-8c99@gregkh/ Ah I was just about to test the patch... awesome to hear ^^ thank you everyone for your hard work on this regression. On 25/9/23 7:13 pm, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=217802 > > --- Comment #24 from The Linux kernel's regression tracker (Thorsten > Leemhuis) (regressions@leemhuis.info) --- > FWIW, testing is always helpful in cases like this, but not needed anymore, > asthings will likely proceed soon anyway: > https://lore.kernel.org/all/2023092522-climatic-commend-8c99@gregkh/ > Resolved in 6.6-rc4 Should be in 6.5.6 stable as well. Apparently the fix is 0e4cac557531 ("misc: rtsx: Fix some platforms can not boot and move the l1ss judgment to probe"), which is included in v6.6. https://git.kernel.org/linus/0e4cac557531 |