Created attachment 303231 [details] output of `nvme id-ctrl /dev/nvme0` # Issue description A few seconds after GNOME is started, my system completely freezes, then goes back to console and repeatly prints error messages like : > EXT4-fs error (device nvme0n1p3): __ext4_find_entry:1635: inode #1309456: > comm gmain: reading directory lblock #0 > EXT4-fs error (device nvme0n1p3): __ext4_find_entry:1635: inode #1315081: > comm systemd-login: reading directory lblock #0 The only way out is to reboot. # How to reproduce Just boot normally, let GNOME starts as usual, and wait a few seconds for the system to freeze. # What I've tried (without much luck) - I've installed the OS on the drive to begin with, so the drive works long enough to have everyting copied and installed. - This is the 3rd drive I'm testing (tried with 2 Samsung 970 EVO plus 2To before, and now with a Seagate FireCuda 530 2To) - I've tried disabling APST by setting the kernel parameter nvme_core.default_ps_max_latency_us=0 (I've also tried various values, first with 5500 as I've seen recommended a few times). - Installing a different distribution (Manjaro) - Starting in recovery mode, and resuming normal start, seems to postpone a bit the freeze of the system but only by a few minutes. But it still freezes. - On the first 2 samsung drives, I've managed to install Windows. I did not try with the latest drive. # Software / hardware - Linux Kernel : 5.15.0-53-generic - Distribution : Zorin OS 16.2 - NVME Drive : Seagate FireCuda 530 2To - CPU : AMD Ryzen 5700X # Attached - `nvme_id_ctrl.txt` the output of `nvme id-ctrl /dev/nvme0` - `smartctl.txt` the output of `smartctl -a /dev/nvme0`
Created attachment 303232 [details] output of `smartctl -a /dev/nvme0`
Created attachment 303233 [details] output of `modinfo nvme_core`
Unfortunately nothing here so far is useful to know what's going on. Would it be possible to boot off a different drive and recreate these ext4 errors on your nvme as a data mount point instead? If you can do that and attach a dmesg, we'll be better able to work out next steps.
Another try then: I've captured dmesg before the whole system freezes (still booting from the NVME) > [sam. nov. 19 23:22:36 2022] nvme nvme0: controller is down; will reset: > CSTS=0xffffffff, PCI_STATUS=0xffff > [sam. nov. 19 23:22:36 2022] blk_update_request: I/O error, dev nvme0n1, > sector 152139544 op 0x0:(READ) flags 0x80700 phys_seg 24 prio class 0 > [sam. nov. 19 23:22:36 2022] blk_update_request: I/O error, dev nvme0n1, > sector 152139752 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 0
This says the pcie link is inaccessible. MMIO and Config both failed in this case.
As per [1], I've tried with either kernel parameter pcie_aspm=off or nvme_core.default_ps_max_latency_us=0 and finally with both, but I get the same result in all 3 cases. [1] https://lore.kernel.org/lkml/YnNeTsSzFJqEK%2Fs+@kbusch-mbp.dhcp.thefacebook.com/T/
If both of those parameters didn't help your observation, then we're in territory below the visibility of the nvme protocol level. This smells like a platform specific power quirk, so probably will need some more details on that. Is your machine x86 based? I'll think on this over the weekend, but you might have better luck if you report this via email to the linux-pci@vger.kernel.org and linux-nvme@lists.infradead.org lists instead.
(In reply to Keith Busch from comment #7) > Is your machine x86 based? Oh duh, you already answered that... :) "CPU : AMD Ryzen 5700X" So maybe linux-acpi@vger.kernel.org too.
I've tried what you proposed : booting from another drive and mounting the NVME drive. The issue appears pretty fast too. Here is dmesg output : > [ 281.797947] blk_update_request: I/O error, dev nvme0n1, sector 2786568960 > op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0 > [ 281.797972] Buffer I/O error on dev nvme0n1p3, logical block 308281696, > lost async page write > [ 281.850852] FAT-fs (nvme0n1p4): unable to read boot sector to mark fs as > dirty > [ 343.901432] EXT4-fs warning (device nvme0n1p3): > htree_dirblock_to_tree:1067: inode #77070337: lblock 0: comm ls: error -5 > reading directory block > [ 343.902354] EXT4-fs error (device nvme0n1p3): __ext4_find_entry:1658: > inode #77070337: comm test-nvme-write: reading directory lblock 0 > [ 350.028540] Aborting journal on device nvme0n1p3-8. > [ 350.028548] Buffer I/O error on dev nvme0n1p3, logical block 223903744, > lost sync page write > [ 350.028554] JBD2: Error -5 detected when updating journal superblock for > nvme0n1p3-8. I'm gonna try reporting this issue to the mailing lists you've given.
Woops sorry, log got truncated. Here is the full relevant portion: > [ 281.692677] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, > PCI_STATUS=0x10 > [ 281.778102] nvme 0000:04:00.0: enabling device (0000 -> 0002) > [ 281.778436] nvme nvme0: Removing after probe failure status: -19 > [ 281.797929] nvme0n1: detected capacity change from 3907029168 to 0 > [ 281.797947] blk_update_request: I/O error, dev nvme0n1, sector 2786568960 > op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0 > [ 281.797972] Buffer I/O error on dev nvme0n1p3, logical block 308281696, > lost async page write > [ 281.850852] FAT-fs (nvme0n1p4): unable to read boot sector to mark fs as > dirty > [ 343.901432] EXT4-fs warning (device nvme0n1p3): > htree_dirblock_to_tree:1067: inode #77070337: lblock 0: comm ls: error -5 > reading directory block > [ 343.902354] EXT4-fs error (device nvme0n1p3): __ext4_find_entry:1658: > inode #77070337: comm test-nvme-write: reading directory lblock 0 > [ 350.028540] Aborting journal on device nvme0n1p3-8. > [ 350.028548] Buffer I/O error on dev nvme0n1p3, logical block 223903744, > lost sync page write > [ 350.028554] JBD2: Error -5 detected when updating journal superblock for > nvme0n1p3-8.
Still showing unusable link when mmio returns all f's.
Could you attach output from 'sudo lspci -vvv' and 'lspci -tv'?
Created attachment 303270 [details] output of `sudo lspci -vvv`
Created attachment 303273 [details] output `lspci -tv`
Thanks. Unfortunately I'm seeing anything of concern here. Could you possibly attach the full 'dmesg' that includes the "nvme nvme0: controller is down; will reset: CSTS=0xffffffff" error message?
Created attachment 303275 [details] output of `dmesg` after the drive failed
I'm now on kernel 6.0.9-060009-generic, but it still fails (as seen in the dmesg output attached).
With kernel 6.0.9 and parameters "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" the drive still fails. I think I'll try with an older kernel (I remember that when I first bought the samsung drive in July, it was working for long period of times) and with another OS (FreeBSD ?) Could the source of the issue be found in hardware somewhere else than the nvme drive? Motherboard maybe? Unfortunately I don't have any other hardware combination.
Same issue with kernel 5.11.0-27.
I've tried every possible value of PCIe ASPM mode in the BIOS (disabled, L0s, L1, L0sL1) but no change.
I've tried booting Windows from an HDD and having the NVME SSD mounted on another drive and the issue is pretty much the same as on Linux : the drive just disappears. There is probably another factor at play, other than the linux kernel. I'm about to throw in the towel on this one.
Yeah, I got nothing on this. The symptoms are definitely a link issue, but I'm out of ideas on what the kernel could try to mitigate it.
I've bought another drive, a SATA SSD this time. I guess we'll never know what was wrong. I'm closing this bug.