Bug 216709

Summary: NVME controller is down (Seagate FireCuda 530)
Product: IO/Storage Reporter: Thomas Qmrd (thomas)
Component: NVMeAssignee: IO/NVME Virtual Default Assignee (io_nvme)
Status: RESOLVED INSUFFICIENT_DATA    
Severity: normal CC: kbusch
Priority: P1    
Hardware: i386   
OS: Linux   
Kernel Version: 5.15.0-53-generic Subsystem:
Regression: No Bisected commit-id:
Attachments: output of `nvme id-ctrl /dev/nvme0`
output of `smartctl -a /dev/nvme0`
output of `modinfo nvme_core`
output of `sudo lspci -vvv`
output `lspci -tv`
output of `dmesg` after the drive failed

Description Thomas Qmrd 2022-11-19 13:57:03 UTC
Created attachment 303231 [details]
output of `nvme id-ctrl /dev/nvme0`

# Issue description
A few seconds after GNOME is started, my system completely freezes, then goes back to console and repeatly prints error messages like :
> EXT4-fs error (device nvme0n1p3): __ext4_find_entry:1635: inode #1309456:
> comm gmain: reading directory lblock #0
> EXT4-fs error (device nvme0n1p3): __ext4_find_entry:1635: inode #1315081:
> comm systemd-login: reading directory lblock #0

The only way out is to reboot.


# How to reproduce
Just boot normally, let GNOME starts as usual, and wait a few seconds for the system to freeze.


# What I've tried (without much luck)
- I've installed the OS on the drive to begin with, so the drive works long enough to have everyting copied and installed.
- This is the 3rd drive I'm testing (tried with 2 Samsung 970 EVO plus 2To before, and now with a Seagate FireCuda 530 2To)
- I've tried disabling APST by setting the kernel parameter nvme_core.default_ps_max_latency_us=0 (I've also tried various values, first with 5500 as I've seen recommended a few times).
- Installing a different distribution (Manjaro)
- Starting in recovery mode, and resuming normal start, seems to postpone a bit the freeze of the system but only by a few minutes. But it still freezes.
- On the first 2 samsung drives, I've managed to install Windows. I did not try with the latest drive.


# Software / hardware
- Linux Kernel : 5.15.0-53-generic
- Distribution : Zorin OS 16.2
- NVME Drive : Seagate FireCuda 530 2To
- CPU : AMD Ryzen 5700X


# Attached
- `nvme_id_ctrl.txt` the output of `nvme id-ctrl /dev/nvme0`
- `smartctl.txt` the output of `smartctl -a /dev/nvme0`
Comment 1 Thomas Qmrd 2022-11-19 13:57:27 UTC
Created attachment 303232 [details]
output of `smartctl -a /dev/nvme0`
Comment 2 Thomas Qmrd 2022-11-19 14:16:06 UTC
Created attachment 303233 [details]
output of `modinfo nvme_core`
Comment 3 Keith Busch 2022-11-19 20:59:10 UTC
Unfortunately nothing here so far is useful to know what's going on. Would it be possible to boot off a different drive and recreate these ext4 errors on your nvme as a data mount point instead? If you can do that and attach a dmesg, we'll be better able to work out next steps.
Comment 4 Thomas Qmrd 2022-11-19 22:31:46 UTC
Another try then: I've captured dmesg before the whole system freezes (still booting from the NVME)
> [sam. nov. 19 23:22:36 2022] nvme nvme0: controller is down; will reset:
> CSTS=0xffffffff, PCI_STATUS=0xffff
> [sam. nov. 19 23:22:36 2022] blk_update_request: I/O error, dev nvme0n1,
> sector 152139544 op 0x0:(READ) flags 0x80700 phys_seg 24 prio class 0
> [sam. nov. 19 23:22:36 2022] blk_update_request: I/O error, dev nvme0n1,
> sector 152139752 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 0
Comment 5 Keith Busch 2022-11-19 22:43:03 UTC
This says the pcie link is inaccessible. MMIO and Config both failed in this case.
Comment 6 Thomas Qmrd 2022-11-19 22:57:58 UTC
As per [1], I've tried with either kernel parameter pcie_aspm=off or nvme_core.default_ps_max_latency_us=0 and finally with both, but I get the same result in all 3 cases.


[1] https://lore.kernel.org/lkml/YnNeTsSzFJqEK%2Fs+@kbusch-mbp.dhcp.thefacebook.com/T/
Comment 7 Keith Busch 2022-11-19 23:23:13 UTC
If both of those parameters didn't help your observation, then we're in territory below the visibility of the nvme protocol level. This smells like a platform specific power quirk, so probably will need some more details on that. Is your machine x86 based?

I'll think on this over the weekend, but you might have better luck if you report this via email to the linux-pci@vger.kernel.org and linux-nvme@lists.infradead.org lists instead.
Comment 8 Keith Busch 2022-11-19 23:28:02 UTC
(In reply to Keith Busch from comment #7)
> Is your machine x86 based?

Oh duh, you already answered that... :)

"CPU :  AMD Ryzen 5700X"

So maybe linux-acpi@vger.kernel.org too.
Comment 9 Thomas Qmrd 2022-11-20 13:32:13 UTC
I've tried what you proposed : booting from another drive and mounting the NVME drive. The issue appears pretty fast too. Here is dmesg output :

> [  281.797947] blk_update_request: I/O error, dev nvme0n1, sector 2786568960
> op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
> [  281.797972] Buffer I/O error on dev nvme0n1p3, logical block 308281696,
> lost async page write
> [  281.850852] FAT-fs (nvme0n1p4): unable to read boot sector to mark fs as
> dirty
> [  343.901432] EXT4-fs warning (device nvme0n1p3):
> htree_dirblock_to_tree:1067: inode #77070337: lblock 0: comm ls: error -5
> reading directory block
> [  343.902354] EXT4-fs error (device nvme0n1p3): __ext4_find_entry:1658:
> inode #77070337: comm test-nvme-write: reading directory lblock 0
> [  350.028540] Aborting journal on device nvme0n1p3-8.
> [  350.028548] Buffer I/O error on dev nvme0n1p3, logical block 223903744,
> lost sync page write
> [  350.028554] JBD2: Error -5 detected when updating journal superblock for
> nvme0n1p3-8.

I'm gonna try reporting this issue to the mailing lists you've given.
Comment 10 Thomas Qmrd 2022-11-20 13:36:27 UTC
Woops sorry, log got truncated. Here is the full relevant portion:

> [  281.692677] nvme nvme0: controller is down; will reset: CSTS=0xffffffff,
> PCI_STATUS=0x10
> [  281.778102] nvme 0000:04:00.0: enabling device (0000 -> 0002)
> [  281.778436] nvme nvme0: Removing after probe failure status: -19
> [  281.797929] nvme0n1: detected capacity change from 3907029168 to 0
> [  281.797947] blk_update_request: I/O error, dev nvme0n1, sector 2786568960
> op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
> [  281.797972] Buffer I/O error on dev nvme0n1p3, logical block 308281696,
> lost async page write
> [  281.850852] FAT-fs (nvme0n1p4): unable to read boot sector to mark fs as
> dirty
> [  343.901432] EXT4-fs warning (device nvme0n1p3):
> htree_dirblock_to_tree:1067: inode #77070337: lblock 0: comm ls: error -5
> reading directory block
> [  343.902354] EXT4-fs error (device nvme0n1p3): __ext4_find_entry:1658:
> inode #77070337: comm test-nvme-write: reading directory lblock 0
> [  350.028540] Aborting journal on device nvme0n1p3-8.
> [  350.028548] Buffer I/O error on dev nvme0n1p3, logical block 223903744,
> lost sync page write
> [  350.028554] JBD2: Error -5 detected when updating journal superblock for
> nvme0n1p3-8.
Comment 11 Keith Busch 2022-11-21 17:55:09 UTC
Still showing unusable link when mmio returns all f's.
Comment 12 Keith Busch 2022-11-21 21:37:32 UTC
Could you attach output from 'sudo lspci -vvv' and 'lspci -tv'?
Comment 13 Thomas Qmrd 2022-11-22 20:10:32 UTC
Created attachment 303270 [details]
output of `sudo lspci -vvv`
Comment 14 Thomas Qmrd 2022-11-22 20:11:26 UTC
Created attachment 303273 [details]
output `lspci -tv`
Comment 15 Keith Busch 2022-11-22 20:58:29 UTC
Thanks. Unfortunately I'm seeing anything of concern here. Could you possibly attach the full 'dmesg' that includes the "nvme nvme0: controller is down; will reset: CSTS=0xffffffff" error message?
Comment 16 Thomas Qmrd 2022-11-22 21:31:17 UTC
Created attachment 303275 [details]
output of `dmesg` after the drive failed
Comment 17 Thomas Qmrd 2022-11-22 21:32:05 UTC
I'm now on kernel 6.0.9-060009-generic, but it still fails (as seen in the dmesg output attached).
Comment 18 Thomas Qmrd 2022-11-22 21:53:52 UTC
With kernel 6.0.9 and parameters "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" the drive still fails.


I think I'll try with an older kernel (I remember that when I first bought the samsung drive in July, it was working for long period of times) and with another OS (FreeBSD ?)

Could the source of the issue be found in hardware somewhere else than the nvme drive? Motherboard maybe? Unfortunately I don't have any other hardware combination.
Comment 19 Thomas Qmrd 2022-11-22 22:07:54 UTC
Same issue with kernel 5.11.0-27.
Comment 20 Thomas Qmrd 2022-11-22 22:28:05 UTC
I've tried every possible value of PCIe ASPM mode in the BIOS (disabled, L0s, L1, L0sL1) but no change.
Comment 21 Thomas Qmrd 2022-11-27 14:18:36 UTC
I've tried booting Windows from an HDD and having the NVME SSD mounted on another drive and the issue is pretty much the same as on Linux : the drive just disappears. There is probably another factor at play, other than the linux kernel.

I'm about to throw in the towel on this one.
Comment 22 Keith Busch 2022-11-28 17:33:04 UTC
Yeah, I got nothing on this. The symptoms are definitely a link issue, but I'm out of ideas on what the kernel could try to mitigate it.
Comment 23 Thomas Qmrd 2022-12-09 06:16:37 UTC
I've bought another drive, a SATA SSD this time. I guess we'll never know what was wrong. I'm closing this bug.