Bug 216809

Summary:	nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting
Product:	IO/Storage	Reporter:	jfhart085
Component:	NVMe	Assignee:	IO/NVME Virtual Default Assignee (io_nvme)
Status:	NEW ---
Severity:	blocking	CC:	andreas.thalhammer, iccfish, kbusch, konoha02, nazar, rev, rrd, tzink
Priority:	P1
Hardware:	All
OS:	Linux
Kernel Version:	6.1	Subsystem:
Regression:	No	Bisected commit-id:
Attachments:	nvme drive and controller specs iostat on nvme Samsung 990 Pro 4TB disconnects under load with `nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off`

Description jfhart085 2022-12-14 11:35:51 UTC

attempting to load an nvme device (nvme0n1) to use as main system drive:
using the following command:

rsync -axvH /. --exclude=/lost+found --exclude=/var/log.bu --exclude=/usr/var/log.bu --exclude=/usr/X11R6/var/log.bu --exclude=/home/jhart/.cache/mozilla/firefox/are7uokl.default-release/cache2.bu --exclude=/home/jhart/.cache/thunderbird/7zsnqnss.default/cache2.bu /mnt/root_new 2>&1 | tee root.log

the i/o quickly hangs and the drive cannot be unmounted without a reboot

dmesg reports the following: {
[Dec14 19:24] nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting
[Dec14 19:25] nvme nvme0: I/O 0 QID 1 timeout, reset controller
[ +30.719985] nvme nvme0: I/O 8 QID 0 timeout, reset controller
[Dec14 19:28] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
[  +0.031803] nvme nvme0: Abort status: 0x371
[Dec14 19:30] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
[  +0.000019] nvme nvme0: Removing after probe failure status: -19
}

Linux DellXPS 6.1.0 #1 SMP Tue Dec 13 21:48:51 JST 2022 x86_64 GNU/Linux
custom distribution built entirely from source

Comment 1 jfhart085 2022-12-14 11:59:53 UTC

Created attachment 303406 [details]
nvme drive and controller specs

Comment 2 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-12-15 06:50:46 UTC

Did this work on earlier kernel version? If yes, which was the latest version that worked?

Comment 3 jfhart085 2022-12-15 06:55:45 UTC

I have just recently started trying to use NVME.  I originally tried with linux-57 and had similar failures, so I upgraded to 6.1 and still had them.  The report I give here is for linux kernel 6.1.

Comment 4 jfhart085 2022-12-15 06:56:32 UTC

that should read linux 5.7.  My apologies for the finger stutter.

Comment 5 jfhart085 2022-12-15 06:57:54 UTC

To clarify: I have not tried NVME with anything earlier than linux-5.7

Comment 6 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-12-15 07:03:09 UTC

thx for clarifying, then it's not a regression I should track and something for the regular developers. If you don't get a reply from them here, report this to the appropriate maintainers and subsystem list mentioned in https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/MAINTAINERS

Comment 7 Nelson G 2023-10-07 02:32:59 UTC

oct 06 20:46:04 debian kernel: nvme nvme0: I/O 648 (I/O Cmd) QID 4 timeout, aborting
oct 06 20:47:23 debian kernel: nvme nvme0: I/O 0 (I/O Cmd) QID 6 timeout, aborting
oct 06 20:47:23 debian kernel: nvme nvme0: I/O 681 QID 3 timeout, reset controller
oct 06 20:47:23 debian kernel: nvme nvme0: I/O 8 QID 0 timeout, reset controller
oct 06 20:47:23 debian kernel: nvme nvme0: Abort status: 0x371
oct 06 20:47:23 debian kernel: nvme nvme0: Abort status: 0x371
oct 06 20:47:23 debian kernel: nvme nvme0: Abort status: 0x371
oct 06 20:47:23 debian kernel: nvme nvme0: Abort status: 0x371
oct 06 20:47:23 debian kernel: nvme nvme0: Abort status: 0x371
oct 06 20:47:23 debian kernel: nvme nvme0: 15/0/0 default/read/poll queues
oct 06 20:47:23 debian kernel: nvme nvme0: Ignoring bogus Namespace Identifiers


I got this error today.
I powered on my machine (cold boot) and after login to my xfce session everything except the cursor got frozen, nothing responded (not even the clock).  I could go into tty and ctrl alt del from there.  It rebooted after a minute or so.

Comment 8 Andreas 2023-10-07 06:44:14 UTC

@Nelson G, Comment 7
Which version of Debian are you running? Which kernel version exactly?

On the shell, can you share the (selective) output of the following commands:
# cat /etc/debian_version
# cat /etc/*release
# uname -a
# hostnamectl

Also, if installed, (selective) output about the system would be nice:
# dmidecode -t system

You might want to _not_ paste the serial numbers and UUIDs, hence "selective".

Comment 9 Nelson G 2023-10-09 22:42:18 UTC

oops, sorry, here:
debian 12.1 
linux 6.1.52-1 x86-64

Thinkpad e495 

# dmidecode 3.4
Getting SMBIOS data from sysfs.
SMBIOS 3.1.1 present.

Handle 0x000F, DMI type 1, 27 bytes
System Information
	Manufacturer: LENOVO
	Product Name: 20NECTO1WW
	Version: ThinkPad E495
	Serial Number: blank
	UUID: blank
	Wake-up Type: Power Switch
	SKU Number: LENOVO_MT_20NE_BU_Think_FM_ThinkPad E495
	Family: ThinkPad E495

Handle 0x0036, DMI type 15, 31 bytes
System Event Log
	Area Length: 1346 bytes
	Header Start Offset: 0x0000
	Header Length: 16 bytes
	Data Start Offset: 0x0010
	Access Method: General-purpose non-volatile data functions
	Access Address: 0x00F0
	Status: Valid, Not Full
	Change Token: 0x00000053
	Header Format: Type 1
	Supported Log Type Descriptors: 4
	Descriptor 1: POST error
	Data Format 1: POST results bitmap
	Descriptor 2: PCI system error
	Data Format 2: None
	Descriptor 3: System reconfigured
	Data Format 3: None
	Descriptor 4: Log area reset/cleared
	Data Format 4: None

Comment 10 Andreas 2023-10-10 03:50:30 UTC

Can you say with certainty that the NVMe SSD itself isn't faulty, causing the errors?

Did you check the SMART (and other) logs, e.g. with smartctl?

For reference, my system:
Hardware specs:
  Product: LENOVO Legion 5 Pro 16ACH6H 82JQ
  SKU Number: LENOVO_MT_82JQ_BU_idea_FM_Legion 5 Pro 16ACH6H
  Family: Legion 5 Pro 16ACH6H
  NVMe PCIe SSD (upgraded, not original): Crucial P1 CT2000P1SSD8
Software specs:
  Gentoo Linux
  Kernel 6.5.6-gentoo x86_64 AMD Ryzen 7 5800H (custom kernel, my own .config)

E.g. this is from my system, without the errors your describe:

localhost ~ # smartctl -H /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.5.6-gentoo-L5P] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

localhost ~ # smartctl -l error /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.5.6-gentoo-L5P] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
Error Information (NVMe Log 0x01, 16 of 256 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0       1937     0  0x0014  0x4005      -            0     1     -

localhost ~ # nvme smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning                        : 0
temperature                             : 44 °C (317 K)
available_spare                         : 100%
available_spare_threshold               : 5%
percentage_used                         : 3%
endurance group critical warning summary: 0
Data Units Read                         : 273.156.408 (139,86 TB)
Data Units Written                      : 69.561.559 (35,62 TB)
host_read_commands                      : 9.941.155.688
host_write_commands                     : 1.220.956.942
controller_busy_time                    : 22.719
power_cycles                            : 1.549
power_on_hours                          : 8.511
unsafe_shutdowns                        : 96
media_errors                            : 0
num_err_log_entries                     : 1.937
Warning Temperature Time                : 0
Critical Composite Temperature Time     : 0
Temperature Sensor 1           : 44 °C (317 K)
Thermal Management T1 Trans Count       : 17
Thermal Management T2 Trans Count       : 4
Thermal Management T1 Total Time        : 17593
Thermal Management T2 Total Time        : 9

Comment 11 Nelson G 2023-10-12 21:09:17 UTC

Well, I don't mean my ssd is the culprit for being in a bad state, but because the support for linux is as bad as my thinkpad's bios oem team.  They simply don't care about linux when you ask for help/support about this bug, or worse, a critical bug/implementation. They act as if the error is using linux per se.    
Anyway:

$ smartctl -H /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-13-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

$ smartctl -l error /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-13-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

$ sudo nvme smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning			: 0
temperature				: 38°C (311 Kelvin)
available_spare				: 100%
available_spare_threshold		: 10%
percentage_used				: 3%
endurance group critical warning summary: 0
Data Units Read				: 45.442.041 (23,27 TB)
Data Units Written			: 26.846.180 (13,75 TB)
host_read_commands			: 462.473.222
host_write_commands			: 320.013.095
controller_busy_time			: 25.241
power_cycles				: 7.142
power_on_hours				: 11.836
unsafe_shutdowns			: 415
media_errors				: 10
num_err_log_entries			: 10
Warning Temperature Time		: 189
Critical Composite Temperature Time	: 1  —very old record :)—
Thermal Management T1 Trans Count	: 0
Thermal Management T2 Trans Count	: 0
Thermal Management T1 Total Time	: 0
Thermal Management T2 Total Time	: 0



Oh btw,  I reported this warning some time ago https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=990510

[    1.390941] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0xffffc000 flags=0x0000]
[    1.390976] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0xffffc080 flags=0x0000]


I tend to get it constantly, as the very first boot message (even before prompting me to luks)  
The day I got the bug we're reporting here, I also got that one,  just sharing in case anyone recognizes that one.

Comment 12 Andreas 2023-10-13 14:21:44 UTC

I tend to think that bug fixes take time, especially if the developers don't experience those issues for themselves, on their own hardware. E.g bug #202665 got fixed >1 yr after it was initially reported.

So anyway, you should definitely get the newest kernel and see if the issue is still there, because that's what the developers will ask you to do anyhow. Also, with the sources at your disposal, you can apply any patches they through at you for testing.

The basics are:
https://www.wikihow.com/Compile-the-Linux-Kernel

1. Get the source. (Apply patches if/as necessary).
2. Compile the kernel and the modules, then install it.
3. Boot that kernel. (Probably involves configuring a boot manager.)

How you would do all that depends on the distribution you use.
For Debian, read this:
https://www.debian.org/doc//manuals/debian-handbook/sect.kernel-compilation.pl.html
E.g. for Ubuntu, you'd read this:
https://wiki.ubuntu.com/Kernel/BuildYourOwnKernel

I know, this seems like a lot of work, but it's the only way I know of...

Comment 13 Keith Busch 2023-10-13 14:25:26 UTC

Well, I don't think there's any indication that there's a kernel bug of any kind here. Devices that don't complete commands will exceed the kernel's tolerance for timeouts, and there's nothing the kernel can do about it.

Comment 14 Andreas 2023-10-13 20:12:23 UTC

Is there a way to tweak the timeout, adjust it to a higher value? How I understand it, the kernel resets the controller after the timeout is reached, but how certain is it that the command wouldn't have succeeded when given more time?

Comment 15 Keith Busch 2023-10-13 20:41:35 UTC

Not sure if more time will help here. The 'iostats' output I suggested on the mailing list should indicate if waiting longer will help or not.

You can change the default with kernel parameter 'nvme_core.io_timeout=<time-in-seconds>'. Default is 30 seconds.

Comment 16 Andreas 2023-10-13 22:25:11 UTC

@Keith Busch, thank you very much.

To get iostat, install sysstat. (On Debian: "sudo apt install sysstat")

My system is working and doesn't have those issues, for reference:

nvme0 contains my main Gentoo Linux system
nvme1 is just mostly idle all the time

localhost ~ # dmesg | grep nvme
[    7.269597] nvme 0000:02:00.0: platform quirk: setting simple suspend
[    7.269615] nvme 0000:05:00.0: platform quirk: setting simple suspend
[    7.269710] nvme nvme1: pci function 0000:05:00.0
[    7.269713] nvme nvme0: pci function 0000:02:00.0
[    7.279969] nvme nvme1: missing or invalid SUBNQN field.
[    7.280233] nvme nvme1: Shutdown timeout set to 8 seconds
[    7.297803] nvme nvme1: 16/0/0 default/read/poll queues
[    7.302096]  nvme1n1: p1 p2 p3 p4
[    7.303998] nvme nvme0: 15/0/0 default/read/poll queues
[    7.311543]  nvme0n1: p1 p2 p3 p4 p5 p6 p7 p8

localhost ~ # iostat -x nvme0n1
Linux 6.5.7-gentoo-L5P (localhost.localdomain)  2023-10-14      _x86_64_        (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,65    0,00    0,36    0,05    0,00   98,94

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme0n1         21,81    684,75     0,01   0,07    1,31    31,40   11,16    366,46     0,15   1,37    4,81    32,84   11,96  12172,93     0,00   0,00    0,22  1017,38    0,21    0,44    0,08   1,28

localhost ~ # iostat -x nvme1n1
Linux 6.5.7-gentoo-L5P (localhost.localdomain)  2023-10-14      _x86_64_        (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,65    0,00    0,36    0,05    0,00   98,93

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme1n1          1,03      5,39     0,00   0,00    0,04     5,23    0,00      0,00     0,00   0,00    0,00     0,00    0,00      0,00     0,00   0,00    0,00     0,00    0,00    0,00    0,00   0,01

@Nelson G, could you also provide your iostat output, please?
Also, try adding nvme_core.io_timeout=60 to your kernel commandline and see if that changes anything. Also, try what Keith Busch suggested on the mailing list:

----snap---
If the device is responding very slowly, suggestions I might have
include:

  a. Enable discards if you've disabled them
  b. Disable discards if you've enabled them
  c. If not already enabled, use an io scheduler, like mq-deadline or kyber
----snap---

Thanks.

Comment 17 Nelson G 2023-10-14 03:10:46 UTC

Here:
iostat system defaults,  both output of warm boot and like 10 minutes of usage:

Linux 6.1.0-13-amd64 (debian) 	13/10/23 	_x86_64_	(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7,53    0,00    8,96    2,09    0,00   81,42

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme0n1        605,22  18567,04    59,70   8,98    0,32    30,68   34,25    337,62     0,11   0,33    0,26     9,86    0,00      0,00     0,00   0,00    0,00     0,00    1,32    0,13    0,20  24,69

Linux 6.1.0-13-amd64 (debian) 	13/10/23 	_x86_64_	(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2,07    0,01    1,81    0,46    0,00   95,65

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme0n1        132,05   4217,31    16,53  11,13    0,64    31,94    5,87     51,63     0,02   0,34    0,23     8,80    0,00      0,00     0,00   0,00    0,00     0,00    0,29    0,11    0,09   4,12


iostat WITH nvme_core.io_timeout=60, both warm boot and like 10 minutes of usage (again):

Linux 6.1.0-13-amd64 (debian) 	13/10/23 	_x86_64_	(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7,21    0,00    7,36    1,69    0,00   83,74

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme0n1        818,44  19837,90    44,75   5,18    0,62    24,24   22,52    213,35     0,02   0,08    0,27     9,47    0,00      0,00     0,00   0,00    0,00     0,00    0,86    0,13    0,51  21,85

Linux 6.1.0-13-amd64 (debian) 	13/10/23 	_x86_64_	(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1,73    0,00    1,06    0,17    0,00   97,04

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme0n1         47,16   1470,21     5,82  10,98    0,70    31,17    3,21     37,63     0,02   0,76    0,09    11,71    0,05      4,11     0,00   0,00    0,29    78,79    0,23    0,13    0,03   1,51


Off topic:
Yeah,  I suppose you're right about bugs getting fixed once a developer experiences it.  To keep things short I'll just say I've had enough bad experiences with lenovo.  And I've been told already that my issues are 'hard' to get ""fixed"" (they're fixed already by amd but not implemented on the bios) because well, Microsoft Windows is the main OS and not linux. So no linux related fixes on the bios for me https://bugzilla.kernel.org/show_bug.cgi?id=216161  
As for the ssd,  it's cheap,  the average temperatures are ridiculously high, the speed rates are mediocre.  But it gets the job done for a boot drive (system and config files are there to keep things faster than an hdd) but my gfs computer has samsung and oh boy that that drive it's cooler on temperature and faster (same storage capacity btw, 256gb lol).  
So yeah, I still don't expect much of it.  Like, no dmesg warnings. Or the bug we're reporting (which only happened to me on very rare occasions).  At least it rarely fails on scary ways like 'nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting'
And thanks for the advice of the kernel update!  I use both debian and arch,  and the 'fix' for https://bugzilla.kernel.org/show_bug.cgi?id=202665 was supposed to be on 6.1 and beyond, although,  knowing that ADATA told users with that bug to use windows, and that it seems to be vendor related (not entirely of course, maybe common nvme controller related), i prefer to just reboot once I see that warning.

Comment 18 Andreas 2023-10-14 18:15:25 UTC

You'd have to check, but maybe those error messages aren't really affecting your system at all. I would suggest to test this by running your system very very long (>1-2 days), no reboots, and see if it remains stable. Especially with the Arch system, as it always uses the newest kernel.

Also, if you're up to it, you could try switching (NVMe) SSDs and see if the fault is gone with another SSD.

@jfhart085@gmail.com,
the original reporter: if you could also provide the requested info, this bug could be closed. Since you have a Kingston SSD, it could be different. Also, a Dell XPS isn't supposed to be a cheap system.

Something like:
# smartctl -H /dev/nvme0n1
# smartctl -l error /dev/nvme0n1
# nvme smart-log /dev/nvme0n1
# iostat -x nvme0n1

@Nelson G
Off Topic:

I am very happy with my Legion 5 Pro from Lenovo. It is running very well with Linux, with the exception that I have to boot into Windows for BIOS updates. Some people also reported issues with dual "switcheroo" graphics on Linux, but since I'm only using the AMD Radeon integrated graphics on Linux (haven't even loaded the closed-source Nvidia drivers) and I specifically bought this laptop BECAUSE it has a dedicated Nvidia card -- for me only on Windows, only for gaming --, I didn't have those (now fixed, as I heard) Nvidia/AMD switcheroo issues. But that's the whole purpose of the Legion 5 Pro anyhow: Gaming! Not Linux, but if you know what you're doing, it works very very well.

Lenovo has some dedicated Linux laptops, like certain ThinkPad series. Even in the BIOS/UEFI settings, there's a "Windows" or "Linux" OS preset on them, so I guess it all depends on what you're looking for.

Anyway: Have you reported your issues/findings to the Lenovo support forums? They have a guy there, MarkRHPearson, acting as a Linux liaison, and he speaks to the company (and the technicians, e.g. for BIOS updates) on behalf of us Linux users.

https://forums.lenovo.com/t5/Linux-Operating-Systems/ct-p/lx_en

That said, it may well be that there's nothing that can be done. Maybe under Windows your SSD has the same issues, but Windows doesn't directly report them to you -- it may just be the specific SSD you're using...

Comment 19 Keith Busch 2023-10-18 15:56:35 UTC

The way to run iostat as needed for this issue is "iostat -x /dev/nvme0n1 1".

Notice the "1" at the end, as in to repeat the stat once every second so that we can see what's changed at each time interval.

If the device is hopelessly broken, I expect at some point the util will approach 100%, r/s and w/s to go to 0, and aqu-sz to maintain a relatively large non-zero value.

If the device is simply unable to keep up with the workload you're demanding out of it but still responsive in a slower capacity, then 'r/s' and 'w/s' should be non-zero, but show an obvious drop from prior to observing timeouts.

Comment 20 jfhart085 2023-10-30 13:05:46 UTC

I no longer have the Kingston SSD.  I replaced it with a Samsung unit.
I was able to resolve the issue as follows:

> On 12/19/22 11:41 PM, Keith Busch wrote:
>
>>
>>> MaxPayload 128 bytes, MaxReadReq 512 bytes
>>>                  DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+
>>> AuxPwr+ TransPend-
>>>                  LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1,
>>> Latency L0 <1us, L1 <8us
>>>                          ClockPM+ Surprise- LLActRep- BwNot-
>>>                  LnkCtl: ASPM Disabled; RCB 64 bytes Disabled-
>>> Retrain- CommClk+
>>>                          ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>>>                  LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train-
>>> SlotClk+ DLActive- BWMgmt- ABWMgmt-
>
>
>> Something seems off if it's downtraining to Gen1 x1. I believe this
>> setup should be capable of Gen2 x4. It sounds like the links among these
>> components may not be reliable.
>
> This was a useful comment, and led me on to some further research. The
> system in question here (a rather older Dell) was equipped with the
> following PCI slots:
>
>             slot  pins
>            ----  ----
> PCI      -    6   124 available
> PCI      -    5   124 blocked by RX580
> PCIe x16 -    4   164 secondary graphics, RX580
> PCIe x8  -    3    98 NVME controller
> PCIe x1  -    2    36 TV card
> PCIe x16 -    1   164 primary graphics, available
>
> It also uses an Nvidia chipset for the PCI bridge.  It turns out that
> back in 2008 Dell had custom modifications made to a standard chip set
> configuration normally supplied by Nvidia. This enabled the
> motherboard to handle up to two graphics cards at the same time while
> minimizing cost.  The trade off apparently was that the two x16 slots
> actually only have 8 data lanes each, and the x8 slot (slot 3) only
> one lane. I had the NVME controller in the x8 slot, expecting the
> controller to negotiate for 4 lanes, not realizing the single lane
> restriction.  I moved the controller to the unused x16 slot (slot 1)
> since I have only a single card in the machine.  It is now operating
> as expected without further difficulty.

Comment 21 Andreas 2023-10-30 13:18:13 UTC

Thank you so much, @jfhart085!

@Nelson G,
Maybe this helps you as well? Does your laptop have two different PCIe NVMe slots? If so, can you switch the SSDs?

And please test your system for an extended period of time, run the iostat command like Keith Busch requested in Comment 19 (on a separate virtual console), then post its output when you notice the problem.

Comment 22 Nelson G 2023-10-31 19:18:21 UTC

Mine only has one PCIe NVMe slot.

Alright,  I will try to run iostat for an extended period of time,  it's been running for at least 10 min now and it doesn't seem to be something outstanding in the output, i will share it in a minute.

I am currently using the 'nvme_core.default_ps_max_latency_us=0' boot parameter because I sadly got this bug once again a few days ago

journald=
oct 29 22:16:46 debian kernel: nvme nvme0: I/O 83 (I/O Cmd) QID 2 timeout, aborting
oct 29 22:16:46 debian kernel: nvme nvme0: I/O 453 (I/O Cmd) QID 8 timeout, aborting
oct 29 22:16:46 debian kernel: nvme nvme0: I/O 454 (I/O Cmd) QID 8 timeout, aborting
oct 29 22:16:46 debian kernel: nvme nvme0: I/O 898 (I/O Cmd) QID 1 timeout, aborting
oct 29 22:16:46 debian kernel: nvme nvme0: I/O 899 (I/O Cmd) QID 1 timeout, aborting
oct 29 22:16:46 debian kernel: nvme nvme0: I/O 0 QID 0 timeout, reset controller
oct 29 22:16:46 debian kernel: nvme nvme0: I/O 83 QID 2 timeout, reset controller
oct 29 22:16:46 debian kernel: nvme nvme0: Abort status: 0x371
oct 29 22:16:46 debian kernel: nvme nvme0: Abort status: 0x371
oct 29 22:16:46 debian kernel: nvme nvme0: Abort status: 0x371
oct 29 22:16:46 debian kernel: nvme nvme0: Abort status: 0x371
oct 29 22:16:46 debian kernel: nvme nvme0: Abort status: 0x371
oct 29 22:16:46 debian kernel: nvme nvme0: 15/0/0 default/read/poll queues
oct 29 22:16:46 debian kernel: nvme nvme0: Ignoring bogus Namespace Identifiers


had to reboot because I needed the system working correctly asap.   But next time i will run iostat and dmesg and save the output (if possible because, everything is slow/unresponsive)
i was reading archlinux forums and found out that boot parameter to be 'useful' on keeping the system from reaching that bug at boot (I've only get that bug during the boot process,  never after DAYS of the system running.  I'd say i've reproduced this bug like 1 time out of 100 boot processes, i really don't know what triggers it).

In a few hours I will run iostat again with the default boot parameters and a more exhaustive nvme activity.

Comment 23 Nelson G 2023-10-31 19:21:06 UTC

Created attachment 305345 [details]
iostat on nvme

nvme_core.default_ps_max_latency_us=0 boot parameter being used,  not so much happening on that ssd during that test.

Comment 24 Andreas 2023-10-31 19:36:13 UTC

With this last information, logically, there are two actions required:

#1 the devs need an iostat output when the error occurs, not when the system is running as it should.

#2 if the dmesg errors only occur right at boot time, but doesn't occur at all while the system is running, this might be a rare case that could be caused and/or addressed by the kernel, and therefore could also be fixed in the code.

Issue #1 is for @Nelson G
Issue #2 if for the devs, @Keith Busch

Comment 25 Rev 2024-03-12 13:12:22 UTC

This bug is still recent. I get it on mainline 6.7(.9), and had it on previous kernels (so long ago, I don't even remember what kernel) for ubuntu 22.04.4 LTS. 

Strangely it was better on some of the previous kernel series. 

But now (with 6.7) I get system hangs with the same log as the initial bug reporter. 

And it gets as worse as system crashes because no I/O is possible any more. The system just stops working, because the OS cannot access its boot drive (512 GB LVM2, Adata SX6000LNP with most recent firmware) any more. And of course, no logs because of that.

First I thought it must be my drive, but checks give the nvme drive a clean slate.

I can even trigger the bug by loading video files and jumping time (skipping) while watching. Sooner or later the system will hang with the OP's error. And later the system will loose I/O of the boot drive and I have to manually power off the hardware and cold boot it (no shutdown / reboot possible, no other working environment possible due to I/O not possible).

That makes my beloved linux less stable than my windows PC (puke).

Comment 26 Keith Busch 2024-03-12 14:44:55 UTC

We're not doing anything outside spec compliance here. The device just doesn't respond. Why is that? Usually that's a firmware bug, and smart tools generally would't indicate that. Sometimes we can work around firmware bugs in the driver, but we can't efficiently guess as to what the problem might be without hardware in hand.

Comment 27 Rev 2024-03-14 10:10:30 UTC

Thank you, I can understand that. I probably should have watched my system better, when the bug was not or very seldom happening and noted the kernel versions. But as typical human behaviour, when an error is not bugging you, you are happy and do nothing. Now I can remember the issue getting worse on 6.7 mainline series and that brings back the memories of "I had that before...".

Installed 6.8(.0) today - no cold boot necessary yet, but error logs like

nvme nvme0: Abort status: 0x0

and

nvme nvme0: I/O tag 512 (f200) opcode 0x1 (I/O Cmd) QID 3 timeout, aborting req_op:WRITE(1) size:4096

are still there.

Comment 28 Keith Busch 2024-03-14 16:44:42 UTC

Do the occurrences become less frequent if you revert to an older kernel right now? I just want to rule out the possibility that your drive is just wearing out and starting to struggle.

Comment 29 Rev 2024-03-15 08:45:58 UTC

TBW (terrabytes written) and drive age are at 1/10 of the manufacturer spec.

Since I don't remember the last kernel that worked better or at least without I/O crashes, I am open for suggestions, of what kernel to choose from.

Comment 30 Rev 2024-03-16 22:50:34 UTC

now the "nvme" filter changed to the following errors in 6.8.1:

(this is new:)
nvme nvme0: Ignoring bogus Namespace Identifiers
nvme nvme0: 7/0/0 default/read/poll queues

(this is old/known:)
nvme nvme0: Abort status: 0x0
nvme nvme0: I/O tag 987 (13db) opcode 0x1 (I/O Cmd) QID 4 timeout, aborting req_op:WRITE(1) size:8192


and at least the last 2 errors repeat steadily in the average 50/hour margin, but no I/O crash yet. (QID varies 1-7 as does size:4096-12288)

By the way, since there are a lot of different NVME SSD brands and types with this error found, how can one get rid of it?

Comment 31 Andreas 2024-03-17 17:50:09 UTC

I have two M.2 SSDs from Crucial:

nvme0: CT4000P3PSSD8 (4 TB), firmware P9CR40A
nvme1: CT2000P1SSD8 (2 TB), firmware P3CR021

I don't have any issues. I recently bought the bigger SSD and moved Linux from the other one, swapping their places. 

I see this during Kernel 6.8.1 boot and in dmesg:

[    6.641233] nvme 0000:02:00.0: platform quirk: setting simple suspend
[    6.641249] nvme 0000:05:00.0: platform quirk: setting simple suspend
[    6.641327] nvme nvme0: pci function 0000:02:00.0
[    6.641331] nvme nvme1: pci function 0000:05:00.0
[    6.647140] nvme nvme0: missing or invalid SUBNQN field.
[    6.672906] nvme nvme1: 15/0/0 default/read/poll queues
[    6.673090] nvme nvme0: allocated 32 MiB host memory buffer.
[    6.681449]  nvme1n1: p1 p2 p3 p4
[    6.721320] nvme nvme0: 8/0/0 default/read/poll queues
[    6.725566] nvme nvme0: Ignoring bogus Namespace Identifiers
[    6.727874]  nvme0n1: p1 p2 p3 p4 p5

This begs the question what SUBNQN field and Namespace Identifiers are and why they aren't present or have to be ignored: the CT2000P1SSD8 is a Crucial P1, it's the older one. The newer CT4000P3PSSD8 Crucial P3 Plus has the "missing or invalid SUBNQN field" and the "Ignoring bogus Namespace Identifiers" issue.

Obviously I have no idea what this means, but I'm guessing that it is unrelated to the I/O timeout issues, since I don't get any of those.

Comment 32 Keith Busch 2024-03-18 04:13:50 UTC

(In reply to Andreas from comment #31)
> This begs the question what SUBNQN field and Namespace Identifiers are and
> why they aren't present or have to be ignored: the CT2000P1SSD8 is a Crucial
> P1, it's the older one. The newer CT4000P3PSSD8 Crucial P3 Plus has the
> "missing or invalid SUBNQN field" and the "Ignoring bogus Namespace
> Identifiers" issue.

These duplicate and invalid naming messages don't mean anything unless your environment is an enterprise server with multipath, or in some cases, virtualization. No one else should mind if your manufacturer didn't assemble unique IDs. That use to harm things if you put two such devices in the same machine because of how udev works, but less so anymore.

Technically, there's still a data integrity danger with udev enumeration order, but not one anyone in real life seems to care about. Linus Torvalds in fact requested we remove the safety check.

Comment 33 Keith Busch 2024-03-18 14:37:54 UTC

(In reply to Rev from comment #30)
> By the way, since there are a lot of different NVME SSD brands and types
> with this error found, how can one get rid of it?

The error is that the drive is failing to produce a response to a command within a very generous time. We wait 30 seconds, which should be an eternity for NAND media. In my experience seeing many of these occurrences, they are always some broken firmware condition, and I have to harass the vendor for a proper fix.

Comment 34 Rev 2024-03-19 12:42:58 UTC

(In reply to Keith Busch from comment #33)
> (In reply to Rev from comment #30)
> > By the way, since there are a lot of different NVME SSD brands and types
> > with this error found, how can one get rid of it?
> 
> The error is that the drive is failing to produce a response to a command
> within a very generous time. We wait 30 seconds, which should be an eternity
> for NAND media. In my experience seeing many of these occurrences, they are
> always some broken firmware condition, and I have to harass the vendor for a
> proper fix.

Since it is a sadly very, very common response from hardware vendors to pack their product in a black box (in this case, check for vendors who give customers the option to do a firmware update in a non Windows environment - is that number zero or non zero?) and leave open source communities hanging, my opinion would be to find a consolidation in the open source communities to concentrate on their strength to find a workaround.

I know, this is an opinion from someone not involved in the process. A very difficult process, I can just guess. But if this is a common problem, and from the responses around the web for me it seems like it is, it might be the only chance to solve the issue.

Comment 35 Keith Busch 2024-03-19 14:26:11 UTC

(In reply to Rev from comment #34)
> Since it is a sadly very, very common response from hardware vendors to pack
> their product in a black box (in this case, check for vendors who give
> customers the option to do a firmware update in a non Windows environment -
> is that number zero or non zero?) and leave open source communities hanging,
> my opinion would be to find a consolidation in the open source communities
> to concentrate on their strength to find a workaround.

Not sure I follow what you have in mind here. There's no one size fits all workaround for this.

If you want a simple way to change the workload the device receives, which may achieve a reduction in timeout observations, try using kyber io scheduler. If you're already using that, try mq-deadline instead.

Or try a different filesystem. Using ext4? Try xfs. Using xfs? Try btrfs. Using btrfs? Try ext4.

And ensure partitions are aligned on 64k boundaries.

Are discards disabled? If so, turn them on.

Is your drive capacity utilization very high? Try deleting a bunch of stuff and keep your utilization below 80%.

Do you have any power saving features enabled? Try turning them off.

Comment 36 Nazar Mokrynskyi 2024-11-12 05:06:17 UTC

I also have this on two systems, both running Ubuntu 24.04.1.

First system:
ASUS Pro WS TRX50-SAGE WIFI + AMD Threadripper 7970X
Ubuntu 24.04 Desktop + kernel 6.11.x from Xanmod
SSDs that have this issue: Solidigm P44 Pro 2TB and Samsung 990 Pro 4TB

Second system:
Gigabyte MZ32-AR0 (rev 1.0) + AMD Epyc 7302P
Ubuntu 24.04 Server + kernel 6.8-generic from stock repositories
SSDs that have this issue: SkHynix P41 Platinum 2TB (almost identical to P44 Pro above with different firmware) and Samsung 990 Pro 4TB (different unit from first system)

I tried nvme_core.default_ps_max_latency_us=0 separately and with pcie_aspm=off pcie_port_pm=off.
Maybe coincidence, but with all 3 it survives a bit longer than without, but still reliably crashes when running `btrfs scrub start /`.
Attaching logs from second system that started with `nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off`.

Things I have tried so far without any luck:
* putting SSD into different PCIe slot (both native M.2 on the motherboard and through adapter into regular PCIe slot)
* forcing PCIe 3.0 speed for these PCIe 4.0 SSDs
* various BIOS options (like enabling/disabling AER, ACS, some others)

All SSDs are with latest available firmware. The SSD I didn't have this issue with was ADATA XPG SP8100NP-2TT-C 2TB (not the best SSD performance-wise).

Will attach full kernel log with `nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off`, open for experiments because both of the systems are freezing sometimes multiple times a day, making it barely usable.

Can compile and test the kernel or boot it with custom options if needed, anything that can fix this issue (it has been multiple months and I upgraded to 990 Pro primarily to deal with this problem).

Comment 37 Nazar Mokrynskyi 2024-11-12 05:07:04 UTC

Created attachment 307207 [details]
Samsung 990 Pro 4TB disconnects under load with `nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off`

Comment 38 iccfish 2024-12-04 03:53:05 UTC

(In reply to Rev from comment #34)
> Since it is a sadly very, very common response from hardware vendors to pack
> their product in a black box (in this case, check for vendors who give
> customers the option to do a firmware update in a non Windows environment -
> is that number zero or non zero?) and leave open source communities hanging,
> my opinion would be to find a consolidation in the open source communities
> to concentrate on their strength to find a workaround.
> 
> I know, this is an opinion from someone not involved in the process. A very
> difficult process, I can just guess. But if this is a common problem, and
> from the responses around the web for me it seems like it is, it might be
> the only chance to solve the issue.

I  encountered this issue recently.

Previously kernel version 6.9.x was used, after system booted 3 monthes later, one nvme ssd suddenly became very slow, continous write a round 10MB/s and not stable (speed drops to zero from time to time). I think maybe it was due to system running in long time, so I rebooted it.

Before I reboot, I upgrade the kernel from 6.9.x to 6.12.x. After system reboot, this nvme ssd still in a strange working state (low speed, not stable speed, high busy time around 100%).

And file system was shutting down by kernel half an hour after system boot.
Log shows:


Dec  2 20:36:28 E5W kernel: nvme4n1: I/O Cmd(0x1) @ LBA 1789475760, 1024 blocks, I/O Error (sct 0x0 / sc 0x7) 
Dec  2 20:36:28 E5W kernel: I/O error, dev nvme4n1, sector 1789475760 op 0x1:(WRITE) flags 0x104000 phys_seg 1 prio class 0
Dec  2 20:36:28 E5W kernel: nvme4n1p1: writeback error on inode 3228735725, offset 1614807040, sector 1789470640
Dec  2 20:36:28 E5W kernel: nvme nvme4: Abort status: 0x0
Dec  2 20:36:30 E5W kernel: nvme nvme4: I/O tag 64 (2040) opcode 0x1 (I/O Cmd) QID 6 timeout, aborting req_op:WRITE(1) size:9216
Dec  2 20:36:31 E5W kernel: nvme nvme4: I/O tag 183 (20b7) opcode 0x1 (I/O Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:524288
Dec  2 20:36:31 E5W kernel: nvme nvme4: I/O tag 184 (20b8) opcode 0x1 (I/O Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:524288
Dec  2 20:36:32 E5W kernel: nvme4n1: I/O Cmd(0x1) @ LBA 2001250948, 18 blocks, I/O Error (sct 0x0 / sc 0x7) 
Dec  2 20:36:32 E5W kernel: I/O error, dev nvme4n1, sector 2001250948 op 0x1:(WRITE) flags 0x29800 phys_seg 1 prio class 0
Dec  2 20:36:32 E5W kernel: nvme nvme4: Abort status: 0x0
Dec  2 20:36:32 E5W kernel: I/O error, dev nvme4n1, sector 2001250948 op 0x1:(WRITE) flags 0x29800 phys_seg 1 prio class 0
Dec  2 20:36:32 E5W kernel: XFS (nvme4n1p1): log I/O error -5
Dec  2 20:36:32 E5W kernel: XFS (nvme4n1p1): Filesystem has been shut down due to log error (0x2).
Dec  2 20:36:32 E5W kernel: XFS (nvme4n1p1): Please unmount the filesystem and rectify the problem(s).


I've done smartctl check, there's no obvious errors. using `dd` command can read from this nvme ssd. I have no idea how to solve this issue currently, and plan to revert back to kernel 5.14.0 to see if issue resolved.

Comment 39 iccfish 2024-12-04 03:55:24 UTC

(In reply to iccfish from comment #38)

> Dec  2 20:36:28 E5W kernel: nvme4n1: I/O Cmd(0x1) @ LBA 1789475760, 1024
> blocks, I/O Error (sct 0x0 / sc 0x7) 
> Dec  2 20:36:28 E5W kernel: I/O error, dev nvme4n1, sector 1789475760 op
> 0x1:(WRITE) flags 0x104000 phys_seg 1 prio class 0
> Dec  2 20:36:28 E5W kernel: nvme4n1p1: writeback error on inode 3228735725,
> offset 1614807040, sector 1789470640
> Dec  2 20:36:28 E5W kernel: nvme nvme4: Abort status: 0x0
> Dec  2 20:36:30 E5W kernel: nvme nvme4: I/O tag 64 (2040) opcode 0x1 (I/O
> Cmd) QID 6 timeout, aborting req_op:WRITE(1) size:9216
> Dec  2 20:36:31 E5W kernel: nvme nvme4: I/O tag 183 (20b7) opcode 0x1 (I/O
> Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:524288
> Dec  2 20:36:31 E5W kernel: nvme nvme4: I/O tag 184 (20b8) opcode 0x1 (I/O
> Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:524288
> Dec  2 20:36:32 E5W kernel: nvme4n1: I/O Cmd(0x1) @ LBA 2001250948, 18
> blocks, I/O Error (sct 0x0 / sc 0x7) 
> Dec  2 20:36:32 E5W kernel: I/O error, dev nvme4n1, sector 2001250948 op
> 0x1:(WRITE) flags 0x29800 phys_seg 1 prio class 0
> Dec  2 20:36:32 E5W kernel: nvme nvme4: Abort status: 0x0
> Dec  2 20:36:32 E5W kernel: I/O error, dev nvme4n1, sector 2001250948 op
> 0x1:(WRITE) flags 0x29800 phys_seg 1 prio class 0
> Dec  2 20:36:32 E5W kernel: XFS (nvme4n1p1): log I/O error -5
> Dec  2 20:36:32 E5W kernel: XFS (nvme4n1p1): Filesystem has been shut down
> due to log error (0x2).
> Dec  2 20:36:32 E5W kernel: XFS (nvme4n1p1): Please unmount the filesystem
> and rectify the problem(s).
> 
> 
> I've done smartctl check, there's no obvious errors. using `dd` command can
> read from this nvme ssd. I have no idea how to solve this issue currently,
> and plan to revert back to kernel 5.14.0 to see if issue resolved.

Here the story begins...

Dec  2 20:29:02 E5W kernel: XFS (nvme4n1p1): Mounting V5 Filesystem 17300a9d-05ca-418f-ade0-f185d8816118
Dec  2 20:29:02 E5W kernel: XFS (nvme4n1p1): Ending clean mount
Dec  2 20:33:14 E5W kernel: nvme nvme4: I/O tag 618 (126a) opcode 0x1 (I/O Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:524288
Dec  2 20:33:14 E5W kernel: nvme nvme4: I/O tag 619 (126b) opcode 0x1 (I/O Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:524288
Dec  2 20:33:14 E5W kernel: nvme nvme4: I/O tag 620 (126c) opcode 0x1 (I/O Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:524288
Dec  2 20:33:20 E5W kernel: nvme nvme4: Abort status: 0x0
Dec  2 20:33:23 E5W kernel: nvme nvme4: Abort status: 0x0
Dec  2 20:33:26 E5W kernel: nvme nvme4: Abort status: 0x0
Dec  2 20:33:50 E5W kernel: nvme nvme4: I/O tag 47 (302f) opcode 0x1 (I/O Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:524288
Dec  2 20:33:50 E5W kernel: nvme nvme4: I/O tag 48 (3030) opcode 0x1 (I/O Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:524288
Dec  2 20:33:50 E5W kernel: nvme nvme4: I/O tag 49 (3031) opcode 0x1 (I/O Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:524288
Dec  2 20:33:50 E5W kernel: nvme nvme4: Abort status: 0x0
Dec  2 20:33:51 E5W kernel: nvme nvme4: Abort status: 0x0
Dec  2 20:33:51 E5W kernel: nvme nvme4: Abort status: 0x0
Dec  2 20:33:58 E5W kernel: nvme nvme4: I/O tag 64 (d040) opcode 0x1 (I/O Cmd) QID 6 timeout, aborting req_op:WRITE(1) size:9216
Dec  2 20:33:58 E5W kernel: nvme nvme4: Abort status: 0x0
Dec  2 20:34:00 E5W kernel: nvme nvme4: I/O tag 512 (e200) opcode 0x1 (I/O Cmd) QID 2 timeout, aborting req_op:WRITE(1) size:30208
Dec  2 20:34:00 E5W kernel: nvme nvme4: I/O tag 65 (5041) opcode 0x1 (I/O Cmd) QID 6 timeout, aborting req_op:WRITE(1) size:4096
Dec  2 20:34:00 E5W kernel: nvme nvme4: I/O tag 96 (3060) opcode 0x1 (I/O Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:524288
Dec  2 20:34:01 E5W kernel: nvme nvme4: Abort status: 0x0
Dec  2 20:34:01 E5W kernel: nvme nvme4: Abort status: 0x0
Dec  2 20:34:01 E5W kernel: nvme nvme4: Abort status: 0x0

Comment 40 Keith Busch 2024-12-04 18:53:09 UTC

I wonder if you have a high media utilization that's straining the SSD's background tasks. Not all devices support reporting utilization, but many do.

What you can try is "sudo nvme list" and compare the two numbers in the "Usage" column. If the usage is a large percentage of the total capacity, that may mean your SSD will have trouble keeping up with new writes due to garbage collection overhead.

If the numbers are the same, that might mean the device doesn't support reporting media utilization, which would make it difficult to determine if that is happening.

You can also check "df -h" and see if your nvme mount points' "Used" is very close to the "Size".

If the device supports reporting utilization, and if that utilization is significantly higher than the filesystem's "Used", that would indicate your filesystem is not performing discards (aka "trim", "fstrim") as expected, and that can really harm the SSDs performance as you approach full capacity, especially if the device's over provisioning is very small. I didn't see any "discards" in the iostat's output.

If the device's utilization is very high, and closely matches the filesystem's "Used" capacity, then you may have simply over subscribed the media, and controller can no longer keep up. What you'd have to do in that case is create an unused partition, maybe 10-20% of the capacity, and never write to it, and make sure the partition's LBA's are discarded on the device ('sudo blkdiscard /dev/nvme0n1p2', or whatever your sequestered partition is if not p2). That would simulate a higher over-provisioning for the used partition, making it easier for the controller to do garbage collection.

Comment 41 Nazar Mokrynskyi 2024-12-08 21:50:39 UTC

Thanks for suggestion, it was worth a shot, but I don't think it is the reason.

I have Samsung 990 Pro 4T, only 1.5T of which is used. Both `nvme list` and `df -h` output closely match each other.

This is one of the best consumer SSDs with PCIe 4.0 interface available, it is very unlikely that it stops working during reads with less than 40% utilization because it is unsubscribed.

Comment 42 Keith Busch 2024-12-08 22:56:11 UTC

Thanks for checking. I that case it doesn't sound like you're hitting any sort of media utilization corner case. That's pretty much the only time I've seem timeouts where the conclusion was "yes, that's expected", so I'm still considering this to be a vendor specific oddity.

Comment 43 Keith Busch 2024-12-08 22:56:31 UTC

Thanks for checking. I that case it doesn't sound like you're hitting any sort of media utilization corner case. That's pretty much the only time I've seem timeouts where the conclusion was "yes, that's expected", so I'm still considering this to be a vendor specific oddity.

Comment 44 Nazar Mokrynskyi 2024-12-09 08:51:44 UTC

It might be vendor-specific, but I've seen the same exact issue with two other SSDs and another user recently reported the same issue with yet another SSD (Crucial T705 2TB, PCIe 5.0) at https://forum.level1techs.com/t/nvme-ssd-disappears-disconnects-from-the-system/220275/6?u=nazar-pc

So there is clearly some issue somewhere that is Linux-specific. Samsung refuses to offer any kind of support for 990 Pro saying that it works great on Windows, there are no known issues and they do not support Linux, so I'm on my own.

Anything else I can check on my end to narrow things down further?

Comment 45 rrd 2024-12-10 01:59:11 UTC

Chiming in that im seeing a similar issue with my nvme's.

Linux version 6.8.12-1-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-1 (2024-08-05T16:17Z)

two Silicon Power 2TB UD90 NVMe 4.0 Gen4 PCIe purchased at same time. Installed in two GLOTRENDS PA09-HS M.2 NVMe to PCIe 4.0 X4 Adapter with M.2 Heatsink for M.2 NVMe SSD in two seperate pcie slots. MB is a H11SSL-i

openzfs mirror, zfs-2.2.6-pve1, one of them (nvme1) drops out after a month or two, degrading the pool. --> The other one (nvme0) has NEVER done this. Reboot and nvme1 comes back. If I understand it right, operation was a write which timed out, and since the nvme was so unresponsive the kernel disabled the device. Well, I guess the mirror came in handy. 

Have not tried a firmware update on nvme yet (issues with a reader). Nor have I tried another of the other suggestions to kernel parameters. 

Hence, seems like a failed/defective nvme. Going to keep it and swap for another identical one I got later and see what happens. 

Since nvme1 is offline, here is nvme0 smart data:

```
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        30 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    3%
Data Units Read:                    58,873,504 [30.1 TB]
Data Units Written:                 218,987,666 [112 TB]
Host Read Commands:                 787,905,522
Host Write Commands:                2,416,266,061
Controller Busy Time:               169,561
Power Cycles:                       31
Power On Hours:                     2,826
Unsafe Shutdowns:                   18
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 8 of 8 entries)
No Errors Logged
```

```
[ 1531.339888] nvme nvme1: I/O tag 412 (219c) opcode 0x1 (I/O Cmd) QID 6 timeout, aborting req_op:WRITE(1) size:4096
[ 1531.344939] nvme nvme1: Abort status: 0x0
[ 1551.432807] nvme nvme1: I/O tag 413 (d19d) opcode 0x1 (I/O Cmd) QID 6 timeout, aborting req_op:WRITE(1) size:16384
[ 1551.437339] nvme nvme1: Abort status: 0x0
[ 1556.553531] nvme nvme1: I/O tag 415 (019f) opcode 0x1 (I/O Cmd) QID 6 timeout, aborting req_op:WRITE(1) size:16384
[ 1556.557791] nvme nvme1: Abort status: 0x0
[ 1556.875491] nvme nvme1: I/O tag 418 (01a2) opcode 0x1 (I/O Cmd) QID 6 timeout, aborting req_op:WRITE(1) size:16384
[ 1556.879723] nvme nvme1: Abort status: 0x0
[ 1561.685244] nvme nvme1: I/O tag 412 (219c) opcode 0x1 (I/O Cmd) QID 6 timeout, reset controller
[ 1643.613804] nvme nvme1: Device not ready; aborting reset, CSTS=0x1
[ 1663.639732] nvme nvme1: Device not ready; aborting reset, CSTS=0x1
[ 1663.640022] nvme nvme1: Disabling device after reset failure: -19
```

I have another

Comment 46 tzink 2025-02-06 11:07:03 UTC

I have the same issue. Lenovo Thinkpad L13, Transcent TS2TMTE400S 2TB PCIe M.2 nvme. Encrypted LVM with btrfs.

```
$ lsb_release -a
No LSB modules are available.
Distributor ID:	Kali
Description:	Kali GNU/Linux Rolling
Release:	2024.4
Codename:	kali-rolling

$ uname -a
Linux 0trust 6.11.2-amd64 #1 SMP PREEMPT_DYNAMIC Kali 6.11.2-1kali1 (2024-10-15) x86_64 GNU/Linux
```

Several times a day the system becomes unresponsive for about 30 secs (only mouse cursor can be moved). When this happens the following messages appear in dmesg:


```
[  +1.847918] nvme nvme0: I/O tag 128 (2080) opcode 0x1 (I/O Cmd) QID 1 timeout, aborting req_op:WRITE(1) size:4096
[  +0.807991] nvme nvme0: I/O tag 129 (b081) opcode 0x1 (I/O Cmd) QID 1 timeout, aborting req_op:WRITE(1) size:16384
[  +0.000024] nvme nvme0: I/O tag 130 (6082) opcode 0x1 (I/O Cmd) QID 1 timeout, aborting req_op:WRITE(1) size:16384
[  +0.000008] nvme nvme0: I/O tag 131 (2083) opcode 0x1 (I/O Cmd) QID 1 timeout, aborting req_op:WRITE(1) size:16384
[ +27.403961] nvme nvme0: I/O tag 152 (1098) opcode 0x1 (I/O Cmd) QID 1 timeout, reset controller
[  +0.000009] WARNING: CPU: 2 PID: 370 at arch/x86/kernel/apic/vector.c:878 apic_set_affinity+0x80/0x90
[  +0.000188] CPU: 2 UID: 0 PID: 370 Comm: kworker/2:1H Not tainted 6.11.2-amd64 #1  Kali 6.11.2-1kali1
[  +0.000009] Hardware name: LENOVO 21FN000BGE/21FN000BGE, BIOS R26ET43W (1.21 ) 12/12/2024
[  +0.000004] Workqueue: kblockd blk_mq_timeout_work
[  +0.000010] RIP: 0010:apic_set_affinity+0x80/0x90
[  +0.000008] Code: 00 20 00 75 1c e8 00 fb ff ff 89 c3 48 c7 c7 d8 e3 f2 b4 e8 d2 72 c4 00 89 d8 5b 5d e9 54 d9 e7 00 e8 a4 f9 ff ff 89 c3 eb e2 <0f> 0b bb fb ff ff ff 89 d8 5b 5d e9 3b d9 e7 00 90 90 90 90 90 90
[  +0.000005] RSP: 0018:ffffb06a40b27ba8 EFLAGS: 00010046
[  +0.000006] RAX: 000000001d03a000 RBX: ffff993d40062800 RCX: 0000000000000000
[  +0.000004] RDX: 0000000000000000 RSI: ffffffffb4f825c0 RDI: ffff993d5655df00
[  +0.000004] RBP: 0000000000000000 R08: ffffffffb4f825c0 R09: ffffffffb4f821a0
[  +0.000003] R10: 0000000000000000 R11: 0000000000000000 R12: ffff993d5655d940
[  +0.000004] R13: ffff993d558f6a80 R14: ffffffffb4f825c0 R15: 0000000000000000
[  +0.000005] FS:  0000000000000000(0000) GS:ffff99401eb00000(0000) knlGS:0000000000000000
[  +0.000004] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000004] CR2: 00007f5f98455008 CR3: 0000000195a22000 CR4: 0000000000750ef0
[  +0.000005] PKRU: 55555554
[  +0.000003] Call Trace:
[  +0.000006]  <TASK>
[  +0.000004]  ? apic_set_affinity+0x80/0x90
[  +0.000007]  ? __warn.cold+0x8e/0xe8
[  +0.000008]  ? apic_set_affinity+0x80/0x90
[  +0.000015]  ? report_bug+0xff/0x140
[  +0.000010]  ? handle_bug+0x3c/0x80
[  +0.000008]  ? exc_invalid_op+0x17/0x70
[  +0.000006]  ? asm_exc_invalid_op+0x1a/0x20
[  +0.000015]  ? apic_set_affinity+0x80/0x90
[  +0.000008]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000007]  amd_ir_set_affinity+0x46/0xa0
[  +0.000009]  ioapic_set_affinity+0x24/0x70
[  +0.000006]  irq_do_set_affinity+0x1c0/0x200
[  +0.000007]  irq_setup_affinity+0xf7/0x180
[  +0.000005]  irq_startup+0x11e/0x130
[  +0.000004]  enable_irq+0x50/0xa0
[  +0.000006]  nvme_timeout+0x283/0x3a0 [nvme]
[  +0.000009]  blk_mq_handle_expired+0x6b/0xa0
[  +0.000005]  bt_iter+0x88/0xa0
[  +0.000005]  blk_mq_queue_tag_busy_iter+0x31c/0x5d0
[  +0.000004]  ? __pfx_blk_mq_handle_expired+0x10/0x10
[  +0.000004]  ? __pfx_blk_mq_handle_expired+0x10/0x10
[  +0.000005]  blk_mq_timeout_work+0x171/0x1b0
[  +0.000004]  process_one_work+0x177/0x330
[  +0.000006]  worker_thread+0x252/0x390
[  +0.000004]  ? __pfx_worker_thread+0x10/0x10
[  +0.000003]  kthread+0xd2/0x100
[  +0.000004]  ? __pfx_kthread+0x10/0x10
[  +0.000004]  ret_from_fork+0x34/0x50
[  +0.000005]  ? __pfx_kthread+0x10/0x10
[  +0.000004]  ret_from_fork_asm+0x1a/0x30
[  +0.000008]  </TASK>
[  +0.000001] ---[ end trace 0000000000000000 ]---
<above trace sometimes repeats>
[  +0.000207] nvme nvme0: Abort status: 0x371
[  +0.000006] nvme nvme0: Abort status: 0x371
[  +0.000003] nvme nvme0: Abort status: 0x371
[  +0.000002] nvme nvme0: Abort status: 0x371
[  +0.000002] nvme nvme0: Abort status: 0x371
[  +0.037048] nvme nvme0: 8/0/0 default/read/poll queues
```

After happening several times the filesystems will switch to ro (which probably is the sane and safe way to go), forcing a reboot. Yet, the issue returns every single time.

Smart does not report anything suspicious.

```
$ sudo smartctl -H /dev/nvme0n1      
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.11.2-amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

$ sudo smartctl -l error /dev/nvme0n1
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.11.2-amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged
```

I also ran a complete diagnostics of the hardware using Lenovo Diagnostics. Every test passed. 

I have had this issue for at least since mid 2024, but I cannot pinpoint the exact time or Kernel version when I first encountered it.