Bug 216809

Summary: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting
Product: IO/Storage Reporter: jfhart085
Component: NVMeAssignee: IO/NVME Virtual Default Assignee (io_nvme)
Status: NEW ---    
Severity: blocking CC: andreas.thalhammer, kbusch, konoha02, nazar, rev
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 6.1 Subsystem:
Regression: No Bisected commit-id:
Attachments: nvme drive and controller specs
iostat on nvme
Samsung 990 Pro 4TB disconnects under load with `nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off`

Description jfhart085 2022-12-14 11:35:51 UTC
attempting to load an nvme device (nvme0n1) to use as main system drive:
using the following command:

rsync -axvH /. --exclude=/lost+found --exclude=/var/log.bu --exclude=/usr/var/log.bu --exclude=/usr/X11R6/var/log.bu --exclude=/home/jhart/.cache/mozilla/firefox/are7uokl.default-release/cache2.bu --exclude=/home/jhart/.cache/thunderbird/7zsnqnss.default/cache2.bu /mnt/root_new 2>&1 | tee root.log

the i/o quickly hangs and the drive cannot be unmounted without a reboot

dmesg reports the following: {
[Dec14 19:24] nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting
[Dec14 19:25] nvme nvme0: I/O 0 QID 1 timeout, reset controller
[ +30.719985] nvme nvme0: I/O 8 QID 0 timeout, reset controller
[Dec14 19:28] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
[  +0.031803] nvme nvme0: Abort status: 0x371
[Dec14 19:30] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
[  +0.000019] nvme nvme0: Removing after probe failure status: -19
}

Linux DellXPS 6.1.0 #1 SMP Tue Dec 13 21:48:51 JST 2022 x86_64 GNU/Linux
custom distribution built entirely from source
Comment 1 jfhart085 2022-12-14 11:59:53 UTC
Created attachment 303406 [details]
nvme drive and controller specs
Comment 2 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-12-15 06:50:46 UTC
Did this work on earlier kernel version? If yes, which was the latest version that worked?
Comment 3 jfhart085 2022-12-15 06:55:45 UTC
I have just recently started trying to use NVME.  I originally tried with linux-57 and had similar failures, so I upgraded to 6.1 and still had them.  The report I give here is for linux kernel 6.1.
Comment 4 jfhart085 2022-12-15 06:56:32 UTC
that should read linux 5.7.  My apologies for the finger stutter.
Comment 5 jfhart085 2022-12-15 06:57:54 UTC
To clarify: I have not tried NVME with anything earlier than linux-5.7
Comment 6 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-12-15 07:03:09 UTC
thx for clarifying, then it's not a regression I should track and something for the regular developers. If you don't get a reply from them here, report this to the appropriate maintainers and subsystem list mentioned in https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/MAINTAINERS
Comment 7 Nelson G 2023-10-07 02:32:59 UTC
oct 06 20:46:04 debian kernel: nvme nvme0: I/O 648 (I/O Cmd) QID 4 timeout, aborting
oct 06 20:47:23 debian kernel: nvme nvme0: I/O 0 (I/O Cmd) QID 6 timeout, aborting
oct 06 20:47:23 debian kernel: nvme nvme0: I/O 681 QID 3 timeout, reset controller
oct 06 20:47:23 debian kernel: nvme nvme0: I/O 8 QID 0 timeout, reset controller
oct 06 20:47:23 debian kernel: nvme nvme0: Abort status: 0x371
oct 06 20:47:23 debian kernel: nvme nvme0: Abort status: 0x371
oct 06 20:47:23 debian kernel: nvme nvme0: Abort status: 0x371
oct 06 20:47:23 debian kernel: nvme nvme0: Abort status: 0x371
oct 06 20:47:23 debian kernel: nvme nvme0: Abort status: 0x371
oct 06 20:47:23 debian kernel: nvme nvme0: 15/0/0 default/read/poll queues
oct 06 20:47:23 debian kernel: nvme nvme0: Ignoring bogus Namespace Identifiers


I got this error today.
I powered on my machine (cold boot) and after login to my xfce session everything except the cursor got frozen, nothing responded (not even the clock).  I could go into tty and ctrl alt del from there.  It rebooted after a minute or so.
Comment 8 Andreas 2023-10-07 06:44:14 UTC
@Nelson G, Comment 7
Which version of Debian are you running? Which kernel version exactly?

On the shell, can you share the (selective) output of the following commands:
# cat /etc/debian_version
# cat /etc/*release
# uname -a
# hostnamectl

Also, if installed, (selective) output about the system would be nice:
# dmidecode -t system

You might want to _not_ paste the serial numbers and UUIDs, hence "selective".
Comment 9 Nelson G 2023-10-09 22:42:18 UTC
oops, sorry, here:
debian 12.1 
linux 6.1.52-1 x86-64

Thinkpad e495 

# dmidecode 3.4
Getting SMBIOS data from sysfs.
SMBIOS 3.1.1 present.

Handle 0x000F, DMI type 1, 27 bytes
System Information
	Manufacturer: LENOVO
	Product Name: 20NECTO1WW
	Version: ThinkPad E495
	Serial Number: blank
	UUID: blank
	Wake-up Type: Power Switch
	SKU Number: LENOVO_MT_20NE_BU_Think_FM_ThinkPad E495
	Family: ThinkPad E495

Handle 0x0036, DMI type 15, 31 bytes
System Event Log
	Area Length: 1346 bytes
	Header Start Offset: 0x0000
	Header Length: 16 bytes
	Data Start Offset: 0x0010
	Access Method: General-purpose non-volatile data functions
	Access Address: 0x00F0
	Status: Valid, Not Full
	Change Token: 0x00000053
	Header Format: Type 1
	Supported Log Type Descriptors: 4
	Descriptor 1: POST error
	Data Format 1: POST results bitmap
	Descriptor 2: PCI system error
	Data Format 2: None
	Descriptor 3: System reconfigured
	Data Format 3: None
	Descriptor 4: Log area reset/cleared
	Data Format 4: None
Comment 10 Andreas 2023-10-10 03:50:30 UTC
Can you say with certainty that the NVMe SSD itself isn't faulty, causing the errors?

Did you check the SMART (and other) logs, e.g. with smartctl?

For reference, my system:
Hardware specs:
  Product: LENOVO Legion 5 Pro 16ACH6H 82JQ
  SKU Number: LENOVO_MT_82JQ_BU_idea_FM_Legion 5 Pro 16ACH6H
  Family: Legion 5 Pro 16ACH6H
  NVMe PCIe SSD (upgraded, not original): Crucial P1 CT2000P1SSD8
Software specs:
  Gentoo Linux
  Kernel 6.5.6-gentoo x86_64 AMD Ryzen 7 5800H (custom kernel, my own .config)

E.g. this is from my system, without the errors your describe:

localhost ~ # smartctl -H /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.5.6-gentoo-L5P] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

localhost ~ # smartctl -l error /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.5.6-gentoo-L5P] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
Error Information (NVMe Log 0x01, 16 of 256 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0       1937     0  0x0014  0x4005      -            0     1     -

localhost ~ # nvme smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning                        : 0
temperature                             : 44 °C (317 K)
available_spare                         : 100%
available_spare_threshold               : 5%
percentage_used                         : 3%
endurance group critical warning summary: 0
Data Units Read                         : 273.156.408 (139,86 TB)
Data Units Written                      : 69.561.559 (35,62 TB)
host_read_commands                      : 9.941.155.688
host_write_commands                     : 1.220.956.942
controller_busy_time                    : 22.719
power_cycles                            : 1.549
power_on_hours                          : 8.511
unsafe_shutdowns                        : 96
media_errors                            : 0
num_err_log_entries                     : 1.937
Warning Temperature Time                : 0
Critical Composite Temperature Time     : 0
Temperature Sensor 1           : 44 °C (317 K)
Thermal Management T1 Trans Count       : 17
Thermal Management T2 Trans Count       : 4
Thermal Management T1 Total Time        : 17593
Thermal Management T2 Total Time        : 9
Comment 11 Nelson G 2023-10-12 21:09:17 UTC
Well, I don't mean my ssd is the culprit for being in a bad state, but because the support for linux is as bad as my thinkpad's bios oem team.  They simply don't care about linux when you ask for help/support about this bug, or worse, a critical bug/implementation. They act as if the error is using linux per se.    
Anyway:

$ smartctl -H /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-13-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

$ smartctl -l error /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-13-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

$ sudo nvme smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning			: 0
temperature				: 38°C (311 Kelvin)
available_spare				: 100%
available_spare_threshold		: 10%
percentage_used				: 3%
endurance group critical warning summary: 0
Data Units Read				: 45.442.041 (23,27 TB)
Data Units Written			: 26.846.180 (13,75 TB)
host_read_commands			: 462.473.222
host_write_commands			: 320.013.095
controller_busy_time			: 25.241
power_cycles				: 7.142
power_on_hours				: 11.836
unsafe_shutdowns			: 415
media_errors				: 10
num_err_log_entries			: 10
Warning Temperature Time		: 189
Critical Composite Temperature Time	: 1  —very old record :)—
Thermal Management T1 Trans Count	: 0
Thermal Management T2 Trans Count	: 0
Thermal Management T1 Total Time	: 0
Thermal Management T2 Total Time	: 0



Oh btw,  I reported this warning some time ago https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=990510

[    1.390941] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0xffffc000 flags=0x0000]
[    1.390976] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0xffffc080 flags=0x0000]


I tend to get it constantly, as the very first boot message (even before prompting me to luks)  
The day I got the bug we're reporting here, I also got that one,  just sharing in case anyone recognizes that one.
Comment 12 Andreas 2023-10-13 14:21:44 UTC
I tend to think that bug fixes take time, especially if the developers don't experience those issues for themselves, on their own hardware. E.g bug #202665 got fixed >1 yr after it was initially reported.

So anyway, you should definitely get the newest kernel and see if the issue is still there, because that's what the developers will ask you to do anyhow. Also, with the sources at your disposal, you can apply any patches they through at you for testing.

The basics are:
https://www.wikihow.com/Compile-the-Linux-Kernel

1. Get the source. (Apply patches if/as necessary).
2. Compile the kernel and the modules, then install it.
3. Boot that kernel. (Probably involves configuring a boot manager.)

How you would do all that depends on the distribution you use.
For Debian, read this:
https://www.debian.org/doc//manuals/debian-handbook/sect.kernel-compilation.pl.html
E.g. for Ubuntu, you'd read this:
https://wiki.ubuntu.com/Kernel/BuildYourOwnKernel

I know, this seems like a lot of work, but it's the only way I know of...
Comment 13 Keith Busch 2023-10-13 14:25:26 UTC
Well, I don't think there's any indication that there's a kernel bug of any kind here. Devices that don't complete commands will exceed the kernel's tolerance for timeouts, and there's nothing the kernel can do about it.
Comment 14 Andreas 2023-10-13 20:12:23 UTC
Is there a way to tweak the timeout, adjust it to a higher value? How I understand it, the kernel resets the controller after the timeout is reached, but how certain is it that the command wouldn't have succeeded when given more time?
Comment 15 Keith Busch 2023-10-13 20:41:35 UTC
Not sure if more time will help here. The 'iostats' output I suggested on the mailing list should indicate if waiting longer will help or not.

You can change the default with kernel parameter 'nvme_core.io_timeout=<time-in-seconds>'. Default is 30 seconds.
Comment 16 Andreas 2023-10-13 22:25:11 UTC
@Keith Busch, thank you very much.

To get iostat, install sysstat. (On Debian: "sudo apt install sysstat")

My system is working and doesn't have those issues, for reference:

nvme0 contains my main Gentoo Linux system
nvme1 is just mostly idle all the time

localhost ~ # dmesg | grep nvme
[    7.269597] nvme 0000:02:00.0: platform quirk: setting simple suspend
[    7.269615] nvme 0000:05:00.0: platform quirk: setting simple suspend
[    7.269710] nvme nvme1: pci function 0000:05:00.0
[    7.269713] nvme nvme0: pci function 0000:02:00.0
[    7.279969] nvme nvme1: missing or invalid SUBNQN field.
[    7.280233] nvme nvme1: Shutdown timeout set to 8 seconds
[    7.297803] nvme nvme1: 16/0/0 default/read/poll queues
[    7.302096]  nvme1n1: p1 p2 p3 p4
[    7.303998] nvme nvme0: 15/0/0 default/read/poll queues
[    7.311543]  nvme0n1: p1 p2 p3 p4 p5 p6 p7 p8

localhost ~ # iostat -x nvme0n1
Linux 6.5.7-gentoo-L5P (localhost.localdomain)  2023-10-14      _x86_64_        (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,65    0,00    0,36    0,05    0,00   98,94

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme0n1         21,81    684,75     0,01   0,07    1,31    31,40   11,16    366,46     0,15   1,37    4,81    32,84   11,96  12172,93     0,00   0,00    0,22  1017,38    0,21    0,44    0,08   1,28

localhost ~ # iostat -x nvme1n1
Linux 6.5.7-gentoo-L5P (localhost.localdomain)  2023-10-14      _x86_64_        (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,65    0,00    0,36    0,05    0,00   98,93

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme1n1          1,03      5,39     0,00   0,00    0,04     5,23    0,00      0,00     0,00   0,00    0,00     0,00    0,00      0,00     0,00   0,00    0,00     0,00    0,00    0,00    0,00   0,01

@Nelson G, could you also provide your iostat output, please?
Also, try adding nvme_core.io_timeout=60 to your kernel commandline and see if that changes anything. Also, try what Keith Busch suggested on the mailing list:

----snap---
If the device is responding very slowly, suggestions I might have
include:

  a. Enable discards if you've disabled them
  b. Disable discards if you've enabled them
  c. If not already enabled, use an io scheduler, like mq-deadline or kyber
----snap---

Thanks.
Comment 17 Nelson G 2023-10-14 03:10:46 UTC
Here:
iostat system defaults,  both output of warm boot and like 10 minutes of usage:

Linux 6.1.0-13-amd64 (debian) 	13/10/23 	_x86_64_	(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7,53    0,00    8,96    2,09    0,00   81,42

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme0n1        605,22  18567,04    59,70   8,98    0,32    30,68   34,25    337,62     0,11   0,33    0,26     9,86    0,00      0,00     0,00   0,00    0,00     0,00    1,32    0,13    0,20  24,69

Linux 6.1.0-13-amd64 (debian) 	13/10/23 	_x86_64_	(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2,07    0,01    1,81    0,46    0,00   95,65

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme0n1        132,05   4217,31    16,53  11,13    0,64    31,94    5,87     51,63     0,02   0,34    0,23     8,80    0,00      0,00     0,00   0,00    0,00     0,00    0,29    0,11    0,09   4,12


iostat WITH nvme_core.io_timeout=60, both warm boot and like 10 minutes of usage (again):

Linux 6.1.0-13-amd64 (debian) 	13/10/23 	_x86_64_	(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7,21    0,00    7,36    1,69    0,00   83,74

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme0n1        818,44  19837,90    44,75   5,18    0,62    24,24   22,52    213,35     0,02   0,08    0,27     9,47    0,00      0,00     0,00   0,00    0,00     0,00    0,86    0,13    0,51  21,85

Linux 6.1.0-13-amd64 (debian) 	13/10/23 	_x86_64_	(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1,73    0,00    1,06    0,17    0,00   97,04

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme0n1         47,16   1470,21     5,82  10,98    0,70    31,17    3,21     37,63     0,02   0,76    0,09    11,71    0,05      4,11     0,00   0,00    0,29    78,79    0,23    0,13    0,03   1,51


Off topic:
Yeah,  I suppose you're right about bugs getting fixed once a developer experiences it.  To keep things short I'll just say I've had enough bad experiences with lenovo.  And I've been told already that my issues are 'hard' to get ""fixed"" (they're fixed already by amd but not implemented on the bios) because well, Microsoft Windows is the main OS and not linux. So no linux related fixes on the bios for me https://bugzilla.kernel.org/show_bug.cgi?id=216161  
As for the ssd,  it's cheap,  the average temperatures are ridiculously high, the speed rates are mediocre.  But it gets the job done for a boot drive (system and config files are there to keep things faster than an hdd) but my gfs computer has samsung and oh boy that that drive it's cooler on temperature and faster (same storage capacity btw, 256gb lol).  
So yeah, I still don't expect much of it.  Like, no dmesg warnings. Or the bug we're reporting (which only happened to me on very rare occasions).  At least it rarely fails on scary ways like 'nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting'
And thanks for the advice of the kernel update!  I use both debian and arch,  and the 'fix' for https://bugzilla.kernel.org/show_bug.cgi?id=202665 was supposed to be on 6.1 and beyond, although,  knowing that ADATA told users with that bug to use windows, and that it seems to be vendor related (not entirely of course, maybe common nvme controller related), i prefer to just reboot once I see that warning.
Comment 18 Andreas 2023-10-14 18:15:25 UTC
You'd have to check, but maybe those error messages aren't really affecting your system at all. I would suggest to test this by running your system very very long (>1-2 days), no reboots, and see if it remains stable. Especially with the Arch system, as it always uses the newest kernel.

Also, if you're up to it, you could try switching (NVMe) SSDs and see if the fault is gone with another SSD.

@jfhart085@gmail.com,
the original reporter: if you could also provide the requested info, this bug could be closed. Since you have a Kingston SSD, it could be different. Also, a Dell XPS isn't supposed to be a cheap system.

Something like:
# smartctl -H /dev/nvme0n1
# smartctl -l error /dev/nvme0n1
# nvme smart-log /dev/nvme0n1
# iostat -x nvme0n1


@Nelson G
Off Topic:

I am very happy with my Legion 5 Pro from Lenovo. It is running very well with Linux, with the exception that I have to boot into Windows for BIOS updates. Some people also reported issues with dual "switcheroo" graphics on Linux, but since I'm only using the AMD Radeon integrated graphics on Linux (haven't even loaded the closed-source Nvidia drivers) and I specifically bought this laptop BECAUSE it has a dedicated Nvidia card -- for me only on Windows, only for gaming --, I didn't have those (now fixed, as I heard) Nvidia/AMD switcheroo issues. But that's the whole purpose of the Legion 5 Pro anyhow: Gaming! Not Linux, but if you know what you're doing, it works very very well.

Lenovo has some dedicated Linux laptops, like certain ThinkPad series. Even in the BIOS/UEFI settings, there's a "Windows" or "Linux" OS preset on them, so I guess it all depends on what you're looking for.

Anyway: Have you reported your issues/findings to the Lenovo support forums? They have a guy there, MarkRHPearson, acting as a Linux liaison, and he speaks to the company (and the technicians, e.g. for BIOS updates) on behalf of us Linux users.

https://forums.lenovo.com/t5/Linux-Operating-Systems/ct-p/lx_en

That said, it may well be that there's nothing that can be done. Maybe under Windows your SSD has the same issues, but Windows doesn't directly report them to you -- it may just be the specific SSD you're using...
Comment 19 Keith Busch 2023-10-18 15:56:35 UTC
The way to run iostat as needed for this issue is "iostat -x /dev/nvme0n1 1".

Notice the "1" at the end, as in to repeat the stat once every second so that we can see what's changed at each time interval.

If the device is hopelessly broken, I expect at some point the util will approach 100%, r/s and w/s to go to 0, and aqu-sz to maintain a relatively large non-zero value.

If the device is simply unable to keep up with the workload you're demanding out of it but still responsive in a slower capacity, then 'r/s' and 'w/s' should be non-zero, but show an obvious drop from prior to observing timeouts.
Comment 20 jfhart085 2023-10-30 13:05:46 UTC
I no longer have the Kingston SSD.  I replaced it with a Samsung unit.
I was able to resolve the issue as follows:

> On 12/19/22 11:41 PM, Keith Busch wrote:
>
>>
>>> MaxPayload 128 bytes, MaxReadReq 512 bytes
>>>                  DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+
>>> AuxPwr+ TransPend-
>>>                  LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1,
>>> Latency L0 <1us, L1 <8us
>>>                          ClockPM+ Surprise- LLActRep- BwNot-
>>>                  LnkCtl: ASPM Disabled; RCB 64 bytes Disabled-
>>> Retrain- CommClk+
>>>                          ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>>>                  LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train-
>>> SlotClk+ DLActive- BWMgmt- ABWMgmt-
>
>
>> Something seems off if it's downtraining to Gen1 x1. I believe this
>> setup should be capable of Gen2 x4. It sounds like the links among these
>> components may not be reliable.
>
> This was a useful comment, and led me on to some further research. The
> system in question here (a rather older Dell) was equipped with the
> following PCI slots:
>
>             slot  pins
>            ----  ----
> PCI      -    6   124 available
> PCI      -    5   124 blocked by RX580
> PCIe x16 -    4   164 secondary graphics, RX580
> PCIe x8  -    3    98 NVME controller
> PCIe x1  -    2    36 TV card
> PCIe x16 -    1   164 primary graphics, available
>
> It also uses an Nvidia chipset for the PCI bridge.  It turns out that
> back in 2008 Dell had custom modifications made to a standard chip set
> configuration normally supplied by Nvidia. This enabled the
> motherboard to handle up to two graphics cards at the same time while
> minimizing cost.  The trade off apparently was that the two x16 slots
> actually only have 8 data lanes each, and the x8 slot (slot 3) only
> one lane. I had the NVME controller in the x8 slot, expecting the
> controller to negotiate for 4 lanes, not realizing the single lane
> restriction.  I moved the controller to the unused x16 slot (slot 1)
> since I have only a single card in the machine.  It is now operating
> as expected without further difficulty.
Comment 21 Andreas 2023-10-30 13:18:13 UTC
Thank you so much, @jfhart085!

@Nelson G,
Maybe this helps you as well? Does your laptop have two different PCIe NVMe slots? If so, can you switch the SSDs?

And please test your system for an extended period of time, run the iostat command like Keith Busch requested in Comment 19 (on a separate virtual console), then post its output when you notice the problem.
Comment 22 Nelson G 2023-10-31 19:18:21 UTC
Mine only has one PCIe NVMe slot.

Alright,  I will try to run iostat for an extended period of time,  it's been running for at least 10 min now and it doesn't seem to be something outstanding in the output, i will share it in a minute.

I am currently using the 'nvme_core.default_ps_max_latency_us=0' boot parameter because I sadly got this bug once again a few days ago

journald=
oct 29 22:16:46 debian kernel: nvme nvme0: I/O 83 (I/O Cmd) QID 2 timeout, aborting
oct 29 22:16:46 debian kernel: nvme nvme0: I/O 453 (I/O Cmd) QID 8 timeout, aborting
oct 29 22:16:46 debian kernel: nvme nvme0: I/O 454 (I/O Cmd) QID 8 timeout, aborting
oct 29 22:16:46 debian kernel: nvme nvme0: I/O 898 (I/O Cmd) QID 1 timeout, aborting
oct 29 22:16:46 debian kernel: nvme nvme0: I/O 899 (I/O Cmd) QID 1 timeout, aborting
oct 29 22:16:46 debian kernel: nvme nvme0: I/O 0 QID 0 timeout, reset controller
oct 29 22:16:46 debian kernel: nvme nvme0: I/O 83 QID 2 timeout, reset controller
oct 29 22:16:46 debian kernel: nvme nvme0: Abort status: 0x371
oct 29 22:16:46 debian kernel: nvme nvme0: Abort status: 0x371
oct 29 22:16:46 debian kernel: nvme nvme0: Abort status: 0x371
oct 29 22:16:46 debian kernel: nvme nvme0: Abort status: 0x371
oct 29 22:16:46 debian kernel: nvme nvme0: Abort status: 0x371
oct 29 22:16:46 debian kernel: nvme nvme0: 15/0/0 default/read/poll queues
oct 29 22:16:46 debian kernel: nvme nvme0: Ignoring bogus Namespace Identifiers


had to reboot because I needed the system working correctly asap.   But next time i will run iostat and dmesg and save the output (if possible because, everything is slow/unresponsive)
i was reading archlinux forums and found out that boot parameter to be 'useful' on keeping the system from reaching that bug at boot (I've only get that bug during the boot process,  never after DAYS of the system running.  I'd say i've reproduced this bug like 1 time out of 100 boot processes, i really don't know what triggers it).

In a few hours I will run iostat again with the default boot parameters and a more exhaustive nvme activity.
Comment 23 Nelson G 2023-10-31 19:21:06 UTC
Created attachment 305345 [details]
iostat on nvme

nvme_core.default_ps_max_latency_us=0 boot parameter being used,  not so much happening on that ssd during that test.
Comment 24 Andreas 2023-10-31 19:36:13 UTC
With this last information, logically, there are two actions required:

#1 the devs need an iostat output when the error occurs, not when the system is running as it should.

#2 if the dmesg errors only occur right at boot time, but doesn't occur at all while the system is running, this might be a rare case that could be caused and/or addressed by the kernel, and therefore could also be fixed in the code.

Issue #1 is for @Nelson G
Issue #2 if for the devs, @Keith Busch
Comment 25 Rev 2024-03-12 13:12:22 UTC
This bug is still recent. I get it on mainline 6.7(.9), and had it on previous kernels (so long ago, I don't even remember what kernel) for ubuntu 22.04.4 LTS. 

Strangely it was better on some of the previous kernel series. 

But now (with 6.7) I get system hangs with the same log as the initial bug reporter. 

And it gets as worse as system crashes because no I/O is possible any more. The system just stops working, because the OS cannot access its boot drive (512 GB LVM2, Adata SX6000LNP with most recent firmware) any more. And of course, no logs because of that.

First I thought it must be my drive, but checks give the nvme drive a clean slate.

I can even trigger the bug by loading video files and jumping time (skipping) while watching. Sooner or later the system will hang with the OP's error. And later the system will loose I/O of the boot drive and I have to manually power off the hardware and cold boot it (no shutdown / reboot possible, no other working environment possible due to I/O not possible).

That makes my beloved linux less stable than my windows PC (puke).
Comment 26 Keith Busch 2024-03-12 14:44:55 UTC
We're not doing anything outside spec compliance here. The device just doesn't respond. Why is that? Usually that's a firmware bug, and smart tools generally would't indicate that. Sometimes we can work around firmware bugs in the driver, but we can't efficiently guess as to what the problem might be without hardware in hand.
Comment 27 Rev 2024-03-14 10:10:30 UTC
Thank you, I can understand that. I probably should have watched my system better, when the bug was not or very seldom happening and noted the kernel versions. But as typical human behaviour, when an error is not bugging you, you are happy and do nothing. Now I can remember the issue getting worse on 6.7 mainline series and that brings back the memories of "I had that before...".

Installed 6.8(.0) today - no cold boot necessary yet, but error logs like

nvme nvme0: Abort status: 0x0

and

nvme nvme0: I/O tag 512 (f200) opcode 0x1 (I/O Cmd) QID 3 timeout, aborting req_op:WRITE(1) size:4096

are still there.
Comment 28 Keith Busch 2024-03-14 16:44:42 UTC
Do the occurrences become less frequent if you revert to an older kernel right now? I just want to rule out the possibility that your drive is just wearing out and starting to struggle.
Comment 29 Rev 2024-03-15 08:45:58 UTC
TBW (terrabytes written) and drive age are at 1/10 of the manufacturer spec.

Since I don't remember the last kernel that worked better or at least without I/O crashes, I am open for suggestions, of what kernel to choose from.
Comment 30 Rev 2024-03-16 22:50:34 UTC
now the "nvme" filter changed to the following errors in 6.8.1:

(this is new:)
nvme nvme0: Ignoring bogus Namespace Identifiers
nvme nvme0: 7/0/0 default/read/poll queues

(this is old/known:)
nvme nvme0: Abort status: 0x0
nvme nvme0: I/O tag 987 (13db) opcode 0x1 (I/O Cmd) QID 4 timeout, aborting req_op:WRITE(1) size:8192


and at least the last 2 errors repeat steadily in the average 50/hour margin, but no I/O crash yet. (QID varies 1-7 as does size:4096-12288)

By the way, since there are a lot of different NVME SSD brands and types with this error found, how can one get rid of it?
Comment 31 Andreas 2024-03-17 17:50:09 UTC
I have two M.2 SSDs from Crucial:

nvme0: CT4000P3PSSD8 (4 TB), firmware P9CR40A
nvme1: CT2000P1SSD8 (2 TB), firmware P3CR021

I don't have any issues. I recently bought the bigger SSD and moved Linux from the other one, swapping their places. 

I see this during Kernel 6.8.1 boot and in dmesg:

[    6.641233] nvme 0000:02:00.0: platform quirk: setting simple suspend
[    6.641249] nvme 0000:05:00.0: platform quirk: setting simple suspend
[    6.641327] nvme nvme0: pci function 0000:02:00.0
[    6.641331] nvme nvme1: pci function 0000:05:00.0
[    6.647140] nvme nvme0: missing or invalid SUBNQN field.
[    6.672906] nvme nvme1: 15/0/0 default/read/poll queues
[    6.673090] nvme nvme0: allocated 32 MiB host memory buffer.
[    6.681449]  nvme1n1: p1 p2 p3 p4
[    6.721320] nvme nvme0: 8/0/0 default/read/poll queues
[    6.725566] nvme nvme0: Ignoring bogus Namespace Identifiers
[    6.727874]  nvme0n1: p1 p2 p3 p4 p5

This begs the question what SUBNQN field and Namespace Identifiers are and why they aren't present or have to be ignored: the CT2000P1SSD8 is a Crucial P1, it's the older one. The newer CT4000P3PSSD8 Crucial P3 Plus has the "missing or invalid SUBNQN field" and the "Ignoring bogus Namespace Identifiers" issue.

Obviously I have no idea what this means, but I'm guessing that it is unrelated to the I/O timeout issues, since I don't get any of those.
Comment 32 Keith Busch 2024-03-18 04:13:50 UTC
(In reply to Andreas from comment #31)
> This begs the question what SUBNQN field and Namespace Identifiers are and
> why they aren't present or have to be ignored: the CT2000P1SSD8 is a Crucial
> P1, it's the older one. The newer CT4000P3PSSD8 Crucial P3 Plus has the
> "missing or invalid SUBNQN field" and the "Ignoring bogus Namespace
> Identifiers" issue.

These duplicate and invalid naming messages don't mean anything unless your environment is an enterprise server with multipath, or in some cases, virtualization. No one else should mind if your manufacturer didn't assemble unique IDs. That use to harm things if you put two such devices in the same machine because of how udev works, but less so anymore.

Technically, there's still a data integrity danger with udev enumeration order, but not one anyone in real life seems to care about. Linus Torvalds in fact requested we remove the safety check.
Comment 33 Keith Busch 2024-03-18 14:37:54 UTC
(In reply to Rev from comment #30)
> By the way, since there are a lot of different NVME SSD brands and types
> with this error found, how can one get rid of it?

The error is that the drive is failing to produce a response to a command within a very generous time. We wait 30 seconds, which should be an eternity for NAND media. In my experience seeing many of these occurrences, they are always some broken firmware condition, and I have to harass the vendor for a proper fix.
Comment 34 Rev 2024-03-19 12:42:58 UTC
(In reply to Keith Busch from comment #33)
> (In reply to Rev from comment #30)
> > By the way, since there are a lot of different NVME SSD brands and types
> > with this error found, how can one get rid of it?
> 
> The error is that the drive is failing to produce a response to a command
> within a very generous time. We wait 30 seconds, which should be an eternity
> for NAND media. In my experience seeing many of these occurrences, they are
> always some broken firmware condition, and I have to harass the vendor for a
> proper fix.

Since it is a sadly very, very common response from hardware vendors to pack their product in a black box (in this case, check for vendors who give customers the option to do a firmware update in a non Windows environment - is that number zero or non zero?) and leave open source communities hanging, my opinion would be to find a consolidation in the open source communities to concentrate on their strength to find a workaround.

I know, this is an opinion from someone not involved in the process. A very difficult process, I can just guess. But if this is a common problem, and from the responses around the web for me it seems like it is, it might be the only chance to solve the issue.
Comment 35 Keith Busch 2024-03-19 14:26:11 UTC
(In reply to Rev from comment #34)
> Since it is a sadly very, very common response from hardware vendors to pack
> their product in a black box (in this case, check for vendors who give
> customers the option to do a firmware update in a non Windows environment -
> is that number zero or non zero?) and leave open source communities hanging,
> my opinion would be to find a consolidation in the open source communities
> to concentrate on their strength to find a workaround.

Not sure I follow what you have in mind here. There's no one size fits all workaround for this.

If you want a simple way to change the workload the device receives, which may achieve a reduction in timeout observations, try using kyber io scheduler. If you're already using that, try mq-deadline instead.

Or try a different filesystem. Using ext4? Try xfs. Using xfs? Try btrfs. Using btrfs? Try ext4.

And ensure partitions are aligned on 64k boundaries.

Are discards disabled? If so, turn them on.

Is your drive capacity utilization very high? Try deleting a bunch of stuff and keep your utilization below 80%.

Do you have any power saving features enabled? Try turning them off.
Comment 36 Nazar Mokrynskyi 2024-11-12 05:06:17 UTC
I also have this on two systems, both running Ubuntu 24.04.1.

First system:
ASUS Pro WS TRX50-SAGE WIFI + AMD Threadripper 7970X
Ubuntu 24.04 Desktop + kernel 6.11.x from Xanmod
SSDs that have this issue: Solidigm P44 Pro 2TB and Samsung 990 Pro 4TB

Second system:
Gigabyte MZ32-AR0 (rev 1.0) + AMD Epyc 7302P
Ubuntu 24.04 Server + kernel 6.8-generic from stock repositories
SSDs that have this issue: SkHynix P41 Platinum 2TB (almost identical to P44 Pro above with different firmware) and Samsung 990 Pro 4TB (different unit from first system)

I tried nvme_core.default_ps_max_latency_us=0 separately and with pcie_aspm=off pcie_port_pm=off.
Maybe coincidence, but with all 3 it survives a bit longer than without, but still reliably crashes when running `btrfs scrub start /`.
Attaching logs from second system that started with `nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off`.

Things I have tried so far without any luck:
* putting SSD into different PCIe slot (both native M.2 on the motherboard and through adapter into regular PCIe slot)
* forcing PCIe 3.0 speed for these PCIe 4.0 SSDs
* various BIOS options (like enabling/disabling AER, ACS, some others)

All SSDs are with latest available firmware. The SSD I didn't have this issue with was ADATA XPG SP8100NP-2TT-C 2TB (not the best SSD performance-wise).

Will attach full kernel log with `nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off`, open for experiments because both of the systems are freezing sometimes multiple times a day, making it barely usable.

Can compile and test the kernel or boot it with custom options if needed, anything that can fix this issue (it has been multiple months and I upgraded to 990 Pro primarily to deal with this problem).
Comment 37 Nazar Mokrynskyi 2024-11-12 05:07:04 UTC
Created attachment 307207 [details]
Samsung 990 Pro 4TB disconnects under load with `nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off`