attempting to load an nvme device (nvme0n1) to use as main system drive: using the following command: rsync -axvH /. --exclude=/lost+found --exclude=/var/log.bu --exclude=/usr/var/log.bu --exclude=/usr/X11R6/var/log.bu --exclude=/home/jhart/.cache/mozilla/firefox/are7uokl.default-release/cache2.bu --exclude=/home/jhart/.cache/thunderbird/7zsnqnss.default/cache2.bu /mnt/root_new 2>&1 | tee root.log the i/o quickly hangs and the drive cannot be unmounted without a reboot dmesg reports the following: { [Dec14 19:24] nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting [Dec14 19:25] nvme nvme0: I/O 0 QID 1 timeout, reset controller [ +30.719985] nvme nvme0: I/O 8 QID 0 timeout, reset controller [Dec14 19:28] nvme nvme0: Device not ready; aborting reset, CSTS=0x1 [ +0.031803] nvme nvme0: Abort status: 0x371 [Dec14 19:30] nvme nvme0: Device not ready; aborting reset, CSTS=0x1 [ +0.000019] nvme nvme0: Removing after probe failure status: -19 } Linux DellXPS 6.1.0 #1 SMP Tue Dec 13 21:48:51 JST 2022 x86_64 GNU/Linux custom distribution built entirely from source
Created attachment 303406 [details] nvme drive and controller specs
Did this work on earlier kernel version? If yes, which was the latest version that worked?
I have just recently started trying to use NVME. I originally tried with linux-57 and had similar failures, so I upgraded to 6.1 and still had them. The report I give here is for linux kernel 6.1.
that should read linux 5.7. My apologies for the finger stutter.
To clarify: I have not tried NVME with anything earlier than linux-5.7
thx for clarifying, then it's not a regression I should track and something for the regular developers. If you don't get a reply from them here, report this to the appropriate maintainers and subsystem list mentioned in https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/MAINTAINERS
oct 06 20:46:04 debian kernel: nvme nvme0: I/O 648 (I/O Cmd) QID 4 timeout, aborting oct 06 20:47:23 debian kernel: nvme nvme0: I/O 0 (I/O Cmd) QID 6 timeout, aborting oct 06 20:47:23 debian kernel: nvme nvme0: I/O 681 QID 3 timeout, reset controller oct 06 20:47:23 debian kernel: nvme nvme0: I/O 8 QID 0 timeout, reset controller oct 06 20:47:23 debian kernel: nvme nvme0: Abort status: 0x371 oct 06 20:47:23 debian kernel: nvme nvme0: Abort status: 0x371 oct 06 20:47:23 debian kernel: nvme nvme0: Abort status: 0x371 oct 06 20:47:23 debian kernel: nvme nvme0: Abort status: 0x371 oct 06 20:47:23 debian kernel: nvme nvme0: Abort status: 0x371 oct 06 20:47:23 debian kernel: nvme nvme0: 15/0/0 default/read/poll queues oct 06 20:47:23 debian kernel: nvme nvme0: Ignoring bogus Namespace Identifiers I got this error today. I powered on my machine (cold boot) and after login to my xfce session everything except the cursor got frozen, nothing responded (not even the clock). I could go into tty and ctrl alt del from there. It rebooted after a minute or so.
@Nelson G, Comment 7 Which version of Debian are you running? Which kernel version exactly? On the shell, can you share the (selective) output of the following commands: # cat /etc/debian_version # cat /etc/*release # uname -a # hostnamectl Also, if installed, (selective) output about the system would be nice: # dmidecode -t system You might want to _not_ paste the serial numbers and UUIDs, hence "selective".
oops, sorry, here: debian 12.1 linux 6.1.52-1 x86-64 Thinkpad e495 # dmidecode 3.4 Getting SMBIOS data from sysfs. SMBIOS 3.1.1 present. Handle 0x000F, DMI type 1, 27 bytes System Information Manufacturer: LENOVO Product Name: 20NECTO1WW Version: ThinkPad E495 Serial Number: blank UUID: blank Wake-up Type: Power Switch SKU Number: LENOVO_MT_20NE_BU_Think_FM_ThinkPad E495 Family: ThinkPad E495 Handle 0x0036, DMI type 15, 31 bytes System Event Log Area Length: 1346 bytes Header Start Offset: 0x0000 Header Length: 16 bytes Data Start Offset: 0x0010 Access Method: General-purpose non-volatile data functions Access Address: 0x00F0 Status: Valid, Not Full Change Token: 0x00000053 Header Format: Type 1 Supported Log Type Descriptors: 4 Descriptor 1: POST error Data Format 1: POST results bitmap Descriptor 2: PCI system error Data Format 2: None Descriptor 3: System reconfigured Data Format 3: None Descriptor 4: Log area reset/cleared Data Format 4: None
Can you say with certainty that the NVMe SSD itself isn't faulty, causing the errors? Did you check the SMART (and other) logs, e.g. with smartctl? For reference, my system: Hardware specs: Product: LENOVO Legion 5 Pro 16ACH6H 82JQ SKU Number: LENOVO_MT_82JQ_BU_idea_FM_Legion 5 Pro 16ACH6H Family: Legion 5 Pro 16ACH6H NVMe PCIe SSD (upgraded, not original): Crucial P1 CT2000P1SSD8 Software specs: Gentoo Linux Kernel 6.5.6-gentoo x86_64 AMD Ryzen 7 5800H (custom kernel, my own .config) E.g. this is from my system, without the errors your describe: localhost ~ # smartctl -H /dev/nvme0n1 smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.5.6-gentoo-L5P] (local build) Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED localhost ~ # smartctl -l error /dev/nvme0n1 smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.5.6-gentoo-L5P] (local build) Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org === START OF SMART DATA SECTION === Error Information (NVMe Log 0x01, 16 of 256 entries) Num ErrCount SQId CmdId Status PELoc LBA NSID VS 0 1937 0 0x0014 0x4005 - 0 1 - localhost ~ # nvme smart-log /dev/nvme0n1 Smart Log for NVME device:nvme0n1 namespace-id:ffffffff critical_warning : 0 temperature : 44 °C (317 K) available_spare : 100% available_spare_threshold : 5% percentage_used : 3% endurance group critical warning summary: 0 Data Units Read : 273.156.408 (139,86 TB) Data Units Written : 69.561.559 (35,62 TB) host_read_commands : 9.941.155.688 host_write_commands : 1.220.956.942 controller_busy_time : 22.719 power_cycles : 1.549 power_on_hours : 8.511 unsafe_shutdowns : 96 media_errors : 0 num_err_log_entries : 1.937 Warning Temperature Time : 0 Critical Composite Temperature Time : 0 Temperature Sensor 1 : 44 °C (317 K) Thermal Management T1 Trans Count : 17 Thermal Management T2 Trans Count : 4 Thermal Management T1 Total Time : 17593 Thermal Management T2 Total Time : 9
Well, I don't mean my ssd is the culprit for being in a bad state, but because the support for linux is as bad as my thinkpad's bios oem team. They simply don't care about linux when you ask for help/support about this bug, or worse, a critical bug/implementation. They act as if the error is using linux per se. Anyway: $ smartctl -H /dev/nvme0n1 smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-13-amd64] (local build) Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED $ smartctl -l error /dev/nvme0n1 smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-13-amd64] (local build) Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org === START OF SMART DATA SECTION === Error Information (NVMe Log 0x01, 16 of 64 entries) No Errors Logged $ sudo nvme smart-log /dev/nvme0n1 Smart Log for NVME device:nvme0n1 namespace-id:ffffffff critical_warning : 0 temperature : 38°C (311 Kelvin) available_spare : 100% available_spare_threshold : 10% percentage_used : 3% endurance group critical warning summary: 0 Data Units Read : 45.442.041 (23,27 TB) Data Units Written : 26.846.180 (13,75 TB) host_read_commands : 462.473.222 host_write_commands : 320.013.095 controller_busy_time : 25.241 power_cycles : 7.142 power_on_hours : 11.836 unsafe_shutdowns : 415 media_errors : 10 num_err_log_entries : 10 Warning Temperature Time : 189 Critical Composite Temperature Time : 1 —very old record :)— Thermal Management T1 Trans Count : 0 Thermal Management T2 Trans Count : 0 Thermal Management T1 Total Time : 0 Thermal Management T2 Total Time : 0 Oh btw, I reported this warning some time ago https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=990510 [ 1.390941] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0xffffc000 flags=0x0000] [ 1.390976] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0xffffc080 flags=0x0000] I tend to get it constantly, as the very first boot message (even before prompting me to luks) The day I got the bug we're reporting here, I also got that one, just sharing in case anyone recognizes that one.
I tend to think that bug fixes take time, especially if the developers don't experience those issues for themselves, on their own hardware. E.g bug #202665 got fixed >1 yr after it was initially reported. So anyway, you should definitely get the newest kernel and see if the issue is still there, because that's what the developers will ask you to do anyhow. Also, with the sources at your disposal, you can apply any patches they through at you for testing. The basics are: https://www.wikihow.com/Compile-the-Linux-Kernel 1. Get the source. (Apply patches if/as necessary). 2. Compile the kernel and the modules, then install it. 3. Boot that kernel. (Probably involves configuring a boot manager.) How you would do all that depends on the distribution you use. For Debian, read this: https://www.debian.org/doc//manuals/debian-handbook/sect.kernel-compilation.pl.html E.g. for Ubuntu, you'd read this: https://wiki.ubuntu.com/Kernel/BuildYourOwnKernel I know, this seems like a lot of work, but it's the only way I know of...
Well, I don't think there's any indication that there's a kernel bug of any kind here. Devices that don't complete commands will exceed the kernel's tolerance for timeouts, and there's nothing the kernel can do about it.
Is there a way to tweak the timeout, adjust it to a higher value? How I understand it, the kernel resets the controller after the timeout is reached, but how certain is it that the command wouldn't have succeeded when given more time?
Not sure if more time will help here. The 'iostats' output I suggested on the mailing list should indicate if waiting longer will help or not. You can change the default with kernel parameter 'nvme_core.io_timeout=<time-in-seconds>'. Default is 30 seconds.
@Keith Busch, thank you very much. To get iostat, install sysstat. (On Debian: "sudo apt install sysstat") My system is working and doesn't have those issues, for reference: nvme0 contains my main Gentoo Linux system nvme1 is just mostly idle all the time localhost ~ # dmesg | grep nvme [ 7.269597] nvme 0000:02:00.0: platform quirk: setting simple suspend [ 7.269615] nvme 0000:05:00.0: platform quirk: setting simple suspend [ 7.269710] nvme nvme1: pci function 0000:05:00.0 [ 7.269713] nvme nvme0: pci function 0000:02:00.0 [ 7.279969] nvme nvme1: missing or invalid SUBNQN field. [ 7.280233] nvme nvme1: Shutdown timeout set to 8 seconds [ 7.297803] nvme nvme1: 16/0/0 default/read/poll queues [ 7.302096] nvme1n1: p1 p2 p3 p4 [ 7.303998] nvme nvme0: 15/0/0 default/read/poll queues [ 7.311543] nvme0n1: p1 p2 p3 p4 p5 p6 p7 p8 localhost ~ # iostat -x nvme0n1 Linux 6.5.7-gentoo-L5P (localhost.localdomain) 2023-10-14 _x86_64_ (16 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 0,65 0,00 0,36 0,05 0,00 98,94 Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme0n1 21,81 684,75 0,01 0,07 1,31 31,40 11,16 366,46 0,15 1,37 4,81 32,84 11,96 12172,93 0,00 0,00 0,22 1017,38 0,21 0,44 0,08 1,28 localhost ~ # iostat -x nvme1n1 Linux 6.5.7-gentoo-L5P (localhost.localdomain) 2023-10-14 _x86_64_ (16 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 0,65 0,00 0,36 0,05 0,00 98,93 Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme1n1 1,03 5,39 0,00 0,00 0,04 5,23 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,01 @Nelson G, could you also provide your iostat output, please? Also, try adding nvme_core.io_timeout=60 to your kernel commandline and see if that changes anything. Also, try what Keith Busch suggested on the mailing list: ----snap--- If the device is responding very slowly, suggestions I might have include: a. Enable discards if you've disabled them b. Disable discards if you've enabled them c. If not already enabled, use an io scheduler, like mq-deadline or kyber ----snap--- Thanks.
Here: iostat system defaults, both output of warm boot and like 10 minutes of usage: Linux 6.1.0-13-amd64 (debian) 13/10/23 _x86_64_ (8 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 7,53 0,00 8,96 2,09 0,00 81,42 Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme0n1 605,22 18567,04 59,70 8,98 0,32 30,68 34,25 337,62 0,11 0,33 0,26 9,86 0,00 0,00 0,00 0,00 0,00 0,00 1,32 0,13 0,20 24,69 Linux 6.1.0-13-amd64 (debian) 13/10/23 _x86_64_ (8 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 2,07 0,01 1,81 0,46 0,00 95,65 Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme0n1 132,05 4217,31 16,53 11,13 0,64 31,94 5,87 51,63 0,02 0,34 0,23 8,80 0,00 0,00 0,00 0,00 0,00 0,00 0,29 0,11 0,09 4,12 iostat WITH nvme_core.io_timeout=60, both warm boot and like 10 minutes of usage (again): Linux 6.1.0-13-amd64 (debian) 13/10/23 _x86_64_ (8 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 7,21 0,00 7,36 1,69 0,00 83,74 Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme0n1 818,44 19837,90 44,75 5,18 0,62 24,24 22,52 213,35 0,02 0,08 0,27 9,47 0,00 0,00 0,00 0,00 0,00 0,00 0,86 0,13 0,51 21,85 Linux 6.1.0-13-amd64 (debian) 13/10/23 _x86_64_ (8 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 1,73 0,00 1,06 0,17 0,00 97,04 Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme0n1 47,16 1470,21 5,82 10,98 0,70 31,17 3,21 37,63 0,02 0,76 0,09 11,71 0,05 4,11 0,00 0,00 0,29 78,79 0,23 0,13 0,03 1,51 Off topic: Yeah, I suppose you're right about bugs getting fixed once a developer experiences it. To keep things short I'll just say I've had enough bad experiences with lenovo. And I've been told already that my issues are 'hard' to get ""fixed"" (they're fixed already by amd but not implemented on the bios) because well, Microsoft Windows is the main OS and not linux. So no linux related fixes on the bios for me https://bugzilla.kernel.org/show_bug.cgi?id=216161 As for the ssd, it's cheap, the average temperatures are ridiculously high, the speed rates are mediocre. But it gets the job done for a boot drive (system and config files are there to keep things faster than an hdd) but my gfs computer has samsung and oh boy that that drive it's cooler on temperature and faster (same storage capacity btw, 256gb lol). So yeah, I still don't expect much of it. Like, no dmesg warnings. Or the bug we're reporting (which only happened to me on very rare occasions). At least it rarely fails on scary ways like 'nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting' And thanks for the advice of the kernel update! I use both debian and arch, and the 'fix' for https://bugzilla.kernel.org/show_bug.cgi?id=202665 was supposed to be on 6.1 and beyond, although, knowing that ADATA told users with that bug to use windows, and that it seems to be vendor related (not entirely of course, maybe common nvme controller related), i prefer to just reboot once I see that warning.
You'd have to check, but maybe those error messages aren't really affecting your system at all. I would suggest to test this by running your system very very long (>1-2 days), no reboots, and see if it remains stable. Especially with the Arch system, as it always uses the newest kernel. Also, if you're up to it, you could try switching (NVMe) SSDs and see if the fault is gone with another SSD. @jfhart085@gmail.com, the original reporter: if you could also provide the requested info, this bug could be closed. Since you have a Kingston SSD, it could be different. Also, a Dell XPS isn't supposed to be a cheap system. Something like: # smartctl -H /dev/nvme0n1 # smartctl -l error /dev/nvme0n1 # nvme smart-log /dev/nvme0n1 # iostat -x nvme0n1 @Nelson G Off Topic: I am very happy with my Legion 5 Pro from Lenovo. It is running very well with Linux, with the exception that I have to boot into Windows for BIOS updates. Some people also reported issues with dual "switcheroo" graphics on Linux, but since I'm only using the AMD Radeon integrated graphics on Linux (haven't even loaded the closed-source Nvidia drivers) and I specifically bought this laptop BECAUSE it has a dedicated Nvidia card -- for me only on Windows, only for gaming --, I didn't have those (now fixed, as I heard) Nvidia/AMD switcheroo issues. But that's the whole purpose of the Legion 5 Pro anyhow: Gaming! Not Linux, but if you know what you're doing, it works very very well. Lenovo has some dedicated Linux laptops, like certain ThinkPad series. Even in the BIOS/UEFI settings, there's a "Windows" or "Linux" OS preset on them, so I guess it all depends on what you're looking for. Anyway: Have you reported your issues/findings to the Lenovo support forums? They have a guy there, MarkRHPearson, acting as a Linux liaison, and he speaks to the company (and the technicians, e.g. for BIOS updates) on behalf of us Linux users. https://forums.lenovo.com/t5/Linux-Operating-Systems/ct-p/lx_en That said, it may well be that there's nothing that can be done. Maybe under Windows your SSD has the same issues, but Windows doesn't directly report them to you -- it may just be the specific SSD you're using...
The way to run iostat as needed for this issue is "iostat -x /dev/nvme0n1 1". Notice the "1" at the end, as in to repeat the stat once every second so that we can see what's changed at each time interval. If the device is hopelessly broken, I expect at some point the util will approach 100%, r/s and w/s to go to 0, and aqu-sz to maintain a relatively large non-zero value. If the device is simply unable to keep up with the workload you're demanding out of it but still responsive in a slower capacity, then 'r/s' and 'w/s' should be non-zero, but show an obvious drop from prior to observing timeouts.
I no longer have the Kingston SSD. I replaced it with a Samsung unit. I was able to resolve the issue as follows: > On 12/19/22 11:41 PM, Keith Busch wrote: > >> >>> MaxPayload 128 bytes, MaxReadReq 512 bytes >>> DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ >>> AuxPwr+ TransPend- >>> LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, >>> Latency L0 <1us, L1 <8us >>> ClockPM+ Surprise- LLActRep- BwNot- >>> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- >>> Retrain- CommClk+ >>> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- >>> LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- >>> SlotClk+ DLActive- BWMgmt- ABWMgmt- > > >> Something seems off if it's downtraining to Gen1 x1. I believe this >> setup should be capable of Gen2 x4. It sounds like the links among these >> components may not be reliable. > > This was a useful comment, and led me on to some further research. The > system in question here (a rather older Dell) was equipped with the > following PCI slots: > > slot pins > ---- ---- > PCI - 6 124 available > PCI - 5 124 blocked by RX580 > PCIe x16 - 4 164 secondary graphics, RX580 > PCIe x8 - 3 98 NVME controller > PCIe x1 - 2 36 TV card > PCIe x16 - 1 164 primary graphics, available > > It also uses an Nvidia chipset for the PCI bridge. It turns out that > back in 2008 Dell had custom modifications made to a standard chip set > configuration normally supplied by Nvidia. This enabled the > motherboard to handle up to two graphics cards at the same time while > minimizing cost. The trade off apparently was that the two x16 slots > actually only have 8 data lanes each, and the x8 slot (slot 3) only > one lane. I had the NVME controller in the x8 slot, expecting the > controller to negotiate for 4 lanes, not realizing the single lane > restriction. I moved the controller to the unused x16 slot (slot 1) > since I have only a single card in the machine. It is now operating > as expected without further difficulty.
Thank you so much, @jfhart085! @Nelson G, Maybe this helps you as well? Does your laptop have two different PCIe NVMe slots? If so, can you switch the SSDs? And please test your system for an extended period of time, run the iostat command like Keith Busch requested in Comment 19 (on a separate virtual console), then post its output when you notice the problem.
Mine only has one PCIe NVMe slot. Alright, I will try to run iostat for an extended period of time, it's been running for at least 10 min now and it doesn't seem to be something outstanding in the output, i will share it in a minute. I am currently using the 'nvme_core.default_ps_max_latency_us=0' boot parameter because I sadly got this bug once again a few days ago journald= oct 29 22:16:46 debian kernel: nvme nvme0: I/O 83 (I/O Cmd) QID 2 timeout, aborting oct 29 22:16:46 debian kernel: nvme nvme0: I/O 453 (I/O Cmd) QID 8 timeout, aborting oct 29 22:16:46 debian kernel: nvme nvme0: I/O 454 (I/O Cmd) QID 8 timeout, aborting oct 29 22:16:46 debian kernel: nvme nvme0: I/O 898 (I/O Cmd) QID 1 timeout, aborting oct 29 22:16:46 debian kernel: nvme nvme0: I/O 899 (I/O Cmd) QID 1 timeout, aborting oct 29 22:16:46 debian kernel: nvme nvme0: I/O 0 QID 0 timeout, reset controller oct 29 22:16:46 debian kernel: nvme nvme0: I/O 83 QID 2 timeout, reset controller oct 29 22:16:46 debian kernel: nvme nvme0: Abort status: 0x371 oct 29 22:16:46 debian kernel: nvme nvme0: Abort status: 0x371 oct 29 22:16:46 debian kernel: nvme nvme0: Abort status: 0x371 oct 29 22:16:46 debian kernel: nvme nvme0: Abort status: 0x371 oct 29 22:16:46 debian kernel: nvme nvme0: Abort status: 0x371 oct 29 22:16:46 debian kernel: nvme nvme0: 15/0/0 default/read/poll queues oct 29 22:16:46 debian kernel: nvme nvme0: Ignoring bogus Namespace Identifiers had to reboot because I needed the system working correctly asap. But next time i will run iostat and dmesg and save the output (if possible because, everything is slow/unresponsive) i was reading archlinux forums and found out that boot parameter to be 'useful' on keeping the system from reaching that bug at boot (I've only get that bug during the boot process, never after DAYS of the system running. I'd say i've reproduced this bug like 1 time out of 100 boot processes, i really don't know what triggers it). In a few hours I will run iostat again with the default boot parameters and a more exhaustive nvme activity.
Created attachment 305345 [details] iostat on nvme nvme_core.default_ps_max_latency_us=0 boot parameter being used, not so much happening on that ssd during that test.
With this last information, logically, there are two actions required: #1 the devs need an iostat output when the error occurs, not when the system is running as it should. #2 if the dmesg errors only occur right at boot time, but doesn't occur at all while the system is running, this might be a rare case that could be caused and/or addressed by the kernel, and therefore could also be fixed in the code. Issue #1 is for @Nelson G Issue #2 if for the devs, @Keith Busch
This bug is still recent. I get it on mainline 6.7(.9), and had it on previous kernels (so long ago, I don't even remember what kernel) for ubuntu 22.04.4 LTS. Strangely it was better on some of the previous kernel series. But now (with 6.7) I get system hangs with the same log as the initial bug reporter. And it gets as worse as system crashes because no I/O is possible any more. The system just stops working, because the OS cannot access its boot drive (512 GB LVM2, Adata SX6000LNP with most recent firmware) any more. And of course, no logs because of that. First I thought it must be my drive, but checks give the nvme drive a clean slate. I can even trigger the bug by loading video files and jumping time (skipping) while watching. Sooner or later the system will hang with the OP's error. And later the system will loose I/O of the boot drive and I have to manually power off the hardware and cold boot it (no shutdown / reboot possible, no other working environment possible due to I/O not possible). That makes my beloved linux less stable than my windows PC (puke).
We're not doing anything outside spec compliance here. The device just doesn't respond. Why is that? Usually that's a firmware bug, and smart tools generally would't indicate that. Sometimes we can work around firmware bugs in the driver, but we can't efficiently guess as to what the problem might be without hardware in hand.
Thank you, I can understand that. I probably should have watched my system better, when the bug was not or very seldom happening and noted the kernel versions. But as typical human behaviour, when an error is not bugging you, you are happy and do nothing. Now I can remember the issue getting worse on 6.7 mainline series and that brings back the memories of "I had that before...". Installed 6.8(.0) today - no cold boot necessary yet, but error logs like nvme nvme0: Abort status: 0x0 and nvme nvme0: I/O tag 512 (f200) opcode 0x1 (I/O Cmd) QID 3 timeout, aborting req_op:WRITE(1) size:4096 are still there.
Do the occurrences become less frequent if you revert to an older kernel right now? I just want to rule out the possibility that your drive is just wearing out and starting to struggle.
TBW (terrabytes written) and drive age are at 1/10 of the manufacturer spec. Since I don't remember the last kernel that worked better or at least without I/O crashes, I am open for suggestions, of what kernel to choose from.
now the "nvme" filter changed to the following errors in 6.8.1: (this is new:) nvme nvme0: Ignoring bogus Namespace Identifiers nvme nvme0: 7/0/0 default/read/poll queues (this is old/known:) nvme nvme0: Abort status: 0x0 nvme nvme0: I/O tag 987 (13db) opcode 0x1 (I/O Cmd) QID 4 timeout, aborting req_op:WRITE(1) size:8192 and at least the last 2 errors repeat steadily in the average 50/hour margin, but no I/O crash yet. (QID varies 1-7 as does size:4096-12288) By the way, since there are a lot of different NVME SSD brands and types with this error found, how can one get rid of it?
I have two M.2 SSDs from Crucial: nvme0: CT4000P3PSSD8 (4 TB), firmware P9CR40A nvme1: CT2000P1SSD8 (2 TB), firmware P3CR021 I don't have any issues. I recently bought the bigger SSD and moved Linux from the other one, swapping their places. I see this during Kernel 6.8.1 boot and in dmesg: [ 6.641233] nvme 0000:02:00.0: platform quirk: setting simple suspend [ 6.641249] nvme 0000:05:00.0: platform quirk: setting simple suspend [ 6.641327] nvme nvme0: pci function 0000:02:00.0 [ 6.641331] nvme nvme1: pci function 0000:05:00.0 [ 6.647140] nvme nvme0: missing or invalid SUBNQN field. [ 6.672906] nvme nvme1: 15/0/0 default/read/poll queues [ 6.673090] nvme nvme0: allocated 32 MiB host memory buffer. [ 6.681449] nvme1n1: p1 p2 p3 p4 [ 6.721320] nvme nvme0: 8/0/0 default/read/poll queues [ 6.725566] nvme nvme0: Ignoring bogus Namespace Identifiers [ 6.727874] nvme0n1: p1 p2 p3 p4 p5 This begs the question what SUBNQN field and Namespace Identifiers are and why they aren't present or have to be ignored: the CT2000P1SSD8 is a Crucial P1, it's the older one. The newer CT4000P3PSSD8 Crucial P3 Plus has the "missing or invalid SUBNQN field" and the "Ignoring bogus Namespace Identifiers" issue. Obviously I have no idea what this means, but I'm guessing that it is unrelated to the I/O timeout issues, since I don't get any of those.
(In reply to Andreas from comment #31) > This begs the question what SUBNQN field and Namespace Identifiers are and > why they aren't present or have to be ignored: the CT2000P1SSD8 is a Crucial > P1, it's the older one. The newer CT4000P3PSSD8 Crucial P3 Plus has the > "missing or invalid SUBNQN field" and the "Ignoring bogus Namespace > Identifiers" issue. These duplicate and invalid naming messages don't mean anything unless your environment is an enterprise server with multipath, or in some cases, virtualization. No one else should mind if your manufacturer didn't assemble unique IDs. That use to harm things if you put two such devices in the same machine because of how udev works, but less so anymore. Technically, there's still a data integrity danger with udev enumeration order, but not one anyone in real life seems to care about. Linus Torvalds in fact requested we remove the safety check.
(In reply to Rev from comment #30) > By the way, since there are a lot of different NVME SSD brands and types > with this error found, how can one get rid of it? The error is that the drive is failing to produce a response to a command within a very generous time. We wait 30 seconds, which should be an eternity for NAND media. In my experience seeing many of these occurrences, they are always some broken firmware condition, and I have to harass the vendor for a proper fix.
(In reply to Keith Busch from comment #33) > (In reply to Rev from comment #30) > > By the way, since there are a lot of different NVME SSD brands and types > > with this error found, how can one get rid of it? > > The error is that the drive is failing to produce a response to a command > within a very generous time. We wait 30 seconds, which should be an eternity > for NAND media. In my experience seeing many of these occurrences, they are > always some broken firmware condition, and I have to harass the vendor for a > proper fix. Since it is a sadly very, very common response from hardware vendors to pack their product in a black box (in this case, check for vendors who give customers the option to do a firmware update in a non Windows environment - is that number zero or non zero?) and leave open source communities hanging, my opinion would be to find a consolidation in the open source communities to concentrate on their strength to find a workaround. I know, this is an opinion from someone not involved in the process. A very difficult process, I can just guess. But if this is a common problem, and from the responses around the web for me it seems like it is, it might be the only chance to solve the issue.
(In reply to Rev from comment #34) > Since it is a sadly very, very common response from hardware vendors to pack > their product in a black box (in this case, check for vendors who give > customers the option to do a firmware update in a non Windows environment - > is that number zero or non zero?) and leave open source communities hanging, > my opinion would be to find a consolidation in the open source communities > to concentrate on their strength to find a workaround. Not sure I follow what you have in mind here. There's no one size fits all workaround for this. If you want a simple way to change the workload the device receives, which may achieve a reduction in timeout observations, try using kyber io scheduler. If you're already using that, try mq-deadline instead. Or try a different filesystem. Using ext4? Try xfs. Using xfs? Try btrfs. Using btrfs? Try ext4. And ensure partitions are aligned on 64k boundaries. Are discards disabled? If so, turn them on. Is your drive capacity utilization very high? Try deleting a bunch of stuff and keep your utilization below 80%. Do you have any power saving features enabled? Try turning them off.
I also have this on two systems, both running Ubuntu 24.04.1. First system: ASUS Pro WS TRX50-SAGE WIFI + AMD Threadripper 7970X Ubuntu 24.04 Desktop + kernel 6.11.x from Xanmod SSDs that have this issue: Solidigm P44 Pro 2TB and Samsung 990 Pro 4TB Second system: Gigabyte MZ32-AR0 (rev 1.0) + AMD Epyc 7302P Ubuntu 24.04 Server + kernel 6.8-generic from stock repositories SSDs that have this issue: SkHynix P41 Platinum 2TB (almost identical to P44 Pro above with different firmware) and Samsung 990 Pro 4TB (different unit from first system) I tried nvme_core.default_ps_max_latency_us=0 separately and with pcie_aspm=off pcie_port_pm=off. Maybe coincidence, but with all 3 it survives a bit longer than without, but still reliably crashes when running `btrfs scrub start /`. Attaching logs from second system that started with `nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off`. Things I have tried so far without any luck: * putting SSD into different PCIe slot (both native M.2 on the motherboard and through adapter into regular PCIe slot) * forcing PCIe 3.0 speed for these PCIe 4.0 SSDs * various BIOS options (like enabling/disabling AER, ACS, some others) All SSDs are with latest available firmware. The SSD I didn't have this issue with was ADATA XPG SP8100NP-2TT-C 2TB (not the best SSD performance-wise). Will attach full kernel log with `nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off`, open for experiments because both of the systems are freezing sometimes multiple times a day, making it barely usable. Can compile and test the kernel or boot it with custom options if needed, anything that can fix this issue (it has been multiple months and I upgraded to 990 Pro primarily to deal with this problem).
Created attachment 307207 [details] Samsung 990 Pro 4TB disconnects under load with `nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off`
(In reply to Rev from comment #34) > Since it is a sadly very, very common response from hardware vendors to pack > their product in a black box (in this case, check for vendors who give > customers the option to do a firmware update in a non Windows environment - > is that number zero or non zero?) and leave open source communities hanging, > my opinion would be to find a consolidation in the open source communities > to concentrate on their strength to find a workaround. > > I know, this is an opinion from someone not involved in the process. A very > difficult process, I can just guess. But if this is a common problem, and > from the responses around the web for me it seems like it is, it might be > the only chance to solve the issue. I encountered this issue recently. Previously kernel version 6.9.x was used, after system booted 3 monthes later, one nvme ssd suddenly became very slow, continous write a round 10MB/s and not stable (speed drops to zero from time to time). I think maybe it was due to system running in long time, so I rebooted it. Before I reboot, I upgrade the kernel from 6.9.x to 6.12.x. After system reboot, this nvme ssd still in a strange working state (low speed, not stable speed, high busy time around 100%). And file system was shutting down by kernel half an hour after system boot. Log shows: Dec 2 20:36:28 E5W kernel: nvme4n1: I/O Cmd(0x1) @ LBA 1789475760, 1024 blocks, I/O Error (sct 0x0 / sc 0x7) Dec 2 20:36:28 E5W kernel: I/O error, dev nvme4n1, sector 1789475760 op 0x1:(WRITE) flags 0x104000 phys_seg 1 prio class 0 Dec 2 20:36:28 E5W kernel: nvme4n1p1: writeback error on inode 3228735725, offset 1614807040, sector 1789470640 Dec 2 20:36:28 E5W kernel: nvme nvme4: Abort status: 0x0 Dec 2 20:36:30 E5W kernel: nvme nvme4: I/O tag 64 (2040) opcode 0x1 (I/O Cmd) QID 6 timeout, aborting req_op:WRITE(1) size:9216 Dec 2 20:36:31 E5W kernel: nvme nvme4: I/O tag 183 (20b7) opcode 0x1 (I/O Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:524288 Dec 2 20:36:31 E5W kernel: nvme nvme4: I/O tag 184 (20b8) opcode 0x1 (I/O Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:524288 Dec 2 20:36:32 E5W kernel: nvme4n1: I/O Cmd(0x1) @ LBA 2001250948, 18 blocks, I/O Error (sct 0x0 / sc 0x7) Dec 2 20:36:32 E5W kernel: I/O error, dev nvme4n1, sector 2001250948 op 0x1:(WRITE) flags 0x29800 phys_seg 1 prio class 0 Dec 2 20:36:32 E5W kernel: nvme nvme4: Abort status: 0x0 Dec 2 20:36:32 E5W kernel: I/O error, dev nvme4n1, sector 2001250948 op 0x1:(WRITE) flags 0x29800 phys_seg 1 prio class 0 Dec 2 20:36:32 E5W kernel: XFS (nvme4n1p1): log I/O error -5 Dec 2 20:36:32 E5W kernel: XFS (nvme4n1p1): Filesystem has been shut down due to log error (0x2). Dec 2 20:36:32 E5W kernel: XFS (nvme4n1p1): Please unmount the filesystem and rectify the problem(s). I've done smartctl check, there's no obvious errors. using `dd` command can read from this nvme ssd. I have no idea how to solve this issue currently, and plan to revert back to kernel 5.14.0 to see if issue resolved.
(In reply to iccfish from comment #38) > Dec 2 20:36:28 E5W kernel: nvme4n1: I/O Cmd(0x1) @ LBA 1789475760, 1024 > blocks, I/O Error (sct 0x0 / sc 0x7) > Dec 2 20:36:28 E5W kernel: I/O error, dev nvme4n1, sector 1789475760 op > 0x1:(WRITE) flags 0x104000 phys_seg 1 prio class 0 > Dec 2 20:36:28 E5W kernel: nvme4n1p1: writeback error on inode 3228735725, > offset 1614807040, sector 1789470640 > Dec 2 20:36:28 E5W kernel: nvme nvme4: Abort status: 0x0 > Dec 2 20:36:30 E5W kernel: nvme nvme4: I/O tag 64 (2040) opcode 0x1 (I/O > Cmd) QID 6 timeout, aborting req_op:WRITE(1) size:9216 > Dec 2 20:36:31 E5W kernel: nvme nvme4: I/O tag 183 (20b7) opcode 0x1 (I/O > Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:524288 > Dec 2 20:36:31 E5W kernel: nvme nvme4: I/O tag 184 (20b8) opcode 0x1 (I/O > Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:524288 > Dec 2 20:36:32 E5W kernel: nvme4n1: I/O Cmd(0x1) @ LBA 2001250948, 18 > blocks, I/O Error (sct 0x0 / sc 0x7) > Dec 2 20:36:32 E5W kernel: I/O error, dev nvme4n1, sector 2001250948 op > 0x1:(WRITE) flags 0x29800 phys_seg 1 prio class 0 > Dec 2 20:36:32 E5W kernel: nvme nvme4: Abort status: 0x0 > Dec 2 20:36:32 E5W kernel: I/O error, dev nvme4n1, sector 2001250948 op > 0x1:(WRITE) flags 0x29800 phys_seg 1 prio class 0 > Dec 2 20:36:32 E5W kernel: XFS (nvme4n1p1): log I/O error -5 > Dec 2 20:36:32 E5W kernel: XFS (nvme4n1p1): Filesystem has been shut down > due to log error (0x2). > Dec 2 20:36:32 E5W kernel: XFS (nvme4n1p1): Please unmount the filesystem > and rectify the problem(s). > > > I've done smartctl check, there's no obvious errors. using `dd` command can > read from this nvme ssd. I have no idea how to solve this issue currently, > and plan to revert back to kernel 5.14.0 to see if issue resolved. Here the story begins... Dec 2 20:29:02 E5W kernel: XFS (nvme4n1p1): Mounting V5 Filesystem 17300a9d-05ca-418f-ade0-f185d8816118 Dec 2 20:29:02 E5W kernel: XFS (nvme4n1p1): Ending clean mount Dec 2 20:33:14 E5W kernel: nvme nvme4: I/O tag 618 (126a) opcode 0x1 (I/O Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:524288 Dec 2 20:33:14 E5W kernel: nvme nvme4: I/O tag 619 (126b) opcode 0x1 (I/O Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:524288 Dec 2 20:33:14 E5W kernel: nvme nvme4: I/O tag 620 (126c) opcode 0x1 (I/O Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:524288 Dec 2 20:33:20 E5W kernel: nvme nvme4: Abort status: 0x0 Dec 2 20:33:23 E5W kernel: nvme nvme4: Abort status: 0x0 Dec 2 20:33:26 E5W kernel: nvme nvme4: Abort status: 0x0 Dec 2 20:33:50 E5W kernel: nvme nvme4: I/O tag 47 (302f) opcode 0x1 (I/O Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:524288 Dec 2 20:33:50 E5W kernel: nvme nvme4: I/O tag 48 (3030) opcode 0x1 (I/O Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:524288 Dec 2 20:33:50 E5W kernel: nvme nvme4: I/O tag 49 (3031) opcode 0x1 (I/O Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:524288 Dec 2 20:33:50 E5W kernel: nvme nvme4: Abort status: 0x0 Dec 2 20:33:51 E5W kernel: nvme nvme4: Abort status: 0x0 Dec 2 20:33:51 E5W kernel: nvme nvme4: Abort status: 0x0 Dec 2 20:33:58 E5W kernel: nvme nvme4: I/O tag 64 (d040) opcode 0x1 (I/O Cmd) QID 6 timeout, aborting req_op:WRITE(1) size:9216 Dec 2 20:33:58 E5W kernel: nvme nvme4: Abort status: 0x0 Dec 2 20:34:00 E5W kernel: nvme nvme4: I/O tag 512 (e200) opcode 0x1 (I/O Cmd) QID 2 timeout, aborting req_op:WRITE(1) size:30208 Dec 2 20:34:00 E5W kernel: nvme nvme4: I/O tag 65 (5041) opcode 0x1 (I/O Cmd) QID 6 timeout, aborting req_op:WRITE(1) size:4096 Dec 2 20:34:00 E5W kernel: nvme nvme4: I/O tag 96 (3060) opcode 0x1 (I/O Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:524288 Dec 2 20:34:01 E5W kernel: nvme nvme4: Abort status: 0x0 Dec 2 20:34:01 E5W kernel: nvme nvme4: Abort status: 0x0 Dec 2 20:34:01 E5W kernel: nvme nvme4: Abort status: 0x0
I wonder if you have a high media utilization that's straining the SSD's background tasks. Not all devices support reporting utilization, but many do. What you can try is "sudo nvme list" and compare the two numbers in the "Usage" column. If the usage is a large percentage of the total capacity, that may mean your SSD will have trouble keeping up with new writes due to garbage collection overhead. If the numbers are the same, that might mean the device doesn't support reporting media utilization, which would make it difficult to determine if that is happening. You can also check "df -h" and see if your nvme mount points' "Used" is very close to the "Size". If the device supports reporting utilization, and if that utilization is significantly higher than the filesystem's "Used", that would indicate your filesystem is not performing discards (aka "trim", "fstrim") as expected, and that can really harm the SSDs performance as you approach full capacity, especially if the device's over provisioning is very small. I didn't see any "discards" in the iostat's output. If the device's utilization is very high, and closely matches the filesystem's "Used" capacity, then you may have simply over subscribed the media, and controller can no longer keep up. What you'd have to do in that case is create an unused partition, maybe 10-20% of the capacity, and never write to it, and make sure the partition's LBA's are discarded on the device ('sudo blkdiscard /dev/nvme0n1p2', or whatever your sequestered partition is if not p2). That would simulate a higher over-provisioning for the used partition, making it easier for the controller to do garbage collection.
Thanks for suggestion, it was worth a shot, but I don't think it is the reason. I have Samsung 990 Pro 4T, only 1.5T of which is used. Both `nvme list` and `df -h` output closely match each other. This is one of the best consumer SSDs with PCIe 4.0 interface available, it is very unlikely that it stops working during reads with less than 40% utilization because it is unsubscribed.
Thanks for checking. I that case it doesn't sound like you're hitting any sort of media utilization corner case. That's pretty much the only time I've seem timeouts where the conclusion was "yes, that's expected", so I'm still considering this to be a vendor specific oddity.
It might be vendor-specific, but I've seen the same exact issue with two other SSDs and another user recently reported the same issue with yet another SSD (Crucial T705 2TB, PCIe 5.0) at https://forum.level1techs.com/t/nvme-ssd-disappears-disconnects-from-the-system/220275/6?u=nazar-pc So there is clearly some issue somewhere that is Linux-specific. Samsung refuses to offer any kind of support for 990 Pro saying that it works great on Windows, there are no known issues and they do not support Linux, so I'm on my own. Anything else I can check on my end to narrow things down further?
Chiming in that im seeing a similar issue with my nvme's. Linux version 6.8.12-1-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-1 (2024-08-05T16:17Z) two Silicon Power 2TB UD90 NVMe 4.0 Gen4 PCIe purchased at same time. Installed in two GLOTRENDS PA09-HS M.2 NVMe to PCIe 4.0 X4 Adapter with M.2 Heatsink for M.2 NVMe SSD in two seperate pcie slots. MB is a H11SSL-i openzfs mirror, zfs-2.2.6-pve1, one of them (nvme1) drops out after a month or two, degrading the pool. --> The other one (nvme0) has NEVER done this. Reboot and nvme1 comes back. If I understand it right, operation was a write which timed out, and since the nvme was so unresponsive the kernel disabled the device. Well, I guess the mirror came in handy. Have not tried a firmware update on nvme yet (issues with a reader). Nor have I tried another of the other suggestions to kernel parameters. Hence, seems like a failed/defective nvme. Going to keep it and swap for another identical one I got later and see what happens. Since nvme1 is offline, here is nvme0 smart data: ``` === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 30 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 3% Data Units Read: 58,873,504 [30.1 TB] Data Units Written: 218,987,666 [112 TB] Host Read Commands: 787,905,522 Host Write Commands: 2,416,266,061 Controller Busy Time: 169,561 Power Cycles: 31 Power On Hours: 2,826 Unsafe Shutdowns: 18 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Error Information (NVMe Log 0x01, 8 of 8 entries) No Errors Logged ``` ``` [ 1531.339888] nvme nvme1: I/O tag 412 (219c) opcode 0x1 (I/O Cmd) QID 6 timeout, aborting req_op:WRITE(1) size:4096 [ 1531.344939] nvme nvme1: Abort status: 0x0 [ 1551.432807] nvme nvme1: I/O tag 413 (d19d) opcode 0x1 (I/O Cmd) QID 6 timeout, aborting req_op:WRITE(1) size:16384 [ 1551.437339] nvme nvme1: Abort status: 0x0 [ 1556.553531] nvme nvme1: I/O tag 415 (019f) opcode 0x1 (I/O Cmd) QID 6 timeout, aborting req_op:WRITE(1) size:16384 [ 1556.557791] nvme nvme1: Abort status: 0x0 [ 1556.875491] nvme nvme1: I/O tag 418 (01a2) opcode 0x1 (I/O Cmd) QID 6 timeout, aborting req_op:WRITE(1) size:16384 [ 1556.879723] nvme nvme1: Abort status: 0x0 [ 1561.685244] nvme nvme1: I/O tag 412 (219c) opcode 0x1 (I/O Cmd) QID 6 timeout, reset controller [ 1643.613804] nvme nvme1: Device not ready; aborting reset, CSTS=0x1 [ 1663.639732] nvme nvme1: Device not ready; aborting reset, CSTS=0x1 [ 1663.640022] nvme nvme1: Disabling device after reset failure: -19 ``` I have another
I have the same issue. Lenovo Thinkpad L13, Transcent TS2TMTE400S 2TB PCIe M.2 nvme. Encrypted LVM with btrfs. ``` $ lsb_release -a No LSB modules are available. Distributor ID: Kali Description: Kali GNU/Linux Rolling Release: 2024.4 Codename: kali-rolling $ uname -a Linux 0trust 6.11.2-amd64 #1 SMP PREEMPT_DYNAMIC Kali 6.11.2-1kali1 (2024-10-15) x86_64 GNU/Linux ``` Several times a day the system becomes unresponsive for about 30 secs (only mouse cursor can be moved). When this happens the following messages appear in dmesg: ``` [ +1.847918] nvme nvme0: I/O tag 128 (2080) opcode 0x1 (I/O Cmd) QID 1 timeout, aborting req_op:WRITE(1) size:4096 [ +0.807991] nvme nvme0: I/O tag 129 (b081) opcode 0x1 (I/O Cmd) QID 1 timeout, aborting req_op:WRITE(1) size:16384 [ +0.000024] nvme nvme0: I/O tag 130 (6082) opcode 0x1 (I/O Cmd) QID 1 timeout, aborting req_op:WRITE(1) size:16384 [ +0.000008] nvme nvme0: I/O tag 131 (2083) opcode 0x1 (I/O Cmd) QID 1 timeout, aborting req_op:WRITE(1) size:16384 [ +27.403961] nvme nvme0: I/O tag 152 (1098) opcode 0x1 (I/O Cmd) QID 1 timeout, reset controller [ +0.000009] WARNING: CPU: 2 PID: 370 at arch/x86/kernel/apic/vector.c:878 apic_set_affinity+0x80/0x90 [ +0.000188] CPU: 2 UID: 0 PID: 370 Comm: kworker/2:1H Not tainted 6.11.2-amd64 #1 Kali 6.11.2-1kali1 [ +0.000009] Hardware name: LENOVO 21FN000BGE/21FN000BGE, BIOS R26ET43W (1.21 ) 12/12/2024 [ +0.000004] Workqueue: kblockd blk_mq_timeout_work [ +0.000010] RIP: 0010:apic_set_affinity+0x80/0x90 [ +0.000008] Code: 00 20 00 75 1c e8 00 fb ff ff 89 c3 48 c7 c7 d8 e3 f2 b4 e8 d2 72 c4 00 89 d8 5b 5d e9 54 d9 e7 00 e8 a4 f9 ff ff 89 c3 eb e2 <0f> 0b bb fb ff ff ff 89 d8 5b 5d e9 3b d9 e7 00 90 90 90 90 90 90 [ +0.000005] RSP: 0018:ffffb06a40b27ba8 EFLAGS: 00010046 [ +0.000006] RAX: 000000001d03a000 RBX: ffff993d40062800 RCX: 0000000000000000 [ +0.000004] RDX: 0000000000000000 RSI: ffffffffb4f825c0 RDI: ffff993d5655df00 [ +0.000004] RBP: 0000000000000000 R08: ffffffffb4f825c0 R09: ffffffffb4f821a0 [ +0.000003] R10: 0000000000000000 R11: 0000000000000000 R12: ffff993d5655d940 [ +0.000004] R13: ffff993d558f6a80 R14: ffffffffb4f825c0 R15: 0000000000000000 [ +0.000005] FS: 0000000000000000(0000) GS:ffff99401eb00000(0000) knlGS:0000000000000000 [ +0.000004] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ +0.000004] CR2: 00007f5f98455008 CR3: 0000000195a22000 CR4: 0000000000750ef0 [ +0.000005] PKRU: 55555554 [ +0.000003] Call Trace: [ +0.000006] <TASK> [ +0.000004] ? apic_set_affinity+0x80/0x90 [ +0.000007] ? __warn.cold+0x8e/0xe8 [ +0.000008] ? apic_set_affinity+0x80/0x90 [ +0.000015] ? report_bug+0xff/0x140 [ +0.000010] ? handle_bug+0x3c/0x80 [ +0.000008] ? exc_invalid_op+0x17/0x70 [ +0.000006] ? asm_exc_invalid_op+0x1a/0x20 [ +0.000015] ? apic_set_affinity+0x80/0x90 [ +0.000008] ? srso_alias_return_thunk+0x5/0xfbef5 [ +0.000007] amd_ir_set_affinity+0x46/0xa0 [ +0.000009] ioapic_set_affinity+0x24/0x70 [ +0.000006] irq_do_set_affinity+0x1c0/0x200 [ +0.000007] irq_setup_affinity+0xf7/0x180 [ +0.000005] irq_startup+0x11e/0x130 [ +0.000004] enable_irq+0x50/0xa0 [ +0.000006] nvme_timeout+0x283/0x3a0 [nvme] [ +0.000009] blk_mq_handle_expired+0x6b/0xa0 [ +0.000005] bt_iter+0x88/0xa0 [ +0.000005] blk_mq_queue_tag_busy_iter+0x31c/0x5d0 [ +0.000004] ? __pfx_blk_mq_handle_expired+0x10/0x10 [ +0.000004] ? __pfx_blk_mq_handle_expired+0x10/0x10 [ +0.000005] blk_mq_timeout_work+0x171/0x1b0 [ +0.000004] process_one_work+0x177/0x330 [ +0.000006] worker_thread+0x252/0x390 [ +0.000004] ? __pfx_worker_thread+0x10/0x10 [ +0.000003] kthread+0xd2/0x100 [ +0.000004] ? __pfx_kthread+0x10/0x10 [ +0.000004] ret_from_fork+0x34/0x50 [ +0.000005] ? __pfx_kthread+0x10/0x10 [ +0.000004] ret_from_fork_asm+0x1a/0x30 [ +0.000008] </TASK> [ +0.000001] ---[ end trace 0000000000000000 ]--- <above trace sometimes repeats> [ +0.000207] nvme nvme0: Abort status: 0x371 [ +0.000006] nvme nvme0: Abort status: 0x371 [ +0.000003] nvme nvme0: Abort status: 0x371 [ +0.000002] nvme nvme0: Abort status: 0x371 [ +0.000002] nvme nvme0: Abort status: 0x371 [ +0.037048] nvme nvme0: 8/0/0 default/read/poll queues ``` After happening several times the filesystems will switch to ro (which probably is the sane and safe way to go), forcing a reboot. Yet, the issue returns every single time. Smart does not report anything suspicious. ``` $ sudo smartctl -H /dev/nvme0n1 smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.11.2-amd64] (local build) Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED $ sudo smartctl -l error /dev/nvme0n1 smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.11.2-amd64] (local build) Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org === START OF SMART DATA SECTION === Error Information (NVMe Log 0x01, 16 of 64 entries) No Errors Logged ``` I also ran a complete diagnostics of the hardware using Lenovo Diagnostics. Every test passed. I have had this issue for at least since mid 2024, but I cannot pinpoint the exact time or Kernel version when I first encountered it.