Bug 216809
Summary: | nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | jfhart085 |
Component: | NVMe | Assignee: | IO/NVME Virtual Default Assignee (io_nvme) |
Status: | NEW --- | ||
Severity: | blocking | CC: | andreas.thalhammer, kbusch, konoha02, nazar, rev |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 6.1 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
nvme drive and controller specs
iostat on nvme Samsung 990 Pro 4TB disconnects under load with `nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off` |
Description
jfhart085
2022-12-14 11:35:51 UTC
Created attachment 303406 [details]
nvme drive and controller specs
Did this work on earlier kernel version? If yes, which was the latest version that worked? I have just recently started trying to use NVME. I originally tried with linux-57 and had similar failures, so I upgraded to 6.1 and still had them. The report I give here is for linux kernel 6.1. that should read linux 5.7. My apologies for the finger stutter. To clarify: I have not tried NVME with anything earlier than linux-5.7 thx for clarifying, then it's not a regression I should track and something for the regular developers. If you don't get a reply from them here, report this to the appropriate maintainers and subsystem list mentioned in https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/MAINTAINERS oct 06 20:46:04 debian kernel: nvme nvme0: I/O 648 (I/O Cmd) QID 4 timeout, aborting oct 06 20:47:23 debian kernel: nvme nvme0: I/O 0 (I/O Cmd) QID 6 timeout, aborting oct 06 20:47:23 debian kernel: nvme nvme0: I/O 681 QID 3 timeout, reset controller oct 06 20:47:23 debian kernel: nvme nvme0: I/O 8 QID 0 timeout, reset controller oct 06 20:47:23 debian kernel: nvme nvme0: Abort status: 0x371 oct 06 20:47:23 debian kernel: nvme nvme0: Abort status: 0x371 oct 06 20:47:23 debian kernel: nvme nvme0: Abort status: 0x371 oct 06 20:47:23 debian kernel: nvme nvme0: Abort status: 0x371 oct 06 20:47:23 debian kernel: nvme nvme0: Abort status: 0x371 oct 06 20:47:23 debian kernel: nvme nvme0: 15/0/0 default/read/poll queues oct 06 20:47:23 debian kernel: nvme nvme0: Ignoring bogus Namespace Identifiers I got this error today. I powered on my machine (cold boot) and after login to my xfce session everything except the cursor got frozen, nothing responded (not even the clock). I could go into tty and ctrl alt del from there. It rebooted after a minute or so. @Nelson G, Comment 7 Which version of Debian are you running? Which kernel version exactly? On the shell, can you share the (selective) output of the following commands: # cat /etc/debian_version # cat /etc/*release # uname -a # hostnamectl Also, if installed, (selective) output about the system would be nice: # dmidecode -t system You might want to _not_ paste the serial numbers and UUIDs, hence "selective". oops, sorry, here: debian 12.1 linux 6.1.52-1 x86-64 Thinkpad e495 # dmidecode 3.4 Getting SMBIOS data from sysfs. SMBIOS 3.1.1 present. Handle 0x000F, DMI type 1, 27 bytes System Information Manufacturer: LENOVO Product Name: 20NECTO1WW Version: ThinkPad E495 Serial Number: blank UUID: blank Wake-up Type: Power Switch SKU Number: LENOVO_MT_20NE_BU_Think_FM_ThinkPad E495 Family: ThinkPad E495 Handle 0x0036, DMI type 15, 31 bytes System Event Log Area Length: 1346 bytes Header Start Offset: 0x0000 Header Length: 16 bytes Data Start Offset: 0x0010 Access Method: General-purpose non-volatile data functions Access Address: 0x00F0 Status: Valid, Not Full Change Token: 0x00000053 Header Format: Type 1 Supported Log Type Descriptors: 4 Descriptor 1: POST error Data Format 1: POST results bitmap Descriptor 2: PCI system error Data Format 2: None Descriptor 3: System reconfigured Data Format 3: None Descriptor 4: Log area reset/cleared Data Format 4: None Can you say with certainty that the NVMe SSD itself isn't faulty, causing the errors? Did you check the SMART (and other) logs, e.g. with smartctl? For reference, my system: Hardware specs: Product: LENOVO Legion 5 Pro 16ACH6H 82JQ SKU Number: LENOVO_MT_82JQ_BU_idea_FM_Legion 5 Pro 16ACH6H Family: Legion 5 Pro 16ACH6H NVMe PCIe SSD (upgraded, not original): Crucial P1 CT2000P1SSD8 Software specs: Gentoo Linux Kernel 6.5.6-gentoo x86_64 AMD Ryzen 7 5800H (custom kernel, my own .config) E.g. this is from my system, without the errors your describe: localhost ~ # smartctl -H /dev/nvme0n1 smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.5.6-gentoo-L5P] (local build) Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED localhost ~ # smartctl -l error /dev/nvme0n1 smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.5.6-gentoo-L5P] (local build) Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org === START OF SMART DATA SECTION === Error Information (NVMe Log 0x01, 16 of 256 entries) Num ErrCount SQId CmdId Status PELoc LBA NSID VS 0 1937 0 0x0014 0x4005 - 0 1 - localhost ~ # nvme smart-log /dev/nvme0n1 Smart Log for NVME device:nvme0n1 namespace-id:ffffffff critical_warning : 0 temperature : 44 °C (317 K) available_spare : 100% available_spare_threshold : 5% percentage_used : 3% endurance group critical warning summary: 0 Data Units Read : 273.156.408 (139,86 TB) Data Units Written : 69.561.559 (35,62 TB) host_read_commands : 9.941.155.688 host_write_commands : 1.220.956.942 controller_busy_time : 22.719 power_cycles : 1.549 power_on_hours : 8.511 unsafe_shutdowns : 96 media_errors : 0 num_err_log_entries : 1.937 Warning Temperature Time : 0 Critical Composite Temperature Time : 0 Temperature Sensor 1 : 44 °C (317 K) Thermal Management T1 Trans Count : 17 Thermal Management T2 Trans Count : 4 Thermal Management T1 Total Time : 17593 Thermal Management T2 Total Time : 9 Well, I don't mean my ssd is the culprit for being in a bad state, but because the support for linux is as bad as my thinkpad's bios oem team. They simply don't care about linux when you ask for help/support about this bug, or worse, a critical bug/implementation. They act as if the error is using linux per se. Anyway: $ smartctl -H /dev/nvme0n1 smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-13-amd64] (local build) Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED $ smartctl -l error /dev/nvme0n1 smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-13-amd64] (local build) Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org === START OF SMART DATA SECTION === Error Information (NVMe Log 0x01, 16 of 64 entries) No Errors Logged $ sudo nvme smart-log /dev/nvme0n1 Smart Log for NVME device:nvme0n1 namespace-id:ffffffff critical_warning : 0 temperature : 38°C (311 Kelvin) available_spare : 100% available_spare_threshold : 10% percentage_used : 3% endurance group critical warning summary: 0 Data Units Read : 45.442.041 (23,27 TB) Data Units Written : 26.846.180 (13,75 TB) host_read_commands : 462.473.222 host_write_commands : 320.013.095 controller_busy_time : 25.241 power_cycles : 7.142 power_on_hours : 11.836 unsafe_shutdowns : 415 media_errors : 10 num_err_log_entries : 10 Warning Temperature Time : 189 Critical Composite Temperature Time : 1 —very old record :)— Thermal Management T1 Trans Count : 0 Thermal Management T2 Trans Count : 0 Thermal Management T1 Total Time : 0 Thermal Management T2 Total Time : 0 Oh btw, I reported this warning some time ago https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=990510 [ 1.390941] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0xffffc000 flags=0x0000] [ 1.390976] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0xffffc080 flags=0x0000] I tend to get it constantly, as the very first boot message (even before prompting me to luks) The day I got the bug we're reporting here, I also got that one, just sharing in case anyone recognizes that one. I tend to think that bug fixes take time, especially if the developers don't experience those issues for themselves, on their own hardware. E.g bug #202665 got fixed >1 yr after it was initially reported. So anyway, you should definitely get the newest kernel and see if the issue is still there, because that's what the developers will ask you to do anyhow. Also, with the sources at your disposal, you can apply any patches they through at you for testing. The basics are: https://www.wikihow.com/Compile-the-Linux-Kernel 1. Get the source. (Apply patches if/as necessary). 2. Compile the kernel and the modules, then install it. 3. Boot that kernel. (Probably involves configuring a boot manager.) How you would do all that depends on the distribution you use. For Debian, read this: https://www.debian.org/doc//manuals/debian-handbook/sect.kernel-compilation.pl.html E.g. for Ubuntu, you'd read this: https://wiki.ubuntu.com/Kernel/BuildYourOwnKernel I know, this seems like a lot of work, but it's the only way I know of... Well, I don't think there's any indication that there's a kernel bug of any kind here. Devices that don't complete commands will exceed the kernel's tolerance for timeouts, and there's nothing the kernel can do about it. Is there a way to tweak the timeout, adjust it to a higher value? How I understand it, the kernel resets the controller after the timeout is reached, but how certain is it that the command wouldn't have succeeded when given more time? Not sure if more time will help here. The 'iostats' output I suggested on the mailing list should indicate if waiting longer will help or not. You can change the default with kernel parameter 'nvme_core.io_timeout=<time-in-seconds>'. Default is 30 seconds. @Keith Busch, thank you very much. To get iostat, install sysstat. (On Debian: "sudo apt install sysstat") My system is working and doesn't have those issues, for reference: nvme0 contains my main Gentoo Linux system nvme1 is just mostly idle all the time localhost ~ # dmesg | grep nvme [ 7.269597] nvme 0000:02:00.0: platform quirk: setting simple suspend [ 7.269615] nvme 0000:05:00.0: platform quirk: setting simple suspend [ 7.269710] nvme nvme1: pci function 0000:05:00.0 [ 7.269713] nvme nvme0: pci function 0000:02:00.0 [ 7.279969] nvme nvme1: missing or invalid SUBNQN field. [ 7.280233] nvme nvme1: Shutdown timeout set to 8 seconds [ 7.297803] nvme nvme1: 16/0/0 default/read/poll queues [ 7.302096] nvme1n1: p1 p2 p3 p4 [ 7.303998] nvme nvme0: 15/0/0 default/read/poll queues [ 7.311543] nvme0n1: p1 p2 p3 p4 p5 p6 p7 p8 localhost ~ # iostat -x nvme0n1 Linux 6.5.7-gentoo-L5P (localhost.localdomain) 2023-10-14 _x86_64_ (16 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 0,65 0,00 0,36 0,05 0,00 98,94 Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme0n1 21,81 684,75 0,01 0,07 1,31 31,40 11,16 366,46 0,15 1,37 4,81 32,84 11,96 12172,93 0,00 0,00 0,22 1017,38 0,21 0,44 0,08 1,28 localhost ~ # iostat -x nvme1n1 Linux 6.5.7-gentoo-L5P (localhost.localdomain) 2023-10-14 _x86_64_ (16 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 0,65 0,00 0,36 0,05 0,00 98,93 Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme1n1 1,03 5,39 0,00 0,00 0,04 5,23 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,01 @Nelson G, could you also provide your iostat output, please? Also, try adding nvme_core.io_timeout=60 to your kernel commandline and see if that changes anything. Also, try what Keith Busch suggested on the mailing list: ----snap--- If the device is responding very slowly, suggestions I might have include: a. Enable discards if you've disabled them b. Disable discards if you've enabled them c. If not already enabled, use an io scheduler, like mq-deadline or kyber ----snap--- Thanks. Here: iostat system defaults, both output of warm boot and like 10 minutes of usage: Linux 6.1.0-13-amd64 (debian) 13/10/23 _x86_64_ (8 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 7,53 0,00 8,96 2,09 0,00 81,42 Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme0n1 605,22 18567,04 59,70 8,98 0,32 30,68 34,25 337,62 0,11 0,33 0,26 9,86 0,00 0,00 0,00 0,00 0,00 0,00 1,32 0,13 0,20 24,69 Linux 6.1.0-13-amd64 (debian) 13/10/23 _x86_64_ (8 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 2,07 0,01 1,81 0,46 0,00 95,65 Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme0n1 132,05 4217,31 16,53 11,13 0,64 31,94 5,87 51,63 0,02 0,34 0,23 8,80 0,00 0,00 0,00 0,00 0,00 0,00 0,29 0,11 0,09 4,12 iostat WITH nvme_core.io_timeout=60, both warm boot and like 10 minutes of usage (again): Linux 6.1.0-13-amd64 (debian) 13/10/23 _x86_64_ (8 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 7,21 0,00 7,36 1,69 0,00 83,74 Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme0n1 818,44 19837,90 44,75 5,18 0,62 24,24 22,52 213,35 0,02 0,08 0,27 9,47 0,00 0,00 0,00 0,00 0,00 0,00 0,86 0,13 0,51 21,85 Linux 6.1.0-13-amd64 (debian) 13/10/23 _x86_64_ (8 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 1,73 0,00 1,06 0,17 0,00 97,04 Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme0n1 47,16 1470,21 5,82 10,98 0,70 31,17 3,21 37,63 0,02 0,76 0,09 11,71 0,05 4,11 0,00 0,00 0,29 78,79 0,23 0,13 0,03 1,51 Off topic: Yeah, I suppose you're right about bugs getting fixed once a developer experiences it. To keep things short I'll just say I've had enough bad experiences with lenovo. And I've been told already that my issues are 'hard' to get ""fixed"" (they're fixed already by amd but not implemented on the bios) because well, Microsoft Windows is the main OS and not linux. So no linux related fixes on the bios for me https://bugzilla.kernel.org/show_bug.cgi?id=216161 As for the ssd, it's cheap, the average temperatures are ridiculously high, the speed rates are mediocre. But it gets the job done for a boot drive (system and config files are there to keep things faster than an hdd) but my gfs computer has samsung and oh boy that that drive it's cooler on temperature and faster (same storage capacity btw, 256gb lol). So yeah, I still don't expect much of it. Like, no dmesg warnings. Or the bug we're reporting (which only happened to me on very rare occasions). At least it rarely fails on scary ways like 'nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting' And thanks for the advice of the kernel update! I use both debian and arch, and the 'fix' for https://bugzilla.kernel.org/show_bug.cgi?id=202665 was supposed to be on 6.1 and beyond, although, knowing that ADATA told users with that bug to use windows, and that it seems to be vendor related (not entirely of course, maybe common nvme controller related), i prefer to just reboot once I see that warning. You'd have to check, but maybe those error messages aren't really affecting your system at all. I would suggest to test this by running your system very very long (>1-2 days), no reboots, and see if it remains stable. Especially with the Arch system, as it always uses the newest kernel. Also, if you're up to it, you could try switching (NVMe) SSDs and see if the fault is gone with another SSD. @jfhart085@gmail.com, the original reporter: if you could also provide the requested info, this bug could be closed. Since you have a Kingston SSD, it could be different. Also, a Dell XPS isn't supposed to be a cheap system. Something like: # smartctl -H /dev/nvme0n1 # smartctl -l error /dev/nvme0n1 # nvme smart-log /dev/nvme0n1 # iostat -x nvme0n1 @Nelson G Off Topic: I am very happy with my Legion 5 Pro from Lenovo. It is running very well with Linux, with the exception that I have to boot into Windows for BIOS updates. Some people also reported issues with dual "switcheroo" graphics on Linux, but since I'm only using the AMD Radeon integrated graphics on Linux (haven't even loaded the closed-source Nvidia drivers) and I specifically bought this laptop BECAUSE it has a dedicated Nvidia card -- for me only on Windows, only for gaming --, I didn't have those (now fixed, as I heard) Nvidia/AMD switcheroo issues. But that's the whole purpose of the Legion 5 Pro anyhow: Gaming! Not Linux, but if you know what you're doing, it works very very well. Lenovo has some dedicated Linux laptops, like certain ThinkPad series. Even in the BIOS/UEFI settings, there's a "Windows" or "Linux" OS preset on them, so I guess it all depends on what you're looking for. Anyway: Have you reported your issues/findings to the Lenovo support forums? They have a guy there, MarkRHPearson, acting as a Linux liaison, and he speaks to the company (and the technicians, e.g. for BIOS updates) on behalf of us Linux users. https://forums.lenovo.com/t5/Linux-Operating-Systems/ct-p/lx_en That said, it may well be that there's nothing that can be done. Maybe under Windows your SSD has the same issues, but Windows doesn't directly report them to you -- it may just be the specific SSD you're using... The way to run iostat as needed for this issue is "iostat -x /dev/nvme0n1 1". Notice the "1" at the end, as in to repeat the stat once every second so that we can see what's changed at each time interval. If the device is hopelessly broken, I expect at some point the util will approach 100%, r/s and w/s to go to 0, and aqu-sz to maintain a relatively large non-zero value. If the device is simply unable to keep up with the workload you're demanding out of it but still responsive in a slower capacity, then 'r/s' and 'w/s' should be non-zero, but show an obvious drop from prior to observing timeouts. I no longer have the Kingston SSD. I replaced it with a Samsung unit. I was able to resolve the issue as follows: > On 12/19/22 11:41 PM, Keith Busch wrote: > >> >>> MaxPayload 128 bytes, MaxReadReq 512 bytes >>> DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ >>> AuxPwr+ TransPend- >>> LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, >>> Latency L0 <1us, L1 <8us >>> ClockPM+ Surprise- LLActRep- BwNot- >>> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- >>> Retrain- CommClk+ >>> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- >>> LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- >>> SlotClk+ DLActive- BWMgmt- ABWMgmt- > > >> Something seems off if it's downtraining to Gen1 x1. I believe this >> setup should be capable of Gen2 x4. It sounds like the links among these >> components may not be reliable. > > This was a useful comment, and led me on to some further research. The > system in question here (a rather older Dell) was equipped with the > following PCI slots: > > slot pins > ---- ---- > PCI - 6 124 available > PCI - 5 124 blocked by RX580 > PCIe x16 - 4 164 secondary graphics, RX580 > PCIe x8 - 3 98 NVME controller > PCIe x1 - 2 36 TV card > PCIe x16 - 1 164 primary graphics, available > > It also uses an Nvidia chipset for the PCI bridge. It turns out that > back in 2008 Dell had custom modifications made to a standard chip set > configuration normally supplied by Nvidia. This enabled the > motherboard to handle up to two graphics cards at the same time while > minimizing cost. The trade off apparently was that the two x16 slots > actually only have 8 data lanes each, and the x8 slot (slot 3) only > one lane. I had the NVME controller in the x8 slot, expecting the > controller to negotiate for 4 lanes, not realizing the single lane > restriction. I moved the controller to the unused x16 slot (slot 1) > since I have only a single card in the machine. It is now operating > as expected without further difficulty. Thank you so much, @jfhart085! @Nelson G, Maybe this helps you as well? Does your laptop have two different PCIe NVMe slots? If so, can you switch the SSDs? And please test your system for an extended period of time, run the iostat command like Keith Busch requested in Comment 19 (on a separate virtual console), then post its output when you notice the problem. Mine only has one PCIe NVMe slot. Alright, I will try to run iostat for an extended period of time, it's been running for at least 10 min now and it doesn't seem to be something outstanding in the output, i will share it in a minute. I am currently using the 'nvme_core.default_ps_max_latency_us=0' boot parameter because I sadly got this bug once again a few days ago journald= oct 29 22:16:46 debian kernel: nvme nvme0: I/O 83 (I/O Cmd) QID 2 timeout, aborting oct 29 22:16:46 debian kernel: nvme nvme0: I/O 453 (I/O Cmd) QID 8 timeout, aborting oct 29 22:16:46 debian kernel: nvme nvme0: I/O 454 (I/O Cmd) QID 8 timeout, aborting oct 29 22:16:46 debian kernel: nvme nvme0: I/O 898 (I/O Cmd) QID 1 timeout, aborting oct 29 22:16:46 debian kernel: nvme nvme0: I/O 899 (I/O Cmd) QID 1 timeout, aborting oct 29 22:16:46 debian kernel: nvme nvme0: I/O 0 QID 0 timeout, reset controller oct 29 22:16:46 debian kernel: nvme nvme0: I/O 83 QID 2 timeout, reset controller oct 29 22:16:46 debian kernel: nvme nvme0: Abort status: 0x371 oct 29 22:16:46 debian kernel: nvme nvme0: Abort status: 0x371 oct 29 22:16:46 debian kernel: nvme nvme0: Abort status: 0x371 oct 29 22:16:46 debian kernel: nvme nvme0: Abort status: 0x371 oct 29 22:16:46 debian kernel: nvme nvme0: Abort status: 0x371 oct 29 22:16:46 debian kernel: nvme nvme0: 15/0/0 default/read/poll queues oct 29 22:16:46 debian kernel: nvme nvme0: Ignoring bogus Namespace Identifiers had to reboot because I needed the system working correctly asap. But next time i will run iostat and dmesg and save the output (if possible because, everything is slow/unresponsive) i was reading archlinux forums and found out that boot parameter to be 'useful' on keeping the system from reaching that bug at boot (I've only get that bug during the boot process, never after DAYS of the system running. I'd say i've reproduced this bug like 1 time out of 100 boot processes, i really don't know what triggers it). In a few hours I will run iostat again with the default boot parameters and a more exhaustive nvme activity. Created attachment 305345 [details]
iostat on nvme
nvme_core.default_ps_max_latency_us=0 boot parameter being used, not so much happening on that ssd during that test.
With this last information, logically, there are two actions required: #1 the devs need an iostat output when the error occurs, not when the system is running as it should. #2 if the dmesg errors only occur right at boot time, but doesn't occur at all while the system is running, this might be a rare case that could be caused and/or addressed by the kernel, and therefore could also be fixed in the code. Issue #1 is for @Nelson G Issue #2 if for the devs, @Keith Busch This bug is still recent. I get it on mainline 6.7(.9), and had it on previous kernels (so long ago, I don't even remember what kernel) for ubuntu 22.04.4 LTS. Strangely it was better on some of the previous kernel series. But now (with 6.7) I get system hangs with the same log as the initial bug reporter. And it gets as worse as system crashes because no I/O is possible any more. The system just stops working, because the OS cannot access its boot drive (512 GB LVM2, Adata SX6000LNP with most recent firmware) any more. And of course, no logs because of that. First I thought it must be my drive, but checks give the nvme drive a clean slate. I can even trigger the bug by loading video files and jumping time (skipping) while watching. Sooner or later the system will hang with the OP's error. And later the system will loose I/O of the boot drive and I have to manually power off the hardware and cold boot it (no shutdown / reboot possible, no other working environment possible due to I/O not possible). That makes my beloved linux less stable than my windows PC (puke). We're not doing anything outside spec compliance here. The device just doesn't respond. Why is that? Usually that's a firmware bug, and smart tools generally would't indicate that. Sometimes we can work around firmware bugs in the driver, but we can't efficiently guess as to what the problem might be without hardware in hand. Thank you, I can understand that. I probably should have watched my system better, when the bug was not or very seldom happening and noted the kernel versions. But as typical human behaviour, when an error is not bugging you, you are happy and do nothing. Now I can remember the issue getting worse on 6.7 mainline series and that brings back the memories of "I had that before...". Installed 6.8(.0) today - no cold boot necessary yet, but error logs like nvme nvme0: Abort status: 0x0 and nvme nvme0: I/O tag 512 (f200) opcode 0x1 (I/O Cmd) QID 3 timeout, aborting req_op:WRITE(1) size:4096 are still there. Do the occurrences become less frequent if you revert to an older kernel right now? I just want to rule out the possibility that your drive is just wearing out and starting to struggle. TBW (terrabytes written) and drive age are at 1/10 of the manufacturer spec. Since I don't remember the last kernel that worked better or at least without I/O crashes, I am open for suggestions, of what kernel to choose from. now the "nvme" filter changed to the following errors in 6.8.1: (this is new:) nvme nvme0: Ignoring bogus Namespace Identifiers nvme nvme0: 7/0/0 default/read/poll queues (this is old/known:) nvme nvme0: Abort status: 0x0 nvme nvme0: I/O tag 987 (13db) opcode 0x1 (I/O Cmd) QID 4 timeout, aborting req_op:WRITE(1) size:8192 and at least the last 2 errors repeat steadily in the average 50/hour margin, but no I/O crash yet. (QID varies 1-7 as does size:4096-12288) By the way, since there are a lot of different NVME SSD brands and types with this error found, how can one get rid of it? I have two M.2 SSDs from Crucial: nvme0: CT4000P3PSSD8 (4 TB), firmware P9CR40A nvme1: CT2000P1SSD8 (2 TB), firmware P3CR021 I don't have any issues. I recently bought the bigger SSD and moved Linux from the other one, swapping their places. I see this during Kernel 6.8.1 boot and in dmesg: [ 6.641233] nvme 0000:02:00.0: platform quirk: setting simple suspend [ 6.641249] nvme 0000:05:00.0: platform quirk: setting simple suspend [ 6.641327] nvme nvme0: pci function 0000:02:00.0 [ 6.641331] nvme nvme1: pci function 0000:05:00.0 [ 6.647140] nvme nvme0: missing or invalid SUBNQN field. [ 6.672906] nvme nvme1: 15/0/0 default/read/poll queues [ 6.673090] nvme nvme0: allocated 32 MiB host memory buffer. [ 6.681449] nvme1n1: p1 p2 p3 p4 [ 6.721320] nvme nvme0: 8/0/0 default/read/poll queues [ 6.725566] nvme nvme0: Ignoring bogus Namespace Identifiers [ 6.727874] nvme0n1: p1 p2 p3 p4 p5 This begs the question what SUBNQN field and Namespace Identifiers are and why they aren't present or have to be ignored: the CT2000P1SSD8 is a Crucial P1, it's the older one. The newer CT4000P3PSSD8 Crucial P3 Plus has the "missing or invalid SUBNQN field" and the "Ignoring bogus Namespace Identifiers" issue. Obviously I have no idea what this means, but I'm guessing that it is unrelated to the I/O timeout issues, since I don't get any of those. (In reply to Andreas from comment #31) > This begs the question what SUBNQN field and Namespace Identifiers are and > why they aren't present or have to be ignored: the CT2000P1SSD8 is a Crucial > P1, it's the older one. The newer CT4000P3PSSD8 Crucial P3 Plus has the > "missing or invalid SUBNQN field" and the "Ignoring bogus Namespace > Identifiers" issue. These duplicate and invalid naming messages don't mean anything unless your environment is an enterprise server with multipath, or in some cases, virtualization. No one else should mind if your manufacturer didn't assemble unique IDs. That use to harm things if you put two such devices in the same machine because of how udev works, but less so anymore. Technically, there's still a data integrity danger with udev enumeration order, but not one anyone in real life seems to care about. Linus Torvalds in fact requested we remove the safety check. (In reply to Rev from comment #30) > By the way, since there are a lot of different NVME SSD brands and types > with this error found, how can one get rid of it? The error is that the drive is failing to produce a response to a command within a very generous time. We wait 30 seconds, which should be an eternity for NAND media. In my experience seeing many of these occurrences, they are always some broken firmware condition, and I have to harass the vendor for a proper fix. (In reply to Keith Busch from comment #33) > (In reply to Rev from comment #30) > > By the way, since there are a lot of different NVME SSD brands and types > > with this error found, how can one get rid of it? > > The error is that the drive is failing to produce a response to a command > within a very generous time. We wait 30 seconds, which should be an eternity > for NAND media. In my experience seeing many of these occurrences, they are > always some broken firmware condition, and I have to harass the vendor for a > proper fix. Since it is a sadly very, very common response from hardware vendors to pack their product in a black box (in this case, check for vendors who give customers the option to do a firmware update in a non Windows environment - is that number zero or non zero?) and leave open source communities hanging, my opinion would be to find a consolidation in the open source communities to concentrate on their strength to find a workaround. I know, this is an opinion from someone not involved in the process. A very difficult process, I can just guess. But if this is a common problem, and from the responses around the web for me it seems like it is, it might be the only chance to solve the issue. (In reply to Rev from comment #34) > Since it is a sadly very, very common response from hardware vendors to pack > their product in a black box (in this case, check for vendors who give > customers the option to do a firmware update in a non Windows environment - > is that number zero or non zero?) and leave open source communities hanging, > my opinion would be to find a consolidation in the open source communities > to concentrate on their strength to find a workaround. Not sure I follow what you have in mind here. There's no one size fits all workaround for this. If you want a simple way to change the workload the device receives, which may achieve a reduction in timeout observations, try using kyber io scheduler. If you're already using that, try mq-deadline instead. Or try a different filesystem. Using ext4? Try xfs. Using xfs? Try btrfs. Using btrfs? Try ext4. And ensure partitions are aligned on 64k boundaries. Are discards disabled? If so, turn them on. Is your drive capacity utilization very high? Try deleting a bunch of stuff and keep your utilization below 80%. Do you have any power saving features enabled? Try turning them off. I also have this on two systems, both running Ubuntu 24.04.1. First system: ASUS Pro WS TRX50-SAGE WIFI + AMD Threadripper 7970X Ubuntu 24.04 Desktop + kernel 6.11.x from Xanmod SSDs that have this issue: Solidigm P44 Pro 2TB and Samsung 990 Pro 4TB Second system: Gigabyte MZ32-AR0 (rev 1.0) + AMD Epyc 7302P Ubuntu 24.04 Server + kernel 6.8-generic from stock repositories SSDs that have this issue: SkHynix P41 Platinum 2TB (almost identical to P44 Pro above with different firmware) and Samsung 990 Pro 4TB (different unit from first system) I tried nvme_core.default_ps_max_latency_us=0 separately and with pcie_aspm=off pcie_port_pm=off. Maybe coincidence, but with all 3 it survives a bit longer than without, but still reliably crashes when running `btrfs scrub start /`. Attaching logs from second system that started with `nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off`. Things I have tried so far without any luck: * putting SSD into different PCIe slot (both native M.2 on the motherboard and through adapter into regular PCIe slot) * forcing PCIe 3.0 speed for these PCIe 4.0 SSDs * various BIOS options (like enabling/disabling AER, ACS, some others) All SSDs are with latest available firmware. The SSD I didn't have this issue with was ADATA XPG SP8100NP-2TT-C 2TB (not the best SSD performance-wise). Will attach full kernel log with `nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off`, open for experiments because both of the systems are freezing sometimes multiple times a day, making it barely usable. Can compile and test the kernel or boot it with custom options if needed, anything that can fix this issue (it has been multiple months and I upgraded to 990 Pro primarily to deal with this problem). Created attachment 307207 [details]
Samsung 990 Pro 4TB disconnects under load with `nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off`
|