Bug 195039
Summary: | Samsung PM951 NVMe sudden controller death | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Marvin W (kernel) |
Component: | Other | Assignee: | io_other |
Status: | NEW --- | ||
Severity: | normal | CC: | adamkoch8, adelfino, akijo97, amanieu, axboe, basti.megamorf+kernel-org, brunogs001, bugs, cravchik, david.antliff, dirkjonker, drobek.krzysztof, eeshugerman, ian, intelligence.dance, Jbud, jckeerthan, juca, k.kulikov94, kernel, linux, linux, linux_kernel, luto, mail, mikhail.v.gavrilov, opensuser, pbrobinson, qwerty, rafal.moderski, raphael.droz, regressions, sereza, shopper2k, thehans, tr.ml, uran1980, v, vladbph, xken.sky |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.11-rc3 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | nvme id-ctrl /dev/nvme0 |
Description
Marvin W
2017-03-25 18:30:18 UTC
Andy, Jens: Does this problem look like the problems you saw on the SM951 that already was blacklisted in https://git.kernel.org/torvalds/c/c5552fde102fcc3f2cf9e502b8ac90e3500d8fdf ? Side note: I added this report to the list of regressions for Linux 4.11. I'll try to watch this place for further updates on this issue to document progress in my weekly reports. Please let me know in case the discussion moves to a different place (bugzilla or another mail thread for example). tia! Marvin, can you try with -rc4, and revert commit c5552fde10? I just checked, it reverts cleanly. Hi Marvin- Could you give me some more details of your hardware? 1. The raw device identification. A new enough smartctl will show it in 'smartctl -i /dev/nvme0'. Even better would be the full output of 'nvme id-ctrl /dev/nvme0' -- you can find the 'nvme' tool in the nvme-cli package. 2. What kind of computer is this? Is it a laptop? Is the affected disk something that came with the laptop? 3. Can you try booting with nvme_core.default_ps_max_latency_us=0? That with disable the power-saving feature that is likely at fault. Samsung currently has a machine that appears to be affected and is trying to figure out what's going on. There's some reason to believe that the problem is triggered by specific combinations of laptop and SSD. I'll obviously need to update the blacklist to fix your laptop (in lieu of a better workaround that still lets you get some power savings), but I need the info above to figure out what the blacklist entry should look like. Thanks, Andy Created attachment 255577 [details] nvme id-ctrl /dev/nvme0 (In reply to Andy Lutomirski from comment #3) > 1. The raw device identification. A new enough smartctl will show it in > 'smartctl -i /dev/nvme0'. Even better would be the full output of 'nvme > id-ctrl /dev/nvme0' -- you can find the 'nvme' tool in the nvme-cli package. Andy, (In reply to Andy Lutomirski from comment #3) > 2. What kind of computer is this? Is it a laptop? Is the affected disk > something that came with the laptop? This is a Dell XPS 15 9550. It is available in many different configurations, including SATA, mSATA or NVMe SSDs with different models each and you don't know before buying which exact model you will receive in the end... > 3. Can you try booting with nvme_core.default_ps_max_latency_us=0? That > with disable the power-saving feature that is likely at fault. I am running since Monday with Kernel 4.11-rc4 and nvme_core.default_ps_max_latency_us=0 and had no problems so far. Device was off/standby several hours, but prior crashes were after 1-4 hours so I assume the problem does not occur with nvme_core.default_ps_max_latency_us=0. I will also try with c5552fde10 reverted as suggested by Jens to be absolutely sure. @Luto: What's the status here? Do you need any more information to fix? Samsung engineers have an affected system and are trying to root-cause it. I was hoping they'd come up with something quickly, but I'm just going to submit a patch with a bigger quirk. Also posted in bug 194921. I don't know if that breaks a rule. If it does, feel free to delete one of the posts. I'm running into what I think is the same or a related problem. When I upgraded to ubuntu 17.04 beta 2 (which I believe is kernel 4.10), I started having crashes after anywhere from 10min to an hour. The operating system would state that the disk is now read only and/or give IO errors. I downgraded to 16.10 and now have no more problems of this kind. I'm also using a dell 9550. Output of 'nvme id-ctrl /dev/nvme0' NVME Identify Controller: vid : 0x144d ssvid : 0x144d sn : S29PNXAH124276 mn : PM951 NVMe SAMSUNG 512GB fr : BXV77D0Q rab : 2 ieee : 002538 cmic : 0 mdts : 5 cntlid : 1 ver : 0 rtd3r : 0 rtd3e : 0 oaes : 0 oacs : 0x17 acl : 7 aerl : 3 frmw : 0x6 lpa : 0 elpe : 63 npss : 4 avscc : 0x1 apsta : 0x1 wctemp : 0 cctemp : 0 mtfa : 0 hmpre : 0 hmmin : 0 tnvmcap : 0 unvmcap : 0 rpmbs : 0 sqes : 0x66 cqes : 0x44 nn : 1 oncs : 0x1f fuses : 0 fna : 0 vwc : 0x1 awun : 255 awupf : 0 nvscc : 1 acwu : 0 sgls : 0 ps 0 : mp:6.00W operational enlat:5 exlat:5 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:- active_power:- ps 1 : mp:4.20W operational enlat:30 exlat:30 rrt:1 rrl:1 rwt:1 rwl:1 idle_power:- active_power:- ps 2 : mp:3.10W operational enlat:100 exlat:100 rrt:2 rrl:2 rwt:2 rwl:2 idle_power:- active_power:- ps 3 : mp:0.0700W non-operational enlat:500 exlat:5000 rrt:3 rrl:3 rwt:3 rwl:3 idle_power:- active_power:- ps 4 : mp:0.0050W non-operational enlat:2000 exlat:22000 rrt:4 rrl:4 rwt:4 rwl:4 idle_power:- active_power:- Chris, the relevant code should be in 4.10 kernels at all. Can you provide the output of: $ modinfo nvme_core $ ls /sys/class/nvme/nvme0/power The Samsung people working on this issue are thinking that it's possible that the bug isn't directly an APST problem and, if you're hitting it without APST, it could be an interesting data point. Anyway, my plan is to make the quirk much, much broader for 4.11. I'm just hoping to hear back in the next day or two to see whether I should be quirking off APST on two particular Dell laptops or whether I should be quirking it off on the entire Samsung 950 line. So far, it does seem like the problem may be restricted to the two laptops in question. I reinstalled 17.04 and I've been running for 3 hours without incident using the nvme_core kernel parameter above. I don't know if this has anything to do with the issue, but my system seems stable and it has not been for days. Here is the output with having run the nvme_core kernel parameter above: filename: /lib/modules/4.10.0-19-generic/kernel/drivers/nvme/host/nvme-core.ko version: 1.0 license: GPL srcversion: 1BBEF320C053A2BA4284272 depends: intree: Y vermagic: 4.10.0-19-generic SMP mod_unload parm: admin_timeout:timeout in seconds for admin commands (byte) parm: io_timeout:timeout in seconds for I/O (byte) parm: shutdown_timeout:timeout in seconds for controller shutdown (byte) parm: max_retries:max number of retries a command may have (uint) parm: nvme_char_major:int parm: default_ps_max_latency_us:max power saving latency for new devices; use PM QOS to change per device (ulong) I'll reboot and output the data without the kernel parameter and reply again in a couple of minutes. On 2017-04-11 10:09 PM, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=195039 > > --- Comment #9 from Andy Lutomirski (luto@kernel.org) --- > Chris, the relevant code should be in 4.10 kernels at all. Can you provide > the > output of: > > $ modinfo nvme_core > $ ls /sys/class/nvme/nvme0/power > > The Samsung people working on this issue are thinking that it's possible that > the bug isn't directly an APST problem and, if you're hitting it without > APST, > it could be an interesting data point. > > Anyway, my plan is to make the quirk much, much broader for 4.11. I'm just > hoping to hear back in the next day or two to see whether I should be > quirking > off APST on two particular Dell laptops or whether I should be quirking it > off > on the entire Samsung 950 line. So far, it does seem like the problem may be > restricted to the two laptops in question. > Output of modinfo: filename: /lib/modules/4.10.0-19-generic/kernel/drivers/nvme/host/nvme-core.ko version: 1.0 license: GPL srcversion: 1BBEF320C053A2BA4284272 depends: intree: Y vermagic: 4.10.0-19-generic SMP mod_unload parm: admin_timeout:timeout in seconds for admin commands (byte) parm: io_timeout:timeout in seconds for I/O (byte) parm: shutdown_timeout:timeout in seconds for controller shutdown (byte) parm: max_retries:max number of retries a command may have (uint) parm: nvme_char_major:int parm: default_ps_max_latency_us:max power saving latency for new devices; use PM QOS to change per device (ulong) Output of 'ls /sys/class/nvme/nvme0/power' async autosuspend_delay_ms control pm_qos_latency_tolerance_us runtime_active_kids runtime_active_time runtime_enabled runtime_status runtime_suspended_time runtime_usage On 2017-04-11 10:09 PM, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=195039 > > --- Comment #9 from Andy Lutomirski (luto@kernel.org) --- > Chris, the relevant code should be in 4.10 kernels at all. Can you provide > the > output of: > > $ modinfo nvme_core > $ ls /sys/class/nvme/nvme0/power > > The Samsung people working on this issue are thinking that it's possible that > the bug isn't directly an APST problem and, if you're hitting it without > APST, > it could be an interesting data point. > > Anyway, my plan is to make the quirk much, much broader for 4.11. I'm just > hoping to hear back in the next day or two to see whether I should be > quirking > off APST on two particular Dell laptops or whether I should be quirking it > off > on the entire Samsung 950 line. So far, it does seem like the problem may be > restricted to the two laptops in question. > Awesome, I guess Ubuntu backported APST support. My current patch set to address this is here: https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=nvme/power&id=37294ae0e942e9dc56e869af23cc6face284dec8 It's untested, and I won't have a chance to test until Tuesday. I hit this same issue on my upgrade to Ubuntu 17.04. I downgraded again to Ubuntu 16.10 and everything was fine, until yesterday. I received a kernel update and apparently the changes are backported to their 4.8 kernel series, as I all of a sudden hit the bug there as well. Hardware: Dell XPS 9550 smartctl -x output === START OF INFORMATION SECTION === Model Number: PM951 NVMe SAMSUNG 512GB Serial Number: S29PNXAGB11420 Firmware Version: BXV77D0Q PCI Vendor/Subsystem ID: 0x144d IEEE OUI Identifier: 0x002538 Controller ID: 1 Number of Namespaces: 1 Namespace 1 Size/Capacity: 512,110,190,592 [512 GB] Namespace 1 Utilization: 418,698,813,440 [418 GB] Namespace 1 Formatted LBA Size: 512 If you need further information, please let me know Could you try 4.11-rc8 or the test kernel here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184 Am I correct that when my arch install updates to kernel 4.11, I'll be able to remove nvme_core.default_ps_max_latency_us=0 as a kernel boot parameter? Since 2017-05-13 I've had to run boot-repair 4 times to recover my system. Each time I have a very slow shutdown (Mint exits, the screen is black, the power button stays lit for approx 30 seconds - much longer than usual), then on reboot I get a "missing HD" error from the BIOS. It auto-repairs itself to point at the Windows partition and then I only get a Windows boot. If I run boot-repair then I can get a grub that correctly boots back to Linux and Windows. I'm using a Dell XPS 9550, 32GB RAM, Samsung PM951 NVMe. The nvme firmware hasn't been changed since I bought the machine (over 1 year ago) and according to Samsung's site is the latest firmware. The change in state was upgrading from kernel 4.9.8 to 4.11, prior to 4.11 I've never seen this issue. In the last two weeks I was running an older BIOS (A06). A few days back I upgraded to BIOS A19 (the only reported stable BIOS with linux) and upgraded kernel 4.11 to 4.11.3. I've just had my 4th slow-shutdown and run of boot-repair. I believe 4.11.x is the common cause for this issue. Given the earlier reports I'm attaching some notes that I hope are useful, I'm happy to dig further if you give me some guidance. I'm running Linux Mint 18.1. Does anyone know if kernel 4.9 is still unaffected or if 4.12 fixes this? $ uname -a Linux ian-XPS-15-9550 4.11.3-041103-generic #201705251233 SMP Thu May 25 16:34:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux $ modinfo nvme_core filename: /lib/modules/4.11.3-041103-generic/kernel/drivers/nvme/host/nvme-core.ko version: 1.0 license: GPL srcversion: E78F732E1E5E7A40EEBCFD1 depends: intree: Y vermagic: 4.11.3-041103-generic SMP mod_unload parm: admin_timeout:timeout in seconds for admin commands (byte) parm: io_timeout:timeout in seconds for I/O (byte) parm: shutdown_timeout:timeout in seconds for controller shutdown (byte) parm: max_retries:max number of retries a command may have (uint) parm: nvme_char_major:int parm: default_ps_max_latency_us:max power saving latency for new devices; use PM QOS to change per device (ulong) $ sudo nvme id-ctrl /dev/nvme0 [sudo] password for ian: NVME Identify Controller: vid : 0x144d ssvid : 0x144d sn : S2FZNYAG801690 mn : PM951 NVMe SAMSUNG 1024GB fr : BXV76D0Q rab : 2 ieee : 002538 cmic : 0 mdts : 5 cntlid : 1 ver : 0 rtd3r : 0 rtd3e : 0 oaes : 0 oacs : 0x17 acl : 7 aerl : 3 frmw : 0x6 lpa : 0 elpe : 63 npss : 4 avscc : 0x1 apsta : 0x1 wctemp : 0 cctemp : 0 mtfa : 0 hmpre : 0 hmmin : 0 tnvmcap : 0 unvmcap : 0 rpmbs : 0 sqes : 0x66 cqes : 0x44 nn : 1 oncs : 0x1f fuses : 0 fna : 0 vwc : 0x1 awun : 255 awupf : 0 nvscc : 1 acwu : 0 sgls : 0 ps 0 : mp:6.00W operational enlat:5 exlat:5 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:- active_power:- ps 1 : mp:4.20W operational enlat:30 exlat:30 rrt:1 rrl:1 rwt:1 rwl:1 idle_power:- active_power:- ps 2 : mp:3.10W operational enlat:100 exlat:100 rrt:2 rrl:2 rwt:2 rwl:2 idle_power:- active_power:- ps 3 : mp:0.0700W non-operational enlat:500 exlat:5000 rrt:3 rrl:3 rwt:3 rwl:3 idle_power:- active_power:- ps 4 : mp:0.0050W non-operational enlat:2000 exlat:22000 rrt:4 rrl:4 rwt:4 rwl:4 idle_power:- active_power:- $ ls /sys/class/nvme/nvme0/power async autosuspend_delay_ms control pm_qos_latency_tolerance_us runtime_active_kids runtime_active_time runtime_enabled runtime_status runtime_suspended_time runtime_usage Note on "(the only reported stable BIOS with linux)" - given my reading on reddit (/dell and /linux) it seems that the latest .20 and .25 BIOS have some issues for linux but .19 is widely reported to be stable. Given that .19 is recent, I've settled on it. Prior to that A06 has been fine a year. Ian, sounds like you are impacted by the APST issue as well. Hopefully 4.12 will work better. Andy, what's the recommended work-around for 4.11 users? Ian, Kernel 4.11 has been working for me. I no longer have nvme_core.default_ps_max_latency_us=0 as a boot parameter and haven't had any issues with the SSD going into read-only mode as I was before. However, (and I don't know if this would have an impact), I switched to Arch from Ubuntu 16.10 last month around the same time I went from 4.10 to 4.11. Ian, the fix should have been: commit ff5350a86b20de23991e474e006e2ff2732b218e Author: Andy Lutomirski <luto@kernel.org> Date: Thu Apr 20 13:37:55 2017 -0700 nvme: Adjust the Samsung APST quirk and that made it in to 4.11. Jens, Chris, Andy - thanks for the quick response. I think I'm going to pop the base cover and re-seat the hd, maybe something is loose (and/or thermal related). Failing that I might regress to 4.9.8 to see if the problem persists. Thanks for removing this possibility. Cheers, Ian. I'll ask a follow-up in case it sparks any thoughts. One co-incidental factor seems to be that I have these shutdow-failures only after using my external monitor and suspending a few times. I don't recall (but don't have solid evidence) having this issue just from laptop suspends but if I switch to using my HDMI monitor a few times (with a cloned display) then that seems co-incident with this issue. Typically I use my laptop solo, sometimes I'll plug it into my home UHD monitor. I'll use and then unplug from the monitor several times over a week between deliberate laptop restarts. The specific behaviour is that after plugging in the HDMI monitor (after several successes) both the laptop and external screen offset the display by 50% (the left side starts in the middle of the screen, the middle wraps to the left edge and continues back to the middle of the screen). The mouse pointer moves but I can't click anything, keyboard shortcuts do nothing. I can swap to a console terminal (ctrl alt F1) and restart the mdm (Mint Display Manager) and I can swap back to Mint and continue. After a shutdown I have a long freeze, then I have a 'missing hd' on the next boot. This sounds far more like a Mint/display manager issue but exactly why it interferes with the SSD such that the BIOS does a recovery, and it only occurred since I switched to kernel 4.11, is a mystery. This behaviour might of course be caused by the same underlying problem or it might be coincidental. Possibly there's a HDMI/bus issue that's known to one of you? I'm only asking in case this jogs memories of a related BIOS/nvme bug. If not, I'll only repost back here if I make any progress on this issue. Cheers, Ian. Ian, this doesn't sound like an nvme problem at all. I'm guessing you have a graphics problem that's crashing the system in a way that annoys your BIOS. My Dell laptop (different model than yours) has an obnoxious feature in which, if it thinks something went wrong, it goes through a counterproductive recovery process. You can turn this off in the BIOS settings. I'm encountering the same issue on XPS 15 9550, however I upgraded the PM951 SSD to a larger PM961 one. I tried disabling only the lowest power state by setting nvme_core.default_ps_max_latency=2000 (the latency numbers are different for PM961, see below), however this didn't resolve the issue and I was still getting controller resets. I had to disable APST entirely by setting nvme_core.default_ps_max_latency=0 for it to work reliably. Unfortunately this causes a noticeable increase in power consumption of ~3-4W, which hurts battery life quite a bit. $ sudo nvme id-ctrl /dev/nvme0n1 NVME Identify Controller: vid : 0x144d ssvid : 0x144d sn : S36CNX0J302022 mn : SAMSUNG MZVLW1T0HMLH-000H1 fr : CXY70H1Q rab : 2 ieee : 002538 cmic : 0 mdts : 0 cntlid : 2 ver : 10200 rtd3r : 186a0 rtd3e : 4c4b40 oaes : 0 oacs : 0x7 acl : 7 aerl : 7 frmw : 0x16 lpa : 0x3 elpe : 63 npss : 4 avscc : 0x1 apsta : 0x1 wctemp : 350 cctemp : 353 mtfa : 50 hmpre : 0 hmmin : 0 tnvmcap : 1024209543168 unvmcap : 0 rpmbs : 0 sqes : 0x66 cqes : 0x44 nn : 1 oncs : 0x1f fuses : 0 fna : 0 vwc : 0x1 awun : 255 awupf : 0 nvscc : 1 acwu : 0 sgls : 0 subnqn : ps 0 : mp:7.60W operational enlat:0 exlat:0 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:- active_power:- ps 1 : mp:6.00W operational enlat:0 exlat:0 rrt:1 rrl:1 rwt:1 rwl:1 idle_power:- active_power:- ps 2 : mp:5.10W operational enlat:0 exlat:0 rrt:2 rrl:2 rwt:2 rwl:2 idle_power:- active_power:- ps 3 : mp:0.0400W non-operational enlat:210 exlat:1500 rrt:3 rrl:3 rwt:3 rwl:3 idle_power:- active_power:- ps 4 : mp:0.0050W non-operational enlat:2200 exlat:6000 rrt:4 rrl:4 rwt:4 rwl:4 idle_power:- active_power:- The same bug seems to exist for the Dell xps 9560 and the pm961 it contains. I am running 4.14.13 currently but the bug was present on ubuntu 17.10's 4.11 and several versions in between. Like the poster above I have had to fully disable APST to achieve stability. While looking to add the 9560 to the quirk list in the kernel I saw that there is a check for a asus ryzen board and the 960 evo (which even has the same product id as the pm961) I am wondering if the intermittent nature of this bug means that it may be happening on ALL 960 evo/sm/pm 961's and we are only finding it piece meal? Does anyone have one of these drives and the a wide variety of hardware to test with? I can write the patch to add this combo to the quirks list, but there maybe deeper issues? I am not sure this is related but: I see sudden controller death as well on Dell XPS 15 9550 using Samsung 960 EVO 1TB, even with nvme_core.default_ps_max_latency_us=0 set. # nvme id-ctrl /dev/nvme0n1 NVME Identify Controller: vid : 0x144d ssvid : 0x144d sn : S3X3NF0JA01074T mn : Samsung SSD 960 EVO 1TB fr : 3B7QCXE7 rab : 2 ieee : 002538 cmic : 0 mdts : 9 cntlid : 2 ver : 10200 rtd3r : 7a120 rtd3e : 4c4b40 oaes : 0 oacs : 0x7 acl : 7 aerl : 3 frmw : 0x16 lpa : 0x3 elpe : 63 npss : 4 avscc : 0x1 apsta : 0x1 wctemp : 356 cctemp : 358 mtfa : 0 hmpre : 0 hmmin : 0 tnvmcap : 1000204886016 unvmcap : 0 rpmbs : 0 sqes : 0x66 cqes : 0x44 nn : 1 oncs : 0x1f fuses : 0 fna : 0x5 vwc : 0x1 awun : 255 awupf : 0 nvscc : 1 acwu : 0 sgls : 0 subnqn : ps 0 : mp:6.04W operational enlat:0 exlat:0 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:- active_power:- ps 1 : mp:5.09W operational enlat:0 exlat:0 rrt:1 rrl:1 rwt:1 rwl:1 idle_power:- active_power:- ps 2 : mp:4.08W operational enlat:0 exlat:0 rrt:2 rrl:2 rwt:2 rwl:2 idle_power:- active_power:- ps 3 : mp:0.0400W non-operational enlat:210 exlat:1500 rrt:3 rrl:3 rwt:3 rwl:3 idle_power:- active_power:- ps 4 : mp:0.0050W non-operational enlat:2200 exlat:6000 rrt:4 rrl:4 rwt:4 rwl:4 idle_power:- active_power:- (In reply to Marvin W from comment #27) > I am not sure this is related but: > I see sudden controller death as well on Dell XPS 15 9550 using Samsung 960 > EVO 1TB, even with nvme_core.default_ps_max_latency_us=0 set. Hi Marvin - do you still see this bug? I've filed a very similar report for my 1TB PM951 NVMe in a Dell XPS 9550: https://bugzilla.kernel.org/show_bug.cgi?id=201811 Specifically on 4.19.0 using nvme_core.default_ps_max_latency_us=0 gets me almost-no read-only failures, except for the one that occurred yesterday and that was the first in over 10 days. I wonder if you solved this issue? Using 4.9.91 I didn't have this issue. I still see a similar problem with latest kernel on Fedora 29 on Lenovo T580 (latest BIOS 1.18) with Samsung 970 EVO drive. After few hours, especially with low battery condition or after resume from suspend, system starts output I/O errors and is not able to read from or write to the drive. I tried the workaround by setting nvme_core.default_ps_max_latency_us=5500 but the issue still resurfaces. > dmesg Dec 17 18:25:20 skyline.origin kernel: EXT4-fs (dm-1): I/O error while writing superblock Dec 17 18:25:20 skyline.origin kernel: EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detect> Dec 17 18:25:20 skyline.origin kernel: EXT4-fs (dm-1): Remounting filesystem read-only Dec 17 18:25:20 skyline.origin kernel: JBD2: Error -5 detected when updating journal superblock for dm-> Dec 17 18:25:20 skyline.origin kernel: Buffer I/O error on dev dm-1, logical block 0, lost sync page wr> Dec 17 18:25:20 skyline.origin kernel: EXT4-fs (dm-1): I/O error while writing superblock Dec 17 18:25:20 skyline.origin kernel: EXT4-fs error (device dm-1) in __ext4_new_inode:982: Journal has> Dec 17 18:25:20 skyline.origin kernel: EXT4-fs error (device dm-1) in __ext4_new_inode:940: Journal has> > uname -a Linux skyline.origin 4.19.8-300.fc29.x86_64 #1 SMP Mon Dec 10 15:23:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux > cat /proc/cmdline BOOT_IMAGE=/vmlinuz-4.19.8-300.fc29.x86_64 root=/dev/mapper/origin-root ro resume=/dev/mapper/origin-swap rd.luks.uuid=luks-e4dad99e-4f78-45ea-a01c-90f0aedbff5b rd.lvm.lv=origin/root rd.lvm.lv=origin/swap rhgb quiet nvme_core.default_ps_max_latency_us=5500 > sudo nvme id-ctrl /dev/nvme0n1 NVME Identify Controller: vid : 0x144d ssvid : 0x144d sn : S466NX0KA20403K mn : Samsung SSD 970 EVO 500GB ... ps 0 : mp:6.20W operational enlat:0 exlat:0 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:- active_power:- ps 1 : mp:4.30W operational enlat:0 exlat:0 rrt:1 rrl:1 rwt:1 rwl:1 idle_power:- active_power:- ps 2 : mp:2.10W operational enlat:0 exlat:0 rrt:2 rrl:2 rwt:2 rwl:2 idle_power:- active_power:- ps 3 : mp:0.0400W non-operational enlat:210 exlat:1200 rrt:3 rrl:3 rwt:3 rwl:3 idle_power:- active_power:- ps 4 : mp:0.0050W non-operational enlat:2000 exlat:8000 rrt:4 rrl:4 rwt:4 rwl:4 idle_power:- active_power:- As a follow up to #29, I just had another drive failure with the following log output:
> dmesg
[ 4507.245989] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[ 4507.267985] nvme 0000:40:00.0: enabling device (0000 -> 0002)
[ 4507.268283] nvme nvme0: Removing after probe failure status: -19
[ 4507.281449] print_req_error: I/O error, dev nvme0n1, sector 201288984
[ 4507.281479] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 5641230 (offset 0 size 0 starting block 22749603)
[ 4507.281484] Buffer I/O error on device dm-1, logical block 22749603
[ 4507.281494] Buffer I/O error on device dm-1, logical block 22749604
[ 4507.281497] Buffer I/O error on device dm-1, logical block 22749605
[ 4507.281500] Buffer I/O error on device dm-1, logical block 22749606
[ 4507.281503] Buffer I/O error on device dm-1, logical block 22749607
[ 4507.281506] Buffer I/O error on device dm-1, logical block 22749608
[ 4507.281508] Buffer I/O error on device dm-1, logical block 22749609
[ 4507.281511] Buffer I/O error on device dm-1, logical block 22749610
[ 4507.281529] print_req_error: I/O error, dev nvme0n1, sector 201289304
[ 4507.281543] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 5641230 (offset 0 size 0 starting block 22749643)
[ 4507.281546] Buffer I/O error on device dm-1, logical block 22749643
[ 4507.281550] Buffer I/O error on device dm-1, logical block 22749644
[ 4507.281562] print_req_error: I/O error, dev nvme0n1, sector 201290776
[ 4507.281575] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 5641230 (offset 0 size 0 starting block 22749827)
[ 4507.281591] print_req_error: I/O error, dev nvme0n1, sector 201291096
[ 4507.281605] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 5641230 (offset 0 size 0 starting block 22749867)
[ 4507.281618] print_req_error: I/O error, dev nvme0n1, sector 201291736
[ 4507.281632] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 5641230 (offset 0 size 0 starting block 22749947)
[ 4507.281647] print_req_error: I/O error, dev nvme0n1, sector 201292760
[ 4507.281660] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 5641230 (offset 0 size 0 starting block 22750075)
[ 4507.281672] print_req_error: I/O error, dev nvme0n1, sector 201292952
[ 4507.281685] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 5641230 (offset 0 size 0 starting block 22750099)
[ 4507.281699] print_req_error: I/O error, dev nvme0n1, sector 201293784
[ 4507.281713] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 5641230 (offset 0 size 0 starting block 22750203)
[ 4507.281726] print_req_error: I/O error, dev nvme0n1, sector 201294232
[ 4507.281738] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 5641230 (offset 0 size 0 starting block 22750259)
[ 4507.281754] print_req_error: I/O error, dev nvme0n1, sector 58625640
[ 4507.282124] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 5641314 (offset 0 size 0 starting block 4916685)
[ 4507.282759] Aborting journal on device dm-1-8.
[ 4507.282773] EXT4-fs error (device dm-1) in ext4_free_blocks:4942: Journal has aborted
[ 4507.282778] Buffer I/O error on dev dm-1, logical block 15, lost async page write
[ 4507.282793] Buffer I/O error on dev dm-1, logical block 32, lost async page write
[ 4507.282808] Buffer I/O error on dev dm-1, logical block 22544389, lost async page write
[ 4507.282820] Buffer I/O error on dev dm-1, logical block 22544401, lost async page write
[ 4507.282828] Buffer I/O error on dev dm-1, logical block 22544402, lost async page write
[ 4507.282837] Buffer I/O error on dev dm-1, logical block 22544693, lost async page write
[ 4507.282850] Buffer I/O error on dev dm-1, logical block 22544733, lost async page write
[ 4507.282861] Buffer I/O error on dev dm-1, logical block 22544736, lost async page write
[ 4507.282866] Buffer I/O error on dev dm-1, logical block 33587200, lost sync page write
[ 4507.282878] Buffer I/O error on dev dm-1, logical block 22544742, lost async page write
[ 4507.282897] JBD2: Error -5 detected when updating journal superblock for dm-1-8.
[ 4507.282917] EXT4-fs (dm-1): I/O error while writing superblock
[ 4507.282929] EXT4-fs error (device dm-1) in ext4_do_update_inode:5310: Journal has aborted
[ 4507.282971] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:5846: Journal has aborted
[ 4507.283027] EXT4-fs (dm-1): Delayed block allocation failed for inode 7344491 at logical offset 0 with max blocks 1 with error 30
[ 4507.283034] EXT4-fs (dm-1): This should not happen!! Data will be lost
[ 4507.283044] EXT4-fs error (device dm-1) in ext4_writepages:2877: Journal has aborted
[ 4507.283070] EXT4-fs (dm-1): I/O error while writing superblock
[ 4507.283078] EXT4-fs error (device dm-1) in ext4_do_update_inode:5310: Journal has aborted
[ 4507.283176] EXT4-fs (dm-1): previous I/O error to superblock detected
[ 4507.283204] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[ 4507.283212] EXT4-fs (dm-1): Remounting filesystem read-only
[ 4507.283309] EXT4-fs (dm-1): I/O error while writing superblock
[ 4507.283423] EXT4-fs (dm-1): I/O error while writing superblock
[ 4507.283433] EXT4-fs error (device dm-1) in ext4_evict_inode:258: Journal has aborted
[ 4507.283437] EXT4-fs error (device dm-1) in ext4_ext_remove_space:3061: Journal has aborted
[ 4507.283508] EXT4-fs (dm-1): I/O error while writing superblock
[ 4507.283514] EXT4-fs (dm-1): previous I/O error to superblock detected
[ 4507.283597] EXT4-fs error (device dm-1) in ext4_orphan_del:2901: Journal has aborted
[ 4507.283716] EXT4-fs error (device dm-1) in ext4_do_update_inode:5310: Journal has aborted
[ 4507.283961] JBD2: Detected IO errors while flushing file data on dm-1-8
[ 4507.294579] nvme nvme0: failed to set APST feature (-19)
[ 4507.405329] EXT4-fs error (device dm-1): ext4_find_entry:1439: inode #5505093: comm gnome-shell: reading directory lblock 0
[ 4640.444653] systemd-journald[847]: Failed to write entry (21 items, 635 bytes), ignoring: Read-only file system
[ 4640.444688] systemd-journald[847]: Failed to write entry (21 items, 740 bytes), ignoring: Read-only file system
[ 4640.444741] systemd-journald[847]: Failed to write entry (21 items, 635 bytes), ignoring: Read-only file system
[ 4640.444776] systemd-journald[847]: Failed to write entry (21 items, 740 bytes), ignoring: Read-only file system
[ 4640.444819] systemd-journald[847]: Failed to write entry (21 items, 635 bytes), ignoring: Read-only file system
[ 4640.444900] systemd-journald[847]: Failed to write entry (21 items, 740 bytes), ignoring: Read-only file system
[ 4640.444933] systemd-journald[847]: Failed to write entry (21 items, 635 bytes), ignoring: Read-only file system
[ 4640.444964] systemd-journald[847]: Failed to write entry (21 items, 740 bytes), ignoring: Read-only file system
[ 4641.345610] systemd-journald[847]: Failed to write entry (21 items, 635 bytes), ignoring: Read-only file system
[ 4641.345782] systemd-journald[847]: Failed to write entry (21 items, 740 bytes), ignoring: Read-only file system
Hi Sebastian. You might want to try to disable APST completely with `nvme_core.default_ps_max_latency_us=0`. For me this reduced the frequency of the read-only errors with 4.19.0 and 4.19.7 to once per week - but it didn't remove it. I've reverted to 4.9.91 and I no longer have these errors. Use `sudo nvme get-feature -f 0x0c -H /dev/nvme0 | grep APSTE` and you should then see: ` Autonomous Power State Transition Enable (APSTE): Disabled` if you've set the max latency to 0 (on a reboot). I actually added that line to GRUB. I've had no feedback in my post where I've noted the few things I tried with my 1TB PM951 and 4.19.0 and 4.19.7: https://bugzilla.kernel.org/show_bug.cgi?id=201811 Thanks Ian. I have switched to nvme_core.default_ps_max_latency_us=0 after the last crash and left laptop running overnight. Unfortunately this morning I still ran into drive I/O errors so had to do a hard reboot. APST was disabled.
Is there anything else I can try?
> sudo nvme get-feature -f 0x0c -H /dev/nvme0
[sudo] password for raytracer:
get-feature:0xc (Autonomous Power State Transition), Current value:00000000
Autonomous Power State Transition Enable (APSTE): Disabled
Auto PST Entries .................
Entry[ 0]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[ 1]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
I don't know of anything else - at this point (with 2 read-only failures with APST disabled on 4.19.x) I reverted to 4.9.91. Sometime after 4.9 I remember reading about NVME Autonomous Power State Transition code updates. I see a reference to this for 4.11: https://kernelnewbies.org/Linux_4.11 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c5552fde102fcc3f2cf9e502b8ac90e3500d8fdf and this is when I started to note others with Samsung quirk issues. Some of these had quirks added to the kernel e.g. look for "PM951 NVMe SAMSUNG 512GB" here for 4.11 and 4.12: https://lore.kernel.org/patchwork/patch/781598/ and I'll note that this is the same model but different size to my drive (I have the 1TB PM951). After this I didn't see others having repeat issues with the same units. I _might_ be a 1TB specific issue, somehow connected to the APST code updates, that affects your 970 and my 950 as we're both on 1TB drives? i.e. this might be a subtly different but related bug. I also suspect fewer of us have the 1TB drive so it'll crop up less frequently. Unfortunately nobody has commented on my other kernel.org bug report so I've stepped back from the current kernel. One thing you might try is to run whatever was the latest 4.10 for maybe a week, then 4.11, to see if one of those introduces the read-only problem. That'd help us isolate where things started to break. I'm assuming that as others update from older kernels we'll see this 1TB issue affecting more people. We'd want to keep an eye on the launchpad bug report too as others note their issues e.g. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184/comments/107 I have tried other kernels over the last year but after 4.9.91 I had other issues - e.g. >4.9.91 I might lose wifi (firmware issues) and some kernel lines wouldn't even boot for me. I was happy when it looked like I could upgrade to 4.19 as I figured I could stay current for a while, ho hum. I don't think the issue is limited to 1TB drives. I had regularly experienced it with Samsung 256GB NVME drive that came with Lenovo T580. I thought these issues were related to one bad drive, so last week replaced it with brand new Samsung 970 Evo 512GB NVME drive. However, the problems with sudden controller crashes remain (even with APST disabled). This is with latest and greatest kernel/bios/firmware. Unfortunately downgrading kernel isn't an option for me as I use Thunderbolt 3 heavily which doesn't have great support on older kernels. (In reply to Ian Ozsvald from comment #33) > I don't know of anything else - at this point (with 2 read-only failures > with APST disabled on 4.19.x) I reverted to 4.9.91. We seem to know that the drive exhibit the issue with 4.19 and not with 4.9. Has anyone tried to run kernels in between? My laptop runs with this drive: SAMSUNG MZVLB1T0HALR-000L7 and I've never had any issues, always running bleeding edge on it (current -rc + changes queued up for the next kernel). It'd be interesting to try and get more data points on versions that work / don't work. I may just have to go and get one of the problematic drives and see if I can reproduce. Hi Jens. Looking back at my notes for 6 months I see: 4.9.8, 4.9.45, 4.9.66, 4.9.91 - all fine, no HD issues 4.9.119 had a boot fail with "linux-headers-4.9.119-0409119-generic depends on libssl1.1 (>= 1.1.0);" which I haven't persued, also iwlwifi failed - this did boot but since wifi didn't work, I didn't do any particular testing so I can't confirm if the nvme bug exists here 4.9.135 didn't get beyond "loading initial ramdisk" on boot, didn't diagnose any further 4.10 didn't try 4.11.12, 4.11.7, 4.11.0 each had "issues with SSD" - again the "long shutdown issue" noted below 4.12.4 didn't try "intel wifi not supported for Intel 8260" - unrelated issue 4.13 didn't try 4.14.1 had a "long shutdown issue" - this might be a different bug, these "long shutdowns" take 30 seconds to complete a shutdown, on the next boot "hard drive not present", on a 5 second power off->power on the hard drive is back and the next boot is successful. I had lots of these on various kernels (maybe 20+ experiences), very annoying, presumably a HD issue but may or may not be this power-saving bug 4.15.0 had a "long shutdown issue" on each shutdown 4.16-4.18 I slightly gave up and didn't try these, hoping that the upcoming 4.19 LTS would solve other issues I'll note that I didn't note above that I had video issues with my NVIDIA card and a USB-C DisplayPort cable (I kept falling back on, and continue to use, HDMI). Between failing nvme, graphics issues and wifi issues I kept searching for a common denominator that'd just let me work. I've *very* happy to try good ideas if it helps us narrow down what's going on. I have been able to reproduce the issue on the following kernels: 4.18.16 4.19.5 4.19.8 One thing that I just realized after Jens said that he never had issues with his Samsung drive is that I started noticing all these failures around the time I upgraded to Fedora 29 (4.19 kernel) and the latest ThinkPad T580 BIOS (1.18). I don't recall having/noticing these issues with Fedora 28 and an older BIOS. Since changing drives and disabling APST didn't fix my issue and downgrading to 4.18 didn't do it, there is one variable I haven't tried yet - downgrading BIOS. I'll try that tonight and report back tomorrow. For completeness, I'm using "Dell XPS 15 9550 1.2.19 System BIOS" from January 2017. Back in May 2017 (the last time I was looking at the BIOS) the later versions all exhibited some trouble with Linux, 1.2.19 was the known-good BIOS so I stuck with it. There's a bigger range of options now. I wonder if anyone coming through here could report which BIOS they have on a Dell 9550, their drive and whether they do or don't have problems? Just a quick update as the Lenovo BIOS downgrade path looks promising so far. After 16 hours the system is still up and running. Typically I would run into multiple failures by now especially after coming out of suspend state overnight. I'm not sure how BIOS interacts with NVME devices attached to the system once the kernel takes over. The only thing that comes to my mind is that some ACPI tables set up by the BIOS get messed up causing havoc in the rest of the system. For the sake of clarify, the change applied was to downgrade Lenovo T580 BIOS from version 1.18 to 1.16. It appears that BIOS may have been the culprit in my case. The system has been stable for the past 24 hours. I still run with the nvme_core.default_ps_max_latency_us=0 setting but I am bit hesitant to remove it as I'm a big fan of the newly reacquired system stability. In any case, Lenovo's ThinkPad T580 BIOS version 1.18 is officially on the not so favorable list. Thanks Ian and Jens for the support and guidance. Cheers! Sebastian just note that I had failures with 4.19.0 and 4.19.7 on the order of once per week with APST disabled, so 24 hours may not be long enough to rule this out. Looking at BIOS updates beyond my 1.2.19 I see that some NVME updates were released so I'm now mulling getting my BIOS updated to the latest to see what happens. I'll make a decision on this next week. I'll keep testing the current setup for a few more days and next I'll try running without 'nvme_core.default_ps_max_latency_us=0' to see if the issue is gone. So far so good though, before I would not be able to run for more than a few hours without crashing the drive. @Ian If you are adventurous (insert caution here), I would definitely give BIOS update a shot. Alternatively you could wait a week or two and see how I do with my current "fix" to see if it makes sense to mess with that in the first place. 24 hours of success - I've upgraded my BIOS from 1.2.19 (Jan 2017) to 1.9.0 (Oct 2018). In 24 hours using kernel 4.19.8 I've not had a read-only failure. Previously with 1.2.19 and kernel 4.19.8 I'd expect a read-only failure within a couple of hours. I've suspended several times and used my HDMI monitor, everything seems to be working fine. Tentatively I'd say that the BIOS update has fixed the issue. At least two intervening BIOS updates (which I hadn't applied) had NVME updates mentioned in their notes. I'm still using 'nvme_core.default_ps_max_latency_us=0' and will keep this for at least a week or so, to give me confidence that this configuration is stable. Just in case it is relevant I'll note that `fwupdmgr get-updates` shows that `Thunderbolt NVM for Xps Notebook 9550` can be upgraded from the current v12 to v16. I have used a USB-C to DisplayPort connector (which is physically the same as Thunderbolt 3), but I believe by using USB C I don't touch the Thunderbolt driver (anyone disagree?). I'll update this in a week or so, again I don't want to change anything else until I know that I trust my system. Sebastian - how are things for you? I think my issue is ultimately resolved but it required a motherboard replacement. After I did the BIOS downgrade, things improved slightly before they got worse. One day the drive failed and the system was no longer recognizing the drive or any other drive I put in the system. Lenovo diagnosed the issue to be a bad motherboard and replaced it. After motherboard replacement, the drive has been working great. It's been 4 days without any crashes. After reading a bunch more of this thread, I'm not at all convinced that this is an APST problem. It sounds like you're seeing failures with APST off and you're seeing more failures with APST on. So APST is probably just changing the frequency with which the problem is triggered. Off the top of my head, you might want to fiddle with your PCIe ASPM settings to see if there's any effect. Do you have pcie_aspm=force set? For what it's worth, the one APST-related failure that was fully root-caused that I know of turned out to be a design issue in the motherboard that caused ASPM exit to fail sometimes under certain conditions. Enabling APST makes deep ASPM states much more likely. I'm at 24 hours now using 4.19.8 with BIOS 1.9.0 (the new BIOS for my XPS 9550), with 'nvme_core.default_ps_max_latency_us=0' _disabled_ (i.e. removed from GRUB) and I've had no read-only failures. Previously I'd have had a failure within 30 minutes. Andy - I think you're right, I suspect my out of date BIOS was the root cause of my issue. To confirm - APST is enabled and I've had no failures: $ sudo nvme get-feature -f 0x0c -H /dev/nvme0 get-feature:0xc (Autonomous Power State Transition), Current value:0x000001 Autonomous Power State Transition Enable (APSTE): Enabled Tentatively I think I can say that the BIOS update might have solved this. I'll report back in a few days. After 5 days (without a reboot) using 4.19.8 with the new BIOS (1.9.0), either leaving it on or briefly using Suspend, I'm happy that the BIOS upgrade has solved the read-only problem. The machine is now stable. However - I've rebooted and got hit by the old dreaded "no bootable medium found". The solution, as before (with the old BIOS), was a hard power off (5 seconds on the power button), after that on a fresh boot the harddrive was magically present again. On a fresh boot I then asked for a restart after logging in, on that reboot I also got "no bootable medium found". Again a hard power off and power cycle solves the issue. I'm going to run as-is for a few days, then may try adding 'nvme_core.default_ps_max_latency_us=0' back to GRUB as a test as 'losing' the primary hard drive on a reboot doesn't make me happy. I'm open to any other ideas.. To update the above - I continue with 4.19.8 on BIOS 1.9.0. On one reboot I had the "no bootable medium found", on several other reboots I've had no issues. I didn't add 'nvme_core.default_ps_max_latency_us=0' back to GRUB as I think the BIOS update has solved most of the issues. I did note that my Thunderbolt driver was out of date, I had v12 and v16 was available. I've just updated it today. Possibly there was another historic bug that somehow interfered with the system due to this, I'll continue to monitor it. Hello, I want to join the discussion. I have Dell Latitude E7470 with LiteOn 500 Gb SSd, and I also contantly experience the problem you discuss. I have managed to temporary cure it by fresh install of Mint 19, but it appeared again and again after even minor updates. I used TimeShift to go back and it helped several times, but not this time. I have just updated my BIOS to 1.25 version, but, still, no luck. I will now check if Ubuntu 18.04 would run ok on 4.9.91 kernel. So far, I am sticking to the solution of using 4.9.91 kernel. Nothing else has worked for me. It's very confusing. Looks like the kernel team has fixed Samsung, but broken something else in the kernel... forever. And, I am still getting the problem after 3 days... I'm attaching this for the record - my bug (late 2018 to Feb 2019) went away after I upgraded my Dell 9550 BIOS to 1.9.0 (and possibly by upgrading Thunderbolt - see my posts above). I'm running kernel 4.19.8. I post this here just for the record, in case it helps others. My earlier bug reports are above in this thread. $ sudo nvme id-ctrl /dev/nvme0 [sudo] password for ian: NVME Identify Controller: vid : 0x144d ssvid : 0x144d sn : S2FZNYAG801690 mn : PM951 NVMe SAMSUNG 1024GB fr : BXV76D0Q rab : 2 ieee : 002538 ... ps 0 : mp:6.00W operational enlat:5 exlat:5 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:- active_power:- ps 1 : mp:4.20W operational enlat:30 exlat:30 rrt:1 rrl:1 rwt:1 rwl:1 idle_power:- active_power:- ps 2 : mp:3.10W operational enlat:100 exlat:100 rrt:2 rrl:2 rwt:2 rwl:2 idle_power:- active_power:- ps 3 : mp:0.0700W non-operational enlat:500 exlat:5000 rrt:3 rrl:3 rwt:3 rwl:3 idle_power:- active_power:- ps 4 : mp:0.0050W non-operational enlat:2000 exlat:22000 rrt:4 rrl:4 rwt:4 rwl:4 idle_power:- active_power:- $ uname -a Linux ian-XPS-15-9550 4.19.8-041908-generic #201812080831 SMP Sat Dec 8 13:34:18 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux System BIOS was upgraded to 1.9.0 - I believe this is the thing that fixed the NVMe issues. Ian. I have no thunderbolt on Dell Latitude E7470 and have updated to latest BIOS. Still, even the old kernel have not solved my problem. Setting NVME kernel option to zero does not change anything. I am already tired of it and writing this message from Windows 7. Thanks, Sergey Just an update: it might be the hardware problem of contact in SSD connection to motherboard. I have got the same error in Windows. Now, I have cleaned the contacts with propanol. Let's see if it helps. To confirm, my issue was resolved, now I am running fine on the latest kernel. The lesson is: prior to trying to replace SSD or Motherboard, try to clean the connection thereof. In my case, it seems like I need to clean this connection ones a year.... Shame at Dell build quality. Best wishes, Sergey Adding another datapoint for others running into this issue: I'm on a desktop with a samsung 970 Evo 2TB nvme ssd: 01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 I started running into this issue out of the blue. After re-seating the nvme ssd, the issue seems to have gone away - I will update if it starts to happen again. Same issue here. Samsung 960 Evo Series (OEM) 1TB NVMe M.2 NGFF SSD PCIe 3.0 x4 80mm - (PM961) SAMSUNG MZVLW1T0HMLH-00000 S/N: S2U3NX0HC05293 FW: CXY7301Q I can confirm this is happening with 5.3.0-22-generic https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1852479 +-1d.0-[03]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 My laptop just crashes randomly. Disbling AER did not solve the problem. Linux pop-os 5.3.0-22-generic #24+system76~1573659475~19.10~26b2022-Ubuntu SMP Wed Nov 13 20:0 x86_64 x86_64 x86_64 GNU/Linux NVME Identify Controller: vid : 0x144d ssvid : 0x144d sn : S444NY0K600040 mn : SAMSUNG MZVLB256HAHQ-00000 fr : EXD7101Q rab : 2 ieee : 002538 ... ps 0 : mp:7.02W operational enlat:0 exlat:0 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:- active_power:- ps 1 : mp:6.30W operational enlat:0 exlat:0 rrt:1 rrl:1 rwt:1 rwl:1 idle_power:- active_power:- ps 2 : mp:3.50W operational enlat:0 exlat:0 rrt:2 rrl:2 rwt:2 rwl:2 idle_power:- active_power:- ps 3 : mp:0.0760W non-operational enlat:210 exlat:1200 rrt:3 rrl:3 rwt:3 rwl:3 idle_power:- active_power:- ps 4 : mp:0.0050W non-operational enlat:2000 exlat:8000 rrt:4 rrl:4 rwt:4 rwl:4 idle_power:- active_power:- I'm not sure if I have the exact same problem as OP, but I've been struggling with NVMe stability issues ever since I put together this Ryzen desktop computer around July 2019. My problems were originally on a "Crucial P1 500GB" NVMe drive, but I just swapped over to a new "Samsung 970 EVO 1TB" by cloning data, and I'm still seeing similar issues. General Hardware Specs: Motherboard: ASUS Prime B450 Plus motherboard running BIOS rev. "2008" (also had issues on rev. 1804) CPU: AMD Ryzen 2700X GPU: NVIDIA Corporation TU116 [GeForce GTX 1660] Current drive: Samsung 970 EVO 1TB Model: MZ-V7E1T0BW Controller: SM981/PM981 Firmware: 2B2QEXE7 Previous drive (with basically same problems): Crucial P1 500GB Model: CT500P1SSD8 Controller: SM2263EN Firmware: P3CR013 The system can be stable for days or weeks on end, as long as I don't put it under particularly heavy sustained load (CPU mainly?). I have VERY repeatable results of AER errors showing up in dmesg just seconds after starting a specific workload: "mprime" executable (Linux version of "Prime95" from mersenne.org), specifically computing "P-1" aka "PM1" type of workunits. I'd posted my issues on various forums but haven't been able to solve this. So I had basically gave up on running "mprime" on this computer and mostly forgot about the problems for a few months, until recently I needed to use another application which seems is triggering these same type of errors again (Intel "Quartus Prime Lite" EDA tools, for FPGA development) I initially had tried disabling ASPM via kernel boot command line "pcie_aspm=off", as a recommended "solution" to my kernel logs being filled with spam from NVIDIA gpu. Errors involving: "[12] Timeout", "[ 6] BadTLP", and "[ 7] BadDLLP". Doing this got rid of those messages from GPU, but caused the NVMe to go into some unrecoverable state, at which point it would try to remount the drive as read only (also would show "BTRFS" errors when i'm only using EXT4?) Here is a snippet of kernel log from when I had ASUS 1804 BIOS, and Crucial P1 500GB SSD, with "pcie_aspm=off", where it was unable to reset NVMe: [ 989.409598] perf: interrupt took too long (4979 > 4912), lowering kernel.perf_event_max_sample_rate to 40000 [ 1195.031765] fuse: init (API version 7.31) [ 1327.328770] perf: interrupt took too long (6268 > 6223), lowering kernel.perf_event_max_sample_rate to 31750 [ 2238.284260] perf: interrupt took too long (7846 > 7835), lowering kernel.perf_event_max_sample_rate to 25250 [ 9117.462381] perf: interrupt took too long (9831 > 9807), lowering kernel.perf_event_max_sample_rate to 20250 [ 9261.476036] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff [ 9261.603999] pci_raw_set_power_state: 19 callbacks suppressed [ 9261.604009] nvme 0000:01:00.0: Refused to change power state, currently in D3 [ 9261.604430] nvme nvme0: Removing after probe failure status: -19 [ 9261.632241] print_req_error: I/O error, dev nvme0n1, sector 15247304 flags 100001 [ 9261.632255] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0 [ 9261.729511] nvme nvme0: failed to set APST feature (-19) [ 9261.739582] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0 [ 9261.739591] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0 [ 9261.739595] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0 [ 9261.756670] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 1, flush 0, corrupt 0, gen 0 [ 9261.756951] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 2, flush 0, corrupt 0, gen 0 [ 9261.758061] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 3, flush 0, corrupt 0, gen 0 [ 9261.758368] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 4, flush 0, corrupt 0, gen 0 [ 9261.759112] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 5, flush 0, corrupt 0, gen 0 [ 9261.759138] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 6, flush 0, corrupt 0, gen 0 [ 9262.276359] Core dump to |/bin/false pipe failed [ 9262.336595] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window] [ 9262.336817] caller _nv000939rm+0x1bf/0x1f0 [nvidia] mapping multiple BARs [ 9262.975980] snd_hda_codec_hdmi hdaudioC0D0: HDMI: invalid ELD data byte 62 [ 9263.012987] Core dump to |/bin/false pipe failed [ 9263.015801] Core dump to |/bin/false pipe failed After re-enabling ASPM kernel boot parameter, and upgrading BIOS to latest "2008" revision I got messages like this (still on Crucial P1): [ 3203.674000] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0 [ 3203.674052] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) [ 3203.674076] pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00001000/00006000 [ 3203.674081] pcieport 0000:00:03.1: AER: [12] Timeout [ 3205.713683] pcieport 0000:00:01.1: AER: Uncorrected (Fatal) error received: 0000:01:00.0 [ 3205.713694] nvme 0000:01:00.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent ID) [ 3205.713709] nvme nvme0: frozen state error detected, reset controller [ 3206.820214] pcieport 0000:00:01.1: AER: Root Port link has been reset [ 3206.820265] nvme nvme0: restart after slot reset [ 3206.963050] nvme nvme0: 15/0/0 default/read/poll queues [ 3206.963296] pcieport 0000:00:01.1: AER: Device recovery successful [ 3207.692447] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0 [ 3207.692464] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) [ 3207.692470] pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00001000/00006000 [ 3207.692472] pcieport 0000:00:03.1: AER: [12] Timeout [ 3208.608352] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0 [ 3208.608370] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) [ 3208.608378] pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00000040/00006000 [ 3208.608381] pcieport 0000:00:03.1: AER: [ 6] BadTLP [ 3210.904689] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0 [ 3210.904707] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) [ 3210.904716] pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00001000/00006000 [ 3210.904719] pcieport 0000:00:03.1: AER: [12] Timeout [ 3211.260459] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0 [ 3211.260493] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) [ 3211.260514] pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00000040/00006000 [ 3211.260519] pcieport 0000:00:03.1: AER: [ 6] BadTLP At some point I also tried "nvme_core.default_ps_max_latency_us=0" while on Crucial drive, which at best may have reduced the frequency of the problem occuring, but still eventually had crashing controller issues under loads. I suspected the Crucial drive had some unresolved controller firmware bugs, so I thought upgrading to a different brand with a new Samsung 970 EVO would help. I used clonezilla to copy the partition data over and grow to fit the new drive. Geting the clone to work without errors is a whole story in itself but I'll try to keep it short. Having only one M.2 slot on my motherboard, I was using a NVMe to USB 3.1 Gen 2 (up to 10Gbps) adapter device by mfgr "SSK". It failed to clone multiple times(some errors about "UAS" iirc) when plugged into my motherboard's USB 3.1 Gen 2 ports. Then I try swapped the USB adapter to a different port, supporting only USB 3.1 Gen 1 (up to 5Gbps), and that suceeded with 0 errors on the first try. So after booting up the new Samsung drive, I tried my high load mprime test and saw the same types of errors: (The high load process wasn't actually started until around 350s. No idea if first 2 lines are relevant or a problem in any way, but I'm including those "errors" just in case.) [ 194.587710] ucsi_ccg 0-0008: failed to reset PPM! [ 194.587734] ucsi_ccg 0-0008: PPM init failed (-110) ... [ 357.259829] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0 [ 357.259847] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) [ 357.259855] pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00001000/00006000 [ 357.259857] pcieport 0000:00:03.1: AER: [12] Timeout [ 357.866075] pcieport 0000:00:01.1: AER: Uncorrected (Fatal) error received: 0000:01:00.0 [ 357.866098] nvme 0000:01:00.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent ID) [ 357.866124] nvme nvme0: frozen state error detected, reset controller [ 358.262744] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0 [ 358.262765] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) [ 358.262772] pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00001000/00006000 [ 358.262775] pcieport 0000:00:03.1: AER: [12] Timeout [ 358.439057] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0 [ 358.439076] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) [ 358.439084] pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00001000/00006000 [ 358.439086] pcieport 0000:00:03.1: AER: [12] Timeout [ 358.506164] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0 [ 358.506182] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) [ 358.506194] pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00000040/00006000 [ 358.506196] pcieport 0000:00:03.1: AER: [ 6] BadTLP [ 358.748596] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0 [ 358.748606] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) [ 358.748611] pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00001000/00006000 [ 358.748612] pcieport 0000:00:03.1: AER: [12] Timeout [ 358.971108] pcieport 0000:00:01.1: AER: Root Port link has been reset [ 358.971133] nvme nvme0: restart after slot reset [ 359.231681] nvme nvme0: Shutdown timeout set to 8 seconds [ 359.270538] nvme nvme0: 32/0/0 default/read/poll queues [ 359.270843] pcieport 0000:00:01.1: AER: Device recovery successful [ 359.355805] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0 [ 359.355825] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) [ 359.355835] pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00001000/00006000 [ 359.355838] pcieport 0000:00:03.1: AER: [12] Timeout ... More or less identical afaict (besides Samsung having a different queue depth). So these fatal errors are able to be reset/recovered from, but this is still very concerning to me as I don't know if constantly resetting the NVMe controller multiple times per minute will lead to data corruption? At this point I still have no idea what is going on and the problem might be any combination of: 1) Linux kernel bug 2) BIOS revision bug from ASUS/AMD (flaw in AMD 400 series PCIe bridge controller?) 3) Specific BIOS settings are not configured right by me? 4) Misbehaving device firmware (NVMe controller and/or GPU causing some kind of PCIe bus conflicts?) 5) Motherboard hardware defect, bad physical connection in some way? (based on other's reports that re-seating NVMe solved their issue) 6) Power or voltages reaching device are fluctuating out of spec? (power issues suspected since only occurs under heavy load). I don't have a scope to check this though. Any advice would be greatly appreciated. I don't know what combination of kernel boot settings(ASPM, AER, APST, nvme_core latency, etc.) and/or BIOS settings I should be trying anymore (or different BIOS revisions), as I don't understand how any of these interact and there's just too many combinations to try all of them exhaustively. Not kernel boot params can override BIOS settings or if they need to be synced to compatible settings, but ASPM setting in BIOS gives me 3 options: "Disabled", "Auto", or "Force L0s" I don't know off the top of my head if BIOS also had any APST, AER, or other related settings, but I can check if asked. Below I've included a bunch more various general info and diagnostic commands I've run on my latest configuration with Samsung drive installed. Let me know if there's any other command output or info I can provide to help: $ lsb_release -d Description: Linux Mint 19.3 Tricia $ uname -r 5.3.0-28-generic $ ls /sys/class/nvme/nvme0/power async autosuspend_delay_ms control pm_qos_latency_tolerance_us runtime_active_kids runtime_active_time runtime_enabled runtime_status runtime_suspended_time runtime_usage $ sudo nvme fw-log /dev/nvme0 Firmware Log for device:nvme0 afi : 0x1 frs1 : 0x3745584551324232 (2B2QEXE7) $ systool -vm nvme_core Module = "nvme_core" Attributes: coresize = "102400" initsize = "0" initstate = "live" refcnt = "5" srcversion = "B43C1A5A4BC80B50DFB88F2" taint = "" uevent = <store method only> version = "1.0" Parameters: admin_timeout = "60" default_ps_max_latency_us= "100000" force_apst = "N" io_timeout = "30" max_retries = "5" multipath = "Y" shutdown_timeout = "5" streams = "N" Sections: $ sudo nvme id-ctrl /dev/nvme0 NVME Identify Controller: vid : 0x144d ssvid : 0x144d sn : S5H9NC0MC24244K mn : Samsung SSD 970 EVO 1TB fr : 2B2QEXE7 rab : 2 ieee : 002538 cmic : 0 mdts : 9 cntlid : 4 ver : 10300 rtd3r : 30d40 rtd3e : 7a1200 oaes : 0 ctratt : 0 oacs : 0x17 acl : 7 aerl : 3 frmw : 0x16 lpa : 0x3 elpe : 63 npss : 4 avscc : 0x1 apsta : 0x1 wctemp : 358 cctemp : 358 mtfa : 0 hmpre : 0 hmmin : 0 tnvmcap : 1000204886016 unvmcap : 0 rpmbs : 0 edstt : 35 dsto : 0 fwug : 0 kas : 0 hctma : 0x1 mntmt : 356 mxtmt : 358 sanicap : 0 hmminds : 0 hmmaxd : 0 sqes : 0x66 cqes : 0x44 maxcmd : 0 nn : 1 oncs : 0x5f fuses : 0 fna : 0x5 vwc : 0x1 awun : 1023 awupf : 0 nvscc : 1 acwu : 0 sgls : 0 subnqn : ioccsz : 0 iorcsz : 0 icdoff : 0 ctrattr : 0 msdbd : 0 ps 0 : mp:6.20W operational enlat:0 exlat:0 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:- active_power:- ps 1 : mp:4.30W operational enlat:0 exlat:0 rrt:1 rrl:1 rwt:1 rwl:1 idle_power:- active_power:- ps 2 : mp:2.10W operational enlat:0 exlat:0 rrt:2 rrl:2 rwt:2 rwl:2 idle_power:- active_power:- ps 3 : mp:0.0400W non-operational enlat:210 exlat:1200 rrt:3 rrl:3 rwt:3 rwl:3 idle_power:- active_power:- ps 4 : mp:0.0050W non-operational enlat:2000 exlat:8000 rrt:4 rrl:4 rwt:4 rwl:4 idle_power:- active_power:- $ sudo lspci -vvv -s 00:01:00.0 01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981 (prog-if 02 [NVM Express]) Subsystem: Samsung Electronics Co Ltd Device a801 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 63 NUMA node: 0 Region 0: Memory at f6800000 (64-bit, non-prefetchable) [size=16K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [70] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L0s unlimited, L1 <64us ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- Capabilities: [b0] MSI-X: Enable+ Count=33 Masked- Vector table: BAR=0 offset=00003000 PBA: BAR=0 offset=00002000 Capabilities: [100 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn- Capabilities: [148 v1] Device Serial Number 00-00-00-00-00-00-00-00 Capabilities: [158 v1] Power Budgeting <?> Capabilities: [168 v1] #19 Capabilities: [188 v1] Latency Tolerance Reporting Max snoop latency: 0ns Max no snoop latency: 0ns Capabilities: [190 v1] L1 PM Substates L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+ PortCommonModeRestoreTime=10us PortTPowerOnTime=10us L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1- T_CommonMode=0us LTR1.2_Threshold=0ns L1SubCtl2: T_PwrOn=10us Kernel driver in use: nvme Kernel modules: nvme $ lspci -tv -[0000:00]-+-00.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex +-00.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit +-01.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge +-01.1-[01]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 +-01.3-[02-08]--+-00.0 Advanced Micro Devices, Inc. [AMD] 400 Series Chipset USB 3.1 XHCI Controller | +-00.1 Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller | \-00.2-[03-08]--+-00.0-[04]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller | +-01.0-[05]-- | +-04.0-[06]-- | +-06.0-[07]-- | \-07.0-[08]-- +-02.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge +-03.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge +-03.1-[09]--+-00.0 NVIDIA Corporation TU116 [GeForce GTX 1660] | +-00.1 NVIDIA Corporation TU116 High Definition Audio Controller | +-00.2 NVIDIA Corporation Device 1aec | \-00.3 NVIDIA Corporation TU116 [GeForce GTX 1650 SUPER] +-04.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge +-07.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge +-07.1-[0a]--+-00.0 Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function | +-00.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor | \-00.3 Advanced Micro Devices, Inc. [AMD] Zeppelin USB 3.0 Host controller +-08.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge +-08.1-[0b]--+-00.0 Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function | +-00.2 Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] | \-00.3 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller +-14.0 Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller +-14.3 Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge +-18.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0 +-18.1 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1 +-18.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2 +-18.3 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3 +-18.4 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4 +-18.5 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5 +-18.6 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6 \-18.7 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7 $ sudo nvme get-feature -f 0x0c -H /dev/nvme0 get-feature:0xc (Autonomous Power State Transition), Current value:0x000001 Autonomous Power State Transition Enable (APSTE): Enabled Auto PST Entries ................. Entry[ 0] ................. Idle Time Prior to Transition (ITPT): 71 ms Idle Transition Power State (ITPS): 3 ................. Entry[ 1] ................. Idle Time Prior to Transition (ITPT): 71 ms Idle Transition Power State (ITPS): 3 ................. Entry[ 2] ................. Idle Time Prior to Transition (ITPT): 71 ms Idle Transition Power State (ITPS): 3 ................. Entry[ 3] ................. Idle Time Prior to Transition (ITPT): 500 ms Idle Transition Power State (ITPS): 4 ................. Entry[ 4] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[ 5] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[ 6] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[ 7] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[ 8] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[ 9] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[10] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[11] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[12] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[13] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[14] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[15] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[16] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[17] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[18] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[19] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[20] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[21] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[22] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[23] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[24] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[25] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[26] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[27] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[28] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[29] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[30] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[31] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. 0 1 2 3 4 5 6 7 8 9 a b c d e f 0000: 18 47 00 00 00 00 00 00 18 47 00 00 00 00 00 00 ".G.......G......" 0010: 18 47 00 00 00 00 00 00 20 f4 01 00 00 00 00 00 ".G.............." 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................" 0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................" 0040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................" 0050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................" 0060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................" 0070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................" 0080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................" 0090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................" 00a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................" 00b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................" 00c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................" 00d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................" 00e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................" 00f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................" Since my problems were only under heavy load, I suspected power delivery issues and upgraded my motherboard to Asus TUF Gaming X570 which has much better VRM. I am no longer able to reproduce any of the errors I reported above. So the previous Asus Prime B450 Plus motherboard was either a defective unit or under-spec'd in general to power a fully loaded 2700X. Hi, just thought I'd share my experience with this. I have a Lenovo P51s Thinkpad (20JY0004US), which has a Samsung MZVLB512HAJQ-000L7 drive. I can't say for sure, but I believe the issues arose when my distro updated the kernel to 5.4. I was unable to boot (endless read-only filesystem errors) until I added the `nvme_core.default_ps_max_latency_us=200` parameter. This mostly solved the problem -- I was able to boot -- but my system would crash occasionally, out of nowhere, with the same read-only filesystem errors. I'd estimate this happened once or twice a day, often when I unplugged the charger, but not always. For a while I thought it only happened when the laptop was unplugged, but at least once it happened while charging. Finally, I tried installing the latest firmware updates from Lenovo, which I'd never done before, and I haven't seen the issue since! However the `nvme_core.default_ps_max_latency_us=200` parameter is still necessary. In my case, it was (and is) a hardware issue of a bad SSD-Motherboard contact. Downgrading the kernel or changing latency seemed to help a bit, but not forever. Thus, it was very hard to diagnose. My current solution is to remove the SSD and spray with a MAF sensor cleaner I bought for my car onto the contacts. It fixes the problem for a period of several months. Then I have to do it again. Cheers, Sergey Update to [my previous comment](https://bugzilla.kernel.org/show_bug.cgi?id=195039#c61): Actually, I still get the issue sometimes, but now it only happens when I plug my laptop in to charge. About 1 out of 3 times that I plug it in it will occur. I'm sorry to bother you. I've been plagued by a problem - an unexplained crash during the use. Dell G3 laptop, Ubuntu 20.04lts, Linux wlp2s0 hosts 5.4.0-31-generic ා Ubuntu SMP Thu May 7 20:20:34 UTC 2020 x86_ 64 x86_ 64 x86_ 64 GNU/Linux Samsung 1tssd solid-state hard disk is also used. The model is Samsung Electronics Co Ltd nvme SSD controller sm981 / pm981 / pm983 root@wlp2s0-hosts :/home/wlp2s0# smartctl -i /dev/nvme0 smartctl 7.1 2019-12-30 r5022 [x86_ 64-linux-5.4.0-31-generic] (local build) Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: PM981a NVMe Samsung 1024GB Serial Number: S4GXNE0M828422 Firmware Version: 15302129 PCI Vendor/Subsystem ID: 0x144d IEEE OUI Identifier: 0x002538 Total NVM Capacity: 1,024,209,543,168 [1.02 TB] Unallocated NVM Capacity: 0 Controller ID: 4 Number of Namespaces: 1 Namespace 1 Size/Capacity: 1,024,209,543,168 [1.02 TB] Namespace 1 Utilization: 138,282,958,848 [138 GB] Namespace 1 Formatted LBA Size: 512 Local Time is: Mon May 25 13:39:24 2020 CST root@wlp2s0-hosts :/home/wlp2s0# nvme id-ctrl /dev/nvme0 NVME Identify Controller: vid : 0x144d ssvid : 0x144d sn : S4GXNE0M828422 mn : PM981a NVMe Samsung 1024GB fr : 15302129 rab : 2 ieee : 002538 cmic : 0 mdts : 9 cntlid : 0x4 ver : 0x10300 rtd3r : 0x30d40 rtd3e : 0x7a1200 oaes : 0 ctratt : 0 rrls : 0 crdt1 : 0 crdt2 : 0 crdt3 : 0 oacs : 0x17 acl : 7 aerl : 3 frmw : 0x16 lpa : 0x2 elpe : 63 npss : 4 avscc : 0x1 apsta : 0x1 wctemp : 357 cctemp : 358 mtfa : 0 hmpre : 0 hmmin : 0 tnvmcap : 1024209543168 unvmcap : 0 rpmbs : 0 edstt : 35 dsto : 0 fwug : 0 kas : 0 hctma : 0x1 mntmt : 321 mxtmt : 358 sanicap : 0x2 hmminds : 0 hmmaxd : 0 nsetidmax : 0 anatt : 0 anacap : 0 anagrpmax : 0 nanagrpid : 0 sqes : 0x66 cqes : 0x44 maxcmd : 0 nn : 1 oncs : 0x5f fuses : 0 fna : 0x3 vwc : 0x1 awun : 1023 awupf : 0 nvscc : 1 nwpc : 0 acwu : 0 sgls : 0 mnan : 0 subnqn : ioccsz : 0 iorcsz : 0 icdoff : 0 ctrattr : 0 msdbd : 0 ps 0 : mp:6.60W operational enlat:0 exlat 0 rrt:0 rrl 0 rwt:0 rwl :0 idle_ power:- active_ power:- ps 1 : mp:4.40W operational enlat:0 exlat 0 rrt:1 rrl 1 rwt:1 rwl :1 idle_ power:- active_ power:- ps 2 : mp:3.10W operational enlat:0 exlat 0 rrt:2 rrl 2 rwt:2 rwl :2 idle_ power:- active_ power:- ps 3 : mp:0.0700W non-operational enlat:210 exlat :1200 rrt:3 rrl 3 rwt:3 rwl :3 idle_ power:- active_ power:- ps 4 : mp:0.0050W non-operational enlat:2000 exlat :8000 rrt:4 rrl 4 rwt:4 rwl :4 idle_ power:- active_ power:- Try using nvme_ core.default_ Ps_ max_ latency_ Us = 0 to boot but not to enter the system at all. I'm experiencing the same problem - only if the laptop is charging! Sys: Lenovo T480s Bios: Version: N22ET62W (1.39 ) Release Date: 02/18/2020 changed ssd from stock 256GB to KINGSTON SA2000M81000G 1TB while in battery mode the system is rock stable. when charging I see: [ 5088.579248] nvme nvme0: I/O 704 QID 2 timeout, aborting [ 5088.579274] nvme nvme0: I/O 705 QID 2 timeout, aborting [ 5088.579285] nvme nvme0: I/O 706 QID 2 timeout, aborting [ 5088.579294] nvme nvme0: I/O 707 QID 2 timeout, aborting [ 5088.579303] nvme nvme0: I/O 708 QID 2 timeout, aborting [ 5118.788204] nvme nvme0: I/O 704 QID 2 timeout, reset controller [ 5150.021209] nvme nvme0: I/O 0 QID 0 timeout, reset controller Tested kernels (all have this problem) 5.7.4 5.7.5 5.8.0 rc1 @RockT - I don't think its kernel issue. I too have a T580 and had numerous issues with NVME. See my initial comments in this thread from 12/2018. T580 seems to have power delivery issues that cause NVME drives to crash. The only fix I found that works reliably is replacing the motherboard on the laptop. I'm on a 3rd motherboard currently, the first one lasted a year, the second one year and a half, the 3rd one was installed couple weeks ago (each time going through usual troubleshooting process that included swapping to new NVME drive, new drive cage, etc). I'm running F32 w/ kernel 5.6 and after the motherboard swap all issues with NVME are gone (at least until mb fails again). Its pretty sad. Maybe the latest mobo revision has some fixes that will make things more reliable. @Sebastian Jastrzebski thank you for your answer. but I somehow doubt that it is a hardware problem: - the stock nvme card was stable - I applied kernel parameter "nvme_core.default_ps_max_latency_us=5500": $ cat /proc/cmdline BOOT_IMAGE=/vmlinuz-5.7.5-050705-generic root=/dev/mapper/vgubuntu--mate-root ro quiet splash nvme_core.default_ps_max_latency_us=5500 vt.handoff=7 This is stable now for two days of work including running some vms and doing some dd tests. No matter if the laptop is charging or on battery I don't have problems anymore. Created attachment 289877 [details] attachment-20893-0.html FWIW I believe the issue was hardware related in my case too. Setting default_ps_max_latency_us=200 fixed it for a couple months but eventually it returned. I tried firmware updates, pinning old kernels, installing different distros, etc. These changes would seem to fix it for a couple days (I even reported it fixed once or twice in this thread) but then it would come back, or start happening under different circumstances. Finally one day it got so bad I couldn't boot at all so I cracked open the case and found there were structural issues with the mobo port. I believe it was the same issue as described [here]( https://forums.lenovo.com/t5/ThinkPad-P-and-W-Series-Mobile/p51s-sata-mobo-connector-is-broken/td-p/4030539), though not as far gone. Plugging the drive into another computer (running the same version of the kernel) worked fine. I never got the port replaced/resoldered though, so I can't say with /complete/ certainty that it was the problem. It sure looks that way though, especially considering all the others in this thread coming to similar conclusions. On Wed, Jun 24, 2020, 10:35 AM <bugzilla-daemon@bugzilla.kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=195039 > > --- Comment #67 from RockT (tr.ml@gmx.de) --- > @Sebastian Jastrzebski thank you for your answer. > > but I somehow doubt that it is a hardware problem: > > - the stock nvme card was stable > - I applied kernel parameter "nvme_core.default_ps_max_latency_us=5500": > > $ cat /proc/cmdline > BOOT_IMAGE=/vmlinuz-5.7.5-050705-generic > root=/dev/mapper/vgubuntu--mate-root > ro quiet splash nvme_core.default_ps_max_latency_us=5500 vt.handoff=7 > > This is stable now for two days of work including running some vms and > doing > some dd tests. > No matter if the laptop is charging or on battery I don't have problems > anymore. > > -- > You are receiving this mail because: > You are on the CC list for the bug. I'm still not convinced. To resize my encrypted fs on the new 1TB I use sysreccd with kernel 5.4.44 LTS I can fsck the filesystem with the stock kernel params. But as soon as I resize the filesystem - the nvme controller locks hard. Not even a soft reboot can recover. As soon as I set "nvme_core.default_ps_max_latency_us=5500" with sysresccd everything works as expected: resize, fsck, luks, lvm Of course it's running stable now for only three days - so take it with a pinch of salt. (In reply to RockT from comment #69) this solved the problem for me: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1852479 >here a bit of my cat /etc/default/grub GRUB_CMDLINE_LINUX_DEFAULT="quiet splash modprobe.blacklist=nouveau nvme_core.default_ps_max_latency_us=5500 pcie_aspm=off" GRUB_CMDLINE_LINUX="nouveau.modeset=0" For me pcie_aspm=off was the parameter that helped solve the issue. for more info +see here: https://wiki.archlinux.org/index.php/Solid_state_drive/NVMe +and here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1746340#yui_3_10_3_1_1590400769591_1615 Hi folks, just confirming that I have the same issue. I have a ThinkPad E480 with a new Kingston A2000 512GB NVME SSD. Here are some of the things I've experienced: -> Fedora 32 won't install to NVME with LUKS and it fails to format to ext4 (simply hangs). With unencrypted install (standard partitioning) it installs however there are frequent lock ups (can't do anything, even switching TTYs don't work). -> Ubuntu 20.04 can install however there are frequent lockups just like Fedora. I haven't tried out "pcie_aspm=off" as I am using a different OS for the time being but it sounds like that would fix it. Maybe one day I'll try it again. The question is, what would be the long-term fix? Is it simply a matter of solving the power saving issue? It would be nice to benefit from the power saving while still having the stability of a SATA SSD. Quick update, good news, I may have found workaround for people who are suffering from the NVMe timeout issues. I'm on Fedora 32 with the Kingston A2000 512GB SSD and after 2+ days of uptime, plugged on and on battery along with various work loads, I think it is safe to say that the system is quite robust. If I run into any issues down the road, I'll be sure to post an update. Like Juan, I updated the pci_aspm parameter however I went with "performance". I'm not sure which one actually solves the problem but I've got both parameters going. Basically, I put this in the /etc/grub/default file: --> GRUB_CMDLINE_LINUX_DEFAULT="nvme_core.default_ps_max_latency_us=0 pci_aspm=performance" This will ensure it will use these parameters all the time at boot. Then I reloaded the GRUB configuration (this is for Fedora): --> sudo grub2-mkconfig -o /boot/grub2/grub.cfg To check if the "nvme_core.default_ps_max_latency_us=0" has been set successfully, you can run the following command: --> cat /sys/module/nvme_core/parameters/default_ps_max_latency_us @berk Thank you! I had this issue with Kingston A2000 1000GB SSD. Running Kernel 5.7 on Arch with GRUB_CMDLINE_LINUX_DEFAULT="nvme_core.default_ps_max_latency_us=0 pci_aspm=performance" seems to solve it so far. I had hang ups a few times per hour. :S Despite nvme latency seeming to help for some time, this is, most probably, a hardware issue: the power supply to SSD is interrupted for an instant. Clean the SSD contacts with a paper dipped in isopropanol or similar. Cut a stripe of cardboard, dip it in isopropanol and clean the motherboard SSD contacts. I have had this issue since 3 years and have to repeat the cleaning around ones a year. Best wishes, Sergey (In reply to Sergey Slizovskiy from comment #75) > Despite nvme latency seeming to help for some time, this is, most probably, > a hardware issue: the power supply to SSD is interrupted for an instant. > Clean the SSD contacts with a paper dipped in isopropanol or similar. Cut a > stripe of cardboard, dip it in isopropanol and clean the motherboard SSD > contacts. > I have had this issue since 3 years and have to repeat the cleaning > around ones a year. > Best wishes, > Sergey I will also try this, but its strange since everthing is fine on Windows (i use Dual Boot). (In reply to Sergey Slizovskiy from comment #75) > Despite nvme latency seeming to help for some time, this is, most probably, > a hardware issue: the power supply to SSD is interrupted for an instant. > Clean the SSD contacts with a paper dipped in isopropanol or similar. Cut a > stripe of cardboard, dip it in isopropanol and clean the motherboard SSD > contacts. > I have had this issue since 3 years and have to repeat the cleaning > around ones a year. > Best wishes, > Sergey Thanks for the reply Sergey, I decided to give your method a try as I was reinstalling (although just blowing dust out of the PCI-e slot and wiping the NVME SSD terminals with some alcohol, however no luck here. Also this isn't an issue on Windows, so it could be a firmware issue on the laptop's side. However, from my previous comment, I noticed that the only thing I needed was the latency, not the aspm stuff. I haven't had any issues so far and I'm getting good uptime along with various workloads. I wrote a little article on my website on the fix I applied to my system. You can read it here: https://tekbyte.net/2020/fixing-nvme-ssd-problems-on-linux/ I should mention that I'm on a ThinkPad E480 running Fedora 32. I did some research and it seems to plague some other ThinkPad owners. I should also say that the Lenovo NVME SSD (some Toshiba OPAL SSD) doesn't have this problem. Apart from that, I simply apply the GRUB tweak and I'm done. Minor inconvenience but not much I can do unless some kernel update or BIOS update fixes this. There is also a BIOS update back but I doubt it'd fix the issue. Might report this to Lenovo. All the best, Berk I can also confirm this issue. I replaced the nvme hard drive of my Thinkpad T480s with a Kingston A2000 1TB drive. The previous drive, a 256GB Samsung PM961 had been running without issues for more than 2 years. The issue is fixed using the parameters nvme_core.default_ps_max_latency_us=0 and pci_aspm=performance. I am running Fedora with kernel 5.8.16. It seems this particular Kingston drive just has issues with Linux given that multiple people have reported issues with this drive, not just here but on several places, just Google for "kingston A2000 linux": - https://bbs.archlinux.org/viewtopic.php?id=256476 - https://askubuntu.com/questions/1222049/nvmekingston-a2000-sometimes-stops-giving-response-in-ubuntu-18-04dell-inspir - https://community.acer.com/en/discussion/604326/m-2-nvme-ssd-aspire-517-51g-issue-compatibility-kingston-a2000-linux-ubuntu Hi Just want to add my 2 cents here. I had the same problems as described above with my Kingston A2000 1TB nvme drive that I just installed in my Asus UX333 ZenBook laptop. Adding nvme_core.default_ps_max_latency_us=0 to my kernel params fixed the problem for me. Currently running on kernel 5.4.78-rt44-1-rt-lts #1 SMP PREEMPT_RT without problems. Thanks Hi, folks. Resently I bought Intel-NUC8i7HVK with two Kingston KINGSTON SA2000M81000G SSD Drives On installed Windows 10 there were no problems at all. On Linux Kubuntu and KDE Neon system hung rendomly. When I found this bug thread I tried to do described like there. edit grub: ``` sudo nano /etc/default/grub ``` add nvme_core.default_ps_max_latency_us=0 in GRUB_CMDLINE_LINUX_DEFAULT: ``` GRUB_DEFAULT=0 GRUB_TIMEOUT_STYLE=hidden GRUB_TIMEOUT=0 GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian` GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nvme_core.default_ps_max_latency_us=0" GRUB_CMDLINE_LINUX="" ``` update grub: ``` sudo update-grub ``` reboot an also disable APST Menegment in mathebord bios After reboot type in console: ``` sudo nvme get-feature -f 0x0c -H /dev/nvme1n1 | grep APSTE ``` output: ``` Autonomous Power State Transition Enable (APSTE): Disabled ``` So APST is disabled now, and problem was solved!!! Hey @Ivan Yakovlev, thank you so much for sharing this. I'm having the same issue but with different hardware and simply disabling the APSTE resolved the issue. I'm currently running a desktop with the following: MB: Gigabyte GA-H270-Gaming-3 BIOS V8 CPU: Intel I5-7500 NVME: TOSHIBA THNSF5512GPUK FW 51055KLA Thank you so much. Hi, I have MSI B550M PRO motherboard with Ryzen 3100 processor. Recently I have installed Crucial P5 1 TB NVMe SSD into the second M2_2 slot (controlled by B550M chipset, since the M2_1 is controlled by the processor). All worked fine until I have executed # sudo smartctl -a /dev/nvme1n1 The output stalled after .... Critical Comp. Temperature Time: 0 Temperature Sensor 1: 51 Celsius Temperature Sensor 2: 57 Celsius Thermal Temp. 1 Transition Count: 11 Thermal Temp. 1 Total Time: 5118 for about 1 min, and after that an error has been issued: Read Error Information Log failed: NVME_IOCTL_ADMIN_CMD: Interrupted system call and disk disappeared from the system. Dmesg showed: kernel: [ 837.596338] nvme 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0xcfd00000 flags=0x0000] kernel: [ 898.560932] nvme nvme1: I/O 10 QID 0 timeout, reset controller kernel: [ 965.504798] nvme nvme1: Device not ready; aborting reset, CSTS=0x1 kernel: [ 971.040803] nvme nvme1: Device not ready; aborting reset, CSTS=0x1 kernel: [ 971.040809] nvme nvme1: Removing after probe failure status: -19 kernel: [ 976.564693] nvme nvme1: Device not ready; aborting reset, CSTS=0x1 kernel: [ 976.590379] nvme nvme1: failed to set APST feature (-19) Setting GRUB_CMDLINE_LINUX_DEFAULT="nvme_core.default_ps_max_latency_us=0 pci_aspm=performance" solved the issue (since I can't find how to disable APSTE in BIOS. My system: $ uname -a Linux pcdom 5.8.0-38-generic #43~20.04.1-Ubuntu SMP Tue Jan 12 16:39:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux RM Samsung 970 EVO Plus 250GB mounted in M2_1 slot (that controlled by the processor) worked fine all the time. RM To those with a Kingston A2000: could you check if this patch makes the issue go away for you? Would be a great help! https://lore.kernel.org/linux-nvme/20210129052442.310780-1-linux@leemhuis.info/ @rafal.moderski: Not sure if it's relevant but I figured I'd share anyways. I recently saw the same problem as you did on a Gigabyte B550I AORUS PRO AX with Ryzen 3600, an nVidia GTX 460, two Intel 660p SSD NVMe disks and a bunch of SATA drives (running Linux 5.10.13-200.fc33.x86_64 on Fedora 33). Both of my NVMe drives dropped out randomly under I/O load and appeared to hang so hard that I had to pull the power cord from the motherboard. [ 660.969381] nvme nvme0: I/O 32 QID 4 timeout, aborting [ 660.969393] nvme nvme0: I/O 33 QID 4 timeout, aborting [ 660.969396] nvme nvme0: I/O 34 QID 4 timeout, aborting [ 660.969399] nvme nvme0: I/O 35 QID 4 timeout, aborting [ 691.176647] nvme nvme0: I/O 32 QID 4 timeout, reset controller [ 721.384067] nvme nvme0: I/O 21 QID 0 timeout, reset controller [ 815.675331] nvme nvme0: Device not ready; aborting reset, CSTS=0x1 [ 815.681442] nvme nvme0: Abort status: 0x371 [ 815.681443] nvme nvme0: Abort status: 0x371 [ 815.681444] nvme nvme0: Abort status: 0x371 [ 815.681445] nvme nvme0: Abort status: 0x371 [ 876.201770] nvme nvme0: Device not ready; aborting reset, CSTS=0x1 [ 876.201773] nvme nvme0: Removing after probe failure status: -19 [ 936.707119] nvme nvme0: Device not ready; aborting reset, CSTS=0x1 [ 936.707258] blk_update_request: I/O error, dev nvme0n1, sector 644864384 op 0x1:(WRITE) flags 0x0 phys_seg 16 prio class 0 [ 936.707278] blk_update_request: I/O error, dev nvme0n1, sector 644864256 op 0x1:(WRITE) flags 0x0 phys_seg 16 prio class 0 [ 936.707281] blk_update_request: I/O error, dev nvme0n1, sector 644867968 op 0x1:(WRITE) flags 0x0 phys_seg 32 prio class 0 [ 936.707284] blk_update_request: I/O error, dev nvme0n1, sector 644866560 op 0x1:(WRITE) flags 0x0 phys_seg 24 prio class 0 [ 936.707288] blk_update_request: I/O error, dev nvme0n1, sector 644866304 op 0x1:(WRITE) flags 0x0 phys_seg 32 prio class 0 [ 936.707292] blk_update_request: I/O error, dev nvme0n1, sector 644866048 op 0x1:(WRITE) flags 0x0 phys_seg 19 prio class 0 [ 936.707295] blk_update_request: I/O error, dev nvme0n1, sector 644865920 op 0x1:(WRITE) flags 0x0 phys_seg 8 prio class 0 [ 936.707297] blk_update_request: I/O error, dev nvme0n1, sector 644865664 op 0x1:(WRITE) flags 0x0 phys_seg 18 prio class 0 [ 936.707299] blk_update_request: I/O error, dev nvme0n1, sector 644865408 op 0x1:(WRITE) flags 0x0 phys_seg 18 prio class 0 [ 936.707302] blk_update_request: I/O error, dev nvme0n1, sector 644865152 op 0x1:(WRITE) flags 0x0 phys_seg 28 prio class 0 [ 936.707304] md: super_written gets error=10 [ 936.707307] md/raid1:md127: Disk failure on nvme0n1p4, disabling device. md/raid1:md127: Operation continuing on 2 devices. [ 936.707310] md: md127: recovery interrupted. [ 936.714223] nvme nvme0: failed to set APST feature (-19) In my case the problems went away when I replaced the PSU (for a 550W). I can reproduce the issue with a ThinkPad T470 and a 1TB Kingston A2000 SSD. Have not tried the 'nvme_core.default_ps_max_latency_us=0' workaround yet. I have major issues with a new SAMSUNG NVME SSD that shows the same symptoms as those various users have reported above. Even with the nvme_core.default_ps_max_latency_us=0 and pcie_aspm=performancesettings my SSD controller is still affected by timeouts. (every few minutes, getting worse the longer the system is running until I/O essentially stalls indefinitely): ``` [Fr Apr 16 08:00:06 2021] nvme nvme0: pci function 0000:03:00.0 [Fr Apr 16 08:00:06 2021] nvme nvme0: Shutdown timeout set to 10 seconds [Fr Apr 16 08:00:06 2021] nvme nvme0: 16/0/0 default/read/poll queues [Fr Apr 16 08:00:06 2021] nvme0n1: p1 p2 p3 p4 p5 [Fr Apr 16 08:00:06 2021] BTRFS: device fsid bfa8a277-c2de-4b2a-a8c9-3488e648b423 devid 1 transid 55112 /dev/nvme0n1p2 scanned by systemd-udevd (406) [Fr Apr 16 08:00:07 2021] BTRFS info (device nvme0n1p2): disk space caching is enabled [Fr Apr 16 08:00:07 2021] BTRFS info (device nvme0n1p2): has skinny extents [Fr Apr 16 08:00:07 2021] BTRFS info (device nvme0n1p2): enabling ssd optimizations [Fr Apr 16 08:00:08 2021] Adding 65552380k swap on /dev/nvme0n1p3. Priority:-2 extents:1 across:65552380k SSFS [Fr Apr 16 08:00:08 2021] BTRFS info (device nvme0n1p2): disk space caching is enabled [Fr Apr 16 08:00:08 2021] EXT4-fs (nvme0n1p5): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none. [Fr Apr 16 08:30:40 2021] nvme nvme0: Shutdown timeout set to 10 seconds [Fr Apr 16 08:30:40 2021] nvme nvme0: 16/0/0 default/read/poll queues [Fr Apr 16 08:46:11 2021] nvme nvme0: I/O 819 QID 5 timeout, aborting [Fr Apr 16 08:46:11 2021] nvme nvme0: I/O 195 QID 10 timeout, aborting [Fr Apr 16 08:46:11 2021] nvme nvme0: I/O 196 QID 10 timeout, aborting [Fr Apr 16 08:46:11 2021] nvme nvme0: I/O 17 QID 11 timeout, aborting [Fr Apr 16 08:46:11 2021] nvme nvme0: I/O 56 QID 14 timeout, aborting [Fr Apr 16 08:46:11 2021] nvme nvme0: I/O 57 QID 14 timeout, aborting [Fr Apr 16 08:46:11 2021] nvme nvme0: I/O 58 QID 14 timeout, aborting [Fr Apr 16 08:46:11 2021] nvme nvme0: I/O 59 QID 14 timeout, aborting [Fr Apr 16 08:46:41 2021] nvme nvme0: I/O 819 QID 5 timeout, reset controller [Fr Apr 16 08:46:41 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 08:46:41 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 08:46:41 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 08:46:41 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 08:46:41 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 08:46:41 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 08:46:41 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 08:46:41 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 08:46:41 2021] nvme nvme0: Shutdown timeout set to 10 seconds [Fr Apr 16 08:46:41 2021] nvme nvme0: 16/0/0 default/read/poll queues [Fr Apr 16 08:48:58 2021] nvme nvme0: I/O 20 QID 11 timeout, aborting [Fr Apr 16 08:48:59 2021] nvme nvme0: I/O 1004 QID 8 timeout, aborting [Fr Apr 16 08:48:59 2021] nvme nvme0: I/O 207 QID 10 timeout, aborting [Fr Apr 16 08:48:59 2021] nvme nvme0: I/O 208 QID 10 timeout, aborting [Fr Apr 16 08:49:03 2021] nvme nvme0: I/O 256 QID 7 timeout, aborting [Fr Apr 16 08:49:03 2021] nvme nvme0: I/O 257 QID 7 timeout, aborting [Fr Apr 16 08:49:03 2021] nvme nvme0: I/O 258 QID 7 timeout, aborting [Fr Apr 16 08:49:03 2021] nvme nvme0: I/O 317 QID 7 timeout, aborting [Fr Apr 16 08:49:28 2021] nvme nvme0: I/O 20 QID 11 timeout, reset controller [Fr Apr 16 08:49:28 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 08:49:28 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 08:49:28 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 08:49:28 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 08:49:28 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 08:49:28 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 08:49:28 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 08:49:28 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 08:49:28 2021] nvme nvme0: Shutdown timeout set to 10 seconds [Fr Apr 16 08:49:28 2021] nvme nvme0: 16/0/0 default/read/poll queues [Fr Apr 16 09:32:55 2021] nvme nvme0: I/O 564 QID 15 timeout, aborting [Fr Apr 16 09:32:59 2021] nvme nvme0: I/O 705 QID 3 timeout, aborting [Fr Apr 16 09:33:03 2021] nvme nvme0: I/O 16 QID 14 timeout, aborting [Fr Apr 16 09:33:03 2021] nvme nvme0: I/O 17 QID 14 timeout, aborting [Fr Apr 16 09:33:18 2021] nvme nvme0: I/O 213 QID 10 timeout, aborting [Fr Apr 16 09:33:21 2021] nvme nvme0: I/O 13 QID 11 timeout, aborting [Fr Apr 16 09:33:25 2021] nvme nvme0: I/O 14 QID 11 timeout, aborting [Fr Apr 16 09:33:25 2021] nvme nvme0: I/O 15 QID 11 timeout, aborting [Fr Apr 16 09:33:25 2021] nvme nvme0: I/O 564 QID 15 timeout, reset controller [Fr Apr 16 09:33:25 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:33:25 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:33:25 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:33:25 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:33:25 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:33:25 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:33:25 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:33:25 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:33:25 2021] nvme nvme0: Shutdown timeout set to 10 seconds [Fr Apr 16 09:33:26 2021] nvme nvme0: 16/0/0 default/read/poll queues [Fr Apr 16 09:50:40 2021] nvme nvme0: I/O 547 QID 15 timeout, aborting [Fr Apr 16 09:50:41 2021] nvme nvme0: I/O 768 QID 5 timeout, aborting [Fr Apr 16 09:50:41 2021] nvme nvme0: I/O 769 QID 5 timeout, aborting [Fr Apr 16 09:50:41 2021] nvme nvme0: I/O 770 QID 5 timeout, aborting [Fr Apr 16 09:50:41 2021] nvme nvme0: I/O 771 QID 5 timeout, aborting [Fr Apr 16 09:50:41 2021] nvme nvme0: I/O 791 QID 5 timeout, aborting [Fr Apr 16 09:50:41 2021] nvme nvme0: I/O 792 QID 5 timeout, aborting [Fr Apr 16 09:50:41 2021] nvme nvme0: I/O 793 QID 5 timeout, aborting [Fr Apr 16 09:51:11 2021] nvme nvme0: I/O 547 QID 15 timeout, reset controller [Fr Apr 16 09:51:11 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:51:11 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:51:11 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:51:11 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:51:11 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:51:11 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:51:11 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:51:11 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:51:11 2021] nvme nvme0: Shutdown timeout set to 10 seconds [Fr Apr 16 09:51:11 2021] nvme nvme0: 16/0/0 default/read/poll queues [Fr Apr 16 09:52:36 2021] nvme nvme0: I/O 738 QID 3 timeout, aborting [Fr Apr 16 09:52:36 2021] nvme nvme0: I/O 275 QID 7 timeout, aborting [Fr Apr 16 09:52:37 2021] nvme nvme0: I/O 302 QID 16 timeout, aborting [Fr Apr 16 09:52:38 2021] nvme nvme0: I/O 329 QID 13 timeout, aborting [Fr Apr 16 09:52:38 2021] nvme nvme0: I/O 384 QID 6 timeout, aborting [Fr Apr 16 09:52:38 2021] nvme nvme0: I/O 385 QID 6 timeout, aborting [Fr Apr 16 09:52:38 2021] nvme nvme0: I/O 386 QID 6 timeout, aborting [Fr Apr 16 09:52:38 2021] nvme nvme0: I/O 434 QID 6 timeout, aborting [Fr Apr 16 09:53:06 2021] nvme nvme0: I/O 738 QID 3 timeout, reset controller [Fr Apr 16 09:53:06 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:53:06 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:53:06 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:53:06 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:53:06 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:53:06 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:53:06 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:53:06 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:53:06 2021] nvme nvme0: Shutdown timeout set to 10 seconds [Fr Apr 16 09:53:06 2021] nvme nvme0: 16/0/0 default/read/poll queues [Fr Apr 16 09:53:41 2021] nvme nvme0: I/O 720 QID 3 timeout, aborting [Fr Apr 16 09:53:41 2021] nvme nvme0: I/O 514 QID 15 timeout, aborting [Fr Apr 16 09:53:44 2021] nvme nvme0: I/O 364 QID 7 timeout, aborting [Fr Apr 16 09:53:44 2021] nvme nvme0: I/O 365 QID 7 timeout, aborting [Fr Apr 16 09:53:44 2021] nvme nvme0: I/O 515 QID 15 timeout, aborting [Fr Apr 16 09:53:44 2021] nvme nvme0: I/O 516 QID 15 timeout, aborting [Fr Apr 16 09:53:44 2021] nvme nvme0: I/O 517 QID 15 timeout, aborting [Fr Apr 16 09:53:44 2021] nvme nvme0: I/O 518 QID 15 timeout, aborting [Fr Apr 16 09:54:11 2021] nvme nvme0: I/O 720 QID 3 timeout, reset controller [Fr Apr 16 09:54:11 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:54:11 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:54:11 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:54:11 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:54:11 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:54:11 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:54:11 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:54:11 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:54:11 2021] nvme nvme0: Shutdown timeout set to 10 seconds [Fr Apr 16 09:54:11 2021] nvme nvme0: 16/0/0 default/read/poll queues [Fr Apr 16 09:58:18 2021] nvme nvme0: I/O 546 QID 9 timeout, aborting [Fr Apr 16 09:58:18 2021] nvme nvme0: I/O 274 QID 10 timeout, aborting [Fr Apr 16 09:58:19 2021] nvme nvme0: I/O 47 QID 11 timeout, aborting [Fr Apr 16 09:58:20 2021] nvme nvme0: I/O 48 QID 11 timeout, aborting [Fr Apr 16 09:58:20 2021] nvme nvme0: I/O 49 QID 11 timeout, aborting [Fr Apr 16 09:58:21 2021] nvme nvme0: I/O 547 QID 9 timeout, aborting [Fr Apr 16 09:58:21 2021] nvme nvme0: I/O 548 QID 9 timeout, aborting [Fr Apr 16 09:58:25 2021] nvme nvme0: I/O 512 QID 9 timeout, aborting [Fr Apr 16 09:58:48 2021] nvme nvme0: I/O 546 QID 9 timeout, reset controller [Fr Apr 16 09:58:48 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:58:48 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:58:48 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:58:48 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:58:48 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:58:48 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:58:48 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:58:48 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 09:58:48 2021] nvme nvme0: Shutdown timeout set to 10 seconds [Fr Apr 16 09:58:48 2021] nvme nvme0: 16/0/0 default/read/poll queues [Fr Apr 16 10:02:36 2021] nvme nvme0: I/O 363 QID 8 timeout, aborting [Fr Apr 16 10:02:36 2021] nvme nvme0: I/O 301 QID 16 timeout, aborting [Fr Apr 16 10:02:37 2021] nvme nvme0: I/O 773 QID 1 timeout, aborting [Fr Apr 16 10:02:37 2021] nvme nvme0: I/O 360 QID 13 timeout, aborting [Fr Apr 16 10:02:38 2021] nvme nvme0: I/O 557 QID 9 timeout, aborting [Fr Apr 16 10:02:38 2021] nvme nvme0: I/O 819 QID 5 timeout, aborting [Fr Apr 16 10:02:38 2021] nvme nvme0: I/O 820 QID 5 timeout, aborting [Fr Apr 16 10:02:39 2021] nvme nvme0: I/O 567 QID 15 timeout, aborting [Fr Apr 16 10:03:06 2021] nvme nvme0: I/O 363 QID 8 timeout, reset controller [Fr Apr 16 10:03:06 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 10:03:06 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 10:03:06 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 10:03:06 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 10:03:06 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 10:03:06 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 10:03:06 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 10:03:06 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 10:03:06 2021] nvme nvme0: Shutdown timeout set to 10 seconds [Fr Apr 16 10:03:06 2021] nvme nvme0: 16/0/0 default/read/poll queues [Fr Apr 16 10:04:06 2021] nvme nvme0: I/O 16 QID 14 timeout, aborting [Fr Apr 16 10:04:06 2021] nvme nvme0: I/O 567 QID 15 timeout, aborting [Fr Apr 16 10:04:10 2021] nvme nvme0: I/O 548 QID 9 timeout, aborting [Fr Apr 16 10:04:10 2021] nvme nvme0: I/O 568 QID 15 timeout, aborting [Fr Apr 16 10:04:10 2021] nvme nvme0: I/O 569 QID 15 timeout, aborting [Fr Apr 16 10:04:12 2021] nvme nvme0: I/O 549 QID 9 timeout, aborting [Fr Apr 16 10:04:12 2021] nvme nvme0: I/O 776 QID 1 timeout, aborting [Fr Apr 16 10:04:12 2021] nvme nvme0: I/O 512 QID 9 timeout, aborting [Fr Apr 16 10:04:36 2021] nvme nvme0: I/O 16 QID 14 timeout, reset controller [Fr Apr 16 10:04:36 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 10:04:36 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 10:04:36 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 10:04:36 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 10:04:36 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 10:04:36 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 10:04:36 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 10:04:36 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 10:04:36 2021] nvme nvme0: Shutdown timeout set to 10 seconds [Fr Apr 16 10:04:36 2021] nvme nvme0: 16/0/0 default/read/poll queues [Fr Apr 16 10:17:36 2021] nvme nvme0: I/O 564 QID 15 timeout, aborting [Fr Apr 16 10:17:37 2021] nvme nvme0: I/O 728 QID 12 timeout, aborting [Fr Apr 16 10:17:37 2021] nvme nvme0: I/O 729 QID 12 timeout, aborting [Fr Apr 16 10:17:37 2021] nvme nvme0: I/O 730 QID 12 timeout, aborting [Fr Apr 16 10:17:37 2021] nvme nvme0: I/O 731 QID 12 timeout, aborting [Fr Apr 16 10:17:37 2021] nvme nvme0: I/O 732 QID 12 timeout, aborting [Fr Apr 16 10:17:37 2021] nvme nvme0: I/O 733 QID 12 timeout, aborting [Fr Apr 16 10:17:37 2021] nvme nvme0: I/O 512 QID 15 timeout, aborting [Fr Apr 16 10:18:06 2021] nvme nvme0: I/O 564 QID 15 timeout, reset controller [Fr Apr 16 10:18:06 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 10:18:06 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 10:18:06 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 10:18:06 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 10:18:06 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 10:18:06 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 10:18:06 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 10:18:06 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 10:18:06 2021] nvme nvme0: Shutdown timeout set to 10 seconds [Fr Apr 16 10:18:06 2021] nvme nvme0: 16/0/0 default/read/poll queues ``` Here's data on my system: ``` + inxi -Fxz System: Kernel: 5.11.13-200.fc33.x86_64 x86_64 bits: 64 compiler: gcc v: 2.35-18.fc33 Console: tty pts/6 Distro: Fedora release 33 (Thirty Three) Machine: Type: Laptop System: TUXEDO product: TUXEDO Pulse 15 Gen1 v: Standard serial: <filter> Mobo: TUXEDO s model: PULSE1501 v: Standard serial: <filter> UEFI: American Megatrends v: N.1.07.A02 date: 12/08/2020 Battery: ID-1: BAT0 charge: 74.2 Wh (81.0%) condition: 91.6/91.6 Wh (100.0%) volts: 12.3 min: 11.6 model: standard status: Discharging CPU: Info: 8-Core model: AMD Ryzen 7 4800H with Radeon Graphics bits: 64 type: MT MCP arch: Zen 2 rev: 1 cache: L2: 4 MiB flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm bogomips: 92626 Speed: 1397 MHz min/max: 1400/2900 MHz boost: enabled Core speeds (MHz): 1: 1397 2: 1397 3: 1350 4: 1397 5: 1379 6: 1397 7: 1472 8: 1397 9: 1397 10: 1397 11: 1397 12: 1397 13: 1397 14: 1397 15: 1365 16: 1390 Graphics: Device-1: Advanced Micro Devices [AMD/ATI] Renoir vendor: Tongfang Hongkong Limited driver: amdgpu v: kernel bus-ID: 04:00.0 Device-2: Chicony HD Webcam type: USB driver: uvcvideo bus-ID: 1-3:2 Display: server: Fedora Project X.org 1.20.10 driver: loaded: amdgpu,ati unloaded: fbdev,modesetting,vesa resolution: 1920x1080~60Hz OpenGL: renderer: AMD RENOIR (DRM 3.40.0 5.11.13-200.fc33.x86_64 LLVM 11.0.0) v: 4.6 Mesa 20.3.5 direct render: Yes Audio: Device-1: Advanced Micro Devices [AMD/ATI] vendor: Tongfang Hongkong Limited driver: snd_hda_intel v: kernel bus-ID: 04:00.1 Device-2: Advanced Micro Devices [AMD] Raven/Raven2/FireFlight/Renoir Audio Processor vendor: Tongfang Hongkong Limited driver: N/A bus-ID: 04:00.5 Device-3: Advanced Micro Devices [AMD] Family 17h HD Audio vendor: Tongfang Hongkong Limited driver: snd_hda_intel v: kernel bus-ID: 04:00.6 Sound Server-1: ALSA v: k5.11.13-200.fc33.x86_64 running: yes Sound Server-2: JACK v: 1.9.14 running: no Sound Server-3: PulseAudio v: 14.0-rebootstrapped running: yes Sound Server-4: PipeWire v: 0.3.25 running: yes Network: Device-1: Intel Wi-Fi 6 AX200 driver: iwlwifi v: kernel bus-ID: 01:00.0 IF: wlp1s0 state: up mac: <filter> Device-2: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet vendor: Tongfang Hongkong Limited driver: r8169 v: kernel port: f000 bus-ID: 02:00.0 IF: eno1 state: down mac: <filter> IF-ID-1: docker0 state: down mac: <filter> IF-ID-2: virbr0 state: down mac: <filter> IF-ID-3: virbr0-nic state: down mac: <filter> Bluetooth: Device-1: Intel AX200 Bluetooth type: USB driver: btusb v: 0.8 bus-ID: 1-4.4:5 Report: ID: hci0 state: up address: <filter> bt-v: 3.0 lmp-v: 5.2 Drives: Local Storage: total: 1.82 TiB used: 24.42 GiB (1.3%) ID-1: /dev/nvme0n1 vendor: Samsung model: SSD 980 PRO 2TB size: 1.82 TiB temp: 34.9 C Partition: ID-1: / size: 600 GiB used: 23 GiB (3.8%) fs: btrfs dev: /dev/nvme0n1p2 ID-2: /boot/efi size: 511 MiB used: 22.5 MiB (4.4%) fs: vfat dev: /dev/nvme0n1p1 Swap: ID-1: swap-1 type: partition size: 62.52 GiB used: 0 KiB (0.0%) dev: /dev/nvme0n1p3 ID-2: swap-2 type: zram size: 4 GiB used: 0 KiB (0.0%) dev: /dev/zram0 Sensors: System Temperatures: cpu: 44.8 C mobo: N/A gpu: amdgpu temp: 44.0 C Fan Speeds (RPM): N/A Info: Processes: 480 Uptime: 8h 29m Memory: 62.3 GiB used: 7.04 GiB (11.3%) Init: systemd runlevel: 5 Compilers: gcc: 10.2.1 Packages: 2226 Shell: Bash v: 5.0.17 inxi: 3.3.02 + nvme id-ctrl /dev/nvme0 NVME Identify Controller: vid : 0x144d ssvid : 0x144d sn : S69ENG0R111888X mn : Samsung SSD 980 PRO 2TB fr : 2B2QGXA7 rab : 2 ieee : 002538 cmic : 0 mdts : 7 cntlid : 0x6 ver : 0x10300 rtd3r : 0x30d40 rtd3e : 0x989680 oaes : 0x200 ctratt : 0x10 rrls : 0 cntrltype : 0 fguid : crdt1 : 0 crdt2 : 0 crdt3 : 0 oacs : 0x17 acl : 7 aerl : 3 frmw : 0x16 lpa : 0xf elpe : 63 npss : 4 avscc : 0x1 apsta : 0x1 wctemp : 355 cctemp : 358 mtfa : 0 hmpre : 0 hmmin : 0 tnvmcap : 2000398934016 unvmcap : 0 rpmbs : 0 edstt : 35 dsto : 0 fwug : 0 kas : 0 hctma : 0x1 mntmt : 318 mxtmt : 356 sanicap : 0x3 hmminds : 0 hmmaxd : 0 nsetidmax : 0 endgidmax : 1 anatt : 0 anacap : 0 anagrpmax : 0 nanagrpid : 0 pels : 0 sqes : 0x66 cqes : 0x44 maxcmd : 256 nn : 1 oncs : 0x57 fuses : 0 fna : 0x5 vwc : 0x7 awun : 1023 awupf : 0 nvscc : 1 nwpc : 0 acwu : 0 sgls : 0 mnan : 0 subnqn : nqn.1994-11.com.samsung:nvme:980PRO:M.2:S69ENG0R111888X ioccsz : 0 iorcsz : 0 icdoff : 0 ctrattr : 0 msdbd : 0 ps 0 : mp:8.49W operational enlat:0 exlat:0 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:- active_power:- ps 1 : mp:4.48W operational enlat:0 exlat:200 rrt:1 rrl:1 rwt:1 rwl:1 idle_power:- active_power:- ps 2 : mp:3.18W operational enlat:0 exlat:1000 rrt:2 rrl:2 rwt:2 rwl:2 idle_power:- active_power:- ps 3 : mp:0.0400W non-operational enlat:2000 exlat:1200 rrt:3 rrl:3 rwt:3 rwl:3 idle_power:- active_power:- ps 4 : mp:0.0050W non-operational enlat:500 exlat:9500 rrt:4 rrl:4 rwt:4 rwl:4 idle_power:- active_power:- + modinfo nvme_core filename: /lib/modules/5.11.13-200.fc33.x86_64/kernel/drivers/nvme/host/nvme-core.ko.xz version: 1.0 license: GPL srcversion: DCD0195DBE946B8AF972AC5 depends: retpoline: Y intree: Y name: nvme_core vermagic: 5.11.13-200.fc33.x86_64 SMP mod_unload sig_id: PKCS#7 signer: Fedora kernel signing key sig_key: 2B:12:6B:43:7C:A9:60:4D:56:25:7D:ED:06:B4:18:E8:F0:AE:AD:F0 sig_hashalgo: sha256 signature: 03:47:AA:7A:9A:5E:80:AC:AF:A7:AF:3F:8F:C6:38:CB:2B:88:B1:54: 02:B2:BF:CA:78:A7:10:92:1C:55:05:73:4D:F2:BA:6C:E7:C6:F3:9C: CB:FA:D9:C6:2F:38:F2:CB:27:F2:78:48:19:75:D2:05:72:BE:68:76: EF:0C:11:33:D1:14:9B:AC:DB:3F:DE:1A:B1:58:A1:74:65:C5:56:B3: DE:5A:30:7D:86:7B:A9:CB:8A:8A:10:7A:F6:CC:86:10:64:DF:B2:C8: B8:5F:B5:C9:1D:15:F4:AD:4D:76:FC:C6:95:1A:A6:C7:C4:C8:F5:04: 84:F9:52:44:B1:DD:FC:55:92:30:DF:E3:43:9D:4A:AF:9E:08:13:DF: C9:C6:8F:FF:B2:F0:15:0B:3B:87:7F:E4:72:83:A2:C2:EB:86:EC:22: 17:C6:61:DE:6A:84:86:EE:84:E0:59:FE:0C:36:70:A7:1F:84:47:BF: 23:6D:CC:A7:A0:E6:CD:B0:8F:5E:4F:4B:80:0D:C6:D9:6E:DF:F7:7F: 3B:80:20:70:2C:2C:B5:45:C9:3A:FA:5E:63:94:27:C4:4C:BE:91:EE: C2:C6:F5:86:12:52:11:A1:30:39:38:9C:10:AB:1B:F0:A5:ED:DC:AB: AE:C2:81:B0:79:DE:27:4C:A1:F2:1E:9E:E6:AF:BB:B8:CF:65:08:C4: A8:6F:84:56:82:D5:D4:48:60:B7:6D:62:78:FC:12:59:68:C2:BA:35: 72:44:04:19:75:F0:98:5A:72:11:68:27:85:EF:50:B9:FE:0C:BB:3C: 3A:24:8F:12:EA:EB:F0:82:91:13:F4:73:CD:F8:A9:61:CE:98:7B:49: C8:F2:34:BA:55:B1:B6:2C:A7:09:38:1C:78:1D:EE:A2:16:98:E6:B8: 58:E9:0A:5D:91:1D:0B:E4:B2:88:1B:C6:5F:40:61:B2:5E:1D:AF:E8: 78:0E:C1:90:DE:CF:A6:EF:86:A8:D4:DE:0B:C1:62:13:9B:1B:CF:DC: 64:24:9C:10:EA:68:FC:72:BE:2A:0D:9F:49:28:1F:FB:2A:69:1B:12: F6:63:A3:98:3B:68:10:10:75:08:13:73:0E:12:47:E9:E7:35:34:35: A5:1F:80:39:0F:4E:D5:A6:69:C0:E1:B2:F5:8C:1A:3F:01:2B:D0:9C: 3C:6A:C2:F9:42:23:1F:33:8A:6A:1F:7F:B5:76:37:7E:12:07:15:A0: 8B:DC:ED:25:B1:74:7B:77:4D:7C:2C:30:19:A8:91:33:81:68:E3:5B: 30:D7:01:6B:69:94:5C:51:11:97:A4:7B:38:44:D0:75:C3:BC:6F:BA: CF:83:0A:42:E8:B2:47:40:61:2A:A0:33 parm: multipath:turn on native support for multiple controllers per subsystem (bool) parm: admin_timeout:timeout in seconds for admin commands (uint) parm: io_timeout:timeout in seconds for I/O (uint) parm: shutdown_timeout:timeout in seconds for controller shutdown (byte) parm: max_retries:max number of retries a command may have (byte) parm: default_ps_max_latency_us:max power saving latency for new devices; use PM QOS to change per device (ulong) parm: force_apst:allow APST for newly enumerated devices even if quirked off (bool) parm: streams:turn on support for Streams write directives (bool) + ls -l /sys/class/nvme/nvme0/power insgesamt 0 -rw-r--r--. 1 root root 4096 16. Apr 10:13 autosuspend_delay_ms -rw-r--r--. 1 root root 4096 16. Apr 10:13 control -rw-r--r--. 1 root root 4096 16. Apr 10:13 pm_qos_latency_tolerance_us -r--r--r--. 1 root root 4096 16. Apr 10:13 runtime_active_time -r--r--r--. 1 root root 4096 16. Apr 10:13 runtime_status -r--r--r--. 1 root root 4096 16. Apr 10:13 runtime_suspended_time + cat /etc/default/grub GRUB_TIMEOUT=0 GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)" GRUB_DEFAULT=saved GRUB_DISABLE_SUBMENU=true GRUB_TERMINAL_OUTPUT="console" GRUB_CMDLINE_LINUX="resume=UUID=10110d5f-81e3-41c6-9ea5-dcefea2cb937 rhgb quiet nvme_core.default_ps_max_latency_us=0 pcie_aspm=performance" GRUB_DISABLE_RECOVERY="true" GRUB_ENABLE_BLSCFG=true + cat /proc/cmdline BOOT_IMAGE=(hd0,gpt2)/boot/vmlinuz-5.11.13-200.fc33.x86_64 root=UUID=bfa8a277-c2de-4b2a-a8c9-3488e648b423 ro resume=UUID=10110d5f-81e3-41c6-9ea5-dcefea2cb937 rhgb quiet nvme_core.default_ps_max_latency_us=0 pcie_aspm=performance + nvme get-feature -f 0x0c -H /dev/nvme0 | grep APSTE Autonomous Power State Transition Enable (APSTE): Disabled + fwupdmgr get-updates Devices with no available firmware updates: • Samsung SSD 980 PRO 2TB • System Firmware • UEFI dbx No updatable devices ``` Have you tried: nvme_core.default_ps_max_latency_us=5500 Worked here. (In reply to RockT from comment #88) > Have you tried: > nvme_core.default_ps_max_latency_us=5500 > > Worked here. It has somewhat alleviated the issue but not resolved it. The timeouts are still occuring, just less frequently: ``` [Fr Apr 16 13:06:09 2021] Command line: BOOT_IMAGE=(hd0,gpt2)/boot/vmlinuz-5.11.13-200.fc33.x86_64 root=UUID=bfa8a277-c2de-4b2a-a8c9-3488e648b423 ro resume=UUID=10110d5f-81e3-41c6-9ea5-dcefea2cb937 rhgb quiet nvme_core.default_ps_max_latency_us=5500 [Fr Apr 16 13:06:09 2021] Kernel command line: BOOT_IMAGE=(hd0,gpt2)/boot/vmlinuz-5.11.13-200.fc33.x86_64 root=UUID=bfa8a277-c2de-4b2a-a8c9-3488e648b423 ro resume=UUID=10110d5f-81e3-41c6-9ea5-dcefea2cb937 rhgb quiet nvme_core.default_ps_max_latency_us=5500 [Fr Apr 16 13:06:10 2021] nvme nvme0: pci function 0000:03:00.0 [Fr Apr 16 13:06:10 2021] nvme nvme0: Shutdown timeout set to 10 seconds [Fr Apr 16 13:06:10 2021] nvme nvme0: 16/0/0 default/read/poll queues [Fr Apr 16 13:06:10 2021] nvme0n1: p1 p2 p3 p4 p5 [Fr Apr 16 13:06:10 2021] BTRFS: device fsid bfa8a277-c2de-4b2a-a8c9-3488e648b423 devid 1 transid 56409 /dev/nvme0n1p2 scanned by systemd-udevd (416) [Fr Apr 16 13:06:12 2021] BTRFS info (device nvme0n1p2): disk space caching is enabled [Fr Apr 16 13:06:12 2021] BTRFS info (device nvme0n1p2): has skinny extents [Fr Apr 16 13:06:12 2021] BTRFS info (device nvme0n1p2): enabling ssd optimizations [Fr Apr 16 13:06:13 2021] Adding 65552380k swap on /dev/nvme0n1p3. Priority:-2 extents:1 across:65552380k SSFS [Fr Apr 16 13:06:13 2021] BTRFS info (device nvme0n1p2): disk space caching is enabled [Fr Apr 16 13:06:13 2021] EXT4-fs (nvme0n1p5): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none. [Fr Apr 16 14:58:58 2021] nvme nvme0: Shutdown timeout set to 10 seconds [Fr Apr 16 14:58:58 2021] nvme nvme0: 16/0/0 default/read/poll queues [Fr Apr 16 15:07:05 2021] nvme nvme0: Shutdown timeout set to 10 seconds [Fr Apr 16 15:07:05 2021] nvme nvme0: 16/0/0 default/read/poll queues [Fr Apr 16 16:20:41 2021] nvme nvme0: I/O 145 QID 6 timeout, aborting [Fr Apr 16 16:20:41 2021] nvme nvme0: I/O 146 QID 6 timeout, aborting [Fr Apr 16 16:20:41 2021] nvme nvme0: I/O 147 QID 6 timeout, aborting [Fr Apr 16 16:20:41 2021] nvme nvme0: I/O 148 QID 6 timeout, aborting [Fr Apr 16 16:20:41 2021] nvme nvme0: I/O 149 QID 6 timeout, aborting [Fr Apr 16 16:20:41 2021] nvme nvme0: I/O 150 QID 6 timeout, aborting [Fr Apr 16 16:20:41 2021] nvme nvme0: I/O 151 QID 6 timeout, aborting [Fr Apr 16 16:20:41 2021] nvme nvme0: I/O 152 QID 6 timeout, aborting [Fr Apr 16 16:21:12 2021] nvme nvme0: I/O 145 QID 6 timeout, reset controller [Fr Apr 16 16:21:12 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 16:21:12 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 16:21:12 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 16:21:12 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 16:21:12 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 16:21:12 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 16:21:12 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 16:21:12 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 16:21:12 2021] nvme nvme0: Shutdown timeout set to 10 seconds [Fr Apr 16 16:21:12 2021] nvme nvme0: 16/0/0 default/read/poll queues [Fr Apr 16 16:58:43 2021] nvme nvme0: I/O 496 QID 16 timeout, aborting [Fr Apr 16 16:58:45 2021] nvme nvme0: I/O 223 QID 3 timeout, aborting [Fr Apr 16 16:58:45 2021] nvme nvme0: I/O 369 QID 5 timeout, aborting [Fr Apr 16 16:58:45 2021] nvme nvme0: I/O 370 QID 5 timeout, aborting [Fr Apr 16 16:58:45 2021] nvme nvme0: I/O 371 QID 5 timeout, aborting [Fr Apr 16 16:58:45 2021] nvme nvme0: I/O 372 QID 5 timeout, aborting [Fr Apr 16 16:58:45 2021] nvme nvme0: I/O 960 QID 10 timeout, aborting [Fr Apr 16 16:58:45 2021] nvme nvme0: I/O 961 QID 10 timeout, aborting [Fr Apr 16 16:59:13 2021] nvme nvme0: I/O 496 QID 16 timeout, reset controller [Fr Apr 16 16:59:13 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 16:59:13 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 16:59:13 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 16:59:13 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 16:59:13 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 16:59:13 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 16:59:13 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 16:59:13 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 16:59:13 2021] nvme nvme0: Shutdown timeout set to 10 seconds [Fr Apr 16 16:59:13 2021] nvme nvme0: 16/0/0 default/read/poll queues [Fr Apr 16 17:37:27 2021] nvme nvme0: I/O 487 QID 4 timeout, aborting [Fr Apr 16 17:37:27 2021] nvme nvme0: I/O 488 QID 4 timeout, aborting [Fr Apr 16 17:37:27 2021] nvme nvme0: I/O 489 QID 4 timeout, aborting [Fr Apr 16 17:37:27 2021] nvme nvme0: I/O 490 QID 4 timeout, aborting [Fr Apr 16 17:37:27 2021] nvme nvme0: I/O 491 QID 4 timeout, aborting [Fr Apr 16 17:37:27 2021] nvme nvme0: I/O 492 QID 4 timeout, aborting [Fr Apr 16 17:37:27 2021] nvme nvme0: I/O 692 QID 14 timeout, aborting [Fr Apr 16 17:37:27 2021] nvme nvme0: I/O 693 QID 14 timeout, aborting [Fr Apr 16 17:37:58 2021] nvme nvme0: I/O 487 QID 4 timeout, reset controller [Fr Apr 16 17:37:58 2021] blk_update_request: I/O error, dev nvme0n1, sector 10035776 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 0 [Fr Apr 16 17:37:58 2021] blk_update_request: I/O error, dev nvme0n1, sector 1458576 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 0 [Fr Apr 16 17:37:58 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 17:37:58 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 17:37:58 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 17:37:58 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 17:37:58 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 17:37:58 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 17:37:58 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 17:37:58 2021] nvme nvme0: Abort status: 0x371 [Fr Apr 16 17:37:58 2021] nvme nvme0: Shutdown timeout set to 10 seconds [Fr Apr 16 17:37:58 2021] nvme nvme0: 16/0/0 default/read/poll queues ``` Created attachment 296413 [details] attachment-4366-0.html sorry for repeating myself, i have this bug every year or so. It is an intermittent contact failure. Take out your ssd and clean all the cotacts with paper wet in solvent On Fri, 16 Apr 2021, 16:58 , <bugzilla-daemon@bugzilla.kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=195039 > > --- Comment #89 from basti.megamorf+kernel-org@googlemail.com --- > (In reply to RockT from comment #88) > > Have you tried: > > nvme_core.default_ps_max_latency_us=5500 > > > > Worked here. > > It has somewhat alleviated the issue but not resolved it. The timeouts are > still occuring, just less frequently: > > ``` > [Fr Apr 16 13:06:09 2021] Command line: > BOOT_IMAGE=(hd0,gpt2)/boot/vmlinuz-5.11.13-200.fc33.x86_64 > root=UUID=bfa8a277-c2de-4b2a-a8c9-3488e648b423 ro > resume=UUID=10110d5f-81e3-41c6-9ea5-dcefea2cb937 rhgb quiet > nvme_core.default_ps_max_latency_us=5500 > [Fr Apr 16 13:06:09 2021] Kernel command line: > BOOT_IMAGE=(hd0,gpt2)/boot/vmlinuz-5.11.13-200.fc33.x86_64 > root=UUID=bfa8a277-c2de-4b2a-a8c9-3488e648b423 ro > resume=UUID=10110d5f-81e3-41c6-9ea5-dcefea2cb937 rhgb quiet > nvme_core.default_ps_max_latency_us=5500 > [Fr Apr 16 13:06:10 2021] nvme nvme0: pci function 0000:03:00.0 > [Fr Apr 16 13:06:10 2021] nvme nvme0: Shutdown timeout set to 10 seconds > [Fr Apr 16 13:06:10 2021] nvme nvme0: 16/0/0 default/read/poll queues > [Fr Apr 16 13:06:10 2021] nvme0n1: p1 p2 p3 p4 p5 > [Fr Apr 16 13:06:10 2021] BTRFS: device fsid > bfa8a277-c2de-4b2a-a8c9-3488e648b423 devid 1 transid 56409 /dev/nvme0n1p2 > scanned by systemd-udevd (416) > [Fr Apr 16 13:06:12 2021] BTRFS info (device nvme0n1p2): disk space > caching is > enabled > [Fr Apr 16 13:06:12 2021] BTRFS info (device nvme0n1p2): has skinny extents > [Fr Apr 16 13:06:12 2021] BTRFS info (device nvme0n1p2): enabling ssd > optimizations > [Fr Apr 16 13:06:13 2021] Adding 65552380k swap on /dev/nvme0n1p3. > Priority:-2 > extents:1 across:65552380k SSFS > [Fr Apr 16 13:06:13 2021] BTRFS info (device nvme0n1p2): disk space > caching is > enabled > [Fr Apr 16 13:06:13 2021] EXT4-fs (nvme0n1p5): mounted filesystem with > ordered > data mode. Opts: (null). Quota mode: none. > [Fr Apr 16 14:58:58 2021] nvme nvme0: Shutdown timeout set to 10 seconds > [Fr Apr 16 14:58:58 2021] nvme nvme0: 16/0/0 default/read/poll queues > [Fr Apr 16 15:07:05 2021] nvme nvme0: Shutdown timeout set to 10 seconds > [Fr Apr 16 15:07:05 2021] nvme nvme0: 16/0/0 default/read/poll queues > [Fr Apr 16 16:20:41 2021] nvme nvme0: I/O 145 QID 6 timeout, aborting > [Fr Apr 16 16:20:41 2021] nvme nvme0: I/O 146 QID 6 timeout, aborting > [Fr Apr 16 16:20:41 2021] nvme nvme0: I/O 147 QID 6 timeout, aborting > [Fr Apr 16 16:20:41 2021] nvme nvme0: I/O 148 QID 6 timeout, aborting > [Fr Apr 16 16:20:41 2021] nvme nvme0: I/O 149 QID 6 timeout, aborting > [Fr Apr 16 16:20:41 2021] nvme nvme0: I/O 150 QID 6 timeout, aborting > [Fr Apr 16 16:20:41 2021] nvme nvme0: I/O 151 QID 6 timeout, aborting > [Fr Apr 16 16:20:41 2021] nvme nvme0: I/O 152 QID 6 timeout, aborting > [Fr Apr 16 16:21:12 2021] nvme nvme0: I/O 145 QID 6 timeout, reset > controller > [Fr Apr 16 16:21:12 2021] nvme nvme0: Abort status: 0x371 > [Fr Apr 16 16:21:12 2021] nvme nvme0: Abort status: 0x371 > [Fr Apr 16 16:21:12 2021] nvme nvme0: Abort status: 0x371 > [Fr Apr 16 16:21:12 2021] nvme nvme0: Abort status: 0x371 > [Fr Apr 16 16:21:12 2021] nvme nvme0: Abort status: 0x371 > [Fr Apr 16 16:21:12 2021] nvme nvme0: Abort status: 0x371 > [Fr Apr 16 16:21:12 2021] nvme nvme0: Abort status: 0x371 > [Fr Apr 16 16:21:12 2021] nvme nvme0: Abort status: 0x371 > [Fr Apr 16 16:21:12 2021] nvme nvme0: Shutdown timeout set to 10 seconds > [Fr Apr 16 16:21:12 2021] nvme nvme0: 16/0/0 default/read/poll queues > [Fr Apr 16 16:58:43 2021] nvme nvme0: I/O 496 QID 16 timeout, aborting > [Fr Apr 16 16:58:45 2021] nvme nvme0: I/O 223 QID 3 timeout, aborting > [Fr Apr 16 16:58:45 2021] nvme nvme0: I/O 369 QID 5 timeout, aborting > [Fr Apr 16 16:58:45 2021] nvme nvme0: I/O 370 QID 5 timeout, aborting > [Fr Apr 16 16:58:45 2021] nvme nvme0: I/O 371 QID 5 timeout, aborting > [Fr Apr 16 16:58:45 2021] nvme nvme0: I/O 372 QID 5 timeout, aborting > [Fr Apr 16 16:58:45 2021] nvme nvme0: I/O 960 QID 10 timeout, aborting > [Fr Apr 16 16:58:45 2021] nvme nvme0: I/O 961 QID 10 timeout, aborting > [Fr Apr 16 16:59:13 2021] nvme nvme0: I/O 496 QID 16 timeout, reset > controller > [Fr Apr 16 16:59:13 2021] nvme nvme0: Abort status: 0x371 > [Fr Apr 16 16:59:13 2021] nvme nvme0: Abort status: 0x371 > [Fr Apr 16 16:59:13 2021] nvme nvme0: Abort status: 0x371 > [Fr Apr 16 16:59:13 2021] nvme nvme0: Abort status: 0x371 > [Fr Apr 16 16:59:13 2021] nvme nvme0: Abort status: 0x371 > [Fr Apr 16 16:59:13 2021] nvme nvme0: Abort status: 0x371 > [Fr Apr 16 16:59:13 2021] nvme nvme0: Abort status: 0x371 > [Fr Apr 16 16:59:13 2021] nvme nvme0: Abort status: 0x371 > [Fr Apr 16 16:59:13 2021] nvme nvme0: Shutdown timeout set to 10 seconds > [Fr Apr 16 16:59:13 2021] nvme nvme0: 16/0/0 default/read/poll queues > [Fr Apr 16 17:37:27 2021] nvme nvme0: I/O 487 QID 4 timeout, aborting > [Fr Apr 16 17:37:27 2021] nvme nvme0: I/O 488 QID 4 timeout, aborting > [Fr Apr 16 17:37:27 2021] nvme nvme0: I/O 489 QID 4 timeout, aborting > [Fr Apr 16 17:37:27 2021] nvme nvme0: I/O 490 QID 4 timeout, aborting > [Fr Apr 16 17:37:27 2021] nvme nvme0: I/O 491 QID 4 timeout, aborting > [Fr Apr 16 17:37:27 2021] nvme nvme0: I/O 492 QID 4 timeout, aborting > [Fr Apr 16 17:37:27 2021] nvme nvme0: I/O 692 QID 14 timeout, aborting > [Fr Apr 16 17:37:27 2021] nvme nvme0: I/O 693 QID 14 timeout, aborting > [Fr Apr 16 17:37:58 2021] nvme nvme0: I/O 487 QID 4 timeout, reset > controller > [Fr Apr 16 17:37:58 2021] blk_update_request: I/O error, dev nvme0n1, > sector > 10035776 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 0 > [Fr Apr 16 17:37:58 2021] blk_update_request: I/O error, dev nvme0n1, > sector > 1458576 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 0 > [Fr Apr 16 17:37:58 2021] nvme nvme0: Abort status: 0x371 > [Fr Apr 16 17:37:58 2021] nvme nvme0: Abort status: 0x371 > [Fr Apr 16 17:37:58 2021] nvme nvme0: Abort status: 0x371 > [Fr Apr 16 17:37:58 2021] nvme nvme0: Abort status: 0x371 > [Fr Apr 16 17:37:58 2021] nvme nvme0: Abort status: 0x371 > [Fr Apr 16 17:37:58 2021] nvme nvme0: Abort status: 0x371 > [Fr Apr 16 17:37:58 2021] nvme nvme0: Abort status: 0x371 > [Fr Apr 16 17:37:58 2021] nvme nvme0: Abort status: 0x371 > [Fr Apr 16 17:37:58 2021] nvme nvme0: Shutdown timeout set to 10 seconds > [Fr Apr 16 17:37:58 2021] nvme nvme0: 16/0/0 default/read/poll queues > ``` > > -- > You may reply to this email to add a comment. > > You are receiving this mail because: > You are on the CC list for the bug. (In reply to Sergey Slizovskiy from comment #90) > Created attachment 296413 [details] > attachment-4366-0.html > > sorry for repeating myself, i have this bug every year or so. It is an > intermittent contact failure. Take out your ssd and clean all the cotacts > with paper wet in solvent > > On Fri, 16 Apr 2021, 16:58 , <bugzilla-daemon@bugzilla.kernel.org> wrote: > > > https://bugzilla.kernel.org/show_bug.cgi?id=195039 > > > > --- Comment #89 from basti.megamorf+kernel-org@googlemail.com --- > > (In reply to RockT from comment #88) > > > Have you tried: > > > nvme_core.default_ps_max_latency_us=5500 > > > > > > Worked here. > > > > It has somewhat alleviated the issue but not resolved it. The timeouts are > > still occuring, just less frequently: > > > > ``` > > [Fr Apr 16 13:06:09 2021] Command line: > > BOOT_IMAGE=(hd0,gpt2)/boot/vmlinuz-5.11.13-200.fc33.x86_64 > > root=UUID=bfa8a277-c2de-4b2a-a8c9-3488e648b423 ro > > resume=UUID=10110d5f-81e3-41c6-9ea5-dcefea2cb937 rhgb quiet > > nvme_core.default_ps_max_latency_us=5500 > > [Fr Apr 16 13:06:09 2021] Kernel command line: > > BOOT_IMAGE=(hd0,gpt2)/boot/vmlinuz-5.11.13-200.fc33.x86_64 > > root=UUID=bfa8a277-c2de-4b2a-a8c9-3488e648b423 ro > > resume=UUID=10110d5f-81e3-41c6-9ea5-dcefea2cb937 rhgb quiet > > nvme_core.default_ps_max_latency_us=5500 > > [Fr Apr 16 13:06:10 2021] nvme nvme0: pci function 0000:03:00.0 > > [Fr Apr 16 13:06:10 2021] nvme nvme0: Shutdown timeout set to 10 seconds > > [Fr Apr 16 13:06:10 2021] nvme nvme0: 16/0/0 default/read/poll queues > > [Fr Apr 16 13:06:10 2021] nvme0n1: p1 p2 p3 p4 p5 > > [Fr Apr 16 13:06:10 2021] BTRFS: device fsid > > bfa8a277-c2de-4b2a-a8c9-3488e648b423 devid 1 transid 56409 /dev/nvme0n1p2 > > scanned by systemd-udevd (416) > > [Fr Apr 16 13:06:12 2021] BTRFS info (device nvme0n1p2): disk space > > caching is > > enabled > > [Fr Apr 16 13:06:12 2021] BTRFS info (device nvme0n1p2): has skinny extents > > [Fr Apr 16 13:06:12 2021] BTRFS info (device nvme0n1p2): enabling ssd > > optimizations > > [Fr Apr 16 13:06:13 2021] Adding 65552380k swap on /dev/nvme0n1p3. > > Priority:-2 > > extents:1 across:65552380k SSFS > > [Fr Apr 16 13:06:13 2021] BTRFS info (device nvme0n1p2): disk space > > caching is > > enabled > > [Fr Apr 16 13:06:13 2021] EXT4-fs (nvme0n1p5): mounted filesystem with > > ordered > > data mode. Opts: (null). Quota mode: none. > > [Fr Apr 16 14:58:58 2021] nvme nvme0: Shutdown timeout set to 10 seconds > > [Fr Apr 16 14:58:58 2021] nvme nvme0: 16/0/0 default/read/poll queues > > [Fr Apr 16 15:07:05 2021] nvme nvme0: Shutdown timeout set to 10 seconds > > [Fr Apr 16 15:07:05 2021] nvme nvme0: 16/0/0 default/read/poll queues > > [Fr Apr 16 16:20:41 2021] nvme nvme0: I/O 145 QID 6 timeout, aborting > > [Fr Apr 16 16:20:41 2021] nvme nvme0: I/O 146 QID 6 timeout, aborting > > [Fr Apr 16 16:20:41 2021] nvme nvme0: I/O 147 QID 6 timeout, aborting > > [Fr Apr 16 16:20:41 2021] nvme nvme0: I/O 148 QID 6 timeout, aborting > > [Fr Apr 16 16:20:41 2021] nvme nvme0: I/O 149 QID 6 timeout, aborting > > [Fr Apr 16 16:20:41 2021] nvme nvme0: I/O 150 QID 6 timeout, aborting > > [Fr Apr 16 16:20:41 2021] nvme nvme0: I/O 151 QID 6 timeout, aborting > > [Fr Apr 16 16:20:41 2021] nvme nvme0: I/O 152 QID 6 timeout, aborting > > [Fr Apr 16 16:21:12 2021] nvme nvme0: I/O 145 QID 6 timeout, reset > > controller > > [Fr Apr 16 16:21:12 2021] nvme nvme0: Abort status: 0x371 > > [Fr Apr 16 16:21:12 2021] nvme nvme0: Abort status: 0x371 > > [Fr Apr 16 16:21:12 2021] nvme nvme0: Abort status: 0x371 > > [Fr Apr 16 16:21:12 2021] nvme nvme0: Abort status: 0x371 > > [Fr Apr 16 16:21:12 2021] nvme nvme0: Abort status: 0x371 > > [Fr Apr 16 16:21:12 2021] nvme nvme0: Abort status: 0x371 > > [Fr Apr 16 16:21:12 2021] nvme nvme0: Abort status: 0x371 > > [Fr Apr 16 16:21:12 2021] nvme nvme0: Abort status: 0x371 > > [Fr Apr 16 16:21:12 2021] nvme nvme0: Shutdown timeout set to 10 seconds > > [Fr Apr 16 16:21:12 2021] nvme nvme0: 16/0/0 default/read/poll queues > > [Fr Apr 16 16:58:43 2021] nvme nvme0: I/O 496 QID 16 timeout, aborting > > [Fr Apr 16 16:58:45 2021] nvme nvme0: I/O 223 QID 3 timeout, aborting > > [Fr Apr 16 16:58:45 2021] nvme nvme0: I/O 369 QID 5 timeout, aborting > > [Fr Apr 16 16:58:45 2021] nvme nvme0: I/O 370 QID 5 timeout, aborting > > [Fr Apr 16 16:58:45 2021] nvme nvme0: I/O 371 QID 5 timeout, aborting > > [Fr Apr 16 16:58:45 2021] nvme nvme0: I/O 372 QID 5 timeout, aborting > > [Fr Apr 16 16:58:45 2021] nvme nvme0: I/O 960 QID 10 timeout, aborting > > [Fr Apr 16 16:58:45 2021] nvme nvme0: I/O 961 QID 10 timeout, aborting > > [Fr Apr 16 16:59:13 2021] nvme nvme0: I/O 496 QID 16 timeout, reset > > controller > > [Fr Apr 16 16:59:13 2021] nvme nvme0: Abort status: 0x371 > > [Fr Apr 16 16:59:13 2021] nvme nvme0: Abort status: 0x371 > > [Fr Apr 16 16:59:13 2021] nvme nvme0: Abort status: 0x371 > > [Fr Apr 16 16:59:13 2021] nvme nvme0: Abort status: 0x371 > > [Fr Apr 16 16:59:13 2021] nvme nvme0: Abort status: 0x371 > > [Fr Apr 16 16:59:13 2021] nvme nvme0: Abort status: 0x371 > > [Fr Apr 16 16:59:13 2021] nvme nvme0: Abort status: 0x371 > > [Fr Apr 16 16:59:13 2021] nvme nvme0: Abort status: 0x371 > > [Fr Apr 16 16:59:13 2021] nvme nvme0: Shutdown timeout set to 10 seconds > > [Fr Apr 16 16:59:13 2021] nvme nvme0: 16/0/0 default/read/poll queues > > [Fr Apr 16 17:37:27 2021] nvme nvme0: I/O 487 QID 4 timeout, aborting > > [Fr Apr 16 17:37:27 2021] nvme nvme0: I/O 488 QID 4 timeout, aborting > > [Fr Apr 16 17:37:27 2021] nvme nvme0: I/O 489 QID 4 timeout, aborting > > [Fr Apr 16 17:37:27 2021] nvme nvme0: I/O 490 QID 4 timeout, aborting > > [Fr Apr 16 17:37:27 2021] nvme nvme0: I/O 491 QID 4 timeout, aborting > > [Fr Apr 16 17:37:27 2021] nvme nvme0: I/O 492 QID 4 timeout, aborting > > [Fr Apr 16 17:37:27 2021] nvme nvme0: I/O 692 QID 14 timeout, aborting > > [Fr Apr 16 17:37:27 2021] nvme nvme0: I/O 693 QID 14 timeout, aborting > > [Fr Apr 16 17:37:58 2021] nvme nvme0: I/O 487 QID 4 timeout, reset > > controller > > [Fr Apr 16 17:37:58 2021] blk_update_request: I/O error, dev nvme0n1, > > sector > > 10035776 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 0 > > [Fr Apr 16 17:37:58 2021] blk_update_request: I/O error, dev nvme0n1, > > sector > > 1458576 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 0 > > [Fr Apr 16 17:37:58 2021] nvme nvme0: Abort status: 0x371 > > [Fr Apr 16 17:37:58 2021] nvme nvme0: Abort status: 0x371 > > [Fr Apr 16 17:37:58 2021] nvme nvme0: Abort status: 0x371 > > [Fr Apr 16 17:37:58 2021] nvme nvme0: Abort status: 0x371 > > [Fr Apr 16 17:37:58 2021] nvme nvme0: Abort status: 0x371 > > [Fr Apr 16 17:37:58 2021] nvme nvme0: Abort status: 0x371 > > [Fr Apr 16 17:37:58 2021] nvme nvme0: Abort status: 0x371 > > [Fr Apr 16 17:37:58 2021] nvme nvme0: Abort status: 0x371 > > [Fr Apr 16 17:37:58 2021] nvme nvme0: Shutdown timeout set to 10 seconds > > [Fr Apr 16 17:37:58 2021] nvme nvme0: 16/0/0 default/read/poll queues > > ``` > > > > -- > > You may reply to this email to add a comment. > > > > You are receiving this mail because: > > You are on the CC list for the bug. No problem. However, after reading through the thread taking out the drive and ensuring it is clean and properly seated once it was put back in was one of the first things I did :( I have essentially tested all proposed solutions that were mentioned in the comments without success. Is there anything else I can do? How can we determine if my problem requires a kernel-side fix? I contacted Kingston support for my A2000 SSD, and they directed me to firmware update instructions (https://www.kingston.com/unitedstates/us/support/technical/ksm-firmware-update) Release notes for A2000 read: Firmware Rev. S5Z42109 (03-30-2021) • Fixed an issue that might cause the drive to become unresponsive on Linux systems I'll update the firmware and give it a try. Will come back with news.
> I'll update the firmware and give it a try. Will come back with news.
Andrés, did you get a chance to try the new firmware? I'd be interested to know if it solves this problem.
(In reply to david.antliff from comment #94) > > I'll update the firmware and give it a try. Will come back with news. > > Andrés, did you get a chance to try the new firmware? I'd be interested to > know if it solves this problem. Yes, I upgraded the firmware, and it's been a week now with no problems at all :) I use the laptop a lot, also left it on a whole night one or two days. I'd recommend all of you who have a Kingston A2000 SSD to upgrade the firmware. (In reply to Andrés Delfino from comment #95) > (In reply to david.antliff from comment #94) > > > I'll update the firmware and give it a try. Will come back with news. > > > > Andrés, did you get a chance to try the new firmware? I'd be interested to > > know if it solves this problem. > > Yes, I upgraded the firmware, and it's been a week now with no problems at > all :) > > I use the laptop a lot, also left it on a whole night one or two days. > > I'd recommend all of you who have a Kingston A2000 SSD to upgrade the > firmware. That's great news - thank you for reporting back. Just for the avoidance of doubt - did you also remove the `nvme_core.default_ps_max_latency_us=` parameter from your kernel command-line? I'll assume you did, but I just want to double-check :) I didn't remove the parameter... wait for it... because I have never added it to begin with :) After reading a few comments in this bug, the possibility of the workaround not preventing the problem 100% of the time kind of discouraged me of trying it at all. My testing was done on a Linux Mint 20.1. With all updates installed, I'm running Linux 5.4.0-72-generic at the moment. Don't hesitate to make further questions. I'm glad to be of help. I was very frustrated with my new SSD sitting on its box; I hope all A2000 users can make notice of this firmware upgrade. (In reply to Andrés Delfino from comment #97) > I didn't remove the parameter... wait for it... because I have never added > it to begin with :) > […] > My testing was done on a Linux Mint 20.1. With all updates installed, I'm > running Linux 5.4.0-72-generic at the moment. That kernel afaics contains this patch: https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/focal/commit/?id=47add9f75714fabd3702dca0e5899a56d2f3ee2f It thus will avoid the deepest APST sleep mode even if you don't set the `nvme_core.default_ps_max_latency_us=` kernel parameter, even with the newest firmware. Ot might have been this patch (that was recently added to ubuntu and thus mint kernels) that fixed things for you. I don't think there is a parameter to disable this behaviour. Kingston is aware of that patch (made weeky before the new firmware got out), I told them to write a kernel patch on top of that one to only apply that quirk if the SSD is running a older firmware. Not sure if they are working on it. (In reply to Thorsten Leemhuis from comment #98) > It thus will avoid the deepest APST sleep mode even if you don't set the > `nvme_core.default_ps_max_latency_us=` kernel parameter, even with the > newest firmware. Ot might have been this patch (that was recently added to > ubuntu and thus mint kernels) that fixed things for you. Oh, I see. If there's a way for me to verify if that patch is in the kernel I'm running, I can check that. (In reply to Andrés Delfino from comment #99) > > Oh, I see. If there's a way for me to verify if that patch is in the kernel > I'm running, I can check that. In the running kernel: not that I'm aware of. But you obviously could get the source of the kernel your are running and check the patched file. But you can do online: if you go to https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/focal/log/ you'll see a version that nearly identical to the one you mentioned; and if you search for A2000 in that repo you'll see that the patch was applied there: https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/focal/commit/?id=47add9f75714fabd3702dca0e5899a56d2f3ee2f (In reply to Thorsten Leemhuis from comment #100) > In the running kernel: not that I'm aware of. But you obviously could get > the source of the kernel your are running and check the patched file. But > you can do online: if you go to > https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/focal/log/ > you'll see a version that nearly identical to the one you mentioned; and if > you search for A2000 in that repo you'll see that the patch was applied > there: > https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/focal/ > commit/?id=47add9f75714fabd3702dca0e5899a56d2f3ee2f Yes, the "kernel version" (as reported by uname -v) is 80, so yeah, I'm running with that patch applied. Just realized that you are the one who wrote the patch, thank you for doing it! Hopefully, the firmware upgrade by itself solves the issue as well, so users with an older kernel don't get bitten by this problem. (In reply to Andrés Delfino from comment #101) > Just realized that you are the one who wrote the patch, thank you for doing > it! You're welcome. While at it a note to everyone following here: if any SSD or other hardware requires workarounds like `nvme_core.default_ps_max_latency_us=`, submit a trivial patch like the one mentioned above – or let the kernel developers know about the problem and the workaround, so they can write a patch. This will make Linux better, as it will run better for everyone out-of-the-box then! > Hopefully, the firmware upgrade by itself solves the issue as well, so users > with an older kernel don't get bitten by this problem. FWIW, that patch made it into 4.14.221, 4.19.175, 5.4.97, 5.10.15, 5.11 and later, so it will only be a problem for people that run really old kernels – or distro kernels that didn't pick up that patch (that's likely the bigger problem here :-/ ) Samsung have released a firmware update for the 980 PRO (3B2QGXA7) at the beginning of May 2021 to be found on https://www.samsung.com/semiconductor/minisite/ssd/download/tools/. I've updated my SSD and will test if this fixes my drive's timeout issues. (In reply to basti.megamorf+kernel-org from comment #103) > Samsung have released a firmware update for the 980 PRO (3B2QGXA7) at the > beginning of May 2021 to be found on > https://www.samsung.com/semiconductor/minisite/ssd/download/tools/. > > I've updated my SSD and will test if this fixes my drive's timeout issues. Unfortunately no improvement so far. I still experience controller timeouts daily at intervals as low as every 30 seconds, sometimes even forcing the system into a completely unusable state where filesystems cannot not be written to anymore at all. I couldn't understand why people were only trying nvme-core.nvme_core.default_ps_max_latency_us=[0|200|5500] but I think I understand now and want to clear this up for anybody else trying a non-zero value I believe this indicates the maximum latency allowed to leave a deep power state. Your SSD has multiple power states w/ varying power usage and a maximum delay to exit the given power state ``` sudo nvme id-ctrl /dev/nvme0n1p3 ps 0 : mp:7.80W operational enlat:0 exlat:0 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:- active_power:- ps 1 : mp:6.00W operational enlat:0 exlat:0 rrt:1 rrl:1 rwt:1 rwl:1 idle_power:- active_power:- ps 2 : mp:3.40W operational enlat:0 exlat:0 rrt:2 rrl:2 rwt:2 rwl:2 idle_power:- active_power:- ps 3 : mp:0.0700W non-operational enlat:210 exlat:1200 rrt:3 rrl:3 rwt:3 rwl:3 idle_power:- active_power:- ps 4 : mp:0.0100W non-operational enlat:2000 exlat:8000 rrt:4 rrl:4 rwt:4 rwl:4 idle_power:- active_power:- ``` In my case ps3 has an entry latency of 210us and an exit latency of 1200us, meaning there's a maximum latency of 210+1200=1410us to leave the ps3 power state. Likewise ps4 has a maximum latency of 10000us The whole problem is that one of the deeper power states is problematic and needs to be avoided. In my case I can set my value to 1410 to avoid entering ps4 since ps3 is fine Source: https://docs.microsoft.com/en-us/windows-hardware/design/component-guidelines/power-management-for-storage-hardware-devices-nvme I'm experiencing the very same issue with the fresh CentOS 7 kernel 3.10.0-1160.59.1 - the NVMe devices (2 of them) disconnect shortly after OS boot: [ 247.711460] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10 [ 248.078507] nvme 0000:01:00.0: enabling device (0000 -> 0002) [ 248.078796] nvme nvme0: Removing after probe failure status: -19 [ 248.084551] nvme0n1: detected capacity change from 1000204886016 to 0 [ 248.084598] blk_update_request: I/O error, dev nvme0n1, sector 1953524992 [ 248.084660] Buffer I/O error on dev nvme0n1, logical block 244190624, async page read [ 249.071522] nvme nvme0: failed to set APST feature (-19) BUT the old kernel works totally well! I've stress-tested the NVMe devices a bit and neither got disconnected and they work well for several hours already. The working kernel version is 3.10.0-693.el7.x86_64 The devices are Samsung 970 EVO Plus 1Tb, firmware version is 2B2QEXM7 for both. 01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 06:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 Reporting here about WD Green SN350 1TB (15b7:5014) $ lspci -nn 05:00.0 Non-Volatile memory controller [0108]: Sandisk Corp Device [15b7:5014] NVME Identify Controller: vid : 0x15b7 ssvid : 0x15b7 sn : 22292H...... mn : WD Green SN350 1TB fr : 33006000 [...] ps 0 : mp:5.00W operational enlat:0 exlat:0 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:0.6300W active_power:5.00W ps 1 : mp:2.40W operational enlat:0 exlat:0 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:0.6300W active_power:2.40W ps 2 : mp:1.90W operational enlat:0 exlat:0 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:0.6300W active_power:1.90W ps 3 : mp:0.0250W non-operational enlat:3900 exlat:11000 rrt:3 rrl:3 rwt:3 rwl:3 idle_power:0.0250W active_power:- ps 4 : mp:0.0050W non-operational enlat:5000 exlat:39000 rrt:4 rrl:4 rwt:4 rwl:4 idle_power:0.0050W active_power:- Latest and unique firmware version : 33006000 (https://wddashboarddownloads.wdc.com/wdDashboard/firmware/WD_Green_SN350_1TB/33006000/device_properties.xml, from https://wddashboarddownloads.wdc.com/wdDashboard/config/devices/lista_devices.xml) I frequently and consistently experienced the above mentioned filesystem problems due to power state switch (`nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10`). Disabling the LAST pstate (p4), using nvme_core.default_ps_max_latency_us=11500 fixed it (no crash for 48h and counting) I thus suggest to add + { PCI_DEVICE(0x15b7, 0x5014), /* WDC Green SN350 1TB NVMe SSD */ + .driver_data = NVME_QUIRK_NO_DEEPEST_PS, }, to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvme/host/pci.c (In reply to Raphaël Droz from comment #107) > I thus suggest to add > + { PCI_DEVICE(0x15b7, 0x5014), /* WDC Green SN350 1TB NVMe SSD */ > + .driver_data = NVME_QUIRK_NO_DEEPEST_PS, }, > > to > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/ > drivers/nvme/host/pci.c Mentioning this here is unlikely to help for two reasons afaics: * this report is about an issue for a different device and hence makes things hard to follow; submitting a new ticket would have been better * many kernel developers can't be reached through this server (see https://docs.kernel.org/admin-guide/reporting-issues.html ). Better report this by main to the maintainers of that file (explained in the aforementioned document) |