Bug 195039 - Samsung PM951 NVMe sudden controller death
Summary: Samsung PM951 NVMe sudden controller death
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: io_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-03-25 18:30 UTC by Marvin W
Modified: 2020-10-28 08:14 UTC (History)
22 users (show)

See Also:
Kernel Version: 4.11-rc3
Tree: Mainline
Regression: No


Attachments
nvme id-ctrl /dev/nvme0 (1.26 KB, text/plain)
2017-03-27 15:37 UTC, Marvin W
Details

Description Marvin W 2017-03-25 18:30:18 UTC
After updating to Kernel 4.11-rc3 (current Fedora 26 kernel), I noticed sudden disk outage after some time. These are related kmsg logs:

    <4>[11018.909097] nvme 0000:04:00.0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
    <6>[11018.917238] nvme 0000:04:00.0: enabling device (0000 -> 0002)
    <4>[11018.917354] nvme nvme0: Removing after probe failure status: -19
    <6>[11018.944950] nvme0n1: detected capacity change from 512110190592 to 0
    <3>[11018.945165] blk_update_request: I/O error, dev nvme0n1, sector 916173360
    <3>[11018.945222] blk_update_request: I/O error, dev nvme0n1, sector 916173120
    <3>[11018.945284] blk_update_request: I/O error, dev nvme0n1, sector 916172864
    <3>[11018.945333] blk_update_request: I/O error, dev nvme0n1, sector 916172848
    [...]

Device details:

    PCI: 04:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM951/PM951 [144d:a802] (rev 01)
    Model Number: PM951 NVMe SAMSUNG 512GB
    Firmware Version: BXV77D0Q

As 4.11 enables APST for NVMe and there already is an exception for the Samsung SM951 NVMe SSD whose description seems to be related, I guess my SSD has the same bug.

For the moment I'm back to 4.9 kernel, but ready for testing if required.
Comment 1 Thorsten Leemhuis 2017-03-27 09:49:27 UTC
Andy, Jens: Does this problem look like the problems you saw on the SM951 that already was blacklisted in https://git.kernel.org/torvalds/c/c5552fde102fcc3f2cf9e502b8ac90e3500d8fdf ?

Side note: I added this report to the list of regressions for Linux 4.11. I'll try to watch this place for further updates on this issue to document progress in my weekly reports. Please let me know in case the discussion moves to a different place (bugzilla or another mail thread for example). tia!
Comment 2 Jens Axboe 2017-03-27 14:45:09 UTC
Marvin, can you try with -rc4, and revert commit c5552fde10? I just checked, it reverts cleanly.
Comment 3 Andy Lutomirski 2017-03-27 15:25:07 UTC
Hi Marvin-

Could you give me some more details of your hardware?

1. The raw device identification.  A new enough smartctl will show it in 'smartctl -i /dev/nvme0'.  Even better would be the full output of 'nvme id-ctrl /dev/nvme0' -- you can find the 'nvme' tool in the nvme-cli package.

2. What kind of computer is this?  Is it a laptop?  Is the affected disk something that came with the laptop?

3. Can you try booting with nvme_core.default_ps_max_latency_us=0?  That with disable the power-saving feature that is likely at fault.

Samsung currently has a machine that appears to be affected and is trying to figure out what's going on.  There's some reason to believe that the problem is triggered by specific combinations of laptop and SSD.  I'll obviously need to update the blacklist to fix your laptop (in lieu of a better workaround that still lets you get some power savings), but I need the info above to figure out what the blacklist entry should look like.

Thanks,
Andy
Comment 4 Marvin W 2017-03-27 15:37:24 UTC
Created attachment 255577 [details]
nvme id-ctrl /dev/nvme0

(In reply to Andy Lutomirski from comment #3)
> 1. The raw device identification.  A new enough smartctl will show it in
> 'smartctl -i /dev/nvme0'.  Even better would be the full output of 'nvme
> id-ctrl /dev/nvme0' -- you can find the 'nvme' tool in the nvme-cli package.
Comment 5 Marvin W 2017-03-29 15:49:17 UTC
Andy,

(In reply to Andy Lutomirski from comment #3)
> 2. What kind of computer is this?  Is it a laptop?  Is the affected disk
> something that came with the laptop?

This is a Dell XPS 15 9550. It is available in many different configurations, including SATA, mSATA or NVMe SSDs with different models each and you don't know before buying which exact model you will receive in the end...

> 3. Can you try booting with nvme_core.default_ps_max_latency_us=0?  That
> with disable the power-saving feature that is likely at fault.

I am running since Monday with Kernel 4.11-rc4 and nvme_core.default_ps_max_latency_us=0 and had no problems so far. Device was off/standby several hours, but prior crashes were after 1-4 hours so I assume the problem does not occur with nvme_core.default_ps_max_latency_us=0.

I will also try with c5552fde10 reverted as suggested by Jens to be absolutely sure.
Comment 6 Thorsten Leemhuis 2017-04-09 16:50:43 UTC
@Luto: What's the status here? Do you need any more information to fix?
Comment 7 Andy Lutomirski 2017-04-09 17:05:33 UTC
Samsung engineers have an affected system and are trying to root-cause it.  I was hoping they'd come up with something quickly, but I'm just going to submit a patch with a bigger quirk.
Comment 8 Chris Roth 2017-04-12 01:19:13 UTC
Also posted in bug 194921. I don't know if that breaks a rule. If it does, feel free to delete one of the posts.

I'm running into what I think is the same or a related problem.
When I upgraded to ubuntu 17.04 beta 2 (which I believe is kernel 4.10), I started having crashes after anywhere from 10min to an hour. The operating system would state that the disk is now read only and/or give IO errors. I downgraded to 16.10 and now have no more problems of this kind.

I'm also using a dell 9550.

Output of 'nvme id-ctrl /dev/nvme0'

NVME Identify Controller:
vid     : 0x144d
ssvid   : 0x144d
sn      :       S29PNXAH124276
mn      : PM951 NVMe SAMSUNG 512GB                
fr      : BXV77D0Q
rab     : 2
ieee    : 002538
cmic    : 0
mdts    : 5
cntlid  : 1
ver     : 0
rtd3r   : 0
rtd3e   : 0
oaes    : 0
oacs    : 0x17
acl     : 7
aerl    : 3
frmw    : 0x6
lpa     : 0
elpe    : 63
npss    : 4
avscc   : 0x1
apsta   : 0x1
wctemp  : 0
cctemp  : 0
mtfa    : 0
hmpre   : 0
hmmin   : 0
tnvmcap : 0
unvmcap : 0
rpmbs   : 0
sqes    : 0x66
cqes    : 0x44
nn      : 1
oncs    : 0x1f
fuses   : 0
fna     : 0
vwc     : 0x1
awun    : 255
awupf   : 0
nvscc   : 1
acwu    : 0
sgls    : 0
ps    0 : mp:6.00W operational enlat:5 exlat:5 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:4.20W operational enlat:30 exlat:30 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:3.10W operational enlat:100 exlat:100 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0700W non-operational enlat:500 exlat:5000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0050W non-operational enlat:2000 exlat:22000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-
Comment 9 Andy Lutomirski 2017-04-12 04:09:29 UTC
Chris, the relevant code should be in 4.10 kernels at all.  Can you provide the output of:

$ modinfo nvme_core
$ ls /sys/class/nvme/nvme0/power

The Samsung people working on this issue are thinking that it's possible that the bug isn't directly an APST problem and, if you're hitting it without APST, it could be an interesting data point.

Anyway, my plan is to make the quirk much, much broader for 4.11.  I'm just hoping to hear back in the next day or two to see whether I should be quirking off APST on two particular Dell laptops or whether I should be quirking it off on the entire Samsung 950 line.  So far, it does seem like the problem may be restricted to the two laptops in question.
Comment 10 Chris Roth 2017-04-12 05:04:46 UTC
I reinstalled 17.04 and I've been running for 3 hours without incident
using the nvme_core kernel parameter above. I don't know if this has
anything to do with the issue, but my system seems stable and it has not
been for days.

Here is the output with having run the nvme_core kernel parameter above:

filename:      
/lib/modules/4.10.0-19-generic/kernel/drivers/nvme/host/nvme-core.ko
version:        1.0
license:        GPL
srcversion:     1BBEF320C053A2BA4284272
depends:       
intree:         Y
vermagic:       4.10.0-19-generic SMP mod_unload
parm:           admin_timeout:timeout in seconds for admin commands (byte)
parm:           io_timeout:timeout in seconds for I/O (byte)
parm:           shutdown_timeout:timeout in seconds for controller
shutdown (byte)
parm:           max_retries:max number of retries a command may have (uint)
parm:           nvme_char_major:int
parm:           default_ps_max_latency_us:max power saving latency for
new devices; use PM QOS to change per device (ulong)

I'll reboot and output the data without the kernel parameter and reply
again in a couple of minutes.

On 2017-04-11 10:09 PM, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=195039
>
> --- Comment #9 from Andy Lutomirski (luto@kernel.org) ---
> Chris, the relevant code should be in 4.10 kernels at all.  Can you provide
> the
> output of:
>
> $ modinfo nvme_core
> $ ls /sys/class/nvme/nvme0/power
>
> The Samsung people working on this issue are thinking that it's possible that
> the bug isn't directly an APST problem and, if you're hitting it without
> APST,
> it could be an interesting data point.
>
> Anyway, my plan is to make the quirk much, much broader for 4.11.  I'm just
> hoping to hear back in the next day or two to see whether I should be
> quirking
> off APST on two particular Dell laptops or whether I should be quirking it
> off
> on the entire Samsung 950 line.  So far, it does seem like the problem may be
> restricted to the two laptops in question.
>
Comment 11 Chris Roth 2017-04-12 05:12:05 UTC
Output of modinfo:

filename:      
/lib/modules/4.10.0-19-generic/kernel/drivers/nvme/host/nvme-core.ko
version:        1.0
license:        GPL
srcversion:     1BBEF320C053A2BA4284272
depends:       
intree:         Y
vermagic:       4.10.0-19-generic SMP mod_unload
parm:           admin_timeout:timeout in seconds for admin commands (byte)
parm:           io_timeout:timeout in seconds for I/O (byte)
parm:           shutdown_timeout:timeout in seconds for controller
shutdown (byte)
parm:           max_retries:max number of retries a command may have (uint)
parm:           nvme_char_major:int
parm:           default_ps_max_latency_us:max power saving latency for
new devices; use PM QOS to change per device (ulong)

Output of 'ls /sys/class/nvme/nvme0/power'

async
autosuspend_delay_ms
control
pm_qos_latency_tolerance_us
runtime_active_kids
runtime_active_time
runtime_enabled
runtime_status
runtime_suspended_time
runtime_usage



On 2017-04-11 10:09 PM, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=195039
>
> --- Comment #9 from Andy Lutomirski (luto@kernel.org) ---
> Chris, the relevant code should be in 4.10 kernels at all.  Can you provide
> the
> output of:
>
> $ modinfo nvme_core
> $ ls /sys/class/nvme/nvme0/power
>
> The Samsung people working on this issue are thinking that it's possible that
> the bug isn't directly an APST problem and, if you're hitting it without
> APST,
> it could be an interesting data point.
>
> Anyway, my plan is to make the quirk much, much broader for 4.11.  I'm just
> hoping to hear back in the next day or two to see whether I should be
> quirking
> off APST on two particular Dell laptops or whether I should be quirking it
> off
> on the entire Samsung 950 line.  So far, it does seem like the problem may be
> restricted to the two laptops in question.
>
Comment 12 Andy Lutomirski 2017-04-12 21:41:01 UTC
Awesome, I guess Ubuntu backported APST support.
Comment 13 Andy Lutomirski 2017-04-13 15:18:07 UTC
My current patch set to address this is here:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=nvme/power&id=37294ae0e942e9dc56e869af23cc6face284dec8

It's untested, and I won't have a chance to test until Tuesday.
Comment 14 alberink+kernel 2017-04-26 08:46:24 UTC
I hit this same issue on my upgrade to Ubuntu 17.04. I downgraded again to Ubuntu 16.10 and everything was fine, until yesterday. I received a kernel update and apparently the changes are backported to their 4.8 kernel series, as I all of a sudden hit the bug there as well.

Hardware: 
Dell XPS 9550

smartctl -x output
=== START OF INFORMATION SECTION ===
Model Number:                       PM951 NVMe SAMSUNG 512GB
Serial Number:                      S29PNXAGB11420
Firmware Version:                   BXV77D0Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Controller ID:                      1
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Utilization:            418,698,813,440 [418 GB]
Namespace 1 Formatted LBA Size:     512

If you need further information, please let me know
Comment 15 Andy Lutomirski 2017-04-26 22:22:28 UTC
Could you try 4.11-rc8 or the test kernel here:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184
Comment 16 Chris Roth 2017-05-15 16:50:28 UTC
Am I correct that when my arch install updates to kernel 4.11, I'll be able to remove nvme_core.default_ps_max_latency_us=0 as a kernel boot parameter?
Comment 17 Ian Ozsvald 2017-06-09 11:05:16 UTC
Since 2017-05-13 I've had to run boot-repair 4 times to recover my system. Each time I have a very slow shutdown (Mint exits, the screen is black, the power button stays lit for approx 30 seconds - much longer than usual), then on reboot I get a "missing HD" error from the BIOS. It auto-repairs itself to point at the Windows partition and then I only get a Windows boot. If I run boot-repair then I can get a grub that correctly boots back to Linux and Windows.

I'm using a Dell XPS 9550, 32GB RAM, Samsung PM951 NVMe. The nvme firmware hasn't been changed since I bought the machine (over 1 year ago) and according to Samsung's site is the latest firmware.

The change in state was upgrading from kernel 4.9.8 to 4.11, prior to 4.11 I've never seen this issue. In the last two weeks I was running an older BIOS (A06). A few days back I upgraded to BIOS A19 (the only reported stable BIOS with linux) and upgraded kernel 4.11 to 4.11.3. I've just had my 4th slow-shutdown and run of boot-repair. I believe 4.11.x is the common cause for this issue.

Given the earlier reports I'm attaching some notes that I hope are useful, I'm happy to dig further if you give me some guidance.

I'm running Linux Mint 18.1. 

Does anyone know if kernel 4.9 is still unaffected or if 4.12 fixes this?


$ uname -a
Linux ian-XPS-15-9550 4.11.3-041103-generic #201705251233 SMP Thu May 25 16:34:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

$ modinfo nvme_core
filename:       /lib/modules/4.11.3-041103-generic/kernel/drivers/nvme/host/nvme-core.ko
version:        1.0
license:        GPL
srcversion:     E78F732E1E5E7A40EEBCFD1
depends:        
intree:         Y
vermagic:       4.11.3-041103-generic SMP mod_unload 
parm:           admin_timeout:timeout in seconds for admin commands (byte)
parm:           io_timeout:timeout in seconds for I/O (byte)
parm:           shutdown_timeout:timeout in seconds for controller shutdown (byte)
parm:           max_retries:max number of retries a command may have (uint)
parm:           nvme_char_major:int
parm:           default_ps_max_latency_us:max power saving latency for new devices; use PM QOS to change per device (ulong)


$ sudo nvme id-ctrl /dev/nvme0
[sudo] password for ian: 
NVME Identify Controller:
vid     : 0x144d
ssvid   : 0x144d
sn      :       S2FZNYAG801690
mn      : PM951 NVMe SAMSUNG 1024GB               
fr      : BXV76D0Q
rab     : 2
ieee    : 002538
cmic    : 0
mdts    : 5
cntlid  : 1
ver     : 0
rtd3r   : 0
rtd3e   : 0
oaes    : 0
oacs    : 0x17
acl     : 7
aerl    : 3
frmw    : 0x6
lpa     : 0
elpe    : 63
npss    : 4
avscc   : 0x1
apsta   : 0x1
wctemp  : 0
cctemp  : 0
mtfa    : 0
hmpre   : 0
hmmin   : 0
tnvmcap : 0
unvmcap : 0
rpmbs   : 0
sqes    : 0x66
cqes    : 0x44
nn      : 1
oncs    : 0x1f
fuses   : 0
fna     : 0
vwc     : 0x1
awun    : 255
awupf   : 0
nvscc   : 1
acwu    : 0
sgls    : 0
ps    0 : mp:6.00W operational enlat:5 exlat:5 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:4.20W operational enlat:30 exlat:30 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:3.10W operational enlat:100 exlat:100 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0700W non-operational enlat:500 exlat:5000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0050W non-operational enlat:2000 exlat:22000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-


$ ls /sys/class/nvme/nvme0/power
async  autosuspend_delay_ms  control  pm_qos_latency_tolerance_us  runtime_active_kids  runtime_active_time  runtime_enabled  runtime_status  runtime_suspended_time  runtime_usage
Comment 18 Ian Ozsvald 2017-06-09 11:07:31 UTC
Note on "(the only reported stable BIOS with linux)" - given my reading on reddit (/dell and /linux) it seems that the latest .20 and .25 BIOS have some issues for linux but .19 is widely reported to be stable. Given that .19 is recent, I've settled on it. Prior to that A06 has been fine a year.
Comment 19 Jens Axboe 2017-06-09 14:30:42 UTC
Ian, sounds like you are impacted by the APST issue as well. Hopefully 4.12 will work better. Andy, what's the recommended work-around for 4.11 users?
Comment 20 Chris Roth 2017-06-09 15:26:44 UTC
Ian, Kernel 4.11 has been working for me. I no longer have nvme_core.default_ps_max_latency_us=0 as a boot parameter and haven't had any issues with the SSD going into read-only mode as I was before.

However, (and I don't know if this would have an impact), I switched to Arch from Ubuntu 16.10 last month around the same time I went from 4.10 to 4.11.
Comment 21 Andy Lutomirski 2017-06-09 15:35:09 UTC
Ian, the fix should have been:

commit ff5350a86b20de23991e474e006e2ff2732b218e
Author: Andy Lutomirski <luto@kernel.org>
Date:   Thu Apr 20 13:37:55 2017 -0700

    nvme: Adjust the Samsung APST quirk

and that made it in to 4.11.
Comment 22 Ian Ozsvald 2017-06-09 17:11:16 UTC
Jens, Chris, Andy - thanks for the quick response. I think I'm going to pop the base cover and re-seat the hd, maybe something is loose (and/or thermal related). Failing that I might regress to 4.9.8 to see if the problem persists. Thanks for removing this possibility. Cheers, Ian.
Comment 23 Ian Ozsvald 2017-06-10 13:24:39 UTC
I'll ask a follow-up in case it sparks any thoughts. One co-incidental factor seems to be that I have these shutdow-failures only after using my external monitor and suspending a few times. I don't recall (but don't have solid evidence) having this issue just from laptop suspends but if I switch to using my HDMI monitor a few times (with a cloned display) then that seems co-incident with this issue.

Typically I use my laptop solo, sometimes I'll plug it into my home UHD monitor. I'll use and then unplug from the monitor several times over a week between deliberate laptop restarts.

The specific behaviour is that after plugging in the HDMI monitor (after several successes) both the laptop and external screen offset the display by 50% (the left side starts in the middle of the screen, the middle wraps to the left edge and continues back to the middle of the screen). The mouse pointer moves but I can't click anything, keyboard shortcuts do nothing. I can swap to a console terminal (ctrl alt F1) and restart the mdm (Mint Display Manager) and I can swap back to Mint and continue. After a shutdown I have a long freeze, then I have a 'missing hd' on the next boot.

This sounds far more like a Mint/display manager issue but exactly why it interferes with the SSD such that the BIOS does a recovery, and it only occurred since I switched to kernel 4.11, is a mystery. This behaviour might of course be caused by the same underlying problem or it might be coincidental. Possibly there's a HDMI/bus issue that's known to one of you?

I'm only asking in case this jogs memories of a related BIOS/nvme bug. If not, I'll only repost back here if I make any progress on this issue. Cheers, Ian.
Comment 24 Andy Lutomirski 2017-06-12 19:04:30 UTC
Ian, this doesn't sound like an nvme problem at all.   I'm guessing you have a graphics problem that's crashing the system in a way that annoys your BIOS.  My Dell laptop (different model than yours) has an obnoxious feature in which, if it thinks something went wrong, it goes through a counterproductive recovery process.  You can turn this off in the BIOS settings.
Comment 25 Amanieu d'Antras 2017-06-30 16:29:15 UTC
I'm encountering the same issue on XPS 15 9550, however I upgraded the PM951 SSD to a larger PM961 one. I tried disabling only the lowest power state by setting nvme_core.default_ps_max_latency=2000 (the latency numbers are different for PM961, see below), however this didn't resolve the issue and I was still getting controller resets.

I had to disable APST entirely by setting nvme_core.default_ps_max_latency=0 for it to work reliably. Unfortunately this causes a noticeable increase in power consumption of ~3-4W, which hurts battery life quite a bit.

$ sudo nvme id-ctrl /dev/nvme0n1
NVME Identify Controller:
vid     : 0x144d
ssvid   : 0x144d
sn      : S36CNX0J302022      
mn      : SAMSUNG MZVLW1T0HMLH-000H1              
fr      : CXY70H1Q
rab     : 2
ieee    : 002538
cmic    : 0
mdts    : 0
cntlid  : 2
ver     : 10200
rtd3r   : 186a0
rtd3e   : 4c4b40
oaes    : 0
oacs    : 0x7
acl     : 7
aerl    : 7
frmw    : 0x16
lpa     : 0x3
elpe    : 63
npss    : 4
avscc   : 0x1
apsta   : 0x1
wctemp  : 350
cctemp  : 353
mtfa    : 50
hmpre   : 0
hmmin   : 0
tnvmcap : 1024209543168
unvmcap : 0
rpmbs   : 0
sqes    : 0x66
cqes    : 0x44
nn      : 1
oncs    : 0x1f
fuses   : 0
fna     : 0
vwc     : 0x1
awun    : 255
awupf   : 0
nvscc   : 1
acwu    : 0
sgls    : 0
subnqn  : 
ps    0 : mp:7.60W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:6.00W operational enlat:0 exlat:0 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:5.10W operational enlat:0 exlat:0 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0400W non-operational enlat:210 exlat:1500 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0050W non-operational enlat:2200 exlat:6000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-
Comment 26 Joshua Bonnett 2018-01-15 11:34:33 UTC
The same bug seems to exist for the Dell xps 9560 and the pm961 it contains. I am running 4.14.13 currently but the bug was present on ubuntu 17.10's 4.11 and several versions in between. Like the poster above I have had to fully disable APST to achieve stability. 


While looking to add the 9560 to the quirk list in the kernel I saw that there is a check for a asus ryzen board and the 960 evo (which even has the same product id as the pm961) I am wondering if the intermittent nature of this bug means that it may be happening on ALL 960 evo/sm/pm 961's and we are only finding it piece meal? Does anyone have one of these drives and the a wide variety of hardware to test with?  

I can write the patch to add this combo to the quirks list, but there maybe deeper issues?
Comment 27 Marvin W 2018-02-03 04:16:17 UTC
I am not sure this is related but:
I see sudden controller death as well on Dell XPS 15 9550 using Samsung 960 EVO 1TB, even with nvme_core.default_ps_max_latency_us=0 set.

# nvme id-ctrl /dev/nvme0n1
NVME Identify Controller:
vid     : 0x144d
ssvid   : 0x144d
sn      : S3X3NF0JA01074T     
mn      : Samsung SSD 960 EVO 1TB                 
fr      : 3B7QCXE7
rab     : 2
ieee    : 002538
cmic    : 0
mdts    : 9
cntlid  : 2
ver     : 10200
rtd3r   : 7a120
rtd3e   : 4c4b40
oaes    : 0
oacs    : 0x7
acl     : 7
aerl    : 3
frmw    : 0x16
lpa     : 0x3
elpe    : 63
npss    : 4
avscc   : 0x1
apsta   : 0x1
wctemp  : 356
cctemp  : 358
mtfa    : 0
hmpre   : 0
hmmin   : 0
tnvmcap : 1000204886016
unvmcap : 0
rpmbs   : 0
sqes    : 0x66
cqes    : 0x44
nn      : 1
oncs    : 0x1f
fuses   : 0
fna     : 0x5
vwc     : 0x1
awun    : 255
awupf   : 0
nvscc   : 1
acwu    : 0
sgls    : 0
subnqn  : 
ps    0 : mp:6.04W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:5.09W operational enlat:0 exlat:0 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:4.08W operational enlat:0 exlat:0 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0400W non-operational enlat:210 exlat:1500 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0050W non-operational enlat:2200 exlat:6000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-
Comment 28 Ian Ozsvald 2018-11-30 08:00:07 UTC
(In reply to Marvin W from comment #27)
> I am not sure this is related but:
> I see sudden controller death as well on Dell XPS 15 9550 using Samsung 960
> EVO 1TB, even with nvme_core.default_ps_max_latency_us=0 set.

Hi Marvin - do you still see this bug? I've filed a very similar report for my 1TB PM951 NVMe in a Dell XPS 9550: https://bugzilla.kernel.org/show_bug.cgi?id=201811

Specifically on 4.19.0 using nvme_core.default_ps_max_latency_us=0 gets me almost-no read-only failures, except for the one that occurred yesterday and that was the first in over 10 days.

I wonder if you solved this issue? Using 4.9.91 I didn't have this issue.
Comment 29 Sebastian Jastrzebski 2018-12-17 23:57:38 UTC
I still see a similar problem with latest kernel on Fedora 29 on Lenovo T580 (latest BIOS 1.18) with Samsung 970 EVO drive. 

After few hours, especially with low battery condition or after resume from suspend, system starts output I/O errors and is not able to read from or write to the drive.

I tried the workaround by setting nvme_core.default_ps_max_latency_us=5500 but the issue still resurfaces.

> dmesg

Dec 17 18:25:20 skyline.origin kernel: EXT4-fs (dm-1): I/O error while writing superblock
Dec 17 18:25:20 skyline.origin kernel: EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detect>
Dec 17 18:25:20 skyline.origin kernel: EXT4-fs (dm-1): Remounting filesystem read-only
Dec 17 18:25:20 skyline.origin kernel: JBD2: Error -5 detected when updating journal superblock for dm->
Dec 17 18:25:20 skyline.origin kernel: Buffer I/O error on dev dm-1, logical block 0, lost sync page wr>
Dec 17 18:25:20 skyline.origin kernel: EXT4-fs (dm-1): I/O error while writing superblock
Dec 17 18:25:20 skyline.origin kernel: EXT4-fs error (device dm-1) in __ext4_new_inode:982: Journal has>
Dec 17 18:25:20 skyline.origin kernel: EXT4-fs error (device dm-1) in __ext4_new_inode:940: Journal has> 

> uname -a
Linux skyline.origin 4.19.8-300.fc29.x86_64 #1 SMP Mon Dec 10 15:23:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

> cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-4.19.8-300.fc29.x86_64 root=/dev/mapper/origin-root ro resume=/dev/mapper/origin-swap rd.luks.uuid=luks-e4dad99e-4f78-45ea-a01c-90f0aedbff5b rd.lvm.lv=origin/root rd.lvm.lv=origin/swap rhgb quiet nvme_core.default_ps_max_latency_us=5500

> sudo nvme id-ctrl /dev/nvme0n1
NVME Identify Controller:
vid       : 0x144d
ssvid     : 0x144d
sn        : S466NX0KA20403K     
mn        : Samsung SSD 970 EVO 500GB     
...
ps    0 : mp:6.20W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:4.30W operational enlat:0 exlat:0 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:2.10W operational enlat:0 exlat:0 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0400W non-operational enlat:210 exlat:1200 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0050W non-operational enlat:2000 exlat:8000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-
Comment 30 Sebastian Jastrzebski 2018-12-18 00:53:25 UTC
As a follow up to #29, I just had another drive failure with the following log output:

> dmesg

[ 4507.245989] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[ 4507.267985] nvme 0000:40:00.0: enabling device (0000 -> 0002)
[ 4507.268283] nvme nvme0: Removing after probe failure status: -19
[ 4507.281449] print_req_error: I/O error, dev nvme0n1, sector 201288984
[ 4507.281479] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 5641230 (offset 0 size 0 starting block 22749603)
[ 4507.281484] Buffer I/O error on device dm-1, logical block 22749603
[ 4507.281494] Buffer I/O error on device dm-1, logical block 22749604
[ 4507.281497] Buffer I/O error on device dm-1, logical block 22749605
[ 4507.281500] Buffer I/O error on device dm-1, logical block 22749606
[ 4507.281503] Buffer I/O error on device dm-1, logical block 22749607
[ 4507.281506] Buffer I/O error on device dm-1, logical block 22749608
[ 4507.281508] Buffer I/O error on device dm-1, logical block 22749609
[ 4507.281511] Buffer I/O error on device dm-1, logical block 22749610
[ 4507.281529] print_req_error: I/O error, dev nvme0n1, sector 201289304
[ 4507.281543] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 5641230 (offset 0 size 0 starting block 22749643)
[ 4507.281546] Buffer I/O error on device dm-1, logical block 22749643
[ 4507.281550] Buffer I/O error on device dm-1, logical block 22749644
[ 4507.281562] print_req_error: I/O error, dev nvme0n1, sector 201290776
[ 4507.281575] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 5641230 (offset 0 size 0 starting block 22749827)
[ 4507.281591] print_req_error: I/O error, dev nvme0n1, sector 201291096
[ 4507.281605] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 5641230 (offset 0 size 0 starting block 22749867)
[ 4507.281618] print_req_error: I/O error, dev nvme0n1, sector 201291736
[ 4507.281632] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 5641230 (offset 0 size 0 starting block 22749947)
[ 4507.281647] print_req_error: I/O error, dev nvme0n1, sector 201292760
[ 4507.281660] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 5641230 (offset 0 size 0 starting block 22750075)
[ 4507.281672] print_req_error: I/O error, dev nvme0n1, sector 201292952
[ 4507.281685] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 5641230 (offset 0 size 0 starting block 22750099)
[ 4507.281699] print_req_error: I/O error, dev nvme0n1, sector 201293784
[ 4507.281713] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 5641230 (offset 0 size 0 starting block 22750203)
[ 4507.281726] print_req_error: I/O error, dev nvme0n1, sector 201294232
[ 4507.281738] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 5641230 (offset 0 size 0 starting block 22750259)
[ 4507.281754] print_req_error: I/O error, dev nvme0n1, sector 58625640
[ 4507.282124] EXT4-fs warning (device dm-1): ext4_end_bio:323: I/O error 10 writing to inode 5641314 (offset 0 size 0 starting block 4916685)
[ 4507.282759] Aborting journal on device dm-1-8.
[ 4507.282773] EXT4-fs error (device dm-1) in ext4_free_blocks:4942: Journal has aborted
[ 4507.282778] Buffer I/O error on dev dm-1, logical block 15, lost async page write
[ 4507.282793] Buffer I/O error on dev dm-1, logical block 32, lost async page write
[ 4507.282808] Buffer I/O error on dev dm-1, logical block 22544389, lost async page write
[ 4507.282820] Buffer I/O error on dev dm-1, logical block 22544401, lost async page write
[ 4507.282828] Buffer I/O error on dev dm-1, logical block 22544402, lost async page write
[ 4507.282837] Buffer I/O error on dev dm-1, logical block 22544693, lost async page write
[ 4507.282850] Buffer I/O error on dev dm-1, logical block 22544733, lost async page write
[ 4507.282861] Buffer I/O error on dev dm-1, logical block 22544736, lost async page write
[ 4507.282866] Buffer I/O error on dev dm-1, logical block 33587200, lost sync page write
[ 4507.282878] Buffer I/O error on dev dm-1, logical block 22544742, lost async page write
[ 4507.282897] JBD2: Error -5 detected when updating journal superblock for dm-1-8.
[ 4507.282917] EXT4-fs (dm-1): I/O error while writing superblock
[ 4507.282929] EXT4-fs error (device dm-1) in ext4_do_update_inode:5310: Journal has aborted
[ 4507.282971] EXT4-fs error (device dm-1) in ext4_reserve_inode_write:5846: Journal has aborted
[ 4507.283027] EXT4-fs (dm-1): Delayed block allocation failed for inode 7344491 at logical offset 0 with max blocks 1 with error 30
[ 4507.283034] EXT4-fs (dm-1): This should not happen!! Data will be lost

[ 4507.283044] EXT4-fs error (device dm-1) in ext4_writepages:2877: Journal has aborted
[ 4507.283070] EXT4-fs (dm-1): I/O error while writing superblock
[ 4507.283078] EXT4-fs error (device dm-1) in ext4_do_update_inode:5310: Journal has aborted
[ 4507.283176] EXT4-fs (dm-1): previous I/O error to superblock detected
[ 4507.283204] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[ 4507.283212] EXT4-fs (dm-1): Remounting filesystem read-only
[ 4507.283309] EXT4-fs (dm-1): I/O error while writing superblock
[ 4507.283423] EXT4-fs (dm-1): I/O error while writing superblock
[ 4507.283433] EXT4-fs error (device dm-1) in ext4_evict_inode:258: Journal has aborted
[ 4507.283437] EXT4-fs error (device dm-1) in ext4_ext_remove_space:3061: Journal has aborted
[ 4507.283508] EXT4-fs (dm-1): I/O error while writing superblock
[ 4507.283514] EXT4-fs (dm-1): previous I/O error to superblock detected
[ 4507.283597] EXT4-fs error (device dm-1) in ext4_orphan_del:2901: Journal has aborted
[ 4507.283716] EXT4-fs error (device dm-1) in ext4_do_update_inode:5310: Journal has aborted
[ 4507.283961] JBD2: Detected IO errors while flushing file data on dm-1-8
[ 4507.294579] nvme nvme0: failed to set APST feature (-19)
[ 4507.405329] EXT4-fs error (device dm-1): ext4_find_entry:1439: inode #5505093: comm gnome-shell: reading directory lblock 0
[ 4640.444653] systemd-journald[847]: Failed to write entry (21 items, 635 bytes), ignoring: Read-only file system
[ 4640.444688] systemd-journald[847]: Failed to write entry (21 items, 740 bytes), ignoring: Read-only file system
[ 4640.444741] systemd-journald[847]: Failed to write entry (21 items, 635 bytes), ignoring: Read-only file system
[ 4640.444776] systemd-journald[847]: Failed to write entry (21 items, 740 bytes), ignoring: Read-only file system
[ 4640.444819] systemd-journald[847]: Failed to write entry (21 items, 635 bytes), ignoring: Read-only file system
[ 4640.444900] systemd-journald[847]: Failed to write entry (21 items, 740 bytes), ignoring: Read-only file system
[ 4640.444933] systemd-journald[847]: Failed to write entry (21 items, 635 bytes), ignoring: Read-only file system
[ 4640.444964] systemd-journald[847]: Failed to write entry (21 items, 740 bytes), ignoring: Read-only file system
[ 4641.345610] systemd-journald[847]: Failed to write entry (21 items, 635 bytes), ignoring: Read-only file system
[ 4641.345782] systemd-journald[847]: Failed to write entry (21 items, 740 bytes), ignoring: Read-only file system
Comment 31 Ian Ozsvald 2018-12-18 11:23:44 UTC
Hi Sebastian. You might want to try to disable APST completely with `nvme_core.default_ps_max_latency_us=0`. For me this reduced the frequency of the read-only errors with 4.19.0 and 4.19.7 to once per week - but it didn't remove it. I've reverted to 4.9.91 and I no longer have these errors.

Use `sudo nvme get-feature -f 0x0c -H /dev/nvme0 | grep APSTE` and you should then see:
`	Autonomous Power State Transition Enable (APSTE): Disabled`
if you've set the max latency to 0 (on a reboot).

I actually added that line to GRUB.

I've had no feedback in my post where I've noted the few things I tried with my 1TB PM951 and 4.19.0 and 4.19.7: https://bugzilla.kernel.org/show_bug.cgi?id=201811
Comment 32 Sebastian Jastrzebski 2018-12-18 11:56:27 UTC
Thanks Ian. I have switched to nvme_core.default_ps_max_latency_us=0 after the last crash and left laptop running overnight. Unfortunately this morning I still ran into drive I/O errors so had to do a hard reboot. APST was disabled. 

Is there anything else I can try? 

> sudo nvme get-feature -f 0x0c -H /dev/nvme0
[sudo] password for raytracer: 
get-feature:0xc (Autonomous Power State Transition), Current value:00000000
	Autonomous Power State Transition Enable (APSTE): Disabled
	Auto PST Entries	.................
	Entry[ 0]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[ 1]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
Comment 33 Ian Ozsvald 2018-12-18 12:48:58 UTC
I don't know of anything else - at this point (with 2 read-only failures with APST disabled on 4.19.x) I reverted to 4.9.91.

Sometime after 4.9 I remember reading about NVME Autonomous Power State Transition code updates. I see a reference to this for 4.11:
https://kernelnewbies.org/Linux_4.11
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c5552fde102fcc3f2cf9e502b8ac90e3500d8fdf
and this is when I started to note others with Samsung quirk issues. 

Some of these had quirks added to the kernel e.g. look for "PM951 NVMe SAMSUNG 512GB" here for 4.11 and 4.12: https://lore.kernel.org/patchwork/patch/781598/ and I'll note that this is the same model but different size to my drive (I have the 1TB PM951).

After this I didn't see others having repeat issues with the same units. I _might_ be a 1TB specific issue, somehow connected to the APST code updates, that affects your 970 and my 950 as we're both on 1TB drives? i.e. this might be a subtly different but related bug. I also suspect fewer of us have the 1TB drive so it'll crop up less frequently.

Unfortunately nobody has commented on my other kernel.org bug report so I've stepped back from the current kernel.

One thing you might try is to run whatever was the latest 4.10 for maybe a week, then 4.11, to see if one of those introduces the read-only problem. That'd help us isolate where things started to break.

I'm assuming that as others update from older kernels we'll see this 1TB issue affecting more people. We'd want to keep an eye on the launchpad bug report too as others note their issues e.g. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184/comments/107

I have tried other kernels over the last year but after 4.9.91 I had other issues - e.g. >4.9.91 I might lose wifi (firmware issues) and some kernel lines wouldn't even boot for me. I was happy when it looked like I could upgrade to 4.19 as I figured I could stay current for a while, ho hum.
Comment 34 Sebastian Jastrzebski 2018-12-18 14:00:29 UTC
I don't think the issue is limited to 1TB drives. I had regularly experienced it with Samsung 256GB NVME drive that came with Lenovo T580. I thought these issues were related to one bad drive, so last week replaced it with brand new Samsung 970 Evo 512GB NVME drive. 

However, the problems with sudden controller crashes remain (even with APST disabled). This is with latest and greatest kernel/bios/firmware. 

Unfortunately downgrading kernel isn't an option for me as I use Thunderbolt 3 heavily which doesn't have great support on older kernels.
Comment 35 Jens Axboe 2018-12-18 15:15:14 UTC
(In reply to Ian Ozsvald from comment #33)
> I don't know of anything else - at this point (with 2 read-only failures
> with APST disabled on 4.19.x) I reverted to 4.9.91.

We seem to know that the drive exhibit the issue with 4.19 and not with 4.9. Has anyone tried to run kernels in between? My laptop runs with this drive:

SAMSUNG MZVLB1T0HALR-000L7

and I've never had any issues, always running bleeding edge on it (current -rc + changes queued up for the next kernel). It'd be interesting to try and get more data points on versions that work / don't work.

I may just have to go and get one of the problematic drives and see if I can reproduce.
Comment 36 Ian Ozsvald 2018-12-18 15:48:40 UTC
Hi Jens. Looking back at my notes for 6 months I see:

4.9.8, 4.9.45, 4.9.66, 4.9.91 - all fine, no HD issues
4.9.119 had a boot fail with "linux-headers-4.9.119-0409119-generic depends on libssl1.1 (>= 1.1.0);" which I haven't persued, also iwlwifi failed - this did boot but since wifi didn't work, I didn't do any particular testing so I can't confirm if the nvme bug exists here
4.9.135 didn't get beyond "loading initial ramdisk" on boot, didn't diagnose any further
4.10 didn't try
4.11.12, 4.11.7, 4.11.0 each had "issues with SSD" - again the "long shutdown issue" noted below
4.12.4 didn't try "intel wifi not supported for Intel 8260" - unrelated issue
4.13 didn't try
4.14.1 had a "long shutdown issue" - this might be a different bug, these "long shutdowns" take 30 seconds to complete a shutdown, on the next boot "hard drive not present", on a 5 second power off->power on the hard drive is back and the next boot is successful. I had lots of these on various kernels (maybe 20+ experiences), very annoying, presumably a HD issue but may or may not be this power-saving bug
4.15.0 had a "long shutdown issue" on each shutdown 
4.16-4.18 I slightly gave up and didn't try these, hoping that the upcoming 4.19 LTS would solve other issues

I'll note that I didn't note above that I had video issues with my NVIDIA card and a USB-C DisplayPort cable (I kept falling back on, and continue to use, HDMI). Between failing nvme, graphics issues and wifi issues I kept searching for a common denominator that'd just let me work.

I've *very* happy to try good ideas if it helps us narrow down what's going on.
Comment 37 Sebastian Jastrzebski 2018-12-18 23:54:43 UTC
I have been able to reproduce the issue on the following kernels:

4.18.16
4.19.5
4.19.8

One thing that I just realized after Jens said that he never had issues with his Samsung drive is that I started noticing all these failures around the time I upgraded to Fedora 29 (4.19 kernel) and the latest ThinkPad T580 BIOS (1.18). I don't recall having/noticing these issues with Fedora 28 and an older BIOS. 

Since changing drives and disabling APST didn't fix my issue and downgrading to 4.18 didn't do it, there is one variable I haven't tried yet - downgrading BIOS.

I'll try that tonight and report back tomorrow.
Comment 38 Ian Ozsvald 2018-12-19 09:49:10 UTC
For completeness, I'm using "Dell XPS 15 9550 1.2.19 System BIOS" from January 2017. Back in May 2017 (the last time I was looking at the BIOS) the later versions all exhibited some trouble with Linux, 1.2.19 was the known-good BIOS so I stuck with it. There's a bigger range of options now.

I wonder if anyone coming through here could report which BIOS they have on a Dell 9550, their drive and whether they do or don't have problems?
Comment 39 Sebastian Jastrzebski 2018-12-19 15:13:27 UTC
Just a quick update as the Lenovo BIOS downgrade path looks promising so far.

After 16 hours the system is still up and running. Typically I would run into multiple failures by now especially after coming out of suspend state overnight.

I'm not sure how BIOS interacts with NVME devices attached to the system once the kernel takes over. The only thing that comes to my mind is that some ACPI tables set up by the BIOS get messed up causing havoc in the rest of the system.

For the sake of clarify, the change applied was to downgrade Lenovo T580 BIOS from version 1.18 to 1.16.
Comment 40 Sebastian Jastrzebski 2018-12-20 12:57:58 UTC
It appears that BIOS may have been the culprit in my case. The system has been stable for the past 24 hours. 

I still run with the nvme_core.default_ps_max_latency_us=0  setting but I am bit hesitant to remove it as I'm a big fan of the newly reacquired system stability.

In any case, Lenovo's ThinkPad T580 BIOS version 1.18 is officially on the not so favorable list.

Thanks Ian and Jens for the support and guidance. Cheers!
Comment 41 Ian Ozsvald 2018-12-20 14:37:48 UTC
Sebastian just note that I had failures with 4.19.0 and 4.19.7 on the order of once per week with APST disabled, so 24 hours may not be long enough to rule this out. 
Looking at BIOS updates beyond my 1.2.19 I see that some NVME updates were released so I'm now mulling getting my BIOS updated to the latest to see what happens. I'll make a decision on this next week.
Comment 42 Sebastian Jastrzebski 2018-12-20 21:45:42 UTC
I'll keep testing the current setup for a few more days and next I'll try running without 'nvme_core.default_ps_max_latency_us=0' to see if the issue is gone. 

So far so good though, before I would not be able to run for more than a few hours without crashing the drive. 

@Ian If you are adventurous (insert caution here), I would definitely give BIOS update a shot. Alternatively you could wait a week or two and see how I do with my current "fix" to see if it makes sense to mess with that in the first place.
Comment 43 Ian Ozsvald 2018-12-31 12:40:35 UTC
24 hours of success - I've upgraded my BIOS from 1.2.19 (Jan 2017) to 1.9.0 (Oct 2018). In 24 hours using kernel 4.19.8 I've not had a read-only failure. Previously with 1.2.19 and kernel 4.19.8 I'd expect a read-only failure within a couple of hours.

I've suspended several times and used my HDMI monitor, everything seems to be working fine. Tentatively I'd say that the BIOS update has fixed the issue. 

At least two intervening BIOS updates (which I hadn't applied) had NVME updates mentioned in their notes.

I'm still using 'nvme_core.default_ps_max_latency_us=0' and will keep this for at least a week or so, to give me confidence that this configuration is stable.

Just in case it is relevant I'll note that `fwupdmgr get-updates` shows that `Thunderbolt NVM for Xps Notebook 9550` can be upgraded from the current v12 to v16. I have used a USB-C to DisplayPort connector (which is physically the same as Thunderbolt 3), but I believe by using USB C I don't touch the Thunderbolt driver (anyone disagree?). I'll update this in a week or so, again I don't want to change anything else until I know that I trust my system.

Sebastian - how are things for you?
Comment 44 Sebastian Jastrzebski 2018-12-31 13:50:18 UTC
I think my issue is ultimately resolved but it required a motherboard replacement. 

After I did the BIOS downgrade, things improved slightly before they got worse. One day the drive failed and the system was no longer recognizing the drive or any other drive I put in the system. Lenovo diagnosed the issue to be a bad motherboard and replaced it.

After motherboard replacement, the drive has been working great. It's been 4 days without any crashes.
Comment 45 Andy Lutomirski 2018-12-31 20:16:35 UTC
After reading a bunch more of this thread, I'm not at all convinced that this is an APST problem.  It sounds like you're seeing failures with APST off and you're seeing more failures with APST on.  So APST is probably just changing the frequency with which the problem is triggered.

Off the top of my head, you might want to fiddle with your PCIe ASPM settings to see if there's any effect.  Do you have pcie_aspm=force set?

For what it's worth, the one APST-related failure that was fully root-caused that I know of turned out to be a design issue in the motherboard that caused ASPM exit to fail sometimes under certain conditions.  Enabling APST makes deep ASPM states much more likely.
Comment 46 Ian Ozsvald 2019-01-02 18:52:04 UTC
I'm at 24 hours now using 4.19.8 with BIOS 1.9.0 (the new BIOS for my XPS 9550), with 'nvme_core.default_ps_max_latency_us=0' _disabled_ (i.e. removed from GRUB) and I've had no read-only failures.

Previously I'd have had a failure within 30 minutes.

Andy - I think you're right, I suspect my out of date BIOS was the root cause of my issue.

To confirm - APST is enabled and I've had no failures:
$ sudo nvme get-feature -f 0x0c -H /dev/nvme0
get-feature:0xc (Autonomous Power State Transition), Current value:0x000001
	Autonomous Power State Transition Enable (APSTE): Enabled

Tentatively I think I can say that the BIOS update might have solved this. I'll report back in a few days.
Comment 47 Ian Ozsvald 2019-01-07 15:38:36 UTC
After 5 days (without a reboot) using 4.19.8 with the new BIOS (1.9.0), either leaving it on or briefly using Suspend, I'm happy that the BIOS upgrade has solved the read-only problem. The machine is now stable. 

However - I've rebooted and got hit by the old dreaded "no bootable medium found". The solution, as before (with the old BIOS), was a hard power off (5 seconds on the power button), after that on a fresh boot the harddrive was magically present again. 

On a fresh boot I then asked for a restart after logging in, on that reboot I also got "no bootable medium found". Again a hard power off and power cycle solves the issue.

I'm going to run as-is for a few days, then may try adding 'nvme_core.default_ps_max_latency_us=0' back to GRUB as a test as 'losing' the primary hard drive on a reboot doesn't make me happy. I'm open to any other ideas..
Comment 48 Ian Ozsvald 2019-02-05 11:12:42 UTC
To update the above - I continue with 4.19.8 on BIOS 1.9.0. On one reboot I had the "no bootable medium found", on several other reboots I've had no issues.

I didn't add 'nvme_core.default_ps_max_latency_us=0' back to GRUB as I think the BIOS update has solved most of the issues.

I did note that my Thunderbolt driver was out of date, I had v12 and v16 was available. I've just updated it today. Possibly there was another historic bug that somehow interfered with the system due to this, I'll continue to monitor it.
Comment 49 Sergey Slizovskiy 2019-05-23 16:42:03 UTC
Hello,  I want to join the discussion.  I have Dell Latitude E7470  with LiteOn 500 Gb SSd,  and I also contantly experience the problem you discuss.   I have managed to temporary cure it by fresh install of Mint 19,  but it appeared again and again after even minor updates.  I used TimeShift to go back and it helped several times,  but not this time.  I have just updated my BIOS to 1.25 version, but, still, no luck.   I will now check if Ubuntu 18.04 would run ok on  4.9.91 kernel.
Comment 50 Sergey Slizovskiy 2019-05-24 14:26:30 UTC
So far, I am sticking to the solution of using 4.9.91 kernel.  Nothing else has worked for me.  It's very confusing. Looks like the kernel team has fixed Samsung, but broken something else in the kernel... forever.
Comment 51 Sergey Slizovskiy 2019-05-26 19:34:19 UTC
And, I am still getting the problem after 3 days...
Comment 52 Ian Ozsvald 2019-05-27 08:16:20 UTC
I'm attaching this for the record - my bug (late 2018 to Feb 2019) went away after I upgraded my Dell 9550 BIOS to 1.9.0 (and possibly by upgrading Thunderbolt - see my posts above). I'm running kernel 4.19.8. I post this here just for the record, in case it helps others. My earlier bug reports are above in this thread.

$ sudo nvme id-ctrl /dev/nvme0
[sudo] password for ian:        
NVME Identify Controller:
vid     : 0x144d
ssvid   : 0x144d
sn      :       S2FZNYAG801690
mn      : PM951 NVMe SAMSUNG 1024GB               
fr      : BXV76D0Q
rab     : 2
ieee    : 002538
...
ps    0 : mp:6.00W operational enlat:5 exlat:5 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:4.20W operational enlat:30 exlat:30 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:3.10W operational enlat:100 exlat:100 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0700W non-operational enlat:500 exlat:5000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0050W non-operational enlat:2000 exlat:22000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-

$ uname -a
Linux ian-XPS-15-9550 4.19.8-041908-generic #201812080831 SMP Sat Dec 8 13:34:18 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

System BIOS was upgraded to 1.9.0 - I believe this is the thing that fixed the NVMe issues.

Ian.
Comment 53 Sergey Slizovskiy 2019-05-27 12:10:35 UTC
I have no thunderbolt on Dell Latitude E7470 and have updated to latest BIOS.  Still, even the old kernel have not solved my problem.  Setting NVME kernel option to zero does not change anything.  I am already tired of it and writing this message from Windows 7. 
Thanks,
Sergey
Comment 54 Sergey Slizovskiy 2019-05-28 10:05:36 UTC
Just an update:  it might be the hardware problem of contact in  SSD connection to motherboard. I have got the same error in Windows. Now, I have cleaned the contacts with propanol. Let's see if it helps.
Comment 55 Sergey Slizovskiy 2019-06-02 11:28:05 UTC
To confirm, my issue was resolved,  now I am running fine on the latest kernel.  The lesson is:  prior to trying to replace SSD or Motherboard,  try to clean the connection thereof. In my case, it seems like I need to clean this connection ones a year....  Shame at Dell build quality. 
Best wishes,
Sergey
Comment 56 jckeerthan 2019-10-14 13:16:54 UTC
Adding another datapoint for others running into this issue:

I'm on a desktop with a samsung 970 Evo 2TB nvme ssd: 01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983

I started running into this issue out of the blue. After re-seating the nvme ssd, the issue seems to have gone away - I will update if it starts to happen again.
Comment 57 Vlad Burlik 2019-12-04 21:56:16 UTC
Same issue here.
Samsung 960 Evo Series (OEM) 1TB NVMe M.2 NGFF SSD PCIe 3.0 x4 80mm - (PM961) 
SAMSUNG MZVLW1T0HMLH-00000
S/N: S2U3NX0HC05293
FW: CXY7301Q
Comment 58 juan 2019-12-05 14:01:10 UTC
I can confirm this is happening with 5.3.0-22-generic

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1852479

+-1d.0-[03]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983

My laptop just crashes randomly. Disbling AER did not solve the problem.

Linux pop-os 5.3.0-22-generic #24+system76~1573659475~19.10~26b2022-Ubuntu SMP Wed Nov 13 20:0 x86_64 x86_64 x86_64 GNU/Linux


NVME Identify Controller:
vid       : 0x144d
ssvid     : 0x144d
sn        : S444NY0K600040      
mn        : SAMSUNG MZVLB256HAHQ-00000              
fr        : EXD7101Q
rab       : 2
ieee      : 002538
...
ps    0 : mp:7.02W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:6.30W operational enlat:0 exlat:0 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:3.50W operational enlat:0 exlat:0 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0760W non-operational enlat:210 exlat:1200 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0050W non-operational enlat:2000 exlat:8000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-
Comment 59 Hans L 2020-02-13 03:17:50 UTC
I'm not sure if I have the exact same problem as OP, but I've been struggling with NVMe stability issues ever since I put together this Ryzen desktop computer around July 2019.  My problems were originally on a "Crucial P1 500GB" NVMe drive, but I just swapped over to a new "Samsung 970 EVO 1TB" by cloning data, and I'm still seeing similar issues.

General Hardware Specs:
  Motherboard: ASUS Prime B450 Plus motherboard running BIOS rev. "2008" (also had issues on rev. 1804)
  CPU: AMD Ryzen 2700X
  GPU: NVIDIA Corporation TU116 [GeForce GTX 1660]
  Current drive: Samsung 970 EVO 1TB
    Model: MZ-V7E1T0BW
    Controller: SM981/PM981
    Firmware: 2B2QEXE7
  Previous drive (with basically same problems): Crucial P1 500GB
    Model: CT500P1SSD8
    Controller: SM2263EN
    Firmware: P3CR013

The system can be stable for days or weeks on end, as long as I don't put it under particularly heavy sustained load (CPU mainly?). I have VERY repeatable results of AER errors showing up in dmesg just seconds after starting a specific workload: "mprime" executable (Linux version of "Prime95" from mersenne.org), specifically computing "P-1" aka "PM1" type of workunits.

I'd posted my issues on various forums but haven't been able to solve this. So I had basically gave up on running "mprime" on this computer and mostly forgot about the problems for a few months, until recently I needed to use another application which seems is triggering these same type of errors again (Intel "Quartus Prime Lite" EDA tools, for FPGA development)

I initially had tried disabling ASPM via kernel boot command line "pcie_aspm=off", as a recommended "solution" to my kernel logs being filled with spam from NVIDIA gpu.  Errors involving: "[12] Timeout", "[ 6] BadTLP", and "[ 7] BadDLLP".  Doing this got rid of those messages from GPU, but caused the NVMe to go into some unrecoverable state, at which point it would try to remount the drive as read only (also would show "BTRFS" errors when i'm only using EXT4?)

Here is a snippet of kernel log from when I had ASUS 1804 BIOS, and Crucial P1 500GB SSD, with "pcie_aspm=off", where it was unable to reset NVMe:
[  989.409598] perf: interrupt took too long (4979 > 4912), lowering kernel.perf_event_max_sample_rate to 40000
[ 1195.031765] fuse: init (API version 7.31)
[ 1327.328770] perf: interrupt took too long (6268 > 6223), lowering kernel.perf_event_max_sample_rate to 31750
[ 2238.284260] perf: interrupt took too long (7846 > 7835), lowering kernel.perf_event_max_sample_rate to 25250
[ 9117.462381] perf: interrupt took too long (9831 > 9807), lowering kernel.perf_event_max_sample_rate to 20250
[ 9261.476036] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[ 9261.603999] pci_raw_set_power_state: 19 callbacks suppressed
[ 9261.604009] nvme 0000:01:00.0: Refused to change power state, currently in D3
[ 9261.604430] nvme nvme0: Removing after probe failure status: -19
[ 9261.632241] print_req_error: I/O error, dev nvme0n1, sector 15247304 flags 100001
[ 9261.632255] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
[ 9261.729511] nvme nvme0: failed to set APST feature (-19)
[ 9261.739582] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
[ 9261.739591] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
[ 9261.739595] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
[ 9261.756670] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 1, flush 0, corrupt 0, gen 0
[ 9261.756951] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 2, flush 0, corrupt 0, gen 0
[ 9261.758061] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 3, flush 0, corrupt 0, gen 0
[ 9261.758368] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 4, flush 0, corrupt 0, gen 0
[ 9261.759112] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 5, flush 0, corrupt 0, gen 0
[ 9261.759138] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 6, flush 0, corrupt 0, gen 0
[ 9262.276359] Core dump to |/bin/false pipe failed
[ 9262.336595] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[ 9262.336817] caller _nv000939rm+0x1bf/0x1f0 [nvidia] mapping multiple BARs
[ 9262.975980] snd_hda_codec_hdmi hdaudioC0D0: HDMI: invalid ELD data byte 62
[ 9263.012987] Core dump to |/bin/false pipe failed
[ 9263.015801] Core dump to |/bin/false pipe failed

After re-enabling ASPM kernel boot parameter, and upgrading BIOS to latest "2008" revision I got messages like this (still on Crucial P1):
[ 3203.674000] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[ 3203.674052] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 3203.674076] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00001000/00006000
[ 3203.674081] pcieport 0000:00:03.1: AER:    [12] Timeout
[ 3205.713683] pcieport 0000:00:01.1: AER: Uncorrected (Fatal) error received: 0000:01:00.0
[ 3205.713694] nvme 0000:01:00.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent ID)
[ 3205.713709] nvme nvme0: frozen state error detected, reset controller
[ 3206.820214] pcieport 0000:00:01.1: AER: Root Port link has been reset
[ 3206.820265] nvme nvme0: restart after slot reset
[ 3206.963050] nvme nvme0: 15/0/0 default/read/poll queues
[ 3206.963296] pcieport 0000:00:01.1: AER: Device recovery successful
[ 3207.692447] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[ 3207.692464] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 3207.692470] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00001000/00006000
[ 3207.692472] pcieport 0000:00:03.1: AER:    [12] Timeout
[ 3208.608352] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[ 3208.608370] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[ 3208.608378] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00000040/00006000
[ 3208.608381] pcieport 0000:00:03.1: AER:    [ 6] BadTLP
[ 3210.904689] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[ 3210.904707] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 3210.904716] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00001000/00006000
[ 3210.904719] pcieport 0000:00:03.1: AER:    [12] Timeout
[ 3211.260459] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[ 3211.260493] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[ 3211.260514] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00000040/00006000
[ 3211.260519] pcieport 0000:00:03.1: AER:    [ 6] BadTLP

At some point I also tried "nvme_core.default_ps_max_latency_us=0" while on Crucial drive, which at best may have reduced the frequency of the problem occuring, but still eventually had crashing controller issues under loads.

I suspected the Crucial drive had some unresolved controller firmware bugs, so I thought upgrading to a different brand with a new Samsung 970 EVO would help.
I used clonezilla to copy the partition data over and grow to fit the new drive.  Geting the clone to work without errors is a whole story in itself but I'll try to keep it short.
Having only one M.2 slot on my motherboard, I was using a NVMe to USB 3.1 Gen 2 (up to 10Gbps) adapter device by mfgr "SSK". 
It failed to clone multiple times(some errors about "UAS" iirc) when plugged into my motherboard's USB 3.1 Gen 2 ports.
Then I try swapped the USB adapter to a different port, supporting only USB 3.1 Gen 1 (up to 5Gbps), and that suceeded with 0 errors on the first try.

So after booting up the new Samsung drive, I tried my high load mprime test and saw the same types of errors:
(The high load process wasn't actually started until around 350s.  No idea if first 2 lines are relevant or a problem in any way, but I'm including those "errors" just in case.)
[  194.587710] ucsi_ccg 0-0008: failed to reset PPM!
[  194.587734] ucsi_ccg 0-0008: PPM init failed (-110)
 ...
[  357.259829] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[  357.259847] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[  357.259855] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00001000/00006000
[  357.259857] pcieport 0000:00:03.1: AER:    [12] Timeout
[  357.866075] pcieport 0000:00:01.1: AER: Uncorrected (Fatal) error received: 0000:01:00.0
[  357.866098] nvme 0000:01:00.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent ID)
[  357.866124] nvme nvme0: frozen state error detected, reset controller
[  358.262744] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[  358.262765] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[  358.262772] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00001000/00006000
[  358.262775] pcieport 0000:00:03.1: AER:    [12] Timeout
[  358.439057] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[  358.439076] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[  358.439084] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00001000/00006000
[  358.439086] pcieport 0000:00:03.1: AER:    [12] Timeout
[  358.506164] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[  358.506182] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[  358.506194] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00000040/00006000
[  358.506196] pcieport 0000:00:03.1: AER:    [ 6] BadTLP
[  358.748596] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[  358.748606] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[  358.748611] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00001000/00006000
[  358.748612] pcieport 0000:00:03.1: AER:    [12] Timeout
[  358.971108] pcieport 0000:00:01.1: AER: Root Port link has been reset
[  358.971133] nvme nvme0: restart after slot reset
[  359.231681] nvme nvme0: Shutdown timeout set to 8 seconds
[  359.270538] nvme nvme0: 32/0/0 default/read/poll queues
[  359.270843] pcieport 0000:00:01.1: AER: Device recovery successful
[  359.355805] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[  359.355825] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[  359.355835] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00001000/00006000
[  359.355838] pcieport 0000:00:03.1: AER:    [12] Timeout
...

More or less identical afaict (besides Samsung having a different queue depth).

So these fatal errors are able to be reset/recovered from, but this is still very concerning to me as I don't know if constantly resetting the NVMe controller multiple times per minute will lead to data corruption?

At this point I still have no idea what is going on and the problem might be any combination of:
1) Linux kernel bug
2) BIOS revision bug from ASUS/AMD  (flaw in AMD 400 series PCIe bridge controller?)
3) Specific BIOS settings are not configured right by me?
4) Misbehaving device firmware (NVMe controller and/or GPU causing some kind of PCIe bus conflicts?)
5) Motherboard hardware defect, bad physical connection in some way? (based on other's reports that re-seating NVMe solved their issue)
6) Power or voltages reaching device are fluctuating out of spec? (power issues suspected since only occurs under heavy load).  I don't have a scope to check this though.

Any advice would be greatly appreciated.

I don't know what combination of kernel boot settings(ASPM, AER, APST, nvme_core latency, etc.) and/or BIOS settings I should be trying anymore (or different BIOS revisions), as I don't understand how any of these interact and there's just too many combinations to try all of them exhaustively.
Not kernel boot params can override BIOS settings or if they need to be synced to compatible settings, but ASPM setting in BIOS gives me 3 options:  "Disabled", "Auto", or "Force L0s"
I don't know off the top of my head if BIOS also had any APST, AER, or other related settings, but I can check if asked.

Below I've included a bunch more various general info and diagnostic commands I've run on my latest configuration with Samsung drive installed.
Let me know if there's any other command output or info I can provide to help:

$ lsb_release -d
Description:    Linux Mint 19.3 Tricia

$ uname -r
5.3.0-28-generic

$ ls /sys/class/nvme/nvme0/power
async  autosuspend_delay_ms  control  pm_qos_latency_tolerance_us  runtime_active_kids  runtime_active_time  runtime_enabled  runtime_status  runtime_suspended_time  runtime_usage

$ sudo nvme fw-log /dev/nvme0
Firmware Log for device:nvme0
afi  : 0x1
frs1 : 0x3745584551324232 (2B2QEXE7)

$ systool -vm nvme_core
Module = "nvme_core"

  Attributes:
    coresize            = "102400"
    initsize            = "0"
    initstate           = "live"
    refcnt              = "5"
    srcversion          = "B43C1A5A4BC80B50DFB88F2"
    taint               = ""
    uevent              = <store method only>
    version             = "1.0"

  Parameters:
    admin_timeout       = "60"
    default_ps_max_latency_us= "100000"
    force_apst          = "N"
    io_timeout          = "30"
    max_retries         = "5"
    multipath           = "Y"
    shutdown_timeout    = "5"
    streams             = "N"

  Sections:

$ sudo nvme id-ctrl /dev/nvme0
NVME Identify Controller:
vid     : 0x144d
ssvid   : 0x144d
sn      : S5H9NC0MC24244K     
mn      : Samsung SSD 970 EVO 1TB                 
fr      : 2B2QEXE7
rab     : 2
ieee    : 002538
cmic    : 0
mdts    : 9
cntlid  : 4
ver     : 10300
rtd3r   : 30d40
rtd3e   : 7a1200
oaes    : 0
ctratt  : 0
oacs    : 0x17
acl     : 7
aerl    : 3
frmw    : 0x16
lpa     : 0x3
elpe    : 63
npss    : 4
avscc   : 0x1
apsta   : 0x1
wctemp  : 358
cctemp  : 358
mtfa    : 0
hmpre   : 0
hmmin   : 0
tnvmcap : 1000204886016
unvmcap : 0
rpmbs   : 0
edstt   : 35
dsto    : 0
fwug    : 0
kas     : 0
hctma   : 0x1
mntmt   : 356
mxtmt   : 358
sanicap : 0
hmminds : 0
hmmaxd  : 0
sqes    : 0x66
cqes    : 0x44
maxcmd  : 0
nn      : 1
oncs    : 0x5f
fuses   : 0
fna     : 0x5
vwc     : 0x1
awun    : 1023
awupf   : 0
nvscc   : 1
acwu    : 0
sgls    : 0
subnqn  : 
ioccsz  : 0
iorcsz  : 0
icdoff  : 0
ctrattr : 0
msdbd   : 0
ps    0 : mp:6.20W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:4.30W operational enlat:0 exlat:0 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:2.10W operational enlat:0 exlat:0 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0400W non-operational enlat:210 exlat:1200 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0050W non-operational enlat:2000 exlat:8000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-

$ sudo lspci -vvv -s 00:01:00.0
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981 (prog-if 02 [NVM Express])
	Subsystem: Samsung Electronics Co Ltd Device a801
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 63
	NUMA node: 0
	Region 0: Memory at f6800000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [70] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr+ UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L0s unlimited, L1 <64us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [b0] MSI-X: Enable+ Count=33 Masked-
		Vector table: BAR=0 offset=00003000
		PBA: BAR=0 offset=00002000
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [148 v1] Device Serial Number 00-00-00-00-00-00-00-00
	Capabilities: [158 v1] Power Budgeting <?>
	Capabilities: [168 v1] #19
	Capabilities: [188 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [190 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us LTR1.2_Threshold=0ns
		L1SubCtl2: T_PwrOn=10us
	Kernel driver in use: nvme
	Kernel modules: nvme


$ lspci -tv
-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex
           +-00.2  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit
           +-01.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
           +-01.1-[01]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
           +-01.3-[02-08]--+-00.0  Advanced Micro Devices, Inc. [AMD] 400 Series Chipset USB 3.1 XHCI Controller
           |               +-00.1  Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller
           |               \-00.2-[03-08]--+-00.0-[04]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
           |                               +-01.0-[05]--
           |                               +-04.0-[06]--
           |                               +-06.0-[07]--
           |                               \-07.0-[08]--
           +-02.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
           +-03.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
           +-03.1-[09]--+-00.0  NVIDIA Corporation TU116 [GeForce GTX 1660]
           |            +-00.1  NVIDIA Corporation TU116 High Definition Audio Controller
           |            +-00.2  NVIDIA Corporation Device 1aec
           |            \-00.3  NVIDIA Corporation TU116 [GeForce GTX 1650 SUPER]
           +-04.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
           +-07.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
           +-07.1-[0a]--+-00.0  Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function
           |            +-00.2  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor
           |            \-00.3  Advanced Micro Devices, Inc. [AMD] Zeppelin USB 3.0 Host controller
           +-08.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
           +-08.1-[0b]--+-00.0  Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function
           |            +-00.2  Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
           |            \-00.3  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller
           +-14.0  Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller
           +-14.3  Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge
           +-18.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0
           +-18.1  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1
           +-18.2  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2
           +-18.3  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3
           +-18.4  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4
           +-18.5  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5
           +-18.6  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6
           \-18.7  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7


$ sudo nvme get-feature -f 0x0c -H /dev/nvme0
get-feature:0xc (Autonomous Power State Transition), Current value:0x000001
	Autonomous Power State Transition Enable (APSTE): Enabled
	Auto PST Entries	.................
	Entry[ 0]
	.................
	Idle Time Prior to Transition (ITPT): 71 ms
	Idle Transition Power State   (ITPS): 3
	.................
	Entry[ 1]
	.................
	Idle Time Prior to Transition (ITPT): 71 ms
	Idle Transition Power State   (ITPS): 3
	.................
	Entry[ 2]
	.................
	Idle Time Prior to Transition (ITPT): 71 ms
	Idle Transition Power State   (ITPS): 3
	.................
	Entry[ 3]
	.................
	Idle Time Prior to Transition (ITPT): 500 ms
	Idle Transition Power State   (ITPS): 4
	.................
	Entry[ 4]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[ 5]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[ 6]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[ 7]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[ 8]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[ 9]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[10]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[11]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[12]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[13]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[14]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[15]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[16]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[17]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[18]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[19]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[20]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[21]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[22]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[23]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[24]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[25]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[26]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[27]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[28]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[29]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[30]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[31]
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
       0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
0000: 18 47 00 00 00 00 00 00 18 47 00 00 00 00 00 00 ".G.......G......"
0010: 18 47 00 00 00 00 00 00 20 f4 01 00 00 00 00 00 ".G.............."
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
Comment 60 Hans L 2020-02-24 23:07:34 UTC
Since my problems were only under heavy load, I suspected power delivery issues and upgraded my motherboard to Asus TUF Gaming X570 which has much better VRM.  I am no longer able to reproduce any of the errors I reported above.

So the previous Asus Prime B450 Plus motherboard was either a defective unit or under-spec'd in general to power a fully loaded 2700X.
Comment 61 eeshugerman 2020-03-19 23:36:51 UTC
Hi, just thought I'd share my experience with this.

I have a Lenovo P51s Thinkpad (20JY0004US), which has a Samsung MZVLB512HAJQ-000L7 drive.

I can't say for sure, but I believe the issues arose when my distro updated the kernel to 5.4. I was unable to boot (endless read-only filesystem errors) until I added the `nvme_core.default_ps_max_latency_us=200` parameter. This mostly solved the problem -- I was able to boot -- but my system would crash occasionally, out of nowhere, with the same read-only filesystem errors. I'd estimate this happened once or twice a day, often when I unplugged the charger, but not always. For a while I thought it only happened when the laptop was unplugged, but at least once it happened while charging.

Finally, I tried installing the latest firmware updates from Lenovo, which I'd never done before, and I haven't seen the issue since! However the `nvme_core.default_ps_max_latency_us=200` parameter is still necessary.
Comment 62 Sergey Slizovskiy 2020-03-19 23:52:21 UTC
In my case, it was (and is) a hardware issue of a bad SSD-Motherboard contact.  Downgrading the kernel or changing latency seemed to help a bit,  but  not forever. Thus, it was very hard to diagnose. 
My current solution is to remove the SSD and spray with a MAF sensor cleaner I bought for my car   onto the contacts.  It fixes the problem for a period of several months.  Then I have to do it again. 

Cheers,
Sergey
Comment 63 eeshugerman 2020-03-25 21:04:55 UTC
Update to [my previous comment](https://bugzilla.kernel.org/show_bug.cgi?id=195039#c61): Actually, I still get the issue sometimes, but now it only happens when I plug my laptop in to charge. About 1 out of 3 times that I plug it in it will occur.
Comment 64 xken.sky 2020-05-25 05:42:58 UTC
I'm sorry to bother you. I've been plagued by a problem - an unexplained crash during the use.
Dell G3 laptop, Ubuntu 20.04lts, Linux wlp2s0 hosts 5.4.0-31-generic เท Ubuntu SMP Thu May 7 20:20:34 UTC 2020 x86_ 64 x86_ 64 x86_ 64 GNU/Linux
Samsung 1tssd solid-state hard disk is also used. The model is Samsung Electronics Co Ltd nvme SSD controller sm981 / pm981 / pm983
root@wlp2s0-hosts :/home/wlp2s0# smartctl -i /dev/nvme0
smartctl 7.1 2019-12-30 r5022 [x86_ 64-linux-5.4.0-31-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke,  www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number:                       PM981a NVMe Samsung 1024GB
Serial Number:                      S4GXNE0M828422
Firmware Version:                   15302129
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 1,024,209,543,168 [1.02 TB]
Unallocated NVM Capacity:           0
Controller ID:                      4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,024,209,543,168 [1.02 TB]
Namespace 1 Utilization:            138,282,958,848 [138 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Mon May 25 13:39:24 2020 CST
root@wlp2s0-hosts :/home/wlp2s0# nvme id-ctrl /dev/nvme0
NVME Identify Controller:
vid       : 0x144d
ssvid     : 0x144d
sn        :       S4GXNE0M828422
mn        : PM981a NVMe Samsung 1024GB
fr        : 15302129
rab       : 2
ieee      : 002538
cmic      : 0
mdts      : 9
cntlid    : 0x4
ver       : 0x10300
rtd3r     : 0x30d40
rtd3e     : 0x7a1200
oaes      : 0
ctratt    : 0
rrls      : 0
crdt1     : 0
crdt2     : 0
crdt3     : 0
oacs      : 0x17
acl       : 7
aerl      : 3
frmw      : 0x16
lpa       : 0x2
elpe      : 63
npss      : 4
avscc     : 0x1
apsta     : 0x1
wctemp    : 357
cctemp    : 358
mtfa      : 0
hmpre     : 0
hmmin     : 0
tnvmcap   : 1024209543168
unvmcap   : 0
rpmbs     : 0
edstt     : 35
dsto      : 0
fwug      : 0
kas       : 0
hctma     : 0x1
mntmt     : 321
mxtmt     : 358
sanicap   : 0x2
hmminds   : 0
hmmaxd    : 0
nsetidmax : 0
anatt     : 0
anacap    : 0
anagrpmax : 0
nanagrpid : 0
sqes      : 0x66
cqes      : 0x44
maxcmd    : 0
nn        : 1
oncs      : 0x5f
fuses     : 0
fna       : 0x3
vwc       : 0x1
awun      : 1023
awupf     : 0
nvscc     : 1
nwpc      : 0
acwu      : 0
sgls      : 0
mnan      : 0
subnqn    :
ioccsz    : 0
iorcsz    : 0
icdoff    : 0
ctrattr   : 0
msdbd     : 0
ps    0 : mp:6.60W operational  enlat:0 exlat 0 rrt:0 rrl 0
rwt:0 rwl :0 idle_ power:- active_ power:-
ps    1 : mp:4.40W operational  enlat:0 exlat 0 rrt:1 rrl 1
rwt:1 rwl :1 idle_ power:- active_ power:-
ps    2 : mp:3.10W operational  enlat:0 exlat 0 rrt:2 rrl 2
rwt:2 rwl :2 idle_ power:- active_ power:-
ps    3 : mp:0.0700W non-operational  enlat:210 exlat :1200  rrt:3 rrl 3
rwt:3 rwl :3 idle_ power:- active_ power:-
ps    4 : mp:0.0050W non-operational  enlat:2000 exlat :8000  rrt:4 rrl 4
rwt:4 rwl :4 idle_ power:- active_ power:-
Try using nvme_ core.default_ Ps_ max_ latency_ Us = 0 to boot but not to enter the system at all.
Comment 65 RockT 2020-06-23 08:45:35 UTC
I'm experiencing the same problem - only if the laptop is charging!

Sys: Lenovo T480s
Bios: Version: N22ET62W (1.39 )
Release Date: 02/18/2020

changed ssd from stock 256GB to
KINGSTON SA2000M81000G 1TB

while in battery mode the system is rock stable.
when charging I see:

[ 5088.579248] nvme nvme0: I/O 704 QID 2 timeout, aborting
[ 5088.579274] nvme nvme0: I/O 705 QID 2 timeout, aborting
[ 5088.579285] nvme nvme0: I/O 706 QID 2 timeout, aborting
[ 5088.579294] nvme nvme0: I/O 707 QID 2 timeout, aborting
[ 5088.579303] nvme nvme0: I/O 708 QID 2 timeout, aborting
[ 5118.788204] nvme nvme0: I/O 704 QID 2 timeout, reset controller
[ 5150.021209] nvme nvme0: I/O 0 QID 0 timeout, reset controller

Tested kernels (all have this problem)
5.7.4
5.7.5
5.8.0 rc1
Comment 66 Sebastian Jastrzebski 2020-06-23 13:11:14 UTC
@RockT - I don't think its kernel issue. I too have a T580 and had numerous issues with NVME. See my initial comments in this thread from 12/2018. 

T580 seems to have power delivery issues that cause NVME drives to crash. The only fix I found that works reliably is replacing the motherboard on the laptop.
I'm on a 3rd motherboard currently, the first one lasted a year, the second one year and a half, the 3rd one was installed couple weeks ago (each time going through usual troubleshooting process that included swapping to new NVME drive, new drive cage, etc).

I'm running F32 w/ kernel 5.6 and after the motherboard swap all issues with NVME are gone (at least until mb fails again). Its pretty sad. Maybe the latest mobo revision has some fixes that will make things more reliable.
Comment 67 RockT 2020-06-24 16:35:24 UTC
@Sebastian Jastrzebski thank you for your answer.

but I somehow doubt that it is a hardware problem:

- the stock nvme card was stable
- I applied kernel parameter "nvme_core.default_ps_max_latency_us=5500":

 $ cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-5.7.5-050705-generic root=/dev/mapper/vgubuntu--mate-root ro quiet splash nvme_core.default_ps_max_latency_us=5500 vt.handoff=7

This is stable now for two days of work including running some vms and doing some dd tests.
No matter if the laptop is charging or on battery I don't have problems anymore.
Comment 68 eeshugerman 2020-06-25 02:32:50 UTC
Created attachment 289877 [details]
attachment-20893-0.html

FWIW I believe the issue was hardware related in my case too. Setting
default_ps_max_latency_us=200 fixed it for a couple months but eventually
it returned. I tried firmware updates, pinning old kernels, installing
different distros, etc. These changes would seem to fix it for a couple
days (I even reported it fixed once or twice in this thread) but then it
would come back, or start happening under different circumstances. Finally
one day it got so bad I couldn't boot at all so I cracked open the case and
found there were structural issues with the mobo port. I believe it was the
same issue as described [here](
https://forums.lenovo.com/t5/ThinkPad-P-and-W-Series-Mobile/p51s-sata-mobo-connector-is-broken/td-p/4030539),
though not as far gone. Plugging the drive into another computer (running
the same version of the kernel) worked fine. I never got the port
replaced/resoldered though, so I can't say with /complete/ certainty that
it was the problem. It sure looks that way though, especially considering
all the others in this thread coming to similar conclusions.

On Wed, Jun 24, 2020, 10:35 AM <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=195039
>
> --- Comment #67 from RockT (tr.ml@gmx.de) ---
> @Sebastian Jastrzebski thank you for your answer.
>
> but I somehow doubt that it is a hardware problem:
>
> - the stock nvme card was stable
> - I applied kernel parameter "nvme_core.default_ps_max_latency_us=5500":
>
>  $ cat /proc/cmdline
> BOOT_IMAGE=/vmlinuz-5.7.5-050705-generic
> root=/dev/mapper/vgubuntu--mate-root
> ro quiet splash nvme_core.default_ps_max_latency_us=5500 vt.handoff=7
>
> This is stable now for two days of work including running some vms and
> doing
> some dd tests.
> No matter if the laptop is charging or on battery I don't have problems
> anymore.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 69 RockT 2020-06-26 09:59:44 UTC
I'm still not convinced.

To resize my encrypted fs on the new 1TB I use sysreccd with kernel 5.4.44 LTS

I can fsck the filesystem with the stock kernel params. But as soon as I resize the filesystem - the nvme controller locks hard. Not even a soft reboot can recover.

As soon as I set "nvme_core.default_ps_max_latency_us=5500" with sysresccd  everything works as expected: resize, fsck, luks, lvm
Comment 70 RockT 2020-06-26 10:06:04 UTC
Of course it's running stable now for only three days - so take it with a pinch of salt.
Comment 71 juan 2020-06-26 10:32:08 UTC
(In reply to RockT from comment #69)

this solved the problem for me:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1852479


>here a bit of my cat /etc/default/grub


GRUB_CMDLINE_LINUX_DEFAULT="quiet splash modprobe.blacklist=nouveau nvme_core.default_ps_max_latency_us=5500 pcie_aspm=off"
GRUB_CMDLINE_LINUX="nouveau.modeset=0"

For me pcie_aspm=off was the parameter that helped solve the issue.

for more info

+see here: https://wiki.archlinux.org/index.php/Solid_state_drive/NVMe

+and here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1746340#yui_3_10_3_1_1590400769591_1615
Comment 72 berk 2020-07-11 03:34:08 UTC
Hi folks, just confirming that I have the same issue. I have a ThinkPad E480 with a new Kingston A2000 512GB NVME SSD. Here are some of the things I've experienced:

-> Fedora 32 won't install to NVME with LUKS and it fails to format to ext4 (simply hangs). With unencrypted install (standard partitioning) it installs however there are frequent lock ups (can't do anything, even switching TTYs don't work).

-> Ubuntu 20.04 can install however there are frequent lockups just like Fedora.

I haven't tried out "pcie_aspm=off" as I am using a different OS for the time being but it sounds like that would fix it. Maybe one day I'll try it again.

The question is, what would be the long-term fix? Is it simply a matter of solving the power saving issue? It would be nice to benefit from the power saving while still having the stability of a SATA SSD.
Comment 73 berk 2020-07-17 01:13:02 UTC
Quick update, good news, I may have found workaround for people who are suffering from the NVMe timeout issues. I'm on Fedora 32 with the Kingston A2000 512GB SSD and after 2+ days of uptime, plugged on and on battery along with various work loads, I think it is safe to say that the system is quite robust. If I run into any issues down the road, I'll be sure to post an update.

Like Juan, I updated the pci_aspm parameter however I went with "performance". I'm not sure which one actually solves the problem but I've got both parameters going.

Basically, I put this in the /etc/grub/default file: 

--> GRUB_CMDLINE_LINUX_DEFAULT="nvme_core.default_ps_max_latency_us=0 pci_aspm=performance"

This will ensure it will use these parameters all the time at boot.

Then I reloaded the GRUB configuration (this is for Fedora):
--> sudo grub2-mkconfig -o /boot/grub2/grub.cfg

To check if the "nvme_core.default_ps_max_latency_us=0" has been set successfully, you can run the following command:

--> cat /sys/module/nvme_core/parameters/default_ps_max_latency_us
Comment 74 Kirill Kulikov 2020-08-10 09:50:45 UTC
@berk Thank you!
I had this issue with Kingston A2000 1000GB SSD. Running Kernel 5.7 on Arch with  GRUB_CMDLINE_LINUX_DEFAULT="nvme_core.default_ps_max_latency_us=0 pci_aspm=performance" seems to solve it so far.
I had hang ups a few times per hour. :S
Comment 75 Sergey Slizovskiy 2020-08-10 10:02:31 UTC
Despite nvme latency seeming to help for some time,  this is, most probably, a hardware issue: the power supply to SSD is interrupted for an instant. 
Clean the SSD contacts with a paper dipped in isopropanol or similar.  Cut a stripe of cardboard, dip it in isopropanol and clean the motherboard SSD contacts. 
   I have had this issue since 3 years and have to repeat the cleaning around ones a year. 
Best wishes,
Sergey
Comment 76 Kirill Kulikov 2020-08-10 11:32:38 UTC
(In reply to Sergey Slizovskiy from comment #75)
> Despite nvme latency seeming to help for some time,  this is, most probably,
> a hardware issue: the power supply to SSD is interrupted for an instant. 
> Clean the SSD contacts with a paper dipped in isopropanol or similar.  Cut a
> stripe of cardboard, dip it in isopropanol and clean the motherboard SSD
> contacts. 
>    I have had this issue since 3 years and have to repeat the cleaning
> around ones a year. 
> Best wishes,
> Sergey

I will also try this, but its strange since everthing is fine on Windows (i use Dual Boot).
Comment 77 berk 2020-08-13 09:23:22 UTC
(In reply to Sergey Slizovskiy from comment #75)
> Despite nvme latency seeming to help for some time,  this is, most probably,
> a hardware issue: the power supply to SSD is interrupted for an instant. 
> Clean the SSD contacts with a paper dipped in isopropanol or similar.  Cut a
> stripe of cardboard, dip it in isopropanol and clean the motherboard SSD
> contacts. 
>    I have had this issue since 3 years and have to repeat the cleaning
> around ones a year. 
> Best wishes,
> Sergey

Thanks for the reply Sergey, I decided to give your method a try as I was reinstalling (although just blowing dust out of the PCI-e slot and wiping the NVME SSD terminals with some alcohol, however no luck here. Also this isn't an issue on Windows, so it could be a firmware issue on the laptop's side.

However, from my previous comment, I noticed that the only thing I needed was the latency, not the aspm stuff. I haven't had any issues so far and I'm getting good uptime along with various workloads. 

I wrote a little article on my website on the fix I applied to my system. You can read it here: https://tekbyte.net/2020/fixing-nvme-ssd-problems-on-linux/

I should mention that I'm on a ThinkPad E480 running Fedora 32. I did some research and it seems to plague some other ThinkPad owners. I should also say that the Lenovo NVME SSD (some Toshiba OPAL SSD) doesn't have this problem. 

Apart from that, I simply apply the GRUB tweak and I'm done. Minor inconvenience but not much I can do unless some kernel update or BIOS update fixes this. There is also a BIOS update back but I doubt it'd fix the issue.  Might report this to Lenovo.

All the best,
Berk
Comment 78 Dirk Jonker 2020-10-28 08:14:46 UTC
I can also confirm this issue. I replaced the nvme hard drive of my Thinkpad T480s with a Kingston A2000 1TB drive. The previous drive, a 256GB Samsung PM961 had been running without issues for more than 2 years.

The issue is fixed using the parameters nvme_core.default_ps_max_latency_us=0 and pci_aspm=performance. I am running Fedora with kernel 5.8.16.

It seems this particular Kingston drive just has issues with Linux given that multiple people have reported issues with this drive, not just here but on several places, just Google for "kingston A2000 linux":

- https://bbs.archlinux.org/viewtopic.php?id=256476
- https://askubuntu.com/questions/1222049/nvmekingston-a2000-sometimes-stops-giving-response-in-ubuntu-18-04dell-inspir
- https://community.acer.com/en/discussion/604326/m-2-nvme-ssd-aspire-517-51g-issue-compatibility-kingston-a2000-linux-ubuntu

Note You need to log in before you can comment on or make changes to this bug.