Bug 201811

Summary: PM951 1TB still needs an APST quirk else I get a read-error failure
Product: IO/Storage Reporter: Ian Ozsvald (ian)
Component: OtherAssignee: io_other
Status: RESOLVED DOCUMENTED    
Severity: normal    
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 4.19.0 Subsystem:
Regression: No Bisected commit-id:

Description Ian Ozsvald 2018-11-29 12:31:32 UTC
This builds upon the extant: https://bugzilla.kernel.org/show_bug.cgi?id=195039 (where I've commented in the past).

I have a PM951 1TB drive (this is the uncommon larger variant to the more usual 0.5TB drive) in a Dell XPS 9550. On kernel 4.19.0 I suffer read-only failures within an hour. Disabling APST solves the problem.

I had been commenting over on the Ubuntu Launchpad:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1805816 (my new report)
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184 (the original report by someone else, solved, for the more common 0.5TB drive)

`apport` suggests I need to post the bug here instead. The bugzilla link in the first line above is for the PM951 NVMe SAMSUNG 512GB, I have the 1TB equivalent. 

If I disable APST in GRUB then I get no read-only failures. If I use a reduced power-saving APST option (I tried nvme_core.default_ps_max_latency_us=250 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1805816) then the machine lives a little longer before suffering the same read-only fate. I believe that a quirk needs to be added for my NVMe drive.

Details copied over:
$ sudo nvme list
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 S2FZNYAG801690 PM951 NVMe SAMSUNG 1024GB 1 314.10 GB / 1.02 TB 512 B + 0 B BXV76D0Q

$ uname -a
Linux ian-XPS-15-9550 4.19.0-041900-generic #201810221809 SMP Mon Oct 22 22:11:45 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ lsb_release -rd
Description: Linux Mint 19 Tara
Release: 19

I am very happy to add more info - please let me know what you want.

Regards, Ian.
Comment 1 Ian Ozsvald 2018-11-29 12:33:05 UTC
sudo nvme id-ctrl /dev/nvme0
NVME Identify Controller:
vid     : 0x144d
ssvid   : 0x144d
sn      :       S2FZNYAG801690
mn      : PM951 NVMe SAMSUNG 1024GB               
fr      : BXV76D0Q
rab     : 2
ieee    : 002538
cmic    : 0
mdts    : 5
cntlid  : 1
ver     : 0
rtd3r   : 0
rtd3e   : 0
oaes    : 0
ctratt  : 0
oacs    : 0x17
acl     : 7
aerl    : 3
frmw    : 0x6
lpa     : 0
elpe    : 63
npss    : 4
avscc   : 0x1
apsta   : 0x1
wctemp  : 0
cctemp  : 0
mtfa    : 0
hmpre   : 0
hmmin   : 0
tnvmcap : 0
unvmcap : 0
rpmbs   : 0
edstt   : 35
dsto    : 0
fwug    : 0
kas     : 0
hctma   : 0
mntmt   : 0
mxtmt   : 0
sanicap : 0
hmminds : 0
hmmaxd  : 0
sqes    : 0x66
cqes    : 0x44
maxcmd  : 0
nn      : 1
oncs    : 0x1f
fuses   : 0
fna     : 0
vwc     : 0x1
awun    : 255
awupf   : 0
nvscc   : 1
acwu    : 0
sgls    : 0
subnqn  : 
ioccsz  : 0
iorcsz  : 0
icdoff  : 0
ctrattr : 0
msdbd   : 0
ps    0 : mp:6.00W operational enlat:5 exlat:5 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:4.20W operational enlat:30 exlat:30 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:3.10W operational enlat:100 exlat:100 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0700W non-operational enlat:500 exlat:5000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0050W non-operational enlat:2000 exlat:22000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-
Comment 2 Ian Ozsvald 2018-11-29 12:36:22 UTC
$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-4.19.0-041900-generic root=/dev/mapper/mint--vg-root ro quiet splash nvme_core.default_ps_max_latency_us=0 vt.handoff=1
Comment 3 Ian Ozsvald 2018-11-29 12:38:13 UTC
Historically I had to stay on 4.9.91. Going >91 caused other problems (e.g. regressions with my Intel WiFi). Going > 4.9 had other issues, generally not booting. Having had to reinstall my home folder (due to Dropbox's requirement to move from encrypted home to whole-disk-encryption) I've taken the opportunity to upgrade everything afresh.

Hoping this isn't a pain, Ian.
Comment 4 Ian Ozsvald 2018-11-29 17:55:01 UTC
*Very* annoyingly I've had my first read-only failure just now, whilst (as best I know) APST was disabled. This morning I had a fresh boot and whilst I didn't confirm that APST was disabled, I have no reason to believe it wasn't disabled. I have a script that checks for me, on this boot it shows:

$ more get_apste 
sudo nvme get-feature -f 0x0c -H /dev/nvme0 | grep APSTE
$ ./get_apste 
	Autonomous Power State Transition Enable (APSTE): Disabled

Here's a snippet of journalctl at the point of failure - I see no relevant logs.   I was using the machine maybe 10 minutes prior to this, and it had been on (lightly used) since the morning. I spotted that the machine had gone read-only at 17:47 and did a hard reboot (5 seconds on the power key):

Nov 29 17:23:28 ian-XPS-15-9550 org.x.reader.Daemon[1296]: UnregisterDocument URI 'file:///home/ian/workspace/clients/Hiring/2018_11_29%20Robin%20Cole%20CV%208-9-2018.pdf'
Nov 29 17:26:25 ian-XPS-15-9550 NetworkManager[994]: <info>  [1543512385.6068] manager: NetworkManager state is now CONNECTED_GLOBAL
Nov 29 17:26:25 ian-XPS-15-9550 dbus-daemon[938]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.11' (uid=0 pid=994 comm="/usr/sbin/NetworkM
anager --no-daemon ")
Nov 29 17:26:25 ian-XPS-15-9550 systemd[1]: Starting Network Manager Script Dispatcher Service...
Nov 29 17:26:25 ian-XPS-15-9550 dbus-daemon[938]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Nov 29 17:26:25 ian-XPS-15-9550 systemd[1]: Started Network Manager Script Dispatcher Service.
Nov 29 17:26:25 ian-XPS-15-9550 nm-dispatcher[10640]: req:1 'connectivity-change': new request (1 scripts)
Nov 29 17:26:25 ian-XPS-15-9550 nm-dispatcher[10640]: req:1 'connectivity-change': start running ordered scripts...
-- Reboot --
Nov 29 17:47:43 ian-XPS-15-9550 kernel: microcode: microcode updated early to revision 0xc6, date = 2018-04-17
Nov 29 17:47:43 ian-XPS-15-9550 kernel: Linux version 4.19.0-041900-generic (kernel@tangerine) (gcc version 8.2.0 (Ubuntu 8.2.0-7ubuntu1)) #201810221809 SMP Mon Oct 22 22:11:45 UTC 2018
Nov 29 17:47:43 ian-XPS-15-9550 kernel: Command line: BOOT_IMAGE=/vmlinuz-4.19.0-041900-generic root=/dev/mapper/mint--vg-root ro quiet splash nvme_core.default_ps_max_latency_us=0 vt.handoff=1

This is the first read-only failure I've had since I've disabled APST. I'm now strongly considering reverting to 4.9.91. Any ideas would be very happily received.
Comment 5 Ian Ozsvald 2018-12-01 09:02:47 UTC
Having had that read only failure I clean booted and the same machine, with no other modifications, has been running fine for 2 days (no sleeping, so 48+ hours on). 
It is possible that whilst trying to file this and the Launchpad bug that I ran a command that disabled the APST - I'd be surprised if I did that but I can't rule it out and the read-only failure is coincidental with the bug filing.
Comment 6 Ian Ozsvald 2018-12-17 10:25:59 UTC
I'll note that I've now returned to 4.9.91 which was my last-good kernel from a couple of months back.

Following up on #5 I had a second read-only failure despite having APST disabled. In this case I'd upgraded to 4.19.7 (up from 4.19.0 where I had the previous APST-disabled with read-only filesytem failure).

I've been using 4.9.91 for a week, it seems to be stable, I get no read-only failures. Maybe this is a quirk of my less-common (but still Dell's standard) 1TB PM951.

Perhaps this post will help someone else in the future. Cheers, Ian.
Comment 7 Ian Ozsvald 2019-01-02 18:46:12 UTC
A BIOS upgrade, detailed in https://bugzilla.kernel.org/show_bug.cgi?id=195039 , seems to have solved this issue. Once I know that I this is solved I'll close this issue.