Bug 194921

Summary: Kernel oopses/panics after controller gets reset
Product: Drivers Reporter: atomnuker (atomnuker)
Component: Flash/Memory Technology DevicesAssignee: David Woodhouse (dwmw2)
Status: NEW ---    
Severity: high CC: axboe, chris.roth, luto, opensuser, regressions
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.11-rc2 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: 2 concatenated logs of the panic event happening

Description atomnuker 2017-03-18 15:57:29 UTC
Created attachment 255337 [details]
2 concatenated logs of the panic event happening

Hi,

I'm not sure whether this is the correct category, the panic trace seems vague and reports there's a NULL dereference somewhere. The issue happens with the current git master of the kernel, though I observed it happening as far back as the start of the 4.11 merge period. I just didn't bother to report it thinking it would be fixed in an RC.

The issue seems to be that after the NVME controller gets reset for some reason, it reports a capacity change, there's a bunch of btrfs errors spammed and afterwards the kernel panics.

The issue seems like it happens at random, I've observed it happening as early as 2 minutes from bootup to 2 hours.
I'd be fine with replicating and compiling a more detailed kernel if I know what to enable.

I've attached 2 full dmesg+oops logs recovered from pstore I have. Concatenated as one file because the form doesn't let me attach more.
Comment 1 The Linux kernel's regression tracker (Thorsten Leemhuis) 2017-03-27 10:09:58 UTC
Does this problem happen on older versions (<= 4.10) already, or is this something that new with 4.11-rc? In the latter case it might qualify as regression.

BTW: What kind of NVMe device is this? That info afaics is missing in your dmesg outouts :-/
Comment 2 atomnuker 2017-03-27 12:45:32 UTC
No, the problem doesn't occur with 4.10 or any older versions.
The NVMe is a Samsung 950 Pro 256Gb.

I'm currently running 4.11-rc4 and haven't had any crashes in 3 and a half hours of heavy usage. The bug might have been fixed (and possibly I mis-categorized it under flash memory because of the vague dmesg) so if I can't reproduce it after a day I'll close this bug.

Cheers
Comment 3 The Linux kernel's regression tracker (Thorsten Leemhuis) 2017-03-27 15:06:20 UTC
(In reply to atomnuker from comment #2)
> No, the problem doesn't occur with 4.10 or any older versions.
> The NVMe is a Samsung 950 Pro 256Gb.

Hmmm, some of the Samsung 950s have a known problem with APST, which is now enabled since https://git.kernel.org/torvalds/c/c5552fde102fcc3f2cf9e502b8ac90e3500d8fdf See Bug 195039 for another owner of a Samsung SSD that ran into problems. Please attach your dmesg output so we can clearly identify what model and firmware it is.
 
> I'm currently running 4.11-rc4 and haven't had any crashes in 3 and a half
> hours of heavy usage.

In the other bug entry Jens just asked: "[…] can you try with -rc4, and revert commit c5552fde10? I just checked, it reverts cleanly. […]" Might we worth a try in case the problem shows up again. Note: It's a PM features, so it might trigger less often where there is "heavy usage" ;-)

> I mis-categorized it under flash memory

Yes, it's likely a block-layer but, but doesn't matter much (and I can't change it :-/ ).
Comment 4 Andy Lutomirski 2017-03-27 15:26:15 UTC
Hi atomnuker-

This report is stranger than the other two because your SSD is more similar to mine, and mine works just fine.  Can you give me a bit more information?

1. The raw device identification.  A new enough smartctl will show it in 'smartctl -i /dev/nvme0'.  Even better would be the full output of 'nvme id-ctrl /dev/nvme0' -- you can find the 'nvme' tool in the nvme-cli package.

2. What kind of computer is this?  Is it a laptop?  Is the affected disk something that came with the laptop?

3. Can you try booting with nvme_core.default_ps_max_latency_us=0?  That with disable the power-saving feature that is likely at fault.

Samsung currently has a machine that appears to be affected and is trying to figure out what's going on.  There's some reason to believe that the problem is triggered by specific combinations of laptop and SSD.  I'll obviously need to update the blacklist to fix your system (in lieu of a better workaround that still lets you get some power savings), but I need the info above to figure out what the blacklist entry should look like.

Thanks,
Andy
Comment 5 atomnuker 2017-03-27 15:27:10 UTC
(In reply to Thorsten Leemhuis from comment #3)
> (In reply to atomnuker from comment #2)
> > No, the problem doesn't occur with 4.10 or any older versions.
> > The NVMe is a Samsung 950 Pro 256Gb.
> 
> Hmmm, some of the Samsung 950s have a known problem with APST, which is now
> enabled since
> https://git.kernel.org/torvalds/c/c5552fde102fcc3f2cf9e502b8ac90e3500d8fdf
> See Bug 195039 for another owner of a Samsung SSD that ran into problems.
> Please attach your dmesg output so we can clearly identify what model and
> firmware it is.
>  

Subsystem: Samsung Electronics Co Ltd NVMe SSD Controller SM951/PM951
Firmware: 1B0QBXX7

This seems to be the model which has the NO_APST quirk however my firmware version is 1B0QBXX7 while the quirked one is BXW75D0Q, so APST seems to be active (dmesg doesn't show anything at all about the NVMe device, only /sys/class/nvme/ and lspci do).

> > I'm currently running 4.11-rc4 and haven't had any crashes in 3 and a half
> > hours of heavy usage.
> 
> In the other bug entry Jens just asked: "[…] can you try with -rc4, and
> revert commit c5552fde10? I just checked, it reverts cleanly. […]" Might we
> worth a try in case the problem shows up again. Note: It's a PM features, so
> it might trigger less often where there is "heavy usage" ;-)

Took an hour long break, left it idling, came back, its still alive. I don't think it's going to crash.
Comment 6 atomnuker 2017-03-27 15:33:07 UTC
(In reply to Andy Lutomirski from comment #4)
> Hi atomnuker-
> 
> This report is stranger than the other two because your SSD is more similar
> to mine, and mine works just fine.  Can you give me a bit more information?
> 
> 1. The raw device identification.  A new enough smartctl will show it in
> 'smartctl -i /dev/nvme0'.  Even better would be the full output of 'nvme
> id-ctrl /dev/nvme0' -- you can find the 'nvme' tool in the nvme-cli package.
> 

Posted in the previous comment.

> 2. What kind of computer is this?  Is it a laptop?  Is the affected disk
> something that came with the laptop?
> 

Yes, its an XPS 15 (9550).

> 3. Can you try booting with nvme_core.default_ps_max_latency_us=0?  That
> with disable the power-saving feature that is likely at fault.
> 

Nothing's failing currently. I'll try that if it does.

> Samsung currently has a machine that appears to be affected and is trying to
> figure out what's going on.  There's some reason to believe that the problem
> is triggered by specific combinations of laptop and SSD.  I'll obviously
> need to update the blacklist to fix your system (in lieu of a better
> workaround that still lets you get some power savings), but I need the info
> above to figure out what the blacklist entry should look like.

Please don't, no reason to jump the gun and cut features just yet, my system is running fine currently. Let me update this bug report tomorrow if it fails or close it if it doesn't.

I still think the problem might be elsewhere entirely.
Comment 7 atomnuker 2017-03-27 15:36:32 UTC
(In reply to atomnuker from comment #6)
> (In reply to Andy Lutomirski from comment #4)
> > Hi atomnuker-
> > 
> > This report is stranger than the other two because your SSD is more similar
> > to mine, and mine works just fine.  Can you give me a bit more information?
> > 
> > 1. The raw device identification.  A new enough smartctl will show it in
> > 'smartctl -i /dev/nvme0'.  Even better would be the full output of 'nvme
> > id-ctrl /dev/nvme0' -- you can find the 'nvme' tool in the nvme-cli
> package.
> > 
> 
> Posted in the previous comment.
> 
> > 2. What kind of computer is this?  Is it a laptop?  Is the affected disk
> > something that came with the laptop?
> > 
> 
> Yes, its an XPS 15 (9550).

Forgot to mention it did not come with the laptop. Was actually cheaper to buy the HDD + 32 gb SSD combo and separately get the SSD than to get the SSD fitted one a year ago.

> 
> > 3. Can you try booting with nvme_core.default_ps_max_latency_us=0?  That
> > with disable the power-saving feature that is likely at fault.
> > 
> 
> Nothing's failing currently. I'll try that if it does.
> 
> > Samsung currently has a machine that appears to be affected and is trying
> to
> > figure out what's going on.  There's some reason to believe that the
> problem
> > is triggered by specific combinations of laptop and SSD.  I'll obviously
> > need to update the blacklist to fix your system (in lieu of a better
> > workaround that still lets you get some power savings), but I need the info
> > above to figure out what the blacklist entry should look like.
> 
> Please don't, no reason to jump the gun and cut features just yet, my system
> is running fine currently. Let me update this bug report tomorrow if it
> fails or close it if it doesn't.
> 
> I still think the problem might be elsewhere entirely.

*might have been, unless it fails.
Comment 8 atomnuker 2017-03-28 08:49:59 UTC
(In reply to Andy Lutomirski from comment #4)
> Hi atomnuker-
> 
> This report is stranger than the other two because your SSD is more similar
> to mine, and mine works just fine.  Can you give me a bit more information?
> 
> 1. The raw device identification.  A new enough smartctl will show it in
> 'smartctl -i /dev/nvme0'.  Even better would be the full output of 'nvme
> id-ctrl /dev/nvme0' -- you can find the 'nvme' tool in the nvme-cli package.
> 
> 2. What kind of computer is this?  Is it a laptop?  Is the affected disk
> something that came with the laptop?
> 
> 3. Can you try booting with nvme_core.default_ps_max_latency_us=0?  That
> with disable the power-saving feature that is likely at fault.
> 
> Samsung currently has a machine that appears to be affected and is trying to
> figure out what's going on.  There's some reason to believe that the problem
> is triggered by specific combinations of laptop and SSD.  I'll obviously
> need to update the blacklist to fix your system (in lieu of a better
> workaround that still lets you get some power savings), but I need the info
> above to figure out what the blacklist entry should look like.
> 
> Thanks,
> Andy

Nevermind, it crashed again after 13 or so hours, and again after 20-odd minutes. Its very random, but when I set default_ps_max_latency_us its fine. Go ahead and QUIRK it, bloody firmware. Can't even update it.
Comment 9 The Linux kernel's regression tracker (Thorsten Leemhuis) 2017-04-09 16:51:11 UTC
@Luto: What's the status here? Do you need any more information to fix this?
Comment 10 Andy Lutomirski 2017-04-09 17:15:34 UTC
atomnuker-

Can you post actual smartctl -i or nvme id-ctrl output?  I'm trying to get the full device ID to quirk it.  The lspci data isn't enough because it doesn't differentiate between devices (Samsung reused the ID for a lot of drives).

I'm a bit surprised because your SSD has the exact same firmware as mine, and mine works fine.
Comment 11 atomnuker 2017-04-09 17:47:18 UTC
smartctl:
Model Number:                       Samsung SSD 950 PRO 256GB
Serial Number:                      S2GLNCAGC28126Y
Firmware Version:                   1B0QBXX7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Controller ID:                      1
Number of Namespaces:               1
Namespace 1 Size/Capacity:          256,060,514,304 [256 GB]
Namespace 1 Utilization:            200,372,977,664 [200 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Sun Apr  9 18:43:43 2017 BST

nvme id-ctrl:
vid     : 0x144d
ssvid   : 0x144d
sn      : S2GLNCAGC28126Y     
mn      : Samsung SSD 950 PRO 256GB               
fr      : 1B0QBXX7
rab     : 2
ieee    : 002538
cmic    : 0
mdts    : 5
cntlid  : 1
ver     : 0
rtd3r   : 0
rtd3e   : 0
oaes    : 0
oacs    : 0x7
acl     : 7
aerl    : 3
frmw    : 0x6
lpa     : 0x1
elpe    : 63
npss    : 4
avscc   : 0x1
apsta   : 0x1
wctemp  : 0
cctemp  : 0
mtfa    : 0
hmpre   : 0
hmmin   : 0
tnvmcap : 0
unvmcap : 0
rpmbs   : 0
sqes    : 0x66
cqes    : 0x44
nn      : 1
oncs    : 0x1f
fuses   : 0
fna     : 0x4
vwc     : 0x1
awun    : 255
awupf   : 0
nvscc   : 1
acwu    : 0
sgls    : 0
subnqn  : 
ps    0 : mp:6.50W operational enlat:5 exlat:5 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:5.80W operational enlat:30 exlat:30 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:3.60W operational enlat:100 exlat:100 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0700W non-operational enlat:500 exlat:5000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0050W non-operational enlat:2000 exlat:22000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-

>I'm a bit surprised because your SSD has the exact same firmware as mine, and
>mine works fine.
Odd. Does your device ID match mine? I'm giving 4.11-rc6 another go with a default default_ps_max_latency_us.
Comment 12 atomnuker 2017-04-09 20:00:47 UTC
Nope, 4.11-rc6 is still making the whole machine freeze with default_ps_max_latency_us=25000. I guess Samsung couldn't be bothered to write good firmware (nor even change IDs like you said).
Comment 13 Chris Roth 2017-04-12 01:18:46 UTC
Also posted in bug 195039. I don't know if that breaks a rule. If it does, feel free to delete one of the posts.

I'm running into what I think is the same or a related problem.
When I upgraded to ubuntu 17.04 beta 2 (which I believe is kernel 4.10), I started having crashes after anywhere from 10min to an hour. The operating system would state that the disk is now read only and/or give IO errors. I downgraded to 16.10 and now have no more problems of this kind.

I'm also using a dell 9550.

Output of 'nvme id-ctrl /dev/nvme0'

NVME Identify Controller:
vid     : 0x144d
ssvid   : 0x144d
sn      :       S29PNXAH124276
mn      : PM951 NVMe SAMSUNG 512GB                
fr      : BXV77D0Q
rab     : 2
ieee    : 002538
cmic    : 0
mdts    : 5
cntlid  : 1
ver     : 0
rtd3r   : 0
rtd3e   : 0
oaes    : 0
oacs    : 0x17
acl     : 7
aerl    : 3
frmw    : 0x6
lpa     : 0
elpe    : 63
npss    : 4
avscc   : 0x1
apsta   : 0x1
wctemp  : 0
cctemp  : 0
mtfa    : 0
hmpre   : 0
hmmin   : 0
tnvmcap : 0
unvmcap : 0
rpmbs   : 0
sqes    : 0x66
cqes    : 0x44
nn      : 1
oncs    : 0x1f
fuses   : 0
fna     : 0
vwc     : 0x1
awun    : 255
awupf   : 0
nvscc   : 1
acwu    : 0
sgls    : 0
ps    0 : mp:6.00W operational enlat:5 exlat:5 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:4.20W operational enlat:30 exlat:30 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:3.10W operational enlat:100 exlat:100 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0700W non-operational enlat:500 exlat:5000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0050W non-operational enlat:2000 exlat:22000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-
Comment 14 Andy Lutomirski 2017-04-12 04:04:49 UTC
Can someone with appropriate permissions mark this a duplicate of bug 195039?