Created attachment 255337 [details] 2 concatenated logs of the panic event happening Hi, I'm not sure whether this is the correct category, the panic trace seems vague and reports there's a NULL dereference somewhere. The issue happens with the current git master of the kernel, though I observed it happening as far back as the start of the 4.11 merge period. I just didn't bother to report it thinking it would be fixed in an RC. The issue seems to be that after the NVME controller gets reset for some reason, it reports a capacity change, there's a bunch of btrfs errors spammed and afterwards the kernel panics. The issue seems like it happens at random, I've observed it happening as early as 2 minutes from bootup to 2 hours. I'd be fine with replicating and compiling a more detailed kernel if I know what to enable. I've attached 2 full dmesg+oops logs recovered from pstore I have. Concatenated as one file because the form doesn't let me attach more.
Does this problem happen on older versions (<= 4.10) already, or is this something that new with 4.11-rc? In the latter case it might qualify as regression. BTW: What kind of NVMe device is this? That info afaics is missing in your dmesg outouts :-/
No, the problem doesn't occur with 4.10 or any older versions. The NVMe is a Samsung 950 Pro 256Gb. I'm currently running 4.11-rc4 and haven't had any crashes in 3 and a half hours of heavy usage. The bug might have been fixed (and possibly I mis-categorized it under flash memory because of the vague dmesg) so if I can't reproduce it after a day I'll close this bug. Cheers
(In reply to atomnuker from comment #2) > No, the problem doesn't occur with 4.10 or any older versions. > The NVMe is a Samsung 950 Pro 256Gb. Hmmm, some of the Samsung 950s have a known problem with APST, which is now enabled since https://git.kernel.org/torvalds/c/c5552fde102fcc3f2cf9e502b8ac90e3500d8fdf See Bug 195039 for another owner of a Samsung SSD that ran into problems. Please attach your dmesg output so we can clearly identify what model and firmware it is. > I'm currently running 4.11-rc4 and haven't had any crashes in 3 and a half > hours of heavy usage. In the other bug entry Jens just asked: "[…] can you try with -rc4, and revert commit c5552fde10? I just checked, it reverts cleanly. […]" Might we worth a try in case the problem shows up again. Note: It's a PM features, so it might trigger less often where there is "heavy usage" ;-) > I mis-categorized it under flash memory Yes, it's likely a block-layer but, but doesn't matter much (and I can't change it :-/ ).
Hi atomnuker- This report is stranger than the other two because your SSD is more similar to mine, and mine works just fine. Can you give me a bit more information? 1. The raw device identification. A new enough smartctl will show it in 'smartctl -i /dev/nvme0'. Even better would be the full output of 'nvme id-ctrl /dev/nvme0' -- you can find the 'nvme' tool in the nvme-cli package. 2. What kind of computer is this? Is it a laptop? Is the affected disk something that came with the laptop? 3. Can you try booting with nvme_core.default_ps_max_latency_us=0? That with disable the power-saving feature that is likely at fault. Samsung currently has a machine that appears to be affected and is trying to figure out what's going on. There's some reason to believe that the problem is triggered by specific combinations of laptop and SSD. I'll obviously need to update the blacklist to fix your system (in lieu of a better workaround that still lets you get some power savings), but I need the info above to figure out what the blacklist entry should look like. Thanks, Andy
(In reply to Thorsten Leemhuis from comment #3) > (In reply to atomnuker from comment #2) > > No, the problem doesn't occur with 4.10 or any older versions. > > The NVMe is a Samsung 950 Pro 256Gb. > > Hmmm, some of the Samsung 950s have a known problem with APST, which is now > enabled since > https://git.kernel.org/torvalds/c/c5552fde102fcc3f2cf9e502b8ac90e3500d8fdf > See Bug 195039 for another owner of a Samsung SSD that ran into problems. > Please attach your dmesg output so we can clearly identify what model and > firmware it is. > Subsystem: Samsung Electronics Co Ltd NVMe SSD Controller SM951/PM951 Firmware: 1B0QBXX7 This seems to be the model which has the NO_APST quirk however my firmware version is 1B0QBXX7 while the quirked one is BXW75D0Q, so APST seems to be active (dmesg doesn't show anything at all about the NVMe device, only /sys/class/nvme/ and lspci do). > > I'm currently running 4.11-rc4 and haven't had any crashes in 3 and a half > > hours of heavy usage. > > In the other bug entry Jens just asked: "[…] can you try with -rc4, and > revert commit c5552fde10? I just checked, it reverts cleanly. […]" Might we > worth a try in case the problem shows up again. Note: It's a PM features, so > it might trigger less often where there is "heavy usage" ;-) Took an hour long break, left it idling, came back, its still alive. I don't think it's going to crash.
(In reply to Andy Lutomirski from comment #4) > Hi atomnuker- > > This report is stranger than the other two because your SSD is more similar > to mine, and mine works just fine. Can you give me a bit more information? > > 1. The raw device identification. A new enough smartctl will show it in > 'smartctl -i /dev/nvme0'. Even better would be the full output of 'nvme > id-ctrl /dev/nvme0' -- you can find the 'nvme' tool in the nvme-cli package. > Posted in the previous comment. > 2. What kind of computer is this? Is it a laptop? Is the affected disk > something that came with the laptop? > Yes, its an XPS 15 (9550). > 3. Can you try booting with nvme_core.default_ps_max_latency_us=0? That > with disable the power-saving feature that is likely at fault. > Nothing's failing currently. I'll try that if it does. > Samsung currently has a machine that appears to be affected and is trying to > figure out what's going on. There's some reason to believe that the problem > is triggered by specific combinations of laptop and SSD. I'll obviously > need to update the blacklist to fix your system (in lieu of a better > workaround that still lets you get some power savings), but I need the info > above to figure out what the blacklist entry should look like. Please don't, no reason to jump the gun and cut features just yet, my system is running fine currently. Let me update this bug report tomorrow if it fails or close it if it doesn't. I still think the problem might be elsewhere entirely.
(In reply to atomnuker from comment #6) > (In reply to Andy Lutomirski from comment #4) > > Hi atomnuker- > > > > This report is stranger than the other two because your SSD is more similar > > to mine, and mine works just fine. Can you give me a bit more information? > > > > 1. The raw device identification. A new enough smartctl will show it in > > 'smartctl -i /dev/nvme0'. Even better would be the full output of 'nvme > > id-ctrl /dev/nvme0' -- you can find the 'nvme' tool in the nvme-cli > package. > > > > Posted in the previous comment. > > > 2. What kind of computer is this? Is it a laptop? Is the affected disk > > something that came with the laptop? > > > > Yes, its an XPS 15 (9550). Forgot to mention it did not come with the laptop. Was actually cheaper to buy the HDD + 32 gb SSD combo and separately get the SSD than to get the SSD fitted one a year ago. > > > 3. Can you try booting with nvme_core.default_ps_max_latency_us=0? That > > with disable the power-saving feature that is likely at fault. > > > > Nothing's failing currently. I'll try that if it does. > > > Samsung currently has a machine that appears to be affected and is trying > to > > figure out what's going on. There's some reason to believe that the > problem > > is triggered by specific combinations of laptop and SSD. I'll obviously > > need to update the blacklist to fix your system (in lieu of a better > > workaround that still lets you get some power savings), but I need the info > > above to figure out what the blacklist entry should look like. > > Please don't, no reason to jump the gun and cut features just yet, my system > is running fine currently. Let me update this bug report tomorrow if it > fails or close it if it doesn't. > > I still think the problem might be elsewhere entirely. *might have been, unless it fails.
(In reply to Andy Lutomirski from comment #4) > Hi atomnuker- > > This report is stranger than the other two because your SSD is more similar > to mine, and mine works just fine. Can you give me a bit more information? > > 1. The raw device identification. A new enough smartctl will show it in > 'smartctl -i /dev/nvme0'. Even better would be the full output of 'nvme > id-ctrl /dev/nvme0' -- you can find the 'nvme' tool in the nvme-cli package. > > 2. What kind of computer is this? Is it a laptop? Is the affected disk > something that came with the laptop? > > 3. Can you try booting with nvme_core.default_ps_max_latency_us=0? That > with disable the power-saving feature that is likely at fault. > > Samsung currently has a machine that appears to be affected and is trying to > figure out what's going on. There's some reason to believe that the problem > is triggered by specific combinations of laptop and SSD. I'll obviously > need to update the blacklist to fix your system (in lieu of a better > workaround that still lets you get some power savings), but I need the info > above to figure out what the blacklist entry should look like. > > Thanks, > Andy Nevermind, it crashed again after 13 or so hours, and again after 20-odd minutes. Its very random, but when I set default_ps_max_latency_us its fine. Go ahead and QUIRK it, bloody firmware. Can't even update it.
@Luto: What's the status here? Do you need any more information to fix this?
atomnuker- Can you post actual smartctl -i or nvme id-ctrl output? I'm trying to get the full device ID to quirk it. The lspci data isn't enough because it doesn't differentiate between devices (Samsung reused the ID for a lot of drives). I'm a bit surprised because your SSD has the exact same firmware as mine, and mine works fine.
smartctl: Model Number: Samsung SSD 950 PRO 256GB Serial Number: S2GLNCAGC28126Y Firmware Version: 1B0QBXX7 PCI Vendor/Subsystem ID: 0x144d IEEE OUI Identifier: 0x002538 Controller ID: 1 Number of Namespaces: 1 Namespace 1 Size/Capacity: 256,060,514,304 [256 GB] Namespace 1 Utilization: 200,372,977,664 [200 GB] Namespace 1 Formatted LBA Size: 512 Local Time is: Sun Apr 9 18:43:43 2017 BST nvme id-ctrl: vid : 0x144d ssvid : 0x144d sn : S2GLNCAGC28126Y mn : Samsung SSD 950 PRO 256GB fr : 1B0QBXX7 rab : 2 ieee : 002538 cmic : 0 mdts : 5 cntlid : 1 ver : 0 rtd3r : 0 rtd3e : 0 oaes : 0 oacs : 0x7 acl : 7 aerl : 3 frmw : 0x6 lpa : 0x1 elpe : 63 npss : 4 avscc : 0x1 apsta : 0x1 wctemp : 0 cctemp : 0 mtfa : 0 hmpre : 0 hmmin : 0 tnvmcap : 0 unvmcap : 0 rpmbs : 0 sqes : 0x66 cqes : 0x44 nn : 1 oncs : 0x1f fuses : 0 fna : 0x4 vwc : 0x1 awun : 255 awupf : 0 nvscc : 1 acwu : 0 sgls : 0 subnqn : ps 0 : mp:6.50W operational enlat:5 exlat:5 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:- active_power:- ps 1 : mp:5.80W operational enlat:30 exlat:30 rrt:1 rrl:1 rwt:1 rwl:1 idle_power:- active_power:- ps 2 : mp:3.60W operational enlat:100 exlat:100 rrt:2 rrl:2 rwt:2 rwl:2 idle_power:- active_power:- ps 3 : mp:0.0700W non-operational enlat:500 exlat:5000 rrt:3 rrl:3 rwt:3 rwl:3 idle_power:- active_power:- ps 4 : mp:0.0050W non-operational enlat:2000 exlat:22000 rrt:4 rrl:4 rwt:4 rwl:4 idle_power:- active_power:- >I'm a bit surprised because your SSD has the exact same firmware as mine, and >mine works fine. Odd. Does your device ID match mine? I'm giving 4.11-rc6 another go with a default default_ps_max_latency_us.
Nope, 4.11-rc6 is still making the whole machine freeze with default_ps_max_latency_us=25000. I guess Samsung couldn't be bothered to write good firmware (nor even change IDs like you said).
Also posted in bug 195039. I don't know if that breaks a rule. If it does, feel free to delete one of the posts. I'm running into what I think is the same or a related problem. When I upgraded to ubuntu 17.04 beta 2 (which I believe is kernel 4.10), I started having crashes after anywhere from 10min to an hour. The operating system would state that the disk is now read only and/or give IO errors. I downgraded to 16.10 and now have no more problems of this kind. I'm also using a dell 9550. Output of 'nvme id-ctrl /dev/nvme0' NVME Identify Controller: vid : 0x144d ssvid : 0x144d sn : S29PNXAH124276 mn : PM951 NVMe SAMSUNG 512GB fr : BXV77D0Q rab : 2 ieee : 002538 cmic : 0 mdts : 5 cntlid : 1 ver : 0 rtd3r : 0 rtd3e : 0 oaes : 0 oacs : 0x17 acl : 7 aerl : 3 frmw : 0x6 lpa : 0 elpe : 63 npss : 4 avscc : 0x1 apsta : 0x1 wctemp : 0 cctemp : 0 mtfa : 0 hmpre : 0 hmmin : 0 tnvmcap : 0 unvmcap : 0 rpmbs : 0 sqes : 0x66 cqes : 0x44 nn : 1 oncs : 0x1f fuses : 0 fna : 0 vwc : 0x1 awun : 255 awupf : 0 nvscc : 1 acwu : 0 sgls : 0 ps 0 : mp:6.00W operational enlat:5 exlat:5 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:- active_power:- ps 1 : mp:4.20W operational enlat:30 exlat:30 rrt:1 rrl:1 rwt:1 rwl:1 idle_power:- active_power:- ps 2 : mp:3.10W operational enlat:100 exlat:100 rrt:2 rrl:2 rwt:2 rwl:2 idle_power:- active_power:- ps 3 : mp:0.0700W non-operational enlat:500 exlat:5000 rrt:3 rrl:3 rwt:3 rwl:3 idle_power:- active_power:- ps 4 : mp:0.0050W non-operational enlat:2000 exlat:22000 rrt:4 rrl:4 rwt:4 rwl:4 idle_power:- active_power:-
Can someone with appropriate permissions mark this a duplicate of bug 195039?