Bug 208123
Summary: | Kernel crashes due to NVMe disk: WD Blue SN550 (WDC WDS100T2B0C) | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | rugk (jml86khakons) |
Component: | NVMe | Assignee: | IO/NVME Virtual Default Assignee (io_nvme) |
Status: | NEW --- | ||
Severity: | high | CC: | akostadinov, andypoo, elvis.angelaccio, julian, kbusch, kilian2798, m.gelpke, marthasimons8888, noodles, richrocksmyworld, roxma |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 5.6.15-300.fc32.x86_64 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
rugk
2020-06-10 10:53:36 UTC
did you try disabling apst? Okay, so I've tried adding:
> nvme_core.default_ps_max_latency_us=5500
…as a kernel parameter.
(Which is BTW very convenient to do on Silverblue, just run
$ rpm-ostree kargs --append=nvme_core.default_ps_max_latency_us=5500
Also, I could – and of course – needed to rollback, because I accidentally changed another kernel parameter.)
And it seems to work so far. (But I'll report back soon if something happens.)
-- LOGS --
# nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning : 0
temperature : 35 C
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 0%
endurance group critical warning summary: 0
data_units_read : 109.350
data_units_written : 187.970
host_read_commands : 2.017.447
host_write_commands : 1.014.888
controller_busy_time : 6
power_cycles : 41
power_on_hours : 37
unsafe_shutdowns : 29
media_errors : 0
num_err_log_entries : 1
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0
# smartctl -t short /dev/nvme0
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.6.15-300.fc32.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
NVMe device successfully opened
Use 'smartctl -a' (or '-x') to print SMART (and more) information
[root@fedidea rugk]# smartctl -a /dev/nvme0
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.6.15-300.fc32.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: WDC WDS100T2B0C-00PXH0
Serial Number: ********
Firmware Version: 211070WD
PCI Vendor/Subsystem ID: 0x15b7
IEEE OUI Identifier: 0x001b44
Total NVM Capacity: 1.000.204.886.016 [1,00 TB]
Unallocated NVM Capacity: 0
Controller ID: 1
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1.000.204.886.016 [1,00 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 001b44 4a4408edc8
Local Time is: Fri Jun 12 00:29:54 2020 CEST
Firmware Updates (0x14): 2 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Maximum Data Transfer Size: 128 Pages
Warning Comp. Temp. Threshold: 80 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Namespace 1 Features (0x02): NA_Fields
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 3.50W 2.90W - 0 0 0 0 0 0
1 + 2.70W 1.80W - 0 0 0 0 0 0
2 + 1.90W 1.50W - 0 0 0 0 0 0
3 - 0.0200W - - 3 3 3 3 3900 11000
4 - 0.0050W - - 4 4 4 4 5000 39000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 2
1 - 4096 0 1
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 34 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 109.515 [56,0 GB]
Data Units Written: 188.161 [96,3 GB]
Host Read Commands: 2.031.364
Host Write Commands: 1.019.201
Controller Busy Time: 6
Power Cycles: 41
Power On Hours: 37
Unsafe Shutdowns: 29
Media and Data Integrity Errors: 0
Error Information Log Entries: 1
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Error Information (NVMe Log 0x01, max 256 entries)
No Errors Logged
# nvme get-feature -f 0x0c -H /dev/nvme0
get-feature:0xc (Autonomous Power State Transition), Current value:0x000001
Autonomous Power State Transition Enable (APSTE): Enabled
Auto PST Entries .................
Entry[ 0]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[ 1]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[ 2]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[ 3]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[ 4]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[ 5]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[ 6]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[ 7]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[ 8]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[ 9]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[10]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[11]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[12]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[13]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[14]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[15]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[16]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[17]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[18]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[19]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[20]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[21]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[22]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[23]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[24]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[25]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[26]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[27]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[28]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[29]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[30]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
Entry[31]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
# rpm-ostree kargs
resume=/dev/mapper/fedora-swap rd.lvm.lv=fedora/root rd.luks.uuid=luks-08a02aba-fa31-[…] rd.lvm.lv=fedora/swap rhgb quiet root=/dev/mapper/luks-08a02aba-fa31-[…] ostree=/ostree/boot.1/fedora/902a099ca320af417c8297a1bff8ce4dda8[…]/0 nvme_core.default_ps_max_latency_us=5500
So disabling APST seems to work in 99% of the cases as a workaround. I still sometimes experience a freezing of the system, but I could detect nothing in the logs, so I'm not sure whether this is related to this issue. Anyway, [on the forum](https://ask.fedoraproject.org/t/investigating-kernel-crashes-due-to-nvme-disk/7620/5?u=rugk) it is speculated this is related to the Ryzen processor and the SN550 NVMe. Because there is a very similar bug report on Reddit: https://www.reddit.com/r/pop_os/comments/g8y3ae/experiencing_random_freezes_after_installing_new/ If there is any other information I can provide you feel free to ask. I experience the same behaviour. Even if i add the mentioned kernel parameter which "fixes" it i get the sporadic crashes where i have no logs indicating what happened at all or what was leading to crash. Just a blinking capslock led. I will gladly provide more info if requested I've got the same disk and see the same issue. I had kern.log open and waited for the bug to occur, what happens there is a "disconnect, reconnect, reset" and reset fails with -19. Trying to tee onto a network drive or usb thumb drive and maybe take a screenshot of it. Is there something i could do to log more infos around this issue? WD has released new firmware for the WD Blue SN550 : 21120WD Use their Windows tool to verify and update the firmware of the SSD. Tool is called “dashboard”. Is this fixed with the latest kernel/firmware? I am affected by the same issue. Unfortunately, the laptop in question is in another country right now, so I do not have the ability to attempt the firmware update. Kernel: 5.10.8 (sorry for the old kernel here) Model Number: WDC WDS100T2B0C-00PXH0 Serial Number: 20324B802592 Firmware Version: 211070WD 04:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD Blue SN550 NVMe SSD [15b7:5009] (rev 01) I managed to capture the dmesg output during a failure by having a log running with dmesg -w from another machine. [ 7337.662382] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff [ 7337.701297] nvme 0000:04:00.0: can't change power state from D3cold to D0 (config space inaccessible) [ 7337.701559] nvme nvme0: Removing after probe failure status: -19 Followed by constant block/fs related errors which never recover. Forgot to mention, this is also running on an AMD Ryzen laptop. Metabox NL50RU/NL5xRU, BIOS 1.07.05TMB 12/15/2020 CPU0: AMD Ryzen 5 4500U with Radeon Graphics (family: 0x17, model: 0x60, stepping: 0x1) The machine crashes sometimes after a couple of hours, or sometimes after a couple of days. By adding, "nvme_core.default_ps_max_latency_us=0" to the boot line, the crashes went away and am now enjoying 30+ days uptimes. I updated the firmware in my WDS100T2B0C-00PXH0 to 211210WD last week and removed the nvme_core.default_ps_max_latency_us=5000 from the kernel command line. 5 days later still no crashes, despite several suspend/resume cycles and a lot of use. Looks like the new firmware has resolved things for me. (In reply to Jonathan McDowell from comment #12) > I updated the firmware in my WDS100T2B0C-00PXH0 to 211210WD last week and > removed the nvme_core.default_ps_max_latency_us=5000 from the kernel command > line. 5 days later still no crashes, despite several suspend/resume cycles > and a lot of use. Looks like the new firmware has resolved things for me. Hi, Jonathan, could you share the output of"nvme id-ctrl /dev/nvmeXnY" on your WDS100T2B0C-00PXH0. I found a similar issue that seems to be highly related to "elpe: 255". After I set the elpe to 63, (btw I'm a NVMe SSD developer), no crash is observed. I goggled for WDS100T2B0C and "id-ctrl" and found https://netlog.jpn.org/r271-635/2021/08/ssd_wdblue_sn550.html. The elpe was 255 at the time. Please let me know whether it is the same value on yours. I'm seeing: elpe : 255 on my upgraded device. Full details with serial number redacted: NVME Identify Controller: vid : 0x15b7 ssvid : 0x15b7 sn : xxxxxxxxxxxx mn : WDC WDS100T2B0C-00PXH0 fr : 211210WD rab : 4 ieee : 001b44 cmic : 0 mdts : 7 cntlid : 0x1 ver : 0x10400 rtd3r : 0x7a120 rtd3e : 0xf4240 oaes : 0x200 ctratt : 0x2 rrls : 0 cntrltype : 1 fguid : crdt1 : 0 crdt2 : 0 crdt3 : 0 oacs : 0x17 acl : 4 aerl : 7 frmw : 0x14 lpa : 0x1e elpe : 255 npss : 4 avscc : 0x1 apsta : 0x1 wctemp : 353 cctemp : 358 mtfa : 50 hmpre : 51200 hmmin : 823 tnvmcap : 1000204886016 unvmcap : 0 rpmbs : 0 edstt : 70 dsto : 1 fwug : 1 kas : 0 hctma : 0x1 mntmt : 273 mxtmt : 358 sanicap : 0x60000002 hmminds : 0 hmmaxd : 8 nsetidmax : 0 endgidmax : 0 anatt : 0 anacap : 0 anagrpmax : 0 nanagrpid : 0 pels : 1 sqes : 0x66 cqes : 0x44 maxcmd : 0 nn : 1 oncs : 0x5f fuses : 0 fna : 0 vwc : 0x7 awun : 0 awupf : 0 nvscc : 1 nwpc : 0 acwu : 0 sgls : 0 mnan : 0 subnqn : nqn.2018-01.com.wdc:nguid:E8238FA6BF53-0001-001B448B4xxxxxxx ioccsz : 0 iorcsz : 0 icdoff : 0 ctrattr : 0 msdbd : 0 ps 0 : mp:3.50W operational enlat:0 exlat:0 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:0.6300W active_power:2.90W ps 1 : mp:2.70W operational enlat:0 exlat:0 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:0.6300W active_power:1.80W ps 2 : mp:1.90W operational enlat:0 exlat:0 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:0.6300W active_power:1.50W ps 3 : mp:0.0250W non-operational enlat:3900 exlat:11000 rrt:3 rrl:3 rwt:3 rwl:3 idle_power:0.0250W active_power:- ps 4 : mp:0.0050W non-operational enlat:5000 exlat:39000 rrt:4 rrl:4 rwt:4 rwl:4 idle_power:0.0050W active_power:- (In reply to Jonathan McDowell from comment #14) > I'm seeing: > > elpe : 255 > ... Thank you Jonathan. After some digging, the root cause of my issue turns out to be a firmware bug. It's not a linux kernel issue. Thanks. FYI I experienced crashing with this drive on an old Lenovo ThinkCentre M625q with AMD E2-9000e CPU. It crashed seemingly as soon as there was no activity on the system for a few seconds. And my fix was to set `nvme_core.default_ps_max_latency_us=15000` at the kernel command line. This basically disables the deepest power saving state (4) but for my application it is acceptable. It is still working after a couple of days but I didn't test it under high load. I took the drive from a windows machine and I saw no firmware updates for it reported by some tool installed there. I don't know if there are separate downloads with a newer version though. $ sudo nvme id-ctrl /dev/nvme0n1 NVME Identify Controller: vid : 0x15b7 ssvid : 0x15b7 sn : [REDACTED] mn : WDC PC SN520 SDAPNUW-256G-1006 fr : 20110006 rab : 4 ieee : 001b44 cmic : 0 mdts : 7 cntlid : 0x1 ver : 0x10300 rtd3r : 0x7a120 rtd3e : 0xf4240 oaes : 0x200 ctratt : 0x2 rrls : 0 cntrltype : 0 fguid : 00000000-0000-0000-0000-000000000000 crdt1 : 0 crdt2 : 0 crdt3 : 0 nvmsr : 0 vwci : 0 mec : 0 oacs : 0x17 acl : 4 aerl : 7 frmw : 0x14 lpa : 0x2 elpe : 255 npss : 4 avscc : 0x1 apsta : 0x1 wctemp : 355 cctemp : 359 mtfa : 50 hmpre : 0 hmmin : 0 tnvmcap : 256060514304 unvmcap : 0 rpmbs : 0 edstt : 31 dsto : 1 fwug : 1 kas : 0 hctma : 0x1 mntmt : 273 mxtmt : 359 sanicap : 0 hmminds : 0 hmmaxd : 0 nsetidmax : 0 endgidmax : 0 anatt : 0 anacap : 0 anagrpmax : 0 nanagrpid : 0 pels : 0 domainid : 0 megcap : 0 sqes : 0x66 cqes : 0x44 maxcmd : 0 nn : 1 oncs : 0x1f fuses : 0 fna : 0 vwc : 0x1 awun : 0 awupf : 0 icsvscc : 1 nwpc : 0 acwu : 0 ocfs : 0 sgls : 0 mnan : 0 maxdna : 0 maxcna : 0 oaqd : 0 subnqn : nqn.2018-01.com.wdc:nguid:1832B5800315-0001-001B448B444C081C ioccsz : 0 iorcsz : 0 icdoff : 0 fcatt : 0 msdbd : 0 ofcs : 0 ps 0 : mp:2.60W operational enlat:0 exlat:0 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:- active_power:- active_power_workload:- ps 1 : mp:2.60W operational enlat:0 exlat:0 rrt:1 rrl:1 rwt:1 rwl:1 idle_power:- active_power:- active_power_workload:- ps 2 : mp:1.70W operational enlat:0 exlat:0 rrt:2 rrl:2 rwt:2 rwl:2 idle_power:- active_power:- active_power_workload:- ps 3 : mp:0.0250W non-operational enlat:5000 exlat:9000 rrt:3 rrl:3 rwt:3 rwl:3 idle_power:- active_power:- active_power_workload:- ps 4 : mp:0.0025W non-operational enlat:5000 exlat:44000 rrt:4 rrl:4 rwt:4 rwl:4 idle_power:- active_power:- active_power_workload:- |