Fedora 32 Silverblue ## What happens Randomly (I assume when it accesses the file system/the NVMe SSD disk quite much, it just freezes and shows me a fullscreen error. It's always some kind of **ext4 error**, but it's a new installation, so the file system is intact. Here are some errors: > t 4948.2505971 EXT4-fs error (device dm-2): __ext4 find_emtry-1536: inode > 83829000: comm gdb-session-wor: reading directory lblock 0 IMG_20200604_230820.jpg ----- > [ 213.350921 EXT4-fs error (device dm-2): __ext4 find_entry:1536: inode > 83029000: comm glm-session-war: reading directory Iblock @ IMG_20200605_000220.jpg ----- > { 206.681358) EXT4-fs error (device dm-4): ext4_read_inode_bitmap:200: comm > dconf worker: Cannot read inode bitmap - block_group = 1056, inode_bitmap = > 34603024 { 206.681465] EXT4-fs error (device dm-4) in ext4 free. inode:355: IO failure { 206.775200] EXT4-fs error (device dm-4): ext4_wait_block_bitmap:520@: comm cheese:cs0: Cannot read block bitmap - block_group = 38, block_bitmap = 1048582 { 206.775410] EXT4-fs error (device dm-4): ext4_discard_preallocations:4090: comm cheese:cs0: Error -5 reading block bitmap for 38 { 213.584473] EXT4-fs error (device dm-4): ext4_journal_check_start :84: Detected aborted journal { 213.584557] EXT4-fs (dm-4): Remounting filesystem read-only IMG_20200605_232825.jpg ### What also happened I assume some kind of this also caused another error: the TPM seems to have been corrupted and I had to regenerate it. What I actually saw is: At some boot, the BIOS/UEFI showed me a message that claimed I had switched the CPU (of course, I did not, it's the built-in AMD Ryzen CPU) and it needs to regenerate the fTPM values or so. As I do not have anything that relies on the TPM, I could just choose `Y` (yes) to regenerate it. (Note: This happened after all photos IIRC.) ## System Here are all logs with system information (nvme-cli, smartctl, lshw etc.): https://gist.github.com/rugk/d17c88a7f78c986029c08426235217ed **Side-note:** I had to learn that not all WDC drives actually [support the custom WDC commands](https://github.com/linux-nvme/nvme-cli/issues/731) that `nvme-cli` provides. ### A log catching the problem Also I've managed to catch `dmesg` output when this occurred. This time, it **was not noticeable in the graphically**, but I could actually still use the system. However, in the background, it seems to have mounted the whole file system as readonly (and did not tell me lol) – do have a look at the end of that kernel log: https://gist.github.com/rugk/88cad699c2ccf2cf0d309aa3a81221a1 Funny how the system is still able to run when it throws all these kinds of error… ## Links Maybe better to read, I've also posted this in the Fedora Ask forum: https://ask.fedoraproject.org/t/investigating-kernel-crashes-due-to-nvme-disk/7620?u=rugk Reported downstream in the Fedora issue tracker at https://bugzilla.redhat.com/show_bug.cgi?id=1844905 It would even be glad if you could point me to a workaround already…
did you try disabling apst?
Okay, so I've tried adding: > nvme_core.default_ps_max_latency_us=5500 …as a kernel parameter. (Which is BTW very convenient to do on Silverblue, just run $ rpm-ostree kargs --append=nvme_core.default_ps_max_latency_us=5500 Also, I could – and of course – needed to rollback, because I accidentally changed another kernel parameter.) And it seems to work so far. (But I'll report back soon if something happens.) -- LOGS -- # nvme smart-log /dev/nvme0 Smart Log for NVME device:nvme0 namespace-id:ffffffff critical_warning : 0 temperature : 35 C available_spare : 100% available_spare_threshold : 10% percentage_used : 0% endurance group critical warning summary: 0 data_units_read : 109.350 data_units_written : 187.970 host_read_commands : 2.017.447 host_write_commands : 1.014.888 controller_busy_time : 6 power_cycles : 41 power_on_hours : 37 unsafe_shutdowns : 29 media_errors : 0 num_err_log_entries : 1 Warning Temperature Time : 0 Critical Composite Temperature Time : 0 Thermal Management T1 Trans Count : 0 Thermal Management T2 Trans Count : 0 Thermal Management T1 Total Time : 0 Thermal Management T2 Total Time : 0 # smartctl -t short /dev/nvme0 smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.6.15-300.fc32.x86_64] (local build) Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org NVMe device successfully opened Use 'smartctl -a' (or '-x') to print SMART (and more) information [root@fedidea rugk]# smartctl -a /dev/nvme0 smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.6.15-300.fc32.x86_64] (local build) Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: WDC WDS100T2B0C-00PXH0 Serial Number: ******** Firmware Version: 211070WD PCI Vendor/Subsystem ID: 0x15b7 IEEE OUI Identifier: 0x001b44 Total NVM Capacity: 1.000.204.886.016 [1,00 TB] Unallocated NVM Capacity: 0 Controller ID: 1 Number of Namespaces: 1 Namespace 1 Size/Capacity: 1.000.204.886.016 [1,00 TB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: 001b44 4a4408edc8 Local Time is: Fri Jun 12 00:29:54 2020 CEST Firmware Updates (0x14): 2 Slots, no Reset required Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp Maximum Data Transfer Size: 128 Pages Warning Comp. Temp. Threshold: 80 Celsius Critical Comp. Temp. Threshold: 85 Celsius Namespace 1 Features (0x02): NA_Fields Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 3.50W 2.90W - 0 0 0 0 0 0 1 + 2.70W 1.80W - 0 0 0 0 0 0 2 + 1.90W 1.50W - 0 0 0 0 0 0 3 - 0.0200W - - 3 3 3 3 3900 11000 4 - 0.0050W - - 4 4 4 4 5000 39000 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 2 1 - 4096 0 1 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 34 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 0% Data Units Read: 109.515 [56,0 GB] Data Units Written: 188.161 [96,3 GB] Host Read Commands: 2.031.364 Host Write Commands: 1.019.201 Controller Busy Time: 6 Power Cycles: 41 Power On Hours: 37 Unsafe Shutdowns: 29 Media and Data Integrity Errors: 0 Error Information Log Entries: 1 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Error Information (NVMe Log 0x01, max 256 entries) No Errors Logged # nvme get-feature -f 0x0c -H /dev/nvme0 get-feature:0xc (Autonomous Power State Transition), Current value:0x000001 Autonomous Power State Transition Enable (APSTE): Enabled Auto PST Entries ................. Entry[ 0] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[ 1] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[ 2] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[ 3] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[ 4] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[ 5] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[ 6] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[ 7] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[ 8] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[ 9] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[10] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[11] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[12] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[13] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[14] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[15] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[16] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[17] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[18] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[19] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[20] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[21] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[22] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[23] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[24] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[25] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[26] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[27] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[28] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[29] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[30] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[31] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. # rpm-ostree kargs resume=/dev/mapper/fedora-swap rd.lvm.lv=fedora/root rd.luks.uuid=luks-08a02aba-fa31-[…] rd.lvm.lv=fedora/swap rhgb quiet root=/dev/mapper/luks-08a02aba-fa31-[…] ostree=/ostree/boot.1/fedora/902a099ca320af417c8297a1bff8ce4dda8[…]/0 nvme_core.default_ps_max_latency_us=5500
So disabling APST seems to work in 99% of the cases as a workaround. I still sometimes experience a freezing of the system, but I could detect nothing in the logs, so I'm not sure whether this is related to this issue. Anyway, [on the forum](https://ask.fedoraproject.org/t/investigating-kernel-crashes-due-to-nvme-disk/7620/5?u=rugk) it is speculated this is related to the Ryzen processor and the SN550 NVMe. Because there is a very similar bug report on Reddit: https://www.reddit.com/r/pop_os/comments/g8y3ae/experiencing_random_freezes_after_installing_new/
If there is any other information I can provide you feel free to ask.
I experience the same behaviour. Even if i add the mentioned kernel parameter which "fixes" it i get the sporadic crashes where i have no logs indicating what happened at all or what was leading to crash. Just a blinking capslock led. I will gladly provide more info if requested
I've got the same disk and see the same issue. I had kern.log open and waited for the bug to occur, what happens there is a "disconnect, reconnect, reset" and reset fails with -19. Trying to tee onto a network drive or usb thumb drive and maybe take a screenshot of it. Is there something i could do to log more infos around this issue?
WD has released new firmware for the WD Blue SN550 : 21120WD Use their Windows tool to verify and update the firmware of the SSD.
Tool is called “dashboard”.
Is this fixed with the latest kernel/firmware?
I am affected by the same issue. Unfortunately, the laptop in question is in another country right now, so I do not have the ability to attempt the firmware update. Kernel: 5.10.8 (sorry for the old kernel here) Model Number: WDC WDS100T2B0C-00PXH0 Serial Number: 20324B802592 Firmware Version: 211070WD 04:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD Blue SN550 NVMe SSD [15b7:5009] (rev 01) I managed to capture the dmesg output during a failure by having a log running with dmesg -w from another machine. [ 7337.662382] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff [ 7337.701297] nvme 0000:04:00.0: can't change power state from D3cold to D0 (config space inaccessible) [ 7337.701559] nvme nvme0: Removing after probe failure status: -19 Followed by constant block/fs related errors which never recover.
Forgot to mention, this is also running on an AMD Ryzen laptop. Metabox NL50RU/NL5xRU, BIOS 1.07.05TMB 12/15/2020 CPU0: AMD Ryzen 5 4500U with Radeon Graphics (family: 0x17, model: 0x60, stepping: 0x1) The machine crashes sometimes after a couple of hours, or sometimes after a couple of days. By adding, "nvme_core.default_ps_max_latency_us=0" to the boot line, the crashes went away and am now enjoying 30+ days uptimes.
I updated the firmware in my WDS100T2B0C-00PXH0 to 211210WD last week and removed the nvme_core.default_ps_max_latency_us=5000 from the kernel command line. 5 days later still no crashes, despite several suspend/resume cycles and a lot of use. Looks like the new firmware has resolved things for me.
(In reply to Jonathan McDowell from comment #12) > I updated the firmware in my WDS100T2B0C-00PXH0 to 211210WD last week and > removed the nvme_core.default_ps_max_latency_us=5000 from the kernel command > line. 5 days later still no crashes, despite several suspend/resume cycles > and a lot of use. Looks like the new firmware has resolved things for me. Hi, Jonathan, could you share the output of"nvme id-ctrl /dev/nvmeXnY" on your WDS100T2B0C-00PXH0. I found a similar issue that seems to be highly related to "elpe: 255". After I set the elpe to 63, (btw I'm a NVMe SSD developer), no crash is observed. I goggled for WDS100T2B0C and "id-ctrl" and found https://netlog.jpn.org/r271-635/2021/08/ssd_wdblue_sn550.html. The elpe was 255 at the time. Please let me know whether it is the same value on yours.
I'm seeing: elpe : 255 on my upgraded device. Full details with serial number redacted: NVME Identify Controller: vid : 0x15b7 ssvid : 0x15b7 sn : xxxxxxxxxxxx mn : WDC WDS100T2B0C-00PXH0 fr : 211210WD rab : 4 ieee : 001b44 cmic : 0 mdts : 7 cntlid : 0x1 ver : 0x10400 rtd3r : 0x7a120 rtd3e : 0xf4240 oaes : 0x200 ctratt : 0x2 rrls : 0 cntrltype : 1 fguid : crdt1 : 0 crdt2 : 0 crdt3 : 0 oacs : 0x17 acl : 4 aerl : 7 frmw : 0x14 lpa : 0x1e elpe : 255 npss : 4 avscc : 0x1 apsta : 0x1 wctemp : 353 cctemp : 358 mtfa : 50 hmpre : 51200 hmmin : 823 tnvmcap : 1000204886016 unvmcap : 0 rpmbs : 0 edstt : 70 dsto : 1 fwug : 1 kas : 0 hctma : 0x1 mntmt : 273 mxtmt : 358 sanicap : 0x60000002 hmminds : 0 hmmaxd : 8 nsetidmax : 0 endgidmax : 0 anatt : 0 anacap : 0 anagrpmax : 0 nanagrpid : 0 pels : 1 sqes : 0x66 cqes : 0x44 maxcmd : 0 nn : 1 oncs : 0x5f fuses : 0 fna : 0 vwc : 0x7 awun : 0 awupf : 0 nvscc : 1 nwpc : 0 acwu : 0 sgls : 0 mnan : 0 subnqn : nqn.2018-01.com.wdc:nguid:E8238FA6BF53-0001-001B448B4xxxxxxx ioccsz : 0 iorcsz : 0 icdoff : 0 ctrattr : 0 msdbd : 0 ps 0 : mp:3.50W operational enlat:0 exlat:0 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:0.6300W active_power:2.90W ps 1 : mp:2.70W operational enlat:0 exlat:0 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:0.6300W active_power:1.80W ps 2 : mp:1.90W operational enlat:0 exlat:0 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:0.6300W active_power:1.50W ps 3 : mp:0.0250W non-operational enlat:3900 exlat:11000 rrt:3 rrl:3 rwt:3 rwl:3 idle_power:0.0250W active_power:- ps 4 : mp:0.0050W non-operational enlat:5000 exlat:39000 rrt:4 rrl:4 rwt:4 rwl:4 idle_power:0.0050W active_power:-
(In reply to Jonathan McDowell from comment #14) > I'm seeing: > > elpe : 255 > ... Thank you Jonathan. After some digging, the root cause of my issue turns out to be a firmware bug. It's not a linux kernel issue. Thanks.
FYI I experienced crashing with this drive on an old Lenovo ThinkCentre M625q with AMD E2-9000e CPU. It crashed seemingly as soon as there was no activity on the system for a few seconds. And my fix was to set `nvme_core.default_ps_max_latency_us=15000` at the kernel command line. This basically disables the deepest power saving state (4) but for my application it is acceptable. It is still working after a couple of days but I didn't test it under high load. I took the drive from a windows machine and I saw no firmware updates for it reported by some tool installed there. I don't know if there are separate downloads with a newer version though. $ sudo nvme id-ctrl /dev/nvme0n1 NVME Identify Controller: vid : 0x15b7 ssvid : 0x15b7 sn : [REDACTED] mn : WDC PC SN520 SDAPNUW-256G-1006 fr : 20110006 rab : 4 ieee : 001b44 cmic : 0 mdts : 7 cntlid : 0x1 ver : 0x10300 rtd3r : 0x7a120 rtd3e : 0xf4240 oaes : 0x200 ctratt : 0x2 rrls : 0 cntrltype : 0 fguid : 00000000-0000-0000-0000-000000000000 crdt1 : 0 crdt2 : 0 crdt3 : 0 nvmsr : 0 vwci : 0 mec : 0 oacs : 0x17 acl : 4 aerl : 7 frmw : 0x14 lpa : 0x2 elpe : 255 npss : 4 avscc : 0x1 apsta : 0x1 wctemp : 355 cctemp : 359 mtfa : 50 hmpre : 0 hmmin : 0 tnvmcap : 256060514304 unvmcap : 0 rpmbs : 0 edstt : 31 dsto : 1 fwug : 1 kas : 0 hctma : 0x1 mntmt : 273 mxtmt : 359 sanicap : 0 hmminds : 0 hmmaxd : 0 nsetidmax : 0 endgidmax : 0 anatt : 0 anacap : 0 anagrpmax : 0 nanagrpid : 0 pels : 0 domainid : 0 megcap : 0 sqes : 0x66 cqes : 0x44 maxcmd : 0 nn : 1 oncs : 0x1f fuses : 0 fna : 0 vwc : 0x1 awun : 0 awupf : 0 icsvscc : 1 nwpc : 0 acwu : 0 ocfs : 0 sgls : 0 mnan : 0 maxdna : 0 maxcna : 0 oaqd : 0 subnqn : nqn.2018-01.com.wdc:nguid:1832B5800315-0001-001B448B444C081C ioccsz : 0 iorcsz : 0 icdoff : 0 fcatt : 0 msdbd : 0 ofcs : 0 ps 0 : mp:2.60W operational enlat:0 exlat:0 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:- active_power:- active_power_workload:- ps 1 : mp:2.60W operational enlat:0 exlat:0 rrt:1 rrl:1 rwt:1 rwl:1 idle_power:- active_power:- active_power_workload:- ps 2 : mp:1.70W operational enlat:0 exlat:0 rrt:2 rrl:2 rwt:2 rwl:2 idle_power:- active_power:- active_power_workload:- ps 3 : mp:0.0250W non-operational enlat:5000 exlat:9000 rrt:3 rrl:3 rwt:3 rwl:3 idle_power:- active_power:- active_power_workload:- ps 4 : mp:0.0025W non-operational enlat:5000 exlat:44000 rrt:4 rrl:4 rwt:4 rwl:4 idle_power:- active_power:- active_power_workload:-