Bug 208123 - Kernel crashes due to NVMe disk: WD Blue SN550 (WDC WDS100T2B0C)
Summary: Kernel crashes due to NVMe disk: WD Blue SN550 (WDC WDS100T2B0C)
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: NVMe (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: IO/NVME Virtual Default Assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-06-10 10:53 UTC by rugk
Modified: 2023-11-15 19:53 UTC (History)
10 users (show)

See Also:
Kernel Version: 5.6.15-300.fc32.x86_64
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description rugk 2020-06-10 10:53:36 UTC
Fedora 32 Silverblue

## What happens

Randomly (I assume when it accesses the file system/the NVMe SSD disk quite much, it just freezes and shows me a fullscreen error. It's always some kind of **ext4 error**, but it's a new installation, so the file system is intact.

Here are some errors:

> t 4948.2505971 EXT4-fs error (device dm-2): __ext4 find_emtry-1536: inode
> 83829000: comm gdb-session-wor: reading directory lblock 0

IMG_20200604_230820.jpg

-----

> [  213.350921 EXT4-fs error (device dm-2): __ext4 find_entry:1536: inode
> 83029000: comm glm-session-war: reading directory Iblock @

IMG_20200605_000220.jpg

-----

> { 206.681358) EXT4-fs error (device dm-4): ext4_read_inode_bitmap:200: comm
> dconf worker: Cannot read inode bitmap - block_group = 1056, inode_bitmap =
> 34603024
{ 206.681465] EXT4-fs error (device dm-4) in ext4 free. inode:355: IO failure
{ 206.775200] EXT4-fs error (device dm-4): ext4_wait_block_bitmap:520@: comm cheese:cs0: Cannot read block bitmap - block_group = 38, block_bitmap = 1048582
{ 206.775410] EXT4-fs error (device dm-4): ext4_discard_preallocations:4090: comm cheese:cs0: Error -5 reading block bitmap for 38
{ 213.584473] EXT4-fs error (device dm-4): ext4_journal_check_start :84: Detected aborted journal
{ 213.584557] EXT4-fs (dm-4): Remounting filesystem read-only

IMG_20200605_232825.jpg

### What also happened

I assume some kind of this also caused another error: the TPM seems to have been corrupted and I had to regenerate it.

What I actually saw is: At some boot, the BIOS/UEFI showed me a message that claimed I had switched the CPU (of course, I did not, it's the built-in AMD Ryzen CPU) and it needs to regenerate the fTPM values or so.
As I do not have anything that relies on the TPM, I could just choose `Y` (yes) to regenerate it.
(Note: This happened after all photos IIRC.)

## System

Here are all logs with system information (nvme-cli, smartctl, lshw etc.):
https://gist.github.com/rugk/d17c88a7f78c986029c08426235217ed

**Side-note:** I had to learn that not all WDC drives actually [support the custom WDC commands](https://github.com/linux-nvme/nvme-cli/issues/731) that `nvme-cli` provides.

### A log catching the problem

Also I've managed to catch `dmesg` output when this occurred. This time, it **was not noticeable in the graphically**, but I could actually still use the system. However, in the background, it seems to have mounted the whole file system as readonly (and did not tell me lol) – do have a look at the end of that kernel log:
https://gist.github.com/rugk/88cad699c2ccf2cf0d309aa3a81221a1

Funny how the system is still able to run when it throws all these kinds of error…

## Links

Maybe better to read, I've also posted this in the Fedora Ask forum: https://ask.fedoraproject.org/t/investigating-kernel-crashes-due-to-nvme-disk/7620?u=rugk

Reported downstream in the Fedora issue tracker at https://bugzilla.redhat.com/show_bug.cgi?id=1844905

It would even be glad if you could point me to a workaround already…
Comment 1 Keith Busch 2020-06-10 13:53:22 UTC
did you try disabling apst?
Comment 2 rugk 2020-06-11 22:36:45 UTC
Okay, so I've tried adding:
> nvme_core.default_ps_max_latency_us=5500

…as a kernel parameter.
(Which is BTW very convenient to do on Silverblue, just run
$ rpm-ostree kargs --append=nvme_core.default_ps_max_latency_us=5500
Also, I could – and of course – needed to rollback, because I accidentally changed another kernel parameter.)

And it seems to work so far. (But I'll report back soon if something happens.)




-- LOGS --

# nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning			: 0
temperature				: 35 C
available_spare				: 100%
available_spare_threshold		: 10%
percentage_used				: 0%
endurance group critical warning summary: 0
data_units_read				: 109.350
data_units_written			: 187.970
host_read_commands			: 2.017.447
host_write_commands			: 1.014.888
controller_busy_time			: 6
power_cycles				: 41
power_on_hours				: 37
unsafe_shutdowns			: 29
media_errors				: 0
num_err_log_entries			: 1
Warning Temperature Time		: 0
Critical Composite Temperature Time	: 0
Thermal Management T1 Trans Count	: 0
Thermal Management T2 Trans Count	: 0
Thermal Management T1 Total Time	: 0
Thermal Management T2 Total Time	: 0
# smartctl -t short /dev/nvme0
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.6.15-300.fc32.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

NVMe device successfully opened

Use 'smartctl -a' (or '-x') to print SMART (and more) information

[root@fedidea rugk]# smartctl -a /dev/nvme0
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.6.15-300.fc32.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       WDC WDS100T2B0C-00PXH0
Serial Number:                      ********
Firmware Version:                   211070WD
PCI Vendor/Subsystem ID:            0x15b7
IEEE OUI Identifier:                0x001b44
Total NVM Capacity:                 1.000.204.886.016 [1,00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1.000.204.886.016 [1,00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            001b44 4a4408edc8
Local Time is:                      Fri Jun 12 00:29:54 2020 CEST
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     80 Celsius
Critical Comp. Temp. Threshold:     85 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     3.50W    2.90W       -    0  0  0  0        0       0
 1 +     2.70W    1.80W       -    0  0  0  0        0       0
 2 +     1.90W    1.50W       -    0  0  0  0        0       0
 3 -   0.0200W       -        -    3  3  3  3     3900   11000
 4 -   0.0050W       -        -    4  4  4  4     5000   39000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        34 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    109.515 [56,0 GB]
Data Units Written:                 188.161 [96,3 GB]
Host Read Commands:                 2.031.364
Host Write Commands:                1.019.201
Controller Busy Time:               6
Power Cycles:                       41
Power On Hours:                     37
Unsafe Shutdowns:                   29
Media and Data Integrity Errors:    0
Error Information Log Entries:      1
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, max 256 entries)
No Errors Logged
# nvme get-feature -f 0x0c -H /dev/nvme0
get-feature:0xc (Autonomous Power State Transition), Current value:0x000001
	Autonomous Power State Transition Enable (APSTE): Enabled
	Auto PST Entries	.................
	Entry[ 0]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[ 1]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[ 2]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[ 3]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[ 4]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[ 5]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[ 6]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[ 7]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[ 8]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[ 9]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[10]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[11]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[12]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[13]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[14]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[15]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[16]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[17]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[18]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[19]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[20]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[21]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[22]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[23]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[24]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[25]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[26]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[27]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[28]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[29]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[30]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
	Entry[31]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................
# rpm-ostree kargs
resume=/dev/mapper/fedora-swap rd.lvm.lv=fedora/root rd.luks.uuid=luks-08a02aba-fa31-[…] rd.lvm.lv=fedora/swap rhgb quiet root=/dev/mapper/luks-08a02aba-fa31-[…] ostree=/ostree/boot.1/fedora/902a099ca320af417c8297a1bff8ce4dda8[…]/0 nvme_core.default_ps_max_latency_us=5500
Comment 3 rugk 2020-06-22 19:23:09 UTC
So disabling APST seems to work in 99% of the cases as a workaround. I still sometimes experience a freezing of the system, but I could detect nothing in the logs, so I'm not sure whether this is related to this issue.

Anyway, [on the forum](https://ask.fedoraproject.org/t/investigating-kernel-crashes-due-to-nvme-disk/7620/5?u=rugk) it is speculated this is related to the Ryzen processor and the SN550 NVMe.

Because there is a very similar bug report on Reddit:
https://www.reddit.com/r/pop_os/comments/g8y3ae/experiencing_random_freezes_after_installing_new/
Comment 4 rugk 2020-06-22 19:37:02 UTC
If there is any other information I can provide you feel free to ask.
Comment 5 kilian2798 2020-10-10 01:44:32 UTC
I experience the same behaviour. Even if i add the mentioned kernel parameter which "fixes" it i get the sporadic crashes where i have no logs indicating what happened at all or what was leading to crash. Just a blinking capslock led. I will gladly provide more info if requested
Comment 6 Julian Hille 2021-01-05 01:32:54 UTC
I've got the same disk and see the same issue.
I had kern.log open and waited for the bug to occur, what happens there is a 
"disconnect, reconnect, reset" and reset fails with -19.
Trying to tee onto a network drive or usb thumb drive and maybe take a screenshot of it.

Is there something i could do to log more infos around this issue?
Comment 7 Ma 2021-01-19 21:21:50 UTC
WD has released new firmware for the WD Blue SN550 : 21120WD
Use their Windows tool to verify and update the firmware of the SSD.
Comment 8 Ma 2021-01-19 21:32:12 UTC
Tool is called “dashboard”.
Comment 9 tornado99 2021-03-10 11:07:19 UTC
Is this fixed with the latest kernel/firmware?
Comment 10 Andrew Macks 2021-04-02 04:39:42 UTC
I am affected by the same issue.  Unfortunately, the laptop in question is in another country right now, so I do not have the ability to attempt the firmware update.

Kernel: 5.10.8 (sorry for the old kernel here)

Model Number:                       WDC WDS100T2B0C-00PXH0
Serial Number:                      20324B802592
Firmware Version:                   211070WD

04:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD Blue SN550 NVMe SSD [15b7:5009] (rev 01)

I managed to capture the dmesg output during a failure by having a log running with dmesg -w from another machine.

[ 7337.662382] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[ 7337.701297] nvme 0000:04:00.0: can't change power state from D3cold to D0 (config space inaccessible)
[ 7337.701559] nvme nvme0: Removing after probe failure status: -19

Followed by constant block/fs related errors which never recover.
Comment 11 Andrew Macks 2021-04-02 04:41:12 UTC
Forgot to mention, this is also running on an AMD Ryzen laptop.

Metabox NL50RU/NL5xRU, BIOS 1.07.05TMB 12/15/2020
CPU0: AMD Ryzen 5 4500U with Radeon Graphics (family: 0x17, model: 0x60, stepping: 0x1)

The machine crashes sometimes after a couple of hours, or sometimes after a couple of days.

By adding, "nvme_core.default_ps_max_latency_us=0" to the boot line, the crashes went away and am now enjoying 30+ days uptimes.
Comment 12 Jonathan McDowell 2021-04-12 07:58:15 UTC
I updated the firmware in my WDS100T2B0C-00PXH0 to 211210WD last week and removed the nvme_core.default_ps_max_latency_us=5000 from the kernel command line. 5 days later still no crashes, despite several suspend/resume cycles and a lot of use. Looks like the new firmware has resolved things for me.
Comment 13 roxma 2022-01-10 06:14:20 UTC
(In reply to Jonathan McDowell from comment #12)
> I updated the firmware in my WDS100T2B0C-00PXH0 to 211210WD last week and
> removed the nvme_core.default_ps_max_latency_us=5000 from the kernel command
> line. 5 days later still no crashes, despite several suspend/resume cycles
> and a lot of use. Looks like the new firmware has resolved things for me.

Hi, Jonathan, could you share the output of"nvme id-ctrl /dev/nvmeXnY" on your WDS100T2B0C-00PXH0.

I found a similar issue that seems to be highly related to "elpe: 255". After I set the elpe to 63, (btw I'm a NVMe SSD developer), no crash is observed.

I goggled for WDS100T2B0C and "id-ctrl" and found https://netlog.jpn.org/r271-635/2021/08/ssd_wdblue_sn550.html. The elpe was 255 at the time. Please let me know whether it is the same value on yours.
Comment 14 Jonathan McDowell 2022-01-10 11:00:09 UTC
I'm seeing:

elpe      : 255

on my upgraded device. Full details with serial number redacted:

NVME Identify Controller:
vid       : 0x15b7
ssvid     : 0x15b7
sn        : xxxxxxxxxxxx        
mn        : WDC WDS100T2B0C-00PXH0                  
fr        : 211210WD
rab       : 4
ieee      : 001b44
cmic      : 0
mdts      : 7
cntlid    : 0x1
ver       : 0x10400
rtd3r     : 0x7a120
rtd3e     : 0xf4240
oaes      : 0x200
ctratt    : 0x2
rrls      : 0
cntrltype : 1
fguid     : 
crdt1     : 0
crdt2     : 0
crdt3     : 0
oacs      : 0x17
acl       : 4
aerl      : 7
frmw      : 0x14
lpa       : 0x1e
elpe      : 255
npss      : 4
avscc     : 0x1
apsta     : 0x1
wctemp    : 353
cctemp    : 358
mtfa      : 50
hmpre     : 51200
hmmin     : 823
tnvmcap   : 1000204886016
unvmcap   : 0
rpmbs     : 0
edstt     : 70
dsto      : 1
fwug      : 1
kas       : 0
hctma     : 0x1
mntmt     : 273
mxtmt     : 358
sanicap   : 0x60000002
hmminds   : 0
hmmaxd    : 8
nsetidmax : 0
endgidmax : 0
anatt     : 0
anacap    : 0
anagrpmax : 0
nanagrpid : 0
pels      : 1
sqes      : 0x66
cqes      : 0x44
maxcmd    : 0
nn        : 1
oncs      : 0x5f
fuses     : 0
fna       : 0
vwc       : 0x7
awun      : 0
awupf     : 0
nvscc     : 1
nwpc      : 0
acwu      : 0
sgls      : 0
mnan      : 0
subnqn    : nqn.2018-01.com.wdc:nguid:E8238FA6BF53-0001-001B448B4xxxxxxx
ioccsz    : 0
iorcsz    : 0
icdoff    : 0
ctrattr   : 0
msdbd     : 0
ps    0 : mp:3.50W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:0.6300W active_power:2.90W
ps    1 : mp:2.70W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:0.6300W active_power:1.80W
ps    2 : mp:1.90W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:0.6300W active_power:1.50W
ps    3 : mp:0.0250W non-operational enlat:3900 exlat:11000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:0.0250W active_power:-
ps    4 : mp:0.0050W non-operational enlat:5000 exlat:39000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:0.0050W active_power:-
Comment 15 roxma 2022-01-10 11:34:42 UTC
(In reply to Jonathan McDowell from comment #14)
> I'm seeing:
> 
> elpe      : 255
> ...

Thank you Jonathan. After some digging, the root cause of my issue turns out to be a firmware bug. It's not a linux kernel issue. Thanks.

Note You need to log in before you can comment on or make changes to this bug.