Bug 211573 - Samsung 970 EVO Plus Generates NVME Errors
Summary: Samsung 970 EVO Plus Generates NVME Errors
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: NVMe (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: IO/NVME Virtual Default Assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-02-05 03:37 UTC by gs
Modified: 2023-09-16 09:50 UTC (History)
8 users (show)

See Also:
Kernel Version: 5.10.13
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description gs 2021-02-05 03:37:18 UTC
Samsung 970 EVO Plus Generates NVME Errors,  whether it is mounted or not. 

Sample size of 2 different hardware units.

When the drives were used in a FreeBSD machine, the error count did not increase. Sample two has higher power on hours, however the majority was spent on a different(FreeBSD) machine. Sample 1 has only been in Linux machines. Thus I believe this to be a Linux driver/kernel bug.

I have noticed no adverse effects of this error, simply an increment in the number of log entries.


Please let me know how I can assist in debugging, or if this is not the correct location to report this issue.

To my novice eye, commands sent to the 0xFFFFFFFF namespace by the NVME standard should be sent to all namespaces, but it appears to be considered an invalid namespace below.

Steps to reproduce:
Insert drive into system and power up

Expected result:
No errors generated

Actual Results:
Errors generated, although no adverse effects observed thus far (although no exhaustive search performed)

Build & Hardware:
Linux NAME 5.10.13-arch1-1 #1 SMP PREEMPT Wed, 03 Feb 2021 23:44:07 +0000 x86_64 GNU/Linux
Hardware: Dell Precision 7530 with Samsung 970 EVO Plus

Error always has same STATUS and PELoc, examples from SMART:
 0     801903     0  0x0004  0x4016  0x004            0     -     -
  1     801902     0  0x0014  0x4016  0x004            0     -     -
  2     801901     0  0x0003  0x4016  0x004            0     -     -
  3     801900     0  0x0006  0x4016  0x004            0     -     -
  4     801899     0  0x0016  0x4016  0x004            0     -     -
  5     801898     0  0x0014  0x4016  0x004            0     -     -
  6     801897     0  0x0001  0x4016  0x004            0     -     -
  7     801896     0  0x0003  0x4016  0x004            0     -     -
  8     801895     0  0x0001  0x4016  0x004            0     -     -
  9     801894     0  0x0003  0x4016  0x004            0     -     -
 10     801893     0  0x0004  0x4016  0x004            0     -     -
 11     801892     0  0x0015  0x4016  0x004            0     -     -
 12     801891     0  0x0001  0x4016  0x004            0     -     -
 13     801890     0  0x0003  0x4016  0x004            0     -     -
 14     801889     0  0x0006  0x4016  0x004            0     -     -
 15     801888     0  0x0014  0x4016  0x004            0     -     -

Examples from nvme error-log:
 Entry[61]   
.................
error_count	: 801847
sqid		: 0
cmdid		: 0x1
status_field	: 0x4016(INVALID_NS: The namespace or the format of that namespace is invalid)
parm_err_loc	: 0x4
lba		: 0
nsid		: 0xffffffff
vs		: 0
trtype		: The transport type is not indicated or the error is not transport related.
cs		: 0
trtype_spec_info: 0
.................
 Entry[62]   
.................
error_count	: 801846
sqid		: 0
cmdid		: 0x4
status_field	: 0x4016(INVALID_NS: The namespace or the format of that namespace is invalid)
parm_err_loc	: 0x4
lba		: 0
nsid		: 0xffffffff
vs		: 0
trtype		: The transport type is not indicated or the error is not transport related.
cs		: 0
trtype_spec_info: 0
.................
 Entry[63]   
.................
error_count	: 801845
sqid		: 0
cmdid		: 0x2
status_field	: 0x4016(INVALID_NS: The namespace or the format of that namespace is invalid)
parm_err_loc	: 0x4
lba		: 0
nsid		: 0xffffffff
vs		: 0
trtype		: The transport type is not indicated or the error is not transport related.
cs		: 0
trtype_spec_info: 0
.................



SMART information of unit 1 (newer):

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.13-arch1-1] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO Plus 2TB
Serial Number:                      XXXXXXXXXXXXXXXXX
Firmware Version:                   2B2QEXM7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      4
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization:            1,969,516,711,936 [1.96 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 5a01aec1a4
Local Time is:                      Thu Feb  4 21:23:18 2021 CST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.50W       -        -    0  0  0  0        0       0
 1 +     5.90W       -        -    1  1  1  1        0       0
 2 +     3.60W       -        -    2  2  2  2        0       0
 3 -   0.0700W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        44 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    31,326,159 [16.0 TB]
Data Units Written:                 35,929,602 [18.3 TB]
Host Read Commands:                 300,068,596
Host Write Commands:                239,589,759
Controller Busy Time:               532
Power Cycles:                       37
Power On Hours:                     818
Unsafe Shutdowns:                   18
Media and Data Integrity Errors:    0
Error Information Log Entries:      103,607
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               44 Celsius
Temperature Sensor 2:               38 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0     103607     0  0x000d  0x4016  0x004            0     -     -
  1     103606     0  0x000f  0x4016  0x004            0     -     -
  2     103605     0  0x000b  0x4016  0x004            0     -     -
  3     103604     0  0x0009  0x4016  0x004            0     -     -
  4     103603     0  0x001c  0x4016  0x004            0     -     -
  5     103602     0  0x001c  0x4016  0x004            0     -     -
  6     103601     0  0x0017  0x4016  0x004            0     -     -
  7     103600     0  0x001c  0x4016  0x004            0     -     -
  8     103599     0  0x001c  0x4016  0x004            0     -     -
  9     103598     0  0x001c  0x4016  0x004            0     -     -
 10     103597     0  0x0016  0x4016  0x004            0     -     -
 11     103596     0  0x000d  0x4016  0x004            0     -     -
 12     103595     0  0x001d  0x4016  0x004            0     -     -
 13     103594     0  0x000f  0x4016  0x004            0     -     -
 14     103593     0  0x001d  0x4016  0x004            0     -     -
 15     103592     0  0x000d  0x4016  0x004            0     -     -
... (48 entries not read)


SMART information unit 2:
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.13-arch1-1] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO Plus 2TB
Serial Number:                      XXXXXXXXXXXXXx
Firmware Version:                   2B2QEXM7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      4
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization:            1,657,011,032,064 [1.65 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 5901ae3f49
Local Time is:                      Thu Feb  4 21:23:56 2021 CST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.50W       -        -    0  0  0  0        0       0
 1 +     5.90W       -        -    1  1  1  1        0       0
 2 +     3.60W       -        -    2  2  2  2        0       0
 3 -   0.0700W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        46 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    4,854,764 [2.48 TB]
Data Units Written:                 22,988,874 [11.7 TB]
Host Read Commands:                 68,667,495
Host Write Commands:                450,184,240
Controller Busy Time:               376
Power Cycles:                       136
Power On Hours:                     516
Unsafe Shutdowns:                   63
Media and Data Integrity Errors:    0
Error Information Log Entries:      801,903
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               46 Celsius
Temperature Sensor 2:               41 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0     801903     0  0x0004  0x4016  0x004            0     -     -
  1     801902     0  0x0014  0x4016  0x004            0     -     -
  2     801901     0  0x0003  0x4016  0x004            0     -     -
  3     801900     0  0x0006  0x4016  0x004            0     -     -
  4     801899     0  0x0016  0x4016  0x004            0     -     -
  5     801898     0  0x0014  0x4016  0x004            0     -     -
  6     801897     0  0x0001  0x4016  0x004            0     -     -
  7     801896     0  0x0003  0x4016  0x004            0     -     -
  8     801895     0  0x0001  0x4016  0x004            0     -     -
  9     801894     0  0x0003  0x4016  0x004            0     -     -
 10     801893     0  0x0004  0x4016  0x004            0     -     -
 11     801892     0  0x0015  0x4016  0x004            0     -     -
 12     801891     0  0x0001  0x4016  0x004            0     -     -
 13     801890     0  0x0003  0x4016  0x004            0     -     -
 14     801889     0  0x0006  0x4016  0x004            0     -     -
 15     801888     0  0x0014  0x4016  0x004            0     -     -
... (48 entries not read)
Comment 1 Keith Busch 2021-10-25 22:33:22 UTC
You should report these kinds of errors to your vendor. The driver isn't doing anything wrong here.
Comment 2 Athanasius 2022-08-08 07:53:22 UTC
I am seeing exactly this as well.  It only started recently, but then I'd not been booting into this Linux installation since around February 2022.

That was exactly the time when I upgraded from Debian buster (oldstable) to bullseye (current stable), moving from a 4.19-based kernel to a 5.10-based one.

So presumably some difference between those *Debian* kernel versions has caused SMART to start logging these "Device: /dev/nvme0, number of Error Log entries increased from 2479 to 2482" (and similar) reports.

It consistently detects the count has increased by a few every boot.

The only output in `dmesg` for `nvme0`:

```
08:52:29 0$ dmesg | grep nvme0
[    1.096802] nvme nvme0: pci function 0000:0a:00.0
[    1.103753] nvme nvme0: missing or invalid SUBNQN field.
[    1.103774] nvme nvme0: Shutdown timeout set to 8 seconds
[    1.114520] nvme nvme0: 8/0/0 default/read/poll queues
[    1.117133]  nvme0n1: p1 p2 p3 p4 p5
```
Comment 3 Athanasius 2022-08-08 07:54:46 UTC
In fact, now I just got the smartctl email for this boot up... it's consistently +3 on Error Logs for the past two boots.
Comment 4 Athanasius 2022-08-08 08:08:45 UTC
I shall try to remember to check if Windows has similar SMART logging anywhere, given that the drive in question is purely for my Windows 10 install (it's the C: drive).  Linux is only concerned with it due to mounting, using ntfs-3g, in case I need to check something in there.

fstab for it:

UUID=<uuid>   /Win10-C   ntfs-3g rw,exec,user,noatime,uid=athan,gid=athan,umask=02,nofail  0       0
Comment 5 Athanasius 2022-08-08 14:26:36 UTC
So, it turns out that the best way to get full SMART information in Windows 10 is ... to install smartmontools for Windows.

Doing so and running `smartctrl -x <nvme drive>` there shows another +3 to the "Error Information Log Entries" count.

At this stage it *could* be that every reboot causes this, or it could be that both boot-up and reboot in Linux causes it.  I'll investigate further.

And, yes, it's entirely possible that this is just a misfeature of the drive, or actual indication of problems with my unit.
Comment 6 Keith Busch 2022-08-08 14:56:31 UTC
The nvme specification is not very consistent on how to identify what features the controller supports, so in some cases the driver just has to try it and see if it worked.

The log entries are likely harmless driver initiated admin commands (SqId 0) checking if a particular feature is supported. The SSD doesn't *need* to log an error entry for such commands as it has no impact on media health (which is what SMART is supposed to care about), but it is allowed to save the error if it wants. I personally find these types of errors to be less than useless.
Comment 7 Athanasius 2022-08-08 15:04:26 UTC
Indeed, the three showing up in Windows 10 are:

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0       2485     0  0x0089  0x4212  0x028            0     -     -
  1       2484     0  0x001d  0x4212  0x028            0     -     -
  2       2483     0  0x0002  0x4004  0x028            0     0     -

In Linux that section of `smartctl -x <device>` output was empty.

In my case you can blame the Debian buster->bullseye upgrade for suddenly highlighting these.  Either it upgraded smartmontools to a version that sends the alert emails and/or the different kernel version has tickled something.

All the rest of the SMART output suggests there's no issues with the drive health, so I'll just ignore those specific emails.

Thanks.
Comment 8 kolAflash 2023-05-18 23:03:38 UTC
Maybe related / helpful:

Bug 217445 - standby-resume cycle increases NVMe error count (maybe bad NVMe commands)
https://bugzilla.kernel.org/show_bug.cgi?id=217445
(rebooting also increases the error count in that bugreport)

Smartd should ignore non-error entries from NVMe Error Information log
https://www.smartmontools.org/ticket/1222

Note You need to log in before you can comment on or make changes to this bug.