Bug 215763

Summary: nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2)
Product: IO/Storage Reporter: sander44 (ionut_n2001)
Component: NVMeAssignee: IO/NVME Virtual Default Assignee (io_nvme)
Status: NEW ---    
Severity: high CC: bnafta, icegood1980, kbusch
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.17.0-next Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg 5.17.0-next

Description sander44 2022-03-27 17:05:01 UTC
Hi,

Today I try this version kernel 5.17.0-next(vanilla).


commit f022814633e1c600507b3a99691b4d624c2813f0 (grafted, HEAD -> master, origin/master, origin/HEAD)
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Sat Mar 26 14:54:41 2022 -0700

    Merge tag 'trace-v5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
    
    Pull trace event string verifier fix from Steven Rostedt:
     "The run-time string verifier checks all trace event formats as
      they are read from the tracing file to make sure that the %s pointers
      are not reading something that no longer exists.
    
      However, it failed to account for the valid case of '%*.s' where the
      length given is zero, and the string is NULL. It incorrectly flagged
      it as a null pointer dereference and gave a WARN_ON()"
    
    * tag 'trace-v5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
      tracing: Have trace event string test handle zero length strings


nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2)

dmesg | grep -E "nvme|nvme0"
[    1.457145] nvme 0000:03:00.0: platform quirk: setting simple suspend
[    1.457179] nvme nvme0: pci function 0000:03:00.0
[    1.534265] nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) 
[    1.541260] nvme nvme0: 8/0/0 default/read/poll queues
[    1.542478]  nvme0n1: p1 p2 p3 p4
[    2.991708] EXT4-fs (nvme0n1p2): mounted filesystem with ordered data mode. Quota mode: none.
[    4.910626] Adding 8388604k swap on /dev/nvme0n1p3.  Priority:-2 extents:1 across:8388604k SSFS
[    4.915648] EXT4-fs (nvme0n1p2): re-mounted. Quota mode: none.
[    5.116574] EXT4-fs (nvme0n1p4): mounted filesystem with ordered data mode. Quota mode: none.


spci
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne Root Complex
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne IOMMU
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:02.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
00:02.4 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir Internal PCIe GPP Bridge to Bus
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 51)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 5
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 6
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 7
01:00.0 VGA compatible controller: NVIDIA Corporation GA106M [GeForce RTX 3060 Mobile / Max-Q] (rev a1)
01:00.1 Audio device: NVIDIA Corporation Device 228e (rev a1)
02:00.0 Network controller: MEDIATEK Corp. MT7921 802.11ax PCI Express Wireless Network Adapter
03:00.0 Non-Volatile memory controller: Intel Corporation Device f1aa (rev 03)
04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne (rev c4)
04:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Renoir Radeon High Definition Audio Controller
04:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor
04:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1
04:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1
04:00.5 Multimedia controller: Advanced Micro Devices, Inc. [AMD] Raven/Raven2/FireFlight/Renoir Audio Processor (rev 01)
04:00.6 Audio device: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) HD Audio Controller

INTEL SSDPEKNU010TZ
Comment 1 sander44 2022-03-27 17:05:16 UTC
Created attachment 300627 [details]
dmesg 5.17.0-next
Comment 2 Keith Busch 2022-03-27 17:26:34 UTC
It's harmless. The driver is querying to see if the controller supports a particular mode of Identification. The NVMe spec doesn't provide a great way for a driver to know ahead of time if a controller supports an optional identification or not so the driver just has to try it to find out. There is a TP in the workgroup that may improve that situation, but that may take some time to publish and for devices to implement it.
Comment 3 Fabio C. Barrionuevo da Luz 2023-06-04 00:20:33 UTC
I have the same error but with Kernel 5.19.0 from Ubuntu 22.04 (KDE Neon User edition)


$ sysctl fs.inotify

fs.inotify.max_queued_events = 32768
fs.inotify.max_user_instances = 8192
fs.inotify.max_user_watches = 1048576


$ free -h
               total        used        free      shared  buff/cache   available
Mem:            62Gi        12Gi        37Gi       443Mi        12Gi        49Gi
Swap:             0B          0B          0B


$ uname -a
Linux fabio-pc 5.19.0-43-generic #44~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon May 22 13:39:36 UTC 2 x86_64 x86_64 x86_64 GNU/Linux



$ sudo dmesg | grep nvme1

[    1.461384] nvme nvme1: pci function 0000:04:00.0
[    1.485117] nvme nvme1: 15/0/0 default/read/poll queues
[    1.503202]  nvme1n1: p1 p2 p4 p5
[    4.183196] EXT4-fs (nvme1n1p2): mounted filesystem with ordered data mode. Quota mode: none.
[    4.525988] EXT4-fs (nvme1n1p2): re-mounted. Quota mode: none.
[    4.814331] Adding 67583996k swap on /dev/nvme1n1p5.  Priority:-2 extents:1 across:67583996k SSFS
[  592.844663] nvme nvme1: I/O 640 QID 7 timeout, aborting
[  592.844674] nvme nvme1: I/O 641 QID 7 timeout, aborting
[  592.844678] nvme nvme1: I/O 642 QID 7 timeout, aborting
[  592.844682] nvme nvme1: I/O 643 QID 7 timeout, aborting
[  592.844686] nvme nvme1: I/O 644 QID 7 timeout, aborting
[  622.861066] nvme nvme1: I/O 640 QID 7 timeout, reset controller
[  654.797304] nvme nvme1: I/O 24 QID 0 timeout, reset controller
[  683.493736] nvme1n1: I/O Cmd(0x2) @ LBA 1318392128, 256 blocks, I/O Error (sct 0x3 / sc 0x71) 
[  683.493747] I/O error, dev nvme1n1, sector 1318392128 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 0
[  683.493767] nvme nvme1: Abort status: 0x371
[  683.493770] nvme nvme1: Abort status: 0x371
[  683.493773] nvme nvme1: Abort status: 0x371
[  683.493774] nvme nvme1: Abort status: 0x371
[  683.493776] nvme nvme1: Abort status: 0x371
[  683.535306] nvme nvme1: 15/0/0 default/read/poll queues
[ 1002.955149] nvme nvme1: I/O 0 QID 6 timeout, aborting
[ 1002.955162] nvme nvme1: I/O 1 QID 6 timeout, aborting
[ 1002.955168] nvme nvme1: I/O 832 QID 6 timeout, aborting
[ 1002.955172] nvme nvme1: I/O 833 QID 6 timeout, aborting
[ 1002.955177] nvme nvme1: I/O 834 QID 6 timeout, aborting
[ 1033.674738] nvme nvme1: I/O 0 QID 6 timeout, reset controller
[ 1064.393291] nvme nvme1: I/O 16 QID 0 timeout, reset controller
[ 1095.144296] nvme1n1: I/O Cmd(0x2) @ LBA 566100072, 256 blocks, I/O Error (sct 0x3 / sc 0x71) 
[ 1095.144307] I/O error, dev nvme1n1, sector 566100072 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 0
[ 1095.144363] nvme nvme1: Abort status: 0x371
[ 1095.144367] nvme nvme1: Abort status: 0x371
[ 1095.144369] nvme nvme1: Abort status: 0x371
[ 1095.144371] nvme nvme1: Abort status: 0x371
[ 1095.144373] nvme nvme1: Abort status: 0x371
[ 1095.202389] nvme nvme1: 15/0/0 default/read/poll queues
[ 2375.079668] nvme nvme1: I/O 706 QID 2 timeout, aborting
[ 2375.079683] nvme nvme1: I/O 707 QID 2 timeout, aborting
[ 2375.079689] nvme nvme1: I/O 708 QID 2 timeout, aborting
[ 2375.079695] nvme nvme1: I/O 706 QID 3 timeout, aborting
[ 2375.079700] nvme nvme1: I/O 708 QID 3 timeout, aborting
[ 2405.798811] nvme nvme1: I/O 706 QID 2 timeout, reset controller
[ 2436.517961] nvme nvme1: I/O 20 QID 0 timeout, reset controller
[ 2467.257607] nvme nvme1: Abort status: 0x371
[ 2467.257612] nvme nvme1: Abort status: 0x371
[ 2467.257615] nvme nvme1: Abort status: 0x371
[ 2467.257616] nvme nvme1: Abort status: 0x371
[ 2467.257618] nvme nvme1: Abort status: 0x371
[ 2467.297335] nvme nvme1: 15/0/0 default/read/poll queues
[ 2500.004300] nvme nvme1: I/O 709 QID 3 timeout, aborting
[ 2500.004315] nvme nvme1: I/O 711 QID 3 timeout, aborting
[ 2500.004321] nvme nvme1: I/O 64 QID 5 timeout, aborting
[ 2500.004328] nvme nvme1: I/O 576 QID 9 timeout, aborting
[ 2500.004333] nvme nvme1: I/O 577 QID 9 timeout, aborting
[ 2530.723505] nvme nvme1: I/O 709 QID 3 timeout, reset controller
[ 2561.442725] nvme nvme1: I/O 20 QID 0 timeout, reset controller
[ 2592.186195] nvme1n1: I/O Cmd(0x2) @ LBA 1423860112, 256 blocks, I/O Error (sct 0x3 / sc 0x71) 
[ 2592.186202] I/O error, dev nvme1n1, sector 1423860112 op 0x0:(READ) flags 0x80700 phys_seg 20 prio class 0
[ 2592.186227] nvme nvme1: Abort status: 0x371
[ 2592.186228] nvme nvme1: Abort status: 0x371
[ 2592.186229] nvme nvme1: Abort status: 0x371
[ 2592.186230] nvme nvme1: Abort status: 0x371
[ 2592.186231] nvme nvme1: Abort status: 0x371
[ 2592.226978] nvme nvme1: 15/0/0 default/read/poll queues
[ 2721.182884] nvme nvme1: I/O 0 QID 4 timeout, aborting
[ 2721.182899] nvme nvme1: I/O 1 QID 4 timeout, aborting
[ 2721.182904] nvme nvme1: I/O 2 QID 4 timeout, aborting
[ 2721.182909] nvme nvme1: I/O 3 QID 4 timeout, aborting
[ 2721.182914] nvme nvme1: I/O 4 QID 4 timeout, aborting
[ 2751.902167] nvme nvme1: I/O 0 QID 4 timeout, reset controller
[ 2782.621470] nvme nvme1: I/O 15 QID 0 timeout, reset controller
[ 2813.373247] nvme nvme1: Abort status: 0x371
[ 2813.373251] nvme nvme1: Abort status: 0x371
[ 2813.373253] nvme nvme1: Abort status: 0x371
[ 2813.373255] nvme nvme1: Abort status: 0x371
[ 2813.373256] nvme nvme1: Abort status: 0x371
[ 2813.420857] nvme nvme1: 15/0/0 default/read/poll queues


$ sudo smartctl -a /dev/nvme1

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.19.0-43-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       ADATA SX8200PNP
Serial Number:                      2K342LQHNHJ2
Firmware Version:                   42B2S7JA
PCI Vendor/Subsystem ID:            0x1cc1
IEEE OUI Identifier:                0x000000
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1.024.209.543.168 [1,02 TB]
Namespace 1 Utilization:            892.418.854.912 [892 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Sat Jun  3 13:06:09 2023 -03
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         64 Pages
Warning  Comp. Temp. Threshold:     75 Celsius
Critical Comp. Temp. Threshold:     80 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.00W       -        -    0  0  0  0        0       0
 1 +     4.60W       -        -    1  1  1  1        0       0
 2 +     3.80W       -        -    2  2  2  2        0       0
 3 -   0.0450W       -        -    3  3  3  3     2000    2000
 4 -   0.0040W       -        -    4  4  4  4    15000   15000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        39 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    10%
Data Units Read:                    28.790.609 [14,7 TB]
Data Units Written:                 98.726.571 [50,5 TB]
Host Read Commands:                 605.778.838
Host Write Commands:                1.588.032.646
Controller Busy Time:               21.672
Power Cycles:                       1.188
Power On Hours:                     9.520
Unsafe Shutdowns:                   151
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged
Comment 4 Fabio C. Barrionuevo da Luz 2023-06-04 02:22:00 UTC
Additionally, my system freezes for like 40 seconds or more and only comes back to work after I press CTRL + ALT + F2, wait a few seconds, and press CTRL + ALT + F1.

I am trying to identify if the problem is caused by some misconfiguration in the Kernel, or by the SSD controller or SSD NAND malfunction (The ADATA SSD Toolbox on Windows 10 does not report any errors.), or by some power setting as described in https://wiki.archlinux.org/title/Solid_state_drive/NVMe#Controller_failure_due_to_broken_APST_support or any NVIDIA official driver misbehavior, some problem on X, or any excessive use of inodes from Jetbrains Pycharm or Android Studio IDE's


If anyone knows how I can get more useful logs to help figure out the cause of the problem, I would appreciate it.