Hello, I installed a "Kingston KC3000 PCIe 4.0 SSD" of 4 Terabyte size in my new Ryzen 6000 Laptop "Lenovo Yoga Slim 7 Pro X". The original SSD works fine, but the new Kingston one results in a delayed boot and produces errors in the kernel log. Each operation on the ssd is causing a delay. For example I need to wait for some time until "fdisk -l" responds back. "smartctl -a" doesn't work at all: Read NVMe Identify Controller failed: NVME_IOCTL_ADMIN_CMD: Interrupted system call An excerpt of the most relevant kernel logs "dmesg | grep nvme": [ 0.815616] nvme 0000:02:00.0: platform quirk: setting simple suspend [ 0.815645] nvme nvme0: pci function 0000:02:00.0 [ 1.281525] nvme nvme0: Shutdown timeout set to 10 seconds [ 1.285707] nvme nvme0: 16/0/0 default/read/poll queues [ 32.150395] nvme nvme0: invalid id 0 completed on queue 7 [ 32.150404] nvme nvme0: I/O 512 (Read) QID 7 timeout, aborting [ 32.150651] nvme nvme0: Abort status: 0x0 [ 62.870472] nvme nvme0: I/O 512 QID 7 timeout, reset controller [ 124.310467] nvme nvme0: invalid id 0 completed on queue 0 [ 124.310490] nvme nvme0: invalid id 0 completed on queue 0 [ 124.310505] nvme nvme0: invalid id 0 completed on queue 0 [ 124.310520] nvme nvme0: invalid id 0 completed on queue 0 [ 124.310534] nvme nvme0: invalid id 0 completed on queue 0 [ 124.310548] nvme nvme0: invalid id 0 completed on queue 0 [ 124.310562] nvme nvme0: invalid id 0 completed on queue 0 [ 124.310576] nvme nvme0: invalid id 0 completed on queue 0 [ 124.310590] nvme nvme0: invalid id 0 completed on queue 0 [ 124.310605] nvme nvme0: invalid id 0 completed on queue 0 [ 124.310619] nvme nvme0: invalid id 0 completed on queue 0 [ 124.310634] nvme nvme0: invalid id 0 completed on queue 0 [ 124.310648] nvme nvme0: invalid id 0 completed on queue 0 [ 124.310662] nvme nvme0: invalid id 0 completed on queue 0 [ 124.310676] nvme nvme0: invalid id 0 completed on queue 0 [ 124.310690] nvme nvme0: invalid id 0 completed on queue 0 [ 124.310695] nvme nvme0: I/O 8 QID 0 timeout, reset controller [ 124.327412] nvme0n1: Read(0x2) @ LBA 8001573376, 8 blocks, Host Aborted Command (sct 0x3 / sc 0x71) [ 124.327464] I/O error, dev nvme0n1, sector 8001573376 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2 [ 124.359883] nvme nvme0: Shutdown timeout set to 10 seconds [ 124.362027] nvme nvme0: 16/0/0 default/read/poll queues [ 144.913853] nvme nvme0: invalid id 0 completed on queue 2 [ 155.030534] nvme nvme0: I/O 704 (Read) QID 2 timeout, aborting [ 185.750213] nvme nvme0: I/O 704 QID 2 timeout, reset controller [ 185.750274] nvme nvme0: invalid id 0 completed on queue 0 [ 185.787269] nvme0n1: Read(0x2) @ LBA 8001573376, 8 blocks, Host Aborted Command (sct 0x3 / sc 0x71) [ 185.787363] I/O error, dev nvme0n1, sector 8001573376 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2 [ 185.787471] nvme nvme0: Abort status: 0x371 [ 185.802599] nvme nvme0: Shutdown timeout set to 10 seconds [ 185.804537] nvme nvme0: 16/0/0 default/read/poll queues [ 267.670245] nvme nvme0: request 0x0 genctr mismatch (got 0x0 expected 0x1) [ 267.670332] nvme nvme0: invalid id 0 completed on queue 0 [ 267.670343] nvme nvme0: I/O 24 QID 0 timeout, reset controller [ 267.710387] nvme0: Identify(0x6), Host Aborted Command (sct 0x3 / sc 0x71) [ 267.726065] nvme nvme0: Shutdown timeout set to 10 seconds [ 267.729259] nvme nvme0: 16/0/0 default/read/poll queues [ 806.976986] nvme nvme0: invalid id 0 completed on queue 7 [ 806.977004] nvme nvme0: I/O 513 (Read) QID 7 timeout, aborting [ 806.977236] nvme nvme0: Abort status: 0x0 The attachment is a .tar.xz containing command outputs of: "smartctl -a ", "fdisk -l", "dmesg | grep nvme", "lspci -vv", "dmesg", "journalctl -b" My main question: Is it possible to determine if this is a kernel issue or is it a hardware issue and can it be solved?
Created attachment 301876 [details] tar.xz containing outputs of different commands The attachment is a .tar.xz containing command outputs of: "smartctl -a ", "fdisk -l", "dmesg | grep nvme", "lspci -vv", "dmesg", "journalctl -b"
Created attachment 301877 [details] dmesg output
Created attachment 301878 [details] dmesg | grep nvme dmesg | grep nvme
Created attachment 301879 [details] fdisk -l
Created attachment 301880 [details] journalctl -b
Created attachment 301881 [details] lspci -vv
Created attachment 301882 [details] smartctl -a
It looks like your device is out of compliance with respect to host command id's. I can send a patch later today that will turn off the generation sequence for this model and see if that fixes your observations.
Thank you for your response. I standby and make myself mentally ready to compile a kernel. What consequences I have to expect if your workaround works? Making the ssd normally usable again is the goal, but what about other features like the power management?
There are no other side effects from disabling the sequencer. It's just a defensive mechanism in the driver to detect double completions. It's a relatively new driver feature because we didn't believe it was necessary until it actually happened in the wild. The consequences of a double competition are so awful that we were practically compelled to detect it. In general, though, it's unnecessary paranoia for most devices, so it's harmless to turn it off.
Created attachment 301885 [details] proposed fix
I compiled a new kernel from a freshly pulled linux git repository. Some of the messages regarding nvme disappeared, but "gdisk /dev/nvme0n1" and "fdsik -l" are still stuck aswell as "smartctl -a /dev/nvme0n1". I also have the problem that my whole system freezes after some time. The kernel log gives the following messages after boot: archusb ~# journalctl -b | grep nvme Sep 28 20:59:03 archusb kernel: Linux version 6.0.0-rc7nvmefix-00068-g49c13ed0316d-dirty (user@headquarter) (gcc (GCC) 12.2.0, GNU ld (GNU Binutils) 2.39.0) #1 SMP PREEMPT_DYNAMIC Wed Sep 28 16:54:32 CEST 2022 Sep 28 20:59:03 archusb kernel: Command line: initrd=\initramfs-linux-nvmefix.img root=UUID=6d6c3551-a53c-425b-ab29-9cf52e1943f7 rw Sep 28 20:59:03 archusb kernel: Kernel command line: initrd=\initramfs-linux-nvmefix.img root=UUID=6d6c3551-a53c-425b-ab29-9cf52e1943f7 rw Sep 28 20:59:03 archusb kernel: usb usb1: Manufacturer: Linux 6.0.0-rc7nvmefix-00068-g49c13ed0316d-dirty xhci-hcd Sep 28 20:59:03 archusb kernel: usb usb2: Manufacturer: Linux 6.0.0-rc7nvmefix-00068-g49c13ed0316d-dirty xhci-hcd Sep 28 20:59:03 archusb kernel: usb usb3: Manufacturer: Linux 6.0.0-rc7nvmefix-00068-g49c13ed0316d-dirty xhci-hcd Sep 28 20:59:03 archusb kernel: usb usb4: Manufacturer: Linux 6.0.0-rc7nvmefix-00068-g49c13ed0316d-dirty xhci-hcd Sep 28 20:59:03 archusb kernel: usb usb5: Manufacturer: Linux 6.0.0-rc7nvmefix-00068-g49c13ed0316d-dirty xhci-hcd Sep 28 20:59:03 archusb kernel: usb usb6: Manufacturer: Linux 6.0.0-rc7nvmefix-00068-g49c13ed0316d-dirty xhci-hcd Sep 28 20:59:03 archusb kernel: usb usb7: Manufacturer: Linux 6.0.0-rc7nvmefix-00068-g49c13ed0316d-dirty xhci-hcd Sep 28 20:59:03 archusb kernel: usb usb8: Manufacturer: Linux 6.0.0-rc7nvmefix-00068-g49c13ed0316d-dirty xhci-hcd Sep 28 20:59:03 archusb kernel: usb usb9: Manufacturer: Linux 6.0.0-rc7nvmefix-00068-g49c13ed0316d-dirty xhci-hcd Sep 28 20:59:03 archusb kernel: usb usb10: Manufacturer: Linux 6.0.0-rc7nvmefix-00068-g49c13ed0316d-dirty xhci-hcd Sep 28 20:59:03 archusb kernel: nvme 0000:02:00.0: platform quirk: setting simple suspend Sep 28 20:59:03 archusb kernel: nvme nvme0: pci function 0000:02:00.0 Sep 28 20:59:03 archusb kernel: nvme nvme0: Shutdown timeout set to 10 seconds Sep 28 20:59:03 archusb kernel: nvme nvme0: 16/0/0 default/read/poll queues And the following messages after I executed "gdisk /dev/nvme0n1" (3 times in a row, last time my system freezed): Sep 28 20:59:35 archusb kernel: nvme nvme0: invalid id 0 completed on queue 1 Sep 28 20:59:35 archusb kernel: nvme nvme0: I/O 384 (Read) QID 1 timeout, aborting Sep 28 20:59:35 archusb kernel: nvme nvme0: Abort status: 0x0 Sep 28 21:00:04 archusb systemd-udevd[375]: nvme0n1: Worker [425] processing SEQNUM=2739 is taking a long time Sep 28 21:00:06 archusb kernel: nvme nvme0: I/O 384 QID 1 timeout, reset controller Sep 28 21:00:06 archusb kernel: nvme nvme0: invalid id 0 completed on queue 15 Sep 28 21:00:06 archusb kernel: nvme0n1: Read(0x2) @ LBA 8001573376, 8 blocks, Host Aborted Command (sct 0x3 / sc 0x71) Sep 28 21:00:06 archusb kernel: I/O error, dev nvme0n1, sector 8001573376 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2 Sep 28 21:00:06 archusb kernel: nvme0n1: Read(0x2) @ LBA 0, 32 blocks, Host Aborted Command (sct 0x3 / sc 0x71) Sep 28 21:00:06 archusb kernel: I/O error, dev nvme0n1, sector 0 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 2 Sep 28 21:00:06 archusb kernel: nvme nvme0: Shutdown timeout set to 10 seconds Sep 28 21:00:06 archusb kernel: nvme nvme0: 16/0/0 default/read/poll queues Sep 28 21:01:46 archusb kernel: nvme nvme0: I/O 0 (Read) QID 13 timeout, aborting Sep 28 21:01:46 archusb kernel: nvme nvme0: Abort status: 0x0 Sep 28 21:02:24 archusb kernel: nvme nvme0: invalid id 0 completed on queue 12 Sep 28 21:02:24 archusb kernel: nvme nvme0: I/O 960 (Read) QID 12 timeout, aborting Sep 28 21:02:24 archusb kernel: nvme nvme0: Abort status: 0x0 Sep 28 21:02:55 archusb kernel: nvme nvme0: I/O 960 QID 12 timeout, reset controller Sep 28 21:03:56 archusb kernel: nvme nvme0: I/O 4 QID 0 timeout, reset controller Sep 28 21:03:56 archusb kernel: videobuf2_common tpm_crb snd_soc_acpi aesni_intel typec_ucsi vfat sparse_keymap tpm_tis snd_timer fat crypto_simd videodev snd tpm_tis_core platform_profile ecdh_generic cryptd rapl pcspkr typec mc cec crc16 snd_pci_acp3x soundcore rfkill ccp i2c_piix4 k10temp roles tpm i2c_hid_acpi wmi mac_hid i2c_hid rng_core amd_pmc acpi_cpufreq video acpi_tad fuse bpf_preload ip_tables x_tables btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq uas usb_storage usbhid serio_raw atkbd nvme libps2 vivaldi_fmap nvme_core i8042 xhci_pci crc32c_intel xhci_pci_renesas serio Sep 28 21:03:56 archusb kernel: CPU: 12 PID: 92 Comm: ksoftirqd/12 Not tainted 6.0.0-rc7nvmefix-00068-g49c13ed0316d-dirty #1 7ecd7809f8f37e0153b2dde865e868f3fbcc257c Sep 28 21:03:56 archusb kernel: nvme_del_cq_end+0x27/0x40 [nvme a5ac8cd8b1b0917064a5d222a5652fc5a9ad93f3] Sep 28 21:04:24 archusb kernel: videobuf2_common tpm_crb snd_soc_acpi aesni_intel typec_ucsi vfat sparse_keymap tpm_tis snd_timer fat crypto_simd videodev snd tpm_tis_core platform_profile ecdh_generic cryptd rapl pcspkr typec mc cec crc16 snd_pci_acp3x soundcore rfkill ccp i2c_piix4 k10temp roles tpm i2c_hid_acpi wmi mac_hid i2c_hid rng_core amd_pmc acpi_cpufreq video acpi_tad fuse bpf_preload ip_tables x_tables btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq uas usb_storage usbhid serio_raw atkbd nvme libps2 vivaldi_fmap nvme_core i8042 xhci_pci crc32c_intel xhci_pci_renesas serio Sep 28 21:04:24 archusb kernel: CPU: 12 PID: 92 Comm: ksoftirqd/12 Tainted: G W 6.0.0-rc7nvmefix-00068-g49c13ed0316d-dirty #1 7ecd7809f8f37e0153b2dde865e868f3fbcc257c Sep 28 21:04:52 archusb kernel: videobuf2_common tpm_crb snd_soc_acpi aesni_intel typec_ucsi vfat sparse_keymap tpm_tis snd_timer fat crypto_simd videodev snd tpm_tis_core platform_profile ecdh_generic cryptd rapl pcspkr typec mc cec crc16 snd_pci_acp3x soundcore rfkill ccp i2c_piix4 k10temp roles tpm i2c_hid_acpi wmi mac_hid i2c_hid rng_core amd_pmc acpi_cpufreq video acpi_tad fuse bpf_preload ip_tables x_tables btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq uas usb_storage usbhid serio_raw atkbd nvme libps2 vivaldi_fmap nvme_core i8042 xhci_pci crc32c_intel xhci_pci_renesas serio Sep 28 21:04:52 archusb kernel: CPU: 12 PID: 92 Comm: ksoftirqd/12 Tainted: G W L 6.0.0-rc7nvmefix-00068-g49c13ed0316d-dirty #1 7ecd7809f8f37e0153b2dde865e868f3fbcc257c Sep 28 21:04:52 archusb kernel: RIP: 0010:nvme_pci_complete_rq+0x4/0x90 [nvme] Sep 28 21:04:56 archusb kernel: CPU: 12 PID: 92 Comm: ksoftirqd/12 Tainted: G W L 6.0.0-rc7nvmefix-00068-g49c13ed0316d-dirty #1 7ecd7809f8f37e0153b2dde865e868f3fbcc257c Sep 28 21:04:56 archusb kernel: RIP: 0010:nvme_error_status+0x1c/0x150 [nvme_core] Sep 28 21:04:56 archusb kernel: nvme_complete_rq+0x46/0x2e0 [nvme_core 270b20fd7b0b171d3fb7775cf55fe067099f26f1] Sep 28 21:05:08 archusb kernel: Tainted: G W L 6.0.0-rc7nvmefix-00068-g49c13ed0316d-dirty #1 Sep 28 21:05:24 archusb kernel: videobuf2_common tpm_crb snd_soc_acpi aesni_intel typec_ucsi vfat sparse_keymap tpm_tis snd_timer fat crypto_simd videodev snd tpm_tis_core platform_profile ecdh_generic cryptd rapl pcspkr typec mc cec crc16 snd_pci_acp3x soundcore rfkill ccp i2c_piix4 k10temp roles tpm i2c_hid_acpi wmi mac_hid i2c_hid rng_core amd_pmc acpi_cpufreq video acpi_tad fuse bpf_preload ip_tables x_tables btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq uas usb_storage usbhid serio_raw atkbd nvme libps2 vivaldi_fmap nvme_core i8042 xhci_pci crc32c_intel xhci_pci_renesas serio Sep 28 21:05:24 archusb kernel: CPU: 12 PID: 92 Comm: ksoftirqd/12 Tainted: G W L 6.0.0-rc7nvmefix-00068-g49c13ed0316d-dirty #1 7ecd7809f8f37e0153b2dde865e868f3fbcc257c Sep 28 21:05:52 archusb kernel: videobuf2_common tpm_crb snd_soc_acpi aesni_intel typec_ucsi vfat sparse_keymap tpm_tis snd_timer fat crypto_simd videodev snd tpm_tis_core platform_profile ecdh_generic cryptd rapl pcspkr typec mc cec crc16 snd_pci_acp3x soundcore rfkill ccp i2c_piix4 k10temp roles tpm i2c_hid_acpi wmi mac_hid i2c_hid rng_core amd_pmc acpi_cpufreq video acpi_tad fuse bpf_preload ip_tables x_tables btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq uas usb_storage usbhid serio_raw atkbd nvme libps2 vivaldi_fmap nvme_core i8042 xhci_pci crc32c_intel xhci_pci_renesas serio Sep 28 21:05:52 archusb kernel: CPU: 12 PID: 92 Comm: ksoftirqd/12 Tainted: G W L 6.0.0-rc7nvmefix-00068-g49c13ed0316d-dirty #1 7ecd7809f8f37e0153b2dde865e868f3fbcc257c It clearly shows some qid timing out, whatever this means.
Could you send output of 'lspci -xs 02:00.0'?
Here it is: archusb ~# lspci -xs 02:00.0 02:00.0 Non-Volatile memory controller: Kingston Technology Company, Inc. Device 5013 (rev 01) 00: 46 26 13 50 06 04 10 00 01 02 08 01 10 00 00 00 10: 04 00 70 80 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 46 26 13 50 30: 00 00 00 00 80 00 00 00 00 00 00 00 ff 01 00 00
Thanks, that confirms the quirk used the correct identifiers. So it's not a matter of sequencing breaking your device, but something seems to have gone very wrong. I'll send a debug patch because your dmesg output doesn't make immediate sense: [ 32.150363] could not locate request for tag 0x0 [ 32.150395] nvme nvme0: invalid id 0 completed on queue 7 Assuming we're actually using the queue 7's tagset, which is shown to be an active queue in your setup, tag 0 should always be valid.
In the meantime, could you run with kernel param 'nvme.io_queue_depth=2'?
I booted up the kernel with your parameter. Then I tried to create a simple gpt partition with gdisk. The command hanged, but then gave "The operation has completed successfully.". Opening gdisk a second time gave "Caution: invalid main GPT header, but valid backup; regenerating main header from backup!". The following logs were generated during the process: archusb ~# journalctl -b | grep nvme Sep 29 18:09:43 archusb kernel: Linux version 6.0.0-rc7nvmefix-00068-g49c13ed0316d-dirty (user@headquarter) (gcc (GCC) 12.2.0, GNU ld (GNU Binutils) 2.39.0) #1 SMP PREEMPT_DYNAMIC Wed Sep 28 16:54:32 CEST 2022 Sep 29 18:09:43 archusb kernel: Command line: initrd=\initramfs-linux-nvmefix.img root=UUID=6d6c3551-a53c-425b-ab29-9cf52e1943f7 rw nvme.io_queue_depth=2 Sep 29 18:09:43 archusb kernel: Kernel command line: initrd=\initramfs-linux-nvmefix.img root=UUID=6d6c3551-a53c-425b-ab29-9cf52e1943f7 rw nvme.io_queue_depth=2 [...usbstuff..] Sep 29 18:09:43 archusb kernel: nvme 0000:02:00.0: platform quirk: setting simple suspend Sep 29 18:09:43 archusb kernel: nvme nvme0: pci function 0000:02:00.0 Sep 29 18:09:43 archusb kernel: nvme nvme0: Shutdown timeout set to 10 seconds Sep 29 18:09:43 archusb kernel: nvme nvme0: 16/0/0 default/read/poll queues Sep 29 16:10:16 archusb kernel: nvme nvme0: I/O 0 (Read) QID 16 timeout, aborting Sep 29 16:10:16 archusb kernel: nvme nvme0: Abort status: 0x0 Sep 29 16:10:47 archusb kernel: nvme nvme0: I/O 0 (Read) QID 10 timeout, aborting Sep 29 16:10:47 archusb kernel: nvme nvme0: Abort status: 0x0 Sep 29 16:12:14 archusb kernel: nvme nvme0: I/O 0 (Read) QID 10 timeout, aborting Sep 29 16:12:14 archusb kernel: nvme nvme0: Abort status: 0x0 Sep 29 16:12:14 archusb kernel: nvme0n1: p1 Sep 29 16:12:46 archusb kernel: nvme nvme0: I/O 0 QID 5 timeout, completion polled Sep 29 16:12:46 archusb kernel: nvme0n1: p1 Sep 29 16:12:46 archusb systemd-homed[551]: block device /sys/devices/pci0000:00/0000:00:02.4/0000:02:00.0/nvme/nvme0/nvme0n1/nvme0n1p1 has been removed. Sep 29 16:13:32 archusb kernel: nvme nvme0: I/O 0 (Read) QID 8 timeout, aborting Sep 29 16:13:32 archusb kernel: nvme nvme0: Abort status: 0x0 Sep 29 16:13:32 archusb systemd-homed[551]: block device /sys/devices/pci0000:00/0000:00:02.4/0000:02:00.0/nvme/nvme0/nvme0n1/nvme0n1p1 has been removed.
This is not looking like a functional nvme controller. Was this device working in a different environment? Could we test with one more kernel parameter in addition to the previous request?: "pci=nomsi". Your system will probably be slower with that, but it will simplify the nvme queue and interrupt setup.
Okay, I already mentioned that the drive didn't appear in Windows 10 drive selection. Now I tried with Windows 11 and it appeared and I could install Windows, but unfortunately it won't boot. During boot I get bluescreens. Because of your findings and the fact that I couldn't make the SSD work on windows, I also conclude the ssd is defective. I decided to order the exact same ssd (because I am crazy) and send the current one back as defective. I will report back when the new SSD is here and if it works. The parameter pci=nomsi lets my kernel drop into a recovery shell. I attach a photo of how it appears to me. I think my USB Stick is indirectly attached via pci and it can't find the UUID of the partition anymore.
Created attachment 301925 [details] kernel with nvme patch and pci=nomsi parameter
I received my identical replacement SSD (Kingston KC3000 4TB). Unfortunately it still doesn't work and shows the exactly same behavior. Sometimes my "fdisk -l" command isn't stuck when I freshly booted. But it will reproducable be stuck some time later. I will provide three log files for three different boots. The first one contains a stack trace which might be of interest: ct 06 12:49:19 archusb kernel: Hardware name: LENOVO 82TL/LNVNB161216, BIOS JVCN30WW 07/25/2022 Oct 06 12:49:19 archusb kernel: Workqueue: kblockd blk_mq_timeout_work Oct 06 12:49:19 archusb kernel: RIP: 0010:blk_mq_put_rq_ref+0x4f/0x60 Oct 06 12:49:19 archusb kernel: Code: 2b 90 00 cc 8b 83 98 00 00 00 83 c0 7f 83 f8 7f 76 18 f0 ff 8b 98 00 00 00 74 06 5b c3 cc cc cc cc 48 89 df 5b e9 41 c0 ff ff <0f> 0b eb e4 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 66 0f 1f 00 0f Oct 06 12:49:19 archusb kernel: RSP: 0018:ffffa73001f4fd78 EFLAGS: 00010293 Oct 06 12:49:19 archusb kernel: RAX: 0000000000000071 RBX: ffff98e8c9418000 RCX: 0000000000000000 Oct 06 12:49:19 archusb kernel: RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffff98e8c9418000 Oct 06 12:49:19 archusb kernel: RBP: ffff98e8ca2103b0 R08: 0000000000000000 R09: 0000000000000008 Oct 06 12:49:19 archusb kernel: R10: ffffa73001f4fc94 R11: 000000000000001c R12: ffff98e8c0c4fc00 Oct 06 12:49:19 archusb kernel: R13: 0000000000000001 R14: 0000000000000004 R15: 0000000000000004 Oct 06 12:49:19 archusb kernel: FS: 0000000000000000(0000) GS:ffff98ef22140000(0000) knlGS:0000000000000000 Oct 06 12:49:19 archusb kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Oct 06 12:49:19 archusb kernel: CR2: 000055b20aaf20c8 CR3: 000000071f610000 CR4: 0000000000750ee0 Oct 06 12:49:19 archusb kernel: PKRU: 55555554 Oct 06 12:49:19 archusb kernel: Call Trace: Oct 06 12:49:19 archusb kernel: <TASK> Oct 06 12:49:19 archusb kernel: bt_iter+0x50/0xa0 Oct 06 12:49:19 archusb kernel: blk_mq_queue_tag_busy_iter+0x327/0x5e0 Oct 06 12:49:19 archusb kernel: ? __blk_mq_complete_request_remote+0x20/0x20 Oct 06 12:49:19 archusb kernel: ? finish_task_switch.isra.0+0x90/0x2d0 Oct 06 12:49:19 archusb kernel: ? __blk_mq_complete_request_remote+0x20/0x20 Oct 06 12:49:19 archusb kernel: blk_mq_timeout_work+0x69/0x150 Oct 06 12:49:19 archusb kernel: process_one_work+0x1c7/0x380 Oct 06 12:49:19 archusb kernel: worker_thread+0x51/0x390 Oct 06 12:49:19 archusb kernel: ? rescuer_thread+0x3b0/0x3b0 Oct 06 12:49:19 archusb kernel: kthread+0xde/0x110 Oct 06 12:49:19 archusb kernel: ? kthread_complete_and_exit+0x20/0x20 Oct 06 12:49:19 archusb kernel: ret_from_fork+0x22/0x30 Oct 06 12:49:19 archusb kernel: </TASK> Oct 06 12:49:19 archusb kernel: ---[ end trace 0000000000000000 ]--- Oct 06 12:49:19 archusb kernel: nvme nvme0: Shutdown timeout set to 10 seconds Oct 06 12:49:19 archusb kernel: nvme nvme0: 16/0/0 default/read/poll queues I now think it is a kernel issue again. How can I be of most help to further trace the problem? First I do is updating from rc7 to the 6.0 kernel.
Created attachment 302944 [details] after ssd replacement 3 outputs of "journalctl -b" from 3 different boots
Created attachment 302946 [details] trace prints on completion Long shot question: do you happen to have access to pcie protocol analyzer, like a LeCroy or Xgig? Attaching a patch. If you could add that, and recompile the kernel, but change your .config so that NVMe is built-in (option 'y') instead of a module (option 'n'): CONFIG_NVME_CORE=y CONFIG_BLK_DEV_NVME=y Then add to the kernel command line: trace-event=nvme:nvme_setup_cmd,nvme:nvme_complete_rq,nvme:nvme_sq The trace-event commandline won't work if nvme is built as a module, so the built-in step is important. After booting, run 'cat /sys/kernel/debug/tracing/trace' and attach the results.
> do you happen to have access to pcie protocol analyzer, like a LeCroy or > Xgig? I do not. I am sorry. I know what the difference between a spinlock and a semaphore is, but I didn't ascend to such a high level yet. --- I installed Windows 11 on the new SSD. It worked, but setting up the OS was much of a struggle. There were multiple bluescreens and freezes during the initial setup process. This is the same behavior as with the first KC3000 SSD. Now it seems to "mostly" work on windows, with which I mean I still suffer from occasional bluescreens, but could execute some programs. I used "AS SSD Benchmark", "Kingston SSD Manager" and "CrystalDiskInfo". The tools seem to report no errors if I read them correctly. There is no firmware update for the ssd available. I will send the information to the Kingston support e-mail aswell. I wonder why the normal SSD for my laptop works fine, while the Kingston one does not. Also other people are using this SSD without any problems at least on windows. Might this be a Kingston Firmware, Ryzen 6000 or Lenovo Firmware problem? I report back when I have news. I will own the SSD for another week before I need to send it back aswell. Until then I will compile the kernel according to your instructions.
Created attachment 302947 [details] Screenshots and export from kingston ssd manager
Created attachment 302948 [details] Screenshot from Crystaldiskinfo
Created attachment 302949 [details] Screenshot from AS SSD Benchmark
I downloaded the 6.0 mainline kernel sources as tarball. Then applied both patches provided in this issue. The .config is again based on the default Arch Linux config from /proc/config.gz . The two settings in the .config for including the nvme modules into the kernel have been set. I compiled the new kernel and reconfigured my bootloader. #cat /boot/loader/entries/linux-nvmetrace.conf title Arch Linux 6.0 Nvmetrace linux /vmlinuz-linux-nvmetrace initrd /initramfs-linux-nvmetrace.img options root=UUID=6d6c3551-a53c-425b-ab29-9cf52e1943f7 rw trace-event=nvme:nvme_setup_cmd,nvme:nvme_complete_rq,nvme:nvme_sq Next I shortly upload the desired output.
Created attachment 302951 [details] nvmetrace kernel "journalctl -b"
Created attachment 302952 [details] nvmetrace kernel "zcat /proc/config.gz"
Created attachment 302953 [details] nvmetrace kernel "cat /sys/kernel/debug/tracing/trace"
The compiled kernel is slow and also runs into a full system freeze after a while. The "fdisk -l" command was stuck as I generated the files so the issue was already clearly present. Once I saw a message indicating a kernel panic as the freeze happened, but couldn't see why.
(In reply to hastegeldhastewas from comment #24) > > do you happen to have access to pcie protocol analyzer, like a LeCroy or > > Xgig? > > I do not. I am sorry. I know what the difference between a spinlock and a > semaphore is, but I didn't ascend to such a high level yet. That type of equipment would actually be going much lower level: snooping signals off copper wires :) You usually only find them in hardware labs, and I only ask about getting such traces if I strongly suspect a protocol violation (this is clearly in that arena). Lacking wire traces, we have kernel traces, and that's often good enough. From the kernel trace you provided (thank you!), your device is definitely busted. The relevant snippets are as follows: 1.093817: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=11, cmdid=128, nsid=1 The above is the first command in your trace dispatched to qid=11. 32.151747: nvme_handle_cqe: nvme0: qid=11 sq_id=11 head=1 tail=1 cmdid:0(0) And above is the first completion on qid=11. Two things to note here: 1. The cqe's cmdid is 0, but the only outstanding command id on that queue is 128. The device posted a completion with the wrong cmdid. 2. The timestamp is 30 seconds late. This is clearly a timeout polling loop that discovered the malformed CQE. The device failed to deliver an MSI for the posted CQE. I haven't seen a controller this broken since the first Apple ones. I'll send a new patch with the same quirks that one is using.
Created attachment 302954 [details] All the quirks This will enforce the simplest possible queue configuration that nvme can do. When using this patch, there will be exactly one IO queue, one possible command ID, and one MSI vector. Apply this atop a pristine 6.0+ kernel (please back-out all the previously requested patches), and you can also remove all the kernel cmdline options I previously requested.
I will use "make defconfig" to create a kernel and apply the requested patch. As I know the KC3000 uses the Phison E18 (PS5018-E18-41) controller. This one is also used on most if not all of the fastest SSDs available right now. Am I assuming correctly that the encountered violation is something Phison has to fix and not Kingston?
The defconfig kernel didn't work. After googling I found out this is expected on real hardware. I continued compiling the kernel with the quirks based on the official unmodified Arch Linux config again. Then applied only the quirk patch. Unfortunately commands like gdisk still hang. The logs show that the kernel applies the quirks as expected. One timeout message is present.
Created attachment 302957 [details] maxquirk kernel "dmesg"
Created attachment 302958 [details] maxquirk kernel "journalctl -b"
Hello Keith and maybe the others, I once again would like to politely ask for your expertise. I found someone in the internet, claiming the KC3000 works in his linux setup. After reading that, I organized another computer with an m.2 slot and the kingston KC3000 SSD works perfectly, although only via PCIe 3.0 . Nothing lags and there are no error-entries in the log. In my understanding the PCIe lanes are basically wires which are connected directly from the cpu to the nvme ssd. I have a hard time to believe the laptop firmware has something todo with them. But at this point in time all the arrows point at it. Do you have a technical explanation for that behavior? I will attach all the logs and also a trace based on your tracing kernel patch. Please note that in the logs nvme1 is the Kingston one, while nvme0 is the one of my friend. I even could execute smartctl without any problems and included an output aswell.
Created attachment 303041 [details] other computer, same kernel, same ssd, but it works outputs of: dmesg journalctl_-b lspci_-vv smartctl_-a
Created attachment 303042 [details] other computer, same kernel, same ssd, but it works + tracing patch this kernel has the tracing patch outputs of: dmesg journalctl_-b lspci_-vv smartctl_-a sys_kernel_debug_tracing_trace
Maybe disable pcie relaxed ordering. Some devices do this completely wrong. The "working" one has this disabled. If there's no bios option to turn it off, then we'll need yet another kernel patch to quirk it off.
The bios is somehow limited. I don't know if there are hidden options and how to access them. But I could not find the setting you mentioned. So I need another quirk patch. From a technical point of view: 1. Do you see any chance that the firmware of my laptop causes the problem? 2. Is the PCIe connection terminated at the Phison E18 controller? If yes: Assuming kingston left the E18 unmodified, will the problem I encountered be present for all Phison E18 ssd's (which are a lot)?
Created attachment 303043 [details] disable ro for kingston In order to verify it successfully did what we want, you can run: # lspci -vv -s 04:00.0 | grep RlxOrd If it shows "RlxOrd-", then it was successful in disabling the feature. If it says "RlxOrd+", then it was not.
(In reply to hastegeldhastewas from comment #43) > 1. Do you see any chance that the firmware of my laptop causes the problem? I don't know, possibly. I think we'd need packet level tracing to know where to point the finger (device vs. platform), which is not a reasonable thing to obtain. > 2. Is the PCIe connection terminated at the Phison E18 controller? If yes: > Assuming kingston left the E18 unmodified, will the problem I encountered be > present for all Phison E18 ssd's (which are a lot)? My understanding is that these 3rd party controllers are provided with development kits for vendors to customize as they see fit, so they're not all equal. And though some models may have the same Phison NVMe controller, I have no idea if they're all sharing the same PCIe IP core.
Thanks for your answers. I applied the patch on the current mainline kernel 6.1-rc1 , but it unfortunately doesn't boot. It can't find the root partition by its UUID. I will attach a photo. My keyboard isn't working in the emergency shell. The patch seems to mess up the pci connection to the usb controller. The desired partition is on the USB-stick I boot from. I used the same steps to compile the kernel as always and copied the bootloader file with the definately correct UUID. If you can fix this easily I will happily compile another kernel. If not I would say we stop the experiment. Whatever is going wrong seems to most likely not be related to linux. I of course am very thankful you still took the time to help me.
Created attachment 303044 [details] kernel without relaxed ordering fails to boot
Well that's weird. So boot succeeded without the pci quirk patch, but breaks with it? That really doesn't make sense right now; the only thing it does is toggle an optional PCIe config register on the nvme device and shouldn't affect anything to do with the root device.
Okay, then I most likeli messed something up. I will try again.
Hello Keith, I now managed to boot 4 differnet kernels (so the UUID issue was something I did wrong). Two of them are without modification and two got the no relaxed ordering patch. linux-6.0.2 linux-6.0.2-noro linux-6.1.0-rc1 linux-6.1.0-rc1-noro The "RlxOrd-" quirk doesn't seem to be working. I tried to verify by using "lspci -vv | grep RlxOrd" and I can only see "RlxOrd+" and no minus. The ssd seems to be "02:00.0" and also shows "RlxOrd+". I am currently investigating if I can see the reason or if something I did wrong is the reason.
Created attachment 303048 [details] logs_for_four_kernels_some_with_no_relaxed_ordering
Created attachment 303049 [details] output of lspci -nn Vendor and device id seem to be correct. I will compile another kernel.
Two new kernels are compiling. One of them uses the no relaxed reordering patch applied to 6.1-rc1, and the other one uses the same patch and kernel but I changed the class to PCI_CLASS_STORAGE_EXPRESS. Like so: /mnt/ssd/dl_unrar/linux-6.1-rc1-noro-otherclass> cat drivers/pci/quirks.c| grep -n1 5013 4331- 4332:DECLARE_PCI_FIXUP_CLASS_EARLY(0x2646, 0x5013, PCI_CLASS_STORAGE_EXPRESS, 8, 4333- quirk_relaxedordering_disable); /mnt/ssd/dl_unrar/linux-6.1-rc1-noro> cat drivers/pci/quirks.c| grep -n1 5013 4331- 4332:DECLARE_PCI_FIXUP_CLASS_EARLY(0x2646, 0x5013, PCI_CLASS_NOT_DEFINED, 8, 4333- quirk_relaxedordering_disable);
Created attachment 303054 [details] logs_kernels_with_no_relaxed_ordering quirk not working No, both kernels do not seem to apply the quirk and still have the error. See the outputs/logs attached. There is a line saying something about an nvme quirk for 02:00.0: Oct 20 13:27:55 archusb kernel: nvme 0000:02:00.0: platform quirk: setting simple suspend Only idea left for me is that one quirk overwrites the other.
Huh, I guess I don't know how pci early quirks work. I'll mess with this on emulation and see what I'm doing wrong. You can manually turn it off at the command line after boot to see what happens. If your drive's BDf is 02:00.0, the command looks like: # setpci -s 02:00.0 CAP_EXP+8.w=0:10 Verify the RlxOrd bit in lspci is set to '-' after to make sure it worked. If the above doesn't toggle it as expected ... well, I'd be stumped ... again.
(In reply to Keith Busch from comment #55) > Huh, I guess I don't know how pci early quirks work. I'll mess with this on > emulation and see what I'm doing wrong. Ah, I see this quirk only works for root ports. This needs an entirely different identifer in that case. Instead of the '0x2646, 0x5013' combo, it needs 'PCI_VENDOR_ID_AMD, 0x14ba'.
Created attachment 303056 [details] logs_setpci_last_try_linux-6.1.0-rc1-noro Hey Keith, I tried the setpci command as soon as I logged in. Applying the setting worked, but unfortunately the errors persisted. Damn, that could have been it. One last log tarball is attached to this message. To get my money back I will now bring the ssd back to the post office. So I can't perform any more tests and the issue can be closed. I thank you again very much for your endurance. I felt very well taken care of. Maybe we read each other again during my next issue for my next ssd.
Okay I will set PCI_VENDOR_ID_AMD as my last last try now. like this: DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_AMD, 0x14ba, PCI_CLASS_STORAGE_EXPRESS, 8, quirk_relaxedordering_disable);
Two kernels are in the making: /mnt/ssd/dl_unrar/linux-6.1-rc1-noro-otherids> cat drivers/pci/quirks.c| grep -n1 0x14ba 4331- 4332:DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_AMD, 0x14ba, PCI_CLASS_NOT_DEFINED, 8, 4333- quirk_relaxedordering_disable); /mnt/ssd/dl_unrar/linux-6.1-rc1-noro-otherids-otherclass> cat drivers/pci/quirks.c| grep -n1 0x14ba 4331- 4332:DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_AMD, 0x14ba, PCI_CLASS_STORAGE_EXPRESS, 8, 4333- quirk_relaxedordering_disable);
Both kernels didn't succeed in applying the quirk. I only see "RlxdOrd+" for both new kernels. Damn. I will bring back the haunted ssd now. Thank you again.