Bug 217981

Summary: Need NVME_QUIRK_BOGUS_NID for INTEL SSDPD2KS019T7
Product: IO/Storage Reporter: PJWELSH (pj)
Component: NVMeAssignee: IO/NVME Virtual Default Assignee (io_nvme)
Status: RESOLVED CODE_FIX    
Severity: normal CC: kbusch
Priority: P3    
Hardware: Intel   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:
Attachments: attachment-21992-0.html

Description PJWELSH 2023-10-04 19:38:51 UTC
Same nguid with latest firmware version:
[~]# nvme id-ns /dev/nvme2 -n 1|grep -E "nguid|i64"
nguid   : 0100000001000000e4d25c3a83874bf0
eui64   : 0000000000000000
root@canas4[~]# nvme id-ns /dev/nvme0 -n 1|grep -E "nguid|i64"
nguid   : 0100000001000000e4d25c3a83874bf0
eui64   : 0000000000000000


System doesn't like this:
# dmesg -T |grep -i nvme0                       
[Tue Oct  3 15:06:40 2023] nvme nvme0: pci function 0000:03:00.0
[Tue Oct  3 15:06:40 2023] nvme nvme0: failed to register the CMB
[Tue Oct  3 15:06:40 2023] nvme nvme0: 48/0/0 default/read/poll queues
[Tue Oct  3 15:06:40 2023] nvme nvme0: VID:DID 8086:0a54 model:INTEL SSDPD2KS019T7 firmware:QDAA0130
[Tue Oct  3 15:06:40 2023] nvme nvme0: ignoring nsid 1 because of duplicate IDs


One of the mirrored disk pairs is "lost" now.
Comment 1 PJWELSH 2023-10-04 19:59:09 UTC
Not sure if you need this part now as the id is noted previously:
[~]# lspci -nn -d ::0108|grep 0a54                 
03:00.0 Non-Volatile memory controller [0108]: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller] [8086:0a54]
42:00.0 Non-Volatile memory controller [0108]: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller] [8086:0a54]

Not sure if I've missed anything.
Comment 2 PJWELSH 2023-10-10 15:33:42 UTC
Not sure if anything else is needed. However, I think the only only change should add the "NVME_QUIRK_BOGUS_NID" to the drivers/nvme/host/pci.c based on what I've read so far.
Comment 3 Keith Busch 2023-10-12 15:37:18 UTC
Is this with a recent kernel? The default behavior now should already handle this.
Comment 4 welsh 2023-10-12 15:58:33 UTC
Created attachment 305207 [details]
attachment-21992-0.html

Looks like a kernel 6.1.50 from the TrueNAS peeps.
I originally submitted a bug with them (
https://www.truenas.com/community/threads/bluefin-to-cobia-rc1-drive-now-fails-with-duplicate-ids.113205/)
and seemed to think the best course of action would be to check/fix with
upstream first.
However, I did add a note yesterday (
https://www.truenas.com/community/threads/bluefin-to-cobia-rc1-drive-now-fails-with-duplicate-ids.113205/post-784010)
asking them to validate that they have applied a patch from 6.1.40 from
July with commit ac522fc6c3165fd0daa2f8da7e07d5f800586daa that will

> Relax our check for them for so that it doesn't reject the probe on
> single-ported PCIe devices, but prints a big warning instead.

The current upstream pci.c code does not seem to indicate a
"NVME_QUIRK_BOGUS_NID", however.
Basically, I'm not sure who or what is to blame ATM other than I randomly
"lose" one of two drives in an array on reboot due a duplice GUID :(

On Thu, Oct 12, 2023 at 10:43 AM <bugzilla-daemon@kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=217981
> <https://bugzilla.kernel.org/show_bug.cgi?id=217981>
>
> Keith Busch (kbusch@kernel.org) changed:
>
> What |Removed |Added
>
> ----------------------------------------------------------------------------
> CC| |kbusch@kernel.org
>
> --- Comment #3 from Keith Busch (kbusch@kernel.org) ---
> Is this with a recent kernel? The default behavior now should already
> handle
> this.
>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You reported the bug.
>

Disclaimer

The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been automatically archived.  Cassens
Comment 5 PJWELSH 2023-10-12 16:10:43 UTC
Sorry for the odd email response... I will attempt to remember to use the bug submission page instead to help avoid confusion.
Comment 6 Keith Busch 2023-10-12 17:51:42 UTC
Oh, that version has the "fix" I mentioned, so must mean your controller is claiming "CMIC", or multi-controller capabilities. I'll apply a kernel quirk for the provided device ID.
Comment 7 PJWELSH 2023-10-12 17:53:07 UTC
Is there a way for me to validate the CMIC attribute?
Comment 8 Keith Busch 2023-10-12 17:57:57 UTC
'nvme id-ctrl /dev/nvme0 | grep cmic'. A value that includes bit 2 set means multi-controller.

The other possibility is nmic, and can check with 'nvme id-ns /dev/nvme0n1 | grep nmic'. Any value with bit 1 set is claiming multi capable.
Comment 9 PJWELSH 2023-10-12 18:00:49 UTC
[~]# nvme id-ctrl /dev/nvme0 | grep cmic
cmic      : 0x3

[~]# nvme id-ns /dev/nvme0n1 |grep nmic
nmic    : 0x1
Comment 10 Keith Busch 2023-10-12 18:25:15 UTC
Yah, that's doubly confirming it.

I'm a bit surprised since that is a pretty old model. Something must have happened with whatever batch you have; the identifiers had been reliably unique as far as I remember. Unfortunately the quirk mechanism works on the device ID granularity, and I'll just post it out to the mailing list.
Comment 11 Keith Busch 2023-10-18 15:50:14 UTC
This is applied for the next 6.6-rc.
Comment 12 PJWELSH 2023-10-18 18:42:46 UTC
Will/can it also be put into the LT 6.1 kernel?
Thanks
Comment 13 Keith Busch 2023-10-18 19:05:31 UTC
I'll keep an eye out for the stable release notice after rc7 is posted. If it works like it has in the past, the stable bot should auto apply the quirk patch to all the LTS trees sometime next week.
Comment 14 PJWELSH 2024-01-16 21:38:09 UTC
Seems to all be good now. closing.
Thanks for the help! Much appreciated.