I have 2 NVMe devices with the same serial number. Kernel refuses to work with one of them, and in dmesg there is following message: > nvme nvme0: Duplicate cntlid 0 with nvme1, rejecting After I change SN of one of the devices (and reboot) there is no such message and device is accessible. This message is misleading, and has nothing to do with the root cause of the issue I had.
It sounds like your device does not support a subsystem nqn, in which case the driver falls back to the legacy way of constructing it from the serial number, as per spec. Since your devices have the same serial number, they will appear to be in the same subsystem. The driver detects the problem because those two controllers reported to be in the same subsystem have the same controller id, which is not allowed, and that's the first place the driver detects a problem. I'd need to double check the spec to see if such legacy subsystems allow multi-controller in this manner. If they do, then the error message is appropriate.
The spec doesn't seem to disallow it. The exact text is: " The method shown in Figure 137 should be used by the host to construct an NVM Subsystem NQN for older NVM subsystems that do not provide an NQN in the Identify Controller data structure. The mechanism used by the vendor to assign Serial Number and Model Number values to ensure uniqueness is outside the scope of this specification. An NVM subsystem may contain multiple controllers. All controllers contained in the NVM subsystem share the same NVM subsystem unique identifier. The Controller ID (CNTLID) value returned in the Identify Controller data structure may be used to uniquely identify a controller within an NVM subsystem. The Controller ID value when combined with the NVM subsystem identifier forms a globally unique value that identifies the controller. " Based on this text, the existing error message is appropriate.
The behaviour change after I changed SN on one of the drives, suggests that kernel is trying to construct legacy NQN and because it conflicts with another drive NQN, it falls back to cntlid to identify drive. But there is no message that such fallback is taking place, so maybe it is not misleading, but incomplete.