Bug 214025 - Better error message for PCI devices killed during boot?
Summary: Better error message for PCI devices killed during boot?
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 low
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-08-10 17:27 UTC by Sam Edwards
Modified: 2021-09-17 09:30 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.13.8
Tree: Mainline
Regression: No


Attachments

Description Sam Edwards 2021-08-10 17:27:17 UTC
Hello,

I recently finished troubleshooting an issue where some NVMe SSD on the PCIe bus wasn't being initialized by the driver; the kernel log contained:

pci 0000:02:00.0: CLS mismatch (64 != 1020), using 64 bytes
...
nvme 0000:02:00.0: can't change power state from D3hot to D0 (config space inaccessible)

The problem (which deserves its own bug report) was that ACPI initialization was powering off the device between the time the PCI bus was scanned and the time the driver was probing the device. The CLS value of 1020 came from the register being read as 0xFF (255*4 = 1020) due to the config space being inaccessible. However, to a user who doesn't have full intuition about PCI, neither of these messages is particularly clear about what's really happening.

I'd have expected a (WARN/ERR) message saying something more like, "pci 0000:02:00.0: device has unexpectedly disappeared from the bus; removing" implemented either as a check right before driver probing or at key stages of the PCI device fixup process (such as when computing CLS). This check is probably not necessary for hotplugged devices, since major platform power management initialization won't happen between the hotplug event and driver binding, but I strongly believe it's appropriate at boot when other subsystems are liable to interfere with PCI devices.

An alternative to removing the device would be to keep it present in sysfs but put it in some other state (D3cold?) and hold off on trying to bind the driver. This hopefully increases the chance that the user sees that the device is present but in an unusual state.

Thoughts?
Comment 1 Gino Badouri 2021-09-13 10:37:21 UTC
Hi there,


I have the exact same problem.
I've bought a Razer laptop (11th gen intel).
And I'm getting:

pci 0000:02:00:0: CLS mismatch (64 != 1020), using 64 bytes
And then later on:
nvme 0000:02:00.0: can't change power state from D3hot to D0 (config space inaccessible)

The nvme is not detected now.

It used to work in 5.11, but in 5.13 and 5.14 it's broken.
I've tried to boot with pci_aspm=off but that didn't fix the problem either.

Any known workaround for this?
Comment 2 Sam Edwards 2021-09-15 01:09:21 UTC
Hey! Same laptop.

"CLS mismatch (... != 1020)" is because the PCIe device is being shut off by ACPI after it's discovered by enumeration but before it's been fully initialized. This bug report is only to request a specific check, and more helpful error message, for that circumstance.

The underlying problem you're encountering is actually #214035 (please make it known over there that you are affected by this too - the ACPI subsystem maintainer hasn't yet taken notice of this). I worked around it by throwing in a big unconditional return before the body of acpi_turn_off_unused_power_resources()

Speaking of the nvme, check its SMART information to see if you have the SSSTC CA6, firmware ERA0901. If you do, beware, I recently encountered some filesystem corruption due to a write issue on that SSD. I don't know if I just had bad luck, but make sure you're taking backups regularly just in case. (It might help to keep the SSD's write cache disabled by adding a "hdparm -W0" to your startup scripts.)
Comment 3 Gino Badouri 2021-09-15 07:28:27 UTC
Thanks for the information Sam!

For the record I'm running the exact same firmware version: ERA0901
Afaik there are no firmware updates available either.

I've found out that you can work around the bug by enabling the Intel VMD RST chipset in the bios.
Even if you don't create a RAID set or use their optane caching technology, the kernel will at least be able to detect the drive.
On the other hand, if you use Windows, it will probably require to re-install because it needs a separate driver from Intel.

For testing I've also added an older Samsung 970 EVO NVME drive and this seems to work fine on all kernels.
Now as you also mention possible corruption.. I think I'll just replace it with something else.
Razer has probably chosen this drive because it's cheap (no proper testing with linux, no firmware updates, possible corruption etc..).
Comment 4 Gino Badouri 2021-09-17 09:05:21 UTC
Hi Sam,

Small update.
I've replaced the SSSTC nvme ssd with another one from WD and the same thing happens.
So for the record, the first (internal) one is now from WD and the second one is the Samsung EVO 970 ssd.

So it seems the nvme ssd type/brand is not to blame here.
It just doesn't initialize properly when it's in the first slot regardless of the brand.
If you don't have a second ssd installed, you could try to move it to the optional slot which is right above it.
Or optionally enable the Intel VMD Rapid Storage chipset, but that will require you to reinstall Windows with the floppy driver from Intel.
Comment 5 Sam Edwards 2021-09-17 09:30:48 UTC
To be precise, the only thing to blame *here* is that the kernel isn't giving a clear indication that the PCI device has been shut off.

Again: The underlying problem you're encountering is actually #214035. The insight about enabling VMD is probably very useful over there.

That bug is about preventing the hardware from being shut off. This bug is about detecting when that has happened.

Note You need to log in before you can comment on or make changes to this bug.