Bug 217705

Summary: kernel 6.4.x power management issues on kaby lake CPU
Product: Drivers Reporter: Michal Hlavac (miso)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: RESOLVED CODE_FIX    
Severity: normal CC: bjorn, kernel, mike, miso, paul.grandperrin, sven.koehler, tiwai
Priority: P3    
Hardware: All   
OS: Linux   
Kernel Version: Subsystem:
Regression: Yes Bisected commit-id: 8ee39ec479147e29af704639f8e55fce246ed2d9

Description Michal Hlavac 2023-07-25 14:44:23 UTC
Hi,

After upgrading to kernel 6.4.x I experienced these problems:
1. When notebook has disconnected power, it boot only to console (no X)
2. When power is disconnected, notebook does not suspend to RAM.


> dmesg log when I disconnect power is:
> ata2: SATA link down (SStatus 4 SControl 300)
> pci 0000:01:00.0: not ready 1023ms after resume; waiting
> pci 0000:01:00.0: not ready 2047ms after resume; waiting
> pci 0000:01:00.0: not ready 4095ms after resume; waiting
> pci 0000:01:00.0: not ready 8191ms after resume; waiting
> pci 0000:01:00.0: not ready 16383ms after resume; waiting
> pci 0000:01:00.0: not ready 32767ms after resume; waiting
> pci 0000:01:00.0: not ready 65535ms after resume; giving up
> pci 0000:01:00.0: Unable to change power state from D3cold to D0, device
> inaccessible
> pci 0000:01:00.0: Unable to change power state from D3cold to D0, device
> inaccessible

lspci
> B00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core
> Processor Host Bridge/DRAM Registers (rev 05)
> B00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe
> Controller (x16) (rev 05)
> B00:02.0 VGA compatible controller: Intel Corporation HD Graphics 630 (rev
> 04)
> B00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200
> v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 05)
> B00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset
> Family USB 3.0 xHCI Controller (rev 31)
> B00:14.2 Signal processing controller: Intel Corporation 100 Series/C230
> Series Chipset Family Thermal Subsystem (rev 31)
> B00:15.0 Signal processing controller: Intel Corporation 100 Series/C230
> Series Chipset Family Serial IO I2C Controller #0 (rev 31)
> B00:15.1 Signal processing controller: Intel Corporation 100 Series/C230
> Series Chipset Family Serial IO I2C Controller #1 (rev 31)
> B00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series
> Chipset Family MEI Controller #1 (rev 31)
> B00:17.0 SATA controller: Intel Corporation HM170/QM170 Chipset SATA
> Controller [AHCI Mode] (rev 31)
> B00:1c.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family
> PCI Express Root Port #1 (rev f1)
> B00:1c.1 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family
> PCI Express Root Port #2 (rev f1)
> B00:1d.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family
> PCI Express Root Port #9 (rev f1)
> B00:1d.4 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family
> PCI Express Root Port #13 (rev f1)
> B00:1d.6 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family
> PCI Express Root Port #15 (rev f1)
> B00:1f.0 ISA bridge: Intel Corporation HM175 Chipset LPC/eSPI Controller (rev
> 31)
> B00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset
> Family Power Management Controller (rev 31)
> B00:1f.3 Audio device: Intel Corporation CM238 HD Audio Controller (rev 31)
> B00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus
> (rev 31)
> B01:00.0 3D controller: NVIDIA Corporation GP107M [GeForce GTX 1050 Mobile]
> (rev a1)
> B02:00.0 Network controller: Qualcomm Atheros QCA6174 802.11ac Wireless
> Network Adapter (rev 32)
> B03:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI
> Express Card Reader (rev 01)
> B04:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD
> Controller SM961/PM961/SM963

Then I found same problem described here:
https://bbs.archlinux.org/viewtopic.php?id=286976

And as workaround I disabled bumblebee service and now it works.
Comment 1 Michal Hlavac 2023-07-25 14:46:41 UTC
Downstream issue #: https://bugzilla.suse.com/show_bug.cgi?id=1213617
Comment 2 Sven Köhler 2023-08-30 22:46:55 UTC
On a Dell XPS 15 9560 (Intel 7700hq), it affects my SSD. The system effectively doesn't boot anymore.

The message I get is:
> nvme 0000:04:00.0: Unable to change power state from D3cold to D0, device
> inaccessible
Comment 3 Paul Grandperrin 2023-09-10 13:51:11 UTC
Exact same issue as Sven Köhler, on same hardware, I can't boot anymore on latest kernel.

It was working on 6.1.43 and doesn't work on 6.1.51.

I'll try to narrow it down some more if I have time
Comment 4 Paul Grandperrin 2023-09-10 14:50:21 UTC
I tried a few kernels:
6.1.43: working
6.1.45: working
6.1.46: nvme inaccessible 
6.1.47: nvme inaccessible
6.1.51: nvme inaccessible 

So it seems 6.1.46 is the culprit.

I was using NixOS prebuilt kernels until now as my machine is really slow to compile kernels but I'll try to bisect to the exact commit.
Comment 5 Maximilien Richer 2023-09-10 14:54:17 UTC
Having the same issue as Sven on the 6.4.x branch (same hardware), I managed to boot using the workaround suggested by the kernel error, ie. adding

> nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

to the kernel command line.
Comment 6 Paul Grandperrin 2023-09-10 15:06:46 UTC
Thanks Maximilien!

Oh what a relief!
I was afraid I wouldn't be able to use kernels updates for some time.

What do mean they were suggested in the kernel error though?
I don't see suggestions anywhere, is it in 6.4 only?
Comment 7 Paul Grandperrin 2023-09-10 15:15:48 UTC
Ok so the workaround only works until the computer goes to sleep.
Then the SSD becomes inaccessible again..
Comment 8 Paul Grandperrin 2023-09-10 15:29:04 UTC
.. actually, it doesn't work :(

Even without suspends, after a while, the nvme becomes inaccessible.

Good bye kernel updates, 6.1.45 will be my last version
Comment 9 Sven Köhler 2023-09-11 22:26:24 UTC
(In reply to Paul Grandperrin from comment #4)
> I tried a few kernels:
> 6.1.45: working
> 6.1.46: nvme inaccessible 
> 
> So it seems 6.1.46 is the culprit.


Would you feel comfortable to bisect it?


Also, my experience with the kernel bugzilla is often that nobody of the kernel developers responds, sadly.
Comment 10 Paul Grandperrin 2023-09-11 22:40:24 UTC
I have been bisecting it non stop since...
I think I only need to test one or two commit before I'll find it.

When I'll have the commit, what should be the next steps?

Try the latest kernel with this commit reverted to validate it's this one?

Then, how to help as much as possible to get a kernel dev to fix it?

I know C a little bit I'm nowhere near able to work on the kernel myself.
Comment 11 Paul Grandperrin 2023-09-11 22:41:15 UTC
What's the proper way to communicate with the kernel devs? Email?
Comment 12 Paul Grandperrin 2023-09-12 07:50:44 UTC
8ee39ec479147e29af704639f8e55fce246ed2d9 is the first bad commit
commit 8ee39ec479147e29af704639f8e55fce246ed2d9
Author: Ricky WU <ricky_wu@realtek.com>
Date:   Tue Jul 25 09:10:54 2023 +0000

    misc: rtsx: judge ASPM Mode to set PETXCFG Reg
    
    commit 101bd907b4244a726980ee67f95ed9cafab6ff7a upstream.
    
    ASPM Mode is ASPM_MODE_CFG need to judge the value of clkreq_0
    to set HIGH or LOW, if the ASPM Mode is ASPM_MODE_REG
    always set to HIGH during the initialization.
    
    Cc: stable@vger.kernel.org
    Signed-off-by: Ricky Wu <ricky_wu@realtek.com>
    Link: https://lore.kernel.org/r/52906c6836374c8cb068225954c5543a@realtek.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 drivers/misc/cardreader/rts5227.c  |  2 +-
 drivers/misc/cardreader/rts5228.c  | 18 ------------------
 drivers/misc/cardreader/rts5249.c  |  3 +--
 drivers/misc/cardreader/rts5260.c  | 18 ------------------
 drivers/misc/cardreader/rts5261.c  | 18 ------------------
 drivers/misc/cardreader/rtsx_pcr.c |  5 ++++-
 6 files changed, 6 insertions(+), 58 deletions(-)
Comment 13 Paul Grandperrin 2023-09-12 08:07:43 UTC
This is kind of weird. This patch touches ASPM things (which makes sense) but only in the cardreader drivers.

Is it possible that a bug in a cardreader driver impacts other components, like the NMVE?

I'm building 6.1.51 with this commit reverted to check that.
Comment 14 Paul Grandperrin 2023-09-12 09:26:43 UTC
I'm currently writing from my patched 6.1.51 kernel, so I can confirm this commit is to blame.

Should I contact Ricky WU directly?
Comment 15 Paul Grandperrin 2023-09-12 11:32:43 UTC
Good news, blacklisting rtsx_pci and rtsx_pci_sdmmc solves the issue.
The card reader won't work anymore, but that's better than not booting.
Comment 16 Paul Grandperrin 2023-09-12 12:31:57 UTC
I sent an email to the appropriate mailing list and developers (I hope).

https://lore.kernel.org/stable/5DHV0S.D0F751ZF65JA1@gmail.com/T/#u
Comment 17 Sven Köhler 2023-09-12 18:21:16 UTC
Wow, that was surprising! Thank you Paul!
Comment 18 mike 2023-10-18 01:08:17 UTC
I'm glad I found this report. Trying to install Manjaro on my old Dell 5520 laptop, it was showing no SSD. lsblk shows the nvme0n1 device with 4 partitions (a standard Windows installation), but fdisk/cfdisk were unable to operate on the drive at all. I did all the usual checks (booting UEFI mode, secure boot disabled, Intel RAID disabled), but still wasn't showing up.

I found the error in journalctl: "Unable to change power state from D3cold to D0, device inaccessible" and it brought me here. Based on reading the thread, the simplest workaround seemed to be going into BIOS and disabling the SD Card reader. Did that and boom, Manjaro installer works! (6.5.3-1-MANJARO kernel, FWIW.)
Comment 19 Artem S. Tashkinov 2023-10-24 11:04:46 UTC
This should get fixed any time soon.
Comment 20 Bjorn Helgaas 2023-11-01 11:59:54 UTC
Apparently the fix is 0e4cac557531 ("misc: rtsx: Fix some platforms can not boot and move the l1ss judgment to probe"), which is included in v6.6.

https://git.kernel.org/linus/0e4cac557531