Bug 216877 - Regression in PCI powermanagement breaks resume after suspend
Summary: Regression in PCI powermanagement breaks resume after suspend
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL: https://lkml.org/lkml/2023/1/4/599
Keywords:
Depends on:
Blocks:
 
Reported: 2023-01-02 14:02 UTC by Thomas Witt
Modified: 2023-03-10 17:09 UTC (History)
3 users (show)

See Also:
Kernel Version: 6.1
Tree: Mainline
Subsystem:
Regression: Yes


Attachments
output of git bisect log (2.46 KB, text/plain)
2023-01-02 14:02 UTC, Thomas Witt
Details
lspci -v (6.76 KB, text/plain)
2023-01-02 14:03 UTC, Thomas Witt
Details
output of netconsole after the issue happens (12.10 KB, text/plain)
2023-01-02 14:03 UTC, Thomas Witt
Details
lspci -vv, as root user (36.48 KB, text/plain)
2023-01-02 17:31 UTC, Thomas Witt
Details
journal before suspend (78.38 KB, text/plain)
2023-01-02 17:32 UTC, Thomas Witt
Details

Description Thomas Witt 2023-01-02 14:02:51 UTC
Created attachment 303512 [details]
output of git bisect log

After commit 5e85eba6f50dc288c22083a7e213152bcc4b8208 "PCI/ASPM: Refactor L1 PM Substates Control Register programming" my Laptop does not resume PCI devices back from suspend.

My Laptop is a Tuxedo Infinitybook S 14 v5, as far as I can tell they use a Clevo L140CU Mainboard.

The main symptom is:
iwlwifi 0000:02:00.0: Unable to change power state from D3hot to D0, device inaccessible
nvme 0000:03:00.0: Unable to change power state from D3hot to D0, device inaccessible

after that, the level of interaction I still have with the laptop varies, but It cannot run dmesg and it cannot do a clean reboot. The issue occurs on every suspend/resume cycle.
Comment 1 Thomas Witt 2023-01-02 14:03:22 UTC
Created attachment 303513 [details]
lspci -v
Comment 2 Thomas Witt 2023-01-02 14:03:52 UTC
Created attachment 303514 [details]
output of netconsole after the issue happens
Comment 3 Thomas Witt 2023-01-02 17:31:22 UTC
Created attachment 303515 [details]
lspci -vv, as root user
Comment 4 Thomas Witt 2023-01-02 17:32:13 UTC
Created attachment 303516 [details]
journal before suspend

this is the output of "journalctl -k -b -1"
Comment 5 Artem S. Tashkinov 2023-01-08 09:46:19 UTC
Is this still broken in 6.1.4?
Comment 6 Thomas Witt 2023-01-08 10:17:52 UTC
(In reply to Artem S. Tashkinov from comment #5)
> Is this still broken in 6.1.4?

Yes, same symptoms.
Comment 7 Artem S. Tashkinov 2023-01-09 12:24:24 UTC
It's being discussed on LKML as well, thanks!

https://lkml.org/lkml/2023/1/4/599
Comment 8 Bjorn Helgaas 2023-02-10 23:18:31 UTC
This should be fixed by
https://git.kernel.org/linus/ff209ecc376a
which will appear in v6.2-rc8.

Please reopen if you can reproduce the problem on v6.2-rc8 or later.
Comment 9 Raghav Shankar 2023-03-04 07:47:38 UTC
This still affects my system on v6.2.2 (with zen patches). I have to restart NetworkManager.service to get any networking working again. Additionally, startup got a lot slower after updating to v6.2.2.
Comment 10 Raghav Shankar 2023-03-04 07:48:23 UTC
To clarify, the startup problem seems to be caused by network manager taking too long.
Comment 11 Bjorn Helgaas 2023-03-09 21:47:56 UTC
Raghav, thanks very much for your report.  I'm not sure the problem you're seeing is the same as what Thomas reported here.  Thomas reported that his system was completely unusable after suspend/resume, and the only thing he could really do was reboot (and even that didn't work reliably).

In your case, it sounds like the system is slow and something is wrong with NetworkManager.  Is this a regression?  If it used to work, and simply upgrading the kernel to v6.2.2 caused problems, then we should look for a kernel issue.

If it seems like a kernel issue, can you please open a new report with more details (distro details, dmesg log, "sudo lspci -vv" output)?  Most kernel subsystems don't pay attention to bugzilla, so email to linux-kernel@vger.kernel.org and whatever other list seems relevant is probably best.  Maybe netdev@vger.kernel.org if you think it's network-related, or linux-pci@vger.kernel.org if you think it's PCI-related.
Comment 12 Raghav Shankar 2023-03-10 14:55:42 UTC
Ah, thank you for your response. I actually managed to fix it by passing ibt=off to the kernel cmdline, as that feature was causing issues with systemd service bringup. Thank you again for taking the time to help me here.
Comment 13 Bjorn Helgaas 2023-03-10 15:35:40 UTC
Hmmm.  That's horrible.  "ibt=off" isn't documented at all, and even if it were, users should not be required to diagnose the slowdown and somehow figure out to use "ibt=off" to avoid it, so I would definitely consider "ibt=off" as an interim *workaround*, but not an actual *fix*.

I did find a couple bug reports that mention "ibt=off" as a workaround:

  https://bugs.archlinux.org/task/74886
  https://bugs.archlinux.org/task/74891

Both are related to the nvidia driver, and it looks like you should see "Missing ENDBR" in your dmesg log if you are seeing the same problem.  So, if you're seeing that problem, I guess using "ibt=off" is OK.

But if you're seeing something different, i.e., you're not using nvidia, please report it to linux-kernel@vger.kernel.org and Peter Zijlstra <peterz@infradead.org> (and cc: me).  We would want to see the complete dmesg log to help figure this out.
Comment 14 Raghav Shankar 2023-03-10 17:09:07 UTC
I came across this through some discussions around that, like this one: https://bbs.archlinux.org/viewtopic.php?id=276805. Before this, I had tried out a few other kernel versions, and I had this same issue in 5.19 as well, I believe. The one I've linked has to do with 5.18. I do use NVidia hardware, which is what could have caused this. I'm not sure if I saw the string you describe when I was looking at my dmesg.

Note You need to log in before you can comment on or make changes to this bug.