Bug 216782 - resume from suspend broken on Asus UX305FA after PCI/PTM changes in kernel 6.1-rc1
Summary: resume from suspend broken on Asus UX305FA after PCI/PTM changes in kernel 6....
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-12-07 08:07 UTC by Tasev Nikola
Modified: 2024-03-08 09:29 UTC (History)
9 users (show)

See Also:
Kernel Version: 6.1-rc1
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg-6.1-rc8 (59.60 KB, text/plain)
2022-12-07 08:07 UTC, Tasev Nikola
Details
dmesg-6.0.7 (61.50 KB, text/plain)
2022-12-07 08:09 UTC, Tasev Nikola
Details
dmidecode Asus UX305FA (11.08 KB, text/plain)
2022-12-07 08:09 UTC, Tasev Nikola
Details
lspci -vvn (16.52 KB, text/plain)
2022-12-07 08:11 UTC, Tasev Nikola
Details
lspci -tv (807 bytes, text/plain)
2022-12-07 08:12 UTC, Tasev Nikola
Details
git-bisect log (3.21 KB, application/octet-stream)
2022-12-09 10:05 UTC, Tasev Nikola
Details
dmesg with PCI debug (99.17 KB, text/plain)
2022-12-14 11:00 UTC, Tasev Nikola
Details
custom config 6.1-rc1 (206.11 KB, text/plain)
2022-12-15 10:17 UTC, Tasev Nikola
Details
dmesg with efi=debug pci=earlydump add_efi_memmap (123.90 KB, text/plain)
2022-12-15 10:18 UTC, Tasev Nikola
Details
screen picture dmesg after resume (144.68 KB, image/jpeg)
2022-12-19 17:10 UTC, Tasev Nikola
Details
PCI/ASPM: Do not save and restore L1SS state if ASPM is forced (2.19 KB, patch)
2023-01-12 13:48 UTC, Rafael J. Wysocki
Details | Diff
lspci -vv and lspci -tv output (68.01 KB, text/plain)
2023-01-16 15:44 UTC, Mark Francis Enriquez
Details
skip L1SS restore if not enabled (1.34 KB, patch)
2023-01-17 05:51 UTC, Vidya Sagar
Details | Diff
lspci_vvv202304061800 (13.97 KB, text/plain)
2023-04-06 10:05 UTC, KobaKo
Details
Dmesg log for the Gigabyte G5 GD laptop, (71.43 KB, text/plain)
2023-05-03 15:48 UTC, Mark Francis Enriquez
Details
L1SS save/restore patch with 100ms delay (3.39 KB, patch)
2023-06-09 03:03 UTC, David Box
Details | Diff
Add back L1SS save and restore (5.80 KB, patch)
2023-06-26 09:58 UTC, Mika Westerberg
Details | Diff

Description Tasev Nikola 2022-12-07 08:07:39 UTC
Created attachment 303373 [details]
dmesg-6.1-rc8

Hi
Starting from kernel 6.1-rc1 for the first time in 8 years the resume from suspend is broken on my Asus UX305FA. After resume the sysrsq keys do not work and a hard reset is the only solution for shutdown, the screen is black, keyboard and mouse do not respond.

After bisecting the first bad commit is:

commit a47126ec29f538e1197862919f94d3b6668144a4
Author: Bjorn Helgaas <bhelgaas@google.com>
Date:   Fri Sep 9 15:24:57 2022 -0500

    PCI/PTM: Cache PTM Capability offset
    
    Cache the PTM Capability offset instead of searching for it every time we
    enable/disable PTM or save/restore PTM state.  No functional change
    intended.
    
    Link: https://lore.kernel.org/r/20220909202505.314195-2-helgaas@kernel.org
    Tested-by: Rajvi Jingar <rajvi.jingar@linux.intel.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
    Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
    Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>

I try to revert only that patch but the build failed with error.
That patch is part of a series of 9 patches. Should i try to revert all 9 ?

I don't know if it is related at all, but looking into dmesg diff of 6.0.7 - 6.1-rc1 kernels i see a new entry "ACPI data" in BIOS-e820:

6.0.7:
[    0.000000] BIOS-e820: [mem 0x00000000bce16000-0x00000000bce75fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000bce76000-0x00000000bd2d0fff] usable
[    0.000000] BIOS-e820: [mem 0x00000000bd2d1000-0x00000000be413fff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x00000000be414000-0x00000000beffdfff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000beffe000-0x00000000beffefff] usable

6.1-rc1:
[    0.000000] BIOS-e820: [mem 0x00000000bce16000-0x00000000bce75fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000bce76000-0x00000000bce96fff] ACPI data
[    0.000000] BIOS-e820: [mem 0x00000000bce97000-0x00000000be413fff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x00000000be414000-0x00000000beffdfff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000beffe000-0x00000000beffefff] usable

Full dmesg's are attached.

Tasev Nikola
Comment 1 Tasev Nikola 2022-12-07 08:09:03 UTC
Created attachment 303374 [details]
dmesg-6.0.7
Comment 2 Tasev Nikola 2022-12-07 08:09:54 UTC
Created attachment 303375 [details]
dmidecode Asus UX305FA
Comment 3 Tasev Nikola 2022-12-07 08:11:38 UTC
Created attachment 303376 [details]
lspci -vvn
Comment 4 Tasev Nikola 2022-12-07 08:12:30 UTC
Created attachment 303377 [details]
lspci -tv
Comment 5 Tasev Nikola 2022-12-07 14:14:25 UTC
I succeded reverting the 9 patches, rebuild the kernel but the resume from suspend is still broken.
The bisecting give's a wrong first bad commit. I'm lost, is hard to make a mistake because the testing here it's easy, close the lid and open it, it will resume or not.

Tasev
Comment 6 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-12-08 14:47:29 UTC
FWIW, have you tried rechecking a47126ec29f5 and it's parent? Maybe that will give some insights. Or try to do a bisection again (maybe even starting with the kernels you already built), but this time try at least a second time if a suspend fails, maybe there is a second problem with suspend that only occurs once in a while, which might send you off the rails.
Comment 7 Tasev Nikola 2022-12-08 17:32:34 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #6)
> FWIW, have you tried rechecking a47126ec29f5 and it's parent? Maybe that
> will give some insights. Or try to do a bisection again (maybe even starting
> with the kernels you already built), but this time try at least a second
> time if a suspend fails, maybe there is a second problem with suspend that
> only occurs once in a while, which might send you off the rails.

Hi 
I have done a new bisection today with sadly the exact same result.
The laptop suspend just fine always, the led is blinking, the resume is broken.
I just forgot to mention in my first post that in the last 4 build's of the bisecting the screen resume with success but the mouse cursor was frozen but the sysrsq keys are working. That's starting from:
[034f93fcb12f579c0108c7e2ca2d17ec4e5170aa] Merge branch 'pci/pm'
The next 3 are:
[3e347969a5776947a115649dae740a9ed47473f5] PCI/PM: Reduce D3hot delay with usleep_range()
[91b12b2a100e977274d3c277a4ff2df0b7439e7d] PCI/PTM: Move pci_ptm_info() body into its only caller
[e243c173c015d62b2bca9b030777ceba13311033] PCI/PTM: Add pci_upstream_ptm() helper


I'm just an average user, i do not understand what you mean by "rechecking a47126ec29f5 and it's parent". Do you mean revert the 9 patches one by one and test? I tried with the first one but the build failed with errors. Then i reverted all the 9 patches starting from 9 to 1, the build succeded but the resume from suspend is still broken
Comment 8 Bjorn Helgaas 2022-12-08 19:07:35 UTC
Marking as regression, reassigning to PCI
Comment 9 Bjorn Helgaas 2022-12-09 00:24:55 UTC
Ouch.  I'm really sorry about this regression, and thank you very much for reporting it and also doing all the bisection you've already done!

This is odd because your lspci doesn't show any devices that even have PTM.

If you still have a bisection log, can you attach the whole thing here?

Apparently resume fails at a47126ec29f5 ("PCI/PTM: Cache PTM Capability offset").

The parent of a47126ec29f5 (a47126ec29f5^) is 568035b01cfb ("Linux 6.0-rc1"); does resume work at 568035b01cfb?
Comment 10 Tasev Nikola 2022-12-09 10:05:27 UTC
Created attachment 303383 [details]
git-bisect log
Comment 11 Tasev Nikola 2022-12-09 10:09:45 UTC
(In reply to Bjorn Helgaas from comment #9)
> Ouch.  I'm really sorry about this regression, and thank you very much for
> reporting it and also doing all the bisection you've already done!
> 
> This is odd because your lspci doesn't show any devices that even have PTM.
> 
> If you still have a bisection log, can you attach the whole thing here?

The log is attached

> 
> Apparently resume fails at a47126ec29f5 ("PCI/PTM: Cache PTM Capability
> offset").
> 
> The parent of a47126ec29f5 (a47126ec29f5^) is 568035b01cfb ("Linux
> 6.0-rc1"); does resume work at 568035b01cfb?

Every kernel before 6.1-rc1 is working fine including 6.0.x .

I find the 9 patches here: https://lore.kernel.org/linux-pm/20220909202505.314195-2-helgaas@kernel.org/
I revert all 9, rebuild but resume is still broken.
For me then this is not PTM related, right?

Should i do git checkout 568035b01cfb and build a kernel?
Comment 12 Tasev Nikola 2022-12-14 11:00:51 UTC
Created attachment 303405 [details]
dmesg with PCI debug
Comment 13 Tasev Nikola 2022-12-14 11:01:52 UTC
After reading https://01.org/linuxgraphics/documentation/development/how-debug-suspend-resume-issues

I tried on kernel 91b12b2a100e977274d3c277a4ff2df0b7439e7d ,because on that kernel the screen after resume is frozen but the sysrsq keys are working :

dmesg > dmesg_before; echo mem > /sys/power/state; dmesg > dmesg_after  

but the file dmesg_after was not generated so is not a i915 issue.
I also rebuild that kernel with PCI debug activated, the dmesg with PCI debug is attached.
I also tried to unload many modules before suspend (wifi, card reader ...) with no success.
Comment 14 Bjorn Helgaas 2022-12-14 21:48:09 UTC
I'm stumped.

The "ACPI data" change in the original report is interesting because I think that's logged for E820_TYPE_ACPI, which is only used in parse_memmap_one() (not relevant since you're not using the "memmap" kernel parameter), do_add_efi_memmap() (probably not relevant either), and arch/x86/kernel/crash.c.  

The last is only compiled when CONFIG_KEXEC_CORE=y.  Is it possible that's enabled in your v6.1-rc1 kernel but not in v6.0?  If so, does it make any difference if you turn it off in v6.1-rc1?

The v6.0 dmesg includes "Booting paravirtualized kernel on bare hardware", while the v6.1-rc1 dmesg does not, suggesting that CONFIG_PARAVIRT=y in v6.0 but not v6.1-rc1.

ehci_hcd and ohci_hcd appear in v6.0 but not v6.1-rc1.  No idea if that's significant.

None of these *should* cause resume to fail, but if we can find something that changes the behavior, then we'd at least have a clue.  The only thing I can think to try is to make the v6.1-rc1 config as close as possible to v6.0 to see if the behavior is sensitive to any of these things.

For future boots, can you add "efi=debug pci=earlydump" to your kernel parameters, please?
Comment 15 Tasev Nikola 2022-12-15 10:16:14 UTC
(In reply to Bjorn Helgaas from comment #14)
> I'm stumped.
> 
> The "ACPI data" change in the original report is interesting because I think
> that's logged for E820_TYPE_ACPI, which is only used in parse_memmap_one()
> (not relevant since you're not using the "memmap" kernel parameter),
> do_add_efi_memmap() (probably not relevant either), and
> arch/x86/kernel/crash.c.  
> 
> The last is only compiled when CONFIG_KEXEC_CORE=y.  Is it possible that's
> enabled in your v6.1-rc1 kernel but not in v6.0?  If so, does it make any
> difference if you turn it off in v6.1-rc1?

The CONFIG_KEXEC_CORE=y is enabled on both config

> 
> The v6.0 dmesg includes "Booting paravirtualized kernel on bare hardware",
> while the v6.1-rc1 dmesg does not, suggesting that CONFIG_PARAVIRT=y in v6.0
> but not v6.1-rc1.

I included the 6.0.7 kernel dmesg from the Ubuntu kernel ppa mainline repository because before submitting a bug i always check if the behaviour is the same between my custom compiled kernel and the Ubuntu official one. I also checked with the 6.1-rc1 to 6.1-rc5 from the Ubuntu kernel ppa mainline and the resume is also broken. I use my custom config only for speedup the build process by disabling as many drivers as i can and also virtualisation. My custom config is attached.

> 
> ehci_hcd and ohci_hcd appear in v6.0 but not v6.1-rc1.  No idea if that's
> significant.
> 
> None of these *should* cause resume to fail, but if we can find something
> that changes the behavior, then we'd at least have a clue.  The only thing I
> can think to try is to make the v6.1-rc1 config as close as possible to v6.0
> to see if the behavior is sensitive to any of these things.
> 
> For future boots, can you add "efi=debug pci=earlydump" to your kernel
> parameters, please?

The dmesg with efi=debug pci=earlydump add_efi_memmap is attached.
Comment 16 Tasev Nikola 2022-12-15 10:17:15 UTC
Created attachment 303412 [details]
custom config 6.1-rc1
Comment 17 Tasev Nikola 2022-12-15 10:18:19 UTC
Created attachment 303413 [details]
dmesg with efi=debug pci=earlydump add_efi_memmap
Comment 18 Bjorn Helgaas 2022-12-15 20:36:07 UTC
Thank you!

My guess is the E820 "ACPI data" entry in 6.1-rc1 means a kexec kernel has been loaded, and maybe that didn't happen with 6.0.  Still conceivable that this difference could be related.  There were very few kexec changes in 6.1-rc1, but I suppose it would be easy to test 6.1-rc1 *without* loading a kexec kernel.

Sounds like CONFIG_PARAVIRT is unrelated.

Ubuntu ppa 6.0.7 works, Ubuntu ppa 6.1-rc1 is broken, so ehci/ohci differences should also be unrelated.

Comment #11 says resume still fails on 6.1-rc1 even with the PTM patches reverted.  That, plus the fact that you have no devices with PTM, suggests that the problem isn't related to those patches.

I'm cc'ing Rafael since he knows how suspend/resume works.
Comment 19 Tasev Nikola 2022-12-19 17:09:05 UTC
This weekend with kernel kernel 91b12b2a100e977274d3c277a4ff2df0b7439e7d i tried to suspend resume many times from tty2, first by stopping the graphic session
with sudo systemctl stop sddm.service, and only once, by luck after resume i could still use the keyboard so i tried dmesg > dmesg_after but it failed with I/O errors, then i switched to tty3 and could type dmesg and take a picture of the screen. Nothing else worked, i could not restart the graphic session or write something on the ssd.
 
The picture is attached.
Comment 20 Tasev Nikola 2022-12-19 17:10:16 UTC
Created attachment 303432 [details]
screen picture dmesg after resume
Comment 21 Rafael J. Wysocki 2022-12-19 17:34:40 UTC
Not that I have any clues, but please carry out the following experiment:

1. Check out commit a47126ec29f5 "PCI/PTM: Cache PTM Capability offset".

Build the kernel and test it.  If suspend/resume fails, do

2. git reset --hard HEAD^

Build the kernel and test it.
Comment 22 Tasev Nikola 2022-12-19 17:42:12 UTC
(In reply to Rafael J. Wysocki from comment #21)
> Not that I have any clues, but please carry out the following experiment:
> 
> 1. Check out commit a47126ec29f5 "PCI/PTM: Cache PTM Capability offset".
> 
> Build the kernel and test it.  If suspend/resume fails, do
> 
> 2. git reset --hard HEAD^
> 
> Build the kernel and test it.

I will try tomorrow and report.
Comment 23 Tasev Nikola 2022-12-20 11:31:28 UTC
I build today the 2 kernels but the resume from suspend fails on both.
Comment 24 Rafael J. Wysocki 2022-12-20 21:01:09 UTC
But in comment #11 you said that the latter worked for you, because it is 6.0-rc1.

It cannot be both working and not working, unless there are .config differences that make suspend fail on your system.

You said that 6.0.y worked for you in particular.  If you have a .config used for building one of these working kernels, please build that kernel again using the same .config and verify that suspend works.

Then please merge commit 91b12b2a100e977274d3c2 on top of it:

# git merge --log 91b12b2a100e977274d3c2

build the kernel and see if suspend works.
Comment 25 Tasev Nikola 2022-12-21 14:05:05 UTC
My bad, i mixed up something.

6.0-rc1 and 6.0-rc2 are broken.
From 6.0-rc3 and up the resume from suspend is working fine.
From 6.1-rc1 the resume from suspend is broken again.
Comment 26 Tasev Nikola 2022-12-21 14:57:17 UTC
After merging commit 91b12b2a100e977274d3c2 on top of 568035b01cfb the resume from suspend is still broken.

After git checkout v6.0-rc3 the resume from suspend is working.
Comment 27 Rafael J. Wysocki 2022-12-21 15:09:29 UTC
But 568035b01cfb is 6.0-rc1 which is known bad.

Please do

$ git checkout v6.0-rc3
$ git merge --log 91b12b2a100e977274d3c2

and test that.
Comment 28 Tasev Nikola 2022-12-21 16:59:47 UTC
The kernel with commit 91b12b2a100e977274d3c2 on top of v6.0-rc3 is working,
i tested suspend/resume 6 times.
Comment 29 Tasev Nikola 2022-12-22 13:13:10 UTC
For testing i copied the pci folder from kernel v6.0 in kernel v6.1.
The kernel 6.1 with pci from 6.0 is working.
Comment 30 Rafael J. Wysocki 2022-12-22 17:04:07 UTC
Good, so you know that the PTM changes did not break it for you (as per comment #28) and you know that some of the later PCI changes that went into 6.1 did that.


Now, please check if commit c440f9969523 ("Merge tag 'i2c-for-6.1-rc1-batch2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux") works and (if so) please test commit 041bc24d867a ("Merge tag 'pci-v6.1-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci").
Comment 31 Rafael J. Wysocki 2022-12-22 17:13:55 UTC
I kind of expect the latter to fail and my next step would be to test commit 3e347969a577 ("PCI/PM: Reduce D3hot delay with usleep_range()") on top of 6.0-rc3, that is

$ git checkout v6.0-rc3
$ git merge --log 3e347969a577
Comment 32 Tasev Nikola 2022-12-23 12:12:35 UTC
Commit c440f9969523 ("Merge tag 'i2c-for-6.1-rc1-batch2') works.
Commit 041bc24d867a ("Merge tag 'pci-v6.1-changes') as you expected fails.
Commit 3e347969a577 ("PCI/PM: Reduce D3hot delay) on top of 6.0-rc3 works.

The end of the year is close, i wish you a Very Merry Christmas and a Happy New Year!
Comment 33 Rafael J. Wysocki 2022-12-27 15:30:25 UTC
Thanks for the wishes, all the best in the new year to you too!

Because commit 041bc24d867a ("Merge tag 'pci-v6.1-changes') fails and one of its parents (commit c440f9969523) works, the issue has been introduced by the other parent of it, that is 0e00a3aeae25 ("Merge branch 'for-linus' into next"), so the problematic change is located somewhere between 6.0-rc1 and 0e00a3aeae25.

Now, v6.0-rc1 is known bad, so all of the subsequent tests need to be done on top of v6.0-rc3 in analogy with comment #31.  Please remember to check out v6.0-rc3 upfront every time.

My next step would be to test commit 519e512110e4 ("Merge branch 'pci/msi'").  If it failed, I would test commit c1c2d8921f10 ("Merge branch 'pci/aspm'").

If 519e512110e4 works, the next one to test will be 99e2c73df882 ("Merge branch 'pci/resource'").
Comment 34 Tasev Nikola 2022-12-28 12:44:39 UTC
Thanks for the wishes to,

After testing:
Commit 519e512110e4 ("Merge branch 'pci/msi'") on top of 6.0-rc3 fails.
Commit c1c2d8921f10 ("Merge branch 'pci/aspm'") on top of 6.0-rc3 works.
Comment 35 Rafael J. Wysocki 2022-12-28 20:32:06 UTC
I would test f9538e27a2d9 ("Merge branch 'pci/dpc'") next.

If this worked, the problematic commit would be on the pci/msi branch, so it would be 2b96f92ca425 ("PCI/MSI: Correct 'can_mask' test in msi_add_msi_desc()").  That is rather unlikely.

Otherwise, it would be on the pci/dpc branch, so it would be 5459c0b70467 ("PCI/DPC: Quirk PIO log size for certain Intel Root Ports").
Comment 36 Tasev Nikola 2022-12-29 14:24:16 UTC
Commit f9538e27a2d9 ("Merge branch 'pci/dpc'") on top of 6.0-rc3 fails.

I reverted commit 5459c0b70467 ("PCI/DPC: Quirk PIO log size for certain Intel Root Ports") from kernel v6.1 but it fails too.

Looking into commit 5459c0b70467 the fix is for Tiger lake and Alder lake systems. My Asus UX305FA is and old Broadwell platform so commit 5459c0b70467 should not affect it.

Commit 2b96f92ca425 ("PCI/MSI: Correct 'can_mask' test in msi_add_msi_desc()") is just a typo fix, i reverted it also from kernel v6.1 but it fails too.
Comment 37 Rafael J. Wysocki 2022-12-29 19:39:51 UTC
Well, you said in comment #34 that commit c1c2d8921f10 ("Merge branch 'pci/aspm'") on top of 6.0-rc3 worked and you said in comment #36 that commit f9538e27a2d9 ("Merge branch 'pci/dpc'") on top of 6.0-rc3 failed.

The difference between the two is just commit 5459c0b70467 and the merge itself:

$ git log --oneline c1c2d8921f10..f9538e27a2d9 | head
f9538e27a2d9 Merge branch 'pci/dpc'
5459c0b70467 PCI/DPC: Quirk PIO log size for certain Intel Root Ports

Does it still fail if you merge commit f9538e27a2d9 on top of v6.0-rc3 and revert commit 5459c0b70467 on top of that?
Comment 38 Tasev Nikola 2023-01-08 19:42:58 UTC
I did:
git reset --hard HEAD^
git checkout v6.0-rc3
git merge --log f9538e27a2d9
git revert 5459c0b70467

Yes it still fails.
Comment 39 Tasev Nikola 2023-01-08 19:47:03 UTC
I rechecked the kernel c1c2d8921f10 from comment 34 and it *fails*.
When testing i probably booted another kernel from grub by mistake and that's why i reported that kernel c1c2d8921f10 is working, i'm really sorry for that.
Comment 40 Tasev Nikola 2023-01-09 07:28:03 UTC
Having time i re-tested the kernel's from comment 32, 34 and 36.
All i reported is correct except kernel c1c2d8921f10 wich i reported as working but it fails always when re-testing.

Since kernel c1c2d8921f10 is tagged ("Merge branch 'pci/aspm'") i remembered that in grub i have pcie_aspm=force set as parameter so i re-tested the kernel's with or without pcie_aspm=force

commit c440f9969523 with or without pcie_aspm=force works,
commit 041bc24d867a with pcie_aspm=force fails, without pcie_aspm=force works,
commit 519e512110e4 with pcie_aspm=force fails, without pcie_aspm=force works,
commit c1c2d8921f10 with pcie_aspm=force fails, without pcie_aspm=force works,
kernel v6.1 with pcie_aspm=force fails, without pcie_aspm=force works,
kernel v6.2-rc2 with pcie_aspm=force fails, without pcie_aspm=force works.

I have pcie_aspm=force in grub since 2014, in that time without the parameter, monitoring the power draw with powertop, the UX305FA was idling between 4-5 Watts. With the pcie_aspm=force parameter set the power draw was only 2-2,2 Watts.

Today with kernel v6.1 i checked the power draw with powertop without pcie_aspm=force and the UX305FA is idling between 2-2,2 Watts.
Comment 41 Mark Francis Enriquez 2023-01-12 13:02:24 UTC
Hello,

Excuse me for intruding, but I also have the same problem with the reporter and also did a manual bisect.
Hope this additional information helps.

The bisect returned this commit:

commit 4ff116d0d5fd8a025604b0802d93a2d5f4e465d1
Author: Vidya Sagar <vidyas@nvidia.com>
Date:   Tue Sep 13 18:48:22 2022 +0530

    PCI/ASPM: Save L1 PM Substates Capability for suspend/resume

    Previously the L1 PM Substates Control Registers (CTL1 and CTL2) weren't
    saved and restored during suspend/resume leading to the L1 PM Substates
    configuration being lost post-resume.

    Save the L1 PM Substates Control Registers so that the configuration is
    retained post-resume.

    [bhelgaas: drop pci_is_pcie() testing; we can rely on pci_configure_ltr()
    having already done that]
    Link: https://lore.kernel.org/r/20220913131822.16557-3-vidyas@nvidia.com
    Signed-off-by: Vidya Sagar <vidyas@nvidia.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>

From this, it seems that it is PCI ASPM related, particularly on the restoration of L1 control registers (?).



To test my hypothesis:
I tried to comment out the new call to "pci_restore_aspm_l1ss_state()" within "pci_restore_state()" in drivers/pci/pci.c.
Nothing was broken and we are not restoring registers before so, I thought Why Not?

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 2127aba3550b..6c226f12b196 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1772,7 +1772,7 @@ void pci_restore_state(struct pci_dev *dev)
         * LTR itself (in the PCIe capability).
         */
        pci_restore_ltr_state(dev);
-       pci_restore_aspm_l1ss_state(dev);
+       // pci_restore_aspm_l1ss_state(dev);

        pci_restore_pcie_state(dev);
        pci_restore_pasid_state(dev);


Results:
I tried rebuilding with the change on top of the v6.1 tag (7614896350aa20764c5eca527262d9eb0a57da63) and the system successfully resumed from sleep.
Comment 42 Rafael J. Wysocki 2023-01-12 13:10:51 UTC
This is useful information, thanks for sharing it!

Do you also boot with pcie_aspm=force?
Comment 43 Mark Francis Enriquez 2023-01-12 13:18:20 UTC
(In reply to Rafael J. Wysocki from comment #42)
> This is useful information, thanks for sharing it!
> 
> Do you also boot with pcie_aspm=force?

Yes, I do. I think I left that in the boot loader to make use of my Ethernet card's powersaving. I'm am also using a laptop which mostly is connected to the LAN.
Comment 44 Rafael J. Wysocki 2023-01-12 13:27:13 UTC
So with pcie_aspm=force, restoring the ASPM state is likely to confuse the firmware which is still assuming that it can control ASPM.

I guess it should be possible to avoid saving and restoring of the L1SS state if pcie_aspm=force is used (and actually in effect), but whether or not this is the right thing to do is a good question.

Let me see if I can cut a patch for this anyway.
Comment 45 Rafael J. Wysocki 2023-01-12 13:48:59 UTC
Created attachment 303587 [details]
PCI/ASPM: Do not save and restore L1SS state if ASPM is forced

So here's a patch to avoid saving/restoring the L1SS state if ASPM is forced (and not negotiated with the platform firmware).

Can you please check if it works for you?
Comment 46 Mark Francis Enriquez 2023-01-12 14:56:00 UTC
As for my case, unfortunately, it does not work. I compiled and tested twice.

The patch adds something in pcie_no_aspm() which is called when the FADT or firmware (I think) says there is no ASPM.

It seems in my case, ASPM is supported and FADT is okay with it, hence it is never called. 

I cannot really speak for the reporter's case since our machines are different. Mine's a Gigabyte laptop and the reporter's an Asus Laptop. Based on the reporter's dmesg log, this may likely have an effect for him/her.


In retrospect, since ASPM is declared as supported by my machine's firmware, I really do not have to specify "pcie_aspm=force" dont't I?

Nevertheless, Thanks for the patch.
Comment 47 Rafael J. Wysocki 2023-01-12 14:58:59 UTC
(In reply to Mark Francis Enriquez from comment #46)
> As for my case, unfortunately, it does not work. I compiled and tested twice.
> 
> The patch adds something in pcie_no_aspm() which is called when the FADT or
> firmware (I think) says there is no ASPM.
> 
> It seems in my case, ASPM is supported and FADT is okay with it, hence it is
> never called. 
> 
> I cannot really speak for the reporter's case since our machines are
> different. Mine's a Gigabyte laptop and the reporter's an Asus Laptop. Based
> on the reporter's dmesg log, this may likely have an effect for him/her.
> 
> In retrospect, since ASPM is declared as supported by my machine's firmware,
> I really do not have to specify "pcie_aspm=force" dont't I?

That appears to be the case.

Can you please see if you can reproduce the issue without pcie_aspm=force (because it affects one more thing which is unlikely in your case, but still)?
Comment 48 Tasev Nikola 2023-01-12 15:13:08 UTC
The patch works for me with or without pcie_aspm=force, tested on kernel v6.1-rc1.
Comment 49 Mark Francis Enriquez 2023-01-12 15:51:52 UTC
(In reply to Rafael J. Wysocki from comment #47)
> (In reply to Mark Francis Enriquez from comment #46)
> > As for my case, unfortunately, it does not work. I compiled and tested
> twice.
> > 
> > The patch adds something in pcie_no_aspm() which is called when the FADT or
> > firmware (I think) says there is no ASPM.
> > 
> > It seems in my case, ASPM is supported and FADT is okay with it, hence it
> is
> > never called. 
> > 
> > I cannot really speak for the reporter's case since our machines are
> > different. Mine's a Gigabyte laptop and the reporter's an Asus Laptop.
> Based
> > on the reporter's dmesg log, this may likely have an effect for him/her.
> > 
> > In retrospect, since ASPM is declared as supported by my machine's
> firmware,
> > I really do not have to specify "pcie_aspm=force" dont't I?
> 
> That appears to be the case.
> 
> Can you please see if you can reproduce the issue without pcie_aspm=force
> (because it affects one more thing which is unlikely in your case, but
> still)?

With or Without "pcie_aspm=force", I can still reproduce the issue.
Comment 50 Rafael J. Wysocki 2023-01-12 16:58:09 UTC
Something appears to be confused by restoring the L1SS state.

There may be an ordering issue (for one, I'm not sure if enabling L1.2 before restoring all of the devices PCIe capabilities is valid) or a timing one.

It's the time to involve the author of commit 4ff116d0d5fd8a025604b0 I think.
Comment 51 Vidya Sagar 2023-01-16 14:26:15 UTC
@Tasev and @Mark,
Could you please share the output of 'sudo lspci -vv', 'sudo lspci -tv' and also specify your LAN devices respectively?
Comment 52 Mark Francis Enriquez 2023-01-16 15:44:58 UTC
Created attachment 303614 [details]
lspci -vv and lspci -tv output

Hello Vidya,

I attached the output of the two lspci commands. 

Though, about "LAN devices", Are you referring to the Ethernet one? If so, I think that's on 
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
Comment 53 Vidya Sagar 2023-01-17 05:51:01 UTC
Created attachment 303618 [details]
skip L1SS restore if not enabled
Comment 54 Vidya Sagar 2023-01-17 05:54:17 UTC
Comment on attachment 303618 [details]
skip L1SS restore if not enabled

Thanks Mark for the lspci output.
Though many devices have support for ASPM L1.1 and L1.2, only ASPM L1 is enabled. So, ideally, saving and restoring L1SS registers shouldn't affect the suspend/resume sequence unless there is some issue with HW.

Also, setting/not setting pcie_aspm=force shouldn't make any difference. Could you please confirm this?
Looking at the code, if the BIOS is enabling ASPM states, then, setting pcie_aspm=force doesn't seem to change anything. Kernel just leaves the ASPM part as it was set by BIOS and doesn't do any further changes. The only parameter that can affect it is setting pcie_aspm=off, in which case, kernel would disable the ASPM states even if BIOS left them enabled.

Can you please try the attached patch where I'm skipping restoration of L1 SubState registers if they are not enabled in the first place. This would help me understand if touching the register (althrough writing with the same value) itself is causing the issue.
Comment 55 Tasev Nikola 2023-01-17 09:32:18 UTC
Hello Vidya

The output of the two lspci commands from my Asus UX305FA are already attached.
Should I attach them once more?

My only network is the:
02:00.0 Network controller: Intel Corporation Wireless 7265 (rev 59)

I tested your patch on kernel v6.1-rc1, it's working with or without pcie_aspm=force.
Comment 56 Mark Francis Enriquez 2023-01-17 14:49:29 UTC
(In reply to Vidya Sagar from comment #54)
> Comment on attachment 303618 [details]
> skip L1SS restore if not enabled
> 
> Thanks Mark for the lspci output.
> Though many devices have support for ASPM L1.1 and L1.2, only ASPM L1 is
> enabled. So, ideally, saving and restoring L1SS registers shouldn't affect
> the suspend/resume sequence unless there is some issue with HW.
> 
> Also, setting/not setting pcie_aspm=force shouldn't make any difference.
> Could you please confirm this?
> Looking at the code, if the BIOS is enabling ASPM states, then, setting
> pcie_aspm=force doesn't seem to change anything. Kernel just leaves the ASPM
> part as it was set by BIOS and doesn't do any further changes. The only
> parameter that can affect it is setting pcie_aspm=off, in which case, kernel
> would disable the ASPM states even if BIOS left them enabled.
> 
> Can you please try the attached patch where I'm skipping restoration of L1
> SubState registers if they are not enabled in the first place. This would
> help me understand if touching the register (althrough writing with the same
> value) itself is causing the issue.

I do not think I understand what you mean about "setting/not setting pcie_aspm=force". Are you saying that even if ASPM is forced (or not) would not affect the issue?
(Though, my machine's BIOS allows ASPM so "pcie_aspm=force" does nothing except printing "PCIe ASPM is forcibly enabled" at boot up)

Anyways, I was able to recompile with this patch alone on top of the v6.1 tag.
I tried this twice with and without "pcie_aspm=force", and suspend/resume works.

Hope this helps confirm (or invalidate) your thoughts.
Comment 57 Vidya Sagar 2023-01-18 13:27:09 UTC
Thanks Mark for trying out my patch.

>> Are you saying that even if ASPM is forced (or not) would not affect the
>> issue?
Yes.

>> my machine's BIOS allows ASPM so "pcie_aspm=force" does nothing except
>> printing "PCIe ASPM is forcibly enabled" at boot up
This matches with my understanding also.

Given that the issue disappears with my patch, I'm more inclined to believe that this is a HW issue. Reason being that, we don't have ASPM L1SS enabled in the system in the first place and during save/restore operation, system is behaving weirdly even if the disabled values (i.e. zeros) are written in to the control registers.

Bjorn / Rafel, what is your take on this? I'm fine with pushing my patch for review if that seems to be the best solution at this point.
Comment 58 Rafael J. Wysocki 2023-01-18 18:02:15 UTC
(In reply to Vidya Sagar from comment #57)

> Bjorn / Rafel, what is your take on this? I'm fine with pushing my patch for
> review if that seems to be the best solution at this point.

There is a functional regression to be address. Commit 4ff116d0d5fd8a02 has introduced it and whether or not it has just exposed a hardware issue, doesn't really matter too much.

If the patch in question is the best way to address it you can think about, so yes please submit it for review.  At least this will allow other users who may be affected by this problems to apply it as a workaround.
Comment 59 Vidya Sagar 2023-01-19 09:52:03 UTC
Mark, given that your platform has devices that support L1 Sub-States, could you please do the following experiment and update the result?
It seems, only ASPM L1 is enabled by default in your system. Can you try enabling the L1SS also by changing the policy of the system (i.e. echo powersupersave > /sys/module/pcie_aspm/parameters/policy) and then taking the system through suspend/resume sequence.
The reason why I want to do this experiment is that, I would like to understand if a) just accessing the L1SS register itself has the issue
b) accessing L1SS register when L1SS are not enabled has the issue
c) everything works fine with save and restore when L1SS are enabled

In the meantime, I sent the patch that doesn't attempt to save/restore L1SS registers if L1SS states are not enabled for review.
https://patchwork.ozlabs.org/project/linux-pci/list/?series=337457
Comment 60 Mark Francis Enriquez 2023-01-19 10:07:56 UTC
Interesting, it actually worked...

I used a regular v6.1 build supplied by my distro (i.e., no Vidya's patch) and set the policy to powersupersave. It resumed successfully. I did this twice and had the same result.

Now, I tried setting the policy back to default and suspended (still in the same v6.1 kernel). I only tried once because the resume failed and had to reboot.
Comment 61 Vidya Sagar 2023-01-19 11:20:05 UTC
I think it appears to be case-(b) as mentioned in comment #59.
It definately looks like some HW issue to me.
Anyway, the patch I sent for review should address this kind of issues though.
Comment 62 Tasev Nikola 2023-01-24 12:17:14 UTC
Hello Vidya

I tested the V2 of your patch from 
https://patchwork.kernel.org/project/linux-pci/patch/20230120091540.3305-1-vidyas@nvidia.com/

I can confirm the patch works, tested on kernel v6.1.
Comment 63 Vidya Sagar 2023-01-24 13:54:52 UTC
Thanks for testing the patch.
Comment 64 Mark Francis Enriquez 2023-01-24 14:49:44 UTC
Hello,

Sorry for the delay.

I tested the V2 patch once more on top of the v6.2-rc5 tag (2241ab53cbb5cdb08a6b2d4688feb13971058f65) with and without the patch.
I left the ASPM policy set on default rather than powersupersave and tried suspending/resuming thrice on each build.

As expected, the build with the patch resumed successfully.

Surprisingly, the build without the patch DID SUCCEED to resume but some PCIE devices failed to resume.

Here's an excerpt, if you would like to see it.
[  108.732484] ACPI: PM: Waking up from system sleep state S3
[  108.740540] ACPI: EC: interrupt unblocked
[  108.765035] sdhci-pci 0000:04:00.0: Unable to change power state from D3hot to D0, device inaccessible
[  108.807919] nvme 0000:02:00.0: Unable to change power state from D3hot to D0, device inaccessible
[  108.948541] ACPI: EC: event unblocked
[  108.948754] sd 0:0:0:0: [sda] Starting disk
[  108.955676] i915 0000:00:02.0: [drm] GuC firmware i915/tgl_guc_70.1.1.bin version 70.1.1
[  108.955687] i915 0000:00:02.0: [drm] HuC firmware i915/tgl_huc_7.9.3.bin version 7.9.3
[  108.955739] nvme 0000:02:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  108.956225] nvme nvme0: Disabling device after reset failure: -19
[  108.956265] nvme0n1: detected capacity change from 1000215216 to 0
[  108.959584] i915 0000:00:02.0: [drm] HuC authenticated
[  108.959918] i915 0000:00:02.0: [drm] GuC submission enabled
[  108.959920] i915 0000:00:02.0: [drm] GuC SLPC enabled
[  108.960326] i915 0000:00:02.0: [drm] GuC RC: enabled
[  109.054948] mmc0: Reset 0x1 never completed.
[  109.054949] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
[  109.054951] mmc0: sdhci: Sys addr:  0xffffffff | Version:  0x0000ffff
[  109.054954] mmc0: sdhci: Blk size:  0x0000ffff | Blk cnt:  0x0000ffff
[  109.054956] mmc0: sdhci: Argument:  0xffffffff | Trn mode: 0x0000ffff
[  109.054958] mmc0: sdhci: Present:   0xffffffff | Host ctl: 0x000000ff
[  109.054960] mmc0: sdhci: Power:     0x000000ff | Blk gap:  0x000000ff
[  109.054962] mmc0: sdhci: Wake-up:   0x000000ff | Clock:    0x0000ffff
[  109.054964] mmc0: sdhci: Timeout:   0x000000ff | Int stat: 0xffffffff
[  109.054966] mmc0: sdhci: Int enab:  0xffffffff | Sig enab: 0xffffffff
[  109.054968] mmc0: sdhci: ACmd stat: 0x0000ffff | Slot int: 0x0000ffff
[  109.054970] mmc0: sdhci: Caps:      0xffffffff | Caps_1:   0xffffffff
[  109.054972] mmc0: sdhci: Cmd:       0x0000ffff | Max curr: 0xffffffff
[  109.054974] mmc0: sdhci: Resp[0]:   0xffffffff | Resp[1]:  0xffffffff
[  109.054976] mmc0: sdhci: Resp[2]:   0xffffffff | Resp[3]:  0xffffffff
[  109.054977] mmc0: sdhci: Host ctl2: 0x0000ffff
[  109.054980] mmc0: sdhci: ADMA Err:  0xffffffff | ADMA Ptr: 0xffffffff
[  109.054980] mmc0: sdhci: ============================================
[  109.200311] usb 1-8: reset high-speed USB device number 3 using xhci_hcd
[  110.667020] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  111.028032] ata1.00: configured for UDMA/133
[  111.061182] mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_component_ops [i915])
[  111.061830] OOM killer enabled.
[  111.061834] Restarting tasks ... done.
[  111.065469] random: crng reseeded on system resumption
[  111.067417] PM: suspend exit

Once Again, the dmesg log above is a NON Issue as the errors does not even appear when the patch is applied. The system resuming might be due to me forcibly enabling GuC submission on i915 (which I seem to have forgotten to remove for this test).

In any case, the patch works.

Thanks.
Comment 65 Tasev Nikola 2023-02-08 14:33:28 UTC
Hi,
Will the patch be merged soon ?
Will the patch be backported to the 6.1 kernel ?
Comment 66 Bjorn Helgaas 2023-02-10 23:21:17 UTC
This should be fixed by:

https://git.kernel.org/linus/a7152be79b62 ("Revert "PCI/ASPM: Save L1 PM Substates Capability for suspend/resume"")

which will appear in v6.2-rc8.  It is marked for stable so it will be backported to v6.1 after v6.2 releases.

Please reopen this if you can reproduce it on v6.2-rc8 or later.

Tasev and Mark, thank you VERY much for all your work bisecting and testing this.  I know that's a lot of work, and I really appreciate it!
Comment 67 KobaKo 2023-04-06 10:04:28 UTC
a7152be79b62) Revert "PCI/ASPM: Save L1 PM Substates Capability for suspend/resume”  
  
this caused the CPU(i7-1355u) can't sleep deeper than pc2.

log: lscpi_vvv_202304061800
Comment 68 KobaKo 2023-04-06 10:05:17 UTC
Created attachment 304090 [details]
lspci_vvv202304061800
Comment 69 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-04-06 10:11:16 UTC
(In reply to KobaKo from comment #67)
> a7152be79b62) Revert "PCI/ASPM: Save L1 PM Substates Capability for
> suspend/resume”  
>   
> this caused the CPU(i7-1355u) can't sleep deeper than pc2.

please file a new ticket and then drop the link to it here, otherwise things just get confusing
Comment 70 KobaKo 2023-04-11 08:36:00 UTC
here's a new ticket to track the cpu idle issue,
https://bugzilla.kernel.org/show_bug.cgi?id=217321
Comment 71 Mark Francis Enriquez 2023-05-03 15:48:39 UTC
Created attachment 304208 [details]
Dmesg log for the Gigabyte G5 GD laptop,

Attaching the dmesg log for the Gigabyte G5 GD laptop, which was affected by this issue prior.
Comment 72 David Box 2023-06-09 03:03:48 UTC
Created attachment 304386 [details]
L1SS save/restore patch with 100ms delay

Looking at the history of this issue, it's possible these problems are because the device wasn't yet ready when the control register was being written to. We have seen something similar before on Chrome systems that were able to achieve very low power during suspend. Please try the attached patch which adds a 100ms delay before the write to see if that helps.
Comment 73 Mika Westerberg 2023-06-26 09:58:46 UTC
Created attachment 304476 [details]
Add back L1SS save and restore

Tasev, Mark, Thomas, I wonder if you guys could try the attached patch on your system and check that suspend/resume still works?

I'm asking because we are seeing an issue on certain systems if this functionality is not there. They are not entering certain CPU low power states and this causes them to draw lots of power from the battery so we are hoping to restore this functionality, and at the same time avoid breaking your systems.

Thanks in advance!

The patch should apply on top of v6-4.
Comment 74 Tasev Nikola 2023-06-26 12:38:47 UTC
(In reply to Mika Westerberg from comment #73)
> Created attachment 304476 [details]
> Add back L1SS save and restore
> 
> Tasev, Mark, Thomas, I wonder if you guys could try the attached patch on
> your system and check that suspend/resume still works?
> 
> I'm asking because we are seeing an issue on certain systems if this
> functionality is not there. They are not entering certain CPU low power
> states and this causes them to draw lots of power from the battery so we are
> hoping to restore this functionality, and at the same time avoid breaking
> your systems.
> 
> Thanks in advance!
> 
> The patch should apply on top of v6-4.

The suspend/resume works with your patch for me (Asus UX305FA).
I tested suspend/resume 5 time's kernel v6.4.
Tasev
Comment 75 Mark Francis Enriquez 2023-06-26 13:11:47 UTC
(In reply to Mika Westerberg from comment #73)
> Created attachment 304476 [details]
> Add back L1SS save and restore
> 
> Tasev, Mark, Thomas, I wonder if you guys could try the attached patch on
> your system and check that suspend/resume still works?
> 
> I'm asking because we are seeing an issue on certain systems if this
> functionality is not there. They are not entering certain CPU low power
> states and this causes them to draw lots of power from the battery so we are
> hoping to restore this functionality, and at the same time avoid breaking
> your systems.
> 
> Thanks in advance!
> 
> The patch should apply on top of v6-4.

Hello,

It works on this particular Gigabyte G5 GD (11th Gen Intel) laptop.

Though, I must comment though that the same stock kernel 6.1 that I tested last January 2023 which failed before, also now works in the same system. There's a lot of software/userspace changes that occurred within the last 6 months so I am trying to rollback my system to January 12, 2023, to further verify.

Thank you and I apologize.
Comment 76 Mika Westerberg 2023-06-27 06:26:32 UTC
Thanks a lot for checking! I sent the patch upstream with you CC'd now.
Comment 77 Tasev Nikola 2023-10-06 13:25:09 UTC
(In reply to Mika Westerberg from comment #76)
> Thanks a lot for checking! I sent the patch upstream with you CC'd now.

I just tested the patch v4, it works on my Asus UX305FA.
Then i remove the Asus UX305FA from the denylist, rebuild the kernel and it works also fine. I tested also with pcie_aspm=force set in grub and it works.
For me with this v4 patch everything just works.

Thank you for your patch Mika.
Comment 78 Mika Westerberg 2023-10-09 08:20:10 UTC
This is great news, thanks a lot for testing! This means that we can drop the denylist at least for this Asus UX305FA system.
Comment 79 Tasev Nikola 2023-12-21 15:52:46 UTC
Hi
I just tested the patch v5 from David, it works on my Asus UX305FA.
I tested also with pcie_aspm=force set in grub and it works.
Tested on kernel v6.7-rc6.
Comment 80 Tasev Nikola 2024-03-08 09:29:18 UTC
I just tested the patch v7 from Bjorn, it works on my Asus UX305FA.
Tested on kernel v6.8-rc1.

Note You need to log in before you can comment on or make changes to this bug.