Bug 217415 - mt7921e error on hibernate resume path
Summary: mt7921e error on hibernate resume path
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: AMD Linux
: P3 normal
Assignee: drivers_network-wireless@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-05-08 09:36 UTC by kolAflash
Modified: 2024-11-21 20:26 UTC (History)
9 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg with hibernation and failure to start wifi (130.30 KB, text/plain)
2024-03-17 16:55 UTC, Alex Maras
Details
dmesg with failure to rmmod (118.32 KB, text/plain)
2024-03-25 14:03 UTC, Alex Maras
Details
failure to reboot after rmmod stalls (389.24 KB, image/jpeg)
2024-03-25 14:04 UTC, Alex Maras
Details
dmesg failure (6.32 KB, text/plain)
2024-07-06 11:20 UTC, Siddh Raman Pant
Details
dmesg with error (33.39 KB, text/plain)
2024-08-07 12:20 UTC, piotrekkr@o2.pl
Details

Description kolAflash 2023-05-08 09:36:29 UTC
When resuming from hibernation (suspend to disk) I got this error from mt7921e.

[T29172] mt7921e 0000:01:00.0: Message 00020007 (seq 11) timeout
[T29172] mt7921e 0000:01:00.0: PM: dpm_run_callback(): pci_pm_restore+0x0/0x90 returns -110
[T29172] mt7921e 0000:01:00.0: PM: failed to restore async: error -110
[T29172] mt7921e 0000:01:00.0: HW/SW Version: 0x8a108a10, Build Time: 20220311230842a
[T29172] 
[T29172] mt7921e 0000:01:00.0: WM Firmware Version: ____010000, Build Time: 20220311230931

Full dmesg:
https://gitlab.freedesktop.org/drm/amd/uploads/4ae31a3d6b9a7db839943c16e06d8704/Ryzen-5650U_6.1.26-with-8cf17c25e_hibernation-wakeup.txt

Came up as part of a different problem:
Ryzen 3500U and 5650U: StandBy and External Monitors broken since >= 6.1
https://gitlab.freedesktop.org/drm/amd/-/issues/2492#note_1894147

Maybe related:
wifi mediatek mt7921e problem after suspend
https://bugzilla.kernel.org/show_bug.cgi?id=215463
Comment 2 kolAflash 2023-08-15 09:10:04 UTC
(In reply to Mario Limonciello (AMD) from comment #1)
> Maybe
> https://patchwork.kernel.org/project/linux-wireless/patch/
> 19f1aae1ab9ea867eb42742fc5b72ed4d7307b0a.1687159671.git.deren.wu@mediatek.
> com/ helps

I've tested Linux-6.1.43 which includes that patch.
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=97ccc14d114b1cf3bc16670fa09f74ec7233b643


dmesg shows these errors multiple times. This is the first occurrence.
[ T1314] mt7921e 0000:01:00.0: enabling device (0000 -> 0002)
[ T1314] mt7921e 0000:01:00.0: ASIC revision: 79610010
[  T404] mt7921e 0000:01:00.0: HW/SW Version: 0x8a108a10, Build Time: 20230526130917a
[  T404] mt7921e 0000:01:00.0: WM Firmware Version: ____010000, Build Time: 20230526130958
[ T3390] mt7921e 0000:01:00.0: Message 00020007 (seq 4) timeout
[ T3390] mt7921e 0000:01:00.0: PM: dpm_run_callback(): pci_pm_restore+0x0/0x90 returns -110
[ T3390] mt7921e 0000:01:00.0: PM: failed to restore async: error -110


Later on there are also multiple stacktraces involving net/mac80211/rx.c
This is the first occurrence.
[ T1410] ------------[ cut here ]------------
[ T1410] WARNING: CPU: 2 PID: 1410 at /home/myuser/opt/linux-kernel/build.backup-exclude-m461c/build_bisect/worktree/net/mac80211/rx.c:5169 ieee80211_rx_list+0x588/0xc60 [mac80211]
[...]
[ T1410] CPU: 2 PID: 1410 Comm: napi/phy0-8193 Tainted: G        W   E      6.1.43-v6.1.43 #18
[ T1410] Hardware name: HP HP EliteBook 845 G8 Notebook PC/8895, BIOS T82 Ver. 01.13.01 03/31/2023
[ T1410] RIP: 0010:ieee80211_rx_list+0x588/0xc60 [mac80211]
[ T1410] Code: ff 8b b5 18 05 00 00 85 f6 74 0b a9 00 00 04 00 0f 84 98 00 00 00 48 89 df e8 a4 06 60 ea e9 6d fb ff ff 0f 0b e9 59 fb ff ff <0f> 0b e9 52 fb ff ff 8b 85 78 15 00 00 85 c0 0f 84 10 fe ff ff e9
[ T1410] RSP: 0018:ffffb40a006b7c20 EFLAGS: 00010246
[ T1410] RAX: 000000ff0000ff00 RBX: ffff8eba878cd600 RCX: ffff8eb99a602198
[ T1410] RDX: ffff8eb99a6003a0 RSI: 0000000000000000 RDI: ffff8eb99a6008e0
[ T1410] RBP: ffff8eb99a6008e0 R08: 00000000ffffffa6 R09: ffffb40a006b7d30
[ T1410] R10: 000000000000143c R11: 000000002d1db16d R12: ffff8eb99a602080
[ T1410] R13: 0000000000000000 R14: ffff8eb99a603748 R15: 0000000000000000
[ T1410] FS:  0000000000000000(0000) GS:ffff8ec7ce680000(0000) knlGS:0000000000000000
[ T1410] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ T1410] CR2: 00007f1a789f7020 CR3: 0000000094010000 CR4: 0000000000750ee0
[ T1410] PKRU: 55555554
[ T1410] Call Trace:
[ T1410]  <TASK>
[ T1410]  ? __warn+0x7d/0x140
[ T1410]  ? ieee80211_rx_list+0x588/0xc60 [mac80211]
[ T1410]  ? report_bug+0xf8/0x1e0
[ T1410]  ? handle_bug+0x44/0x80
[ T1410]  ? exc_invalid_op+0x13/0x60
[ T1410]  ? asm_exc_invalid_op+0x16/0x20
[ T1410]  ? ieee80211_rx_list+0x588/0xc60 [mac80211]
[ T1410]  ? swiotlb_tbl_map_single+0x5d6/0x6b0
[ T1410]  mt76_rx_complete+0x198/0x2e0 [mt76]
[ T1410]  ? swiotlb_map+0x96/0x260
[ T1410]  mt76_rx_poll_complete+0x373/0x570 [mt76]
[ T1410]  ? mt76_dma_rx_poll+0x25d/0x480 [mt76]
[ T1410]  mt76_dma_rx_poll+0x25d/0x480 [mt76]
[ T1410]  ? __napi_poll+0x1b0/0x1b0
[ T1410]  mt7921_poll_rx+0x4a/0xe0 [mt7921e]
[ T1410]  __napi_poll+0x29/0x1b0
[ T1410]  ? napi_threaded_poll+0x80/0x100
[ T1410]  napi_threaded_poll+0x9d/0x100
[ T1410]  kthread+0xd9/0x100
[ T1410]  ? kthread_complete_and_exit+0x20/0x20
[ T1410]  ret_from_fork+0x22/0x30
[ T1410]  </TASK>
[ T1410] ---[ end trace 0000000000000000 ]---


Finally the system is crashing when waking from standby. The crash itself is more likely an issue from the amdgpu driver. But there's also another stacktrace involving mt7921e which I could recover from pstore.
[ T1410] ------------[ cut here ]------------
[ T1410] WARNING: CPU: 9 PID: 1410 at /home/myuser/opt/linux-kernel/build.backup-exclude-m461c/build_bisect/worktree/net/mac80211/rx.c:5169 ieee80211_rx_list+0x588/0xc60 [mac80211]
[...]
[ T1410] CPU: 9 PID: 1410 Comm: napi/phy0-8193 Tainted: G        W   E      6.1.43-v6.1.43 #18
[ T1410] Hardware name: HP HP EliteBook 845 G8 Notebook PC/8895, BIOS T82 Ver. 01.13.01 03/31/2023
[ T1410] RIP: 0010:ieee80211_rx_list+0x588/0xc60 [mac80211]
[ T1410] Code: ff 8b b5 18 05 00 00 85 f6 74 0b a9 00 00 04 00 0f 84 98 00 00 00 48 89 df e8 a4 06 60 ea e9 6d fb ff ff 0f 0b e9 59 fb ff ff <0f> 0b e9 52 fb ff ff 8b 85 78 15 00 00 85 c0 0f 84 10 fe ff ff e9
[ T1410] RSP: 0018:ffffb40a006b7c20 EFLAGS: 00010246
[ T1410] RAX: 000000ff0000ff00 RBX: ffff8ec16adfca00 RCX: ffff8eb99a602198
[ T1410] RDX: ffff8eb99a6003a0 RSI: 0000000000000000 RDI: ffff8eb99a6008e0
[ T1410] RBP: ffff8eb99a6008e0 R08: 00000000ffffffa7 R09: ffffb40a006b7d30
[ T1410] R10: 000000000000143c R11: 0000000004c72c40 R12: ffff8eb99a602080
[ T1410] R13: 0000000000000000 R14: ffff8eb99a603748 R15: 0000000000000000
[ T1410] FS:  0000000000000000(0000) GS:ffff8ec7ce840000(0000) knlGS:0000000000000000
[ T1410] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ T1410] CR2: 0000000000000000 CR3: 0000000094010000 CR4: 0000000000750ee0
[ T1410] PKRU: 55555554
[ T1410] Call Trace:
[ T1410]  <TASK>
[ T1410]  ? __warn+0x7d/0x140
[ T1410]  ? ieee80211_rx_list+0x588/0xc60 [mac80211]
[ T1410]  ? report_bug+0xf8/0x1e0
[ T1410]  ? handle_bug+0x44/0x80
[ T1410]  ? exc_invalid_op+0x13/0x60
[ T1410]  ? asm_exc_invalid_op+0x16/0x20
[ T1410]  ? ieee80211_rx_list+0x588/0xc60 [mac80211]
[ T1410]  ? swiotlb_tbl_map_single+0x5d6/0x6b0
[ T1410]  mt76_rx_complete+0x198/0x2e0 [mt76]
[ T1410]  ? swiotlb_map+0x96/0x260
[ T1410]  mt76_rx_poll_complete+0x373/0x570 [mt76]
[ T1410]  ? mt76_dma_rx_poll+0x25d/0x480 [mt76]
[ T1410]  mt76_dma_rx_poll+0x25d/0x480 [mt76]
[ T1410]  ? __napi_poll+0x1b0/0x1b0
[ T1410]  mt7921_poll_rx+0x4a/0xe0 [mt7921e]
[ T1410]  __napi_poll+0x29/0x1b0
[ T1410]  ? napi_threaded_poll+0x80/0x100
[ T1410]  napi_threaded_poll+0x9d/0x100
[ T1410]  kthread+0xd9/0x100
[ T1410]  ? kthread_complete_and_exit+0x20/0x20
[ T1410]  ret_from_fork+0x22/0x30
[ T1410]  </TASK>
[ T1410] ---[ end trace 0000000000000000 ]---


See here for full logs and the mentioned amdgpu driver issue:
https://gitlab.freedesktop.org/drm/amd/-/issues/2492#note_2043652
Comment 3 Alex Maras 2024-03-17 16:55:55 UTC
Created attachment 306000 [details]
dmesg with hibernation and failure to start wifi

I'm getting an ongoing issue with the same symptoms. Coming up from hibernate, wifi is completely dead and similar messages in dmesg.

Running `rmmod mt7921e` followed by `modprobe mt7921e` fixes it, with the exception that sometimes that command refuses to finish, and more importantly won't respond to a `kill -9`, and blocks reboot indefinitely. 

I'll try to get a `dmesg` recording next time the `rmmod` fails to finish, but I don't know if it shows anything. 

Attached the dmesg for a failure coming up out of hibernate.
Comment 4 Alex Maras 2024-03-25 14:03:03 UTC
Created attachment 306040 [details]
dmesg with failure to rmmod

attaching a dmesg of when I tried to `rmmod` and it timed out and failed. This happened every several times I do it.
Comment 5 Alex Maras 2024-03-25 14:04:53 UTC
Created attachment 306041 [details]
failure to reboot after rmmod stalls

and a photo of the output that eventually shows up when the rmmod gets stuck and I try rebooting.
Comment 6 hurricanepootis 2024-05-28 23:01:30 UTC
I'm having the same issue on my desktop PC as well. Ryzen 5 5600X, MSI x470 gaming pro, Radeon RX 6800, and MT7922. I get the exact same error after my device wakes up from suspend. However, it's not all of the time, but most of the time.

Running Arch Linux with Linux 6.9.2
Comment 7 kolAflash 2024-05-29 09:15:13 UTC
@hurricanepootis
mt7921e is a wireless network driver for the MT7922. But your mainboard doesn't seem to have a wireless network chip like the MT7922.
https://www.msi.com/Motherboard/X470-GAMING-PRO/Specification

Please share hardware details about your wireless network card. Do you use an extra PCI or USB wireless network card?

And can you share a dmesg log after waking from suspend?
  sudo dmesg
Thanks!




@Alex Maras
From your dmesg log I guess this is your computer. Please reply is this isn't correct.
[    0.000000] DMI: Framework Laptop 13 (AMD Ryzen 7040Series)/FRANMDCP07, BIOS 03.03 10/17/2023
Framework seems have official Linux support.
https://knowledgebase.frame.work/en_us/can-i-install-linux-By6nAJ7td
You might ask the Framework support if they can help with this problem.
https://framework.kustomer.help/en_us/contact/support-request-ryon9uAuq

I searched the Framework forum.
https://community.frame.work/tag/linux
There are some Linux users having similar problems with the mt7921e driver.
https://community.frame.work/t/framework-13-amd-on-arch-issues-with-wireless-after-resume/44597
https://community.frame.work/t/responded-issues-on-arch-linux-with-rz616-on-framework-13-amd-7040-on-linux-kernel-6-5-7/38404
https://community.frame.work/t/tracking-unstable-and-unreliable-wlan-rz616-mt7922-fw13-amd-diy/40316
Placing new firmware into /lib/firmware/ seems to be an interesting idea discussed there.
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git into 
I didn't read fully trough those. Maybe have a look if you can spare some time.
Comment 8 hurricanepootis 2024-05-29 15:43:01 UTC
@kolAflash, I am using a generic m.2 wifi key e to pcie x1 slot with a cable coming for motherboard usb 2.0 header for the Bluetooth.

As gar as I am aware, there is no actual chip handling any logic on the adapter, just splitting out what's needed from the m.2 slot.

As a side note, in the past I have used an Intel AX210 in that adapter and had no problems with suspending with it, and also briefly used a Mediatek Mt7921 and cannot recall if I did or did not have problems.
Comment 9 Haonan Chen 2024-06-11 11:09:56 UTC
Same issue, with Lenovo ARX8 (R9-7945HX). This is result of `lspci -kvv`:

```
04:00.0 Network controller: MEDIATEK Corp. MT7922 802.11ax PCI Express Wireless Network Adapter
Subsystem: Lenovo Device e0c6
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 123
IOMMU group: 1
Region 0: Memory at 3ffc02000000 (64-bit, prefetchable) [size=1M]
Region 2: Memory at d1700000 (64-bit, non-prefetchable) [size=32K]
Capabilities: [80] Express (v2) Endpoint, IntMsgNum 0
DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75W TEE-IO-
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #1, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <2us, L1 <8us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM L1 Enabled; RCB 64 bytes, LnkDisable- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x1
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
AtomicOpsCtl: ReqEn-
IDOReq- IDOCompl- LTR+ EmergencyPowerReductionReq-
10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
LnkCap2: Supported Link Speeds: 2.5-5GT/s, Crosslink- Retimer- 2Retimers- DRS-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete- EqualizationPhase1-
EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [e0] MSI: Enable+ Count=1/32 Maskable+ 64bit+
Address: 00000000fee00000  Data: 0000
Masking: fffffffe  Pending: 00000000
Capabilities: [f8] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Kernel driver in use: mt7921e
Kernel modules: mt7921e
```
Comment 10 Haonan Chen 2024-06-11 11:29:48 UTC
Adding kernel parameter `mt7921e.disable_aspm=y` workarounded this issue on my machine. I am using linux 6.9.
Comment 11 hurricanepootis 2024-07-01 00:39:08 UTC
Adding 'mt7921e.disable_aspm=y' did not fix my issue, nor did upgrading my bios to AMD AGESA ComboAm4v2PI 1.2.0.A.
Comment 12 hurricanepootis 2024-07-05 04:46:00 UTC
I mitigated this issue by removing the module before placing my PC to sleep. I automated this by creating a script with the following:
```
#!/usr/bin/env sh
case ${1} in
    pre)
    rmmod mt7921e
    echo "Removing mt7921e kernel module"
    ;;
    post)
    modprobe mt7921e
    echo "Adding mt7921e kernel module"
    ;;
esac
```
And placing it at `/usr/lib/systemd/system-sleep/mt7921e-sleep-fix.sh`. What this script does is that whenever the system starts to suspend, systemd will execute anything in that folder with either `pre` or `post` as the first argument. It's a simple script tbh, you can read more about how it works on the man page for systemd-sleep.
Comment 13 Siddh Raman Pant 2024-07-06 11:20:53 UTC
Created attachment 306537 [details]
dmesg failure

I faced the same issue on Debian 6.7.12-1. Attached is the error in dmesg.

Removing and loading the module `mt7921e` again fixes the issue.
Comment 14 piotrekkr@o2.pl 2024-08-07 12:20:51 UTC
Created attachment 306681 [details]
dmesg with error

Have same issue with mt9721e and Asus Vivobook, Linux Mint 21.3 and kernel 6.5.0-45-generic. Happening when waking up from sleep. What is interesting is that it breaks after being in sleep longer than few minutes. If I close the lid, wait few minutes and then open, it works. When I let it sleep for like an hour or more then I can see issues. Those issues prevent OS shutdown. Screen goes blank on shutdown but laptop do not shutdown completely. I need to do a hard shutdown by pressing power button for like 10s or more. Workaround with rmmod and modprobe on systemd sleep seems to be working okay.
Comment 15 Artem S. Tashkinov 2024-08-12 16:03:59 UTC
Is this still reproducible under 6.10.4 when using the latest firmware files?
Comment 16 hurricanepootis 2024-08-13 03:58:57 UTC
I am on Kernel 6.10.4 and am now using linux-firmware-git 20240809.59460076 on Arch Linux. I have put my computer to sleep and back a few times now without the script I wrote (see above). I will need to use wifi more through the next few days, and I will also try on firmware 20240703.e94a2a3b (the current linux-firmware on arch) to see if the kernel and/or firmware has potentially fixed the issue.
Comment 17 Vadim 2024-08-23 13:33:46 UTC
I have the same problem in Ubuntu 24.04 (6.8.0-41).
Aspire A715-42G

Aug 23 15:44:16 hacker kernel: mt7921e 0000:04:00.0: Message 00020007 (seq 10) timeout
Aug 23 15:44:16 hacker kernel: mt7921e 0000:04:00.0: PM: dpm_run_callback(): pci_pm_resume+0x0/0x110 returns -110
Aug 23 15:44:16 hacker kernel: mt7921e 0000:04:00.0: PM: failed to resume async: error -110
Comment 18 Vadim 2024-08-24 05:07:08 UTC
I installed a new kernel via mainline kernels

Kernel: 6.10.6-061006-generic
OS: Ubuntu 24.04 LTS x86_64 

Aug 24 08:02:39 hacker kernel: mt7921e 0000:04:00.0: Message 00020007 (seq 3) timeout
Aug 24 08:02:39 hacker kernel: mt7921e 0000:04:00.0: PM: dpm_run_callback(): pci_pm_resume returns -110
Aug 24 08:02:39 hacker kernel: mt7921e 0000:04:00.0: PM: failed to resume async: error -110

Conclusion: the new kernel did not fix this
Comment 19 Pavel Petrovic 2024-09-10 16:16:21 UTC
Confirming this bug also affects the ASUS Vivobook 16X K3605ZU with latest Ubuntu 24.04.1. The messages in the syslog are a bit different, but it is likely the same or very related problem. the Ubuntu apport tool uploaded all the logs here:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2059744

I have also compiled the same version of kernel 6.8.0-41-generic from sources from apt packages using the usual guide
  https://wiki.ubuntu.com/Kernel/BuildYourOwnKernel
(which required a lot of additional packages as well)

and it behaves the same, but now it is ready for trying any patches, or changing kernel config options, if you have any suggestions, please.

Also tried to download a newer kernel through mainline, for version 6.9.12-060912-generic it still works the same - no wifi after wakeup from suspend, but the kernels 6.10.* fail to build NVIDIA module on this system, even after attempts to fix, I could not get that kernel to work to test. Would need some more help to try those.

Would you have any suggestions what to try?
Comment 20 Pavel Petrovic 2024-09-14 15:07:32 UTC
To add some more observations.
It is not only the wifi, it is also sound. And the failure (as you may have noticed in the logs) is in the 

0000:00:1c.0 PCI bridge: Intel Corporation Alder Lake PCH-P PCI Express Root Port #9 (rev 01)

before the attempt to wake up wifi.

syslog notices:

pcieport 0000:00:1c.0: broken device, retraining non-functional downstream link at 2.5GT/s

and then

pcieport 0000:00:1c.0: retraining failed

exactly before the first problem with the mt7921:

0000:00:1c.0 PCI bridge: Intel Corporation Alder Lake PCH-P PCI Express Root Port #9 (rev 01)
Comment 21 Pavel Petrovic 2024-09-14 15:09:12 UTC
(the last line should have been)

pci 0000:2c:00.0: not ready 1023ms after resume;
Comment 22 Pavel Petrovic 2024-09-17 15:20:39 UTC
A new kernel 6.8.0-45 arrived in Ubuntu 24.04.1 today.
But it is even worse than before.
After wakeup from suspend, even usb tethering does not work properly,
which used to work with v41, and system freezes quite easily.

Here is a syslog from the suspending part:
  https://capek.ii.fmph.uniba.sk/suspending-syslog

and from the waking up after suspend:
  https://capek.ii.fmph.uniba.sk/wakeup-syslog
Comment 23 Vadim 2024-09-17 15:30:59 UTC
Pavel, I'm not a developer, maybe you should buy an Intel WiFi card 🤔 I read that some laptops have the ability to change the Wi-Fi card.
Comment 24 Pavel Petrovic 2024-09-17 17:30:13 UTC
(In reply to Vadim from comment #23)
> Pavel, I'm not a developer, maybe you should buy an Intel WiFi card 🤔 I
> read that some laptops have the ability to change the Wi-Fi card.

Thanks, but it is easier to turn off automatic suspend in the settings, and be careful not to leave the PC on unattended, it's not that terrible, but it would be nice to have it fixed, hopefully some expert on PCI will take a look eventually. I do not know how much it is related to the network card, or the PCI bus itself.
Comment 25 Pavel Petrovic 2024-10-21 11:59:25 UTC
A new kernel 6.8.0-47 arrived in Ubuntu 24.04.1 today.
Still similar story.
  https://capek.ii.fmph.uniba.sk/wakeup-syslog-47

Suspend ruins the system.
It is annoying to turn off the computer completely all the time when moving somewhere. Having it ON while in the bag is bad, it could overheat, no air to cool...
I hope someone will fix this, please.
Comment 26 Mario Limonciello (AMD) 2024-10-21 13:12:19 UTC
For distro kernels please report issues to distros and please keep the focus on upstream bugs on upstream kernels.
Comment 27 Pavel Petrovic 2024-10-21 21:13:12 UTC
Thank you Mario for your suggestion.
But well, we were just sent here from there.
It is likely an active bug in upstream kernel as well.
If you or someone else can give clear instructions how to test such upstream kernel on this Ubuntu machine and report its details, I will be glad to do that.

As mentioned above, when I tried mainline, 6.10.* failed to build NVIDIA module on this system.
Comment 28 Pavel Petrovic 2024-11-02 21:06:01 UTC
New Ubuntu kernel 6.8.0-48, same result.
Kernel crashes, machine does not even reboot after wakeup.
  https://capek.ii.fmph.uniba.sk/6.8.0-48-wakeup-from-suspend.txt

2024-11-02T21:48:58.952069+01:00 buchlovice kernel: pcieport 0000:00:1c.0: broken device, retraining non-functional downstream link at 2.5GT/s
2024-11-02T21:48:58.952070+01:00 buchlovice kernel: pcieport 0000:00:1c.0: retraining failed
2024-11-02T21:48:58.952072+01:00 buchlovice kernel: pcieport 0000:00:1c.0: broken device, retraining non-functional downstream link at 2.5GT/s
2024-11-02T21:48:58.952074+01:00 buchlovice kernel: pcieport 0000:00:1c.0: retraining failed
2024-11-02T21:48:58.952075+01:00 buchlovice kernel: mt7921e 0000:2c:00.0: not ready 1023ms after resume; waiting
2024-11-02T21:48:58.952077+01:00 buchlovice kernel: mt7921e 0000:2c:00.0: not ready 2047ms after resume; waiting
2024-11-02T21:48:58.952079+01:00 buchlovice kernel: mt7921e 0000:2c:00.0: not ready 4095ms after resume; waiting
2024-11-02T21:48:58.952081+01:00 buchlovice kernel: mt7921e 0000:2c:00.0: not ready 8191ms after resume; waiting
2024-11-02T21:48:58.952083+01:00 buchlovice kernel: mt7921e 0000:2c:00.0: not ready 16383ms after resume; waiting
2024-11-02T21:48:58.952085+01:00 buchlovice kernel: mt7921e 0000:2c:00.0: not ready 32767ms after resume; waiting
2024-11-02T21:48:58.952088+01:00 buchlovice kernel: mt7921e 0000:2c:00.0: not ready 65535ms after resume; giving up
2024-11-02T21:48:58.952089+01:00 buchlovice kernel: mt7921e 0000:2c:00.0: Unable to change power state from D3cold to D0, device inaccessible
Comment 29 Pavel Petrovic 2024-11-21 15:36:06 UTC
New Ubuntu kernel 6.8.0-49, same result.
Kernel crashes, machine does not even reboot after wakeup.

2024-11-21T16:27:34.851605+01:00 buchlovice kernel: Freezing user space processes
2024-11-21T16:27:34.851692+01:00 buchlovice kernel: Freezing user space processes completed (elapsed 0.001 seconds)
2024-11-21T16:27:34.851698+01:00 buchlovice kernel: OOM killer disabled.
2024-11-21T16:27:34.851700+01:00 buchlovice kernel: Freezing remaining freezable tasks
2024-11-21T16:27:34.851702+01:00 buchlovice kernel: Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
2024-11-21T16:27:34.851703+01:00 buchlovice kernel: printk: Suspending console(s) (use no_console_suspend to debug)
2024-11-21T16:27:34.851705+01:00 buchlovice kernel: ACPI: EC: interrupt blocked
2024-11-21T16:27:34.851706+01:00 buchlovice kernel: ACPI: EC: interrupt unblocked
2024-11-21T16:27:34.851707+01:00 buchlovice kernel: pcieport 0000:00:1c.0: broken device, retraining non-functional downstream link at 2.5GT/s
2024-11-21T16:27:34.851708+01:00 buchlovice kernel: pcieport 0000:00:1c.0: retraining failed
2024-11-21T16:27:34.851709+01:00 buchlovice kernel: pcieport 0000:00:1c.0: broken device, retraining non-functional downstream link at 2.5GT/s
2024-11-21T16:27:34.851710+01:00 buchlovice kernel: pcieport 0000:00:1c.0: retraining failed
2024-11-21T16:27:34.851711+01:00 buchlovice kernel: mt7921e 0000:2c:00.0: not ready 1023ms after resume; waiting
2024-11-21T16:27:34.851712+01:00 buchlovice kernel: mt7921e 0000:2c:00.0: not ready 2047ms after resume; waiting
2024-11-21T16:27:34.851714+01:00 buchlovice kernel: mt7921e 0000:2c:00.0: not ready 4095ms after resume; waiting
2024-11-21T16:27:34.851715+01:00 buchlovice kernel: mt7921e 0000:2c:00.0: not ready 8191ms after resume; waiting
2024-11-21T16:27:34.851716+01:00 buchlovice kernel: mt7921e 0000:2c:00.0: not ready 16383ms after resume; waiting
2024-11-21T16:27:34.851717+01:00 buchlovice kernel: mt7921e 0000:2c:00.0: not ready 32767ms after resume; waiting
2024-11-21T16:27:34.851718+01:00 buchlovice kernel: mt7921e 0000:2c:00.0: not ready 65535ms after resume; giving up
2024-11-21T16:27:34.851718+01:00 buchlovice kernel: mt7921e 0000:2c:00.0: Unable to change power state from D3cold to D0, device inaccessible
2024-11-21T16:27:34.851720+01:00 buchlovice kernel: pcieport 10000:e0:06.0: can't derive routing for PCI INT A
2024-11-21T16:27:34.851720+01:00 buchlovice kernel: nvme 10000:e1:00.0: PCI INT A: no GSI
2024-11-21T16:27:34.851721+01:00 buchlovice kernel: i915 0000:00:02.0: [drm] GT0: GuC firmware i915/adlp_guc_70.bin version 70.20.0
2024-11-21T16:27:34.851722+01:00 buchlovice kernel: i915 0000:00:02.0: [drm] GT0: HuC firmware i915/tgl_huc.bin version 7.9.3
2024-11-21T16:27:34.851723+01:00 buchlovice kernel: i915 0000:00:02.0: [drm] GT0: HuC: authenticated for all workloads
2024-11-21T16:27:34.851724+01:00 buchlovice kernel: i915 0000:00:02.0: [drm] GT0: GUC: submission enabled
2024-11-21T16:27:34.851726+01:00 buchlovice kernel: i915 0000:00:02.0: [drm] GT0: GUC: SLPC enabled
2024-11-21T16:27:34.851727+01:00 buchlovice kernel: i915 0000:00:02.0: [drm] GT0: GUC: RC enabled
2024-11-21T16:27:34.851728+01:00 buchlovice kernel: nvme nvme0: 16/0/0 default/read/poll queues
2024-11-21T16:27:34.851729+01:00 buchlovice kernel: mt7921e 0000:2c:00.0: driver own failed
2024-11-21T16:27:34.851730+01:00 buchlovice kernel: mt7921e 0000:2c:00.0: PM: dpm_run_callback(): pci_pm_resume+0x0/0x110 returns -5
2024-11-21T16:27:34.851731+01:00 buchlovice kernel: mt7921e 0000:2c:00.0: PM: failed to resume async: error -5
Comment 30 Mario Limonciello (AMD) 2024-11-21 15:45:43 UTC
> It is likely an active bug in upstream kernel as well.

It very well may be, but we don't know for sure until the upstream kernel has been tested.  6.8 is end of life upstream and won't be picking up any new patches or actively looked at.

As I said in comment #26, please keep distro kernel bugs in distro trackers. TBH from the logs that have been shown on distro kernel this "could" be an issue in PCIe core or firmware not even in Mediatek driver.

6.12 is likely to be declared the next LTS kernel, this would be a good place to test.
Comment 31 Pavel Petrovic 2024-11-21 16:33:08 UTC
(In reply to Mario Limonciello (AMD) from comment #30)
> > It is likely an active bug in upstream kernel as well.
> 
> It very well may be, but we don't know for sure until the upstream kernel
> has been tested.  6.8 is end of life upstream and won't be picking up any
> new patches or actively looked at.
> 
> As I said in comment #26, please keep distro kernel bugs in distro trackers.
> TBH from the logs that have been shown on distro kernel this "could" be an
> issue in PCIe core or firmware not even in Mediatek driver.
> 
> 6.12 is likely to be declared the next LTS kernel, this would be a good
> place to test.


Thank you, could you, please, point me to a tutorial that will explain
how to create a bootable ISO with some distribution and an upstream kernel?

Otherwise I do not see how a regular Linux user should respond
to your rant. As I mentioned above, I tried, but did not get far.
Comment 33 Mario Limonciello (AMD) 2024-11-21 16:35:01 UTC
And https://docs.fedoraproject.org/en-US/fedora/latest/preparing-boot-media/
Comment 34 Pavel Petrovic 2024-11-21 18:33:54 UTC
Many thanks, Mario. I have downloaded the ISO you linked, written it to a bootable medium, started Fedora from the medium, issued Suspend, it suspended fine (power LED went off).
Pressing a key started the wakeup process (power LED went on), display immediately showed the latest contents of the screen, but the computer froze. Live medium, so no logs, therefore I shrank my Ubuntu partition by 20GB, installed Fedora on the main disk, booted from the disk, made sure the system is updated, and clicked Suspend. Power LED went off.
After a key press, power LED went on, but nothing came up for more than 5 minutes, even the display was dark.

journalctl does not contain a single line between the suspend and next boot after hard power OFF and ON.

It is saved here:

https://capek.ii.fmph.uniba.sk/6.13.0-0.rc0.20241119git158f238aa69d.2.fc42.x86_64-no-wakeup-after-suspend.txt

It contains the last two boots - one that ends with suspend and the following one after hard power-off. 

If there is anything to try or other logs to provide, please, let me know.
Comment 35 Mario Limonciello (AMD) 2024-11-21 20:26:50 UTC
At this point I can confidently say that we're looking at different issues from you and the original reporter (@kolAflash).

I think it would be best to split up your issue into a "few" bugs to get the attention of the right people for each component I see a problem.

Let me pull a few things from your log to show you what I mean.

> Nov 21 20:09:42 fedora kernel: pcieport 10000:e0:06.0: can't derive routing
> for PCI INT A
> Nov 21 20:09:42 fedora kernel: nvme 10000:e1:00.0: PCI INT A: not connected

There is some problem with what /appears/ to be interrupt handling for your NVME disk.  FWIW this might not be crucial.
On AMD platforms we had a similar problem with the IOMMU showing this.
I dug into it and confirmed it was a false positive and it's sorted on AMD with this (that won't do anything for Intel).
https://github.com/torvalds/linux/commit/0feda94c868d396fac3b3cb14089d2d989a07c72

It would be best to have Intel guys confirm if that's a problem or not.  However...

> Nov 21 19:59:25 fedora kernel: WARNING: CPU: 15 PID: 11 at
> mm/page_alloc.c:4727 __alloc_pages_noprof+0x2ca/0x330
Nov 21 19:59:25 fedora kernel: Modules linked in: nvme nvme_core nvme_auth i915(+) nouveau(+) mxm_wmi drm_ttm_helper gpu_sched drm_gpuvm drm_exec i2c_algo_bit drm_buddy crct10dif_pclmul crc32_pclmul ttm crc32c_intel polyval_clmulni ucsi_acpi polyval_generic hid_multitouch drm_display_helper ghash_clmulni_intel sha512_ssse3 typec_ucsi sha256_ssse3 sha1_ssse3 cec typec vmd i2c_hid_acpi video i2c_hid wmi pinctrl_tigerlake serio_raw fuse

There is a page allocation failure right after this, so NVME HMB /might/ not have gotten set up properly.

> Nov 21 18:59:55 fedora kernel: nouveau 0000:01:00.0: gsp: rc engn:00000001
> chid:0 type:45 scope:1 part:233
> Nov 21 18:59:55 fedora kernel: nouveau 0000:01:00.0:
> fifo:c00000:0000:0000:[(udev-worker)[542]] errored - disabling channel
> Nov 21 18:59:55 fedora kernel: nouveau 0000:01:00.0: DRM: channel 0 killed!
> Nov 21 18:59:55 fedora kernel: nouveau 0000:01:00.0: gsp: rc engn:00000001
> chid:8 type:45 scope:1 part:233
> Nov 21 18:59:55 fedora kernel: nouveau 0000:01:00.0:
> fifo:c00400:0001:0008:[(udev-worker)[542]] errored - disabling channel
> Nov 21 18:59:55 fedora kernel: nouveau 0000:01:00.0: DRM: channel 8 killed!
> Nov 21 18:59:55 fedora kernel: nouveau 0000:01:00.0: gsp:msg fn:103
> len:0x78/0x58 res:0x62 resp:0x62
> Nov 21 18:59:55 fedora kernel: msg: 00000000: 03 00 d0 c1 03 00 d0 c1 00 00
> 1d de 80 00 00 00  ................
> Nov 21 18:59:55 fedora kernel: msg: 00000010: 62 00 00 00 38 00 00 00 00 00
> 00 00 00 00 00 00  b...8...........
> Nov 21 18:59:55 fedora kernel: msg: 00000020: 00 00 00 00 03 00 d0 c1 00 00
> 00 00 00 00 00 00  ................
> Nov 21 18:59:55 fedora kernel: msg: 00000030: 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00  ................
> Nov 21 18:59:55 fedora kernel: msg: 00000040: 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00  ................
> Nov 21 18:59:55 fedora kernel: msg: 00000050: 00 00 00 00 00 00 00 00        
>                  ........
> Nov 21 18:59:55 fedora kernel: nouveau 0000:01:00.0: systemd-logind[1123]:
> VMM allocation failed: -22
> Nov 21 18:59:55 fedora kernel: nouveau 0000:01:00.0: gsp: rc engn:00000001
> chid:0 type:45 scope:1 part:233
> Nov 21 18:59:55 fedora kernel: nouveau 0000:01:00.0: gsp: rc engn:00000001
> chid:8 type:45 scope:1 part:233
> Nov 21 18:59:55 fedora gnome-shell[2393]: Failed to open gpu
> '/dev/dri/card0': GDBus.Error:org.freedesktop.DBus.Error.InvalidArgs: Invalid
> argument
> Nov 21 18:59:55 fedora kernel: nouveau 0000:01:00.0: gsp:msg fn:103
> len:0x78/0x58 res:0x62 resp:0x62
> Nov 21 18:59:55 fedora kernel: msg: 00000000: 03 00 d0 c1 03 00 d0 c1 00 00
> 1d de 80 00 00 00  ................
> Nov 21 18:59:55 fedora kernel: msg: 00000010: 62 00 00 00 38 00 00 00 00 00
> 00 00 00 00 00 00  b...8...........
> Nov 21 18:59:55 fedora kernel: msg: 00000020: 00 00 00 00 03 00 d0 c1 00 00
> 00 00 00 00 00 00  ................
> Nov 21 18:59:55 fedora kernel: msg: 00000030: 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00  ................
> Nov 21 18:59:55 fedora kernel: msg: 00000040: 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00  ................
> Nov 21 18:59:55 fedora kernel: msg: 00000050: 00 00 00 00 00 00 00 00        
>                  ........
> Nov 21 18:59:55 fedora kernel: nouveau 0000:01:00.0: systemd-logind[1123]:
> VMM allocation failed: -22
> Nov 21 18:59:55 fedora kernel: nouveau 0000:01:00.0: gsp: rc engn:00000001
> chid:0 type:45 scope:1 part:233
> Nov 21 18:59:55 fedora kernel: nouveau 0000:01:00.0: gsp: rc engn:00000001
> chid:8 type:45 scope:1 part:233

Nouveau seems to be misbehaving here.  This should be a bug filed against 
https://gitlab.freedesktop.org/drm/nouveau/-/issues

> Nov 21 19:03:21 fedora kernel: PM: suspend entry (s2idle)
> Nov 21 19:03:21 fedora kernel: Filesystems sync: 0.014 seconds
> -- Boot a07ba43e15ca4a569cfb9dbcd7512e47 --

Hanging on s2idle needs to be triaged by the Intel s2idle triage script:

https://github.com/intel/S0ixSelftestTool

I suggest you run that and then open up another bug for Intel guys to look at the results.

Note You need to log in before you can comment on or make changes to this bug.