Bug 218982
Summary: | 6.10 regression: System doesn't resume from standby (with bisect) | ||
---|---|---|---|
Product: | ACPI | Reporter: | Saverio Miroddi (saverio.pub2) |
Component: | Power-Sleep-Wake | Assignee: | acpi_power-sleep-wake |
Status: | NEW --- | ||
Severity: | normal | CC: | gwk2112, h, mario.limonciello, vasant.hegde |
Priority: | P3 | ||
Hardware: | AMD | ||
OS: | Linux | ||
Kernel Version: | Subsystem: | ||
Regression: | No | Bisected commit-id: | |
Attachments: |
dmesg output
lspci -vvv output dmesg after aborted suspend with iommu=off |
Description
Saverio Miroddi
2024-06-24 13:21:45 UTC
Sorry. I broke the suspend/resume path :-( Thanks a lot for reporting and bisecting. Similar bug was reported last week [1]. I have posted patch to fix this issue [2]. Can you please test this patch? [1] https://bugzilla.kernel.org/show_bug.cgi?id=218975 [2] https://lore.kernel.org/linux-iommu/20240621101533.20216-1-vasant.hegde@amd.com/ -Vasant (In reply to Vasant Hegde from comment #1) > Sorry. I broke the suspend/resume path :-( Thanks a lot for reporting and > bisecting. > > Similar bug was reported last week [1]. I have posted patch to fix this > issue [2]. Can you please test this patch? > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=218975 > > [2] > https://lore.kernel.org/linux-iommu/20240621101533.20216-1-vasant.hegde@amd. > com/ Thanks! I've tested one suspend/resume cycle, and the patch fixed it :) (In reply to Saverio Miroddi from comment #2) > (In reply to Vasant Hegde from comment #1) > > Sorry. I broke the suspend/resume path :-( Thanks a lot for reporting and > > bisecting. > > > > Similar bug was reported last week [1]. I have posted patch to fix this > > issue [2]. Can you please test this patch? > > > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=218975 > > > > [2] > > > https://lore.kernel.org/linux-iommu/20240621101533.20216-1-vasant.hegde@amd. > > com/ > > Thanks! I've tested one suspend/resume cycle, and the patch fixed it :) Thanks! Joerg queued patch to fix branch. Hopefully it will go to upstream soon. -Vasant I've been having the same problem as Saverio Miroddi and found the same commit with git bisect. I've applied the above patch (using linux-git in Arch) but it doesn't fix it for me, so I've reverted to linux-lts for now. My motherboard is an MSI MAG X670E TOMAHAWK WIFI with a Ryzen 7700X. Hi Tony, (In reply to Tony Houghton from comment #4) > I've been having the same problem as Saverio Miroddi and found the same > commit with git bisect. I've applied the above patch (using linux-git in > Arch) but it doesn't fix it for me, so I've reverted to linux-lts for now. > My motherboard is an MSI MAG X670E TOMAHAWK WIFI with a Ryzen 7700X. You mean latest upstream is not working for you? -Vasant Yes, I've double-checked that what I've built and installed includes that call to iommu_enable_gt(). It's still failing to resume for me. > I've been having the same problem as Saverio Miroddi and found the same
> commit with git bisect. I've applied the above patch (using linux-git in
> Arch) but it doesn't fix it for me, so I've reverted to linux-lts for now. My
> motherboard is an MSI MAG X670E TOMAHAWK WIFI with a Ryzen 7700X.
Not sure if related in one way or another, but my experience with the MSI MAG Mortar has been very poor.
The resume from standby has been stable only on the first firmware release, then subsequent firmware versions broke resume - approximately 20% of the time, system wouldn't resume.
Not sure if also related, but the first firmware, where resume worked, had EXPO broken. Subsequent firmware versions had EXPO working.
I'm moved to another brand since them, and didn't experience any resume problem.
I have had suspend-related problems with an MSI board in the past and RMAd it. I thought it was a one-off because MSI don't seem to have a bad reputation. This one has been fine until recently. It's commit 8e0179733172 that seems to have triggered the trouble with this board, and the subsequent modification from <https://lore.kernel.org/linux-iommu/20240621101533.20216-1-vasant.hegde@amd.com/> doesn't fix it for me. I did look into these code path again and I don't find anything suspicious. (I am on v6.12-rc5). In resume path it should call : amd_iommu_resume -> enable_iommus -> early_enable_iommus -> early_enable_iommu -> iommu_enable_gt(). Can you provide full dmesg and `lspci -vvv` output ? Also see if issues goes away if you disable IOMMU (iommu=off) -Vasant Created attachment 307106 [details]
dmesg output
Created attachment 307107 [details]
lspci -vvv output
I've attached the dmesg and lspci outputs. I tried iommu=off and that gave me new problem: every time I try to suspend, the PC resumes spontaneously about a second after the screens turn off. I would try disabling IOMMU in the BIOS, but I couldn't find such an option (nor for SVM, which seemed like it could be related when I Googled for the IOMMU setting). Thanks for the dmesg Tony. Looking into the dmesg : [ 0.416070] AMD-Vi: Extended features (0x246577efa2254afa, 0x0): PPR NX GT [5] IA GA PC GA_vAPIC [ 0.416077] AMD-Vi: Interrupt remapping enabled [ 0.464738] AMD-Vi: Virtual APIC enabled It has GT support and looking into code path again I can confirm GT feature gets enabled in resume path. Only other noticeable difference is X2APIC support. Looks like x2apic is not enabled in BIOS. But driver will use MSI interrupt and it should work fine. (I am referring to iommu_init_irq() code path). By any chance do you see any message on console once you resume the system? Just to double check, can you try out below change? This is queued for v6.13 merge window. Basically this patch will skip copy_device_table() code path in resume path (This is to rule out the resume code taking wrong path in copy_device_table() function). -Vasant commit 3f6eeada6930056c38f20964120ac402cf0030b9 Author: Vasant Hegde <vasant.hegde@amd.com> Date: Wed Oct 16 08:49:58 2024 +0000 iommu/amd: Do not try copy old DTE resume path In suspend/resume path, no need to copy old DTE (early_enable_iommus()). Just need to reload IOMMU hardware. This is the side effect of commit 3ac3e5ee5ed5 ("iommu/amd: Copy old trans table from old kernel") which changed early_enable_iommus() but missed to fix enable_iommus(). Resume path continue to work as 'amd_iommu_pre_enabled' is set to false and copy_device_table() will fail. It will just re-loaded IOMMU. Hence I think we don't need to backport this to stable tree. Signed-off-by: Vasant Hegde <vasant.hegde@amd.com> Link: https://lore.kernel.org/r/20241016084958.99727-1-vasant.hegde@amd.com Signed-off-by: Joerg Roedel <jroedel@suse.de> diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c index 43131c3a2172..3fa70169aace 100644 --- a/drivers/iommu/amd/init.c +++ b/drivers/iommu/amd/init.c @@ -2882,11 +2882,6 @@ static void enable_iommus_vapic(void) #endif } -static void enable_iommus(void) -{ - early_enable_iommus(); -} - static void disable_iommus(void) { struct amd_iommu *iommu; @@ -2913,7 +2908,8 @@ static void amd_iommu_resume(void) iommu_apply_resume_quirks(iommu); /* re-load the hardware */ - enable_iommus(); + for_each_iommu(iommu) + early_enable_iommu(iommu); amd_iommu_enable_interrupts(); } (In reply to Tony Houghton from comment #12) > I've attached the dmesg and lspci outputs. I tried iommu=off and that gave > me new problem: every time I try to suspend, the PC resumes spontaneously > about a second after the screens turn off. I am not sure what's going on here. > I would try disabling IOMMU in > the BIOS, but I couldn't find such an option (nor for SVM, which seemed like > it could be related when I Googled for the IOMMU setting). I am not sure about vendor BIOS. Ideally there should be an option to disable IOMMU (check for IOMMU or DMAr or AMD-Vi in BIOS). -Vasant >> every time I try to suspend, the PC resumes spontaneously
>> about a second after the screens turn off.
> I am not sure what's going on here.
I suggest turning on /sys/power/pm_debug_messages and then sharing the dmesg. This should capture any interrupt or GPIO that was causing an aborted suspend.
Created attachment 307197 [details]
dmesg after aborted suspend with iommu=off
Comment on attachment 307197 [details]
dmesg after aborted suspend with iommu=off
[ 194.510735] mt7921e 0000:0f:00.0: PM: pci_pm_suspend(): mt7921_pci_suspend [mt7921e] returns -12
[ 194.510740] mt7921e 0000:0f:00.0: PM: dpm_run_callback(): pci_pm_suspend returns -12
[ 194.510744] mt7921e 0000:0f:00.0: PM: failed to suspend async: error -12
Issue occurs in wifi driver. Do you have a lot running at this time? -12 is out of memory.
(In reply to Vasant Hegde from comment #13) > Thanks for the dmesg Tony. Thank you for investigating and helping; sorry about the delay in replying. > Only other noticeable difference is X2APIC support. Looks like x2apic is not > enabled in BIOS. But driver will use MSI interrupt and it should work fine. > (I am referring to iommu_init_irq() code path). Is that likely to be significant? I Googled x2apic, and apparently it doesn't do much besides provide support for more than 256 CPU cores. Again, I can't find a BIOS option for this, but I did eventually find one to disable IOMMU (see below). > By any chance do you see any message on console once you resume the system? No, the display output stays off, and so does networking, so I can't see anything. All that happens is the fans and LEDs etc come on, but the system is completely unresponsive. My keyboard lights up too, but the caps lock LED doesn't toggle when I press the caps lock key. > Just to double check, can you try out below change? This is queued for > v6.13 merge window. Basically this patch will skip copy_device_table() code > path in resume path (This is to rule out the resume code taking wrong path > in copy_device_table() function). ... > commit 3f6eeada6930056c38f20964120ac402cf0030b9 I've added that patch, and it doesn't make a difference that I can tell. I see that de111f6b4f6a3010020825d22a068f416bc29c95, which causes the issue for me, is also very simple. AIUI, de111f6b4f6a removes the calls to iommu_enable_gt from early_enable_iommu and early_enable_iommus and calls iommu_enable_gt from iommu_init_pci instead. Then the patch from <https://lore.kernel.org/linux-iommu/20240621101533.20216-1-vasant.hegde@amd.com/> restored iommu_enable_gt to early_enable_iommu but didn't affect the other two locations. So that makes me think calling iommu_enable_gt from iommu_init_pci is what's upsetting my board. (In reply to Mario Limonciello (AMD) from comment #17) > Issue occurs in wifi driver. Do you have a lot running at this time? -12 is > out of memory. I thought I already posted a message after attaching this new dmesg output, but it seems to have got lost :-(. I also noticed that the error appeared to be from the wifi driver. It only occurs when I add iommu=off to the kernel command line, and as well as causing the suspend to abort, the driver fails to initialise, which I didn't notice until now, because I use ethernet by default. Should I report this issue separately against the mt7921e driver? It happens even with nothing running except GNOME in 32GB of RAM, so if it really is out of memory, maybe it's getting stuck in some loop/recursion that keeps trying to allocate more? If I disable the wifi in the BIOS, it fixes this suspend issue, but the resume still fails, so it seems iommu=off doesn't fix the original issue. I also found the BIOS option to disable IOMMU, but that doesn't make any difference to either issue. Hi Tony, (In reply to Tony Houghton from comment #18) > (In reply to Vasant Hegde from comment #13) > > Thanks for the dmesg Tony. > > Thank you for investigating and helping; sorry about the delay in replying. > > > Only other noticeable difference is X2APIC support. Looks like x2apic is > not > > enabled in BIOS. But driver will use MSI interrupt and it should work fine. > > (I am referring to iommu_init_irq() code path). > > Is that likely to be significant? I Googled x2apic, and apparently it > doesn't do much besides provide support for more than 256 CPU cores. Again, > I can't find a BIOS option for this, but I did eventually find one to > disable IOMMU (see below). Its not significant. Even with APIC mode it should work fine. > > > By any chance do you see any message on console once you resume the system? > > No, the display output stays off, and so does networking, so I can't see > anything. All that happens is the fans and LEDs etc come on, but the system > is completely unresponsive. My keyboard lights up too, but the caps lock LED > doesn't toggle when I press the caps lock key. > > > Just to double check, can you try out below change? This is queued for > > v6.13 merge window. Basically this patch will skip copy_device_table() code > > path in resume path (This is to rule out the resume code taking wrong path > > in copy_device_table() function). > ... > > commit 3f6eeada6930056c38f20964120ac402cf0030b9 > > I've added that patch, and it doesn't make a difference that I can tell. Thanks for testing. > > I see that de111f6b4f6a3010020825d22a068f416bc29c95, which causes the issue > for me, is also very simple. AIUI, de111f6b4f6a removes the calls to > iommu_enable_gt from early_enable_iommu and early_enable_iommus and calls > iommu_enable_gt from iommu_init_pci instead. Then the patch from > <https://lore.kernel.org/linux-iommu/20240621101533.20216-1-vasant.hegde@amd. > com/> restored iommu_enable_gt to early_enable_iommu but didn't affect the > other two locations. So that makes me think calling iommu_enable_gt from > iommu_init_pci is what's upsetting my board. Right. That's commit 150bdf5f8d. I don't think iommu_init_pci() path gets called in resume path. Only resume callback handler gets called. -Vasant Hi, (In reply to Tony Houghton from comment #19) > (In reply to Mario Limonciello (AMD) from comment #17) > > > If I disable the wifi in the BIOS, it fixes this suspend issue, but the > resume still fails, so it seems iommu=off doesn't fix the original issue. I > also found the BIOS option to disable IOMMU, but that doesn't make any > difference to either issue. If I read it correctly, suspend/resume doesn't work even if IOMMU is disabled in BIOS. is that correct? Then the problem is in some other place, not in IOMMU subsystem. -Vasant I've done some more testing etc, and now it seems that de111f6b4f6a works correctly after all. I'm sure I confirmed it before, so maybe changing the BIOS settings and/or flashing a new version made some difference. But I still have the original problem with more recent kernels. I've done another git bisect and found a new suspect: first bad commit: [ceac1cb0259de682d78f5c784ef8e0b13022e9d9] Bluetooth: btusb: mediatek: add ISO data transmission functions I'll do some more tests on that and its immediate predecessor, but it seems more than a coincidence that it's related to the same piece of hardware that caused the other issue (comments #12, #17 etc). When you're doing your bisect you should explicitly look for any DMA mapping errors that are introduced with IOMMU disabled. BTW - please don't test with "iommu=off". To isolate AMD IOMMU specific issue you should use "amd_iommu=off". This will at least allow the use of software bounce buffers, which might have been the problem induced in Mediatek driver by running "iommu=off". |