Bug 218982

Summary: 6.10 regression: System doesn't resume from standby (with bisect)
Product: ACPI Reporter: Saverio Miroddi (saverio.pub2)
Component: Power-Sleep-WakeAssignee: acpi_power-sleep-wake
Status: NEW ---    
Severity: normal CC: gwk2112, h, mario.limonciello, vasant.hegde
Priority: P3    
Hardware: AMD   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg output
lspci -vvv output
dmesg after aborted suspend with iommu=off

Description Saverio Miroddi 2024-06-24 13:21:45 UTC
After upgrading my system from 6.9 to 6.10-rcX (currently, rc5), my system doesn't resume from suspend anymore.

Specifically, when I perform a resume, the machine is hung, without any signal sent to the monitor.

I'm not sure which hardware component is responsible; I presume it's related to the motherboard, which is a MAG B650M MORTAR WIFI (MS-7D76), with a 7950X AMD CPU.

I've performed a bisect, and the commit where this issue first appeared follows.

commit de111f6b4f6a3010020825d22a068f416bc29c95
Author: Vasant Hegde <vasant.hegde@amd.com>
Date:   Mon May 6 08:20:39 2024 +0000

    iommu/amd: Enable Guest Translation after reading IOMMU feature register
    
    Commit 8e0179733172 ("iommu/amd: Enable Guest Translation before
    registering devices") moved IOMMU Guest Translation (GT) enablement to
    early init path. It does feature check based on Global EFR value (got from
    ACPI IVRS table). Later it adjusts EFR value based on IOMMU feature
    register (late_iommu_features_init()).
    
    It seems in some systems BIOS doesn't set gloabl EFR value properly.
    This is causing mismatch. Hence move IOMMU GT enablement after
    late_iommu_features_init() so that it does check based on IOMMU EFR
    value.
    
    Fixes: 8e0179733172 ("iommu/amd: Enable Guest Translation before registering devices")
    Reported-by: Klara Modin <klarasmodin@gmail.com>
    Closes: https://lore.kernel.org/linux-iommu/333e6eb6-361c-4afb-8107-2573324bf689@gmail.com/
    Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
    Tested-by: Klara Modin <klarasmodin@gmail.com>
    Link: https://lore.kernel.org/r/20240506082039.7575-1-vasant.hegde@amd.com
    Signed-off-by: Joerg Roedel <jroedel@suse.de>
Comment 1 Vasant Hegde 2024-06-24 16:00:00 UTC
Sorry. I broke the suspend/resume path :-( Thanks a lot for reporting and bisecting.

Similar bug was reported last week [1]. I have posted patch to fix this issue [2]. Can you please test this patch?

[1] https://bugzilla.kernel.org/show_bug.cgi?id=218975

[2] https://lore.kernel.org/linux-iommu/20240621101533.20216-1-vasant.hegde@amd.com/


-Vasant
Comment 2 Saverio Miroddi 2024-06-25 07:25:29 UTC
(In reply to Vasant Hegde from comment #1)
> Sorry. I broke the suspend/resume path :-( Thanks a lot for reporting and
> bisecting.
> 
> Similar bug was reported last week [1]. I have posted patch to fix this
> issue [2]. Can you please test this patch?
> 
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=218975
> 
> [2]
> https://lore.kernel.org/linux-iommu/20240621101533.20216-1-vasant.hegde@amd.
> com/

Thanks! I've tested one suspend/resume cycle, and the patch fixed it :)
Comment 3 Vasant Hegde 2024-06-28 09:21:56 UTC
(In reply to Saverio Miroddi from comment #2)
> (In reply to Vasant Hegde from comment #1)
> > Sorry. I broke the suspend/resume path :-( Thanks a lot for reporting and
> > bisecting.
> > 
> > Similar bug was reported last week [1]. I have posted patch to fix this
> > issue [2]. Can you please test this patch?
> > 
> > [1] https://bugzilla.kernel.org/show_bug.cgi?id=218975
> > 
> > [2]
> >
> https://lore.kernel.org/linux-iommu/20240621101533.20216-1-vasant.hegde@amd.
> > com/
> 
> Thanks! I've tested one suspend/resume cycle, and the patch fixed it :)

Thanks! Joerg queued patch to fix branch. Hopefully it will go to upstream soon.

-Vasant
Comment 4 Tony Houghton 2024-10-27 21:52:18 UTC
I've been having the same problem as Saverio Miroddi and found the same commit with git bisect. I've applied the above patch (using linux-git in Arch) but it doesn't fix it for me, so I've reverted to linux-lts for now. My motherboard is an MSI MAG X670E TOMAHAWK WIFI with a Ryzen 7700X.
Comment 5 Vasant Hegde 2024-10-28 10:33:11 UTC
Hi Tony,

(In reply to Tony Houghton from comment #4)
> I've been having the same problem as Saverio Miroddi and found the same
> commit with git bisect. I've applied the above patch (using linux-git in
> Arch) but it doesn't fix it for me, so I've reverted to linux-lts for now.
> My motherboard is an MSI MAG X670E TOMAHAWK WIFI with a Ryzen 7700X.

You mean latest upstream is not working for you?

-Vasant
Comment 6 Tony Houghton 2024-10-28 19:52:10 UTC
Yes, I've double-checked that what I've built and installed includes that call to iommu_enable_gt(). It's still failing to resume for me.
Comment 7 Saverio Miroddi 2024-10-28 20:20:49 UTC
> I've been having the same problem as Saverio Miroddi and found the same
> commit with git bisect. I've applied the above patch (using linux-git in
> Arch) but it doesn't fix it for me, so I've reverted to linux-lts for now. My
> motherboard is an MSI MAG X670E TOMAHAWK WIFI with a Ryzen 7700X.

Not sure if related in one way or another, but my experience with the MSI MAG Mortar has been very poor.

The resume from standby has been stable only on the first firmware release, then subsequent firmware versions broke resume - approximately 20% of the time, system wouldn't resume.

Not sure if also related, but the first firmware, where resume worked, had EXPO broken. Subsequent firmware versions had EXPO working.

I'm moved to another brand since them, and didn't experience any resume problem.
Comment 8 Tony Houghton 2024-10-29 22:11:50 UTC
I have had suspend-related problems with an MSI board in the past and RMAd it. I thought it was a one-off because MSI don't seem to have a bad reputation. This one has been fine until recently. It's commit 8e0179733172 that seems to have triggered the trouble with this board, and the subsequent modification from <https://lore.kernel.org/linux-iommu/20240621101533.20216-1-vasant.hegde@amd.com/> doesn't fix it for me.
Comment 9 Vasant Hegde 2024-10-30 09:53:16 UTC
I did look into these code path again and I don't find anything suspicious. (I am on v6.12-rc5).

In resume path it should call : 
amd_iommu_resume -> enable_iommus -> early_enable_iommus -> early_enable_iommu -> iommu_enable_gt().

Can you provide full dmesg and `lspci -vvv` output ?

Also see if issues goes away if you disable IOMMU (iommu=off)

-Vasant
Comment 10 Tony Houghton 2024-10-31 22:34:59 UTC
Created attachment 307106 [details]
dmesg output
Comment 11 Tony Houghton 2024-10-31 22:35:25 UTC
Created attachment 307107 [details]
lspci -vvv output
Comment 12 Tony Houghton 2024-10-31 22:40:09 UTC
I've attached the dmesg and lspci outputs. I tried iommu=off and that gave me new problem: every time I try to suspend, the PC resumes spontaneously about a second after the screens turn off. I would try disabling IOMMU in the BIOS, but I couldn't find such an option (nor for SVM, which seemed like it could be related when I Googled for the IOMMU setting).
Comment 13 Vasant Hegde 2024-11-04 09:50:15 UTC
Thanks for the dmesg Tony.

Looking into the dmesg :
[    0.416070] AMD-Vi: Extended features (0x246577efa2254afa, 0x0): PPR NX GT [5] IA GA PC GA_vAPIC
[    0.416077] AMD-Vi: Interrupt remapping enabled
[    0.464738] AMD-Vi: Virtual APIC enabled

It has GT support and looking into code path again I can confirm GT feature gets enabled in resume path.

Only other noticeable difference is X2APIC support. Looks like x2apic is not enabled in BIOS. But driver will use MSI interrupt and it should work fine. (I am referring to iommu_init_irq() code path).

By any chance do you see any message on console once you resume the system?

Just to double check, can you try out below change?  This is queued for v6.13 merge window. Basically this patch will skip copy_device_table() code path in resume path (This is to rule out the resume code taking wrong path in copy_device_table() function). 


-Vasant

commit 3f6eeada6930056c38f20964120ac402cf0030b9
Author: Vasant Hegde <vasant.hegde@amd.com>
Date:   Wed Oct 16 08:49:58 2024 +0000

    iommu/amd: Do not try copy old DTE resume path
    
    In suspend/resume path, no need to copy old DTE (early_enable_iommus()).
    Just need to reload IOMMU hardware.
    
    This is the side effect of commit 3ac3e5ee5ed5 ("iommu/amd: Copy old
    trans table from old kernel") which changed early_enable_iommus() but
    missed to fix enable_iommus().
    
    Resume path continue to work as 'amd_iommu_pre_enabled' is set to false
    and copy_device_table() will fail. It will just re-loaded IOMMU. Hence I
    think we don't need to backport this to stable tree.
    
    Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
    Link: https://lore.kernel.org/r/20241016084958.99727-1-vasant.hegde@amd.com
    Signed-off-by: Joerg Roedel <jroedel@suse.de>

diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index 43131c3a2172..3fa70169aace 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -2882,11 +2882,6 @@ static void enable_iommus_vapic(void)
 #endif
 }
 
-static void enable_iommus(void)
-{
-	early_enable_iommus();
-}
-
 static void disable_iommus(void)
 {
 	struct amd_iommu *iommu;
@@ -2913,7 +2908,8 @@ static void amd_iommu_resume(void)
 		iommu_apply_resume_quirks(iommu);
 
 	/* re-load the hardware */
-	enable_iommus();
+	for_each_iommu(iommu)
+		early_enable_iommu(iommu);
 
 	amd_iommu_enable_interrupts();
 }
Comment 14 Vasant Hegde 2024-11-04 09:53:01 UTC
(In reply to Tony Houghton from comment #12)
> I've attached the dmesg and lspci outputs. I tried iommu=off and that gave
> me new problem: every time I try to suspend, the PC resumes spontaneously
> about a second after the screens turn off. 

I am not sure what's going on here.


> I would try disabling IOMMU in
> the BIOS, but I couldn't find such an option (nor for SVM, which seemed like
> it could be related when I Googled for the IOMMU setting).

I am not sure about vendor BIOS. Ideally there should be an option to disable IOMMU (check for IOMMU or DMAr or AMD-Vi in BIOS).

-Vasant
Comment 15 Mario Limonciello (AMD) 2024-11-04 14:12:26 UTC
>> every time I try to suspend, the PC resumes spontaneously
>> about a second after the screens turn off. 
> I am not sure what's going on here.

I suggest turning on /sys/power/pm_debug_messages and then sharing the dmesg.  This should capture any interrupt or GPIO that was causing an aborted suspend.
Comment 16 Tony Houghton 2024-11-10 17:14:22 UTC
Created attachment 307197 [details]
dmesg after aborted suspend with iommu=off
Comment 17 Mario Limonciello (AMD) 2024-11-10 17:20:51 UTC
Comment on attachment 307197 [details]
dmesg after aborted suspend with iommu=off

[  194.510735] mt7921e 0000:0f:00.0: PM: pci_pm_suspend(): mt7921_pci_suspend [mt7921e] returns -12
[  194.510740] mt7921e 0000:0f:00.0: PM: dpm_run_callback(): pci_pm_suspend returns -12
[  194.510744] mt7921e 0000:0f:00.0: PM: failed to suspend async: error -12

Issue occurs in wifi driver. Do you have a lot running at this time? -12 is out of memory.
Comment 18 Tony Houghton 2024-11-10 19:20:11 UTC
(In reply to Vasant Hegde from comment #13)
> Thanks for the dmesg Tony.

Thank you for investigating and helping; sorry about the delay in replying.

> Only other noticeable difference is X2APIC support. Looks like x2apic is not
> enabled in BIOS. But driver will use MSI interrupt and it should work fine.
> (I am referring to iommu_init_irq() code path).

Is that likely to be significant? I Googled x2apic, and apparently it doesn't do much besides provide support for more than 256 CPU cores. Again, I can't find a BIOS option for this, but I did eventually find one to disable IOMMU (see below).

> By any chance do you see any message on console once you resume the system?

No, the display output stays off, and so does networking, so I can't see anything. All that happens is the fans and LEDs etc come on, but the system is completely unresponsive. My keyboard lights up too, but the caps lock LED doesn't toggle when I press the caps lock key.

> Just to double check, can you try out below change?  This is queued for
> v6.13 merge window. Basically this patch will skip copy_device_table() code
> path in resume path (This is to rule out the resume code taking wrong path
> in copy_device_table() function). 
...
> commit 3f6eeada6930056c38f20964120ac402cf0030b9

I've added that patch, and it doesn't make a difference that I can tell.

I see that de111f6b4f6a3010020825d22a068f416bc29c95, which causes the issue for me, is also very simple. AIUI, de111f6b4f6a removes the calls to iommu_enable_gt from early_enable_iommu and early_enable_iommus and calls iommu_enable_gt from iommu_init_pci instead. Then the patch from <https://lore.kernel.org/linux-iommu/20240621101533.20216-1-vasant.hegde@amd.com/> restored iommu_enable_gt to early_enable_iommu but didn't affect the other two locations. So that makes me think calling iommu_enable_gt from iommu_init_pci is what's upsetting my board.
Comment 19 Tony Houghton 2024-11-10 19:21:20 UTC
(In reply to Mario Limonciello (AMD) from comment #17)
 
> Issue occurs in wifi driver. Do you have a lot running at this time? -12 is
> out of memory.

I thought I already posted a message after attaching this new dmesg output, but it seems to have got lost :-(.

I also noticed that the error appeared to be from the wifi driver. It only occurs when I add iommu=off to the kernel command line, and as well as causing the suspend to abort, the driver fails to initialise, which I didn't notice until now, because I use ethernet by default. Should I report this issue separately against the mt7921e driver? It happens even with nothing running except GNOME in 32GB of RAM, so if it really is out of memory, maybe it's getting stuck in some loop/recursion that keeps trying to allocate more?

If I disable the wifi in the BIOS, it fixes this suspend issue, but the resume still fails, so it seems iommu=off doesn't fix the original issue. I also found the BIOS option to disable IOMMU, but that doesn't make any difference to either issue.
Comment 20 Vasant Hegde 2024-11-11 11:52:43 UTC
Hi Tony,

(In reply to Tony Houghton from comment #18)
> (In reply to Vasant Hegde from comment #13)
> > Thanks for the dmesg Tony.
> 
> Thank you for investigating and helping; sorry about the delay in replying.
> 
> > Only other noticeable difference is X2APIC support. Looks like x2apic is
> not
> > enabled in BIOS. But driver will use MSI interrupt and it should work fine.
> > (I am referring to iommu_init_irq() code path).
> 
> Is that likely to be significant? I Googled x2apic, and apparently it
> doesn't do much besides provide support for more than 256 CPU cores. Again,
> I can't find a BIOS option for this, but I did eventually find one to
> disable IOMMU (see below).

Its not significant. Even with APIC mode it should work fine.

> 
> > By any chance do you see any message on console once you resume the system?
> 
> No, the display output stays off, and so does networking, so I can't see
> anything. All that happens is the fans and LEDs etc come on, but the system
> is completely unresponsive. My keyboard lights up too, but the caps lock LED
> doesn't toggle when I press the caps lock key.
> 
> > Just to double check, can you try out below change?  This is queued for
> > v6.13 merge window. Basically this patch will skip copy_device_table() code
> > path in resume path (This is to rule out the resume code taking wrong path
> > in copy_device_table() function). 
> ...
> > commit 3f6eeada6930056c38f20964120ac402cf0030b9
> 
> I've added that patch, and it doesn't make a difference that I can tell.

Thanks for testing.

> 
> I see that de111f6b4f6a3010020825d22a068f416bc29c95, which causes the issue
> for me, is also very simple. AIUI, de111f6b4f6a removes the calls to
> iommu_enable_gt from early_enable_iommu and early_enable_iommus and calls
> iommu_enable_gt from iommu_init_pci instead. Then the patch from
> <https://lore.kernel.org/linux-iommu/20240621101533.20216-1-vasant.hegde@amd.
> com/> restored iommu_enable_gt to early_enable_iommu but didn't affect the
> other two locations. So that makes me think calling iommu_enable_gt from
> iommu_init_pci is what's upsetting my board.

Right. That's commit 150bdf5f8d.

I don't think iommu_init_pci() path gets called in resume path. Only resume callback handler gets called.

-Vasant
Comment 21 Vasant Hegde 2024-11-11 11:53:47 UTC
Hi,

(In reply to Tony Houghton from comment #19)
> (In reply to Mario Limonciello (AMD) from comment #17)
>  

> 
> If I disable the wifi in the BIOS, it fixes this suspend issue, but the
> resume still fails, so it seems iommu=off doesn't fix the original issue. I
> also found the BIOS option to disable IOMMU, but that doesn't make any
> difference to either issue.

If I read it correctly, suspend/resume doesn't work even if IOMMU is disabled in BIOS. is that correct? Then the problem is in some other place, not in IOMMU subsystem.

-Vasant
Comment 22 Tony Houghton 2024-11-12 15:30:47 UTC
I've done some more testing etc, and now it seems that de111f6b4f6a works correctly after all. I'm sure I confirmed it before, so maybe changing the BIOS settings and/or flashing a new version made some difference. But I still have the original problem with more recent kernels.

I've done another git bisect and found a new suspect:

first bad commit: [ceac1cb0259de682d78f5c784ef8e0b13022e9d9] Bluetooth: btusb: mediatek: add ISO data transmission functions

I'll do some more tests on that and its immediate predecessor, but it seems more than a coincidence that it's related to the same piece of hardware that caused the other issue (comments #12, #17 etc).
Comment 23 Mario Limonciello (AMD) 2024-11-12 15:34:36 UTC
When you're doing your bisect you should explicitly look for any DMA mapping errors that are introduced with IOMMU disabled.

BTW - please don't test with "iommu=off".  To isolate AMD IOMMU specific issue you should use "amd_iommu=off".

This will at least allow the use of software bounce buffers, which might have been the problem induced in Mediatek driver by running "iommu=off".