Bug 219499 - AMDGPU fails to resume from suspend unless amd_iommu=off
Summary: AMDGPU fails to resume from suspend unless amd_iommu=off
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(Other) (show other bugs)
Hardware: Intel Linux
: P3 normal
Assignee: Vasant Hegde
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-11-14 15:17 UTC by Hamish McIntyre-Bhatty
Modified: 2025-02-17 05:27 UTC (History)
5 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: Yes
Bisected commit-id: c4cb23111103a841c2df30058597398443bcad5f


Attachments
AMDGPU crash log (532.13 KB, text/plain)
2024-11-14 15:17 UTC, Hamish McIntyre-Bhatty
Details
[Patch] Print Control Register (1.52 KB, patch)
2024-11-19 10:15 UTC, Vasant Hegde
Details | Diff
AMDGPU crash log after applying patches (479.77 KB, text/plain)
2024-11-27 17:09 UTC, Hamish McIntyre-Bhatty
Details
dmesg output after resuming with previous commit. (107.14 KB, text/plain)
2024-11-27 18:38 UTC, Hamish McIntyre-Bhatty
Details
iommu/amd: Expicitly enable CNTRL.EPH_En bit in resume path (1.90 KB, application/mbox)
2024-11-28 16:55 UTC, Vasant Hegde
Details
New dmesg output (295.17 KB, text/plain)
2024-12-19 17:47 UTC, Hamish McIntyre-Bhatty
Details

Description Hamish McIntyre-Bhatty 2024-11-14 15:17:03 UTC
Created attachment 307218 [details]
AMDGPU crash log

On Lenovo Thinkbook 14 G2 ARE with Ryzen 4500U, resume from suspend doesn't work since commit c4cb23111103a841c2df30058597398443bcad5f:

`Author: Vasant Hegde vasant.hegde@amd.com
Date:   Thu Apr 18 10:33:57 2024 +0000

iommu/amd: Add support for enable/disable IOPF

Return success from enable_feature(IOPF) path as this interface is going
away. Instead we will enable/disable IOPF support in attach/detach device
path.

In attach device path, if device is capable of PRI, then we will add it to
per IOMMU IOPF queue and enable PPR support in IOMMU. Also it will
attach device to domain even if it fails to enable PRI or add device to
IOPF queue as device can continue to work without PRI support.

In detach device patch it follows following sequence:
  - Flush the queue for the given device
  - Disable PPR support in DTE[devid]
  - Remove device from IOPF queue
  - Disable device PRI

Also add IOMMU_IOPF as dependency to AMD_IOMMU driver.

Co-developed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20240418103400.6229-13-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
drivers/iommu/amd/Kconfig     |  1 +
drivers/iommu/amd/amd_iommu.h |  4 ++++
drivers/iommu/amd/iommu.c     | 39 +++++++++++++++++++++++++++++++--------
drivers/iommu/amd/ppr.c       | 41 +++++++++++++++++++++++++++++++++++++++++
4 files changed, 77 insertions(+), 8 deletions(-)
`

See https://gitlab.freedesktop.org/drm/amd/-/issues/2092 for previous discussion - it seemed to have been fixed and then regressed again after 6.10-rc1. Attached is a journalctl log from my system after attempting to resume.
Comment 1 Hamish McIntyre-Bhatty 2024-11-14 15:18:20 UTC
Just adding an extra note to say I'm running on Linux Mint 22, but I did bisect using vanilla kernel builds, just with the config copied over.
Comment 2 Vasant Hegde 2024-11-14 16:47:48 UTC
Hi Hamish,

(In reply to Hamish McIntyre-Bhatty from comment #1)
> Just adding an extra note to say I'm running on Linux Mint 22, but I did
> bisect using vanilla kernel builds, just with the config copied over.

Thanks for the report.  Looking into the attach dmesg I see that we hit Illegal dev table .. I believe which is causing resume to fail. Looking into the DTE I can't make out why this is Illegal. Can we also dump IOMMU Control register?


Nov 14 11:52:19 Thinkbook kernel: iommu ivhd0: AMD-Vi: Event logged [ILLEGAL_DEV_TABLE_ENTRY device=0000:06:00.0 pasid=0x00000 address=0x135300000 flags=0x0080]
Nov 14 11:52:19 Thinkbook kernel: AMD-Vi: DTE[0]: 7d90000000000003
Nov 14 11:52:19 Thinkbook kernel: AMD-Vi: DTE[1]: 0000100103fc0009
Nov 14 11:52:19 Thinkbook kernel: AMD-Vi: DTE[2]: 2000000117840013
Nov 14 11:52:19 Thinkbook kernel: AMD-Vi: DTE[3]: 0000000000000000


-Vasant
Comment 3 Vasant Hegde 2024-11-14 17:17:24 UTC
Can you please try below patch?


diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index 5785f35426d6..329ccf3213e8 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -3217,6 +3217,7 @@ static int amd_iommu_enable_interrupts(void)
 			goto out;
 	}
 
+out:
 	/*
 	 * Interrupt handler is ready to process interrupts. Enable
 	 * PPR and GA log interrupt for all IOMMUs.
@@ -3224,7 +3225,6 @@ static int amd_iommu_enable_interrupts(void)
 	enable_iommus_vapic();
 	enable_iommus_ppr();
 
-out:
 	return ret;
 }
Comment 4 Mario Limonciello (AMD) 2024-11-14 18:30:18 UTC
Also separately from the patch are the illegal dev table entries new to this issue or a red herring?  
IE If you run the commit before the bisected one do they show up too?
Comment 5 Hamish McIntyre-Bhatty 2024-11-15 14:33:00 UTC
I don't know how to dump the IOMMU control register, could I have some instructions for that please? I will try the patch and the commit before the bisected one and report back soon.
Comment 6 Vasant Hegde 2024-11-19 10:15:14 UTC
Created attachment 307238 [details]
[Patch] Print Control Register

Hi,

Attached patch will print IOMMU Control register along with ILlegal device table entry details.

Please apply this patch along with previously mentioned change. Let us know how it goes.

-Vasant
Comment 7 Hamish McIntyre-Bhatty 2024-11-27 17:08:27 UTC
Hi there,

I applied both patches, and the result is that it still doesn't resume. I have attached a new log, hopefully this will be helpful.
Comment 8 Hamish McIntyre-Bhatty 2024-11-27 17:09:10 UTC
Created attachment 307291 [details]
AMDGPU crash log after applying patches

AMDGPU crash log after applying patches.
Comment 9 Hamish McIntyre-Bhatty 2024-11-27 18:38:46 UTC
Created attachment 307292 [details]
dmesg output after resuming with previous commit.

Don't seem to get the illegal dev table entries with the previous commit, I think. I'm not 100% sure what I'm looking for, though, so don't take my word for it. Here's the dmesg output.
Comment 10 Vasant Hegde 2024-11-28 16:29:08 UTC
Hi Hamish,

Thanks for the log. So with failed log I still see ILLEGAL_DEV_TABLE_ENTRY  and passed log everything looks fine. 

I did look into the code path and the culprit commit. Only difference culprit commit does is enabling DTE[PPR] bit.


I did check internally. We are not able to reproduce the issue. Sorry. I have to trouble you to test more patches :-(

I spoke to HW engineer and he pointed out that in resume path CNTRL.EPH_EN bit is cleared. Its default value should be 1. I will prepare another patch. Can you test it please ?

-Vasant
Comment 11 Vasant Hegde 2024-11-28 16:55:04 UTC
Created attachment 307302 [details]
iommu/amd: Expicitly enable CNTRL.EPH_En bit in resume path

Hi,

Please test this patch along with previous patches and provide dmesg after suspend/resume.

-Vasant
Comment 12 Hamish McIntyre-Bhatty 2024-11-28 16:57:24 UTC
No problem, but I have just distro-hopped on the system I'm using to build the kernels so it might take a little time to get things ready again. I'll post back here when all is done.
Comment 13 Vasant Hegde 2024-12-18 06:28:55 UTC
Hi Hamish,

(In reply to Hamish McIntyre-Bhatty from comment #12)
> No problem, but I have just distro-hopped on the system I'm using to build
> the kernels so it might take a little time to get things ready again. I'll
> post back here when all is done.

Did you get  a chance to test above patch?

-Vasant
Comment 14 Hamish McIntyre-Bhatty 2024-12-19 17:43:53 UTC
Apologies, just had to create a new account as I apparently hadn't created an entry in my password manager. I've just tested the patch, and I can confirm that suspend and resume now work again.

Journalctl output below.
Comment 15 Hamish McIntyre-Bhatty 2024-12-19 17:47:49 UTC
Created attachment 307373 [details]
New dmesg output

New dmesg output after suspending and resuming with all patches. Note that I'm doing hybrid suspend so you see messages about hibernation too.
Comment 16 Hamish McIntyre-Bhatty 2025-01-26 17:57:59 UTC
It's been a little while so I thought I'd bump this. I'm happy to test again with the latest kernel, with or without the patches, just to make sure this issue is still here.

What would it be useful for me to do to move this forwards?
Comment 17 Vasant Hegde 2025-01-27 08:56:43 UTC
HI,

(In reply to Hamish McIntyre-Bhatty from comment #16)
> It's been a little while so I thought I'd bump this. I'm happy to test again
> with the latest kernel, with or without the patches, just to make sure this
> issue is still here.
> 
> What would it be useful for me to do to move this forwards?

Somehow I missed your update. Sorry for that.
Will send patch today with your tested by.

-Vasant
Comment 18 Vasant Hegde 2025-01-27 09:52:26 UTC
Posted to IOMMU list : https://lore.kernel.org/linux-iommu/20250127094411.5931-1-vasant.hegde@amd.com/T/#u

-Vasant
Comment 19 Hamish McIntyre-Bhatty 2025-01-27 13:03:07 UTC
No worries, glad I could help.
Comment 20 Hamish McIntyre-Bhatty 2025-02-03 12:05:58 UTC
Not sure this should be marked as fixed as it doesn't seem to have gotten into the kernel - no replies to that list message. Unless I'm misunderstanding how this works, which is entirely possible - this is my first kernel bug report.
Comment 21 Vasant Hegde 2025-02-04 04:56:40 UTC
(In reply to Hamish McIntyre-Bhatty from comment #20)
> Not sure this should be marked as fixed as it doesn't seem to have gotten
> into the kernel - no replies to that list message. Unless I'm
> misunderstanding how this works, which is entirely possible - this is my
> first kernel bug report.

Yeah. Sometime we don't see immediate replies in kernel mailing list as people are busy. 

This will be eventually picked up by IOMMU subsystem maintainer (Joerg) and he will send it to Linus. 

Later it will go to all applicable stable branches.

It takes sometime. I will keep this updated.

-Vasant
Comment 22 Mario Limonciello (AMD) 2025-02-04 16:25:49 UTC
I also don't think it's a good idea to close the bug until the fix is actually landed.  I'll re-open it and as this lands we can close it with a link to the hash that fixes it.
Comment 23 Hamish McIntyre-Bhatty 2025-02-04 17:06:47 UTC
I second that. And thanks for explaining how it works, Vasant.
Comment 24 Vasant Hegde 2025-02-17 05:27:40 UTC
Upstream commit :




commit ef75966abf950c0539534effa4960caa29fb7167
Author: Vasant Hegde <vasant.hegde@amd.com>
Date:   Mon Jan 27 09:44:11 2025 +0000

    iommu/amd: Expicitly enable CNTRL.EPHEn bit in resume path

    With recent kernel, AMDGPU failed to resume after suspend on certain laptop.

    Sample log:
    -----------
    Nov 14 11:52:19 Thinkbook kernel: iommu ivhd0: AMD-Vi: Event logged [ILLEGAL_DEV_TABLE_ENTRY device=0000:06:00.0 pasid=0x00000 address=0x135300000 flags=0x0080]
    Nov 14 11:52:19 Thinkbook kernel: AMD-Vi: DTE[0]: 7d90000000000003
    Nov 14 11:52:19 Thinkbook kernel: AMD-Vi: DTE[1]: 0000100103fc0009
    Nov 14 11:52:19 Thinkbook kernel: AMD-Vi: DTE[2]: 2000000117840013
    Nov 14 11:52:19 Thinkbook kernel: AMD-Vi: DTE[3]: 0000000000000000

    This is because in resume path, CNTRL[EPHEn] is not set. Fix this by
    setting CNTRL[EPHEn] to 1 in resume path if EFR[EPHSUP] is set.

    Note
      May be better approach is to save the control register in suspend path
      and restore it in resume path instead of trying to set indivisual
      bits. We will have separate patch for that.

    Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219499
    Fixes: c4cb23111103 ("iommu/amd: Add support for enable/disable IOPF")
    Tested-by: Hamish McIntyre-Bhatty <kernel-bugzilla@regd.hamishmb.com>
    Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
    Link: https://lore.kernel.org/r/20250127094411.5931-1-vasant.hegde@amd.com
    Signed-off-by: Joerg Roedel <jroedel@suse.de>


-Vasant

Note You need to log in before you can comment on or make changes to this bug.