Created attachment 307218 [details] AMDGPU crash log On Lenovo Thinkbook 14 G2 ARE with Ryzen 4500U, resume from suspend doesn't work since commit c4cb23111103a841c2df30058597398443bcad5f: `Author: Vasant Hegde vasant.hegde@amd.com Date: Thu Apr 18 10:33:57 2024 +0000 iommu/amd: Add support for enable/disable IOPF Return success from enable_feature(IOPF) path as this interface is going away. Instead we will enable/disable IOPF support in attach/detach device path. In attach device path, if device is capable of PRI, then we will add it to per IOMMU IOPF queue and enable PPR support in IOMMU. Also it will attach device to domain even if it fails to enable PRI or add device to IOPF queue as device can continue to work without PRI support. In detach device patch it follows following sequence: - Flush the queue for the given device - Disable PPR support in DTE[devid] - Remove device from IOPF queue - Disable device PRI Also add IOMMU_IOPF as dependency to AMD_IOMMU driver. Co-developed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Vasant Hegde <vasant.hegde@amd.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20240418103400.6229-13-vasant.hegde@amd.com Signed-off-by: Joerg Roedel <jroedel@suse.de> drivers/iommu/amd/Kconfig | 1 + drivers/iommu/amd/amd_iommu.h | 4 ++++ drivers/iommu/amd/iommu.c | 39 +++++++++++++++++++++++++++++++-------- drivers/iommu/amd/ppr.c | 41 +++++++++++++++++++++++++++++++++++++++++ 4 files changed, 77 insertions(+), 8 deletions(-) ` See https://gitlab.freedesktop.org/drm/amd/-/issues/2092 for previous discussion - it seemed to have been fixed and then regressed again after 6.10-rc1. Attached is a journalctl log from my system after attempting to resume.
Just adding an extra note to say I'm running on Linux Mint 22, but I did bisect using vanilla kernel builds, just with the config copied over.
Hi Hamish, (In reply to Hamish McIntyre-Bhatty from comment #1) > Just adding an extra note to say I'm running on Linux Mint 22, but I did > bisect using vanilla kernel builds, just with the config copied over. Thanks for the report. Looking into the attach dmesg I see that we hit Illegal dev table .. I believe which is causing resume to fail. Looking into the DTE I can't make out why this is Illegal. Can we also dump IOMMU Control register? Nov 14 11:52:19 Thinkbook kernel: iommu ivhd0: AMD-Vi: Event logged [ILLEGAL_DEV_TABLE_ENTRY device=0000:06:00.0 pasid=0x00000 address=0x135300000 flags=0x0080] Nov 14 11:52:19 Thinkbook kernel: AMD-Vi: DTE[0]: 7d90000000000003 Nov 14 11:52:19 Thinkbook kernel: AMD-Vi: DTE[1]: 0000100103fc0009 Nov 14 11:52:19 Thinkbook kernel: AMD-Vi: DTE[2]: 2000000117840013 Nov 14 11:52:19 Thinkbook kernel: AMD-Vi: DTE[3]: 0000000000000000 -Vasant
Can you please try below patch? diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c index 5785f35426d6..329ccf3213e8 100644 --- a/drivers/iommu/amd/init.c +++ b/drivers/iommu/amd/init.c @@ -3217,6 +3217,7 @@ static int amd_iommu_enable_interrupts(void) goto out; } +out: /* * Interrupt handler is ready to process interrupts. Enable * PPR and GA log interrupt for all IOMMUs. @@ -3224,7 +3225,6 @@ static int amd_iommu_enable_interrupts(void) enable_iommus_vapic(); enable_iommus_ppr(); -out: return ret; }
Also separately from the patch are the illegal dev table entries new to this issue or a red herring? IE If you run the commit before the bisected one do they show up too?
I don't know how to dump the IOMMU control register, could I have some instructions for that please? I will try the patch and the commit before the bisected one and report back soon.
Created attachment 307238 [details] [Patch] Print Control Register Hi, Attached patch will print IOMMU Control register along with ILlegal device table entry details. Please apply this patch along with previously mentioned change. Let us know how it goes. -Vasant
Hi there, I applied both patches, and the result is that it still doesn't resume. I have attached a new log, hopefully this will be helpful.
Created attachment 307291 [details] AMDGPU crash log after applying patches AMDGPU crash log after applying patches.
Created attachment 307292 [details] dmesg output after resuming with previous commit. Don't seem to get the illegal dev table entries with the previous commit, I think. I'm not 100% sure what I'm looking for, though, so don't take my word for it. Here's the dmesg output.
Hi Hamish, Thanks for the log. So with failed log I still see ILLEGAL_DEV_TABLE_ENTRY and passed log everything looks fine. I did look into the code path and the culprit commit. Only difference culprit commit does is enabling DTE[PPR] bit. I did check internally. We are not able to reproduce the issue. Sorry. I have to trouble you to test more patches :-( I spoke to HW engineer and he pointed out that in resume path CNTRL.EPH_EN bit is cleared. Its default value should be 1. I will prepare another patch. Can you test it please ? -Vasant
Created attachment 307302 [details] iommu/amd: Expicitly enable CNTRL.EPH_En bit in resume path Hi, Please test this patch along with previous patches and provide dmesg after suspend/resume. -Vasant
No problem, but I have just distro-hopped on the system I'm using to build the kernels so it might take a little time to get things ready again. I'll post back here when all is done.
Hi Hamish, (In reply to Hamish McIntyre-Bhatty from comment #12) > No problem, but I have just distro-hopped on the system I'm using to build > the kernels so it might take a little time to get things ready again. I'll > post back here when all is done. Did you get a chance to test above patch? -Vasant
Apologies, just had to create a new account as I apparently hadn't created an entry in my password manager. I've just tested the patch, and I can confirm that suspend and resume now work again. Journalctl output below.
Created attachment 307373 [details] New dmesg output New dmesg output after suspending and resuming with all patches. Note that I'm doing hybrid suspend so you see messages about hibernation too.
It's been a little while so I thought I'd bump this. I'm happy to test again with the latest kernel, with or without the patches, just to make sure this issue is still here. What would it be useful for me to do to move this forwards?
HI, (In reply to Hamish McIntyre-Bhatty from comment #16) > It's been a little while so I thought I'd bump this. I'm happy to test again > with the latest kernel, with or without the patches, just to make sure this > issue is still here. > > What would it be useful for me to do to move this forwards? Somehow I missed your update. Sorry for that. Will send patch today with your tested by. -Vasant
Posted to IOMMU list : https://lore.kernel.org/linux-iommu/20250127094411.5931-1-vasant.hegde@amd.com/T/#u -Vasant
No worries, glad I could help.
Not sure this should be marked as fixed as it doesn't seem to have gotten into the kernel - no replies to that list message. Unless I'm misunderstanding how this works, which is entirely possible - this is my first kernel bug report.
(In reply to Hamish McIntyre-Bhatty from comment #20) > Not sure this should be marked as fixed as it doesn't seem to have gotten > into the kernel - no replies to that list message. Unless I'm > misunderstanding how this works, which is entirely possible - this is my > first kernel bug report. Yeah. Sometime we don't see immediate replies in kernel mailing list as people are busy. This will be eventually picked up by IOMMU subsystem maintainer (Joerg) and he will send it to Linus. Later it will go to all applicable stable branches. It takes sometime. I will keep this updated. -Vasant
I also don't think it's a good idea to close the bug until the fix is actually landed. I'll re-open it and as this lands we can close it with a link to the hash that fixes it.
I second that. And thanks for explaining how it works, Vasant.
Upstream commit : commit ef75966abf950c0539534effa4960caa29fb7167 Author: Vasant Hegde <vasant.hegde@amd.com> Date: Mon Jan 27 09:44:11 2025 +0000 iommu/amd: Expicitly enable CNTRL.EPHEn bit in resume path With recent kernel, AMDGPU failed to resume after suspend on certain laptop. Sample log: ----------- Nov 14 11:52:19 Thinkbook kernel: iommu ivhd0: AMD-Vi: Event logged [ILLEGAL_DEV_TABLE_ENTRY device=0000:06:00.0 pasid=0x00000 address=0x135300000 flags=0x0080] Nov 14 11:52:19 Thinkbook kernel: AMD-Vi: DTE[0]: 7d90000000000003 Nov 14 11:52:19 Thinkbook kernel: AMD-Vi: DTE[1]: 0000100103fc0009 Nov 14 11:52:19 Thinkbook kernel: AMD-Vi: DTE[2]: 2000000117840013 Nov 14 11:52:19 Thinkbook kernel: AMD-Vi: DTE[3]: 0000000000000000 This is because in resume path, CNTRL[EPHEn] is not set. Fix this by setting CNTRL[EPHEn] to 1 in resume path if EFR[EPHSUP] is set. Note May be better approach is to save the control register in suspend path and restore it in resume path instead of trying to set indivisual bits. We will have separate patch for that. Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219499 Fixes: c4cb23111103 ("iommu/amd: Add support for enable/disable IOPF") Tested-by: Hamish McIntyre-Bhatty <kernel-bugzilla@regd.hamishmb.com> Signed-off-by: Vasant Hegde <vasant.hegde@amd.com> Link: https://lore.kernel.org/r/20250127094411.5931-1-vasant.hegde@amd.com Signed-off-by: Joerg Roedel <jroedel@suse.de> -Vasant