First of all, sorry that I can't immediately provide the dmesg output - for some reason `journald -b -1` is perplexingly returning the current log rather than the one from the bisection. I'll try again and maybe I'll have better luck next time. Despite that, I have bisected the boot failure to commit b5c58b2fdc427e7958412ecb2de2804a1f7c1572 which I'll append for convenience at the end. After booting (other times) into a working kernel, examining the journald of the broken kernel always show stack traces of kernel null pointer derefences within amdgpu call stack and the issue is 100% reproducible, so I was easily able to bisect it. Unfortunately despite my best attempts at monkey-reverting that commit my botched conflict resolution did not even compile. I doubt this is relevant here, but I'll preemptively add that the system should have IOMMU and kCFI in use and that it's built with Clang 18. ============================================================================ b5c58b2fdc427e7958412ecb2de2804a1f7c1572 is the first bad commit commit b5c58b2fdc427e7958412ecb2de2804a1f7c1572 (HEAD) Author: Leon Romanovsky <leon@kernel.org> Date: Wed Jul 24 21:04:49 2024 +0300 dma-mapping: direct calls for dma-iommu Directly call into dma-iommu just like we have been doing for dma-direct for a while. This avoids the indirect call overhead for IOMMU ops and removes the need to have DMA ops entirely for many common configurations. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> <changes skipped for brevity>
Created attachment 306902 [details] Stacktrace I was able to recover the latest stacktrace from the full journal and I'm uploading a cleaned up version.
I confirm the bug on my hardware. I don't know if it can help, but I can attach my journal too...
Created attachment 306903 [details] stacktrace
I have tried https://lore.kernel.org/lkml/Zu_FDfHZAVzPv1lq@infradead.org/ and it did not make any obvious difference (probably not surprising, since that patch involves USB but I wanted to report this before anyone asks about it).
Can you please try this patch? diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index b839683da0ba..cf3b89e681a3 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -926,7 +926,7 @@ bool dma_addressing_limited(struct device *dev) dma_get_required_mask(dev)) return true; - if (unlikely(ops)) + if (unlikely(ops) || use_dma_iommu(dev) return false; return !dma_direct_all_ram_mapped(dev); } On Sun, Sep 22, 2024, at 11:49, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=219292 > > --- Comment #4 from Niklāvs Koļesņikovs (pinkflames.linux@gmail.com) --- > I have tried https://lore.kernel.org/lkml/Zu_FDfHZAVzPv1lq@infradead.org/ and > it did not make any obvious difference (probably not surprising, since that > patch involves USB but I wanted to report this before anyone asks about it). > > -- > You may reply to this email to add a comment. > > You are receiving this mail because: > You are on the CC list for the bug.
Thank you for looking into this. I had to manually redo the patch to make it apply to current git master branch and add an extra closing bracket to make the resulting file compile but it did not seem to make a difference at boot time - still the same kernel NULL pointer dereference and the stack trace appears to be unchanged as well. Here's the patch I used: diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index b839683da0ba..f088a6462564 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -926,7 +926,7 @@ bool dma_addressing_limited(struct device *dev) dma_get_required_mask(dev)) return true; - if (unlikely(ops)) + if (unlikely(ops) || use_dma_iommu(dev)) return false; return !dma_direct_all_ram_mapped(dev); }
And here's the patch I used without markdown formatting... diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index b839683da0ba..f088a6462564 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -926,7 +926,7 @@ bool dma_addressing_limited(struct device *dev) dma_get_required_mask(dev)) return true; - if (unlikely(ops)) + if (unlikely(ops) || use_dma_iommu(dev)) return false; return !dma_direct_all_ram_mapped(dev); }
Yes, this is what I ended. Official patch was posted: https://lore.kernel.org/all/c2671009f217bfc3c9f7fbda2eefb0c44768fc3b.1727028402.git.leon@kernel.org/ Thanks for the report. On Sun, Sep 22, 2024, at 22:14, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=219292 > > --- Comment #7 from Niklāvs Koļesņikovs (pinkflames.linux@gmail.com) --- > And here's the patch I used without markdown formatting... > > diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c > index b839683da0ba..f088a6462564 100644 > --- a/kernel/dma/mapping.c > +++ b/kernel/dma/mapping.c > @@ -926,7 +926,7 @@ bool dma_addressing_limited(struct device *dev) > dma_get_required_mask(dev)) > return true; > > - if (unlikely(ops)) > + if (unlikely(ops) || use_dma_iommu(dev)) > return false; > return !dma_direct_all_ram_mapped(dev); > } > > -- > You may reply to this email to add a comment. > > You are receiving this mail because: > You are on the CC list for the bug.
Thank you, Leon, applying your mailing list patch on top of the Christoph's does resolve the amdgpu issue. Reported-and-tested-by: Niklāvs Koļesņikovs <pinkflames.linux@gmail.com>
The fix has landed upstream and will ship as part of 6.12-rc1. Closing. Cheers.