Bug 219292

Summary: Commit b5c58b2fdc42 causes a kernel null pointer dereference with amdgpu
Product: Drivers Reporter: Niklāvs Koļesņikovs (pinkflames.linux)
Component: IOMMUAssignee: drivers_iommu
Status: RESOLVED CODE_FIX    
Severity: normal CC: jean-christophe, leon
Priority: P3    
Hardware: All   
OS: Linux   
Kernel Version: Linux 6.12 pre-rc1 commit b5c58b2fdc42 Subsystem:
Regression: Yes Bisected commit-id: b5c58b2fdc427e7958412ecb2de2804a1f7c1572
Attachments: Stacktrace
stacktrace

Description Niklāvs Koļesņikovs 2024-09-20 04:45:25 UTC
First of all, sorry that I can't immediately provide the dmesg output - for some reason `journald -b -1` is perplexingly returning the current log rather than the one from the bisection. I'll try again and maybe I'll have better luck next time.

Despite that, I have bisected the boot failure to commit b5c58b2fdc427e7958412ecb2de2804a1f7c1572 which I'll append for convenience at the end. After booting (other times) into a working kernel, examining the journald of the broken kernel always show stack traces of kernel null pointer derefences within amdgpu call stack and the issue is 100% reproducible, so I was easily able to bisect it.

Unfortunately despite my best attempts at monkey-reverting that commit my botched conflict resolution did not even compile.

I doubt this is relevant here, but I'll preemptively add that the system should have IOMMU and kCFI in use and that it's built with Clang 18.

============================================================================

b5c58b2fdc427e7958412ecb2de2804a1f7c1572 is the first bad commit
commit b5c58b2fdc427e7958412ecb2de2804a1f7c1572 (HEAD)
Author: Leon Romanovsky <leon@kernel.org>
Date:   Wed Jul 24 21:04:49 2024 +0300

    dma-mapping: direct calls for dma-iommu
    
    Directly call into dma-iommu just like we have been doing for dma-direct
    for a while.  This avoids the indirect call overhead for IOMMU ops and
    removes the need to have DMA ops entirely for many common configurations.
    
    Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
    Signed-off-by: Leon Romanovsky <leon@kernel.org>
    Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Acked-by: Robin Murphy <robin.murphy@arm.com>
    Signed-off-by: Christoph Hellwig <hch@lst.de>

<changes skipped for brevity>
Comment 1 Niklāvs Koļesņikovs 2024-09-20 05:12:52 UTC
Created attachment 306902 [details]
Stacktrace

I was able to recover the latest stacktrace from the full journal and I'm uploading a cleaned up version.
Comment 2 Jean-Christophe Guillain 2024-09-20 08:31:22 UTC
I confirm the bug on my hardware.
I don't know if it can help, but I can attach my journal too...
Comment 3 Jean-Christophe Guillain 2024-09-20 08:32:29 UTC
Created attachment 306903 [details]
stacktrace
Comment 4 Niklāvs Koļesņikovs 2024-09-22 08:49:39 UTC
I have tried https://lore.kernel.org/lkml/Zu_FDfHZAVzPv1lq@infradead.org/ and it did not make any obvious difference (probably not surprising, since that patch involves USB but I wanted to report this before anyone asks about it).
Comment 5 leon 2024-09-22 10:29:45 UTC
Can you please try this patch?

diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index b839683da0ba..cf3b89e681a3 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -926,7 +926,7 @@ bool dma_addressing_limited(struct device *dev)
                         dma_get_required_mask(dev))
                return true;
 
-       if (unlikely(ops))
+       if (unlikely(ops) || use_dma_iommu(dev)
                return false;
        return !dma_direct_all_ram_mapped(dev);
 }


On Sun, Sep 22, 2024, at 11:49, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=219292
>
> --- Comment #4 from Niklāvs Koļesņikovs (pinkflames.linux@gmail.com) ---
> I have tried https://lore.kernel.org/lkml/Zu_FDfHZAVzPv1lq@infradead.org/ and
> it did not make any obvious difference (probably not surprising, since that
> patch involves USB but I wanted to report this before anyone asks about it).
>
> -- 
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 6 Niklāvs Koļesņikovs 2024-09-22 19:12:44 UTC
Thank you for looking into this. I had to manually redo the patch to make it apply to current git master branch and add an extra closing bracket to make the resulting file compile but it did not seem to make a difference at boot time - still the same kernel NULL pointer dereference and the stack trace appears to be unchanged as well.

Here's the patch I used:

diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index b839683da0ba..f088a6462564 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -926,7 +926,7 @@ bool dma_addressing_limited(struct device *dev)
                         dma_get_required_mask(dev))
                return true;
 
-       if (unlikely(ops))
+       if (unlikely(ops) || use_dma_iommu(dev))
                return false;
        return !dma_direct_all_ram_mapped(dev);
 }
Comment 7 Niklāvs Koļesņikovs 2024-09-22 19:14:42 UTC
And here's the patch I used without markdown formatting...

diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index b839683da0ba..f088a6462564 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -926,7 +926,7 @@ bool dma_addressing_limited(struct device *dev)
                         dma_get_required_mask(dev))
                return true;
 
-       if (unlikely(ops))
+       if (unlikely(ops) || use_dma_iommu(dev))
                return false;
        return !dma_direct_all_ram_mapped(dev);
 }
Comment 8 leon 2024-09-23 05:46:38 UTC
Yes, this is what I ended.
Official patch was posted:
https://lore.kernel.org/all/c2671009f217bfc3c9f7fbda2eefb0c44768fc3b.1727028402.git.leon@kernel.org/

Thanks for the report.

On Sun, Sep 22, 2024, at 22:14, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=219292
>
> --- Comment #7 from Niklāvs Koļesņikovs (pinkflames.linux@gmail.com) ---
> And here's the patch I used without markdown formatting...
>
> diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
> index b839683da0ba..f088a6462564 100644
> --- a/kernel/dma/mapping.c
> +++ b/kernel/dma/mapping.c
> @@ -926,7 +926,7 @@ bool dma_addressing_limited(struct device *dev)
>                          dma_get_required_mask(dev))
>                 return true;
>
> -       if (unlikely(ops))
> +       if (unlikely(ops) || use_dma_iommu(dev))
>                 return false;
>         return !dma_direct_all_ram_mapped(dev);
>  }
>
> -- 
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 9 Niklāvs Koļesņikovs 2024-09-23 06:23:45 UTC
Thank you, Leon, applying your mailing list patch on top of the Christoph's does resolve the amdgpu issue.

Reported-and-tested-by: Niklāvs Koļesņikovs <pinkflames.linux@gmail.com>
Comment 10 Niklāvs Koļesņikovs 2024-09-24 20:34:36 UTC
The fix has landed upstream and will ship as part of 6.12-rc1. Closing.

Cheers.