Bug 219292 - Commit b5c58b2fdc42 causes a kernel null pointer dereference with amdgpu
Summary: Commit b5c58b2fdc42 causes a kernel null pointer dereference with amdgpu
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: IOMMU (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: drivers_iommu
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-09-20 04:45 UTC by Niklāvs Koļesņikovs
Modified: 2024-09-24 20:34 UTC (History)
2 users (show)

See Also:
Kernel Version: Linux 6.12 pre-rc1 commit b5c58b2fdc42
Subsystem:
Regression: Yes
Bisected commit-id: b5c58b2fdc427e7958412ecb2de2804a1f7c1572


Attachments
Stacktrace (4.41 KB, text/plain)
2024-09-20 05:12 UTC, Niklāvs Koļesņikovs
Details
stacktrace (8.64 KB, text/plain)
2024-09-20 08:32 UTC, Jean-Christophe Guillain
Details

Description Niklāvs Koļesņikovs 2024-09-20 04:45:25 UTC
First of all, sorry that I can't immediately provide the dmesg output - for some reason `journald -b -1` is perplexingly returning the current log rather than the one from the bisection. I'll try again and maybe I'll have better luck next time.

Despite that, I have bisected the boot failure to commit b5c58b2fdc427e7958412ecb2de2804a1f7c1572 which I'll append for convenience at the end. After booting (other times) into a working kernel, examining the journald of the broken kernel always show stack traces of kernel null pointer derefences within amdgpu call stack and the issue is 100% reproducible, so I was easily able to bisect it.

Unfortunately despite my best attempts at monkey-reverting that commit my botched conflict resolution did not even compile.

I doubt this is relevant here, but I'll preemptively add that the system should have IOMMU and kCFI in use and that it's built with Clang 18.

============================================================================

b5c58b2fdc427e7958412ecb2de2804a1f7c1572 is the first bad commit
commit b5c58b2fdc427e7958412ecb2de2804a1f7c1572 (HEAD)
Author: Leon Romanovsky <leon@kernel.org>
Date:   Wed Jul 24 21:04:49 2024 +0300

    dma-mapping: direct calls for dma-iommu
    
    Directly call into dma-iommu just like we have been doing for dma-direct
    for a while.  This avoids the indirect call overhead for IOMMU ops and
    removes the need to have DMA ops entirely for many common configurations.
    
    Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
    Signed-off-by: Leon Romanovsky <leon@kernel.org>
    Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Acked-by: Robin Murphy <robin.murphy@arm.com>
    Signed-off-by: Christoph Hellwig <hch@lst.de>

<changes skipped for brevity>
Comment 1 Niklāvs Koļesņikovs 2024-09-20 05:12:52 UTC
Created attachment 306902 [details]
Stacktrace

I was able to recover the latest stacktrace from the full journal and I'm uploading a cleaned up version.
Comment 2 Jean-Christophe Guillain 2024-09-20 08:31:22 UTC
I confirm the bug on my hardware.
I don't know if it can help, but I can attach my journal too...
Comment 3 Jean-Christophe Guillain 2024-09-20 08:32:29 UTC
Created attachment 306903 [details]
stacktrace
Comment 4 Niklāvs Koļesņikovs 2024-09-22 08:49:39 UTC
I have tried https://lore.kernel.org/lkml/Zu_FDfHZAVzPv1lq@infradead.org/ and it did not make any obvious difference (probably not surprising, since that patch involves USB but I wanted to report this before anyone asks about it).
Comment 5 leon 2024-09-22 10:29:45 UTC
Can you please try this patch?

diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index b839683da0ba..cf3b89e681a3 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -926,7 +926,7 @@ bool dma_addressing_limited(struct device *dev)
                         dma_get_required_mask(dev))
                return true;
 
-       if (unlikely(ops))
+       if (unlikely(ops) || use_dma_iommu(dev)
                return false;
        return !dma_direct_all_ram_mapped(dev);
 }


On Sun, Sep 22, 2024, at 11:49, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=219292
>
> --- Comment #4 from Niklāvs Koļesņikovs (pinkflames.linux@gmail.com) ---
> I have tried https://lore.kernel.org/lkml/Zu_FDfHZAVzPv1lq@infradead.org/ and
> it did not make any obvious difference (probably not surprising, since that
> patch involves USB but I wanted to report this before anyone asks about it).
>
> -- 
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 6 Niklāvs Koļesņikovs 2024-09-22 19:12:44 UTC
Thank you for looking into this. I had to manually redo the patch to make it apply to current git master branch and add an extra closing bracket to make the resulting file compile but it did not seem to make a difference at boot time - still the same kernel NULL pointer dereference and the stack trace appears to be unchanged as well.

Here's the patch I used:

diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index b839683da0ba..f088a6462564 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -926,7 +926,7 @@ bool dma_addressing_limited(struct device *dev)
                         dma_get_required_mask(dev))
                return true;
 
-       if (unlikely(ops))
+       if (unlikely(ops) || use_dma_iommu(dev))
                return false;
        return !dma_direct_all_ram_mapped(dev);
 }
Comment 7 Niklāvs Koļesņikovs 2024-09-22 19:14:42 UTC
And here's the patch I used without markdown formatting...

diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index b839683da0ba..f088a6462564 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -926,7 +926,7 @@ bool dma_addressing_limited(struct device *dev)
                         dma_get_required_mask(dev))
                return true;
 
-       if (unlikely(ops))
+       if (unlikely(ops) || use_dma_iommu(dev))
                return false;
        return !dma_direct_all_ram_mapped(dev);
 }
Comment 8 leon 2024-09-23 05:46:38 UTC
Yes, this is what I ended.
Official patch was posted:
https://lore.kernel.org/all/c2671009f217bfc3c9f7fbda2eefb0c44768fc3b.1727028402.git.leon@kernel.org/

Thanks for the report.

On Sun, Sep 22, 2024, at 22:14, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=219292
>
> --- Comment #7 from Niklāvs Koļesņikovs (pinkflames.linux@gmail.com) ---
> And here's the patch I used without markdown formatting...
>
> diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
> index b839683da0ba..f088a6462564 100644
> --- a/kernel/dma/mapping.c
> +++ b/kernel/dma/mapping.c
> @@ -926,7 +926,7 @@ bool dma_addressing_limited(struct device *dev)
>                          dma_get_required_mask(dev))
>                 return true;
>
> -       if (unlikely(ops))
> +       if (unlikely(ops) || use_dma_iommu(dev))
>                 return false;
>         return !dma_direct_all_ram_mapped(dev);
>  }
>
> -- 
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 9 Niklāvs Koļesņikovs 2024-09-23 06:23:45 UTC
Thank you, Leon, applying your mailing list patch on top of the Christoph's does resolve the amdgpu issue.

Reported-and-tested-by: Niklāvs Koļesņikovs <pinkflames.linux@gmail.com>
Comment 10 Niklāvs Koļesņikovs 2024-09-24 20:34:36 UTC
The fix has landed upstream and will ship as part of 6.12-rc1. Closing.

Cheers.

Note You need to log in before you can comment on or make changes to this bug.