Created attachment 283635 [details]
On recent kernels (at least since 5.1.15), video playback will (sometimes) hard
reset my POWER9 system with an AMD RX 580. Due to the nature of the crash, it
does not seem that any kernel logs are produced by the event. The system
Hostboot firmware does record the crash though, and it produces a GUARD event
which disables the CPU slice the crash occurred on for subsequent boots until
it is cleared.
An upstream Hostboot issue has been submitted by another person who has
encountered this behavior as well:
I have been unable to reproduce the issue on kernel 5.0.9, indicating a
regression somewhere in between.
Attached is my 5.1.15 kernel config, the output of lspci -vv, and the contents
Created attachment 283637 [details]
Created attachment 283639 [details]
can you bisect?
Can't reproduce this on 5.1.17 with Polaris WX5100 and 18-core POWER9. Since both you and Timothy have dual-CPU systems with the GPU on the second CPU's PCIe, this could indicate that the problem is only affecting dual processor systems possibly in that specific configuration (alternatively, the problem could be fixed in 5.1.17 already...)
Nevermind. Seems like I was able to hit the same problem today. It did not happen on video playback though, just randomly, after about a day or two of uptime.
I just tested and confirmed this bug is still present on the latest 5.2.0+ GIT HEAD. I can reliably reproduce this bug at will, but it is a somewhat involved process to get the machine up and running to the point where I can trigger it each time so bisect will be slow.
Maybe narrow it down first (i.e. find the last release that was good, by testing 5.1.14 first, and then bisect only history between the last good and first bad release. We know 5.1.9 is good but we don't know which release between 5.1.9 and 5.1.15 introduced the problem.
I'm seeing this on 5.1.0 and up. 5.0.0+ was the last working version for me, I'm continuing the bisect.
That's strange. Both me and Shawn had 5.1.9 as a definitely working version.
I actually haven't tested 5.1.9, but I can confirm that 5.0.9 works fine.
I see, I misread then, because the working kernel I have here is 5.1.9... I did have it hang on 5.1.17, though.
I just double checked -- 5.0.21 works, 5.1.0 does not. Daniel, didn't you say your lockup was somewhat random? I can trigger this lockup every single time with a specific action, so is it possible 5.1.9 for you just hasn't hit the right combination of factors to trigger the lockup on your system?
Well, previously I had X11 running with 5.1.9 for ~2 weeks with no lockups; then I rebooted into 5.1.17 and suddenly it locked up the second day with a checkstop. Now back in 5.1.9, and nothing for almost 3 days already.
Looks like this is a case where it's fairly critical to have a known 100% reliable way to reproduce a bug like this. I'm bisecting further, so we should be able to get to the bottom of this soon.
Bisect shows that the failing commit is:
This is on a WX7100 GPU, the lockup is 100% repeatable after that patch goes in. Unfortunately it does not cleanly reverse on the 5.1 kernel so I can't verify this is the only problem with 5.1, but it's a start.
Any IBMers have ideas why this patch went in and why it would be causing "Cache line inhibited hit cacheable space" faults?
Created attachment 283799 [details]
test patch #1
Though I'm not familiar with this code, a quick spot check shows what I believe
to be an inconsistency with the commit's claim of functional identicality.
Namely, the previous caller of __dma_get_coherent_pfn (now
arch_dma_coherent_to_pfn) would explicitly modify the vm_area to mark it as
uncacheable in the !coherent case. It seems the new caller (dma_common_mmap)
does not do this. I have written a small patch to restore the previous
behavior (I think). Note that this probably isn't upstreamable since this fix
should probably go somewhere in the powerpc arch code rather than the dma core.
Tim, since you're the only one who can easily reproduce this,
would you mind giving this patch a shot?
On second glance, it seems I got it backwards. pgprot_noncache /is/ actually
being set via the default implementation of arch_dma_mmap_pgprot, but this
creates the opposite issue. In the coherent case, the vma is now marked
as noncache but in the previous implementation it was not. I'll post a new
patch to solve this by providing a powerpc implementation of
arch_dma_mmap_pgprot that only sets noncache in the !coherent case to match
the previous behvaior.
Created attachment 283801 [details]
test patch #2
Here's the new patch that should restore the previous behavior correctly.
Created attachment 283803 [details]
test patch #3
oops, missed a couple of includes and made a typo. Fixed those.
I was encountering a bug showing similar symptoms with a different trigger: For me, any attempt to play sound consistently and immediately crashed my system.
This was not the case with the 4.20 kernel, was confirmed happening with the 5.1 kernel. git bisection identified the same patch Timothy has identified as the patch introducing the issue for me.
I can confirm that the patch provided by Shawn appears to fix the issue. Building a kernel with that patch applied to head (commit 22051d9c4a57d3b4a8b5a7407efc80c71c7bfb16) from linux.git provides me with a kernel that no longer crashes when I attempt to play sound.
Patch #3 confirmed working when applied against kernel 5.2.1. Thanks Shawn!
Great! I've posted the patch to the linuxppc-dev mailing list here: https://patchwork.ozlabs.org/patch/1133466/.
A nit, but might want to mark this bug as a regression and update the kernel version to 5.2.1?
Done. I've also updated the product/component to Platform Specific/PPC-64 since
this wasn't an issue with amdgpu after all.
This is in my fixes branch which I'll send to Linus later this week:
Now merged, thanks all.
If the fix isn't working please reopen.