Bug 204145

Summary: amdgpu video playback causes host to hard reset (checkstop) on POWER9 with RX 580
Product: Platform Specific/Hardware Reporter: Shawn Anastasio (shawn)
Component: PPC-64Assignee: drivers_video-dri
Status: CLOSED CODE_FIX    
Severity: high CC: alexdeucher, dan, linux, michael, psppsn96, robert, tpearson
Priority: P1    
Hardware: PPC-64   
OS: Linux   
Kernel Version: 5.2.1 Tree: Mainline
Regression: Yes
Attachments: 5.1.15 kconfig
/proc/cpuinfo
lspci -vv
test patch #1
test patch #2
test patch #3

Description Shawn Anastasio 2019-07-12 02:43:41 UTC
Created attachment 283635 [details]
5.1.15 kconfig

On recent kernels (at least since 5.1.15), video playback will (sometimes) hard 
reset my POWER9 system with an AMD RX 580. Due to the nature of the crash, it
does not seem that any kernel logs are produced by the event. The system
Hostboot firmware does record the crash though, and it produces a GUARD event
which disables the CPU slice the crash occurred on for subsequent boots until 
it is cleared.

An upstream Hostboot issue has been submitted by another person who has
encountered this behavior as well:
https://github.com/open-power/hostboot/issues/180

I have been unable to reproduce the issue on kernel 5.0.9, indicating a 
regression somewhere in between.

Attached is my 5.1.15 kernel config, the output of lspci -vv, and the contents
of /proc/cpuinfo.
Comment 1 Shawn Anastasio 2019-07-12 02:44:12 UTC
Created attachment 283637 [details]
/proc/cpuinfo
Comment 2 Shawn Anastasio 2019-07-12 02:44:24 UTC
Created attachment 283639 [details]
lspci -vv
Comment 3 Alex Deucher 2019-07-12 02:59:54 UTC
can you bisect?
Comment 4 Daniel Kolesa 2019-07-13 13:40:58 UTC
Can't reproduce this on 5.1.17 with Polaris WX5100 and 18-core POWER9. Since both you and Timothy have dual-CPU systems with the GPU on the second CPU's PCIe, this could indicate that the problem is only affecting dual processor systems possibly in that specific configuration (alternatively, the problem could be fixed in 5.1.17 already...)
Comment 5 Daniel Kolesa 2019-07-14 15:35:20 UTC
Nevermind. Seems like I was able to hit the same problem today. It did not happen on video playback though, just randomly, after about a day or two of uptime.
Comment 6 Timothy Pearson 2019-07-16 00:13:32 UTC
I just tested and confirmed this bug is still present on the latest 5.2.0+ GIT HEAD.  I can reliably reproduce this bug at will, but it is a somewhat involved process to get the machine up and running to the point where I can trigger it each time so bisect will be slow.
Comment 7 Daniel Kolesa 2019-07-16 00:16:59 UTC
Maybe narrow it down first (i.e. find the last release that was good, by testing 5.1.14 first, and then bisect only history between the last good and first bad release. We know 5.1.9 is good but we don't know which release between 5.1.9 and 5.1.15 introduced the problem.
Comment 8 Timothy Pearson 2019-07-16 09:49:36 UTC
I'm seeing this on 5.1.0 and up.  5.0.0+ was the last working version for me, I'm continuing the bisect.
Comment 9 Daniel Kolesa 2019-07-16 13:01:55 UTC
That's strange. Both me and Shawn had 5.1.9 as a definitely working version.
Comment 10 Shawn Anastasio 2019-07-16 18:54:50 UTC
I actually haven't tested 5.1.9, but I can confirm that 5.0.9 works fine.
Comment 11 Daniel Kolesa 2019-07-16 18:55:35 UTC
I see, I misread then, because the working kernel I have here is 5.1.9... I did have it hang on 5.1.17, though.
Comment 12 Timothy Pearson 2019-07-16 20:18:19 UTC
I just double checked -- 5.0.21 works, 5.1.0 does not.  Daniel, didn't you say your lockup was somewhat random?  I can trigger this lockup every single time with a specific action, so is it possible 5.1.9 for you just hasn't hit the right combination of factors to trigger the lockup on your system?
Comment 13 Daniel Kolesa 2019-07-16 20:20:22 UTC
Well, previously I had X11 running with 5.1.9 for ~2 weeks with no lockups; then I rebooted into 5.1.17 and suddenly it locked up the second day with a checkstop. Now back in 5.1.9, and nothing for almost 3 days already.
Comment 14 Timothy Pearson 2019-07-16 20:26:34 UTC
Looks like this is a case where it's fairly critical to have a known 100% reliable way to reproduce a bug like this.  I'm bisecting further, so we should be able to get to the bottom of this soon.
Comment 15 Timothy Pearson 2019-07-17 09:36:08 UTC
Bisect shows that the failing commit is:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.1.y&id=6666cc17d7802b7dcbb073e7be1eee2cf6fa64d9

This is on a WX7100 GPU, the lockup is 100% repeatable after that patch goes in.  Unfortunately it does not cleanly reverse on the 5.1 kernel so I can't verify this is the only problem with 5.1, but it's a start.

Any IBMers have ideas why this patch went in and why it would be causing "Cache line inhibited hit cacheable space" faults?
Comment 16 Shawn Anastasio 2019-07-17 19:12:39 UTC
Created attachment 283799 [details]
test patch #1

Though I'm not familiar with this code, a quick spot check shows what I believe 
to be an inconsistency with the commit's claim of functional identicality. 
Namely, the previous caller of __dma_get_coherent_pfn (now 
arch_dma_coherent_to_pfn) would explicitly modify the vm_area to mark it as 
uncacheable in the !coherent case. It seems the new caller (dma_common_mmap)
does not do this. I have written a small patch to restore the previous 
behavior (I think). Note that this probably isn't upstreamable since this fix 
should probably go somewhere in the powerpc arch code rather than the dma core.

Tim, since you're the only one who can easily reproduce this,
would you mind giving this patch a shot?
Comment 17 Shawn Anastasio 2019-07-17 19:42:26 UTC
On second glance, it seems I got it backwards. pgprot_noncache /is/ actually
being set via the default implementation of arch_dma_mmap_pgprot, but this
creates the opposite issue. In the coherent case, the vma is now marked
as noncache but in the previous implementation it was not. I'll post a new
patch to solve this by providing a powerpc implementation of
arch_dma_mmap_pgprot that only sets noncache in the !coherent case to match
the previous behvaior.
Comment 18 Shawn Anastasio 2019-07-17 20:02:33 UTC
Created attachment 283801 [details]
test patch #2

Here's the new patch that should restore the previous behavior correctly.
Comment 19 Shawn Anastasio 2019-07-17 20:30:27 UTC
Created attachment 283803 [details]
test patch #3

oops, missed a couple of includes and made a typo. Fixed those.
Comment 20 Robert Bridge 2019-07-17 22:09:41 UTC
I was encountering a bug showing similar symptoms with a different trigger: For me, any attempt to play sound consistently and immediately crashed my system. 

This was not the case with the 4.20 kernel, was confirmed happening with the 5.1 kernel. git bisection identified the same patch Timothy has identified as the patch introducing the issue for me.

I can confirm that the patch provided by Shawn appears to fix the issue. Building a kernel with that patch applied to head (commit 22051d9c4a57d3b4a8b5a7407efc80c71c7bfb16) from linux.git provides me with a kernel that no longer crashes when I attempt to play sound.
Comment 21 Timothy Pearson 2019-07-17 23:20:18 UTC
Patch #3 confirmed working when applied against kernel 5.2.1.  Thanks Shawn!
Comment 22 Shawn Anastasio 2019-07-17 23:58:35 UTC
Great! I've posted the patch to the linuxppc-dev mailing list here: https://patchwork.ozlabs.org/patch/1133466/.
Comment 23 Timothy Pearson 2019-07-18 00:39:38 UTC
A nit, but might want to mark this bug as a regression and update the kernel version to 5.2.1?
Comment 24 Shawn Anastasio 2019-07-18 00:50:32 UTC
Done. I've also updated the product/component to Platform Specific/PPC-64 since
this wasn't an issue with amdgpu after all.
Comment 25 Michael Ellerman 2019-07-22 23:11:37 UTC
This is in my fixes branch which I'll send to Linus later this week:

https://git.kernel.org/powerpc/c/b4fc36e60f25cf22bf8b7b015a701015740c3743
Comment 26 Michael Ellerman 2019-07-30 05:28:18 UTC
Now merged, thanks all.

If the fix isn't working please reopen.