Bug 150731 - amdgpu: segfault on unbind in sysfs; card becomes nonresponsive
Summary: amdgpu: segfault on unbind in sysfs; card becomes nonresponsive
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-07-29 22:41 UTC by Jimi
Modified: 2017-06-10 19:28 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.6.4
Subsystem:
Regression: No
Bisected commit-id:


Attachments
X crash log (58.34 KB, text/plain)
2016-08-11 23:32 UTC, Jimi
Details
X post-crash log (61.87 KB, text/plain)
2016-08-11 23:33 UTC, Jimi
Details
dmesg log (128.64 KB, text/plain)
2016-08-11 23:46 UTC, Jimi
Details
dmesg log (amdgpu-pro) (107.61 KB, text/plain)
2016-08-12 01:09 UTC, Jimi
Details

Description Jimi 2016-07-29 22:41:06 UTC
Full details here: https://www.reddit.com/r/linux_gaming/comments/4udupx/nvidiaamd_support_questions/d5ovipc

Summary:
I'm using an R9 380. Others confirmed having this issue on the R9 285 and RX 480 (so, Tonga & Polaris 10 at least).

I can bind my video card to amdgpu, and that works. It crashes X, but when I log back in, it's properly connected and everything.

However, if I try to unbind it, after waiting for a few seconds, I get a segfault. Any subsequent attempts to do anything with that card in sysfs--trying to unbind again, trying to bind to something else, etc.--will get stuck forever, never segfaulting, because the card is not responding.

Removing the card (echo 1 > /sys/bus/pci/devices/0000:0X:00.0/remove) works, but after a rescan (echo 1 > /sys/bus/pci/rescan), the card is no longer in sysfs at all, as if it's been powered down. It can't be accessed by the system in any way after that, until the computer reboots.

It may or may not be related to the "reset issues" bug:
http://vfio.blogspot.de/2015/04/progress-on-amd-front.html
https://lists.gnu.org/archive/html/qemu-devel/2015-04/msg03128.html
That bug officially only affects Hawaii and Bonaire, but Tonga cards (380, 285) exhibit the same behavior even if it may not be for the same reason. Whether it affects Polaris 10 (RX 480) is unknown. The RX 480 tester is currently finding that out.

I also had this issue on 4.6.1, so it probably at least affects 4.6 in general. Maybe all kernel versions that have amdgpu?
Comment 1 Jimi 2016-07-29 22:44:33 UTC
Clarification: If I bind the card to amdgpu, X doesn't crash when I actually bind it to amdgpu. I never actually get to bind it to amdgpu. X crashes when I rescan PCI devices (after unbinding the card from whatever it was originally bound to, which in my case is always vfio-pci because I pass it into a QEMU/KVM virtual machine). When I log back in, the card has been automatically bound to amdgpu successfully.
Comment 2 Jimi 2016-07-29 22:45:47 UTC
Another clarification: the behavior is the same if I don't bind the card to amdgpu myself and let it be bound to amdgpu on boot, automatically, which is how I usually test it.
Comment 3 Jimi 2016-08-09 16:50:21 UTC
I've now confirmed this issue on Fiji (R9 Fury) as well.
Comment 4 Jimi 2016-08-11 23:32:58 UTC
Created attachment 228411 [details]
X crash log

Here are my Xorg logs for when I unbind from vfio-pci, remove, and rescan, and X crashes and comes back with the card bound. The post-crash log is the Xorg.0.log file, which just shows X loading a desktop that uses both cards (although the AMD card has "Ignore" set to "true" since it's just meant for running games with the DRI_PRIME variable), and the crash log is the Xorg.0.log.old file, which captures the moment of the crash starting at time [456.336].

You can see there aren't any errors in there. It seems to be just reconfiguring the graphics because it noticed a new available card, and somehow that resulted in me being booted back to the login screen. And according to all the tutorials I've read on switching a card between vfio-pci and X, it shouldn't even be doing this on its own. It should be waiting for me to bind the card to amdgpu myself. Why is it doing it automatically and booting me out?
Comment 5 Jimi 2016-08-11 23:33:16 UTC
Created attachment 228421 [details]
X post-crash log
Comment 6 Jimi 2016-08-11 23:46:49 UTC
Created attachment 228431 [details]
dmesg log

And now I've tried unbinding it from amdgpu without X running at all, and it of course didn't work, confirming its kernel bug status. I've captured the dmesg log, and as far as I can tell, the part of the log that pertains to amdgpu is the stack trace starting at [1131.985756].
Comment 7 Jimi 2016-08-12 00:16:57 UTC
I should mention at this point, I think there are 2 different bugs going on. One bug is making it impossible to unbind any cards from the driver, and another bug is making X immediately bind itself to an amdgpu card the instant it becomes available and crash. The former is definitely something wrong with amdgpu in the kernel, but the latter could be X's fault--I don't know. Just in case, I've filed a report for X, too: https://bugs.freedesktop.org/show_bug.cgi?id=97313
Comment 8 Jimi 2016-08-12 01:09:50 UTC
Created attachment 228441 [details]
dmesg log (amdgpu-pro)

And here's the dmesg log from testing this with amdgpu-pro (without X running), with the crash starting at [137.003975].

amdgpu-pro exhibited almost exactly the same behavior. The only difference was instead of getting a segfault after a few seconds, the terminal session that unbound the card was immediately spammed with the dmesg stack trace in this attached file.
Comment 9 Jimi 2016-08-16 09:21:12 UTC
Over at the X bug report, I've figured out that when X has AutoAddGPU turned off, it doesnt crash (meaning that bug is not related to this bug), however, the card is still automatically bound to amdgpu before I can bind it, which means 2 things:
1. That's going to be a problem when I'm able to successfully unbinding my card from amdgpu and will need to be able to respond PCI devices without it auto-binding to amdgpu, because I'll want to bind it to vfio-pci.
2. That part of the X bug is actually its own bug, which makes sense, because the card would immediately auto-bind to amdgpu if I tested things without X running.

So, as far as this bug report is concerned, we actually have 2 bugs going on: the card can't be unbound, and the card is automatically bound on a rescan, stopping the user from having a choice in which driver it gets bound to. I think these 2 issues just may be related. Maybe they even have the same cause?
Comment 10 Jimi 2016-08-16 09:23:32 UTC
Sorry, autocorrect typos. Thing #1 is supposed to say that it's going to be a problem when I'm able to successfully unbind my card from amdgpu and will need to be able to rescan PCI devices without it auto-binding to amdgpu, because I'll want to bind it to vfio-pci.
Comment 11 Luke A. Guest 2017-06-08 14:27:20 UTC
I can confirm that the OS completely hangs when unbinding R9 380 (Tonga Pro) with X running. Works fine with X off.

I have amdgpu and vfio-pci both in kernel, used the following to unbind it.

#!/bin/bash
for dev in "$@"; do
        vendor=$(cat /sys/bus/pci/devices/$dev/vendor)
        device=$(cat /sys/bus/pci/devices/$dev/device)
        if [ -e /sys/bus/pci/devices/$dev/driver ]; then
                echo $dev > /sys/bus/pci/devices/$dev/driver/unbind
        fi
        echo $vendor $device > /sys/bus/pci/drivers/vfio-pci/new_id
done

lspci -nnk shows:

03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Tonga PRO [Radeon R9 285/380] [1002:6939] (rev f1)
        Subsystem: PC Partner Limited / Sapphire Technology Radeon R9 380 Nitro 4G D5 [174b:e308]
        Kernel driver in use: vfio-pci
03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Tonga HDMI Audio [Radeon R9 285/380] [1002:aad8]
        Subsystem: PC Partner Limited / Sapphire Technology Radeon R9 285/380 HDMI Audio [174b:aad8]
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel
Comment 12 Jimi 2017-06-10 19:28:05 UTC
Turns out this bug has been getting ignored because of an extremely obscure fact about terrible bug report website organization, that I'm sure has screwed many other people in the past: https://bugzilla.kernel.org/show_bug.cgi?id=195321#c5

Thankfully, someone else posted it in the right place recently: https://bugs.freedesktop.org/show_bug.cgi?id=100399

Let's add our voices to that.

Note You need to log in before you can comment on or make changes to this bug.