Bug 150731
Summary: | amdgpu: segfault on unbind in sysfs; card becomes nonresponsive | ||
---|---|---|---|
Product: | Drivers | Reporter: | Jimi (JimiJames.Bove) |
Component: | Video(DRI - non Intel) | Assignee: | drivers_video-dri |
Status: | NEW --- | ||
Severity: | normal | CC: | JimiJames.Bove, laguest |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 4.6.4 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
X crash log
X post-crash log dmesg log dmesg log (amdgpu-pro) |
Description
Jimi
2016-07-29 22:41:06 UTC
Clarification: If I bind the card to amdgpu, X doesn't crash when I actually bind it to amdgpu. I never actually get to bind it to amdgpu. X crashes when I rescan PCI devices (after unbinding the card from whatever it was originally bound to, which in my case is always vfio-pci because I pass it into a QEMU/KVM virtual machine). When I log back in, the card has been automatically bound to amdgpu successfully. Another clarification: the behavior is the same if I don't bind the card to amdgpu myself and let it be bound to amdgpu on boot, automatically, which is how I usually test it. I've now confirmed this issue on Fiji (R9 Fury) as well. Created attachment 228411 [details]
X crash log
Here are my Xorg logs for when I unbind from vfio-pci, remove, and rescan, and X crashes and comes back with the card bound. The post-crash log is the Xorg.0.log file, which just shows X loading a desktop that uses both cards (although the AMD card has "Ignore" set to "true" since it's just meant for running games with the DRI_PRIME variable), and the crash log is the Xorg.0.log.old file, which captures the moment of the crash starting at time [456.336].
You can see there aren't any errors in there. It seems to be just reconfiguring the graphics because it noticed a new available card, and somehow that resulted in me being booted back to the login screen. And according to all the tutorials I've read on switching a card between vfio-pci and X, it shouldn't even be doing this on its own. It should be waiting for me to bind the card to amdgpu myself. Why is it doing it automatically and booting me out?
Created attachment 228421 [details]
X post-crash log
Created attachment 228431 [details]
dmesg log
And now I've tried unbinding it from amdgpu without X running at all, and it of course didn't work, confirming its kernel bug status. I've captured the dmesg log, and as far as I can tell, the part of the log that pertains to amdgpu is the stack trace starting at [1131.985756].
I should mention at this point, I think there are 2 different bugs going on. One bug is making it impossible to unbind any cards from the driver, and another bug is making X immediately bind itself to an amdgpu card the instant it becomes available and crash. The former is definitely something wrong with amdgpu in the kernel, but the latter could be X's fault--I don't know. Just in case, I've filed a report for X, too: https://bugs.freedesktop.org/show_bug.cgi?id=97313 Created attachment 228441 [details]
dmesg log (amdgpu-pro)
And here's the dmesg log from testing this with amdgpu-pro (without X running), with the crash starting at [137.003975].
amdgpu-pro exhibited almost exactly the same behavior. The only difference was instead of getting a segfault after a few seconds, the terminal session that unbound the card was immediately spammed with the dmesg stack trace in this attached file.
Over at the X bug report, I've figured out that when X has AutoAddGPU turned off, it doesnt crash (meaning that bug is not related to this bug), however, the card is still automatically bound to amdgpu before I can bind it, which means 2 things: 1. That's going to be a problem when I'm able to successfully unbinding my card from amdgpu and will need to be able to respond PCI devices without it auto-binding to amdgpu, because I'll want to bind it to vfio-pci. 2. That part of the X bug is actually its own bug, which makes sense, because the card would immediately auto-bind to amdgpu if I tested things without X running. So, as far as this bug report is concerned, we actually have 2 bugs going on: the card can't be unbound, and the card is automatically bound on a rescan, stopping the user from having a choice in which driver it gets bound to. I think these 2 issues just may be related. Maybe they even have the same cause? Sorry, autocorrect typos. Thing #1 is supposed to say that it's going to be a problem when I'm able to successfully unbind my card from amdgpu and will need to be able to rescan PCI devices without it auto-binding to amdgpu, because I'll want to bind it to vfio-pci. I can confirm that the OS completely hangs when unbinding R9 380 (Tonga Pro) with X running. Works fine with X off. I have amdgpu and vfio-pci both in kernel, used the following to unbind it. #!/bin/bash for dev in "$@"; do vendor=$(cat /sys/bus/pci/devices/$dev/vendor) device=$(cat /sys/bus/pci/devices/$dev/device) if [ -e /sys/bus/pci/devices/$dev/driver ]; then echo $dev > /sys/bus/pci/devices/$dev/driver/unbind fi echo $vendor $device > /sys/bus/pci/drivers/vfio-pci/new_id done lspci -nnk shows: 03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Tonga PRO [Radeon R9 285/380] [1002:6939] (rev f1) Subsystem: PC Partner Limited / Sapphire Technology Radeon R9 380 Nitro 4G D5 [174b:e308] Kernel driver in use: vfio-pci 03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Tonga HDMI Audio [Radeon R9 285/380] [1002:aad8] Subsystem: PC Partner Limited / Sapphire Technology Radeon R9 285/380 HDMI Audio [174b:aad8] Kernel driver in use: vfio-pci Kernel modules: snd_hda_intel Turns out this bug has been getting ignored because of an extremely obscure fact about terrible bug report website organization, that I'm sure has screwed many other people in the past: https://bugzilla.kernel.org/show_bug.cgi?id=195321#c5 Thankfully, someone else posted it in the right place recently: https://bugs.freedesktop.org/show_bug.cgi?id=100399 Let's add our voices to that. |