Bug 194559

Summary: amdgpu problems loading 2 firmwares on multi-smp system
Product: Drivers Reporter: Janpieter Sollie (janpieter.sollie)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: deathsimple, fin4478, janpieter.sollie
Priority: P1    
Hardware: x86-64   
OS: Linux   
See Also: https://bugzilla.kernel.org/show_bug.cgi?id=194731
https://bugzilla.kernel.org/show_bug.cgi?id=194899
Kernel Version: 4.9.9 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg.txt, lspci.txt and .config
config of working drm-next kernel

Description Janpieter Sollie 2017-02-12 14:34:36 UTC
Created attachment 254705 [details]
dmesg.txt, lspci.txt and .config

System: opteron 2876 (*2), 128GB ram, x86_64.
VGA1: SI, cape verde pro.
VGA2: R9 nano.
VGA3: onboard mgag200.
Distribution: gentoo.
kernel: vanilla-sources-4.9.9.
kernel loader: lilo 24
firmware: linux-firmware-20170126.
bug: amdgpu loading on the system causes a reboot. even when disabling panic, the kernel does not wait for me to reset the system. the issue occurs even when booting the system with init=/bin/bash and then modprobe amdgpu.
solutions:
1) removing /lib/firmware/radeon.
2) removing /lib/firmware/amdgpu.
3) boot the kernel with nosmp.
Each of these solutions works, but causes hardware not to be initialized
Tried without success:
1) using the radeon module instead of amdgpu.
2) using amdgpu-pro.
3) using a different kernel version (4.4.39).
4) boot with iommu=soft.
5) first load drm (works), then load amdgpu (crashes).

I suspect a nonsafe threaded kernel bug in drm.

In attachment:
1) dmesg of noSMP boot.
2) lspci output.
3) config of current 4.9.9 kernel.
Comment 1 fin4478 2017-02-13 03:28:17 UTC
Stock kernels have very little amdgpu code, see kernel.org and click diff.
Use the command: 
git clone -b drm-next-4.11-wip git://people.freedesktop.org/~agd5f/linux

The kernel configuration file of Debian Official kernel are available in /boot, named after the kernel release. Copy the .config file to the linux directory. Connect all your devices and run the command: make localmodconfig. You can use the command make defconfig too for creating initial .config file. 

Use the command: make xconfig and check that you have enabled: Reroute Broken IRQ, Virtualization KVM and 300Hz CPU timer, I also disabled Swap, Kernel Debug, CPU Freq scaling , Cpu handling in Acpi, Used Bios to control CPU and devices. In the drivers->graphics->amdgpu enable cik support for a gcn 1.1 gpu and si support for a gcn 1.0 gpu.

Create debian kernel package:
export CONCURRENCY_LEVEL=4
fakeroot make-kpkg --initrd kernel_image

Install the kernel package with Gdebi. To make a custom kernel to boot, add a line to /etc/initramfs-tools/modules:
unix
And run: sudo update-initramfs
Reboot.
Comment 2 Janpieter Sollie 2017-02-14 06:55:11 UTC
Dear fin4478,
thank you for the tips, I will try them asap,
but I am confused: I have nothing with debian, and the system is headless (running as OpenCL accelerator).  Does this matter to you?
Comment 3 Janpieter Sollie 2017-02-14 07:59:04 UTC
it works! attached my config file of your drm-next kernel
I don't know what needs to be done for you developers to integrate drm-next into the mainline kernel, but thank you!!!
Comment 4 Janpieter Sollie 2017-02-14 08:00:08 UTC
Created attachment 254741 [details]
config of working drm-next kernel
Comment 5 fin4478 2017-02-14 08:15:00 UTC
(In reply to Janpieter Sollie from comment #3)
> it works! attached my config file of your drm-next kernel
> I don't know what needs to be done for you developers to integrate drm-next
> into the mainline kernel, but thank you!!!

Amd should warn not use stock kernels and tell how to use use  ~agd5f wip kernel and latest mesa git. Here is the page for you, dear Amd:
http://support.amd.com/en-us/download/linux

This and many other amdgpu bug reports prove my point.
Comment 6 Michel Dänzer 2017-02-14 08:32:19 UTC
(In reply to fin4478 from comment #5)
> This and many other amdgpu bug reports prove my point.

Your bug report comments like this one rather indicate that you don't understand how the kernel development process works.
Comment 7 fin4478 2017-02-14 15:05:09 UTC
(In reply to Michel Dänzer from comment #6)
> (In reply to fin4478 from comment #5)
> > This and many other amdgpu bug reports prove my point.
> 
> Your bug report comments like this one rather indicate that you don't
> understand how the kernel development process works.

You do not see how agd5f wip kernel solved this and many other problems.

Amd should warn not use stock kernels and tell how to use use  ~agd5f wip kernel and latest mesa git. Here is the page for you, dear Amd:
http://support.amd.com/en-us/download/linux

You clearly want bad reputation for Amd gpus so I stop giving this info.
Comment 8 Christian König 2017-02-14 15:18:09 UTC
(In reply to fin4478 from comment #7)
> You clearly want bad reputation for Amd gpus so I stop giving this info.

Well as an AMD employee I can only advise you to stop giving incorrect informations.

Alex branches only contain additional features not upstream yet, so they are way more unstable than the upstream kernel driver.
Comment 9 Janpieter Sollie 2017-02-16 09:15:55 UTC
additional comment:
works on 4.10-rc8, so necessary patch is already integrated

thank you kernel developers!