Bug 73901

Summary: Kernel crash after modprobe radeon runpm=1
Product: Drivers Reporter: Pali Rohár (pali)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED CODE_FIX    
Severity: high CC: alexdeucher, mh-kernelbug
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 3.14 Subsystem:
Regression: No Bisected commit-id:
Attachments: syslog output
dmesg output
fix ATPX detection on non-VGA dGPUs
avoid a possible crash when runpm is forced on non-ATPX systems

Description Pali Rohár 2014-04-12 18:44:19 UTC
After modprobing radeon driver with runpm=1 notebook display panel output is immediatelly black (probably turned off) and after one or two seconds kernel freeze/crash (sysrq not working too). I was able to dump & sync syslog kernel output before freeze (called sync command in infinite loop on background). See attachment where is log from syslog daemon after modprobing radeon kernel module (with runpm=1). I'm not able to provide any other debug output as display is off and kernel crashing... Problem is reproducable always on notebook Dell Latitude E6440 which have muxless AMD Radeon HD 8690M graphic card.

Black screen is probably caused by intel driver at line:
[  171.913779] i915: switched off

And kernel crash by NULL derefence:
[  173.442690] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008

My AMD graphics card is identified by lspci -nn as:
01:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Sun XT [Radeon HD 8670A/8670M/8690M] [1002:6660]
Comment 1 Pali Rohár 2014-04-12 18:44:46 UTC
Created attachment 131981 [details]
syslog output
Comment 2 Alex Deucher 2014-04-14 14:18:39 UTC
what kernel are you using?  Can you attach your full dmesg output without radeon.runpm=1?
Comment 3 Pali Rohár 2014-04-14 15:28:15 UTC
Created attachment 132211 [details]
dmesg output

I'm using version 3.14 (as specified in bugzilla). Dmesg output from kernel without any radeon params is attached.
Comment 4 Alex Deucher 2014-04-14 16:57:52 UTC
Your system does not appear to have the ATPX acpi methods that are required for runtime pm to work properly (required to power off the dGPU).  You should see something like:
ATPX version X, functions 0xXXXXXXXX
in your dmesg output.
Comment 5 Pali Rohár 2014-04-14 17:21:10 UTC
So I cannot turn off dGPU when it is not used?

Also I think that kernel should not crash when booting with (maybe incorrect?) param runpm.
Comment 6 Pali Rohár 2014-04-14 17:22:43 UTC
Btw, I looked into DSDT/SSDT acpi tables and there is ATPX method (in SSDT7, scope \_SB.PCI0.GFX0).
Comment 7 Alex Deucher 2014-04-14 17:25:50 UTC
(In reply to Pali Rohár from comment #5)
> So I cannot turn off dGPU when it is not used?
> 

Correct.  The driver requires that method to power on/off the dGPU.

> Also I think that kernel should not crash when booting with (maybe
> incorrect?) param runpm.

Yes, that should probably be fixed.

(In reply to Pali Rohár from comment #6)
> Btw, I looked into DSDT/SSDT acpi tables and there is ATPX method (in SSDT7,
> scope \_SB.PCI0.GFX0).

Did you enable vgaswitcheroo support in your kernel config?
Comment 8 Pali Rohár 2014-04-14 17:31:31 UTC
(In reply to Alex Deucher from comment #7)
> (In reply to Pali Rohár from comment #6)
> > Btw, I looked into DSDT/SSDT acpi tables and there is ATPX method (in
> SSDT7,
> > scope \_SB.PCI0.GFX0).
> 
> Did you enable vgaswitcheroo support in your kernel config?

Kernel is from http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.14-trusty/

And in file /boot/config-3.14.0-031400-generic I see:

CONFIG_VGA_SWITCHEROO=y
Comment 9 Pali Rohár 2014-04-14 22:15:43 UTC
I looked into radeon_atpx_handler.c code and I found reason why radeon kernel driver does not detect ATPX...

First here is lspci output:
00:02.0 VGA compatible controller [0300]: Intel Corporation 4th Gen Core Processor Integrated Graphics Controller [8086:0416] (rev 06)
01:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Sun XT [Radeon HD 8670A/8670M/8690M] [1002:6660]

Second here is relevant code of function radeon_atpx_detect(void) from file radeon_atpx_handler.c

	int vga_count = 0;

	while ((pdev = pci_get_class(PCI_CLASS_DISPLAY_VGA << 8, pdev)) != NULL) {
		vga_count++;

		has_atpx |= (radeon_atpx_pci_probe_handle(pdev) == true);
	}

	if (has_atpx && vga_count == 2) { ... ATPX was detected ... }

And some defines (from pci_ids.h):

#define PCI_CLASS_DISPLAY_VGA		0x0300
#define PCI_CLASS_DISPLAY_OTHER		0x0380

Because my Radeon card has pci class 0380 and not 0300 it is not checked for ATPX in while loop and so vgaswitcheroo is not enabled.

I created this quick & dirty patch and after that runpm=1 working without any crash.

--- radeon_atpx_handler.c.orig	2014-04-14 17:36:36.583744668 +0200
+++ radeon_atpx_handler.c	2014-04-14 23:50:53.354492060 +0200
@@ -528,6 +528,12 @@ static bool radeon_atpx_detect(void)
 		has_atpx |= (radeon_atpx_pci_probe_handle(pdev) == true);
 	}
 
+	while ((pdev = pci_get_class(PCI_CLASS_DISPLAY_OTHER << 8, pdev)) != NULL) {
+		vga_count++;
+
+		has_atpx |= (radeon_atpx_pci_probe_handle(pdev) == true);
+	}
+
 	if (has_atpx && vga_count == 2) {
 		acpi_get_name(radeon_atpx_priv.atpx.handle, ACPI_FULL_PATHNAME, &buffer);
 		printk(KERN_INFO "VGA switcheroo: detected switching method %s handle\n",

Now also vgaswitcheroo debugfs file appeared:

$ sudo cat /sys/kernel/debug/vgaswitcheroo/switch
0:IGD:+:Pwr:0000:00:02.0
1:DIS: :DynPwr:0000:01:00.0

Alex, I think that now you have everything needed for implementing proper fix for this bug.
Comment 10 Alex Deucher 2014-04-14 23:06:25 UTC
Created attachment 132301 [details]
fix ATPX detection on non-VGA dGPUs

Thanks for sorting this out.
Comment 11 Alex Deucher 2014-04-14 23:07:33 UTC
Created attachment 132311 [details]
avoid a possible crash when runpm is forced on non-ATPX systems

Fix runpm=1 handling on non-PX systems.
Comment 12 Pali Rohár 2014-04-15 08:02:39 UTC
(In reply to Alex Deucher from comment #10)
> Created attachment 132301 [details]
> fix ATPX detection on non-VGA dGPUs
> 
> Thanks for sorting this out.

This patch is same as mine, already tested and is working.
Comment 13 Pali Rohár 2014-04-15 08:07:11 UTC
(In reply to Alex Deucher from comment #11)
> Created attachment 132311 [details]
> avoid a possible crash when runpm is forced on non-ATPX systems
> 
> Fix runpm=1 handling on non-PX systems.

It is not possible to apply this patch on top of 3.14 nor on top of linus master (55101e2d6ce1c780f6ee8fee5f37306971aac6cd)

linux/drivers/gpu/drm/radeon$ patch -p5 -i 0002-drm-radeon-don-t-allow-runpm-1-on-systems-with-out-A.patch
patching file radeon_kms.c
Hunk #1 FAILED at 107.
1 out of 1 hunk FAILED -- saving rejects to file radeon_kms.c.rej
Comment 14 Alex Deucher 2014-04-15 13:22:52 UTC
(In reply to Pali Rohár from comment #13)
> (In reply to Alex Deucher from comment #11)
> > Created attachment 132311 [details]
> > avoid a possible crash when runpm is forced on non-ATPX systems
> > 
> > Fix runpm=1 handling on non-PX systems.
> 
> It is not possible to apply this patch on top of 3.14 nor on top of linus
> master (55101e2d6ce1c780f6ee8fee5f37306971aac6cd)
> 
> linux/drivers/gpu/drm/radeon$ patch -p5 -i
> 0002-drm-radeon-don-t-allow-runpm-1-on-systems-with-out-A.patch
> patching file radeon_kms.c
> Hunk #1 FAILED at 107.
> 1 out of 1 hunk FAILED -- saving rejects to file radeon_kms.c.rej

It relies on other patches in the radeon -fixes tree.  It should apply against:
http://cgit.freedesktop.org/~deathsimple/linux/log/?h=drm-fixes-3.15-wip
Comment 15 Pali Rohár 2014-04-20 20:53:05 UTC
Now I tested this patch with 3.15-rc2 kernel and no kernel crash with runpm=1 anymore...

But there is another problem, runpm=1 somehow not working correctly. It does not poweroff radeon card when it is not used.
Comment 16 Pali Rohár 2014-04-22 20:12:41 UTC
My bad, I'm using tlp which calling:
$ echo on > /sys/bus/pci/devices/0000:01:00.0/power/control
when notebook is running on ac. And this prevent runpm to work correctly. After I blacklisted radeon card in tlp then runpm started working correctly.
Comment 17 Pali Rohár 2014-07-12 21:36:22 UTC
Ok, when set auto control via

$ echo auto > /sys/bus/pci/devices/0000:01:00.0/power/control

card is automatically turned off when it is not used. When I set on via

$ echo on > /sys/bus/pci/devices/0000:01:00.0/power/control

then it is always on. So it working as expected and closing this bug as fixed.