Bug 72701 - Screen freeze when using radeon driver
Summary: Screen freeze when using radeon driver
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-03-22 16:22 UTC by klod
Modified: 2016-06-05 03:25 UTC (History)
7 users (show)

See Also:
Kernel Version: 3.13 and 3.14
Subsystem:
Regression: No
Bisected commit-id:


Attachments
possible fix (1.83 KB, patch)
2014-04-07 02:13 UTC, Alex Deucher
Details | Diff
possible fix (3.98 KB, patch)
2014-04-09 02:07 UTC, Alex Deucher
Details | Diff
possible fix (4.01 KB, patch)
2014-04-09 05:59 UTC, Alex Deucher
Details | Diff
dmesg with radeon.test=3 (659.58 KB, text/plain)
2014-07-14 11:43 UTC, abandoned account
Details
dmesg with radeon.test=0 (475.71 KB, text/plain)
2014-07-14 11:46 UTC, abandoned account
Details
kernel .config used in previous dmesgs (95.76 KB, text/plain)
2014-07-14 11:48 UTC, abandoned account
Details
ati catalyst screenshot of the Information tab when both graphic cards were on (140.06 KB, image/png)
2014-07-14 12:05 UTC, abandoned account
Details
sloppy patch try (4.91 KB, patch)
2014-07-14 16:29 UTC, abandoned account
Details | Diff
updated sloppy patch to kernel 3.18-rc4 (3.53 KB, patch)
2014-11-12 15:53 UTC, abandoned account
Details | Diff

Description klod 2014-03-22 16:22:56 UTC
I got this problem with all distributions using new kernels, that I tried to boot from live CD, as well as the ones I have on the hard drive. Faster booting ones manage to boot before they freeze, while slower ones freeze in the middle of booting process.


[klod@klod ~]$ lspci | grep VGA
00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] BeaverCreek [Radeon HD 6520G]
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Whistler [Radeon HD 6630M/6650M/6750M/7670M/7690M]


When I turn off discrete GPU in my laptop's BIOS, it works well.
Comment 1 Alex Deucher 2014-03-22 16:50:32 UTC
Does booting with radeon.runpm=0 on the kernel command line in grub help?
Comment 2 klod 2014-03-22 20:32:38 UTC
Yes, it does. Thank you :)
I can't find any documentation on radeon parameters, but I guess that has something to do with power management. I tried radeon.dpm=0 earlier, but it didn't work. I hope this issue will be resolved soon :)
Comment 4 klod 2014-03-24 19:06:11 UTC
Just a few questions:

1. - What are "PX" and "non-PX" cards?

2. - Aren't these going to disable power management in my GPU? What's the difference between applying those and using "radeon.runpm=0" in grub? 

3. - How can I apply those? 

Thank you ! :)
Comment 5 Alex Deucher 2014-03-24 19:12:47 UTC
(In reply to klod from comment #4)
> Just a few questions:
> 
> 1. - What are "PX" and "non-PX" cards?

PX = PowerXpress.  PX systems are laptops with two GPUs, an integrated and a discrete GPU.

> 
> 2. - Aren't these going to disable power management in my GPU? What's the
> difference between applying those and using "radeon.runpm=0" in grub? 
> 

It does not disable power management.  It only disables the special handling for PX systems which gets incorrectly applied to non-PX systems in certain cases.  When you apply the patches you shouldn't need to add the radeon.runpm=0 option.  If radeon.runpm=0 fixes the issue, so should the patches.

> 3. - How can I apply those? 
> 

If you are using git:
git am <patch file>

Otherwise:
patch -p1 -i <patch file>
Comment 6 klod 2014-03-24 19:30:58 UTC
Well, "radeon.runpm=0" allows me to boot and use the system, but with much higher temperature and shorter battery life. I wouldn't call that "fixing the issue", as it's still worse than what I have with 3.12 and "radeon.dpm=1" parameter.
Comment 7 klod 2014-04-01 19:42:55 UTC
I think those patches are applied in 3.14 and 3.13.8, but i still need to use radeon.runpm=0 in order to boot with my discrete card.
Comment 8 Alex Deucher 2014-04-01 19:51:57 UTC
(In reply to klod from comment #7)
> I think those patches are applied in 3.14 and 3.13.8, but i still need to
> use radeon.runpm=0 in order to boot with my discrete card.

You have a PX system so those patches are not relevant for you.  It seems runpm is not working properly on your system. Booting with radeon.runpm=0 reverts back to the 3.12 behavior (PX dGPUs are not dynamically powered down).
Comment 9 klod 2014-04-01 21:53:33 UTC
Well, it seems so. What can I do to find what the problem is?
Comment 10 Alex Deucher 2014-04-01 22:36:22 UTC
Did manually powering on/off the dGPU via debugfs ever work on your system?  See the "Forcing the power state of the devices" section of this page:
http://nouveau.freedesktop.org/wiki/Optimus/
for how to test that.
Comment 11 Alex Deucher 2014-04-07 02:13:24 UTC
Created attachment 131601 [details]
possible fix

Does the attached kernel patch help?
Comment 12 klod 2014-04-09 00:04:00 UTC
I'm sorry, I'm very busy these days. I will try that when I have time
Comment 13 Alex Deucher 2014-04-09 02:07:30 UTC
Created attachment 131741 [details]
possible fix

updated patch.
Comment 14 Alex Deucher 2014-04-09 05:59:04 UTC
Created attachment 131781 [details]
possible fix

fix a stupid typo.
Comment 15 abandoned account 2014-07-14 10:08:22 UTC
Hi. I believe that I am in the exact same situation as the OP.

The system locks up in about 10 (or sometimes 14-15) seconds of boot time and I don't have to go past typing my luks password to mount rootfs.

It works fine if I use radeon.runpm=0  or if I disable CONFIG_VGA_SWITCHEROO (=n) kernel option and recompile kernel. It also works with kernel 3.12.23 (and 3.12.24 if I remember correctly). It doesn't work with 3.15.4 and 3.16-rc4 . But I diff-ed the sources a bit and noticed the runpm option (and a lot of related others) didn't exist in 3.12.23/24 so it makes sense that it works.
I tried to apply the patch above in your comment #14 but 4 hunks failed and kernel wouldn't compile.

I also tried
 echo OFF > /sys/kernel/debug/vgaswitcheroo/switch
which locks up the running system: I had audio playing (from a video) and it was repeating the last buffer while lockedup; only pressing the power button for 4 seconds worked to turn off laptop (Lenovo Z575 , same lspci output as OP); sysrq or numlock/capslock leds have no effect while in this state.
 The cat /sys/kernel/debug/vgaswitcheroo/switch  before echo-ing OFF to it was like this:
sudo cat /sys/kernel/debug/vgaswitcheroo/switch
0:IGD:+:Pwr:0000:00:01.0
1:DIS: :Pwr:0000:01:00.0


I'd be happy to try anything, I have time. Been trying to track this down for 2 days by changing and matching kernel configs(which was silly because I should've focused on video settings only), knowing that 3.12.23 worked but 3.15.4 didn't. It all eventually lead me to this page, thank goodness :) 

I'm not exactly sure what more info should I give at this time, but I'd be happy to, just say what you need.

What else I can say: turning the discrete card off from bios (switching a setting from Dynamic(both cards on) to UDMA(only integrated card on, discrete  card off) works fine.

Probably worth noting is that when I had Windows 7 64bit, with laptop drivers, it would work fine: the discrete card would be powered off while not in use while windows was running; this was with lenovo provided video drivers(heavily outdated from like 2011-2012). However, with ati mobility (generic?) drivers from amd site(even the very latest ones of a month or two ago when I last tried), it would freeze the system (likely when the driver was (after a short while) trying to power off the discrete card) and there was one option which would prevent the freeze and that was a setting in registry called "enableulps"(which is set to 1 by default!) which when set to "0" (manually by me, in safe mode, then rebooted) then the discrete card was on 100mhz gpu and 150Mhz memory all the time when idle (it would go higher when in use, but never lower, never power down - with these drivers); but with the original manufacturer drivers(from lenovo) enableulps is 1 and the card is at 0mhz gpu and 27mhz mem  (reported by gpu-z)  ie. turned off (but those readings however incorrect they were consistently reported as such all the time)
 So there may be something that the lenovo drivers did extra in order to be able to power down the discrete card, something that the generic ati mobility drivers don't do(without freezing the system, with the out-of-the-box install).

I have also tried in linux the fglrx driver, but I can't remember much about how it worked, because I was running the system mostly in UDMA mode (only the integrate graphics card on, from bios) but I remember it didn't lock up (maybe because it didn't try to power down the discrete card?) when I tried it sometimes with both cards (with fglrx driver). Should you need me to retry something with fglrx driver, let me know; it may not be easy but I am willing to try again. I am now using the radeon driver with no intent to switch back to fglrx.

The temperatures while I'm just writing this are:
$ sensors
acpitz-virtual-0
Adapter: Virtual device
temp1:        +53.0°C  (crit = +98.0°C)
temp2:        +44.0°C  (crit = +126.0°C)

radeon-pci-0008
Adapter: PCI adapter
temp1:        +53.0°C  (crit = +120.0°C, hyst = +90.0°C)

radeon-pci-0100
Adapter: PCI adapter
temp1:        +61.0°C  

k10temp-pci-00c3
Adapter: PCI adapter
temp1:        +52.6°C  (high = +70.0°C)
                       (crit = +100.0°C, hyst = +99.0°C)

and when I have UDMA set in bios (so only one card) they are around 40-44, max 45 (and of course radeon-pci-0100 is gone)
acpitz-virtual-0 temp1 is equal to k10temp-pci-00c3 temp1 = cpu temp
acpitz-virtual-0 temp2 is motherboard temp
radeon-pci-0008 temp1 is integrated gfx card
radeon-pci-0100 temp1 is discrete gfx card

some lspci -vvv (the VGA parts)

00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] BeaverCreek [Radeon HD 6520G] (prog-if 00 [VGA controller])
        Subsystem: Lenovo Device 3970
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 42
        Region 0: Memory at d0000000 (32-bit, prefetchable) [size=256M]
        Region 1: I/O ports at 3000 [size=256]
        Region 2: Memory at f0200000 (32-bit, non-prefetchable) [size=256K]
        Expansion ROM at <unassigned> [disabled]
        Capabilities: <access denied>
        Kernel driver in use: radeon

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Whistler [Radeon HD 6630M/6650M/6750M/7670M/7690M] (prog-if 00 [VGA controller])
        Subsystem: Lenovo Radeon HD 6650M
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 43
        Region 0: Memory at e0000000 (64-bit, prefetchable) [size=256M]
        Region 2: Memory at f0100000 (64-bit, non-prefetchable) [size=128K]
        Region 4: I/O ports at 2000 [size=256]
        [virtual] Expansion ROM at f0120000 [disabled] [size=128K]
        Capabilities: <access denied>
        Kernel driver in use: radeon

Thank you for your time on this. Much appreciated!
Comment 16 abandoned account 2014-07-14 11:43:58 UTC
Created attachment 142981 [details]
dmesg with radeon.test=3

with radeon.test=3 this took about 80 seconds to show me the boot screen text, the screen was black and no cursor before that, numlock/capslock leds weren't turning on, it seemed locked up but it wasn't - it was doing the tests.
[ with radeon.test=1 takes less time (maybe 40-50 sec) of black screen (right after loading the initrd image) ]

and when I booted in X, I tried to start parole (media player) to play a video(which it autoplays on startup from the last playlist) but it crashed and the window closed without notice  and I guess the errors were:
[  336.437925] [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (1544192, 2, 4096, -12)

maybe a side effect of the tests


on another note:
I enabled a few other flags (like radeon.dpm=1) to reduce the temperatures:
(now the cards frequencies look as they do when they are idle, even when playing a video in a window - it uses only the IGD for everything, I've noticed)

# cat $(find /sys/kernel/debug/dri/ -iname \*pm\*)
uvd    vclk: 0 dclk: 0
power level 0    sclk: 10000 mclk: 15000 vddc: 900 vddci: 0
uvd    vclk: 0 dclk: 0
power level 0    sclk: 10000 mclk: 15000 vddc: 900 vddci: 0
uvd    vclk: 0 dclk: 0
power level 0    sclk: 27587 vddc: 888
uvd    vclk: 0 dclk: 0
power level 0    sclk: 27587 vddc: 888

# sensors
acpitz-virtual-0
Adapter: Virtual device
temp1:        +47.0°C  (crit = +98.0°C)
temp2:        +43.0°C  (crit = +126.0°C)

radeon-pci-0008
Adapter: PCI adapter
temp1:        +47.0°C  (crit = +120.0°C, hyst = +90.0°C)

radeon-pci-0100
Adapter: PCI adapter
temp1:        +52.0°C  (crit = +120.0°C, hyst = +90.0°C)

k10temp-pci-00c3
Adapter: PCI adapter
temp1:        +47.2°C  (high = +70.0°C)
                       (crit = +100.0°C, hyst = +99.0°C)

and these are while playing a (720p)video on the background in a window(resized by me to less than 720p) which is like 1/4 of the screen behind a semi-transparent xfce4-terminal background
Comment 17 abandoned account 2014-07-14 11:46:03 UTC
Created attachment 142991 [details]
dmesg with radeon.test=0

here is another dmesg in which the only thing that I've changed(from the previous dmesg) is radeon.test=0
(I actually used grub and edited the radeon.test to 3 in the previous boot, so in this boot I didn't have to do anything to get radeon.test=0 )
Comment 18 abandoned account 2014-07-14 11:48:05 UTC
Created attachment 143001 [details]
kernel .config used in previous dmesgs

3.16-rc4
Comment 19 abandoned account 2014-07-14 12:05:18 UTC
Created attachment 143011 [details]
ati catalyst screenshot of the Information tab when both graphic cards were on

found a screenshot of ATI Catalyst Control Center Information tab when I used fglrx driver which shows information about both graphic cards - might be useful.
Comment 20 abandoned account 2014-07-14 16:29:29 UTC
Created attachment 143031 [details]
sloppy patch try

I modified your previous patch slightly(because some hunks failed and compilation error), but still doesn't handle the case when runpm=1 only when runpm=-1  but regardless I wanted to test it and just as expected it doesn't work(system locks up still) and I'm realizing it's because vgaswitcheroo is doing the turning OFF of the DIS card and since this is why it freezes anyway, I guess the problem is in the vgaswitcheroo not being able to turn off DIS without crashing (just like amd mobility does on Windows).
 Too tired, can't keep my eyes open, but I'll try to find out more about it in the morning. :)
Comment 21 abandoned account 2014-11-12 15:53:57 UTC
Created attachment 157401 [details]
updated sloppy patch to kernel 3.18-rc4

Doesn't freeze(works) with(kernel params):
radeon.dpm=1 radeon.runpm=1
radeon.dpm=0 radeon.runpm=1
(but this doesn't turn off the DIScrete card)
sudo cat /sys/kernel/debug/vgaswitcheroo/switch
0:IGD:+:Pwr:0000:00:01.0
1:DIS: :DynPwr:0000:01:00.0
# echo 'OFF' > /sys/kernel/debug/vgaswitcheroo/switch
has no effect

Freezes system with:
radeon.dpm=0 radeon.runpm=-1
(what I tested, freezes system in about 10 seconds after boot)

Without this patch:
Tested to freeze, only when:
# echo 'OFF' > /sys/kernel/debug/vgaswitcheroo/switch
with:
radeon.dpm=1 radeon.runpm=0
radeon.dpm=0 radeon.runpm=0
Comment 22 abandoned account 2014-11-12 17:01:21 UTC
In order to avoid any freezes, with the caveat that the DIScrete card won't ever be turned off (even by # echo 'OFF' > /sys/kernel/debug/vgaswitcheroo/switch  which would have no effect) I am using:
 radeon.dpm=1 radeon.runpm=1
with the above patch[originally made by  Alex Deucher above in Comment #14 ] (which I also put here https://github.com/emanueLczirai/coostomhuston/blob/a3c118ac44b616ebcc049419cc08c4d13ebb44bd/system/lenovo%20z575/OS/manjaro/filesystem%20now/home/emacs/build/kernel/linuxgit/2100_DIScrete_gfx_card_systemfreeze.patch )

As per our irc conversation, it would seem that maybe the lenovo board requires some quirk in order to avoid the system freeze and thus I accept the current workaround. Besides, I always keep my DIS card off from BIOS anyway (the BIOS Graphics: UDMA setting(instead of Dynamic), does this)

I am thus then giving up on this for now, but I'm always ready to test new ideas, if any should arise.

Thank you.
Comment 23 Vladislav Kamenev 2016-03-27 21:49:30 UTC
This bug is also affects me, but in other way.
Im using DRI_PRIME to switch between intergrated and discrete gpu.
Running system on integrated gpu and it works perfectly.
When im launching games\video(not all, but very much of them) on discrete card - system freezes after some very a bit of time.
Netconsole - empty
Kdump - no record.
I don't have an idea how to debug it.
 radeon.runpm=0 is present.

Looking forward to some discussion

Note You need to log in before you can comment on or make changes to this bug.