Bug 63101

Summary: Hard lockup whel launching games like TF2 on kernels 3.11.5 and 3.12 rc4 and above if radeon.dpm=1 is used
Product: Drivers Reporter: Kertesz Laszlo (laszlo.kertesz)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: normal CC: alexandre.f.demers, alexdeucher, i, satishsaley, szg00000
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.12.0 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg of the locked up session
dmesg of 95167aad6761ec8a0fc76506ed00439483208ee1
x crashlog
dmesg, kernel next-20131108 until it froze
disable various dpm features
adjust dpm features for stability
adjust dpm features for stability v2

Description Kertesz Laszlo 2013-10-15 20:04:09 UTC
Created attachment 111161 [details]
dmesg of the locked up session

I begin to experience system lockups when launching Team Fortress 2 on an A8-5500 (using its integrated 7560 Radeon video with the radeon driver) - it began starting around kernel 3.12 rc4, (i build the kernel from mainline git).
I have a leftover package of 3.12 rc3+ and that works fine.

This also happens with the stable 3.11.5 kernel version.

Kernel 3.11.0 doesnt have this issue.

As i launch the game, the menus are shown fine, but as soon as i enter the actual game, either it crashes outright and throws me back to the desktop or i have the last screen frozen.
In any way, i can move the mouse but cannot interact with windows, the keyboard listens only to the magic keys. From time to timethe system beeps.
Dmesg is filled with warning and BUG/Oops entries.

Begins with:

Oct 15 21:21:17 laca-desktop kernel: [  355.859798] radeon 0000:00:01.0: GPU lockup CP stall for more than 10000msec
Oct 15 21:21:17 laca-desktop kernel: [  355.859807] radeon 0000:00:01.0: GPU lockup (waiting for 0x0000000000001276 last fence id 0x0000000000001275)
Oct 15 21:21:17 laca-desktop kernel: [  355.866308] [TTM] Buffer eviction failed

Then follow warnings like:

WARNING: CPU: 3 PID: 4041 at drivers/gpu/drm/drm_mm.c:257 drm_mm_remove_node+0x104/0x110 [drm]()

After that the BUGs:

Oct 15 21:21:20 laca-desktop kernel: [  358.208965] BUG: unable to handle kernel NULL pointer dereference at 0000000000000088

The system is unusable (i can move the mouse pointer in best case), i have to reboot with the sysrq combinations.

All of the problematic kernels share the new cpufreq code i suppose, they all work the same way - instead of powering the CPU between 0.9-1.34v as the kernels before, now i have 0.65-1.22 volts, the CPU barely reaching its nominal 3.2 GHz speed, and never above (before it was up to ~3.5GHz and boosting was working, now core boosting never works).
Comment 1 Alex Deucher 2013-10-15 22:57:35 UTC
Can you bisect?
Comment 2 Kertesz Laszlo 2013-10-16 06:32:26 UTC
Going back to commit 95167aad6761ec8a0fc76506ed00439483208ee1

- GPU lockup only, somewhat recoverable error, gpu locked in highest power state.
lots of errors like 
[  523.602078] radeon 0000:00:01.0: ffff8800034ebc00 pin failed
And
[  546.612391] WARNING: CPU: 0 PID: 2891 at drivers/gpu/drm/drm_mm.c:257 ttm_bo_man_put_node+0x23/0x3d [ttm]()

No other errors, i can even launch other games (tried DoD). The CPU is severely throttled (this is with 4 threads compile, the CPU is rated at 3.2 GHz):

# cpufreq-aperf -o
CPU	Average freq(KHz)	Time in C0	Time in Cx	C0 percentage
000	1536000			00 sec 996 ms	00 sec 003 ms	99
001	1536000			00 sec 984 ms	00 sec 015 ms	98
002	1536000			00 sec 970 ms	00 sec 029 ms	97
003	1536000			00 sec 985 ms	00 sec 014 ms	98

probably because of the GPU being in the highest state.
Attached dmesg. I continue bisecting.
Comment 3 Kertesz Laszlo 2013-10-16 06:39:28 UTC
Created attachment 111221 [details]
dmesg of 95167aad6761ec8a0fc76506ed00439483208ee1
Comment 4 Kertesz Laszlo 2013-10-16 06:41:00 UTC
X crashed right after i posted the previous comment. Attached xorg log with the crash.
Comment 5 Kertesz Laszlo 2013-10-16 06:41:28 UTC
Created attachment 111231 [details]
x crashlog
Comment 6 Christopher Meng 2013-10-18 03:21:14 UTC
3.12 RC4 made my system laggy but RC5 seems has fixed the problem.
Comment 7 Kertesz Laszlo 2013-10-18 05:54:32 UTC
I run the latest git, updated until commit 83f11a9cf2578b104c0daf18fc9c7d33c3d6d53a and i still have this issue.

The problem was introduced somewhere between

a2ac07fe292ea41296049dfdbfeed203e2467ee7 (v3.12-rc3-71-ga2ac07f)
and
0bfdbf0e79ab20394c932f27f6d3a34b757035ef (v3.12-rc3-276-g0bfdbf0).
I ran in some issue when bisecting, i was thrown back to between rc1 and rc2 when checking out certain commits which were shown as rc3+ in the commit log.
Comment 8 Kertesz Laszlo 2013-10-18 10:10:17 UTC
Ok, i finished bisecting.
According to git, 6d15ee492809d38bd62237b6d0f6a81d4dd12d15 (v3.12-rc3-267-g6d15ee4) is the first bad commit.
Comment 9 Kertesz Laszlo 2013-10-19 06:55:47 UTC
Reverted 6d15ee492809d38bd62237b6d0f6a81d4dd12d15, bit didnt help, i still get the lockup.

Build v3.12-rc3-265-gafe05d4 (until commit afe05d41e2c25ca3e047f9c7e5341bda553a932f) is working.
Comment 10 Kertesz Laszlo 2013-10-19 13:09:57 UTC
This issue has to do with the radeon driver, because if i dont append radeon.dpm=1 to the kernel boot options, TF2 launches even with latest 3.12 rc5+ kernel (but it has dismal performance).

Also, the CPU voltage is correct without the dpm option - between 0.9-1.34v, instead of 0.65-1.22 as it is with dpm.
Comment 11 Alex Deucher 2013-10-19 13:13:10 UTC
(In reply to Kertesz Laszlo from comment #10)
> This issue has to do with the radeon driver, because if i dont append
> radeon.dpm=1 to the kernel boot options, TF2 launches even with latest 3.12
> rc5+ kernel (but it has dismal performance).

If it was dpm specific you should have mentioned that.  Make sure your kernel has this patch:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=4076a65544e2de310cbf4eaadb13ee15bbfaaf4f
Comment 12 Kertesz Laszlo 2013-10-19 13:25:04 UTC
(In reply to Alex Deucher from comment #11)
> (In reply to Kertesz Laszlo from comment #10)
> > This issue has to do with the radeon driver, because if i dont append
> > radeon.dpm=1 to the kernel boot options, TF2 launches even with latest 3.12
> > rc5+ kernel (but it has dismal performance).
> 
> If it was dpm specific you should have mentioned that.  Make sure your
> kernel has this patch:
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/
> ?id=4076a65544e2de310cbf4eaadb13ee15bbfaaf4f

Sorry about that. I have the dpm option there since it appeared, i totally forgot about it - now i started the computer in single user mode which doesnt have that option then i proceeded to runlevel 2. This way i saw this.

I have that patch, i have the linus kernel cloned from git (now at version v3.12-rc5-123-g04919af). I even re cloned it recently.

Also please take a look at bug #62861 - it is related to the dpm code too.
Comment 13 Kertesz Laszlo 2013-10-20 08:03:50 UTC
I reverted commit 4076a65544e2de310cbf4eaadb13ee15bbfaaf4f, but still crashes with dpm.

Reverting this commit changed back the CPU voltage handling as it was before.

It seems that the CPU power management is best for APUs in its latest form - if dpm is not used or with 4076a65544e2de310cbf4eaadb13ee15bbfaaf4f reverted and dpm activated, the CPU is scaled with voltages between 0.9 and 1.34, using all frequencies, but if all cores are used, after a while the CPU becomes throttled and the effective frequencies go down to around ~3 GHz or even lower, in some cases even ~ 2GHz or lower (temps go as high as 60C withut dpm and ~56C with dpm).

With the new code we have ~0.65 - 1.22 volts, the frequencies are maxed out at 3.168GHz (rarely 3.2) coupled with a lower max temperature(max 51C, mostly 49), but the frequencies are stable, i can compile a kernel with 4 threads and there is absolutely no throttling. Only thing is missing its the highest voltage level (1.34) which enables turbo mode.
Comment 14 Kertesz Laszlo 2013-10-22 13:25:52 UTC
I did some more testing. Turns out that i can get into the game if i set the texture quality to "low". But it still crashes afterwards at some point - usually after ~20 or so minutes.

Sometimes io have tons of messages like this:

Oct 22 13:06:00 laca-desktop kernel: [  561.631586] [TTM] Buffer eviction failed
Oct 22 13:06:00 laca-desktop kernel: [  561.631594] radeon 0000:00:01.0: ffff88005c4e8400 pin failed
Oct 22 13:06:00 laca-desktop kernel: [  561.631596] [drm:radeon_crtc_page_flip] *ERROR* failed to pin new rbo buffer before flip

or

Oct 22 13:06:00 laca-desktop kernel: [  561.743974] radeon 0000:00:01.0: ffff88005c4e8400 pin failed
Oct 22 13:06:00 laca-desktop kernel: [  561.743977] [drm:radeon_crtc_page_flip] *ERROR* failed to pin new rbo buffer before flip


With the latest git kernel after the crash gpu lockup message the computer is still somewhat usable, the GPU is throttled to max and opengl apps fail with "Bus Error" (both tf2 and glxgears report this error). Sometimes after a while the computer freezes if left in this state.
Comment 15 Kertesz Laszlo 2013-11-08 14:33:15 UTC
Created attachment 113871 [details]
dmesg, kernel next-20131108 until it froze

Tried the next-20131108 kernel.
Still crashes, but now the GPU softresets, sometimes even 5 or so times in a row. If it resets multiple times, usually freezes the X server, only the mouse is moving. Sometimes it resets 1 or 2 times first then the X server is usable (it resets afterwards out of the blue too sometimes, but the X session isnt affected).
But if i try to launch TF2 again, it locks the X server and sometimes freezes the whole computer (logs dont show anything). Most times i was able to reboot with the sysrq keys.

PS. Weird thing, i once was able to play TF2 with the 3.12 final kernel too. Then it was perfectly playable, launched it a few times in a row and had no issues whatsoever. I did nothing out of the ordinary, the session was running for a day or 2 after which i launched it, previously i compiled stuff, tested stuff in virtual machines, watched movies, browsed etc.
However after reboot i couldnt reproduce this. No software (mesa etc) was changed, but after reboot, it crashed as before.
Comment 16 Kertesz Laszlo 2013-11-09 20:57:49 UTC
It seems that it does work after a day or so uptime. I had 1 day and 6 or 8 hours uptime and i could play the game just fine. Previously i did use programs that used quite some memory (eclipse, virtual machines, browsers).

Any ideas why this might be happening?
Comment 17 Alex Deucher 2013-11-11 18:38:57 UTC
Created attachment 114241 [details]
disable various dpm features

You might try disabling trinity dpm features to see if you can narrow down what's causing the stability problems.  Try the attached patch and see if it helps, and if so, try enabling additional features to narrow down the problem.
Comment 18 Kertesz Laszlo 2013-11-12 19:00:06 UTC
Tested with 3.12.0-next-20131111, started the computer and launched Steam and TF2 right away.

Activating enable_nbps_policy to enable_gfx_clock_gating worked well.

The enable_mg_clock_gating, enable_gfx_dynamic_mgpg, override_dynamic_mgpg options
gave gpu lockups and softresets (same as the above attachment), mostly after i entered the map, but sometimes after the reset the game was resumed and could play without issues.
Combining them with uvd_dpm seems to be the killer factor though.

This is what seems to work well so far (uvd_dpm seems to not create problems if the mgpg and mg options are disabled:

pi->enable_bapm = false;
	pi->enable_nbps_policy = true; /*mychange*/
	pi->enable_sclk_ds = true;
	pi->enable_gfx_power_gating = true;
	pi->enable_gfx_clock_gating = true;
	pi->enable_mg_clock_gating = false;
	pi->enable_gfx_dynamic_mgpg = false; /* ??? */
	pi->override_dynamic_mgpg = false;
	pi->enable_auto_thermal_throttling = true;
	pi->voltage_drop_in_dce = false; /* need to restructure dpm/modeset interaction */
	pi->uvd_dpm = true; /* ??? */
Comment 19 Kertesz Laszlo 2013-11-13 07:23:41 UTC
Enabling any of enable_mg_clock_gating, enable_gfx_dynamic_mgpg, override_dynamic_mgpg and uvd_dpm will cause the GPU softreset, TF2 to crash and most times an unusable desktop, sometimes with garbled image (composed of the real desktop tiles with some of the TF2 tiles. Exactly the behavior described with the unmodified 3.12.0-next-20131111 kernel.
Comment 20 Alex Deucher 2013-11-14 15:24:14 UTC
Created attachment 114671 [details]
adjust dpm features for stability

This patch should do the trick.  With this it sounds like dpm is finally stable on your system.
Comment 21 Kertesz Laszlo 2013-11-14 16:26:46 UTC
The 

pi->uvd_dpm = true; /* ??? */

line above wasnt a typo. I was trying to say that 

- ANY of the enable_mg_clock_gating, enable_gfx_dynamic_mgpg, override_dynamic_mgpg options enabled caused softresets which most of the time were recovered IF uvd_dpm was false.
- But if ANY of the above 3 options was true AND uvd_dpm true, i got the worst kind of behavior which most times rendereed the desktop unusable. 

- If the 3 options were false AND uvd_dpm true, it works.

I have the kernel (next-20131113) compiled with that exact piece of code i pasted in my previous comment right now and everything works fine so far, all games i tried, hardware decoding with xbmc and the webgl demos from chromeexperiments (on Seamonkey and Firefox, Chrome seems to hate mesa/radeon).
Comment 22 Alex Deucher 2013-11-14 16:45:25 UTC
Created attachment 114681 [details]
adjust dpm features for stability v2

Thanks for clarifying.  New patch attached.