Bug 63101
Summary: | Hard lockup whel launching games like TF2 on kernels 3.11.5 and 3.12 rc4 and above if radeon.dpm=1 is used | ||
---|---|---|---|
Product: | Drivers | Reporter: | Kertesz Laszlo (laszlo.kertesz) |
Component: | Video(DRI - non Intel) | Assignee: | drivers_video-dri |
Status: | NEW --- | ||
Severity: | normal | CC: | alexandre.f.demers, alexdeucher, i, satishsaley, szg00000 |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.12.0 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
dmesg of the locked up session
dmesg of 95167aad6761ec8a0fc76506ed00439483208ee1 x crashlog dmesg, kernel next-20131108 until it froze disable various dpm features adjust dpm features for stability adjust dpm features for stability v2 |
Can you bisect? Going back to commit 95167aad6761ec8a0fc76506ed00439483208ee1 - GPU lockup only, somewhat recoverable error, gpu locked in highest power state. lots of errors like [ 523.602078] radeon 0000:00:01.0: ffff8800034ebc00 pin failed And [ 546.612391] WARNING: CPU: 0 PID: 2891 at drivers/gpu/drm/drm_mm.c:257 ttm_bo_man_put_node+0x23/0x3d [ttm]() No other errors, i can even launch other games (tried DoD). The CPU is severely throttled (this is with 4 threads compile, the CPU is rated at 3.2 GHz): # cpufreq-aperf -o CPU Average freq(KHz) Time in C0 Time in Cx C0 percentage 000 1536000 00 sec 996 ms 00 sec 003 ms 99 001 1536000 00 sec 984 ms 00 sec 015 ms 98 002 1536000 00 sec 970 ms 00 sec 029 ms 97 003 1536000 00 sec 985 ms 00 sec 014 ms 98 probably because of the GPU being in the highest state. Attached dmesg. I continue bisecting. Created attachment 111221 [details]
dmesg of 95167aad6761ec8a0fc76506ed00439483208ee1
X crashed right after i posted the previous comment. Attached xorg log with the crash. Created attachment 111231 [details]
x crashlog
3.12 RC4 made my system laggy but RC5 seems has fixed the problem. I run the latest git, updated until commit 83f11a9cf2578b104c0daf18fc9c7d33c3d6d53a and i still have this issue. The problem was introduced somewhere between a2ac07fe292ea41296049dfdbfeed203e2467ee7 (v3.12-rc3-71-ga2ac07f) and 0bfdbf0e79ab20394c932f27f6d3a34b757035ef (v3.12-rc3-276-g0bfdbf0). I ran in some issue when bisecting, i was thrown back to between rc1 and rc2 when checking out certain commits which were shown as rc3+ in the commit log. Ok, i finished bisecting. According to git, 6d15ee492809d38bd62237b6d0f6a81d4dd12d15 (v3.12-rc3-267-g6d15ee4) is the first bad commit. Reverted 6d15ee492809d38bd62237b6d0f6a81d4dd12d15, bit didnt help, i still get the lockup. Build v3.12-rc3-265-gafe05d4 (until commit afe05d41e2c25ca3e047f9c7e5341bda553a932f) is working. This issue has to do with the radeon driver, because if i dont append radeon.dpm=1 to the kernel boot options, TF2 launches even with latest 3.12 rc5+ kernel (but it has dismal performance). Also, the CPU voltage is correct without the dpm option - between 0.9-1.34v, instead of 0.65-1.22 as it is with dpm. (In reply to Kertesz Laszlo from comment #10) > This issue has to do with the radeon driver, because if i dont append > radeon.dpm=1 to the kernel boot options, TF2 launches even with latest 3.12 > rc5+ kernel (but it has dismal performance). If it was dpm specific you should have mentioned that. Make sure your kernel has this patch: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=4076a65544e2de310cbf4eaadb13ee15bbfaaf4f (In reply to Alex Deucher from comment #11) > (In reply to Kertesz Laszlo from comment #10) > > This issue has to do with the radeon driver, because if i dont append > > radeon.dpm=1 to the kernel boot options, TF2 launches even with latest 3.12 > > rc5+ kernel (but it has dismal performance). > > If it was dpm specific you should have mentioned that. Make sure your > kernel has this patch: > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/ > ?id=4076a65544e2de310cbf4eaadb13ee15bbfaaf4f Sorry about that. I have the dpm option there since it appeared, i totally forgot about it - now i started the computer in single user mode which doesnt have that option then i proceeded to runlevel 2. This way i saw this. I have that patch, i have the linus kernel cloned from git (now at version v3.12-rc5-123-g04919af). I even re cloned it recently. Also please take a look at bug #62861 - it is related to the dpm code too. I reverted commit 4076a65544e2de310cbf4eaadb13ee15bbfaaf4f, but still crashes with dpm. Reverting this commit changed back the CPU voltage handling as it was before. It seems that the CPU power management is best for APUs in its latest form - if dpm is not used or with 4076a65544e2de310cbf4eaadb13ee15bbfaaf4f reverted and dpm activated, the CPU is scaled with voltages between 0.9 and 1.34, using all frequencies, but if all cores are used, after a while the CPU becomes throttled and the effective frequencies go down to around ~3 GHz or even lower, in some cases even ~ 2GHz or lower (temps go as high as 60C withut dpm and ~56C with dpm). With the new code we have ~0.65 - 1.22 volts, the frequencies are maxed out at 3.168GHz (rarely 3.2) coupled with a lower max temperature(max 51C, mostly 49), but the frequencies are stable, i can compile a kernel with 4 threads and there is absolutely no throttling. Only thing is missing its the highest voltage level (1.34) which enables turbo mode. I did some more testing. Turns out that i can get into the game if i set the texture quality to "low". But it still crashes afterwards at some point - usually after ~20 or so minutes. Sometimes io have tons of messages like this: Oct 22 13:06:00 laca-desktop kernel: [ 561.631586] [TTM] Buffer eviction failed Oct 22 13:06:00 laca-desktop kernel: [ 561.631594] radeon 0000:00:01.0: ffff88005c4e8400 pin failed Oct 22 13:06:00 laca-desktop kernel: [ 561.631596] [drm:radeon_crtc_page_flip] *ERROR* failed to pin new rbo buffer before flip or Oct 22 13:06:00 laca-desktop kernel: [ 561.743974] radeon 0000:00:01.0: ffff88005c4e8400 pin failed Oct 22 13:06:00 laca-desktop kernel: [ 561.743977] [drm:radeon_crtc_page_flip] *ERROR* failed to pin new rbo buffer before flip With the latest git kernel after the crash gpu lockup message the computer is still somewhat usable, the GPU is throttled to max and opengl apps fail with "Bus Error" (both tf2 and glxgears report this error). Sometimes after a while the computer freezes if left in this state. Created attachment 113871 [details]
dmesg, kernel next-20131108 until it froze
Tried the next-20131108 kernel.
Still crashes, but now the GPU softresets, sometimes even 5 or so times in a row. If it resets multiple times, usually freezes the X server, only the mouse is moving. Sometimes it resets 1 or 2 times first then the X server is usable (it resets afterwards out of the blue too sometimes, but the X session isnt affected).
But if i try to launch TF2 again, it locks the X server and sometimes freezes the whole computer (logs dont show anything). Most times i was able to reboot with the sysrq keys.
PS. Weird thing, i once was able to play TF2 with the 3.12 final kernel too. Then it was perfectly playable, launched it a few times in a row and had no issues whatsoever. I did nothing out of the ordinary, the session was running for a day or 2 after which i launched it, previously i compiled stuff, tested stuff in virtual machines, watched movies, browsed etc.
However after reboot i couldnt reproduce this. No software (mesa etc) was changed, but after reboot, it crashed as before.
It seems that it does work after a day or so uptime. I had 1 day and 6 or 8 hours uptime and i could play the game just fine. Previously i did use programs that used quite some memory (eclipse, virtual machines, browsers). Any ideas why this might be happening? Created attachment 114241 [details]
disable various dpm features
You might try disabling trinity dpm features to see if you can narrow down what's causing the stability problems. Try the attached patch and see if it helps, and if so, try enabling additional features to narrow down the problem.
Tested with 3.12.0-next-20131111, started the computer and launched Steam and TF2 right away. Activating enable_nbps_policy to enable_gfx_clock_gating worked well. The enable_mg_clock_gating, enable_gfx_dynamic_mgpg, override_dynamic_mgpg options gave gpu lockups and softresets (same as the above attachment), mostly after i entered the map, but sometimes after the reset the game was resumed and could play without issues. Combining them with uvd_dpm seems to be the killer factor though. This is what seems to work well so far (uvd_dpm seems to not create problems if the mgpg and mg options are disabled: pi->enable_bapm = false; pi->enable_nbps_policy = true; /*mychange*/ pi->enable_sclk_ds = true; pi->enable_gfx_power_gating = true; pi->enable_gfx_clock_gating = true; pi->enable_mg_clock_gating = false; pi->enable_gfx_dynamic_mgpg = false; /* ??? */ pi->override_dynamic_mgpg = false; pi->enable_auto_thermal_throttling = true; pi->voltage_drop_in_dce = false; /* need to restructure dpm/modeset interaction */ pi->uvd_dpm = true; /* ??? */ Enabling any of enable_mg_clock_gating, enable_gfx_dynamic_mgpg, override_dynamic_mgpg and uvd_dpm will cause the GPU softreset, TF2 to crash and most times an unusable desktop, sometimes with garbled image (composed of the real desktop tiles with some of the TF2 tiles. Exactly the behavior described with the unmodified 3.12.0-next-20131111 kernel. Created attachment 114671 [details]
adjust dpm features for stability
This patch should do the trick. With this it sounds like dpm is finally stable on your system.
The pi->uvd_dpm = true; /* ??? */ line above wasnt a typo. I was trying to say that - ANY of the enable_mg_clock_gating, enable_gfx_dynamic_mgpg, override_dynamic_mgpg options enabled caused softresets which most of the time were recovered IF uvd_dpm was false. - But if ANY of the above 3 options was true AND uvd_dpm true, i got the worst kind of behavior which most times rendereed the desktop unusable. - If the 3 options were false AND uvd_dpm true, it works. I have the kernel (next-20131113) compiled with that exact piece of code i pasted in my previous comment right now and everything works fine so far, all games i tried, hardware decoding with xbmc and the webgl demos from chromeexperiments (on Seamonkey and Firefox, Chrome seems to hate mesa/radeon). Created attachment 114681 [details]
adjust dpm features for stability v2
Thanks for clarifying. New patch attached.
|
Created attachment 111161 [details] dmesg of the locked up session I begin to experience system lockups when launching Team Fortress 2 on an A8-5500 (using its integrated 7560 Radeon video with the radeon driver) - it began starting around kernel 3.12 rc4, (i build the kernel from mainline git). I have a leftover package of 3.12 rc3+ and that works fine. This also happens with the stable 3.11.5 kernel version. Kernel 3.11.0 doesnt have this issue. As i launch the game, the menus are shown fine, but as soon as i enter the actual game, either it crashes outright and throws me back to the desktop or i have the last screen frozen. In any way, i can move the mouse but cannot interact with windows, the keyboard listens only to the magic keys. From time to timethe system beeps. Dmesg is filled with warning and BUG/Oops entries. Begins with: Oct 15 21:21:17 laca-desktop kernel: [ 355.859798] radeon 0000:00:01.0: GPU lockup CP stall for more than 10000msec Oct 15 21:21:17 laca-desktop kernel: [ 355.859807] radeon 0000:00:01.0: GPU lockup (waiting for 0x0000000000001276 last fence id 0x0000000000001275) Oct 15 21:21:17 laca-desktop kernel: [ 355.866308] [TTM] Buffer eviction failed Then follow warnings like: WARNING: CPU: 3 PID: 4041 at drivers/gpu/drm/drm_mm.c:257 drm_mm_remove_node+0x104/0x110 [drm]() After that the BUGs: Oct 15 21:21:20 laca-desktop kernel: [ 358.208965] BUG: unable to handle kernel NULL pointer dereference at 0000000000000088 The system is unusable (i can move the mouse pointer in best case), i have to reboot with the sysrq combinations. All of the problematic kernels share the new cpufreq code i suppose, they all work the same way - instead of powering the CPU between 0.9-1.34v as the kernels before, now i have 0.65-1.22 volts, the CPU barely reaching its nominal 3.2 GHz speed, and never above (before it was up to ~3.5GHz and boosting was working, now core boosting never works).