Bug 61941 - Random GPU lockups/resets on Mobility Radeon HD 3650 with radeon.dpm=1
Summary: Random GPU lockups/resets on Mobility Radeon HD 3650 with radeon.dpm=1
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-09-23 21:40 UTC by Coacher
Modified: 2016-08-19 22:24 UTC (History)
10 users (show)

See Also:
Kernel Version: >=3.11.0
Subsystem:
Regression: No
Bisected commit-id:


Attachments
lspci -vvv (35.81 KB, application/octet-stream)
2013-09-23 21:40 UTC, Coacher
Details
dmesg after lockup (73.77 KB, text/plain)
2013-09-23 21:40 UTC, Coacher
Details
another dmesg after lockup (72.49 KB, text/plain)
2013-09-23 21:41 UTC, Coacher
Details

Description Coacher 2013-09-23 21:40:21 UTC
Created attachment 109311 [details]
lspci -vvv

Hello.

I have 3.11 kernel with radeon.dpm enabled via kernel boot option. Sometimes under unknown circumstances my GPU resets. It happens seldom and randomly during a common desktop workflow: web browsing, document typing, video playback. No games or video benchmarks or any other GPU-eating stuff.

During such lockup my screen turns black, but backlight is still on. After a while image reappears, though it is blurry and after 10-15 seconds my screen is back to normal state completely. Also sometimes after such reset I am able to continue working, but sometimes screen is stuck with one image that appeared after screen restoration and I have to reboot machine to fix this.

Everything else seems to work fine during lockups, for example music is playing without any problems. I can't say for sure, but it looks like this bug happens more often when some video is played in Firefox. (JIC: I don't have Adobe Flash installed.)

My OS is Gentoo amd64 with vanilla kernel 3.11.{0,1}. And this happens with power_dpm_state set to "balanced". I've attached two dmesg outputs after such lockups.

I do understand that it is probably too vague description to fix this and I am ready to provide any other additional info.
Comment 1 Coacher 2013-09-23 21:40:48 UTC
Created attachment 109321 [details]
dmesg after lockup
Comment 2 Coacher 2013-09-23 21:41:30 UTC
Created attachment 109331 [details]
another dmesg after lockup
Comment 3 Alex Deucher 2013-09-23 22:15:33 UTC
Do you only get the lockups with dpm enabled?  If so, try disabling certain dpm features and see if any of them help.  See if you can narrow down which if any of them help.  E.g.,

diff --git a/drivers/gpu/drm/radeon/rv6xx_dpm.c b/drivers/gpu/drm/radeon/rv6xx_dpm.c
index 5811d27..13c5267 100644
--- a/drivers/gpu/drm/radeon/rv6xx_dpm.c
+++ b/drivers/gpu/drm/radeon/rv6xx_dpm.c
@@ -1981,10 +1981,10 @@ int rv6xx_dpm_init(struct radeon_device *rdev)
        else
                pi->fb_div_scale = 0;
 
-       pi->voltage_control =
-               radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_VDDC, 0);
+       pi->voltage_control = false;
+//             radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_VDDC, 0);
 
-       pi->gfx_clock_gating = true;
+       pi->gfx_clock_gating = false;
 
        pi->sclk_ss = radeon_atombios_get_asic_ss_info(rdev, &ss,
                                                       ASIC_INTERNAL_ENGINE_SS, 0);
@@ -1993,13 +1993,14 @@ int rv6xx_dpm_init(struct radeon_device *rdev)
 
        /* Disable sclk ss, causes hangs on a lot of systems */
        pi->sclk_ss = false;
+       pi->mclk_ss = false;
 
        if (pi->sclk_ss || pi->mclk_ss)
                pi->dynamic_ss = true;
        else
                pi->dynamic_ss = false;
 
-       pi->dynamic_pcie_gen2 = true;
+       pi->dynamic_pcie_gen2 = false;
 
        if (pi->gfx_clock_gating &&
            (rdev->pm.int_thermal_type != THERMAL_TYPE_NONE))
Comment 4 Coacher 2013-09-25 15:44:30 UTC
(In reply to Alex Deucher from comment #3)
> Do you only get the lockups with dpm enabled?  If so, try disabling certain
> dpm features and see if any of them help.  See if you can narrow down which
> if any of them help.  E.g.,

So far, yes, it happens only with dpm enabled. I'll try to switch certain features off and report if this helps somehow.
Comment 5 Coacher 2013-10-19 18:17:33 UTC
Disabling voltage control doesn't help. i.e. I still have this issue after the following change:

-       pi->voltage_control =
-               radeon_atom_is_voltage_gpio(rdev,
SET_VOLTAGE_TYPE_ASIC_VDDC, 0);
+       pi->voltage_control = false;
+//             radeon_atom_is_voltage_gpio(rdev,
SET_VOLTAGE_TYPE_ASIC_VDDC, 0);

I will try to disable dynamic_pcie_gen2 this time.
Comment 6 Coacher 2013-10-19 18:48:17 UTC
Disabling dynamic_pcie_gen2 also doesn't help. Will try mclk_ss next.
Comment 7 Coacher 2013-10-22 16:39:39 UTC
mclk_ss doesn't help either. Last option is gfx_clock_gating. Will try it next.
Comment 8 Coacher 2013-10-22 16:47:08 UTC
I've just noticed that after such lockup if I only restart X server without full reboot of the machine then video playback (mplayer) is unavailable. xvinfo shows this:

X-Video Extension version 2.2
screen #0
 no adaptors present

Prior to such lockup video playback works flawlessly with the very same configuration. Not sure if this info is useful somehow. These results obtained with the latest stable 3.11.6 kernel.
Comment 9 Coacher 2013-10-25 18:55:14 UTC
Changing gfx_clock_gating also doesn't help. There's one more thing though that I've changed alongside dpm: I've enabled aspm=1 for kernel driver.

I will remove aspm=1 from modprobe.conf so the driver will choose itself whether it should be on or off. I'll report whether this helps or not.
Comment 10 Coacher 2013-11-10 18:22:34 UTC
I've just had another suck lockup on 3.12 vanilla kernel without specifying aspm option to the driver. What should I test now?
Comment 11 Coacher 2013-11-21 08:54:06 UTC
Recently Phoronix site published a series of Radeon cards tests. Their setup was kernel 3.12 with dpm enabled, Mesa 10.0 pre-release and Ubuntu 13.10. They had the same problem as me [0]:

"The Radeon HD 3650 ... was unstable when running the Source Engine tests and resulted in "GPU lockup CP stall for more than 10000msec" and "*ERROR* radeon: fence wait failed" errors. With frequent GPU lock-ups, the HD 3650 tests were abandoned."

[0]: http://www.phoronix.com/scan.php?page=article&item=amd_gallium3d_tf2css&num=2
Comment 12 Shawn Starr 2013-11-21 13:14:47 UTC
Ilya,

I have a theory I've been testing and 'so far'. Do you have any CPU frequency daemons running thermald or have CPUFreq set to a governor other than performance?

If you set to performance and test do you have GPU resets?

With GLAMOR and no EXA in xorg.conf, I haven't had a reset in two weeks now with various use, DPM enabled and CPU governor set to performance mode. 

This might be a big stretch but anything is possible in potential triggers.
Comment 13 Coacher 2013-11-21 16:14:57 UTC
(In reply to Shawn Starr from comment #12)
Hello, Shawn.

I don't have any such daemon, but my cpufreq governor is set to conservative both on AC and battery.

Ok, I will try your suggestion, but could you please share what is you theory about?
Comment 14 Shawn Starr 2013-11-21 17:55:53 UTC
Somehow, when the CPU shifts frequency this is causing voltage changes that are affecting the GPU's switching DPM power states.

I will be trying this week my old tests with EXA to see if this holds true but this time keep the cpufreq governor to performance mode only.

I could be grasping at straws but really, nothing would surprise me these days :)
Comment 15 Shawn Starr 2013-11-22 15:10:54 UTC
My Theory is incorrect, but what is interesting was I could get the GPU to reset quicker, still, using EXA is unreliable use GLAMOR in your xorg.conf (if you dont have a newer stack).

This was stable even if its masking the real issue going on.
Comment 16 Coacher 2013-12-20 06:21:24 UTC
I've noticed that this bug happens more frequently on my machine while running Virtualbox. However, it does occur without Virtualbox as well.
Comment 17 Coacher 2014-01-10 07:14:03 UTC
Just got another lockup with 3.12.6 kernel.

Ping.
Comment 18 Coacher 2014-01-18 19:01:44 UTC
This issue still occurs with 3.12.8 kernel.
Comment 19 Coacher 2014-02-15 18:25:51 UTC
Still occurs with 3.13.2 kernel.

Ping.
Comment 20 Coacher 2014-02-15 19:19:07 UTC
(In reply to Ilya Tumaykin from comment #19)
Same with 3.13.3.
Comment 21 Huan Zhang 2014-02-16 02:48:33 UTC
I am having exactly the same issue on Ubuntu 13.10 (kernel 3.11.0-15-generic). My GPU is HD3450 (R620 LE). Apart form the random hang issue, sometimes the kernel crashes during "modprobe radeon". I have to disable dpm for now.
Comment 22 Coacher 2014-04-15 12:24:36 UTC
Reproducible with 3.14.0 kernel.
Comment 23 Pavol Klačanský 2014-11-19 19:37:25 UTC
It occurs at least once a day on 3.18.0-031800rc4-generic.
Comment 24 Mihai Coman 2015-03-23 18:22:12 UTC
Still happens on 3.19.2-1-ARCH. After blanking, sometimes the image comes back on; I can move the cursor, but the interface is unresponsive.
Comment 25 Denis Ollier 2016-08-19 22:24:35 UTC
Still occurs on kernel 4.7.1-1-ARCH.

Note You need to log in before you can comment on or make changes to this bug.