Bug 60523 - Radeon DPM not working with 2 monitors attached to Radeon HD5770 (Juniper)
Summary: Radeon DPM not working with 2 monitors attached to Radeon HD5770 (Juniper)
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-07-05 16:57 UTC by Tobias Droste
Modified: 2016-03-23 18:31 UTC (History)
18 users (show)

See Also:
Kernel Version: drm-next-3.12-wip
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg | grep -iE "drm|radeon|power|uvd" after starting a video with mplayer (9.85 KB, text/plain)
2013-07-05 16:57 UTC, Tobias Droste
Details
dmesg | grep -iE "drm|radeon|power|uvd" (9.74 KB, text/plain)
2013-07-30 17:49 UTC, Tobias Droste
Details
drm/radeon boot output (5.67 KB, text/plain)
2013-08-15 20:56 UTC, Christian Birchinger
Details
debug drm output (9.04 KB, text/plain)
2013-08-15 21:18 UTC, Christian Birchinger
Details
full dmesg (HD6870) (76.37 KB, text/x-log)
2013-08-19 02:30 UTC, Timothy Dale
Details
fix for 3.11 (17.33 KB, patch)
2013-08-20 16:50 UTC, Alex Deucher
Details | Diff
output of lspci -vv (45.83 KB, text/plain)
2014-02-01 20:36 UTC, Alex Belykh
Details
dmesg when booting with two monitors (62.22 KB, text/plain)
2014-02-01 20:38 UTC, Alex Belykh
Details
dmesg when booting up with a single monitor (59.18 KB, text/plain)
2014-02-01 20:39 UTC, Alex Belykh
Details
fix (1.13 KB, patch)
2014-03-06 21:04 UTC, Alex Deucher
Details | Diff
vbios.rom [AMD/ATI] Juniper XT [Radeon HD 5770] (61.50 KB, application/octet-stream)
2014-12-04 19:06 UTC, Tobias Droste
Details
Advanced Micro Devices, Inc. [AMD/ATI] Barts XT [Radeon HD 6870] (62.50 KB, application/octet-stream)
2014-12-04 19:14 UTC, Mathias Tillman
Details
[AMD/ATI] Barts XT [Radeon HD 6870] (64.00 KB, application/octet-stream)
2015-02-13 12:28 UTC, Timothy Dale
Details

Description Tobias Droste 2013-07-05 16:57:40 UTC
Created attachment 106814 [details]
dmesg | grep -iE "drm|radeon|power|uvd" after starting a video with mplayer

I have some not always reproducible lockup with the latest DPM code.
There are 2 scenarios:

1. PC boots, DPM is active and reports different power states and levels in dmesg, but does not change the power level (it's always at power level 2)

2. mplayer is started and plays a video with UVD. DPM switches the power state to UVD and switches between powerlevel 0 and 2 while playing the video

3. When closing mplayer there are 2 possible outcomes:
a) DPM switches the power state back to performance and stays in power level 0 or 1. This will stay until something triggers it to switch to power level 2 again. As soon as it's in power level 2 again it's never changing back.

b) PC locks up, display is black and VT switch doesn't work until I SysRq+REISUB to reboot the machine

Scenario b is harder to trigger but it happened at least 2 times today.

dmesg | grep -iE "drm|radeon|power|uvd"
Comment 1 Tobias Droste 2013-07-09 18:01:20 UTC
I just tried the most recent branch with the "force performance level" framework and the result is this:

# echo auto > power_dpm_force_performance_level
bash: echo: write error: Invalid argument
# echo low > power_dpm_force_performance_level
bash: echo: write error: Invalid argument
# echo high > power_dpm_force_performance_level
bash: echo: write error: Invalid argument

Has this something to do with the powerlevel not switching thing? Or am I doing something wrong?
Comment 2 Alex Deucher 2013-07-09 18:27:39 UTC
You're doing it correctly, but the power management hardware seems to be in a bad state on your system.
Comment 3 Tobias Droste 2013-07-09 18:42:23 UTC
Anything I can do to reset it? Or provide debug information? A restart doesn't seem to help.
Comment 4 Alex Deucher 2013-07-09 19:46:45 UTC
It doesn't look like it's working correctly to begin with.
Comment 5 Tobias Droste 2013-07-30 17:37:08 UTC
In case you are wondering if this was somehow fixed with your latest patches: It wasn't ;-)

Good news is: 
No lockups!

Bad news: 
It's still not automatically switching power levels.
The only way to get it into a lower power level is to run a small video with a player using UVD. After closing the player the power level goes to 0. But as soon as something needs some more power the level goes back to 2 and stays there, even when the load is low again.

and this is still happening:
# echo low > power_dpm_force_performance_level
bash: echo: write error: Invalid argument
Comment 6 Alex Deucher 2013-07-30 17:45:01 UTC
Please attach your dmesg output with the latest drm bits.
Comment 7 Tobias Droste 2013-07-30 17:49:23 UTC
Created attachment 107045 [details]
dmesg | grep -iE "drm|radeon|power|uvd"
Comment 8 Alex Deucher 2013-07-30 17:57:05 UTC
Does this patch help?

diff --git a/drivers/gpu/drm/radeon/cypress_dpm.c b/drivers/gpu/drm/radeon/cypress_dpm.c
index 9bcdd17..1acbddb 100644
--- a/drivers/gpu/drm/radeon/cypress_dpm.c
+++ b/drivers/gpu/drm/radeon/cypress_dpm.c
@@ -2113,7 +2113,7 @@ int cypress_dpm_init(struct radeon_device *rdev)
            (rdev->family == CHIP_HEMLOCK))
                pi->gfx_clock_gating = false;
        else
-               pi->gfx_clock_gating = true;
+               pi->gfx_clock_gating = false;//true;
 
        pi->mg_clock_gating = true;
        pi->mgcgtssm = true;
Comment 9 Tobias Droste 2013-07-30 21:50:21 UTC
No.

It seems to work good as soon as the power state switches to a state with UVD (jumping from 0 to 2 based on load). But without UVD running it stays on power level 2 as soon as it lands there.

So it goes (ps = power state, pl = power level):

ps(non-uvd) pl2 -> ps(uvd) pl2 -> ps(uvd) pl0 -> ps(uvd) pl1 -> ps(uvd) pl0 -> ps(non-uvd) pl0

it stays at this until something is rendering in 3D and goes to pl2. 

But when the 3D rendering stops it doesn't go back to pl0 or pl1. So it really seems like it's stuck there in this power state.

and this is still happening:
# echo low > power_dpm_force_performance_level
bash: echo: write error: Invalid argument
Comment 10 Tobias Droste 2013-07-30 21:51:55 UTC
Interesting fact #2:

# echo low > power_dpm_force_performance_level

works with UVD running!
Comment 11 Tobias Droste 2013-07-30 21:59:38 UTC
Interesting fact #3:

it's also working if I only attach 1 monitor!

So it looks like power state 1, 2 and 4 are working correctly and power state 3 (no uvd and two monitors attached) is broken.
Comment 12 Alex Deucher 2013-08-01 22:54:06 UTC
Do you have any sort of background animations, compute jobs, or anything like that running?  I have a very similar 5770, and I can't reproduce the issue.  The multi-monitor state works fine here.
Comment 13 Tobias Droste 2013-08-01 23:16:21 UTC
The only active component is kwin with compositing. 
But even if I disable compositing and don't touch the mouse or keyboard for a while the state doesn't change. 

I also doubt that's the problem because

# echo high > power_dpm_force_performance_level

gives 

bash: echo: write error: Invalid argument

with 2 monitors attached and works fine with 1 monitor.

Maybe a weird bug in the bios? But the card works fine with catalyst in windows.
Comment 14 Tobias Droste 2013-08-02 16:31:38 UTC
I changed rv770_smc.c to this: http://pastebin.com/eMzfrAaZ, but there aren't any message in dmesg after booting.

'echo low > power_dpm_force_performance_level' fails, but is also not printing something to dmesg. 
Changing from 1 monitor to 2 and back is also not printing something to dmesg.
Comment 15 Christian Birchinger 2013-08-04 18:54:26 UTC
I think i have the same problem using the kernel version 3.11-rc3+ (git version incl. latest drm-fixes)

I'm also using a 5770 (MSI Radeon HD 5770 Hawk) graphics card.

I get the same error messages (write error: Invalid argument)
when echoing the values into power_dpm_force_performance_level.

During idle (XFCE with or without composite), it remains on power level 2:

# cat /sys/kernel/debug/dri/0/radeon_pm_info
uvd    vclk: 0 dclk: 0
power level 2    sclk: 87500 mclk: 120000 vddc: 1200 vddci: 1100
(Temperature is around 48C)


When playing a video, the states change (card gets cooler):

# cat /sys/kernel/debug/dri/64/radeon_pm_info
uvd    vclk: 50000 dclk: 40000
power level 1    sclk: 40000 mclk: 90000 vddc: 950 vddci: 1100
# cat /sys/kernel/debug/dri/64/radeon_pm_info
uvd    vclk: 50000 dclk: 40000
power level 2    sclk: 40000 mclk: 90000 vddc: 950 vddci: 1100
# cat /sys/kernel/debug/dri/64/radeon_pm_info
uvd    vclk: 50000 dclk: 40000
power level 0    sclk: 40000 mclk: 90000 vddc: 950 vddci: 1100
(Temperature is around 44C)


The Windows driver brings the temperature down to 37C in idle.
Comment 16 Christian Birchinger 2013-08-04 23:25:19 UTC
I also had a hard lockup having the machine running over night. Too bad i found nothing in the syslog.

A small detail appeard though: When the monitor is off (dpms standby), it starts changing power states.
Comment 17 Alex Deucher 2013-08-15 12:32:12 UTC
Does this patch help?

diff --git a/drivers/gpu/drm/radeon/cypress_dpm.c b/drivers/gpu/drm/radeon/cypress_dpm.c
index 95a66db..a1d2503 100644
--- a/drivers/gpu/drm/radeon/cypress_dpm.c
+++ b/drivers/gpu/drm/radeon/cypress_dpm.c
@@ -1898,10 +1898,10 @@ int cypress_dpm_enable(struct radeon_device *rdev)
        cypress_start_dpm(rdev);
 
        if (pi->gfx_clock_gating)
-               cypress_gfx_clock_gating_enable(rdev, true);
+               cypress_gfx_clock_gating_enable(rdev, false);
 
        if (pi->mg_clock_gating)
-               cypress_mg_clock_gating_enable(rdev, true);
+               cypress_mg_clock_gating_enable(rdev, false);
 
        if (rdev->irq.installed &&
            r600_is_internal_thermal_sensor(rdev->pm.int_thermal_type)) {
Comment 18 Tobias Droste 2013-08-15 15:53:23 UTC
Doesn't help here.

But I can confirm that it works as soon as dpms is active and dpm switches to power level 0. It stays at power level 0 after the monitors are active again and I can echo things to power_dpm_force_performance_level. But as soon as it goes back to power level 2 it stays there and echoing to power_dpm_force_performance_level fails again.
Comment 19 Christian Birchinger 2013-08-15 18:54:56 UTC
Just to be totaly clear. In my case i only have one monitor in use. I don't use any multi-head setup at the moment.

The rest is identical. dpms standby puts it to level 0, when it wakes up it's also level 0. After it switched to level 2 though it wont ever go back to 1 or 0 (except  when using dpms again or the video playback trick)

During dpms standby i see "caps: single_disp video" in the dmesg output
but when the monitor is on it is "caps: video". No idea if that is normal
but as i said, i'm only using one display.
Comment 20 Alex Deucher 2013-08-15 20:09:32 UTC
Christian,  Can you try this patch:
http://lists.freedesktop.org/archives/dri-devel/2013-August/043464.html
I the vblank period on your monitor is short enough that it's causing the driver to select the multi-head case to avoid mclk switching.  You should also try the patch in comment 17.
Comment 21 Christian Birchinger 2013-08-15 20:52:43 UTC
No change here at all with those 2 patches. I'm attaching my boot dmesg just in case my maybe weird setup (CRT monitor) does not cause anything special.
Comment 22 Christian Birchinger 2013-08-15 20:56:07 UTC
Created attachment 107211 [details]
drm/radeon boot output

The relevant radeon and drm boot message output
Comment 23 Alex Deucher 2013-08-15 21:08:51 UTC
Ah, you have a system with gddr5 memory.  The blanking period is probably too short on your monitor to support mclk switching.  Something like this will tell you for sure:

diff --git a/drivers/gpu/drm/radeon/cypress_dpm.c b/drivers/gpu/drm/radeon/cypress_dpm.c
index 95a66db..cfe8313 100644
--- a/drivers/gpu/drm/radeon/cypress_dpm.c
+++ b/drivers/gpu/drm/radeon/cypress_dpm.c
@@ -2169,6 +2169,8 @@ bool cypress_dpm_vblank_too_short(struct radeon_device *rdev)
        /* we never hit the non-gddr5 limit so disable it */
        u32 switch_limit = pi->mem_gddr5 ? 450 : 0;
 
+       DRM_ERROR("vblank_time: %d switch_limit: %d", vblank_time, switch_limit);
+
        if (vblank_time < switch_limit)
                return true;
        else
diff --git a/drivers/gpu/drm/radeon/radeon_pm.c b/drivers/gpu/drm/radeon/radeon_pm.c
index a44ae9a..7b4c9db 100644
--- a/drivers/gpu/drm/radeon/radeon_pm.c
+++ b/drivers/gpu/drm/radeon/radeon_pm.c
@@ -648,10 +648,15 @@ static struct radeon_ps *radeon_dpm_pick_power_state(struct radeon_device *rdev,
 
        /* check if the vblank period is too short to adjust the mclk */
        if (single_display && rdev->asic->dpm.vblank_too_short) {
-               if (radeon_dpm_vblank_too_short(rdev))
+               if (radeon_dpm_vblank_too_short(rdev)) {
+                       DRM_ERROR("vblank too short\n");
                        single_display = false;
+               }
        }
 
+       DRM_ERROR("single display = %d crtcs: %d", single_display,
+                 rdev->pm.dpm.new_active_crtc_count);
+
        /* certain older asics have a separare 3D performance state,
         * so try that first if the user selected performance
         */
Comment 24 Christian Birchinger 2013-08-15 21:18:17 UTC
Created attachment 107212 [details]
debug drm output

Yes, i get lots of output. Log is attached
Comment 25 Christian Birchinger 2013-08-15 21:27:44 UTC
So with the problem being the vblank i switched the resolutions using xrandr. Using lower resolution modes makes it start switching.

~ $ xrandr
Screen 0: minimum 320 x 200, current 1600 x 1200, maximum 8192 x 8192
DisplayPort-0 disconnected (normal left inverted right x axis y axis)
HDMI-0 disconnected (normal left inverted right x axis y axis)
DVI-0 connected 1600x1200+0+0 (normal left inverted right x axis y axis) 416mm x 312mm
   1600x1200      85.0*+
   1280x1024     100.0  
   1152x864       99.4  
   1024x768      100.0  
   800x600       100.0  
   640x480       100.0  
   640x400        85.1  
   400x300       144.4  
   320x240       150.3  
   320x200       139.9  

"1280x1024" and "640x400" makes it switch to level 0. But my default working mode "1600x1200" for the past 10+ years triggers the issue.

Modeline        "1600x1200" 220.00 1600 1616 1808 2080 1200 1204 1207 1244 +hsync +vsync
Modeline        "1280x1024" 181.75 1280 1312 1440 1696 1024 1031 1046 1072 -hsync -vsync
Modeline        "1152x864"  137.65 1152 1184 1312 1536 864 866 885 902 -hsync -vsync
Modeline        "1024x768"  115.50 1024 1056 1248 1440 768 771 781 802 -hsync -vsync
Modeline        "800x600"    69.65 800 864 928 1088 600 604 610 640 -hsync -vsync
Modeline        "640x480"    45.80 640 672 768 864 480 488 494 530 -hsync -vsync
Modeline        "640x400"    31.50 640 672 736 832 400 401 404 445 -hsync +vsync
Modeline        "400x300"    25.00 400 424 488 520 300 319 322 333 doublescan
Modeline        "320x240"    15.75 320 336 384 400 240 244 246 262 doublescan
Modeline        "320x200"    12.59 320 336 384 400 200 204 205 225 doublescan

I only really use the 1600x1200 now the old stuff comes from a time where scaling used the whole CPU ;)

The EDID file it uses on bootup contains the same modes. EDID was generated with the same values because
the radone driver seems to ignore the values in xorg.conf and only uses EDID from KMS etc.
Comment 26 Alex Deucher 2013-08-15 21:40:28 UTC
In order to switch the mclk, the hw needs at least 450us.  The vblank period of the 1600x1200 mode is 396us, so it's not long enough to switch.  The switch has to happen during vblank to avoid seeing a flicker when the mclk changes.  As such the driver picks a power state where the mclk doesn't change (the same power state that is used for multi-head).  You could try specifying a 1600x1200 modeline with a longer vblank period if you want to use that mode and still support mclk switching.
Comment 27 Tobias Droste 2013-08-15 21:43:48 UTC
Isn't this the reason why there is a multi-monitor power state? same mclk but different sclk for each power level? So switching between them should be no problem because there's no memory reclocking happening.
Comment 28 Christian Birchinger 2013-08-15 21:46:05 UTC
Ok thanks.

I was just in the middle of posting this:

With 1280x1024 it switched to power level 0 but without "single_disp".
With the really low 640x400 mode it did also use "single_disp".

But i now see that's no longer relevant.
Comment 29 Alex Deucher 2013-08-15 21:55:39 UTC
Correct.  I'm not sure why that state sees to get stuck in the highest performance level on your cards though.
Comment 30 Christian Birchinger 2013-08-15 22:09:30 UTC
Maybe the same reason why Tobias is stuck at level 2.

Since i'm no longer able to use tools like xvidtune and the online modeline calculator tells me 1600x1200 85hz requires >300Mhz pixel clock, so i'm probably
stuck at trying out lots of predefined modes that got pasted online.

How would i calculate the vblank period from modelines? It would really help
if i at least knew a mode is within the specs before trying it out.
Comment 31 Alex Deucher 2013-08-15 22:16:01 UTC
(In reply to Christian Birchinger from comment #30)
> Maybe the same reason why Tobias is stuck at level 2.
> 

Right you both seem to be afflicted but the same issue.

> Since i'm no longer able to use tools like xvidtune and the online modeline
> calculator tells me 1600x1200 85hz requires >300Mhz pixel clock, so i'm
> probably
> stuck at trying out lots of predefined modes that got pasted online.

You can use gtf or cvt to generate modelines and add them manually with xrandr.

> 
> How would i calculate the vblank period from modelines? It would really help
> if i at least knew a mode is within the specs before trying it out.

See r600_dpm_get_vblank_time() in r600_dpm.c for the formula:

line_time_us = (mode.crtc_htotal * 1000) / mode.clock;
vblank_lines = mode.crtc_vblank_end - mode.crtc_vdisplay;
vblank_time_us = vblank_lines * line_time_us;
Comment 32 Christian Birchinger 2013-08-15 22:41:25 UTC
Yes, i did this:

xrandr --newmode "1600x1200_test" 229.5   1600 1664 1856 2160 1200 1201 1204 1250 +hsync +vsync

Puts my state to this:

uvd    vclk: 0 dclk: 0
power level 0    sclk: 15700 mclk: 30000 vddc: 950 vddci: 1100

(so it's level 0 and single_disp)

The reason i was using obscure realtime testing with xvidtune is because now the image is much
narrower and i get ugly crt stretching pillow effects when i compensate for it.

But yes, that's nothing the driver can take care of anymore. Maybe i find some non-standard mode
that still gets accepted. But without a tool that displays tests in realtime like xvidtune there
will be lots of tying around.
Comment 33 Timothy Dale 2013-08-19 02:04:57 UTC
I'm also experiencing this issue with my 6870. I decided to attach a second monitor yesterday and immediately noticed that my card went from idling at around 45°C to a steady 55°C.

'watch cat /sys/kernel/debug/dri/0/radeon_pm_info' shows that I'm stuck at power level 2, even when running a game like Xonotic. I get the same 'bash: echo: write error: Invalid argument' error if I try to force a performance level as well.

Using 3.11-rc5 + the drm-next-3.12-wip branch.
Comment 34 Timothy Dale 2013-08-19 02:30:38 UTC
Created attachment 107241 [details]
full dmesg (HD6870)
Comment 35 Pitam Mitra 2013-08-19 18:46:50 UTC
I have the same problem with HD5700. Since the others have attached dmesg reports already, I am not including mine. 

Unlike droste, when I do:

# echo high > power_dpm_force_performance_level
OR
# echo low > power_dpm_force_performance_level


nothing happens, and nothing prints to dmesg. I am always "stuck" in power level 2. Using 3.11git-rc5 from Fedora rawhide.
Comment 36 Tobias Droste 2013-08-19 18:57:24 UTC
Hm I'm pretty sure the "stuck" power level is the same problem as the "write error: Invalid argument" problem.

Are you sure you're in the correct directory before executing the command?

If you're not in '/sys/class/drm/card0/device' you would just create a text file with 'high' or 'low' in it.
Comment 37 Alex Deucher 2013-08-20 14:40:22 UTC
I was finally able to reproduce this, but only with gcc 4.8.  Older versions of gcc work fine.  Looks like the gcc bug has struck again.  See:
https://bugs.freedesktop.org/show_bug.cgi?id=66932
Now to find what other part gcc doesn't like...
Comment 38 Tobias Droste 2013-08-20 16:15:01 UTC
I don't want to say that it's not a gcc bug, but I'm using gcc 4.7:

gcc version 4.7.2 20130108 [gcc-4_7-branch revision 195012] (SUSE Linux)
Comment 39 Alex Deucher 2013-08-20 16:50:32 UTC
Created attachment 107254 [details]
fix for 3.11

The attached patch seems to fix it for me.
Comment 40 Timothy Dale 2013-08-20 21:18:54 UTC
(In reply to Alex Deucher from comment #39)
> Created attachment 107254 [details]
> fix for 3.11
> 
> The attached patch seems to fix it for me.

That doesn't seem to fix it for me (6870). I first tried 3.11-rc6 + drm-next-3.12-wip (ee31b2b), and then 3.11-rc6 + the standalone patch you attached.

In the former case, when trying to force a performance level I still got: echo: write error: Invalid argument

And the power level still idles at:
power level 2    sclk: 90000 mclk: 105000 vddc: 1175 vddci: 1150

gcc --version
gcc (GCC) 4.8.1 20130725 (prerelease)
Comment 41 Alex Deucher 2013-08-20 21:24:05 UTC
Well, it helps a little, but I'm still able to reproduce it eventually even with the patch.  The same kernel source is working fine on a fedora 16 system and now exhibits this problem on Fedora 19.  So maybe it's not all gcc.
Comment 42 Alex Deucher 2013-08-20 21:37:27 UTC
I can reproduce it in Fedora 17 as well (gcc 4.7).  So it seems to be something about Fedora 16 (gcc 4.6).
Comment 43 Nathan Jones 2013-09-05 22:32:22 UTC
I can reproduce it in Gentoo with 3.11.0, GCC 4.6.3. GPU Temperature is 25C (as reported by lm-sensors) with any UVD accelerated media playing, and "cat /sys/kernel/debug/dri/0/radeon_pm_info" shows:

uvd    vclk: 50000 dclk: 40000
power level 2    sclk: 40000 mclk: 90000 vddc: 950 vddci: 1100

As soon as I stop playing media, it stays at the lower power level, but even scrolling a web page makes it switch to:

uvd    vclk: 0 dclk: 0
power level 2    sclk: 85000 mclk: 120000 vddc: 1250 vddci: 1100

And the temperature creeps up to 34C.

ASUS EAH5770 CUCore 2DI/1GD5 is the model of my GPU.

This only happens with two monitors attached, and I tried the patch in Comment 39, but the behavior did not change. With a single monitor it drops to 157/300, and switches back and forth like it should. For me the problem is not so severe, since even extended GPU load at 100% still doesn't put me over 40C, but it would be nice if it worked with Multimonitor as well as with single..
Comment 44 Tobias Droste 2013-11-02 07:57:28 UTC
I see you want to enable DPM by default. Are there any news on this one? Should I (we) try the drm-next branch to see if it is fixed? It's still an issue on the drm-fixes branch from airlied.
Comment 45 Alex Deucher 2013-11-04 14:05:58 UTC
(In reply to Tobias Droste from comment #44)
> I see you want to enable DPM by default. Are there any news on this one?
> Should I (we) try the drm-next branch to see if it is fixed? It's still an
> issue on the drm-fixes branch from airlied.

I don't know of any changes that would affect this.
Comment 46 Alex Belykh 2014-02-01 20:36:47 UTC
Created attachment 124101 [details]
output of lspci -vv

I experience similar issue with HD4730, although lockup is rather unavoidable than "hard to trigger".

Booting with radeon.dpm=1 and two monitors attached results in periodic (every 15-45 sec) lockups (for ~7-15 sec), until one of them results in a reboot (most of the time during bootup, before even X is started).

radeon.dpm=0 seems to boot up and work fine (although I think I had unexplainable loud fan spinups after some time, need to test that further).

Bisecting kernel didn't yield anything useful. It just pointed me at 66229b200598a3b66b839d1759ff3f5b17ac5639 "drm/radeon/kms: add dpm support for rv7xx (v4)".

Latest drm-next kernel (commit ef64cf9d06049e4e9df661f3be60b217e476bee1) didn't help either.

The abovementioned patch for fixing arrays' size applied over 66229b2 didn't alleviate the problem in the slightest (gcc 4.7.3 though).

Funnily enough, if I boot with either of monitors detached, it boots and works fine, and if I then reattach the second monitor and connect it with xrandr, everything keeps working fine. Disconnecting a monitor every time one needs to reboot is a drag, though.
Comment 47 Alex Belykh 2014-02-01 20:38:11 UTC
Created attachment 124111 [details]
dmesg when booting with two monitors

The lockups at 3.8 sec and 19.3 sec are observable. Also I wonder why it initially switches into "single_disp video" only to immediately switch to "video" afterwards.
Incidentally, one of the displays is vga, other is dvi, if that matters.
Comment 48 Alex Belykh 2014-02-01 20:39:36 UTC
Created attachment 124121 [details]
dmesg when booting up with a single monitor

attaching a second monitor at 241.8 sec.
Comment 49 Alex Deucher 2014-03-06 21:04:48 UTC
Created attachment 128311 [details]
fix

This patch fixes the issue here.
Comment 50 Tobias Droste 2014-03-07 06:35:31 UTC
Unfortunately not here.

It's still:
# echo auto > power_dpm_force_performance_level
bash: echo: write error: Invalid argument
# echo low > power_dpm_force_performance_level
bash: echo: write error: Invalid argument
# echo high > power_dpm_force_performance_level
bash: echo: write error: Invalid argument
Comment 51 Alex Belykh 2014-03-07 08:02:05 UTC
And not here. Applying this patch on top of latest drm-next (786a7828bc74b9b1466e83abb200b75f80f94121) resulted in the same kind of lockups ending with reboot.
Comment 52 Vasco Gervasi 2014-05-12 20:50:38 UTC
Same problem with a HD6870 and two monitor.
I am using 3.15-rc5.
Comment 53 Digingbil 2014-05-13 23:09:57 UTC
Same problem here. Sapphire HD 5770 with 3.13.0-24-generic Ubuntu Gnome 14.04 with two DVI monitors. If I disconnect/disable one of them then it's working as it should and the fan doesn't sound like a boing 747 ;) If you need more info will be more than happy to assist. Although my results are quite same as the ones above.
Comment 54 Mathias Tillman 2014-06-11 10:26:50 UTC
Just thought I'd drop by and say that I have the same problem on my XFX Radeon HD 6870 using kernel 3.15.0-rc3+ (compiled from the latest agd5f repo) when I have two monitors attached. Trying to echo to power_dpm_force_performance_level results in an "Invalid argument" like others have mentioned as well.

What's interesting here is that I tried overriding the sclk and mclk level values in the code (in btc_dpm.c, btc_apply_state_adjust_rules), I changed the sclk and mclk for the medium and high levels from 77500, 105000 to 25000, 42000 and 90000, 105000 to 40000, 50000 respectively. And now it doesn't get stuck in the highest level anymore and I can successfully override the performance level using power_dpm_force_performance_level.

I'm not really sure what is causing all this, but I do know that btc_dpm_vblank_too_short return false on stock values, so it's not that at least.
Comment 56 Tobias Droste 2014-07-02 00:49:51 UTC
No change here
Comment 57 Mathias Tillman 2014-07-02 09:39:35 UTC
No difference here either, unfortunately.
Comment 58 Alex Deucher 2014-07-03 13:56:26 UTC
What kernel did you try?  I haven't been able to reproduce this for a while.  I just tested a bunch of boards yesterday with 3.16 and multi-head dpm worked properly on all of them.
Comment 59 Mathias Tillman 2014-07-03 15:15:08 UTC
I was using the 3.15 kernel (with your patch applied manually) on my last reply, but I just updated to the drm-fixes-3.16 branch and unfortunately there is no difference; it still gets stuck in the highest power level, and echoing to power_dpm_force_performance_level results in an Invalid argument.
Comment 60 Mathias Tillman 2014-07-03 17:47:00 UTC
I've done some further tests with overriding the power state level values as detailed above, here are my results:

The stock values are
level 0: sclk: 10000, mclk: 30000, vddc: 950, vddci: 950
level 1: sclk: 77500, mclk: 105000, vddc: 1100 vddci: 1150
level 2: sclk: 90000, mclk: 105000, vddc: 1175, vddci: 1150

sclk, vddc and vddci were all at their stock values for all power levels during my tests.

I changed mclk on power level 2 to the values below (OK means that it does not get stuck in the highest power level):
95000 = OK
100000 = OK
104000 = OK
105000 = NOT OK
110000 = NOT OK

So as you can see it gets stuck when the mclk value is 10500 or higher, but why that is I really have no idea. Could it be a firmware or VBIOS problem?
Comment 61 Tobias Droste 2014-07-03 23:20:45 UTC
I'm currently running drm-next from that day+your patch.

It does not seem to be the same problem for me, because mlck is 120000 for every level as soon as there's a second screen and only level 2 stays stuck.
Comment 62 Mathias Tillman 2014-12-04 14:01:06 UTC
Sorry for bringing up an old bug, but I'm still having this problem in 3.18, so I decided to do some further tests which made me discover some new things that may or may not be related.
I started off by adding a new sysfs entry for calling a pplib command (I noticed that one such call failed when I unplugged and replugged one of my monitors). What I discovered was that ALL pplib commands failed (returned 0x00) when it got stuck in the higher power level, and it started working again once I unplugged my second monitor. What's interesting here is that the commands were also fine when I was running them using a single monitor that was also running on the highest power level (though it was not stuck). So it seems that they fail only when the power level is stuck. I'm not sure if this is related or not, but it certainly seems that way to me.
To check if it was actually related to the mclk value as my previous tests had shown that they were I added another sysfs entry for changing the mclk value on the fly (using radeon_set_memory_clock). I guess the code for disabling the mclk switching on multiple monitors is necessary as the screen would become garbled and eventually the computer would lock up when running it multiple times.
In any case, it works if I only call it a few times, so I was able to test my theory. Unfortunately it showed the same symptoms no matter what I lowered the mclk to (pplib commands failing, unable to force the power level, etc). So it doesn't seem to be directly related to what the mclk value is, but more that it at one point has been running at a particular value and that triggers a part of the code that disables pplib commands and power scaling? I'm really not sure why this is happening.

I hope this information will be useful to someone, I will probably keep trying to solve this, but I am starting to run out of ideas (and I'm far from familiar with how the radeon code actually works).
Comment 63 Tobias Droste 2014-12-04 18:49:27 UTC
The bug may have an ancient file date, but this is not an old bug. I'm still seeing this with 3.18 (latest drm-fixes) and I'm glad I'm not the only one. 

I also tried to add some debug outputs to see what happens and manually adjust clocks but I couldn't get it to work.

If you have something to try I would be happy to try it to finally fix this. It gets quite annoying.

Btw. which gcc version are you using to compile the kernel?
I'm on gcc (SUSE Linux) 4.8.3 20140627 [gcc-4_8-branch revision 212064] now so if it is a bug in gcc it's not fixed in 4.8.3
Comment 64 Alex Deucher 2014-12-04 19:01:55 UTC
Please attach a copy of the vbios from your systems.

(as root)
(use lspci to get the bus id)
cd /sys/bus/pci/devices/<pci bus id>
echo 1 > rom
cat rom > /tmp/vbios.rom
echo 0 > rom
Comment 65 Tobias Droste 2014-12-04 19:06:17 UTC
Created attachment 159721 [details]
vbios.rom [AMD/ATI] Juniper XT [Radeon HD 5770]
Comment 66 Mathias Tillman 2014-12-04 19:14:59 UTC
Created attachment 159731 [details]
Advanced Micro Devices, Inc. [AMD/ATI] Barts XT [Radeon HD 6870]

Thank you for looking into this again, Alex.
Comment 67 Timothy Dale 2015-02-13 12:28:52 UTC
Created attachment 166641 [details]
[AMD/ATI] Barts XT [Radeon HD 6870]

Reattached a second monitor recently and I'm still experiencing this as well (kernel 3.18.6)
Comment 68 Alex Deucher 2015-02-14 05:34:55 UTC
(In reply to Mathias Tillman from comment #60)
> I've done some further tests with overriding the power state level values as
> detailed above, here are my results:
> 
> The stock values are
> level 0: sclk: 10000, mclk: 30000, vddc: 950, vddci: 950
> level 1: sclk: 77500, mclk: 105000, vddc: 1100 vddci: 1150
> level 2: sclk: 90000, mclk: 105000, vddc: 1175, vddci: 1150
> 
> sclk, vddc and vddci were all at their stock values for all power levels
> during my tests.
> 
> I changed mclk on power level 2 to the values below (OK means that it does
> not get stuck in the highest power level):
> 95000 = OK
> 100000 = OK
> 104000 = OK
> 105000 = NOT OK
> 110000 = NOT OK

Does changing the mclk to a lower value in level 1 (leave levels 0 and 2 unchanged) help?  How about changing the sclk in level 2?
Comment 69 Digingbil 2015-05-24 09:43:38 UTC
Any new insights on this issue guys? Maybe fixed on some latest kernel etc?
Comment 70 John K. 2015-05-28 09:59:05 UTC
I am having the same bug with kernel 4.0.4 and with a radeon 5870.

When I have two monitors on, the power level goes and get stuck at 2 and I get:

echo low > power_dpm_force_performance_level
bash: echo: write error: Invalid argument

any hope on a solution?
Comment 71 Tobias Droste 2015-05-30 03:40:39 UTC
No changes regarding this issue in 4.1-rc4
Comment 72 Martin Steghöfer 2015-10-30 12:44:49 UTC
I'm having the same problem (stuck in power level 2 with dual monitor setup; trying to force it results in "invalid argument" error) on Kernel 4.2.0-16.19 (Ubuntu versioning; not sure, how that translates to the "real" Kernel versions) - just in case you need someone to try something or provide information.

(In reply to Alex Belykh from comment #46)
> Funnily enough, if I boot with either of monitors detached, it boots and
> works fine, and if I then reattach the second monitor and connect it with
> xrandr, everything keeps working fine.

I can confirm the effectiveness of this workaround. Booting with only one monitor attached and adding the other one later results in being able to force performance levels and also in correct automatic level adjustments (when not forced).

Maybe this workaround is the key to solving this issue. Now I can run the system in both the working and the screwed-up state with otherwise the same hardware configuration and log internal information. There has to be some loggable difference between the two. I just don't know, what to look for (never did any kernel development), so I'd need someone to provide the log statements and interpret their output.
Comment 73 Mathias Anselmann 2016-01-24 15:52:24 UTC
Same problem here on kernel 4.3.3 and a Radeon 6870 with 2 monitors. Stuck at power level 2. If I can submit any useful information, feel free to ask.
Greetings
Mathias
Comment 74 Steven Haigh 2016-03-05 04:41:01 UTC
I'm actually looking at this - and I wonder if this is the same problem as I filed here:
https://bugs.freedesktop.org/show_bug.cgi?id=94387

It has all the hallmarks of it. I've love to help get this fixed as I'm sick of heating my room up beyond a comfortable level :P
Comment 75 Steven Haigh 2016-03-06 00:49:09 UTC
To get more info on this, I rebooted the system with a single screen attached:

$ cat /sys/kernel/debug/dri/64/radeon_pm_info
uvd    vclk: 0 dclk: 0
power level 0    sclk: 30000 mclk: 30000 vddc: 950 vddci: 950

Plugged in the second DVI cable & KDE automatically configured the second screen for me:

$ cat /sys/kernel/debug/dri/64/radeon_pm_info
uvd    vclk: 0 dclk: 0
power level 0    sclk: 30000 mclk: 105000 vddc: 950 vddci: 1150

Plugged in the third screen to the DP - which KDE automatically configured again:

$ cat /sys/kernel/debug/dri/64/radeon_pm_info
uvd    vclk: 0 dclk: 0
power level 0    sclk: 30000 mclk: 105000 vddc: 950 vddci: 1150

This is good.

Yet if I boot the system with three screens attached, I'm stuck at:

$ cat /sys/kernel/debug/dri/64/radeon_pm_info
uvd    vclk: 0 dclk: 0
power level 2    sclk: 90000 mclk: 105000 vddc: 1175 vddci: 1150
Comment 76 Steven Haigh 2016-03-06 00:50:23 UTC
Hmmm - yet when I hit submit on this page, it pushed the GPU up a power level, and now its stuck there again:

$ cat /sys/kernel/debug/dri/64/radeon_pm_info
uvd    vclk: 0 dclk: 0
power level 2    sclk: 90000 mclk: 105000 vddc: 1175 vddci: 1150
Comment 77 Steven Haigh 2016-03-13 03:51:24 UTC
Bueller? Bueller? Bueller?

Note You need to log in before you can comment on or make changes to this bug.