Bug 49031

Summary: [ivb] Major power consumption regression probably with i915 drm error
Product: Power Management Reporter: Micael Dias (kam1kaz3)
Component: OtherAssignee: Rafael J. Wysocki (rjw)
Status: CLOSED DUPLICATE    
Severity: normal CC: adam, chris, daniel, florian, kruegsch, lenb, rjw, ying.huang
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.6.2+ Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg using kernel 3.7rc1
dmesg using kernel 3.5.6
dmesg for latest drm-intel-next-queued
powertop with kernel 3.5
powertop with kernel 3.7
powertop idlestats for kernel 3.7 (left) and 3.5 (right)
Debug suspend when children forbidden

Description Micael Dias 2012-10-18 21:06:01 UTC
Created attachment 83921 [details]
dmesg using kernel 3.7rc1

I'm using arch and have seen major power consumption regression upgrading the kernel from 3.5.7 to 3.6.2. I've tested 3.7rc1 with the same results.

When using 3.5.7 I used to get ~11 W discharge rate from powertop, and with this new kernels I get ~23 W.

I also get this suspecting message from dmesg with the new kernels:
[    8.307873] [drm:__gen6_gt_force_wake_mt_get] *ERROR* Timed out waiting for forcewake old ack to clear.
Comment 1 Micael Dias 2012-10-18 21:07:18 UTC
Created attachment 83931 [details]
dmesg using kernel 3.5.6
Comment 2 Micael Dias 2012-10-18 21:12:20 UTC
Forgot to mention hardware:

Laptop: ASUS N56VZ
CPU: Intel Ivy Bridge
GPU: nVidia (disabled with bbswitch)
Comment 3 Daniel Vetter 2012-10-18 21:40:55 UTC
The ERROR should be fixed with http://cgit.freedesktop.org/~danvet/drm-intel/commit/?h=drm-intel-next-queued&id=16995a9fe140802c026c2ce17bf7e232f86d57ab but should otherwise be totally harmless (if it only happens once after boot-up and maybe after each resume). The power regression itself is a bit more serious. To confirm, can you please grab latest powertop and check whether the gpu goes into rc6 (only works on recent kernels, not 3.5).
Comment 4 Chris Wilson 2012-10-18 21:56:28 UTC
Also please don't apply random kernel parameters 'i915_enable_rc6=7 i915_enable_fbc=1' without understanding the danger they pose.
Comment 5 Micael Dias 2012-10-19 04:35:34 UTC
I can't test the latest kernel from git right now (will do shortly), but I remember using powertop to check the gpu state, and it did go into RC6, RC6p and RC6pp without any trouble, spending most off the time in RC6pp.

I read RC6pp and the frame buffer compression has some stability problems on some hardware, but I've not had any trouble because of those yet. What other dangers can it pose? Is there any chance it can cause harware damage?
Comment 6 Chris Wilson 2012-10-19 08:09:46 UTC
FBC very rarely saves power, but it does cause a massive (10-100x) slowdown of the system. RC6pp was never implemented in silicon for SNB/IVB. Both are known for causing random render corruption, GPU hangs and hard hangs.
Comment 7 Micael Dias 2012-10-24 05:00:46 UTC
Created attachment 84581 [details]
dmesg for latest drm-intel-next-queued

Compiled kernel from git 3b6bd9fbb0e3ce885e5b5c87000133bb899b8a14 and even though the error message is not there, power consumption is still much higher than with 3.5.

I'm attaching screenshots from powertop.

Note: powertop says that backight is consuming more on 3.7 but it was set to the exact same brightness using /sys/class/backlight/intel_backlight/brightness
Comment 8 Micael Dias 2012-10-24 05:01:29 UTC
Created attachment 84591 [details]
powertop with kernel 3.5
Comment 9 Micael Dias 2012-10-24 05:01:50 UTC
Created attachment 84601 [details]
powertop with kernel 3.7
Comment 10 Daniel Vetter 2012-10-24 08:23:11 UTC
The interesting tab on powertop is the next one, with idle stats. Can you please attach a screenshot of that on 3.7? Also, can you test what happens when you boot with i915.semaphores=0 ?
Comment 11 Micael Dias 2012-10-24 12:51:34 UTC
i915.semaphores=0 doesn't change anything.

Kernel 3.7 idle CPU temperature: 52 ºC

Kernel 3.5 idle CPU temperature: 46 ºC

I'll upload the wanted screenshots in a minute.
Comment 12 Micael Dias 2012-10-24 12:52:17 UTC
Created attachment 84671 [details]
powertop idlestats for kernel 3.7 (left) and 3.5 (right)
Comment 13 Daniel Vetter 2012-10-24 13:50:08 UTC
Hm, this is _really_ unusual. Normally all the power regressions of this magnitude are due to rc6, which is not the case for you. So I think we need to take a step back and re-asses whether this is really due to graphics.

Can you please boot with i915.ko disable (e.g. add i915.die=1 to the kernel cmdline) on both kernels and measure power consumption again? Also, I think we need to have a bisect result to move forward with this issue.
Comment 14 Micael Dias 2012-10-24 14:33:45 UTC
Sure.

Kernel 3.5 with i915.die=1
Power consumption: 17.0 W
CPU temperature: 43 ºC

Kernel 3.7 with i915.die=1
Power consumption: 17.5 W
CPU temperature: 46~47 ºC

Idle stats no longer shows the time spent on RC states, and I cannot get into gnome.
Comment 15 Daniel Vetter 2012-10-24 14:52:46 UTC
Meh, your bios doesn't set up rc6, so that blows out everything else. I think you need to try to bisect this one, with rc6 apparently working on 3.6 I have no idea what burns down so much power for you.
Comment 16 Micael Dias 2012-10-24 17:23:52 UTC
Kernel 3.5 (28a33cbc24e4256c143dce96c7d93bf423229f92): good
Kernel 3.6-rc3 (fea7a08acb13524b47711625eebea40a0ede69a0) bad

The RCs in between those don't boot. Nothing happens after "Loading initial ramdisk"
Comment 17 Daniel Vetter 2012-10-24 17:33:03 UTC
On Wed, Oct 24, 2012 at 7:23 PM,  <bugzilla-daemon@bugzilla.kernel.org> wrote:
> --- Comment #16 from Micael Dias <kam1kaz3@gmail.com>  2012-10-24 17:23:52
> ---
> Kernel 3.5 (28a33cbc24e4256c143dce96c7d93bf423229f92): good
> Kernel 3.6-rc3 (fea7a08acb13524b47711625eebea40a0ede69a0) bad
>
> The RCs in between those don't boot. Nothing happens after "Loading initial
> ramdisk"

Have you tried doing this with git bisect? You can mark (some) bad
commits as "please skip" with it.
Comment 18 Micael Dias 2012-10-24 19:32:39 UTC
No, I'm not confortable enough with that. If there's any quick guide you could point me to so I can use that with the kernel I'd appreciate it.
Comment 19 Daniel Vetter 2012-10-24 19:38:59 UTC
On Wed, Oct 24, 2012 at 9:32 PM,  <bugzilla-daemon@bugzilla.kernel.org> wrote:
> --- Comment #18 from Micael Dias <kam1kaz3@gmail.com>  2012-10-24 19:32:39
> ---
> No, I'm not confortable enough with that. If there's any quick guide you
> could
> point me to so I can use that with the kernel I'd appreciate it.

My usual recommendation is:
http://www.reactivated.net/weblog/archives/2006/01/using-git-bisect-to-find-buggy-kernel-patches/
Comment 20 Micael Dias 2012-10-25 16:30:48 UTC
Finished bisecting, this is the first bad commit:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=71a83bd727cc31c5fe960c3758cb396267ff710e

Now I'm not sure anymore if this has anything to do with the CPU...
Comment 21 Daniel Vetter 2012-10-25 18:55:42 UTC
Well, that makes quite a bit sense, and also makes it rather sure that i915 is not the culprit. Please file a new bug (probably by mail to lkml), cc'ing the people in the commit message with the details we've already gathered.

Closing this one here (many kernel hackers don't like bugzilla, so mailing list report is more successful, usually).
Comment 22 Rafael J. Wysocki 2012-10-29 07:40:15 UTC
Reassigned to power management and reopened.
Comment 23 Rafael J. Wysocki 2012-10-29 10:31:51 UTC
@Micael: What's in the /sys/devices/.../power/control file corresponding to the NVidia graphics adapter?
Comment 24 Micael Dias 2012-10-29 13:21:56 UTC
> pwd
/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/power

> cat control
auto


This is the same for both 3.5 and 3.7-rc2 with or without bbswitch.
Comment 25 Marco Krüger 2012-10-29 13:55:21 UTC
Maybe it is realated to bug 48721.

Please check the line 'RC6 residency since boot: XXX' of [1] twice.
If the XXX-value doesn't increas, the GPU can't switch to the RC6 state.

We also noticed that the current CPU frequency [2] doesn't fit the scaling frequency [3], when using the ondemand governor.

[1] cat /sys/kernel/debug/dri/0/i915_drpc_info
[2] cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq
[3] cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
Comment 26 Daniel Vetter 2012-10-29 14:36:05 UTC
(In reply to comment #25)
> Maybe it is realated to bug 48721.
> 
> Please check the line 'RC6 residency since boot: XXX' of [1] twice.
> If the XXX-value doesn't increas, the GPU can't switch to the RC6 state.

We've already checked that theory, the gpu successfully enters rc6.
Comment 27 Micael Dias 2012-10-29 14:58:31 UTC
Just confirming that RC6 residency does increase, and cat'ing those cpu freq files report "1200000" for both.
Comment 28 Adam Ryczkowski 2012-10-29 16:03:30 UTC
I also experience this bug (http://unix.stackexchange.com/questions/53213/why-kernel-3-6-2-is-so-much-less-enery-efficient-than-3-6-1-on-mint-13) and I am ready to assist you with the bisecting thing, if it is still needed.
Comment 29 Rafael J. Wysocki 2012-10-29 23:17:17 UTC
(In reply to comment #24)
> > pwd
> /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/power
> 
> > cat control
> auto
> 
> 
> This is the same for both 3.5 and 3.7-rc2 with or without bbswitch.

Does writing "on" to that file make a difference?
Comment 30 Micael Dias 2012-10-30 00:26:26 UTC
No difference at all.
Comment 31 Huang Ying 2012-10-31 00:38:20 UTC
Created attachment 85341 [details]
Debug suspend when children forbidden

Hi, Micael Dias, Could you try the attached patch?
Comment 32 Micael Dias 2012-10-31 01:36:59 UTC
Just compiled from git with the patch, still have the problem.
Comment 33 Huang Ying 2012-10-31 01:53:10 UTC
In

https://bugzilla.kernel.org/show_bug.cgi?id=48981

It is reported that 3.7 can not turn on/off Nvidia card with bbswitch, can you try some method in that bug report?  And verify whether Nvidia card can be turned on/off successfully?
Comment 34 Micael Dias 2012-10-31 02:19:26 UTC
I confirm that bbswitch was not working even though it printed to dmesg that the card was OFF.

Compiled bbswitch from git and now it's working as intended, and I finally have low power consumption again :)


Thank you all very much for being so patient and helping me fix the issue.

You may now close this bug :)
Comment 35 Rafael J. Wysocki 2012-11-02 09:52:22 UTC

*** This bug has been marked as a duplicate of bug 48981 ***
Comment 36 Florian Mickler 2012-11-11 18:53:49 UTC
A patch referencing this bug report has been merged in Linux v3.7-rc5:

commit b3c32c4f9565f93407921c0d8a4458042eb8998e
Author: Huang Ying <ying.huang@intel.com>
Date:   Thu Oct 25 09:36:03 2012 +0800

    PCI/PM: Fix proc config reg access for D3cold and bridge suspending