Bug 49031
Summary: | [ivb] Major power consumption regression probably with i915 drm error | ||
---|---|---|---|
Product: | Power Management | Reporter: | Micael Dias (kam1kaz3) |
Component: | Other | Assignee: | Rafael J. Wysocki (rjw) |
Status: | CLOSED DUPLICATE | ||
Severity: | normal | CC: | adam, chris, daniel, florian, kruegsch, lenb, rjw, ying.huang |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.6.2+ | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
dmesg using kernel 3.7rc1
dmesg using kernel 3.5.6 dmesg for latest drm-intel-next-queued powertop with kernel 3.5 powertop with kernel 3.7 powertop idlestats for kernel 3.7 (left) and 3.5 (right) Debug suspend when children forbidden |
Created attachment 83931 [details]
dmesg using kernel 3.5.6
Forgot to mention hardware: Laptop: ASUS N56VZ CPU: Intel Ivy Bridge GPU: nVidia (disabled with bbswitch) The ERROR should be fixed with http://cgit.freedesktop.org/~danvet/drm-intel/commit/?h=drm-intel-next-queued&id=16995a9fe140802c026c2ce17bf7e232f86d57ab but should otherwise be totally harmless (if it only happens once after boot-up and maybe after each resume). The power regression itself is a bit more serious. To confirm, can you please grab latest powertop and check whether the gpu goes into rc6 (only works on recent kernels, not 3.5). Also please don't apply random kernel parameters 'i915_enable_rc6=7 i915_enable_fbc=1' without understanding the danger they pose. I can't test the latest kernel from git right now (will do shortly), but I remember using powertop to check the gpu state, and it did go into RC6, RC6p and RC6pp without any trouble, spending most off the time in RC6pp. I read RC6pp and the frame buffer compression has some stability problems on some hardware, but I've not had any trouble because of those yet. What other dangers can it pose? Is there any chance it can cause harware damage? FBC very rarely saves power, but it does cause a massive (10-100x) slowdown of the system. RC6pp was never implemented in silicon for SNB/IVB. Both are known for causing random render corruption, GPU hangs and hard hangs. Created attachment 84581 [details]
dmesg for latest drm-intel-next-queued
Compiled kernel from git 3b6bd9fbb0e3ce885e5b5c87000133bb899b8a14 and even though the error message is not there, power consumption is still much higher than with 3.5.
I'm attaching screenshots from powertop.
Note: powertop says that backight is consuming more on 3.7 but it was set to the exact same brightness using /sys/class/backlight/intel_backlight/brightness
Created attachment 84591 [details]
powertop with kernel 3.5
Created attachment 84601 [details]
powertop with kernel 3.7
The interesting tab on powertop is the next one, with idle stats. Can you please attach a screenshot of that on 3.7? Also, can you test what happens when you boot with i915.semaphores=0 ? i915.semaphores=0 doesn't change anything. Kernel 3.7 idle CPU temperature: 52 ºC Kernel 3.5 idle CPU temperature: 46 ºC I'll upload the wanted screenshots in a minute. Created attachment 84671 [details]
powertop idlestats for kernel 3.7 (left) and 3.5 (right)
Hm, this is _really_ unusual. Normally all the power regressions of this magnitude are due to rc6, which is not the case for you. So I think we need to take a step back and re-asses whether this is really due to graphics. Can you please boot with i915.ko disable (e.g. add i915.die=1 to the kernel cmdline) on both kernels and measure power consumption again? Also, I think we need to have a bisect result to move forward with this issue. Sure. Kernel 3.5 with i915.die=1 Power consumption: 17.0 W CPU temperature: 43 ºC Kernel 3.7 with i915.die=1 Power consumption: 17.5 W CPU temperature: 46~47 ºC Idle stats no longer shows the time spent on RC states, and I cannot get into gnome. Meh, your bios doesn't set up rc6, so that blows out everything else. I think you need to try to bisect this one, with rc6 apparently working on 3.6 I have no idea what burns down so much power for you. Kernel 3.5 (28a33cbc24e4256c143dce96c7d93bf423229f92): good Kernel 3.6-rc3 (fea7a08acb13524b47711625eebea40a0ede69a0) bad The RCs in between those don't boot. Nothing happens after "Loading initial ramdisk" On Wed, Oct 24, 2012 at 7:23 PM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > --- Comment #16 from Micael Dias <kam1kaz3@gmail.com> 2012-10-24 17:23:52 > --- > Kernel 3.5 (28a33cbc24e4256c143dce96c7d93bf423229f92): good > Kernel 3.6-rc3 (fea7a08acb13524b47711625eebea40a0ede69a0) bad > > The RCs in between those don't boot. Nothing happens after "Loading initial > ramdisk" Have you tried doing this with git bisect? You can mark (some) bad commits as "please skip" with it. No, I'm not confortable enough with that. If there's any quick guide you could point me to so I can use that with the kernel I'd appreciate it. On Wed, Oct 24, 2012 at 9:32 PM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > --- Comment #18 from Micael Dias <kam1kaz3@gmail.com> 2012-10-24 19:32:39 > --- > No, I'm not confortable enough with that. If there's any quick guide you > could > point me to so I can use that with the kernel I'd appreciate it. My usual recommendation is: http://www.reactivated.net/weblog/archives/2006/01/using-git-bisect-to-find-buggy-kernel-patches/ Finished bisecting, this is the first bad commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=71a83bd727cc31c5fe960c3758cb396267ff710e Now I'm not sure anymore if this has anything to do with the CPU... Well, that makes quite a bit sense, and also makes it rather sure that i915 is not the culprit. Please file a new bug (probably by mail to lkml), cc'ing the people in the commit message with the details we've already gathered. Closing this one here (many kernel hackers don't like bugzilla, so mailing list report is more successful, usually). Reassigned to power management and reopened. @Micael: What's in the /sys/devices/.../power/control file corresponding to the NVidia graphics adapter? > pwd /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/power > cat control auto This is the same for both 3.5 and 3.7-rc2 with or without bbswitch. Maybe it is realated to bug 48721. Please check the line 'RC6 residency since boot: XXX' of [1] twice. If the XXX-value doesn't increas, the GPU can't switch to the RC6 state. We also noticed that the current CPU frequency [2] doesn't fit the scaling frequency [3], when using the ondemand governor. [1] cat /sys/kernel/debug/dri/0/i915_drpc_info [2] cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq [3] cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq (In reply to comment #25) > Maybe it is realated to bug 48721. > > Please check the line 'RC6 residency since boot: XXX' of [1] twice. > If the XXX-value doesn't increas, the GPU can't switch to the RC6 state. We've already checked that theory, the gpu successfully enters rc6. Just confirming that RC6 residency does increase, and cat'ing those cpu freq files report "1200000" for both. I also experience this bug (http://unix.stackexchange.com/questions/53213/why-kernel-3-6-2-is-so-much-less-enery-efficient-than-3-6-1-on-mint-13) and I am ready to assist you with the bisecting thing, if it is still needed. (In reply to comment #24) > > pwd > /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/power > > > cat control > auto > > > This is the same for both 3.5 and 3.7-rc2 with or without bbswitch. Does writing "on" to that file make a difference? No difference at all. Created attachment 85341 [details]
Debug suspend when children forbidden
Hi, Micael Dias, Could you try the attached patch?
Just compiled from git with the patch, still have the problem. In https://bugzilla.kernel.org/show_bug.cgi?id=48981 It is reported that 3.7 can not turn on/off Nvidia card with bbswitch, can you try some method in that bug report? And verify whether Nvidia card can be turned on/off successfully? I confirm that bbswitch was not working even though it printed to dmesg that the card was OFF. Compiled bbswitch from git and now it's working as intended, and I finally have low power consumption again :) Thank you all very much for being so patient and helping me fix the issue. You may now close this bug :) *** This bug has been marked as a duplicate of bug 48981 *** A patch referencing this bug report has been merged in Linux v3.7-rc5: commit b3c32c4f9565f93407921c0d8a4458042eb8998e Author: Huang Ying <ying.huang@intel.com> Date: Thu Oct 25 09:36:03 2012 +0800 PCI/PM: Fix proc config reg access for D3cold and bridge suspending |
Created attachment 83921 [details] dmesg using kernel 3.7rc1 I'm using arch and have seen major power consumption regression upgrading the kernel from 3.5.7 to 3.6.2. I've tested 3.7rc1 with the same results. When using 3.5.7 I used to get ~11 W discharge rate from powertop, and with this new kernels I get ~23 W. I also get this suspecting message from dmesg with the new kernels: [ 8.307873] [drm:__gen6_gt_force_wake_mt_get] *ERROR* Timed out waiting for forcewake old ack to clear.