Bug 48721
Summary: | cpufreq stuck at max frequency, suspend/resume, i915 may affect | ||
---|---|---|---|
Product: | ACPI | Reporter: | Maurizio Avogadro (mavoga) |
Component: | Power-Processor | Assignee: | Len Brown (lenb) |
Status: | CLOSED DUPLICATE | ||
Severity: | normal | CC: | 1007380, abhijeet.1989, alan, alejometal3, alex.shi, alga777, brendansellers, const, corin, e.at.chi.kaen, eduedix, flyser42, garkein, hendry, j.keil, jj, js314592, jvpgomes, kernel.org, kernel, kruegsch, lenb, loic, losinski, matthias, matti.niemenmaa+kernelbugs, mswal2846, nefthy-kernel.org, onorua, patrakov, pl4nkton, rjw, rjw, robert, rscrihf, rui.zhang, samuel-kbugs, tomka |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.6.0+ | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
3.5.5 kernel: output of 'grep -r . /sys/devices/system/cpu/ 2>/dev/null'
3.6.1 kernel: output of 'grep -r . /sys/devices/system/cpu/ 2>/dev/null' 3.6.1 kernel config git bisect log CPU and GPU working reliably for ~12h with this patch Script to (possibly) provoke the bug |
Created attachment 83131 [details]
3.6.1 kernel: output of 'grep -r . /sys/devices/system/cpu/ 2>/dev/null'
Created attachment 83151 [details]
3.6.1 kernel config
I confirm the same bahavior on 3.6.2 as well. Bisecting is somewhat complicated because some rare times the CPU frequency modulates correctly for a span of time after reboot. The behavior is actually totally unpredictable: sometimes the laptop starts, everything goes fine and CPU clock modulates. Bisecting is very hard if not impossible without first identifying what triggers the bug, and led me to nonsensical results. Furthermore, I experience the very same issue as #48791: the GPU can't reach the rc6 states (source: powertop), but that happens randomly, also after a regular boot, and it seems happening exactly when the CPU clock is stuck. I believe these bugs are related. Created attachment 83601 [details]
git bisect log
I can confirm the behavior and I did a git bisect.
The result is, that
"drm/i915: fix forcewake related hangs on snb" - 6af2d180f82151cf3d58952e35a4f96e45bc453a
is the bad commit.
It also matches my hardware specs: i7-2640M with Sandy Bridge GPU in use. Can somebody check this against the last good commit in the attached git bisect log?
Built git checkout 9978cf5 ("i915: Remove silly test"): the CPU clock is stuck at 2.2GHz at first reboot. Confirming this behavior on an Intel Core i5-2520M running Arch Linux, Intel 3000 GPU used. I will be glad to attach any information that is needed. (In reply to comment #7) > Built git checkout 9978cf5 ("i915: Remove silly test"): the CPU clock is > stuck > at 2.2GHz at first reboot. That is strange. I tested that commit again and I can not force the buggy behavior. BTW - I used a wattmeter to check the state without the need to login. That allowed me a high reboot frequency for testing. Hello, I did a bisect myself (although not with a power meter..) but came across the same commit as Marco. I stumbled over a few more hurdles since some versions simply broke while booting (some would boot after the 2. or 3. attempt, others not). However, in addition I tried resume/suspend for every commit. Some commits would boot up fine with frequency scaling working. But after suspend/resume frequency scaling was broken too. I did a quick search for "sandy bridge forcewake" and I found this (among other mails): http://www.gossamer-threads.com/lists/linux/kernel/1405844 I've built v3.6.2 with 6af2d180f82151cf3d58952e35a4f96e45bc453a reverted. Behaviour as above. Upon initial boot, frequency scaling is broken, after suspend/resume frequency scaling is working. I fear that this snake pit goes even deeper. Hopefully someone with more insight can shed some light here.. I guess it is related to the bug [1], that keeps the CPU busy. [1] - https://bugs.freedesktop.org/show_bug.cgi?id=54089 Hi, I tried out several patches against stable/v3.6.2: * Reverting 6af2d180f82151cf3d58952e35a4f96e45bc453a: result like in comment #10. * This [1] patch + reverting 6af2d180: upon boot frequency scaling is working. However, after some suspend/resume cycles frequency scaling is broken again. * All commits from drm-intel-nightly from [2]: no result. Hope this helps. [1] https://bugs.freedesktop.org/show_bug.cgi?id=54089 [2] git://people.freedesktop.org/~danvet/drm-intel Some more observations from me. First of another bisect: git bisect start # good: [73b6448a7705298b2b10367a50fd063b27cdbeb8] Linux 3.5.6 git bisect good 73b6448a7705298b2b10367a50fd063b27cdbeb8 # bad: [6af2d180f82151cf3d58952e35a4f96e45bc453a] drm/i915: fix forcewake related hangs on snb git bisect bad 6af2d180f82151cf3d58952e35a4f96e45bc453a # good: [6b16351acbd415e66ba16bf7d473ece1574cf0bc] Linux 3.5-rc4 git bisect good 6b16351acbd415e66ba16bf7d473ece1574cf0bc # bad: [930ebb462422117e12b85bb5fd6548ed13d0afb5] drm/i915: fix up ilk rc6 disabling confusion git bisect bad 930ebb462422117e12b85bb5fd6548ed13d0afb5 # good: [e055684168af48ac7deb27d7267046a0fb9ef80e] drm/i915: context switch implementation git bisect good e055684168af48ac7deb27d7267046a0fb9ef80e # good: [4a87d65d54ffc76e1d6f7e2124354997b66bd81c] drm/i915: add HDMI and DP port enumeration on ValleyView git bisect good 4a87d65d54ffc76e1d6f7e2124354997b66bd81c # bad: [e86fe0d31722090e3bb72a3e8aab70f07e2c6b77] drm/i915: mask tiled bit when updating IVB sprites git bisect bad e86fe0d31722090e3bb72a3e8aab70f07e2c6b77 # good: [ff049b6ce21d2814451afd4a116d001712e0116b] drm/i915: bind driver to ValleyView chipsets git bisect good ff049b6ce21d2814451afd4a116d001712e0116b # good: [79f5b2c7599270aa3dcfffb445f8f520c05a7fc5] drm/i915: make enable/disable_gt_powersave locking consistent git bisect good 79f5b2c7599270aa3dcfffb445f8f520c05a7fc5 # bad: [01a06850fb45ace55ed67d1d9da2df553a041e40] drm/i915: disable drm agp support for !gen3 with kms enabled git bisect bad 01a06850fb45ace55ed67d1d9da2df553a041e40 # bad: [87207ca20eeb519aa0333b754db9cf3c369ea6f7] drm/i915: don't use dev->agp git bisect bad 87207ca20eeb519aa0333b754db9cf3c369ea6f7 As you can see I ended up with a complete different commit. Currently I'm running 3.4.9 which did not yet show the problem. Since I experienced the problem with 3.5.6 too, I suspect the error somewhere between 3.4.9 and 3.5.9. It is interesting though, that some versions in between would work well whereas others seem to be broken. Most often I came across AGP (which was refactored at some point) and forcewake. Maybe the culprit's here. I also reverted 990bbdadabaa51828e475eda86ee5720a4910cc3 (which is mentioned in 6af2d180f82151cf3d58952e35a4f96e45bc453a) but that left me with a non-working graphics card. After ~12h test and normal use with some suspends and resumes it seems that reverting the commit 6af2d180f82151cf3d58952e35a4f96e45bc453a [1] and adding a msleep as mentioned in [2] it seems that both CPU frequency scaling and GPU rc6 state are working reliably as expected. At least in my case... Attaching a patch for clarity. I should also try applying [2] only in order to confirm both changes are needed... [1] see comment #10 here [2] https://bugs.freedesktop.org/show_bug.cgi?id=54089#c24 (Comment #24) Created attachment 84431 [details]
CPU and GPU working reliably for ~12h with this patch
(In reply to comment #14) > After ~12h test and normal use with some suspends and resumes it seems that > reverting the commit 6af2d180f82151cf3d58952e35a4f96e45bc453a [1] and adding > a > msleep as mentioned in [2] it seems that both CPU frequency scaling and GPU > rc6 > state are working reliably as expected. At least in my case... > > Attaching a patch for clarity. > > I should also try applying [2] only in order to confirm both changes are > needed... > > [1] see comment #10 here > [2] https://bugs.freedesktop.org/show_bug.cgi?id=54089#c24 (Comment #24) When I revert and/or patch from v3.6.3 neither of them fixes the bug. My last "good" bisect commit was in v3.5-rc4 and that worked for me. In my case, even with the patch proposed, the issue appears at boot randomly, say every 5-6 boots, and always disappeared after a suspend/resume cycle until now. Bisecting would be a PITA under these circumstances: I'm unable to reliably reproduce the bug, I can only tell that graphics activity seems irrelevant and that it happens exclusively on boot. Does the driver rely on some value stored in NVRAM? Could this issue be related to some NVRAM data corruption? I also experienced a weird NVRAM corruption on this laptop, which required a complete reflash to solve, some time after installing the first 3.6 kernel. I've applied the patch to my Lenovo X220 FC17 3.6.3 system. So far, I am able to successfully un-suspend without experiencing this problem. So I've suspended and come out of suspension several times since 10/26 without an issue, but this morning I came out of suspension and the problem reappeared, despite the patch application. My CPU at the time was low and TOP showed no run away process, yet the temperature was climbing and the fan was flying to no avail. Rebooting fresh has "reset" the problem... At a T420 it doesn't work - tested at kernel 3.6.4 at a stable Gentoo. *** Bug 48791 has been marked as a duplicate of this bug. *** I was seeing the same thing on an i5-2540M using 3.6.3. I was wondering why the fan was always running so much, except (I think) on boot and sometimes after a resume. I finally noticed that scaling_cur_freq != cpuinfo_cur_freq and found this bug report. Yesterday I updated to the 3.6.6 kernel and even after several suspend/resume cycles it's still working properly. (In reply to comment #22) > Yesterday I updated to the 3.6.6 kernel and even > after several suspend/resume cycles it's still working properly. I also have updated and thought it's fixed. But after some more hours of usual work it has fallen back to improper behaviour. I haven't tried any other kernels or patches, but I can confirm this behavior on my i7-2620M, currently running kernel release 3.6.6. I can confirm it's still a problem with 3.6.6. I rebooted again and did a suspend/resume, so one of those triggered it again. I can confirm the problem too. Having an i7-2620M SandyBridge running ArchLinux. Last working build for me was 3.5.6. All later releases up to 3.6.6 show the same problem. I just did another suspend/resume cycle and it is corrected, so it seems very random. I can also confirm exactly what #25 Samuel Sieb and #26 Peter Schneider says. As #26 says, the problem is exactly the same. There are times when it works perfectly and sometimes it doesn't. So, when there is a problem, is there any way I can help by giving you guys some outputs which you can check against the times when there is no issue? I would be interested in helping in any way to get the bug solved. It has been months since it was produced and I really want this to go away. I'd like to point out one more observation: Once the error occurs and the CPU frequency is fixed, this also applies to the Intel GPU: > # cat /sys/kernel/debug/dri/0/i915_cur_delayinfo > GT_PERF_STATUS: 0x00000d29 > RPSTAT1: 0x00048d00 > Render p-state ratio: 13 > Render p-state VID: 41 > Render p-state limit: 255 > CAGF: 650MHz > RP CUR UP EI: 61329us > RP CUR UP: 4us > RP PREV UP: 0us > RP CUR DOWN EI: 0us > RP CUR DOWN: 0us > RP PREV DOWN: 0us > Lowest (RPN) frequency: 650MHz > Nominal (RP1) frequency: 650MHz > Max non-overclocked (RP0) frequency: 1300MHz > > # cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq > 800000 > ############ After standby ################################# > > # cat /sys/kernel/debug/dri/0/i915_cur_delayinfo > GT_PERF_STATUS: 0x00001ac6 > RPSTAT1: 0x00049a19 > Render p-state ratio: 26 > Render p-state VID: 198 > Render p-state limit: 255 > CAGF: 1300MHz > RP CUR UP EI: 9142us > RP CUR UP: 9142us > RP PREV UP: 66000us > RP CUR DOWN EI: 191074us > RP CUR DOWN: 0us > RP PREV DOWN: 0us > Lowest (RPN) frequency: 650MHz > Nominal (RP1) frequency: 650MHz > Max non-overclocked (RP0) frequency: 1300MHz > > # cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq > 2700000 Notice how CAGF is now at the maximum frequency. Also the GT_PERF_STATUS and RPSTAT1 different and reproducible. I have no idea what these mean though. (I've posted the same info on the Arch forum and this behavior was confirmed by another user.) Upgrading from 3.4.9 to 3.4.10 seems to make that behaviour much more worse than before at an stable x86 Gentoo with i915 module at a ThinkPad T420. Hello,
using 3.6 I get this behavior rarely (actually I didn't correlated with a kernel problem either), while with 3.7 I get it consistently.
Accordingly to PowerTop, the power consumption doubles (from 13.5-15 W to >30 W), CPU cores and GPU work at maximum frequency, statistics show heavy use of Turbo mode.
I can confirm that suspending and resuming the machine restore the correct frequency scaling *and* the "benefit" survives to reboots. Just as an example:
> 1. Kernel 3.6
> ==================================================
>
> GT_PERF_STATUS: 0x00000d29
> RPSTAT1: 0x00048d00
> Render p-state ratio: 13
> Render p-state VID: 41
> Render p-state limit: 255
> CAGF: 650MHz <==============
> RP CUR UP EI: 25483us
> RP CUR UP: 59us
> RP PREV UP: 0us
> RP CUR DOWN EI: 0us
> RP CUR DOWN: 0us
> RP PREV DOWN: 0us
> Lowest (RPN) frequency: 650MHz
> Nominal (RP1) frequency: 650MHz
> Max non-overclocked (RP0) frequency: 1100MHz
>
>
> 2. Kernel 3.7 (after poweroff and cold reboot)
> ==================================================
> GT_PERF_STATUS: 0x000016c3
> RPSTAT1: 0x00049614
> Render p-state ratio: 22
> Render p-state VID: 195
> Render p-state limit: 255
> CAGF: 1100MHz <==============
> RP CUR UP EI: 15388us
> RP CUR UP: 15388us
> RP PREV UP: 66000us
> RP CUR DOWN EI: 232477us
> RP CUR DOWN: 0us
> RP PREV DOWN: 0us
> owest (RPN) frequency: 650MHz
> Nominal (RP1) frequency: 650MHz
> Max non-overclocked (RP0) frequency: 1100MHz
>
> 3. Kernel 3.7 (after suspend/resume)
> ==================================================
> GT_PERF_STATUS: 0x00000d29
> RPSTAT1: 0x00048d00
> Render p-state ratio: 13
> Render p-state VID: 41
> Render p-state limit: 255
> CAGF: 650MHz <==============
> RP CUR UP EI: 55593us
> RP CUR UP: 4us
> RP PREV UP: 0us
> RP CUR DOWN EI: 0us
> RP CUR DOWN: 0us
> RP PREV DOWN: 0us
> Lowest (RPN) frequency: 650MHz
> Nominal (RP1) frequency: 650MHz
> Max non-overclocked (RP0) frequency: 1100MHz
>
> 4. Kernel 3.7 (after reboot without poweroff)
> ==================================================
> GT_PERF_STATUS: 0x00000d29
> RPSTAT1: 0x00040d00
> Render p-state ratio: 13
> Render p-state VID: 41
> Render p-state limit: 255
> CAGF: 650MHz <==============
> RP CUR UP EI: 2005us
> RP CUR UP: 6us
> RP PREV UP: 0us
> RP CUR DOWN EI: 0us
> RP CUR DOWN: 0us
> RP PREV DOWN: 0us
> Lowest (RPN) frequency: 650MHz
> Nominal (RP1) frequency: 650MHz
> Max non-overclocked (RP0) frequency: 1100MHz
I'm on an Asus X53Sv (AKA K53Sv) with BIOS rev. 320.
I'm running this script after each resume / boot cycle now at my ThinkPad T420 (i5 + intel i915 graphic) : $ cat check_rc6.sh #!/bin/sh # ACTION=$1 S=/sys/class/drm/card0/power/rc6pp_residency_ms B=`cat $S` sleep 3 A=`cat $S` if [[ $A -eq $B ]]; then if [[ -n "$ACTION" ]]; then $ACTION else echo "RC6 issue" aplay /usr/share/sounds/pop.wav fi fi The patch mentioned above from bug https://bugs.freedesktop.org/show_bug.cgi?id=54089 shipped in Linux-3.8-rc Please report if the issues reported in this bug report can be reproduced in Linux-3.8-rc (In reply to comment #34) > The patch mentioned above from bug > https://bugs.freedesktop.org/show_bug.cgi?id=54089 > shipped in Linux-3.8-rc > > Please report if the issues reported in this bug > report can be reproduced in Linux-3.8-rc Linux-3.8.0-rc3 seems to fix the issue for me. Thanks Bug 48791 has been marked as a duplicate of this one, but I still see it with Linux-3.8.0-rc3. The issue is that after some resumes from suspend (maybe 20%?) the GPU can't reach the RC6 state anymore leading to heat and power drain. This is on a Thinkpad X220. Should I reopen Bug 48791? (In reply to comment #36) > This is on a Thinkpad X220. Should I reopen Bug 48791? yes - of course I, like several others, have downloaded the 3.8.0 rc3 kernel from http://www.kernel.org/, compiled it using these directions http://www.howopensource.com/2011/08/how-to-install-compile-linux-kernel-3-0-in-fedora-15-and-14/ (making sure to replace the kernel name) and it is working fine for me .... so great to have suspend/resume working again! I confirm the same bahavior described in initial description on 3.7.4-1-ARCH #1 SMP PREEMPT Mon Jan 21 23:05:29 CET 2013 x86_64 GNU/Linux. My machine is an Asus Laptop N53SV Intel Core i5 2430M @ 2.4 GHz, Using Arch Linux up-to-date and no added parameters on kernel boot line. If you need further info, i'm glad to post it. (In reply to comment #39) > I confirm the same bahavior described in initial description on 3.7.4-1-ARCH > #1 > SMP PREEMPT Mon Jan 21 23:05:29 CET 2013 x86_64 GNU/Linux. This bug here is only fixed in the 3.8 series. The remaining discussion is about the seemingly independent bug 48791 which is not fixed yet. I think this bug is related to https://bugzilla.kernel.org/show_bug.cgi?id=52411 I don't know how to mark this bug as being related. Since status is NEEDINFO, related Archlinux bugs are: https://bugs.archlinux.org/task/33810 & https://bugs.archlinux.org/task/32025 And how about the extra info like this stable series have ruined my life? ;) TMI, I guess. ;) I did quickly try 3.8-rc7 and it seemed to run cool. http://stats.webconverger.org/x220/temp/043.png I assume we wait for 3.8 to drop and then I'll try forget about 3.7.x nightmare. I have noticed this again at least twice after resuming from suspend on 3.8.2-1-ARCH #1 SMP PREEMPT Mon Mar 4 09:06:43 CET 2013 x86_64 GNU/Linux. However it is not regular as it was for me on 3.6-3.7.x. Machine is a Sony VAIO VPCEG34FX. With 3.8.3 the problem become almost constant on ThinkPad x220. Linux x220 3.8.3 #1 SMP PREEMPT Sun Mar 17 13:46:24 EET 2013 x86_64 GNU/Linux With 3.8.1 and even 3.8.2 situation was much more better. I don't know what is going on. What information is needed? How can I help? Yes, I also experienced this on 3.8.3 after unsuspending. However, I resuspended and unsuspended and it "went away." So not consistent for me. My also running Lenovo X220 on Fedora 18 with Kernel Linux 3.8.3-201.fc18.x86_64. Running 3.9-rc3, and I had this issue today. Linux version 3.9.0-1-rc3-mainline-dirty (wgiokas@wst420) (gcc version 4.7.2 (GCC) ) #1 SMP PREEMPT Sun Mar 17 20:13:34 CDT 2013 It does seem to be better than 3.6 and 7, but still not fixed. The only thing that I can see blocking this is the fact that it's very hard to reproduce this bug. If it wasn't, then I would be bisecting, but right now it is almost impossible to differentiate a good and a bad commit. Hi, guys, can you please verify if this is a duplicate of bug 52411? please check if the problem still exists in the latest upstream kernel. Hi Zhang the symptoms described in bug 52411 are indeed the same; nevertheless, I experienced the issue sometimes - but quite rarely - after resuming my laptop even with 3.8: if I'm not going wrong the last time it happened it was some weeks ago with 3.8.5. But please bear in mind that most of the times I simply shutdown -h now my laptop, therefore I can't provide much information. Thanks I tried kernel 3.8.8 and it seems that the problem is still present. Besides happening (apparently) randomly when the laptop resumes from suspend, I also noticed that if I turn the laptop on and I take some time to login, it will enter in the same condition, with the GPU not going to rc6. Hi, alex, can you please check if this is the problem you meet? Reply-To: daniel.dorau@gmx.de For my understanding: Is the GPU not going to rc6 the same as all CPUs staying at their max frequencies after resume? The latter was what I thought this issue was about and what I experienced after 3.5.7 to 3.6 upgrade (definately introduced with 3.6). The issue became increasingly rare now and I haven't experienced it for a while now. After reading the other bugs description I'm not sure if they are the same, although probably related. bugzilla-daemon@bugzilla.kernel.org schrieb: >https://bugzilla.kernel.org/show_bug.cgi?id=48721 > > >Zhang Rui <rui.zhang@intel.com> changed: > > What |Removed |Added >---------------------------------------------------------------------------- > CC| |alex.shi@intel.com > > > > >--- Comment #50 from Zhang Rui <rui.zhang@intel.com> 2013-04-21 14:31:15 --- >Hi, alex, > >can you please check if this is the problem you meet? > >-- >Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email >------- You are receiving this mail because: ------- >You are on the CC list for the bug. (In reply to comment #50) > Hi, alex, > > can you please check if this is the problem you meet? No, this is different problem from the one I met. In my problem, the cpu freq can be changed, just ondemand governor behaviour issue. Do all these system have i915 graphics? I have an i915. I'll check that /sys/kernel/debug/dri/0/i915_cur_delayinfo file each resume to see if there's a correlation for me as well. Reply-To: daniel.dorau@gmx.de As far as I'm concerned, yes. I didn't notice the issue on more recent kernels (3.9.x) though. It's definitely correlated with the values in i915_cur_delayinfo. Typical values when stuck: CAGF: 1300MHz RP CUR UP EI: 49915us RP CUR UP: 49915us RP PREV UP: 66000us RP CUR DOWN EI: 189311us RP CUR DOWN: 0us RP PREV DOWN: 300000us Typical values when normal: CAGF: 650MHz RP CUR UP EI: 33251us RP CUR UP: 99us RP PREV UP: 0us RP CUR DOWN EI: 0us RP CUR DOWN: 0us RP PREV DOWN: 0us Can you please check if the symptoms are reproducible with the (new) intel_pstate driver instead of acpi-cpufreq? Could you explain how I would do that? Which version of the kernel is needed and what steps to take? i was just hit today using intel_pstate, resolved it by another suspend-resume cycle. Thinkpad X220t Funtoo GNU/Linux Linux 3.10.0-rc4-JJ #1 SMP PREEMPT Wed Jun 5 16:51:49 CEST 2013 x86_64 Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz GenuineIntel GNU/Linux # cpupower frequency-info analyzing CPU 0: driver: intel_pstate ... available cpufreq governors: performance, powersave current policy: frequency should be within 800 MHz and 3.20 GHz. The governor "powersave" may decide which speed to use within this range. (In reply to comment #57) > Can you please check if the symptoms are reproducible with the (new) > intel_pstate driver instead of acpi-cpufreq? Never noticed until now with kernel 3.9 and intel_pstate driver (approx. 2 weeks), will report if it will happen. (In reply to comment #58) > Could you explain how I would do that? Which version of the kernel is needed > and what steps to take? You need 3.9 or later, but better use 3.10-rc5. Select CONFIG_X86_INTEL_PSTATE, build the kernel and install it. (In reply to comment #59) > i was just hit today using intel_pstate, resolved it by another > suspend-resume > cycle. Are you ever able to see the problem after a sequence like this: # echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo # echo mem > /sys/power/state # echo 0 > /sys/devices/system/cpu/intel_pstate/no_turbo (that is, turn turbo off, suspend/resume, turn turbo on)? Created attachment 104201 [details] Script to (possibly) provoke the bug New kernels seem to work a lot better but I ran into the same troubles with 3.9.5 and pstate activated several days ago. I've just written a hacky bash script to provoke the issue - unfortunately it did not yet provoke the bug. Maybe somebody has more luck? The script suspends automatically several times and tries to detect the stuck CPU and/or GPU clock. "true" means that the bug was detected during that standby cycle. You may need to install rtcwake. A sample output look like this: > -> # bash suspend_bug_detect.sh > Detected CPUs are: cpu0 cpu1 cpu2 cpu3 > > Frequency scaling information: > analyzing CPU 0: > driver: intel_pstate > CPUs which run at the same hardware frequency: 0 > CPUs which need to have their frequency coordinated by software: 0 > maximum transition latency: 0.97 ms. > hardware limits: 800 MHz - 3.40 GHz > available cpufreq governors: performance, powersave > current policy: frequency should be within 800 MHz and 3.40 GHz. > The governor "powersave" may decide which speed to use > within this range. > current CPU frequency is 3.13 GHz (asserted by call to hardware). > boost state support: > Supported: yes > Active: yes > 25500 MHz max turbo 4 active cores > 25500 MHz max turbo 3 active cores > 25500 MHz max turbo 2 active cores > 25500 MHz max turbo 1 active cores > Intel GPU info: > Maximum Frequency: 1300MHz > > Trying to provoke bug 48721 by suspending... > > run, gpu, cpu0, cpu1, cpu2, cpu3, > 1, false, false, false, false, false, > 2, false, false, false, false, false, > 3, false, false, false, false, false, > 4, false, false, false, false, false, > 5, false, false, false, false, false, Another interesting datapoint. I've been running my laptop with the scaling_governor set to "powersave", so the CPU frequency is locked to the lowest level. I've had a few times when the hardware wouldn't resume, not sure if that's related, but it never happened before this. More interesting is that this last time I resumed, the CPU frequency was still low as it should be, but the result from i915_cur_delayinfo was stuck high as described earlier in this bug. I had to suspend and resume again to get it back to normal. Can people please verify if the problem is still reproducible with this patch: https://bugzilla.kernel.org/attachment.cgi?id=106891&action=diff from bug #58971 applied? Any results anyone? The patch definitely helps against the frequenct stuck at the maximum. However, it is not a full solution to the power consumption problem, as explained below. The affected laptop is a Sony VAIO VPCZ23A4R. Without the patch, there is some chance that on boot the CPU will be stuck at 3.5 GHz and the GPU will be stuck in the powered-on state even if I set the minimum and maximum frequency to 800 MHz using cpufreq-set for all CPUs. A suspend-resume cycle usually helps. With the patch, I have not yet seen the "CPU stuck at 3.5 GHz and GPU is 100% active" bug. However, now it is stuck in 2-2.5 GHz range (i.e. does not go below 2 GHz) and the GPU is in RC6 only for 1-5% of all the time, with the workload being just Chromium with GMail and some bugzillas open. The CPU frequency does obey cpufreq-set, but this does not affect RC6 residency. So not as bad, but still bad. A suspend-resume cycle helps. Actually, when I close Chromium, the CPU frequency goes down to ~900 MHz and there appears RC6. However it still looks like a bug that just typing emails and bugs in Chromium raises the power consumption from 10.5 W (previous kernels could go as low as 8W) to 15-16W. Here the issue seems gone since kernel 3.9 and intel_pstate. I only noticed a slight temperature increase since the new scaling driver was introduced, but neither the graphics nor the CPU clock got stuck anymore as far as I can see. (In reply to Maurizio Avogadro from comment #69) > Here the issue seems gone since kernel 3.9 and intel_pstate. I only noticed > a slight temperature increase since the new scaling driver was introduced, > but neither the graphics nor the CPU clock got stuck anymore as far as I can > see. Awesome, thanks! (In reply to Alexander E. Patrakov from comment #67) > The patch definitely helps against the frequenct stuck at the maximum. > However, it is not a full solution to the power consumption problem, as > explained below. This particular bug entry is not about the "power consumption problem" (anyway, you can't consume power, but energy), only about the CPU frequency stuck at the maximum. So if the patch helps i that, I'm inclined to consider it as resolved. (In reply to Rafael J. Wysocki from comment #71) > This particular bug entry is not about the "power consumption problem" > (anyway, you can't consume power, but energy), only about the CPU frequency > stuck at the maximum. So if the patch helps i that, I'm inclined to > consider it as resolved. Well, if you understand why the calls to gen6_gt_force_wake_get() are necessary, then let's treat the bug as fixed. However, my opinion is not that certain, because of the following (please note that it is my purely subjective opinion based on black-box observations): * I still have extreme sensitivity of the CPU frequency and GPU RC6 occupancy to the load. * In physics and in signal processing, excessive sensitivity and instability are closely related notions if there is a positive feedback loop somewhere. * Lockup in one of the extreme states (like max frequency here) is one of the manifestations of instability of the intended state. If you are absolutely sure that there are no feedback loops that can become positive in the code that changes the power states, please close the bug as fixed. (In reply to Alexander E. Patrakov from comment #72) > Well, if you understand why the calls to gen6_gt_force_wake_get() are > necessary, then let's treat the bug as fixed. Please refer to bug #58971 for that. > If you are absolutely sure that there are no feedback loops that can become > positive in the code that changes the power states, please close the bug as > fixed. I'm not absolutely sure about anything, but this particular bug entry is about a specific kind of behavior that either can or cannot be observed. If you don't observe it any more (and anyone else doesn't for that matter), I think it should be closed, because we can't possibly make any progress here in that case (it's kind of difficult to fix a bug that no one is able to reproduce). I have read the related bug entry, and I am not convinced that this bug has been root-caused. But if you are comfortable with other people opening follow-up bugs that may share the same root cause - yes, sure, close this one. After rereading the above, I understand that my arguments can be also used against myself, as I am also not sure that the extreme-sensitivity of the consumed power to the load is related to the root cause of this bug, whatever is is. So, let's just close this bug :) I'm marking this one as a duplicate of bug #58971, because both the time frame it appeared in and the visible symptoms indicate at least some dependency between these bugs. *** This bug has been marked as a duplicate of bug 58971 *** |
Created attachment 83121 [details] 3.5.5 kernel: output of 'grep -r . /sys/devices/system/cpu/ 2>/dev/null' On my laptop (a Clevo W150HRM with an i7 2720QM CPU) I'm currently running Debian wheezy/sid with latest aptosid (3.6-1.slh.3) amd64 kernel: after switching from 3.5.5 to 3.6 kernel I noticed the idle temperatures suddenly increased by ~10°C with no apparent activity on the system. The temperature increase, together with sysfs scaling_cur_freq and cpuinfo_cur_freq values seem to support my suspect of a cpufreq bug: the ondemand governor doesn't seem able to modulate the CPU clock anymore. While the scaling_cur_freq value varies depending on system load just as expected, the cpuinfo_cur_freq value remains stuck at (or over because of turbo boost) the maximum CPU frequency, just like the performance governor was set. I could get the same frequency readings with cpufreq-info, turbostat and i7z: the CPU frequency never decreases under 2.2GHz. Furthermore, no cpufreq-set setting seems able to lower the idle frequency in any way, even by changing governor, and a test 3.6.1 vanilla kernel I compiled for testing purposes showed the same behavior. As soon as I reboot to a 3.5.5 kernel cpuinfo_cur_freq begin to follow scaling_cur_freq just as expected, and the idle temps decrease back by over 10°C.