Bug 48721

Summary: cpufreq stuck at max frequency, suspend/resume, i915 may affect
Product: ACPI Reporter: Maurizio Avogadro (mavoga)
Component: Power-ProcessorAssignee: Len Brown (lenb)
Status: CLOSED DUPLICATE    
Severity: normal CC: 1007380, abhijeet.1989, alan, alejometal3, alex.shi, alga777, brendansellers, const, corin, e.at.chi.kaen, eduedix, flyser42, garkein, hendry, j.keil, jj, js314592, jvpgomes, kernel.org, kernel, kruegsch, lenb, loic, losinski, matthias, matti.niemenmaa+kernelbugs, mswal2846, nefthy-kernel.org, onorua, patrakov, pl4nkton, rjw, rjw, robert, rscrihf, rui.zhang, samuel-kbugs, tomka
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.6.0+ Subsystem:
Regression: Yes Bisected commit-id:
Attachments: 3.5.5 kernel: output of 'grep -r . /sys/devices/system/cpu/ 2>/dev/null'
3.6.1 kernel: output of 'grep -r . /sys/devices/system/cpu/ 2>/dev/null'
3.6.1 kernel config
git bisect log
CPU and GPU working reliably for ~12h with this patch
Script to (possibly) provoke the bug

Description Maurizio Avogadro 2012-10-13 01:56:34 UTC
Created attachment 83121 [details]
3.5.5 kernel: output of 'grep -r . /sys/devices/system/cpu/ 2>/dev/null'

On my laptop (a Clevo W150HRM with an i7 2720QM CPU) I'm currently running Debian wheezy/sid with latest aptosid (3.6-1.slh.3) amd64 kernel: after switching from 3.5.5 to 3.6 kernel I noticed the idle temperatures suddenly increased by ~10°C with no apparent activity on the system.

The temperature increase, together with sysfs scaling_cur_freq and cpuinfo_cur_freq values seem to support my suspect of a cpufreq bug: the ondemand governor doesn't seem able to modulate the CPU clock anymore. While the scaling_cur_freq value varies depending on system load just as expected, the cpuinfo_cur_freq value remains stuck at (or over because of turbo boost) the maximum CPU frequency, just like the performance governor was set.

I could get the same frequency readings with cpufreq-info, turbostat and i7z: the CPU frequency never decreases under 2.2GHz.

Furthermore, no cpufreq-set setting seems able to lower the idle frequency in any way, even by changing governor, and a test 3.6.1 vanilla kernel I compiled for testing purposes showed the same behavior.

As soon as I reboot to a 3.5.5 kernel cpuinfo_cur_freq begin to follow scaling_cur_freq just as expected, and the idle temps decrease back by over 10°C.
Comment 1 Maurizio Avogadro 2012-10-13 01:57:21 UTC
Created attachment 83131 [details]
 3.6.1 kernel: output of 'grep -r . /sys/devices/system/cpu/ 2>/dev/null'
Comment 2 Maurizio Avogadro 2012-10-13 02:29:44 UTC
Created attachment 83151 [details]
3.6.1 kernel config
Comment 3 onorua 2012-10-13 07:17:29 UTC
I confirm the same bahavior on 3.6.2 as well.
Comment 4 Maurizio Avogadro 2012-10-15 00:40:40 UTC
Bisecting is somewhat complicated because some rare times the CPU frequency modulates correctly for a span of time after reboot.
Comment 5 Maurizio Avogadro 2012-10-15 22:58:16 UTC
The behavior is actually totally unpredictable: sometimes the laptop starts, everything goes fine and CPU clock modulates. Bisecting is very hard if not impossible without first identifying what triggers the bug, and led me to nonsensical results.

Furthermore, I experience the very same issue as #48791: the GPU can't reach the rc6 states (source: powertop), but that happens randomly, also after a regular boot, and it seems happening exactly when the CPU clock is stuck. I believe these bugs are related.
Comment 6 Marco Krüger 2012-10-16 11:54:02 UTC
Created attachment 83601 [details]
git bisect log

I can confirm the behavior and I did a git bisect.
The result is, that 
"drm/i915: fix forcewake related hangs on snb" - 6af2d180f82151cf3d58952e35a4f96e45bc453a
 is the bad commit.

It also matches my hardware specs: i7-2640M with Sandy Bridge GPU in use. Can somebody check this against the last good commit in the attached git bisect log?
Comment 7 Maurizio Avogadro 2012-10-16 23:44:52 UTC
Built git checkout 9978cf5 ("i915: Remove silly test"): the CPU clock is stuck at 2.2GHz at first reboot.
Comment 8 William Giokas 2012-10-17 00:41:46 UTC
Confirming this behavior on an Intel Core i5-2520M running Arch Linux, Intel 3000 GPU used. I will be glad to attach any information that is needed.
Comment 9 Marco Krüger 2012-10-17 08:31:23 UTC
(In reply to comment #7)
> Built git checkout 9978cf5 ("i915: Remove silly test"): the CPU clock is
> stuck
> at 2.2GHz at first reboot.

That is strange. I tested that commit again and I can not force the buggy behavior.
BTW - I used a wattmeter to check the state without the need to login. That allowed me a high reboot frequency for testing.
Comment 10 j.keil 2012-10-18 19:43:31 UTC
Hello,

I did a bisect myself (although not with a power meter..) but came across the same commit as Marco. I stumbled over a few more hurdles since some versions simply broke while booting (some would boot after the 2. or 3. attempt, others not).

However, in addition I tried resume/suspend for every commit. Some commits would boot up fine with frequency scaling working. But after suspend/resume frequency scaling was broken too.

I did a quick search for "sandy bridge forcewake" and I found this (among other mails):
http://www.gossamer-threads.com/lists/linux/kernel/1405844

I've built v3.6.2 with 6af2d180f82151cf3d58952e35a4f96e45bc453a reverted. Behaviour as above. Upon initial boot, frequency scaling is broken, after suspend/resume frequency scaling is working.

I fear that this snake pit goes even deeper. Hopefully someone with more insight can shed some light here..
Comment 11 Marco Krüger 2012-10-18 20:18:33 UTC
I guess it is related to the bug [1], that keeps the CPU busy. 

[1] - https://bugs.freedesktop.org/show_bug.cgi?id=54089
Comment 12 j.keil 2012-10-19 16:08:10 UTC
Hi,

I tried out several patches against stable/v3.6.2:

* Reverting 6af2d180f82151cf3d58952e35a4f96e45bc453a: result like in comment #10.
* This [1] patch + reverting 6af2d180: upon boot frequency scaling is working. However, after some suspend/resume cycles frequency scaling is broken again.
* All commits from drm-intel-nightly from [2]: no result.

Hope this helps.

[1] https://bugs.freedesktop.org/show_bug.cgi?id=54089
[2] git://people.freedesktop.org/~danvet/drm-intel
Comment 13 j.keil 2012-10-23 07:39:25 UTC
Some more observations from me. First of another bisect:

git bisect start
# good: [73b6448a7705298b2b10367a50fd063b27cdbeb8] Linux 3.5.6
git bisect good 73b6448a7705298b2b10367a50fd063b27cdbeb8
# bad: [6af2d180f82151cf3d58952e35a4f96e45bc453a] drm/i915: fix forcewake related hangs on snb
git bisect bad 6af2d180f82151cf3d58952e35a4f96e45bc453a
# good: [6b16351acbd415e66ba16bf7d473ece1574cf0bc] Linux 3.5-rc4
git bisect good 6b16351acbd415e66ba16bf7d473ece1574cf0bc
# bad: [930ebb462422117e12b85bb5fd6548ed13d0afb5] drm/i915: fix up ilk rc6 disabling confusion
git bisect bad 930ebb462422117e12b85bb5fd6548ed13d0afb5
# good: [e055684168af48ac7deb27d7267046a0fb9ef80e] drm/i915: context switch implementation
git bisect good e055684168af48ac7deb27d7267046a0fb9ef80e
# good: [4a87d65d54ffc76e1d6f7e2124354997b66bd81c] drm/i915: add HDMI and DP port enumeration on ValleyView
git bisect good 4a87d65d54ffc76e1d6f7e2124354997b66bd81c
# bad: [e86fe0d31722090e3bb72a3e8aab70f07e2c6b77] drm/i915: mask tiled bit when updating IVB sprites
git bisect bad e86fe0d31722090e3bb72a3e8aab70f07e2c6b77
# good: [ff049b6ce21d2814451afd4a116d001712e0116b] drm/i915: bind driver to ValleyView chipsets
git bisect good ff049b6ce21d2814451afd4a116d001712e0116b
# good: [79f5b2c7599270aa3dcfffb445f8f520c05a7fc5] drm/i915: make enable/disable_gt_powersave locking consistent
git bisect good 79f5b2c7599270aa3dcfffb445f8f520c05a7fc5
# bad: [01a06850fb45ace55ed67d1d9da2df553a041e40] drm/i915: disable drm agp support for !gen3 with kms enabled
git bisect bad 01a06850fb45ace55ed67d1d9da2df553a041e40
# bad: [87207ca20eeb519aa0333b754db9cf3c369ea6f7] drm/i915: don't use dev->agp
git bisect bad 87207ca20eeb519aa0333b754db9cf3c369ea6f7

As you can see I ended up with a complete different commit. Currently I'm running 3.4.9 which did not yet show the problem. Since I experienced the problem with 3.5.6 too, I suspect the error somewhere between 3.4.9 and 3.5.9.

It is interesting though, that some versions in between would work well whereas others seem to be broken.
Most often I came across AGP (which was refactored at some point) and forcewake. Maybe the culprit's here.

I also reverted 990bbdadabaa51828e475eda86ee5720a4910cc3 (which is mentioned in 6af2d180f82151cf3d58952e35a4f96e45bc453a) but that left me with a non-working graphics card.
Comment 14 Maurizio Avogadro 2012-10-23 12:34:05 UTC
After ~12h test and normal use with some suspends and resumes it seems that reverting the commit 6af2d180f82151cf3d58952e35a4f96e45bc453a [1] and adding a msleep as mentioned in [2] it seems that both CPU frequency scaling and GPU rc6 state are working reliably as expected. At least in my case...

Attaching a patch for clarity.

I should also try applying [2] only in order to confirm both changes are needed...

[1] see comment #10 here
[2] https://bugs.freedesktop.org/show_bug.cgi?id=54089#c24 (Comment #24)
Comment 15 Maurizio Avogadro 2012-10-23 12:36:47 UTC
Created attachment 84431 [details]
CPU and GPU working reliably for ~12h with this patch
Comment 16 Marco Krüger 2012-10-25 11:51:14 UTC
(In reply to comment #14)
> After ~12h test and normal use with some suspends and resumes it seems that
> reverting the commit 6af2d180f82151cf3d58952e35a4f96e45bc453a [1] and adding
> a
> msleep as mentioned in [2] it seems that both CPU frequency scaling and GPU
> rc6
> state are working reliably as expected. At least in my case...
> 
> Attaching a patch for clarity.
> 
> I should also try applying [2] only in order to confirm both changes are
> needed...
> 
> [1] see comment #10 here
> [2] https://bugs.freedesktop.org/show_bug.cgi?id=54089#c24 (Comment #24)

When I revert and/or patch from v3.6.3 neither of them fixes the bug. My last "good" bisect commit was in v3.5-rc4 and that worked for me.
Comment 17 Maurizio Avogadro 2012-10-25 15:18:16 UTC
In my case, even with the patch proposed, the issue appears at boot randomly, say every 5-6 boots, and always disappeared after a suspend/resume cycle until now.

Bisecting would be a PITA under these circumstances: I'm unable to reliably reproduce the bug, I can only tell that graphics activity seems irrelevant and that it happens exclusively on boot.

Does the driver rely on some value stored in NVRAM? Could this issue be related to some NVRAM data corruption? I also experienced a weird NVRAM corruption on this laptop, which required a complete reflash to solve, some time after installing the first 3.6 kernel.
Comment 18 mswal2846 2012-10-26 00:50:05 UTC
I've applied the patch to my Lenovo X220 FC17 3.6.3 system.  So far, I am able to successfully un-suspend without experiencing this problem.
Comment 19 mswal2846 2012-10-29 14:03:16 UTC
So I've suspended and come out of suspension several times since 10/26 without an issue, but this morning I came out of suspension and the problem reappeared, despite the patch application.

My CPU at the time was low and TOP showed no run away process, yet the temperature was climbing and the fan was flying to no avail.

Rebooting fresh has "reset" the problem...
Comment 20 Toralf Förster 2012-10-30 21:30:47 UTC
At a T420 it doesn't work - tested at kernel 3.6.4 at a stable Gentoo.
Comment 21 Toralf Förster 2012-10-31 20:31:28 UTC
*** Bug 48791 has been marked as a duplicate of this bug. ***
Comment 22 Samuel Sieb 2012-11-08 18:31:54 UTC
I was seeing the same thing on an i5-2540M using 3.6.3.  I was wondering why the fan was always running so much, except (I think) on boot and sometimes after a resume.  I finally noticed that scaling_cur_freq != cpuinfo_cur_freq and found this bug report.  Yesterday I updated to the 3.6.6 kernel and even after several suspend/resume cycles it's still working properly.
Comment 23 Constantin Baranov 2012-11-08 18:53:35 UTC
(In reply to comment #22)
> Yesterday I updated to the 3.6.6 kernel and even
> after several suspend/resume cycles it's still working properly.
I also have updated and thought it's fixed. But after some more hours of usual work it has fallen back to improper behaviour.
Comment 24 Brendan 2012-11-10 03:36:29 UTC
I haven't tried any other kernels or patches, but I can confirm this behavior on my i7-2620M, currently running kernel release 3.6.6.
Comment 25 Samuel Sieb 2012-11-12 08:01:01 UTC
I can confirm it's still a problem with 3.6.6.  I rebooted again and did a suspend/resume, so one of those triggered it again.
Comment 26 Peter Schneider 2012-11-12 08:29:49 UTC
I can confirm the problem too. Having an i7-2620M SandyBridge running ArchLinux. Last working build for me was 3.5.6. All later releases up to 3.6.6 show the same problem.
Comment 27 Samuel Sieb 2012-11-12 08:45:47 UTC
I just did another suspend/resume cycle and it is corrected, so it seems very random.
Comment 28 eduedix 2012-11-12 09:06:38 UTC
I can also confirm exactly what #25 Samuel Sieb and #26 Peter Schneider says.
Comment 29 abhijeet.1989 2012-11-14 07:18:14 UTC
As #26 says, the problem is exactly the same. There are times when it works perfectly and sometimes it doesn't. 

So, when there is a problem, is there any way I can help by giving you guys some outputs which you can check against the times when there is no issue? I would be interested in helping in any way to get the bug solved. It has been months since it was produced and I really want this to go away.
Comment 30 Matthias Blaicher 2012-11-21 09:19:23 UTC
I'd like to point out one more observation:
Once the error occurs and the CPU frequency is fixed, this also applies to the Intel GPU:

> # cat /sys/kernel/debug/dri/0/i915_cur_delayinfo
> GT_PERF_STATUS: 0x00000d29
> RPSTAT1: 0x00048d00
> Render p-state ratio: 13
> Render p-state VID: 41
> Render p-state limit: 255
> CAGF: 650MHz
> RP CUR UP EI: 61329us
> RP CUR UP: 4us
> RP PREV UP: 0us
> RP CUR DOWN EI: 0us
> RP CUR DOWN: 0us
> RP PREV DOWN: 0us
> Lowest (RPN) frequency: 650MHz
> Nominal (RP1) frequency: 650MHz
> Max non-overclocked (RP0) frequency: 1300MHz
> 
> # cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq
> 800000


> ############ After standby #################################
> 
> # cat /sys/kernel/debug/dri/0/i915_cur_delayinfo
> GT_PERF_STATUS: 0x00001ac6
> RPSTAT1: 0x00049a19
> Render p-state ratio: 26
> Render p-state VID: 198
> Render p-state limit: 255
> CAGF: 1300MHz
> RP CUR UP EI: 9142us
> RP CUR UP: 9142us
> RP PREV UP: 66000us
> RP CUR DOWN EI: 191074us
> RP CUR DOWN: 0us
> RP PREV DOWN: 0us
> Lowest (RPN) frequency: 650MHz
> Nominal (RP1) frequency: 650MHz
> Max non-overclocked (RP0) frequency: 1300MHz
> 
> # cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq
> 2700000

Notice how CAGF is now at the maximum frequency. Also the GT_PERF_STATUS and RPSTAT1 different and reproducible. I have no idea what these mean though.

(I've posted the same info on the Arch forum and this behavior was confirmed by another user.)
Comment 31 Toralf Förster 2012-12-13 13:11:16 UTC
Upgrading from 3.4.9 to 3.4.10 seems to make that behaviour much more worse than before at an stable x86 Gentoo with i915 module at a ThinkPad T420.
Comment 32 Alessio Gaeta 2012-12-30 14:54:25 UTC
Hello,
using 3.6 I get this behavior rarely (actually I didn't correlated with a kernel problem either), while with 3.7 I get it consistently.

Accordingly to PowerTop, the power consumption doubles (from 13.5-15 W to >30 W), CPU cores and GPU work at maximum frequency, statistics show heavy use of Turbo mode.

I can confirm that suspending and resuming the machine restore the correct frequency scaling *and* the "benefit" survives to reboots. Just as an example:

> 1. Kernel 3.6
> ==================================================
> 
> GT_PERF_STATUS: 0x00000d29
> RPSTAT1: 0x00048d00
> Render p-state ratio: 13
> Render p-state VID: 41
> Render p-state limit: 255
> CAGF: 650MHz <==============
> RP CUR UP EI: 25483us
> RP CUR UP: 59us
> RP PREV UP: 0us
> RP CUR DOWN EI: 0us
> RP CUR DOWN: 0us
> RP PREV DOWN: 0us
> Lowest (RPN) frequency: 650MHz
> Nominal (RP1) frequency: 650MHz
> Max non-overclocked (RP0) frequency: 1100MHz
> 
> 
> 2. Kernel 3.7 (after poweroff and cold reboot)
> ==================================================
> GT_PERF_STATUS: 0x000016c3
> RPSTAT1: 0x00049614
> Render p-state ratio: 22
> Render p-state VID: 195
> Render p-state limit: 255
> CAGF: 1100MHz <==============
> RP CUR UP EI: 15388us
> RP CUR UP: 15388us
> RP PREV UP: 66000us
> RP CUR DOWN EI: 232477us
> RP CUR DOWN: 0us
> RP PREV DOWN: 0us
> owest (RPN) frequency: 650MHz
> Nominal (RP1) frequency: 650MHz
> Max non-overclocked (RP0) frequency: 1100MHz
> 
> 3. Kernel 3.7 (after suspend/resume)
> ==================================================
> GT_PERF_STATUS: 0x00000d29
> RPSTAT1: 0x00048d00
> Render p-state ratio: 13
> Render p-state VID: 41
> Render p-state limit: 255
> CAGF: 650MHz <==============
> RP CUR UP EI: 55593us
> RP CUR UP: 4us
> RP PREV UP: 0us
> RP CUR DOWN EI: 0us
> RP CUR DOWN: 0us
> RP PREV DOWN: 0us
> Lowest (RPN) frequency: 650MHz
> Nominal (RP1) frequency: 650MHz
> Max non-overclocked (RP0) frequency: 1100MHz
> 
> 4. Kernel 3.7 (after reboot without poweroff)
> ==================================================
> GT_PERF_STATUS: 0x00000d29
> RPSTAT1: 0x00040d00
> Render p-state ratio: 13
> Render p-state VID: 41
> Render p-state limit: 255
> CAGF: 650MHz <==============
> RP CUR UP EI: 2005us
> RP CUR UP: 6us
> RP PREV UP: 0us
> RP CUR DOWN EI: 0us
> RP CUR DOWN: 0us
> RP PREV DOWN: 0us
> Lowest (RPN) frequency: 650MHz
> Nominal (RP1) frequency: 650MHz
> Max non-overclocked (RP0) frequency: 1100MHz

I'm on an Asus X53Sv (AKA K53Sv) with BIOS rev. 320.
Comment 33 Toralf Förster 2012-12-30 16:11:06 UTC
I'm running this script after each resume / boot cycle now at my ThinkPad T420 (i5 + intel i915 graphic) :

$ cat check_rc6.sh
#!/bin/sh
#

ACTION=$1

S=/sys/class/drm/card0/power/rc6pp_residency_ms

B=`cat $S`
sleep 3
A=`cat $S`

if [[ $A -eq $B ]]; then
        if [[ -n "$ACTION" ]]; then
                $ACTION
        else
                echo "RC6 issue"
                aplay /usr/share/sounds/pop.wav
        fi
fi
Comment 34 Len Brown 2013-01-15 01:14:46 UTC
The patch mentioned above from bug
https://bugs.freedesktop.org/show_bug.cgi?id=54089
shipped in Linux-3.8-rc

Please report if the issues reported in this bug
report can be reproduced in Linux-3.8-rc
Comment 35 Maurizio Avogadro 2013-01-16 16:09:24 UTC
(In reply to comment #34)
> The patch mentioned above from bug
> https://bugs.freedesktop.org/show_bug.cgi?id=54089
> shipped in Linux-3.8-rc
> 
> Please report if the issues reported in this bug
> report can be reproduced in Linux-3.8-rc

Linux-3.8.0-rc3 seems to fix the issue for me.

Thanks
Comment 36 Thomas Kahle 2013-01-17 20:19:12 UTC
Bug 48791 has been marked as a duplicate of this one, but I still see it with Linux-3.8.0-rc3.  The issue is that after some resumes from suspend (maybe 20%?) the GPU can't reach the RC6 state anymore leading to heat and power drain.  This is on a Thinkpad X220.  Should I reopen Bug 48791?
Comment 37 Toralf Förster 2013-01-17 20:23:58 UTC
(In reply to comment #36)
> This is on a Thinkpad X220.  Should I reopen Bug 48791?
yes - of course
Comment 38 mswal2846 2013-01-19 17:59:05 UTC
I, like several others, have downloaded the 3.8.0 rc3 kernel from http://www.kernel.org/, compiled it using these directions  http://www.howopensource.com/2011/08/how-to-install-compile-linux-kernel-3-0-in-fedora-15-and-14/ (making sure to replace the kernel name) and it is working fine for me .... so great to have suspend/resume working again!
Comment 39 Alejandro Rodriguez 2013-01-24 14:15:41 UTC
I confirm the same bahavior described in initial description on 3.7.4-1-ARCH #1 SMP PREEMPT Mon Jan 21 23:05:29 CET 2013 x86_64 GNU/Linux.

My machine is an Asus Laptop N53SV Intel Core i5 2430M @ 2.4 GHz, Using Arch Linux up-to-date and no added parameters on kernel boot line.

If you need further info, i'm glad to post it.
Comment 40 Thomas Kahle 2013-01-24 16:24:32 UTC
(In reply to comment #39)
> I confirm the same bahavior described in initial description on 3.7.4-1-ARCH
> #1
> SMP PREEMPT Mon Jan 21 23:05:29 CET 2013 x86_64 GNU/Linux.

This bug here is only fixed in the 3.8 series.  The remaining discussion is about the seemingly independent bug 48791 which is not fixed yet.
Comment 41 Kai Hendry 2013-02-05 15:42:06 UTC
I think this bug is related to https://bugzilla.kernel.org/show_bug.cgi?id=52411
I don't know how to mark this bug as being related.
Comment 42 Kai Hendry 2013-02-15 01:50:15 UTC
Since status is NEEDINFO, related Archlinux bugs are:
https://bugs.archlinux.org/task/33810 & https://bugs.archlinux.org/task/32025

And how about the extra info like this stable series have ruined my life? ;) TMI, I guess. ;) I did quickly try 3.8-rc7 and it seemed to run cool. http://stats.webconverger.org/x220/temp/043.png

I assume we wait for 3.8 to drop and then I'll try forget about 3.7.x nightmare.
Comment 43 Roy Crihfield 2013-03-16 17:03:41 UTC
I have noticed this again at least twice after resuming from suspend on 3.8.2-1-ARCH #1 SMP PREEMPT Mon Mar 4 09:06:43 CET 2013 x86_64 GNU/Linux. However it is not regular as it was for me on 3.6-3.7.x.

Machine is a Sony VAIO VPCEG34FX.
Comment 44 onorua 2013-03-18 10:07:42 UTC
With 3.8.3 the problem become almost constant on ThinkPad x220. 
Linux x220 3.8.3 #1 SMP PREEMPT Sun Mar 17 13:46:24 EET 2013 x86_64 GNU/Linux

With 3.8.1 and even 3.8.2 situation was much more better. I don't know what is going on. What information is needed? How can I help?
Comment 45 mswal2846 2013-03-18 13:18:48 UTC
Yes, I also experienced this on 3.8.3 after unsuspending.  However, I resuspended and unsuspended and it "went away."  So not consistent for me.  My also running Lenovo X220 on Fedora 18 with Kernel Linux 3.8.3-201.fc18.x86_64.
Comment 46 William Giokas 2013-03-19 23:11:14 UTC
Running 3.9-rc3, and I had this issue today.

Linux version 3.9.0-1-rc3-mainline-dirty (wgiokas@wst420) (gcc version 4.7.2 (GCC) ) #1 SMP PREEMPT Sun Mar 17 20:13:34 CDT 2013

It does seem to be better than 3.6 and 7, but still not fixed. The only thing that I can see blocking this is the fact that it's very hard to reproduce this bug. If it wasn't, then I would be bisecting, but right now it is almost impossible to differentiate a good and a bad commit.
Comment 47 Zhang Rui 2013-04-17 13:49:48 UTC
Hi, guys,

can you please verify if this is a duplicate of bug 52411?
please check if the problem still exists in the latest upstream kernel.
Comment 48 Maurizio Avogadro 2013-04-17 23:17:37 UTC
Hi Zhang

the symptoms described in bug 52411 are indeed the same; nevertheless, I experienced the issue sometimes - but quite rarely - after resuming my laptop even with 3.8: if I'm not going wrong the last time it happened it was some weeks ago with 3.8.5.

But please bear in mind that most of the times I simply shutdown -h now my laptop, therefore I can't provide much information.

Thanks
Comment 49 João Gomes 2013-04-20 11:20:51 UTC
I tried kernel 3.8.8 and it seems that the problem is still present.
Besides happening (apparently) randomly when the laptop resumes from suspend, I also noticed that if I turn the laptop on and I take some time to login, it will enter in the same condition, with the GPU not going to rc6.
Comment 50 Zhang Rui 2013-04-21 14:31:15 UTC
Hi, alex,

can you please check if this is the problem you meet?
Comment 51 Anonymous Emailer 2013-04-21 15:40:14 UTC
Reply-To: daniel.dorau@gmx.de

For my understanding: Is the GPU not going to rc6 the same as all CPUs staying at their max frequencies after resume? The latter was what I thought this issue was about and what I experienced after 3.5.7 to 3.6 upgrade (definately introduced with 3.6). The issue became increasingly rare now and I haven't experienced it for a while now. 

After reading the other bugs description I'm not sure if they are the same, although probably related.

bugzilla-daemon@bugzilla.kernel.org schrieb:

>https://bugzilla.kernel.org/show_bug.cgi?id=48721
>
>
>Zhang Rui <rui.zhang@intel.com> changed:
>
>           What    |Removed                     |Added
>----------------------------------------------------------------------------
>                 CC|                            |alex.shi@intel.com
>
>
>
>
>--- Comment #50 from Zhang Rui <rui.zhang@intel.com>  2013-04-21 14:31:15 ---
>Hi, alex,
>
>can you please check if this is the problem you meet?
>
>-- 
>Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
>------- You are receiving this mail because: -------
>You are on the CC list for the bug.
Comment 52 Alex Shi 2013-04-22 00:39:55 UTC
(In reply to comment #50)
> Hi, alex,
> 
> can you please check if this is the problem you meet?

No, this is different problem from the one I met. In my problem, the cpu freq can be changed, just ondemand governor behaviour issue.
Comment 53 Len Brown 2013-06-04 00:41:31 UTC
Do all these system have i915 graphics?
Comment 54 Samuel Sieb 2013-06-04 04:37:54 UTC
I have an i915.  I'll check that /sys/kernel/debug/dri/0/i915_cur_delayinfo file each resume to see if there's a correlation for me as well.
Comment 55 Anonymous Emailer 2013-06-04 06:59:31 UTC
Reply-To: daniel.dorau@gmx.de

As far as I'm concerned, yes. I didn't notice the issue on more recent kernels (3.9.x) though.
Comment 56 Samuel Sieb 2013-06-05 02:55:47 UTC
It's definitely correlated with the values in i915_cur_delayinfo.

Typical values when stuck:
CAGF: 1300MHz
RP CUR UP EI: 49915us
RP CUR UP: 49915us
RP PREV UP: 66000us
RP CUR DOWN EI: 189311us
RP CUR DOWN: 0us
RP PREV DOWN: 300000us

Typical values when normal:
CAGF: 650MHz
RP CUR UP EI: 33251us
RP CUR UP: 99us
RP PREV UP: 0us
RP CUR DOWN EI: 0us
RP CUR DOWN: 0us
RP PREV DOWN: 0us
Comment 57 Rafael J. Wysocki 2013-06-09 22:44:52 UTC
Can you please check if the symptoms are reproducible with the (new) intel_pstate driver instead of acpi-cpufreq?
Comment 58 Samuel Sieb 2013-06-09 23:25:01 UTC
Could you explain how I would do that?  Which version of the kernel is needed and what steps to take?
Comment 59 Jonas Jelten 2013-06-09 23:49:34 UTC
i was just hit today using intel_pstate, resolved it by another suspend-resume cycle.
Thinkpad X220t Funtoo GNU/Linux
Linux 3.10.0-rc4-JJ #1 SMP PREEMPT Wed Jun 5 16:51:49 CEST 2013 x86_64 Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz GenuineIntel GNU/Linux

# cpupower frequency-info
analyzing CPU 0:
  driver: intel_pstate
...
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 800 MHz and 3.20 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
Comment 60 Maurizio Avogadro 2013-06-10 07:27:03 UTC
(In reply to comment #57)
> Can you please check if the symptoms are reproducible with the (new)
> intel_pstate driver instead of acpi-cpufreq?

Never noticed until now with kernel 3.9 and intel_pstate driver (approx. 2 weeks), will report if it will happen.
Comment 61 Rafael J. Wysocki 2013-06-10 11:43:14 UTC
(In reply to comment #58)
> Could you explain how I would do that?  Which version of the kernel is needed
> and what steps to take?

You need 3.9 or later, but better use 3.10-rc5.  Select CONFIG_X86_INTEL_PSTATE, build the kernel and install it.
Comment 62 Rafael J. Wysocki 2013-06-10 11:44:34 UTC
(In reply to comment #59)
> i was just hit today using intel_pstate, resolved it by another
> suspend-resume
> cycle.

Are you ever able to see the problem after a sequence like this:

# echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
# echo mem > /sys/power/state
# echo 0 > /sys/devices/system/cpu/intel_pstate/no_turbo

(that is, turn turbo off, suspend/resume, turn turbo on)?
Comment 63 Matthias Blaicher 2013-06-10 16:25:28 UTC
Created attachment 104201 [details]
Script to (possibly) provoke the bug

New kernels seem to work a lot better but I ran into the same troubles with 3.9.5 and pstate activated several days ago. 

I've just written a hacky bash script to provoke the issue - unfortunately it did not yet provoke the bug.

Maybe somebody has more luck? The script suspends automatically several times and tries to detect the stuck CPU and/or GPU clock. "true" means that the bug was detected during that standby cycle. You may need to install rtcwake.

A sample output look like this:

> -> # bash suspend_bug_detect.sh                     
> Detected CPUs are: cpu0 cpu1 cpu2 cpu3
> 
> Frequency scaling information:
> analyzing CPU 0:
>   driver: intel_pstate
>   CPUs which run at the same hardware frequency: 0
>   CPUs which need to have their frequency coordinated by software: 0
>   maximum transition latency: 0.97 ms.
>   hardware limits: 800 MHz - 3.40 GHz
>   available cpufreq governors: performance, powersave
>   current policy: frequency should be within 800 MHz and 3.40 GHz.
>                   The governor "powersave" may decide which speed to use
>                   within this range.
>   current CPU frequency is 3.13 GHz (asserted by call to hardware).
>   boost state support:
>     Supported: yes
>     Active: yes
>     25500 MHz max turbo 4 active cores
>     25500 MHz max turbo 3 active cores
>     25500 MHz max turbo 2 active cores
>     25500 MHz max turbo 1 active cores
> Intel GPU info:
> Maximum Frequency:  1300MHz
> 
> Trying to provoke bug 48721 by suspending...
> 
> run,    gpu,    cpu0,   cpu1,   cpu2,   cpu3,
> 1,      false,  false,  false,  false,  false,
> 2,      false,  false,  false,  false,  false,
> 3,      false,  false,  false,  false,  false,
> 4,      false,  false,  false,  false,  false,
> 5,      false,  false,  false,  false,  false,
Comment 64 Samuel Sieb 2013-06-27 16:49:15 UTC
Another interesting datapoint.  I've been running my laptop with the scaling_governor set to "powersave", so the CPU frequency is locked to the lowest level.  I've had a few times when the hardware wouldn't resume, not sure if that's related, but it never happened before this. More interesting is that this last time I resumed, the CPU frequency was still low as it should be, but the result from i915_cur_delayinfo was stuck high as described earlier in this bug.  I had to suspend and resume again to get it back to normal.
Comment 65 Rafael J. Wysocki 2013-07-17 11:31:37 UTC
Can people please verify if the problem is still reproducible with this patch:

https://bugzilla.kernel.org/attachment.cgi?id=106891&action=diff

from bug #58971 applied?
Comment 66 Rafael J. Wysocki 2013-08-06 22:54:50 UTC
Any results anyone?
Comment 67 Alexander E. Patrakov 2013-08-07 05:26:56 UTC
The patch definitely helps against the frequenct stuck at the maximum. However, it is not a full solution to the power consumption problem, as explained below.

The affected laptop is a Sony VAIO VPCZ23A4R. Without the patch, there is some chance that on boot the CPU will be stuck at 3.5 GHz and the GPU will be stuck in the powered-on state even if I set the minimum and maximum frequency to 800 MHz using cpufreq-set for all CPUs. A suspend-resume cycle usually helps.

With the patch, I have not yet seen the "CPU stuck at 3.5 GHz and GPU is 100% active" bug. However, now it is stuck in 2-2.5 GHz range (i.e. does not go below 2 GHz) and the GPU is in RC6 only for 1-5% of all the time, with the workload being just Chromium with GMail and some bugzillas open. The CPU frequency does obey cpufreq-set, but this does not affect RC6 residency. So not as bad, but still bad. A suspend-resume cycle helps.
Comment 68 Alexander E. Patrakov 2013-08-07 05:42:58 UTC
Actually, when I close Chromium, the CPU frequency goes down to ~900 MHz and there appears RC6. However it still looks like a bug that just typing emails and bugs in Chromium raises the power consumption from 10.5 W (previous kernels could go as low as 8W) to 15-16W.
Comment 69 Maurizio Avogadro 2013-08-07 08:46:55 UTC
Here the issue seems gone since kernel 3.9 and intel_pstate. I only noticed a slight temperature increase since the new scaling driver was introduced, but neither the graphics nor the CPU clock got stuck anymore as far as I can see.
Comment 70 Rafael J. Wysocki 2013-08-07 10:57:36 UTC
(In reply to Maurizio Avogadro from comment #69)
> Here the issue seems gone since kernel 3.9 and intel_pstate. I only noticed
> a slight temperature increase since the new scaling driver was introduced,
> but neither the graphics nor the CPU clock got stuck anymore as far as I can
> see.

Awesome, thanks!
Comment 71 Rafael J. Wysocki 2013-08-07 11:00:51 UTC
(In reply to Alexander E. Patrakov from comment #67)
> The patch definitely helps against the frequenct stuck at the maximum.
> However, it is not a full solution to the power consumption problem, as
> explained below.

This particular bug entry is not about the "power consumption problem" (anyway, you can't consume power, but energy), only about the CPU frequency stuck at the maximum.  So if the patch helps i that, I'm inclined to consider it as resolved.
Comment 72 Alexander E. Patrakov 2013-08-07 12:11:35 UTC
(In reply to Rafael J. Wysocki from comment #71)
> This particular bug entry is not about the "power consumption problem"
> (anyway, you can't consume power, but energy), only about the CPU frequency
> stuck at the maximum.  So if the patch helps i that, I'm inclined to
> consider it as resolved.

Well, if you understand why the calls to gen6_gt_force_wake_get() are necessary, then let's treat the bug as fixed. However, my opinion is not that certain, because of the following (please note that it is my purely subjective opinion based on black-box observations):

 * I still have extreme sensitivity of the CPU frequency and GPU RC6 occupancy to the load.
 * In physics and in signal processing, excessive sensitivity and instability are closely related notions if there is a positive feedback loop somewhere.
 * Lockup in one of the extreme states (like max frequency here) is one of the manifestations of instability of the intended state.

If you are absolutely sure that there are no feedback loops that can become positive in the code that changes the power states, please close the bug as fixed.
Comment 73 Rafael J. Wysocki 2013-08-07 23:01:36 UTC
(In reply to Alexander E. Patrakov from comment #72)
> Well, if you understand why the calls to gen6_gt_force_wake_get() are
> necessary, then let's treat the bug as fixed.

Please refer to bug #58971 for that.

> If you are absolutely sure that there are no feedback loops that can become
> positive in the code that changes the power states, please close the bug as
> fixed.

I'm not absolutely sure about anything, but this particular bug entry is about a specific kind of behavior that either can or cannot be observed.  If you don't observe it any more (and anyone else doesn't for that matter), I think it should be closed, because we can't possibly make any progress here in that case (it's kind of difficult to fix a bug that no one is able to reproduce).
Comment 74 Alexander E. Patrakov 2013-08-08 05:35:39 UTC
I have read the related bug entry, and I am not convinced that this bug has been root-caused. But if you are comfortable with other people opening follow-up bugs that may share the same root cause - yes, sure, close this one.
Comment 75 Alexander E. Patrakov 2013-08-08 07:12:18 UTC
After rereading the above, I understand that my arguments can be also used against myself, as I am also not sure that the extreme-sensitivity of the consumed power to the load is related to the root cause of this bug, whatever is is. So, let's just close this bug :)
Comment 76 Rafael J. Wysocki 2013-09-10 00:42:46 UTC
I'm marking this one as a duplicate of bug #58971, because both the time frame it appeared in and the visible symptoms indicate at least some dependency between these bugs.

*** This bug has been marked as a duplicate of bug 58971 ***