Bug 58971

Summary: High temperature after resuming from suspend to RAM (system idle).
Product: Drivers Reporter: Alexander Kaltsas (alexkaltsas)
Component: Video(DRI - Intel)Assignee: intel-gfx-bugs (intel-gfx-bugs)
Status: RESOLVED CODE_FIX    
Severity: high CC: alexkaltsas, anarsoul, chris, daniel, david.pretty, dirk.brandewie, hendry, intel-gfx-bugs, james.ravn, jbarnes, jj, johnmbryant, kernel.org, kernel, mariusz.libera, matej, mavoga, mpessas+bugs, pmunksgaard, RayFredPip, rjw, rockorequin, sudhir, tianyu.lan, tinotom, uwe.sommerlatt, v.antonovich, viresh.kumar, wcang79, xdever, zoku88
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 3.7+ Subsystem:
Regression: Yes Bisected commit-id:
Attachments: System informations. lspci, ver_linux, cpuinfo etc.
acpidump output.
acpidump
My /proc/cpuinfo
Patch with msleep(100)
My /proc/cpuinfo
Patch with msleep(100)
set min GPU frequency when idle
dmesg of falied resume with intitcall_debug
intel_pm.c patch against linux 3.10.1
i915 error state

Description Alexander Kaltsas 2013-05-29 17:34:43 UTC
Created attachment 102851 [details]
System informations. lspci, ver_linux, cpuinfo etc.

With kernel 3.9.x  after resuming from suspend to RAM the temperature, when the system is idle, is more than 10 degrees higher before suspending. The last kernel working without this issue is 3.8.7.

There doesn't seem to be high CPU usage (~5%).

It drains the battery really quickly.

Things I tried without solving the problem:

-- I removed every module that could be removed (rmmod). I couldn't remove i915. I guess KMS prevented it even when the X was down.

-- I compiled a kernel 3.9.2 and 3.9.4 with CONFIG_X86_INTEL_PSTATE disabled but.

-- I used the intel_pstate=disable option.


I am attaching a file with my systems informations.

Related links:

https://bbs.archlinux.org/viewtopic.php?id=150743
https://bbs.archlinux.org/viewtopic.php?id=163854
Comment 1 Alexander Kaltsas 2013-05-29 17:42:07 UTC
Just a correction. Last good kernel was 3.8.11 not 3.8.7.
Comment 2 Lan Tianyu 2013-05-30 02:19:27 UTC
Please provide the output of acpidump. Could you do a bisect between v3.8.11 and v3.9 to find which commit causing this issue?
Comment 3 Alexander Kaltsas 2013-05-30 08:32:32 UTC
I started a git bisect but I couldn't use the produced kernel (kernel panic at early boot). I trying to solve it and continue.

Something else. git tag out is

v3.8
v3.8-rc1
v3.8-rc2
v3.8-rc3
v3.8-rc4
v3.8-rc5
v3.8-rc6
v3.8-rc7
v3.9
v3.9-rc1
v3.9-rc2
v3.9-rc3
v3.9-rc4
v3.9-rc5
v3.9-rc6
v3.9-rc7
v3.9-rc8

I used v3.8 good, v3.9 bad. I did it correctly? No version 3.8.11.

I am attaching the acpidump output.
Comment 4 Alexander Kaltsas 2013-05-30 08:33:25 UTC
Created attachment 102991 [details]
acpidump output.
Comment 5 Lan Tianyu 2013-05-30 08:45:23 UTC
(In reply to comment #3)
> I used v3.8 good, v3.9 bad. I did it correctly? No version 3.8.11.
Yes. Please go head. Currently have no more clue.
Comment 6 Alexander Kaltsas 2013-05-31 20:12:14 UTC
As far the git bisect:

There is a point in the git bisect piece (3.8 to 3.9) that makes the build kernel un-bootable (really early kernel panic. .... Not syncing: attempted to kill init! ... exitcode=0x00000009). I have tried anything I could to debug the kernel panic without success. Every time this point is included to a bisect step I must make an assumption. Either the build kernel is git bisect good or bad. That doubles the steps every time it occurs. It almost impossible to try every combination. I don't know what to try next.

I had also posted my problem to the following bug report. 

https://bugzilla.kernel.org/show_bug.cgi?id=58801
Comment 7 Rafael J. Wysocki 2013-06-04 01:03:01 UTC
You can use "git bisect skip" to avoid commits that break your system entirely.

Anyway, does 3.10-rc4 still have the problem?
Comment 8 Alexander Kaltsas 2013-06-04 12:17:18 UTC
I will give it a try with the 3.10-rc4.
Comment 9 Alexander Kaltsas 2013-06-05 19:16:01 UTC
3.10-rc4 is also bad.

When this happens the cpu frequency is always at max. Even If I change the governor to powersave.

I will give bisect another try.
Comment 10 Rafael J. Wysocki 2013-06-05 19:33:05 UTC
I wonder what powertop says when this is happening?
Comment 12 Sudhir Khanger 2013-06-05 23:42:20 UTC
This problem is much more prominent on my i7-2620M Intel Sandybridge system. After waking up from suspend temperature soars to over 80C and power consumption reaches 30-40W.
Comment 13 Sudhir Khanger 2013-06-05 23:42:59 UTC
Created attachment 103601 [details]
acpidump
Comment 14 Sudhir Khanger 2013-06-06 01:09:57 UTC
Here are some more htop & powertop screenshots

http://imgur.com/VC7b9lz
http://imgur.com/zS45WNR
http://imgur.com/v91HLgl
http://imgur.com/274Ne5W
http://imgur.com/v108Saw
Comment 15 tinotom 2013-06-06 07:37:40 UTC
Hi,

same behaviour here, with the only difference that the platform is amd64 and the cpu is a i7-2670M .
Moreover I am quite sure that another bug report is opened here regarding the same issue but I can't find it.
Regards,
Comment 17 Rafael J. Wysocki 2013-06-06 11:03:50 UTC
(In reply to comment #14)
> Here are some more htop & powertop screenshots
> 
> http://imgur.com/VC7b9lz
> http://imgur.com/zS45WNR
> http://imgur.com/v91HLgl
> http://imgur.com/274Ne5W
> http://imgur.com/v108Saw

What about "frequency stats"?
Comment 18 Rafael J. Wysocki 2013-06-06 11:10:41 UTC
(In reply to comment #14)
> Here are some more htop & powertop screenshots
> 
> http://imgur.com/VC7b9lz

This shows that 80% of the time is spent in the turbo range, which isn't correct.  The idle states statistics look OK.

So, it appears to be a problem with frequency scaling.

(All of you) please tell me what's there in
/sys/devices/system/cpu/cpu*/cpufreq/scaling_driver
Comment 19 Rafael J. Wysocki 2013-06-06 11:14:19 UTC
It would be awesome if someone able to reproduce this problem could carry out a binary search for the bad commit with

$ git bisect start -- drivers/cpufreq/
Comment 20 Alexander Kaltsas 2013-06-06 11:19:46 UTC
intel_pstate.

The main change between 3.8 and 3.9 I believe is Intel PSTATE. I thought the problem was Intel PSTATE.

As I mentioned before I tried:

Compiled a kernel 3.9.2 and 3.9.4 with CONFIG_X86_INTEL_PSTATE disabled

Used the intel_pstate=disable option.
Comment 21 Alexander Kaltsas 2013-06-06 11:22:43 UTC
I will try the suggested git bisect.
Comment 22 Rafael J. Wysocki 2013-06-06 11:24:00 UTC
(In reply to comment #20)
> intel_pstate.
> 
> The main change between 3.8 and 3.9 I believe is Intel PSTATE. I thought the
> problem was Intel PSTATE.
> 
> As I mentioned before I tried:
> 
> Compiled a kernel 3.9.2 and 3.9.4 with CONFIG_X86_INTEL_PSTATE disabled
> 
> Used the intel_pstate=disable option.

In which case it should use acpi-cpufreq instead.  Does it?  If so, does it have the same problem?
Comment 23 Rafael J. Wysocki 2013-06-06 11:25:53 UTC
(In reply to comment #21)
> I will try the suggested git bisect.

If you only see the problem with intel_pstate, it would be good to verify if it's always been there with intel_pstate or only from a certain point.  You can further restrict your search to drivers/cpufreq/intel_pstate.c for that.
Comment 24 Alexander Kaltsas 2013-06-06 11:27:52 UTC
Yes, when I disable Intel PSTATE the scaling driver reverts to acpi-cpufreq. But the problem is still present.
Comment 25 Sudhir Khanger 2013-06-06 15:29:26 UTC
I can reproduce this bug in 3.7 long before intel_pstate drivers were released. So, intel_pstate could just have come at coincidental time. Also I am not affected by this after every suspend to resume but occasionally.

Idle frequency has gone from 800Hz to you can see in turbo range when I have made no changes to BIOS or efforts to enable turbo. It should go beyond 2.7GHz which is max range for my system in non-turbo setup. My reading tells me frequency is suppose to be high in intel_pstate drivers.

$ls /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver
/sys/devices/system/cpu/cpu0/cpufreq/scaling_driver  /sys/devices/system/cpu/cpu2/cpufreq/scaling_driver
/sys/devices/system/cpu/cpu1/cpufreq/scaling_driver  /sys/devices/system/cpu/cpu3/cpufreq/scaling_driver
$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver
intel_pstate
$ cat /sys/devices/system/cpu/cpu2/cpufreq/scaling_driver
intel_pstate
$ cat /sys/devices/system/cpu/cpu1/cpufreq/scaling_driver
intel_pstate
$ cat /sys/devices/system/cpu/cpu3/cpufreq/scaling_driver
intel_pstate

I don't know anything about git bisect but let me do some readings and I will try to do or ask any questions I have.

Thanks.
Comment 26 Viresh Kumar 2013-06-06 15:50:35 UTC
(In reply to comment #3)
> I used v3.8 good, v3.9 bad. I did it correctly? No version 3.8.11.

Can you try -rcs of 3.9 ? Maybe we can narrow down the problem a bit.
Comment 27 Rafael J. Wysocki 2013-06-06 20:48:02 UTC
(In reply to comment #24)
> Yes, when I disable Intel PSTATE the scaling driver reverts to acpi-cpufreq.
> But the problem is still present.

In that configuration (i.e. PSTATE disabled, acpi-cpufreq in use), does this commit make any difference:

http://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/commit/?h=pm-fixes&id=8673b83bf2f013379453b4779047bf3c6ae387e4
Comment 28 Rafael J. Wysocki 2013-06-06 21:16:20 UTC
Anyway, I don't think this qualifies as a regression, at least not as a recent one.
Comment 29 Alexander Kaltsas 2013-06-07 08:59:11 UTC
Reverting 

http://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/commit/?h=pm-fixes&id=8673b83bf2f013379453b4779047bf3c6ae387e4

and disabling intel_pstate doesn't solve the issue.

Powertop output with the issue present, Intel Pstate disabled and the above patch changed.

http://i.imgur.com/9OJiV8I.png
http://i.imgur.com/NMcSJdi.png
http://i.imgur.com/zgkzTys.png
http://i.imgur.com/fZS1aqV.png
http://i.imgur.com/VvfC6cB.png
Comment 30 Rafael J. Wysocki 2013-06-07 10:43:21 UTC
Well, I'm not sure what you did, because the commit I asked about in comment #27 has not been merged yet, so it cannot be reverted. :-)

Or did you test linux-next with that commit reverted?
Comment 31 Alexander Kaltsas 2013-06-07 10:54:27 UTC
I downloaded linux-3.9.tar.xz, applied the patch patch-3.9.4.xz. After that I opened the file drivers/cpufreq/acpi-cpufreq.c

I located the lines 

cmd.addr.msr.reg = MSR_IA32_PERF_CTL; (line 350)

and 

cmd.addr.msr.reg = MSR_AMD_PERF_CTL; (line 354)

And changed it to cmd.addr.msr.reg = MSR_IA32_PERF_STATUS; and cmd.addr.msr.reg = MSR_AMD_PERF_STATUS;.
Comment 32 Alexander Kaltsas 2013-06-07 11:04:02 UTC
Ok, I did it wrong. Just  realize I was working in wrong place. I have to sleep tbh.
Comment 33 Alexander Kaltsas 2013-06-07 12:25:22 UTC
This time I think I did it correctly.

I applied the patch to kernel 3.9.4 and disabled Intel pstate. No change. The issue is still present.

Do you want me to test it with 3.10-rc4?
Comment 34 Rafael J. Wysocki 2013-06-07 12:39:07 UTC
No, thanks.

I think it's better to focus on intel_pstate from now on.  I'll get back to you later today.
Comment 35 Rafael J. Wysocki 2013-06-09 22:18:48 UTC
I promised to get back to you, but I didn't manage to do that on Friday.  Sorry about that.

intel_pstate has a couple of tunables under /sys/devices/system/cpu/intel_pstate one of which is 'no_turbo'.

When you reproduce the bug after a system resume, does writing 1 and then 0 to that file make any difference?
Comment 36 Rafael J. Wysocki 2013-06-09 22:24:10 UTC
Also, if you write 1 to /sys/devices/system/cpu/intel_pstate/no_turbo before suspend, is the problem reproducible after writing 0 to it right after the subsequent resume?

In other words, are you able to see the problem after a sequence like this:

# echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
# echo mem > /sys/power/state
# echo 0 > /sys/devices/system/cpu/intel_pstate/no_turbo
Comment 37 Alexander Kaltsas 2013-06-10 08:06:13 UTC
It may be i915 related after all. I need to make test. I will reply soon.
Comment 38 Rafael J. Wysocki 2013-06-10 11:45:50 UTC
Even so, can you please do the comment #36 check anyway?
Comment 39 Alexander Kaltsas 2013-06-10 11:48:41 UTC
ok.
Comment 40 Alexander Kaltsas 2013-06-10 12:33:01 UTC
No, using

# echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
# echo mem > /sys/power/state
# echo 0 > /sys/devices/system/cpu/intel_pstate/no_turbo

When the problem is present doesn't change anything.
Comment 41 Dirk Brandewie 2013-06-10 14:46:32 UTC
So I am a little confused here is intel_pstate implicated here or not most of the thread explicitly took intel_pstate off the table.

Anyway maybe the output from turbostat when the system is in the bad state would help shed some light on what is going on.
Comment 42 Alexander Kaltsas 2013-06-10 14:53:06 UTC
Ok, I managed to complete a git bisect. The result was:

First bad commit: [15239099d7a7a9ecdc1ccb5b187ae4cda5488ff9] drm/i915: enable irqs earlier when resuming.

http://o.cs.uvic.ca:20810/perl/cid.pl?cid=15239099d7a7a9ecdc1ccb5b187ae4cda5488ff9

I removed the above patch from kernel 3.9.4, 3.9.5 and it seems to work ok after suspend and resume. With Intel PSTATE enabled.

The temperature is back to normal and the cpu frequency doesn't "lock" to the 
highest value.



Just a side note. Don't know if related at all:

I have noticed that when using Intel PSTATE. If the system starts with frequency
scaling to powersave the frequency is around to 800~900 MHz. If I change it to 
performance it will go up. After changing it back to powersave the frequency will
stay to ~2000 MHz. If I suspend and resume it will be around to 800~900 MHz again.
This doesn't seem to effect the temperature of the system'
Comment 43 Rafael J. Wysocki 2013-06-10 19:47:54 UTC
(In reply to comment #41)
> So I am a little confused here is intel_pstate implicated here or not most of
> the thread explicitly took intel_pstate off the table.

intel_pstate does not cause that problem to happen, but it's just much easier to work with intel_pstate than with acpi-cpufreq.

Since we have a bisection result now (thanks Alexander!), I'm reassigning this bug to the i915 people.
Comment 44 Alexander Kaltsas 2013-06-10 19:53:23 UTC
I don't have to open a new request. Right?

Thanks you for your time.
Comment 45 Rafael J. Wysocki 2013-06-10 19:57:24 UTC
(In reply to comment #44)
> I don't have to open a new request. Right?

Right, it's better to keep all of the relevant info in one place.
Comment 46 Sudhir Khanger 2013-06-17 16:07:02 UTC
Problem seems to have aggravated for me. Since 3.9.6, I am booting directly into a system which is running at full frequency 3.4GHz, consuming 35W and a system temperature of 95-97C. This is has happened more than a few times since I installed 3.9.6 last night.

http://imgur.com/vYyJ0fj
http://imgur.com/1pEwtSg
http://imgur.com/i0PMRSD
http://imgur.com/Fa9LtX4
http://imgur.com/6ALw4El
Comment 47 Jonas Jelten 2013-06-20 00:22:32 UTC
related?
https://bugzilla.kernel.org/show_bug.cgi?id=48791
Comment 48 Toralf Förster 2013-06-20 15:23:51 UTC
if you use ondemand governor, what's about runnign this after s2ram :
for g in performance ondemand; do for i in 3; do echo $g > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor; done; done

?
Comment 49 Robert Csordas 2013-06-23 10:08:28 UTC
I have this problem too (I reported it as https://bugzilla.kernel.org/show_bug.cgi?id=57871). I have just found this thread. I removed the commit 15239099d7a7a9ecdc1ccb5b187ae4cda5488ff9 (what Alexander have found), and it became much better but the problem still not solved. Originally 1 times out of 3 my CPUs become locked to max freq, and now I tested 20-25 times and the problem showed up only once. But this also makes the testing much harder.
Comment 50 Alexander Kaltsas 2013-06-23 10:20:35 UTC
Since I removed the commit I can no more replicate the problem. Everything is fine. Tested on kernel versions 3.9.4, 3.9.5, 3.9.6.
Comment 51 Rafael J. Wysocki 2013-06-23 10:28:23 UTC
On Sunday, June 23, 2013 10:20:36 AM you wrote:

> --- Comment #50 from Alexander Kaltsas <alexkaltsas@gmail.com>  2013-06-23
> 10:20:35 ---
> Since I removed the commit I can no more replicate the problem. Everything is
> fine. Tested on kernel versions 3.9.4, 3.9.5, 3.9.6.

Can you both please attach the output of /proc/cpuinfo from your systems
(or provide pointers to that information)?
Comment 52 Robert Csordas 2013-06-23 10:47:07 UTC
Created attachment 105811 [details]
My /proc/cpuinfo
Comment 53 Robert Csordas 2013-06-23 10:48:57 UTC
Now I continued trying to lock the frequency. After 35 tries (!!!) it was working perfectly. After that I removed the charger to run
from battery. After 3 tries, the cpus were locked at 3.3GHz. I continued and after another 3 tries it happened again. After the 2nd
try after this the system was not coming back from suspend (the display remained off). After reboot, running from battery, on the 4th
try the CPU's are at 3.3GHz again.

Before removing the comit there was no difference bethween charger attached or not.

Also, as on the other thread I have said, I disabled turbo mode by writing 38th bit of msr 0x1a0 to 1. Still when the problem happens my 2.7 GHz CPUs are running at 3.3-3.4GHz
Comment 54 Alexander Kaltsas 2013-06-23 11:35:47 UTC
My cpu infos are already attached to the bug.

@ Robert Csordas. I could always replicate the issue by changing the frequency governor to performance (sudo cpupower frequency-set -g performance) and suspending.
Comment 55 Robert Csordas 2013-06-23 12:03:41 UTC
For me cpupower frequency-set -g performance make no noticable difference (i915 early IRQ removed, kernel 3.9.5).
Comment 56 Robert Csordas 2013-06-23 13:14:45 UTC
It seems that the problem is related to some uggly race condition. Because the "later enabling of IRQ" made the problem much less frequent, I tought it may be that something in resuming of i915 still happening "to early". I put a couple of msleep(100)'s in the __i915_drm_thaw function, where I tought it may be good. After that my machine resumed itself 30 out of 30 without any problems running from battery. Later I "hand binary searched" the right sleep that solves the problem. So my machine is now working with one msleep(100); before the line intel_modeset_init_hw(dev); (the early IRQ commit already removed). With this one sleep added the machine resumed 20 out of 20 tries without problem (running from battery).

Interestingly putting the sleep early in the function doesn't helps (this may indicate that the race condition is in i915 itself and not bethween i915 and some other module???).

So the uggly solution which works for me:

1. remove commit 15239099d7a7a9ecdc1ccb5b187ae4cda5488ff9
2. locate function __i915_drm_thaw in i915_drv.c
3. find line intel_modeset_init_hw(dev); and put msleep(100); one line before it.
Comment 57 Rafael J. Wysocki 2013-06-23 21:37:35 UTC
Thanks a lot for that detective work and the information!
Comment 58 Way-Chuang Ang 2013-06-24 14:34:34 UTC
I have Dell Vostro 3350 with similar issue. The temperature may rise up after resuming from suspend. The fix for my problem was to suspend repeatedly until temperature is stable. This problem can be reliably reproduced on my system.
Comment 59 rocko 2013-06-24 18:04:49 UTC
@Robert Csordas: thanks for the msleep hint! I've been running into this issue bigtime with the 3.10-rc series kernel, where roughly four out of every five suspend/resumes cycles will trigger the high CPU frequency/power usage. As per your step 2, I applied an msleep(100) just before intel_modeset_init_hw(dev) in __i915_drm_thaw, and early results look promising: with this new kernel I was able to perform 9 suspend/resumes in a row without hitting this bug. (The tenth resume completely failed - the screen backlight didn't even turn on and the system was completely unresponsive, but I'm going to assume for now that this is a different bug.)

Note that I didn't revert commit 15239099d7a7a9ecdc1ccb5b187ae4cda5488ff9, I just inserted the msleep call.
Comment 60 Robert Csordas 2013-06-24 18:29:37 UTC
Created attachment 105901 [details]
Patch with msleep(100)

I created a patch (with reverted early irq and sleep(100)) for those who want to experiment. It may be a temporary solution unitl the bug will be fixed (of course if it works for other people too). I'm using it since yesterday and seems to work fine.
Comment 61 Rafael J. Wysocki 2013-06-24 19:38:15 UTC
On Monday, June 24, 2013 02:34:35 PM you wrote:
> --- Comment #58 from Way-Chuang Ang <wcang79@gmail.com>  2013-06-24 14:34:34
> ---
> I have Dell Vostro 3350 with similar issue. The temperature may rise up after
> resuming from suspend. The fix for my problem was to suspend repeatedly until
> temperature is stable. This problem can be reliably reproduced on my system.

Can you please attach the contents of /proc/cpuinfo from your system?
Comment 62 Rafael J. Wysocki 2013-06-24 22:03:14 UTC
(In reply to comment #60)
> Created an attachment (id=105901) [details]
> Patch with msleep(100)
> 
> I created a patch (with reverted early irq and sleep(100)) for those who want
> to experiment. It may be a temporary solution unitl the bug will be fixed (of
> course if it works for other people too). I'm using it since yesterday and
> seems to work fine.

If you change /sys/power/pm_async to 0, do you still need the msleep() patch?
Comment 63 Way-Chuang Ang 2013-06-24 23:02:35 UTC
Created attachment 105911 [details]
My /proc/cpuinfo

This system only has Intel IGP.
Comment 64 Jesse Barnes 2013-06-24 23:56:59 UTC
Hm, the mdelay() that seems to be helping is put just before a function where we enable RC6 and some other power setup.  The msleep() might help make sure some other resume activities are complete before we talk to the Punit for example, or enable RC6 (which would be entered immediately).

The question is, what other activity might interfere with the Punit communcation, or ring frequency changes that RC6 or ring freq scaling might cause?

I'm curious to hear the result from setting pm_async to 0...
Comment 65 Robert Csordas 2013-06-25 07:16:01 UTC
Here are the results form pm_async:

With original kernel (3.9.5) it makes no difference.

With kernel with commit 15239099d7a7a9ecdc1ccb5b187ae4cda5488ff9 removed, but no msleep, it seems to work fine (30 out of 30 resumes worked).
Comment 66 Robert Csordas 2013-06-25 07:18:24 UTC
Created attachment 105951 [details]
Patch with msleep(100)

I'm sorry about it but my last patch was not working since my original file was in my root folder.
Comment 67 Apostolis Bessas 2013-06-25 07:48:46 UTC
I use 3.9.7 (as shipped with Arch linux) on a thinkpad x220, i5-2520M CPU, and setting pm_async to 0 has fixed the issue for me so far.
Comment 68 Alexander Kaltsas 2013-06-25 09:54:41 UTC
I will install the stock kernel and give pm_async a try.
Comment 69 Alexander Kaltsas 2013-06-25 11:41:38 UTC
Setting pm_async to 0 has no effect for me. I set it to zero, suspended and when resumed the frequency locked at max and the temperature went up.
Comment 70 Jesse Barnes 2013-06-25 17:36:55 UTC
Can you try the drm-intel-nightly branch from git://people.freedesktop.org/~danvet/drm-intel with the attached min_freq patch applied?
Comment 71 Jesse Barnes 2013-06-25 17:37:24 UTC
Created attachment 106011 [details]
set min GPU frequency when idle
Comment 72 Rafael J. Wysocki 2013-06-25 23:35:19 UTC
(In reply to comment #65)
> Here are the results form pm_async:
> 
> With original kernel (3.9.5) it makes no difference.
> 
> With kernel with commit 15239099d7a7a9ecdc1ccb5b187ae4cda5488ff9 removed, but
> no msleep, it seems to work fine (30 out of 30 resumes worked).

OK, thanks.

Please boot with initcall_debug in the kernel command line, set pm_async back to 1 and try to reproduce the issue (by suspending and resuming).  Once you've reproduced it, please attach the part of dmesg corresponding to the last ("failing") suspend-resume cycle.
Comment 73 James Ravn 2013-06-25 23:59:34 UTC
I have the same system as Apostolis (x220, 3.9.7), and pm_async=0 seems to fix it for me as well. I would experience the max cpu / high fan about 1 out of 4 resumes, with pm_async disabled I haven't had a problem.
Comment 74 Robert Csordas 2013-06-26 06:45:18 UTC
(In reply to comment #73)
> I have the same system as Apostolis (x220, 3.9.7), and pm_async=0 seems to
> fix
> it for me as well. I would experience the max cpu / high fan about 1 out of 4
> resumes, with pm_async disabled I haven't had a problem.

I have the same machine too (x220), but with i7 CPU, kernel 3.9.5, and pm_async doesn't helps. You may try testing from battery: for me from battery the problem is happening much more often.
Comment 75 Robert Csordas 2013-06-26 07:01:15 UTC
Created attachment 106061 [details]
dmesg of falied resume with intitcall_debug

Here is my dmesg of a failed wakeup.

I have not mentoined yet but I have quite many i915 power management tweaks in my kernel command line: i915.i915_enable_rc6=7 i915.lvds_downclock=1 i915.i915_enable_fbc=1 i915.semaphores=1 i915.modeset=1 drm.vblankoffdelay=1
Comment 76 Way-Chuang Ang 2013-06-26 12:18:25 UTC
pm_async=0 doesn't help for my case either. The temperature still increases after a few times of suspends and resumes.
Comment 77 Way-Chuang Ang 2013-06-28 08:18:03 UTC
Hi Jesse Barnes. I want to try out the min_freq patch, do you have a patch for 3.10-rc7? 
Thanks.
Comment 78 Jesse Barnes 2013-06-28 15:19:40 UTC
No, I don't have anything against 3.10-rc7, that would be a bigger backport.  Can you try to repro on the kernel I pointed at instead?
Comment 79 Way-Chuang Ang 2013-06-28 16:26:01 UTC
I'm sorry if I missed out something, but I cannot find any reference to the kernel version that your patch should apply to.
Comment 80 Alexander Kaltsas 2013-06-28 16:37:05 UTC
I believe you should do something like:

git clone -b drm-intel-nightly --depth 1 git://people.freedesktop.org/~danvet/drm-intel

and then apply the patch with:

patch -Np1 -i "i915-min-freq.patch"
Comment 81 JohnMB 2013-07-13 08:24:05 UTC
(In reply to Jesse Barnes from comment #70)
> Can you try the drm-intel-nightly branch from
> git://people.freedesktop.org/~danvet/drm-intel with the attached min_freq
> patch applied?

Has anyone tried this now?

I have the same problem on 3.9 and 3.10 kernels, whereby idle temperature is approx 10 degrees hight after a resume suspend which resulst in a constant running of the fan at higher speed than is normal on idle.
Comment 82 JohnMB 2013-07-13 08:32:49 UTC
As others have reported, repeated suspend/resume cycles can often return the system to normal after a few attempts.
Comment 83 Lan Tianyu 2013-07-15 01:48:58 UTC
*** Bug 57871 has been marked as a duplicate of this bug. ***
Comment 84 Jesse Barnes 2013-07-15 17:39:35 UTC
Another patch to try, see comment #72 in https://bugs.freedesktop.org/show_bug.cgi?id=54089.
Comment 85 Alexander Kaltsas 2013-07-15 21:18:28 UTC
It seems to be ok with the above patch. v3.9.9. I will to do furter testing.
Comment 86 rocko 2013-07-16 02:47:00 UTC
Created attachment 106891 [details]
intel_pm.c patch against linux 3.10.1

So far the changes in that patch are working well in linux 3.10.1 as well (I've done five suspend/resume cycles without any issues).

I had to modify it slightly for 3.10 though, so I've attached what I'm trying in case anyone else wants to try it out.
Comment 87 JohnMB 2013-07-16 10:25:48 UTC
Also so far so good with that latest patch on 3.10.1 after 10+ s2ram/resume cycles. Will keep running for a while to make sure.
Comment 88 Chris Wilson 2013-07-17 11:26:11 UTC
commit 7dcd2677ea912573d9ed4bcd629b0023b2d11505
Author: Konstantin Khlebnikov <khlebnikov@openvz.org>
Date:   Wed Jul 17 10:22:58 2013 +0400

    drm/i915: fix long-standing SNB regression in power consumption after resume
Comment 89 Alexander Kaltsas 2013-07-17 11:46:54 UTC
Link/patch? 

Had anything to do with my git bisect findings?

drm/i915: enable irqs earlier when resuming.
Comment 90 Philip Munksgaard 2013-07-17 12:04:38 UTC
I suppose this is the patch he is referring to: https://lkml.org/lkml/2013/7/17/48
Comment 91 Daniel Vetter 2013-07-17 12:11:09 UTC
(In reply to Alexander Kaltsas from comment #89)
> Link/patch? 
> 
> Had anything to do with my git bisect findings?
> 
> drm/i915: enable irqs earlier when resuming.

If the fix in that patch is correct it's extremely timing dependent and small juggling of stuff around could uncover the underlying bug easily. Testing will tell.

(In reply to Philip Munksgaard from comment #90)
> I suppose this is the patch he is referring to:
> https://lkml.org/lkml/2013/7/17/48

Yes.
Comment 92 Alexander Kaltsas 2013-07-24 12:53:13 UTC
What kernel version will include the patch?

I tried manually "insert" to 3.9.9 but I believe there is many other changes.
Comment 93 JohnMB 2013-07-24 20:46:16 UTC
(In reply to Alexander Kaltsas from comment #92)
> What kernel version will include the patch?


It's in the 3.11 tree I think? And therefore I would expect it's heading for the stable queue for 3.9.x/3.10.x point releases soon?
Comment 94 Lan Tianyu 2013-07-25 01:29:59 UTC
*** Bug 59571 has been marked as a duplicate of this bug. ***
Comment 95 Alexander Kaltsas 2013-07-31 11:18:27 UTC
Just an info.

I have applied the patch to kernel 3.10.2 and I noticed image degradation after resume.

http://i.imgur.com/WgRpBLr.png

After some mouse movement the display is ok (with minor artifacts that eventualy goes away).

And some dmesg errors:

[ 7073.391125] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung

[ 7073.391130] [drm] capturing error event; look for more information in /sys/kernel/debug/dri/0/i915_error_state

[ 7073.402402] [drm:kick_ring] *ERROR* Kicking stuck semaphore on render ring
[13137.145779] [drm] Wrong MCH_SSKPD value: 0x16040307
[13137.145780] [drm] This can cause pipe underruns and display issues.
[13137.145780] [drm] Please upgrade your BIOS to fix this.
[13138.285473] [drm] Enabling RC6 states: RC6 on, RC6p on, RC6pp on

I am attaching the i915_error_state file.
Comment 96 Alexander Kaltsas 2013-07-31 11:22:31 UTC
Created attachment 107051 [details]
i915 error state

i915 error state
Comment 97 Daniel Vetter 2013-08-04 21:16:33 UTC
(In reply to Alexander Kaltsas from comment #95)
> Just an info.
> 
> I have applied the patch to kernel 3.10.2 and I noticed image degradation
> after resume.
> 
> http://i.imgur.com/WgRpBLr.png

This is bug #60530

> After some mouse movement the display is ok (with minor artifacts that
> eventualy goes away).
> 
> And some dmesg errors:
> 
> [ 7073.391125] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed...
> GPU hung
> 
> [ 7073.391130] [drm] capturing error event; look for more information in
> /sys/kernel/debug/dri/0/i915_error_state
> 
> [ 7073.402402] [drm:kick_ring] *ERROR* Kicking stuck semaphore on render ring
> [13137.145779] [drm] Wrong MCH_SSKPD value: 0x16040307
> [13137.145780] [drm] This can cause pipe underruns and display issues.
> [13137.145780] [drm] Please upgrade your BIOS to fix this.
> [13138.285473] [drm] Enabling RC6 states: RC6 on, RC6p on, RC6pp on
> 
> I am attaching the i915_error_state file.

And this is bug #53571
Comment 98 Alexander Kaltsas 2013-08-04 21:27:32 UTC
Thanks you.
Comment 99 Rafael J. Wysocki 2013-09-10 00:42:46 UTC
*** Bug 48721 has been marked as a duplicate of this bug. ***
Comment 100 Mykal Valentine 2013-11-18 04:50:00 UTC
Quetion about the resolution for this bug.

I'm using a laptop with a HSW CPU and HD4500 integrated graphics on linux 3.12 and am seeing what seems to be a similar issue to this (seeing max frequency after resume.)

I'm pretty sure the fix is in linux 3.12.  I looked at the source and the change i915_dma.c is still there,but the intel_pm.c change doesn't seem to be anymore, though it's hard to tell since I can't even find the function...

What should I do to see if this is the same bug or a new bug?