Bug 90041 - intel_pstate: very in low cpu frequency - throttling config corrupted during S3
Summary: intel_pstate: very in low cpu frequency - throttling config corrupted during S3
Status: RESOLVED PATCH_ALREADY_AVAILABLE
Alias: None
Product: Power Management
Classification: Unclassified
Component: Hibernation/Suspend (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: Chen Yu
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-12-18 16:35 UTC by Mihai Donțu
Modified: 2017-10-27 17:11 UTC (History)
14 users (show)

See Also:
Kernel Version: 3.18
Subsystem:
Regression: No
Bisected commit-id:


Attachments
/proc/cpuinfo (4.62 KB, text/plain)
2014-12-18 16:35 UTC, Mihai Donțu
Details
i7z output (1.25 KB, text/plain)
2014-12-18 16:36 UTC, Mihai Donțu
Details
Patch to force setting pstate in the intel_pstate driver (3.70 KB, patch)
2015-06-11 18:23 UTC, Doug Smythies
Details | Diff
adjust throttling once resumed (1.51 KB, text/plain)
2016-06-26 15:19 UTC, Chen Yu
Details
save/restore tstate for each CPUs (3.58 KB, text/plain)
2016-11-13 17:36 UTC, Chen Yu
Details
version 2 of save/restore tstate across s3 (7.79 KB, text/plain)
2016-11-30 06:35 UTC, Chen Yu
Details
Acpidump (197.82 KB, application/octet-stream)
2017-01-23 09:32 UTC, Kadir
Details
version 3 (clear tstate after resumed back) (1.92 KB, text/plain)
2017-02-06 18:12 UTC, Chen Yu
Details
version 4 (clear tstate after resumed back) (2.53 KB, text/plain)
2017-02-13 09:06 UTC, Chen Yu
Details
Version 5 of Chen's patch (4.98 KB, patch)
2017-03-27 22:16 UTC, Doug Smythies
Details | Diff

Description Mihai Donțu 2014-12-18 16:35:39 UTC
Created attachment 161221 [details]
/proc/cpuinfo

Hi,

I have a DEL Latitude E7440 laptop, with i7-4600U on which intel_pstate has an unusual behaviour: if I'm compiling something heavy (like chromium or libreoffice), after about 15min the CPU frequency is set at ~ 800MHz (it varies between ~750 & ~790) and i7z shows that each core spending 90% of the time in C1 state. What's interesting is that /proc/cpuinfo shows 'cpu MHz' at ~130MHz, which I tend to believe because the entire desktop goes into ultra slow motion. If I suspend the current job (CTRL^Z) and wait for about 5min, it gets back to normal.

At first I thought it's some kind of reaction to high temperature, but I kept an eye on it and it never goes above 85°C.

I have also tried 'acpi-cpufreq' and the behavior is considerably better. The only problem with it is that _it seems_ to be worse when it comes to managing the CPU states when switching from AC to battery and back.

I have attached a sample of 'cpuinfo' to back up my 130MHz claim (which is surprising for me) and also the output of i7z.
Comment 1 Mihai Donțu 2014-12-18 16:36:03 UTC
Created attachment 161231 [details]
i7z output
Comment 2 Mihai Donțu 2014-12-18 17:30:17 UTC
I just switched to 'acpi-cpufreq' and started a build of calligra. The CPU clock multiplier started to slowly drop from 27 to 8 after only 5min and stayed there. The CPU temperature is ~50°C. While a good deal of time is spent in C1 state, the system is more usable than with intel_pstate, albeit less performant.
Comment 3 Doug Smythies 2015-06-11 16:24:23 UTC
With a reasonable degree of confidence, I believe the root issue to be a Dell BIOS problem. See also bug 62851
Comment 4 Mihai Donțu 2015-06-11 16:48:12 UTC
For the most part, I agree. I bought a cooling pad and can now do all kinds of builds with no problem.

A single issue remains: sometimes, when I resume from suspend to RAM, the CPU frequency is stuck at a little over 1GHz. I have to do another cycle of suspend-resume in order to get it back to normal. Can this also be a BIOS issue?
Comment 5 Doug Smythies 2015-06-11 18:17:48 UTC
(In reply to Mihai Donțu from comment #4)
> For the most part, I agree. I bought a cooling pad and can now do all kinds
> of builds with no problem.
> 
> A single issue remains: sometimes, when I resume from suspend to RAM, the
> CPU frequency is stuck at a little over 1GHz. I have to do another cycle of
> suspend-resume in order to get it back to normal. Can this also be a BIOS
> issue?

It might be. There are two lock up issues, one due to Dell BIOS and one due to an issue issue in the intel_pstate driver. Without more data, I do not know which particular problem you have. For the intel_pstate driver, I have submitted a patch to correct the problem. The patch has yet to be accepted, but it is my hope that it will be included in kernel 4.2RC1.

In a moment I will attach the patch, in case you want to try it.
Comment 6 Doug Smythies 2015-06-11 18:23:06 UTC
Created attachment 179651 [details]
Patch to force setting pstate in the intel_pstate driver
Comment 7 Doug Smythies 2015-08-18 20:15:30 UTC
When you experience the stuck CPU frequency condition after resume from suspend, could you please do the following and report back (steps 1 and 2 are needed before any suspend):

1.) (needed once per boot) sudo modprobe msr
2.) before any suspend: sudo rdmsr -a 0x19a
3.) after a suspend that results in the stuck CPU frequencies:
 sudo rdmsr -a 0x19a
4.) If the result from step 3 is not 0, then:
 sudo wrmsr -a 0x19a 0x0
and check it:
 sudo rdmsr -a 0x19a
5.) Are the CPU frequencies O.K. now?
Comment 8 Paul Johnson 2015-10-23 02:12:35 UTC
Hi, Doug!

I corresponded with you on other bug report. This concerns Dell 6430u with Intel i6 cpu.  Today I upgraded to Ubuntu 15.10 which provides 4.2.0-16, hoping to leav all that suspend/cpu nonsense behind.  Perhaps the fix is not included in the kernel that they included, it was frozen for a while...

I notice still the old suspend -> slow CPU after restart on battery problem. All CPU stay frozen at 754Mhz.

I also notice a new problem that restart on power has a sluggish CPU. It stays damped down. It is not frozen at 754MHz, but slow. This reminds me of old old problem with "ondemand" governor where the up_threshold was set too high, at 95, and only way to shake loose was to lower up_threshold to 70 or such. Maybe that memory reveals me as old timer :)  In the new regime with powersave and performance, I have not found parameter similar to up_threshold that can be adjusted to easily fix the CPU-is-suffocated  problem.

I was frustrated, but just tested your  msr solution. It works! If I load msr and run 

$ sudo wrmsr -a 0x19a 0x0

then I can suspend and resume on battery has CPU frequency that scale across the full spectrum. They idle at 754 Mhz but I have seen them go up to 2.6Ghz. That is a new behavior in this kernel. 

I'll splat in some data, don't know for sure what might help

$ uname -a
Linux dellap14 4.2.0-16-generic #19-Ubuntu SMP Thu Oct 8 15:35:06 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
$ cpufreq-info -c0
cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009
Report errors and bugs to cpufreq@vger.kernel.org, please.
analyzing CPU 0:
  driver: acpi-cpufreq
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 10.0 us.
  hardware limits: 754 MHz - 2.60 GHz
  available frequency steps: 2.60 GHz, 2.60 GHz, 2.50 GHz, 2.30 GHz, 2.20 GHz, 2.10 GHz, 1.90 GHz, 1.80 GHz, 1.60 GHz, 1.50 GHz, 1.40 GHz, 1.20 GHz, 1.10 GHz, 1000 MHz, 800 MHz, 754 MHz
  available cpufreq governors: conservative, ondemand, userspace, powersave, performance
  current policy: frequency should be within 754 MHz and 2.60 GHz.
                  The governor "ondemand" may decide which speed to use
                  within this range.
  current CPU frequency is 800 MHz.
  cpufreq stats: 2.60 GHz:3.31%, 2.60 GHz:0.01%, 2.50 GHz:0.92%, 2.30 GHz:0.74%, 2.20 GHz:0.30%, 2.10 GHz:2.58%, 1.90 GHz:0.76%, 1.80 GHz:0.95%, 1.60 GHz:1.06%, 1.50 GHz:1.01%, 1.40 GHz:1.68%, 1.20 GHz:2.31%, 1.10 GHz:2.55%, 1000 MHz:12.05%, 800 MHz:38.07%, 754 MHz:31.69%  (21929)

If you want to see data from the state in which it is stuck slow, please tell me what you need and I'll get it.  I'm not up to compile a kernel anymore, but can report on this one Ubuntu provides.
Comment 9 Doug Smythies 2015-10-23 06:22:25 UTC
(In reply to Paul Johnson from comment #8)

> I corresponded with you on other bug report.

For reference: bug 90421.

>  Perhaps the fix is not
> included in the kernel that they included, it was frozen for a while...

The patch I mentioned is included, but there are other root issues that show similar symptoms.

> I notice still the old suspend -> slow CPU after restart on battery problem.
> All CPU stay frozen at 754Mhz.

I think you are suffering from the Clock Modulation problem.

> I was frustrated, but just tested your msr solution. It works!

I'm confused. I thought you tried it with success on 2015.09.08 (see bug 90421).

> $ cpufreq-info -c0
> ...
>   available cpufreq governors: conservative, ondemand, userspace, powersave,
> performance

It looks as though you are using the acpi-cpufreq frequency scaling driver.
Clock Modulation effects should be less than with the intel_pstate driver.
 
> If you want to see data from the state in which it is stuck slow, please
> tell me what you need and I'll get it.

Myself, I'd like to know the msr register contents before you clear them. i.e. I'd like to know the Clock Modulation percent.
Comment 10 Paul Johnson 2015-11-07 03:49:05 UTC
Hi Doug.

I think the conclusion here will be that the failure in frequency
scaling is not linked one-to-one with the msr thing anymore.  On this Dell laptop, I no longer see the almost-always-locked in slow after resume. I've fiddled this a lot of ways, here's what I think are facts.

After a clean restart

$ /sbin/modprobe msr
$ sudo rdmsr -a 0x19a
0
0
0
0


While plugged in, suspend and resume:

$ sudo rdmsr -a 0x19a
0
0
0
0

While plugged in, suspend, then unplug. Same.

$ sudo rdmsr -a 0x19a
0
0
0
0

Remove plug, suspend, resume unplugged.


$ sudo rdmsr -a 0x19a
[sudo] password for pauljohn:
1c
1c
1c
1c


However, to my surprise, the cpu frequency scaling is working, 
CPUs do scale up and down. This is a surprise to me, I've not seen this until the Ubuntu update. Perhaps after a second wave of package updates.

$ uname -a
Linux dellap14 4.2.0-16-generic #19-Ubuntu SMP Thu Oct 8 15:35:06 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux


Is it possible I've reset those values accidentally?

No:

$ sudo rdmsr -a 0x19a
1c
1c
1c
1c


I can reset those to 0 thus

$ sudo wrmsr -a 0x19a 0x0

But don't notice any difference. 

After several resume/suspend cycles, laptop seems to not scale up any more. However, I could not see a pattern to explain it.

I find one way to dependably make frequency scaling break. One day, I needed some calculations quickly and I did this

$ sudo cpupower frequency-set -g performance

After calculations were done, I did this

$ sudo cpupower frequency-set -g powersave

Now the CPU is truly locked at 754. It has no scaling response at all, not even the minor fluctuations I was seeing before. Maybe this is the Dell BIOS problem you mentioned.

Now the msr setting is still 0, so we have no more fix for the slow CPU:

$ sudo rdmsr -a 0x19a
0
0
0
0


$ cpufreq-info


analyzing CPU 3:
  driver: acpi-cpufreq
  CPUs which run at the same hardware frequency: 3
  CPUs which need to have their frequency coordinated by software: 3
  maximum transition latency: 10.0 us.
  hardware limits: 754 MHz - 2.60 GHz
  available frequency steps: 2.60 GHz, 2.60 GHz, 2.50 GHz, 2.30 GHz, 2.20 GHz, 2.10 GHz, 1.90 GHz, 1.80 GHz, 1.60 GHz, 1.50 GHz, 1.40 GHz, 1.20 GHz, 1.10 GHz, 1000 MHz, 800 MHz, 754 MHz
  available cpufreq governors: conservative, ondemand, userspace, powersave, performance
  current policy: frequency should be within 754 MHz and 2.60 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency is 754 MHz.
  cpufreq stats: 2.60 GHz:13.14%, 2.60 GHz:0.00%, 2.50 GHz:0.32%, 2.30 GHz:0.25%, 2.20 GHz:0.17%, 2.10 GHz:4.90%, 1.90 GHz:0.62%, 1.80 GHz:0.92%, 1.60 GHz:0.55%, 1.50 GHz:0.55%, 1.40 GHz:0.28%, 1.20 GHz:1.78%, 1.10 GHz:1.55%, 1000 MHz:3.73%, 800 MHz:13.73%, 754 MHz:57.51%  (1409)

I'll still keep my journal, maybe there will be a pattern.
Comment 11 Kadir 2016-02-22 11:47:57 UTC
Hi, 

I have a Dell Latitude e6320 and the exact problem as described here. On battery the cpu is stuck at 600 Mhz after a suspend and resume. I am on the Intel p-state driver.

After clean restart I get:

sudo rdmsr -a 0x19a
0
0
0
0

While plugged in, suspend and resume, I get:

sudo rdmsr -a 0x19a
0
0
0
0

While plugged in, suspend, then unplug, I get:

sudo rdmsr -a 0x19a
1c
1c
1c
1c


With removed plug, suspend, resume unplugged, I get:

sudo rdmsr -a 0x19a
1c
1c
1c
1c

If I do sudo wrmsr -a 0x19a 0x0, everything is normal again.

I am on:

uname -a
Linux latitude 4.3.5-300.fc23.x86_64 #1 SMP Mon Feb 1 03:18:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Is there a fix in kernel 4.4 or coming in 4.5?
Comment 12 Doug Smythies 2016-02-23 06:42:11 UTC
(In reply to Kadir from comment #11)

> I am on:
> 
> uname -a
> Linux latitude 4.3.5-300.fc23.x86_64 #1 SMP Mon Feb 1 03:18:41 UTC 2016
> x86_64 x86_64 x86_64 GNU/Linux
> 
> Is there a fix in kernel 4.4 or coming in 4.5?

No. In its current form, the intel_pstate driver is fundamentally incompatible with Clock Modulation.

Someone made a script that runs upon resume, clearing the register. See here:
https://bbs.archlinux.org/viewtopic.php?pid=1558948#p1558948
Comment 13 Kadir 2016-02-23 07:22:14 UTC
Thanks for your quick reply. It's unfortunate that a kernel fix is not available. I know two other persons (with Dell Latitude laptops) who also have this problem.

Thanks for pointing out the workaround script, I use a simpler version:

I've created the file /usr/lib/systemd/system-sleep/cpu_clock and put the following in:

#!/bin/sh
wrmsr -a 0x19a 0x0

And made the file executable.

The result is the same as the script. Only on Fedora the msr-tools package is not standard installed, so first you have to install that package and then create the file.

I hope a fix is coming in the next few kernels.
Comment 14 Chen Yu 2016-06-26 15:19:06 UTC
Created attachment 221191 [details]
adjust throttling once resumed

A trial patch to adjust processor throttling after resumed.
Comment 15 Chen Yu 2016-06-26 15:20:19 UTC
@Kadir, I saw your reply at https://bugzilla.kernel.org/show_bug.cgi?id=80651
could you help check if #Comment 14 work for you?
Comment 16 Doug Smythies 2016-06-26 15:30:45 UTC
@Chen Yu, I believe @Kadir's post over on bug 80651 was incorrect for the original issue of that bug report. There were some posts in that bug about Clock Modulation, which was not the original issue.
Comment 17 Doug Smythies 2016-06-26 16:07:45 UTC
(In reply to Chen Yu from comment #14)
> Created attachment 221191 [details]
> adjust throttling once resumed
> 
> A trial patch to adjust processor throttling after resumed.

Chen, The patch is not in the correct format to apply. I get this:

doug@s15:~/temp-k-git/linux$ git am chen.patch
Patch format detection failed.

Note: None of my computers suffer from the Clock Modulation problem anyhow, I was just trying it.
Comment 18 Kadir 2016-06-26 16:58:50 UTC
I really want to help and test to resolve this issue, because it is quite an irritating bug. I know a couple of people with Dell latitude laptops and they all suffer from the same bug. 

But I have look into how to apply patches to a Fedora kernel or build a custom kernel on Fedora. 

Back in the day when using Gentoo, I could easily do this. Nowadays I kinda forgot how to do this custom kernel stuff. So atm I don't know how to help out, I have to look into it.
Comment 19 Kadir 2016-06-27 19:04:14 UTC
Small update; today I helped a friend to install fedora on a (second-hand) purchased Dell Latitude e6520 (with latest bios). 

I tested as described above to see if that laptop also shows the same behaviour. I was able to confidently reproduce this exact bug. So that is another (otherwise very well linux supported) laptop, affected by this bug.
Comment 20 Chen Yu 2016-06-28 12:59:04 UTC
(In reply to Doug Smythies from comment #17)
> (In reply to Chen Yu from comment #14)
> > Created attachment 221191 [details]
> > adjust throttling once resumed
> > 
> > A trial patch to adjust processor throttling after resumed.
> 
> Chen, The patch is not in the correct format to apply. I get this:
> 
> doug@s15:~/temp-k-git/linux$ git am chen.patch
> Patch format detection failed.
> 
> Note: None of my computers suffer from the Clock Modulation problem anyhow,
> I was just trying it.

Hi, Doug, does patch -p1 < x.diff works ?
Comment 21 Doug Smythies 2016-06-29 23:09:54 UTC
(In reply to Chen Yu from comment #20)
> (In reply to Doug Smythies from comment #17)
> > (In reply to Chen Yu from comment #14)
> > > Created attachment 221191 [details]
> > > adjust throttling once resumed
> > > 
> > > A trial patch to adjust processor throttling after resumed.
> > 
> > Chen, The patch is not in the correct format to apply. I get this:
> > 
> > doug@s15:~/temp-k-git/linux$ git am chen.patch
> > Patch format detection failed.
> > 
> > Note: None of my computers suffer from the Clock Modulation problem anyhow,
> > I was just trying it.
> 
> Hi, Doug, does patch -p1 < x.diff works ?

Hi Chen,

Yes it works. I did:

doug@s15:~/temp-k-git/linux$ patch -p1 < chen.patch

and it applied fine.

@Kadir:
1.)Keep in mind that the real root issue here is a Dell BIOS bug.
2.)I would make you a test kernel myself, but I  only know how to make .deb type kernels (for Ubuntu or Debian). I do not know how to make one for fedora.
Comment 22 Kadir 2016-06-30 13:32:33 UTC
Doug, thanks for helping out! 

I understand the issue is really the Dell BIOS. Unfortunately Dell does not want to fix this. I am glad it is being fixed in the Linux kernel.

I have just installed Ubuntu Mate 16.04 (on a separate drive) with the following kernel:

4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24 10:09:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

The bug is clearly present. When putting the laptop in suspend and then removing the charger and resuming the laptop I get the following output:

sudo rdmsr -a 0x19a

1c
1c
1c
1c

The cpu runs (with the powersave governor) at merely 600 mhz and does not scale up. 

Inputting sudo wrmsr -a 0x19a 0x0 fixes the problem.


If you can make a test kernel please, I can test it. I assume it just a matter of click and install the .deb or via apt-get (I am not very familiar with apt).
Comment 23 Doug Smythies 2016-06-30 14:07:34 UTC
@Kadir: I put the kernel (4.7-rc5 + chen patch) on my web site. Watch for an e-mail with the URL and some instructions.
Comment 24 Kadir 2016-06-30 16:47:51 UTC
Hi Chen,

With help from Doug, I just tested the patch.

uname -a:
4.7.0-rc5-chen #78 SMP Wed Jun 29 21:10:38 PDT 2016 x86_64 x86_64 x86_64 GNU/Linux

Unfortunately, the bug is not fixed with this patch. I still get:

1c
1c
1c
1c

When doing sudo rdmsr -a 0x19a and the clock speed of the cpu stays at around 600 Mhz and it does not scale up. 

Only if I do sudo wrmsr -a 0x19a 0x0 the clockspeeds go back to normal.
Comment 25 Chen Yu 2016-11-13 17:36:04 UTC
Created attachment 244311 [details]
save/restore tstate for each CPUs

OK finally I got some time to deal with this, please help check if this patch work for you? Thanks.
Comment 26 Chen Yu 2016-11-15 03:20:13 UTC
I've sent it to 
https://patchwork.kernel.org/patch/9428081/
@Kadir
@Doug Smythies
Comment 27 Doug Smythies 2016-11-21 23:26:45 UTC
@Chen Yu: I do not actually have a computer that suffers from the issue, and so can not test your proposed patch. However, and as previously done, I could compile a test kernel for Kadir's Ubuntu computer using the Ubuntu kernel configuration.

Additionally, and as seems to often happen, this bug report has progressed to talking about more than one issue. The original issue is thermal and will not be solved by your proposed patch. A secondary issue is resume from suspend on battery on several computer models, and will be solved by your patch (untested). As I have always done, I maintain that the root issue here is the fundamental incompatibility of the intel_pstate driver (well now only for the "get_target_pstate_use_performance" path, the "get_target_pstate_use_cpu_load" path would be fine) and clock_modulation and that is what should be fixed (and there are other issues and reasons it should be fixed).
Comment 28 Kadir 2016-11-24 07:37:55 UTC
@Chen Yu: Thanks for the patch, I can test it.
@Doug Smythies; Can you compile a test kernel (for testing with Ubuntu Mate 16.10) like last time? I can test it out over the weekend.
Comment 29 Chen Yu 2016-11-24 10:48:25 UTC
(In reply to Doug Smythies from comment #27)
> @Chen Yu: I do not actually have a computer that suffers from the issue, and
> so can not test your proposed patch. However, and as previously done, I
> could compile a test kernel for Kadir's Ubuntu computer using the Ubuntu
> kernel configuration.
thanks.
> 
> Additionally, and as seems to often happen, this bug report has progressed
> to talking about more than one issue. The original issue is thermal and will
> not be solved by your proposed patch. A secondary issue is resume from
> suspend on battery on several computer models, and will be solved by your
> patch (untested). As I have always done, I maintain that the root issue here
> is the fundamental incompatibility of the intel_pstate driver (well now only
> for the "get_target_pstate_use_performance" path, the
> "get_target_pstate_use_cpu_load" path would be fine) and clock_modulation
> and that is what should be fixed (and there are other issues and reasons it
> should be fixed).
I agree, and this looks like another problem we should deal with, when the throttling is really activated, need to leverage intel_pstate(also acpi freq)
to coincide with throttling. Currently only the thermal component would enable the throttling, and since the case is rare(the other case should be considered to be invalid), maybe the maintainer is not  interested to consider throttling in intel_pstate at least for now?
Comment 30 Chen Yu 2016-11-24 10:49:42 UTC
(In reply to Kadir from comment #28)
> @Chen Yu: Thanks for the patch, I can test it.
> @Doug Smythies; Can you compile a test kernel (for testing with Ubuntu Mate
> 16.10) like last time? I can test it out over the weekend.

I got feedback from Rafael,will modify the patch, please stay tune.
Comment 31 Doug Smythies 2016-11-24 15:40:09 UTC
(In reply to Chen Yu from comment #30)
> I got feedback from Rafael,will modify the patch, please stay tune.

Yes, I saw the e-mail.
I will wait for a version 2 of your proposed patch before I compile a test kernel for Kadir.

By the way, your patch mentions:

> However more and more reports show that other platforms also
> experienced the same issue, because some BIOSes would like to
> adjust the tstate if he thinks the temperature is too high.

I think this has nothing to do with temperature, but rather they are using Clock Modulation as a method of power management while on battery. However, I have no proof, and Dell has never replied to related inquires on their forum type web site.
Comment 32 Chen Yu 2016-11-30 06:35:05 UTC
Created attachment 246371 [details]
version 2 of save/restore tstate across s3
Comment 33 Chen Yu 2016-11-30 06:36:16 UTC
(In reply to Doug Smythies from comment #31)
> (In reply to Chen Yu from comment #30)
> > I got feedback from Rafael,will modify the patch, please stay tune.
> 
> Yes, I saw the e-mail.
> I will wait for a version 2 of your proposed patch before I compile a test
> kernel for Kadir.
> 
> By the way, your patch mentions:
> 
> > However more and more reports show that other platforms also
> > experienced the same issue, because some BIOSes would like to
> > adjust the tstate if he thinks the temperature is too high.
> 
> I think this has nothing to do with temperature, but rather they are using
> Clock Modulation as a method of power management while on battery. However,
> I have no proof, and Dell has never replied to related inquires on their
> forum type web site.
OK, adjust the commit log.
Comment 34 Doug Smythies 2016-11-30 07:56:27 UTC
@Chen: So now you say:

> However more and more reports show that other platforms also
> experienced the same issue, because some BIOSes would like to
> adjust the tstate with some unknown reasons, maybe the BIOS thinks
> the temperature is too high, or the battery is too low, as mentioned
> by Doug Smythies.

However, I don't think it has anything to do with battery level, only that the LapTop is on battery power.

@Kadir: A 4.9-rc7 kernel with Chen's V2 patch is on my web site, in the same directory as previous. I did not boot it, as I can not at the moment due to some other ongoing work.

linux-headers-4.9.0-rc7-chen2_4.9.0-rc7-chen2-164_amd64.deb
linux-image-4.9.0-rc7-chen2_4.9.0-rc7-chen2-164_amd64.deb
Comment 35 Kadir 2016-11-30 08:00:04 UTC
@Doug: I will test the patched kernel as soon as possible.
Comment 36 Kadir 2016-12-02 18:50:41 UTC
Succes!

I did multiple resume/suspend cycles and the cpu scales up perfectly to max 3.1 Ghz and scales down to min 800 Mhz, so the patched kernel works perfectly!

I get the following from msr-tools:

After clean restart I get:

sudo rdmsr -a 0x19a
0
0
0
0

While plugged in, suspend and resume, I get:

sudo rdmsr -a 0x19a
0
0
0
0

While plugged in, suspend, then unplug, I get:

sudo rdmsr -a 0x19a
0
0
0
0


With removed plug, suspend, resume unplugged, I get:

sudo rdmsr -a 0x19a
0
0
0
0
Comment 37 Chen Yu 2016-12-03 14:45:47 UTC
OK ,thanks Doug and Kadir.
@Kadir, just another question, do you see this issue on linux without this patch applied and the LASTEST BIOS? I just want to confirm that this issue has been fixed by Dell's BIOS. Because it would be the last choice that we have to workaround it in the kernel as this patch does.
Comment 38 Doug Smythies 2016-12-03 15:10:08 UTC
(In reply to Chen Yu from comment #37)
> @Kadir, just another question, do you see this issue on linux without this
> patch applied and the LASTEST BIOS?

In comment #19 Kadir already mentioned the computer has the latest BIOS.

> I just want to confirm that this issue
> has been fixed by Dell's BIOS. Because it would be the last choice that we
> have to workaround it in the kernel as this patch does.

Yes. The root issue should be fixed (see my comment #27) rather than worked around.
Comment 39 Kadir 2016-12-04 11:26:27 UTC
@Chen: I am on the latest BIOS and I see this bug without this patch applied.

This bug is fixed with your latest patch.
Comment 40 Chen Yu 2016-12-12 05:44:11 UTC
Thanks.
BTW, some people also reported the similar issue on another Dell xps13, and he said if the Speedstep is disabled in the BIOS, this issue goes away, could you also help verify this?

And another question is , does this issue reproduce on windows?(I mean, with the same setting when it is reproduced in linux)?
Comment 41 Kadir 2016-12-13 19:02:26 UTC
@Chen

I tested with Speedstep disabled. Indeed, this is also a workaround and the bug does not occur. However the batterylife is completely destroyed with Speedstep disabled, the cpu never clocks down to 800 Mhz. Not only is batterylife an issue, also heat is a problem. 

With Speedstep disabled, it is as if the performance governor is selected instead of the powersave governor. This defeates the purpose of having a powersave governor. For a laptop you need to have powersaving, or else why use it as a laptop?

In Windows this bug does not occur.

IMO your latest patch resolves the issue and I think the bug should be fixed with your patch. It would counterproductive imo to disable any powersaving features on a laptop. That is the purpose of a laptop. Because many Dell laptops are affected by this bug, I think this should be fixed.
Comment 42 Chen Yu 2017-01-23 02:15:19 UTC
@Kadir

Could you upload your acpidump file?
Comment 43 Kadir 2017-01-23 09:32:37 UTC
Created attachment 252831 [details]
Acpidump

@Chen

Here you go, this is while running on my main OS (Fedora 25, 4.9.4-201.fc25.x86_64)
Comment 44 sascha_r 2017-02-02 14:00:41 UTC
(In reply to Chen Yu from comment #32)
> Created attachment 246371 [details]
> version 2 of save/restore tstate across s3

My system: Dell Latitude E6430, BIOS A18 (currently latest version),
Linux latitude 4.9.7-1-ck #1 SMP PREEMPT Wed Feb 1 16:15:00 EST 2017 x86_64 GNU/Linux

The patch fixes the issue for me. I will be glad to provide more information, if needed.
Comment 45 Chen Yu 2017-02-06 18:12:14 UTC
Created attachment 254321 [details]
version 3 (clear tstate after resumed back)

Here's the version 3 for this patch, and this one might be more acceptable to linux community, please help a try.
Comment 46 Doug Smythies 2017-02-08 21:56:02 UTC
I have made a 4.10-rc7 kernel for Kadir with Chen's new patch. However I used the two patch set from the linux-pm mailing list, and not the single patch posted herein.
Comment 47 Chen Yu 2017-02-09 03:37:45 UTC
thanks, Doug.
Comment 48 Kadir 2017-02-10 21:18:01 UTC
With the help of Doug, I tested the latest patch.

Something is not quite right, I don't if it supposed to be like that.

I get the following from msr-tools:

After clean restart I get:

sudo rdmsr -a 0x19a
0
0
0
0

While plugged in, suspend and resume, I get:

sudo rdmsr -a 0x19a
0
0
0
0

While plugged in, suspend, then unplug, I get:

sudo rdmsr -a 0x19a
1c
0
0
0


With removed plug, suspend, resume unplugged, I get:

sudo rdmsr -a 0x19a
1c
0
0
0

In this state the min. freq, of the cpu is around 630 Mhz and the max freq. is 2365 Mhz. This is with the powersave governor. So the cpu does not go to the max cpufreq. of 3300 Mhz (with turbo boost)

Is this the expected behaviour? I mean, I could live with it, because the CPU does scale up when needed, it just does not go up the max freq. or turboboost.
Comment 49 Kadir 2017-02-10 21:46:10 UTC
Upon further testing, I noticed some more strange things.

If the laptop is plugged in and you unplug the laptop (while NOT in suspend, so when the laptop is fully powered on) the freq. dips below 800 Mhz and goes down to 630 Mhz for a couple of seconds and sudo rdmsr -a 0x19a first shows:

1c
1c
1c
1c

and then

1e
1e
1e
1e 

and when it normalizes after a couple of seconds and freq. is again at 800 Mhz it shows

0
0
0
0

Now I notice that while plugged in, suspend, then unplug, I get different results, sometimes 1c, sometimes 1e and sometimes 0

Very strange, the CPU freq. is quite unstable while plugging in and out and suspending and resuming.
Comment 50 Doug Smythies 2017-02-10 21:52:57 UTC
@Kadir,
Thanks very much for doing the test.
Yes, during suspend to memory, CPU 0 doesn't actually shutdown, and so isn't executing the recovery path upon resume (or so I think after a quick look at the patch). I assume the higher CPU frequency you are sometimes observing is when CPU 0 is idle, thus giving up its vote to the PLL. Under full load, where CPU 0 never goes into idle, I think the max CPU frequency will suffer. I'll create that scenario on my test computer shortly to verify. I have never mixed clock modulation on some cores and not others before.

Oh, we had an edit collision: I have not heard of the situation you describe, unplugging without suspend, before. which doesn't mean it hasn't always been there, just that no-one reported it before.
Comment 51 Doug Smythies 2017-02-10 23:12:44 UTC
O.K. via my test computer, I confirm with mixed clock modulation on some CPUs and not others, will result in a variety of CPU frequencies, depending on which CPUs are in deep C states or not.

Under this condition the IA32_PERF_STATUS_MSR doesn't really tell the truth (in my opinion), as it doesn't seem to reflect the clock modulation stuff (no load):

# rdmsr --bitfield 15:8 -d -a 0x198
16
16
16
16
16
16
16
16

Whereas, if all CPU are set to use clock modulation then it does reflect it properly (no load):

# rdmsr --bitfield 15:8 -d -a 0x198
12
12
12
12
12
12
12
12
Comment 52 Chen Yu 2017-02-13 03:45:15 UTC
(In reply to Doug Smythies from comment #50)
> @Kadir,
> Thanks very much for doing the test.
> Yes, during suspend to memory, CPU 0 doesn't actually shutdown, and so isn't
> executing the recovery path upon resume (or so I think after a quick look at
> the patch).
[cut]
You are right, I've made a mistake, I forgot the case for CPU0. Let me think about it...
Comment 53 Chen Yu 2017-02-13 09:06:57 UTC
Created attachment 254719 [details]
version 4 (clear tstate after resumed back)

Here's a refreshed version to fix the issue boot CPU has not been taken care of in previous version. Doug, would you please help review it before I sent it out? thanks.
Comment 54 Doug Smythies 2017-02-13 15:49:35 UTC
@Chen: I am not very familiar with this area of the code, and just by looking at the code can not comment. I can not even make a test scenario with my test computer, because it does clear any clock modulation settings upon resume from suspend.

@Kadir: Hoping you are still willing to help. I'll compile you a kernel 4.10-rc8, using the Ubuntu kernel configuration, but with Chen's patches. I'll put it in the usual spot, and e-mail you. I can only get it done in a few hours.
Comment 55 Kadir 2017-02-13 18:19:56 UTC
Of course I am willing to help! Let me know when the kernel is ready
Comment 56 Doug Smythies 2017-02-13 22:18:06 UTC
(In reply to Kadir from comment #55)
> Of course I am willing to help! Let me know when the kernel is ready

The kernel is in the usual location. Look for:

linux-headers-4.10.0-rc8-chen_4.10.0-rc8-chen-196_amd64.deb
linux-image-4.10.0-rc8-chen_4.10.0-rc8-chen-196_amd64.deb

There is only the patch from Chen from comment 53 applied to this one.
Comment 57 Kadir 2017-02-16 07:57:36 UTC
I did some testing with this latest patch and kernel, thank you Chen for the patch and thank you Doug for the test kernel.

The results are good. In all situations I get:

sudo rdmsr -a 0x19a
0
0
0
0

So that is very good!

I did notice that same behavior as mentioned in comment 49. When the laptop is plugged and you are using it as normal and unplug while in use, the cpu freq drops slightly and normalizes after 2/3 seconds. In that case sudo rdmsr -a 0x19a ranges from 1c to 1e and finally 0.

I also noticed that this behavior is present with my main distro (Fedora 25) with kernel 4.9.9-200.fc25. I don't think it is a problem, I am just curious as why this happens. It is not a problem at all with normal use.

So in conclusion, this latest patch resolves this bug and I am quite happy with it.
Comment 58 Doug Smythies 2017-02-17 15:20:18 UTC
@Kadir: Thanks for doing the test.

(In reply to Kadir from comment #57)
> I did notice that same behavior as mentioned in comment 49. When the laptop
> is plugged and you are using it as normal and unplug while in use, the cpu
> freq drops slightly and normalizes after 2/3 seconds. In that case sudo
> rdmsr -a 0x19a ranges from 1c to 1e and finally 0.
> 
Hmmm... I wonder if this behavior changes at all as a function of battery level?
Did you do these tests at or near 100% battery? I wonder what happens at, say, 20% battery level.
Comment 59 Kadir 2017-02-27 08:53:50 UTC
The tests were done at/near 100 %, this behavior is also present on my main distro Fedora 25 running 4.9.11-200.fc25.x86_64

It does not bother me much, but I sometimes notice it, like animations that stutter/lag when unplugging
Comment 60 Victor Trac 2017-02-27 17:19:35 UTC
I tested the Version 4 patch on the DRM Intel Nightly kernel built on Feb 21.  I believe I have isolated the issue to when my XPS 13 (9350) is plugged in using a Kensington SD4600P USB-C powered dock (which includes using 4k mini-DP at 60hz, some USB wireless devices, and a USB gigabit NIC) AFTER resuming.  If I do a fresh boot while plugged into USB-C power, everything works fine. If I sleep and then come back, then when I plug into USB-C, then the CPU frequency drops to less than 400hz (including 90-120 hz, which makes the laptop unusable). 

I have not been able to reproduce the problem when using the Dell AC adapter without anything plugged into the USB-C port. 

Another interesting tidbit is that if I go to my BIOS and disable Intel SpeedStep, then the CPU never drops below 400hz (but it also maxes out at 2.2Ghz instead of the 3.1Ghz that the hardware supports, and my battery life is generally bad).

So the issue seems to be a combination of sleep/resume, SpeedStep, and USB-C. I hope that helps.

Also, I had this message show up while resuming on USB-C:

[178087.133316] usb usb1: root hub lost power or was reset
[178087.133321] usb usb2: root hub lost power or was reset
[178087.148186] [drm] GuC firmware load skipped
[178087.151461] rtc_cmos 00:01: System wakeup disabled by ACPI
[178087.532050] usb 1-4: reset full-speed USB device number 5 using xhci_hcd
[178087.888994] usb 1-3: reset full-speed USB device number 8 using xhci_hcd
[178088.245530] usb 1-5: reset high-speed USB device number 6 using xhci_hcd
[178088.462532] [drm] RC6 on
[178089.661661] ACPI Error: [SPRT] Namespace lookup failure, AE_ALREADY_EXISTS (20160930/dswload2-330)
[178089.661675] ACPI Exception: AE_ALREADY_EXISTS, During name lookup/catalog (20160930/psobject-227)
[178089.661681] ACPI Error: Method parse/execution failed [\_GPE._E42] (Node ffff88046e0d7258), AE_ALREADY_EXISTS (20160930/psparse-543)
[178089.661696] ACPI Error: Method parse/execution failed [\_GPE._E42] (Node ffff88046e0d7258), AE_ALREADY_EXISTS (20160930/psparse-543)
[178089.661715] ACPI Exception: AE_ALREADY_EXISTS, while evaluating GPE method [_E42] (20160930/evgpe-646)
[178090.060264] PM: resume of devices complete after 2927.349 msecs
[178090.060463] usb 1-3:1.0: rebind failed: -517
[178090.060472] usb 1-3:1.1: rebind failed: -517
[178090.061459] PM: Finishing wakeup.
Comment 61 Victor Trac 2017-03-27 21:48:00 UTC
I have confirmed that the problem ONLY happens when all of these are true, even with the latest patch:

* Resumed from sleep
* Intel SpeedStep is enabled in BIOS
* Laptop powered from USB-C via Kensington SD4600P USB-C hub
* Nexus 6P plugged into Kenginston SD4600P

If I'm on a fresh boot, this doesn't happen, regardless if my Nexus 6P is plugged in or not. So it seems like when I plug in my phone, the SD4600 must be reporting some lower power available to the laptop, which is causing the kernel/BIOS to drop the CPU too low. What makes this seem like a kernel issue, however, is that it only happens after a sleep resume cycle.
Comment 62 Doug Smythies 2017-03-27 22:15:09 UTC
@Victor : Thanks. There is another odd thing with Dell laptops and power, that Chen's patch will not solve. Before your report, I had only heard of Dell Laptops complaining, and enabling Clock Modulation, when it didn't detect the proper ID pin stuff from a Dell power adapter. I mean, in addition to the well known resume from suspend on battery issue.

There was a version 5 of Chen's patch on the e-mail list that maybe you could try.
It might be easier if I attach it rather than try to give links to the list item or the patchwork item.

Just a few hours ago, I built a kernel 4.11-rc4 with Chen's version 5 patch applied for @Kadir to try (even though that patch was sent to the list a couple of weeks ago).
Comment 63 Doug Smythies 2017-03-27 22:16:53 UTC
Created attachment 255581 [details]
Version 5 of Chen's patch
Comment 64 Victor Trac 2017-03-29 16:12:59 UTC
@Doug:
Thanks for the new patch. I tried on on the March 27th nightly:

$ uname -a
Linux callisto 4.11.0-1-drm-intel-nightly #36 SMP PREEMPT Mon Mar 27 17:41:36 CDT 2017 x86_64 GNU/Linux


Unfortunately, I still see the problem after plugging in my Nexus 6P into the SD4600P port. After a few minutes (it seems random), my CPU starts to go down to 100-300mhz. I notice when my computer starts stuttering. It immediately stops when I unplug the phone.

What may be interesting is that if I plug my Nexus 6P directly into my XPS 13 on the USB 3.0 USB-A port using a USB-A <-> USB-C cable, then I don't have any issues.
Comment 65 Kadir 2017-03-31 18:21:26 UTC
Version 5 of the patch is perfect. It solves the original bug for me. 

Also the small bug/annoyance mentioned in comment 57 (https://bugzilla.kernel.org/show_bug.cgi?id=90041#c57) is fixed.
Comment 66 Len Brown 2017-04-04 00:21:38 UTC
for the record: chen-yu's patch is not upstream as of 4.11-rc5
and is not in linux-next, as of today
Comment 67 Doug Smythies 2017-04-20 18:22:34 UTC
(In reply to Len Brown from comment #66)
> for the record: chen-yu's patch is not upstream as of 4.11-rc5
> and is not in linux-next, as of today

Yes, Len, it was sent to the-mail list, but nothing ever happened after that.

@Chen: At this point, I would suggest a RESEND.
Comment 68 Chen Yu 2017-04-24 02:02:28 UTC
I noticed the it has been marked as 'Deferred'
https://patchwork.kernel.org/patch/9629003/
I'll sync with Len/Rafael online.
Comment 69 Doug Smythies 2017-08-21 22:04:02 UTC
@Chen : I believe that pending changes to the intel-pstate CPU frequency scaling driver, proposed by Rafael, I think for kernel 4.14-rc1, will eliminate the ongoing troubles with Clock Modulation and the driver.

I'm saying that the issue will no longer exist, and that the intel_pstate CPU frequency scaling driver will respond "properly" to Clock Modulation events. Of course, I'll check when kernel 4.14-rc1 becomes available, or before if I can properly apply all the patches.

For reference see the patch set related to:

[PATCH 0/2] cpufreq: intel_pstate: Eliminate the PID controller
2017.07.24 03:22 (or 10:22 UTC, I think)

https://marc.info/?l=linux-pm&m=150093486908759&w=2
https://marc.info/?l=linux-pm&m=150093484308751&w=2
https://marc.info/?l=linux-pm&m=150093486808758&w=2

Note 1: I made this exact same comment on bug 189861 a few minutes ago.
Note 2: Victor seems to have some additional weird problems, so I am not certain about his particular use case.

Note You need to log in before you can comment on or make changes to this bug.