Bug 59571

Summary: CPU overheating on Dell XPS L502x laptop
Product: ACPI Reporter: Pip (RayFredPip)
Component: Power-ProcessorAssignee: Lan Tianyu (tianyu.lan)
Status: CLOSED DUPLICATE    
Severity: high CC: tianyu.lan
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 3.9.5-1-ARCH (Archlinux stock kernel) Subsystem:
Regression: No Bisected commit-id:
Attachments: The names of files explain themselves
lscpu.txt
i7z.txt
sensors.txt
grub_config.txt
screenshot of htop when overheating
powertop overview 1
powertop overview 2
powertop overview 3
powertop device stats 1
powertop device stats 2
powertop idle stats
powertop frequency stats
powertop tunables
dmesg output while overheating
grep sys_cpufreq_overheating

Description Pip 2013-06-11 00:50:34 UTC
Created attachment 104261 [details]
The names of files explain themselves

This is Dell XPS L502x laptop with Intel i7 Quad-Core Sandybridge CPU. For no particular reason, the CPU gets full power-horse working and overheating in a random way, which causes the fan to spin crazily with huge noise. However, the temperature doesn't cool down that much, instead the computer shuts down itself sometimes after it reaches the critical point of 100 degree Celsius.
It seems that the intel_pstate driver built-in the kernel is buggy. And observed solution from practical experience is to reboot or suspend the machine to cool it down a little bit. Looks like there is a weird gap in the algorithm that controls the CPU frequency scaling job.
There is no regular behavior observed yet. But this CPU overheating problem does exist. Don't know what causes it to do heavy computing and overheating.
Comment 1 Pip 2013-06-11 00:51:31 UTC
Created attachment 104271 [details]
lscpu.txt
Comment 2 Pip 2013-06-11 00:51:47 UTC
Created attachment 104281 [details]
i7z.txt
Comment 3 Pip 2013-06-11 00:52:05 UTC
Created attachment 104291 [details]
sensors.txt
Comment 4 Pip 2013-06-11 00:52:30 UTC
Created attachment 104301 [details]
grub_config.txt
Comment 5 Lan Tianyu 2013-06-13 03:00:11 UTC
Hi, Could you provide what powertop and htop show during overheating?
(including these tab, Overview   Idle stats   Frequency stats   Device stats   Tunables).
Comment 6 Lan Tianyu 2013-06-13 03:04:23 UTC
BTW, does this bug happen in the mainline(v3.10-rc5) kernel?
Comment 7 Pip 2013-06-13 21:55:26 UTC
Well, I have been switched to 3.9.1-1 kernel for about 2 days, and so far so good , `cpupower` shows that intel_pstate driver which is built in works well. So there must be something different between this kernel and the latest 3.9.5-1 stock kernel .
I haven't tried 3.10 kernel yet because it's not in the stable repository of my distribution( Archlinux here actually).
Comment 8 Lan Tianyu 2013-06-14 05:49:46 UTC
This seems a regression between stable 3.9.1 and stable 3.9.5. Could you further narrow the gap or do a bisect? From your description, there is no a direct way to produce the bug since it takes place randomly, right?

It's better to check whether the issue exists in the latest mainline kernel v3.10-rc5 because this may have been resolved.
Comment 9 Pip 2013-06-19 19:30:09 UTC
I have tested the 3.9.6 stock kernel from my distribution and the problem remains, so I have to wait for the next main line release of 3.10 which will be ready.
Now I'm running on 3.9.1 which works very well. No such problem observed.
Comment 10 Lan Tianyu 2013-06-20 03:37:21 UTC
(In reply to comment #9)
> I have tested the 3.9.6 stock kernel from my distribution and the problem
> remains, so I have to wait for the next main line release of 3.10 which will
> be
> ready.
> Now I'm running on 3.9.1 which works very well. No such problem observed.
Hi:
   Could you do bisect between 3.9.1 and 3.9.5? Or Just try 3.9.2, 3.9.3 ...
Since it's hard to reproduce the bug. And then we can find the first bad version.


(In reply to comment #5)
> Hi, Could you provide what powertop and htop show during overheating?
> (including these tab, Overview   Idle stats   Frequency stats   Device stats  
> Tunables).
Could you provide these info?
Comment 11 Pip 2013-06-20 20:44:33 UTC
Created attachment 105471 [details]
screenshot of htop when overheating
Comment 12 Pip 2013-06-20 20:45:10 UTC
Created attachment 105481 [details]
powertop overview 1
Comment 13 Pip 2013-06-20 20:45:27 UTC
Created attachment 105491 [details]
powertop overview 2
Comment 14 Pip 2013-06-20 20:45:47 UTC
Created attachment 105501 [details]
powertop overview 3
Comment 15 Pip 2013-06-20 20:46:33 UTC
Created attachment 105511 [details]
powertop device stats 1
Comment 16 Pip 2013-06-20 20:46:56 UTC
Created attachment 105521 [details]
powertop device stats 2
Comment 17 Pip 2013-06-20 20:47:20 UTC
Created attachment 105531 [details]
powertop idle stats
Comment 18 Pip 2013-06-20 20:47:45 UTC
Created attachment 105541 [details]
powertop frequency stats
Comment 19 Pip 2013-06-20 20:48:20 UTC
Created attachment 105551 [details]
powertop tunables
Comment 20 Pip 2013-06-20 20:50:22 UTC
These screenshots are taken when the laptop is overheating running on stock kernel 3.9.6-1.
***********************************************************
pip@XPS-Pip ~ % sensors
acpitz-virtual-0
Adapter: Virtual device
temp1:        +84.0°C  (crit = +100.0°C)
temp2:        +84.0°C  (crit = +100.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Physical id 0:  +82.0°C  (high = +86.0°C, crit = +100.0°C)
Core 0:         +77.0°C  (high = +86.0°C, crit = +100.0°C)
Core 1:         +77.0°C  (high = +86.0°C, crit = +100.0°C)
Core 2:         +77.0°C  (high = +86.0°C, crit = +100.0°C)
Core 3:         +78.0°C  (high = +86.0°C, crit = +100.0°C)
***********************************************************
Comment 21 Lan Tianyu 2013-06-21 02:53:04 UTC
(In reply to comment #18)
> Created an attachment (id=105541) [details]
> powertop frequency stats
This output is strange. There is no scaling available cpu frequency.

Please provide the output of "grep . /sys/bus/cpu/devices/cpu*/cpufreq/*".

Could you provide some syslog/dmesg just during overheat? There maybe some clue.
Comment 22 Pip 2013-06-24 11:30:43 UTC
Created attachment 105841 [details]
dmesg output while overheating
Comment 23 Pip 2013-06-24 11:31:17 UTC
Created attachment 105851 [details]
grep sys_cpufreq_overheating
Comment 24 Pip 2013-06-24 11:34:07 UTC
This time on 3.9.7-1 stock kernel, overheating problem remains. Here is the output from command 'sensors':
acpitz-virtual-0
Adapter: Virtual device
temp1:        +84.0°C  (crit = +100.0°C)
temp2:        +84.0°C  (crit = +100.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Physical id 0:  +85.0°C  (high = +86.0°C, crit = +100.0°C)
Core 0:         +78.0°C  (high = +86.0°C, crit = +100.0°C)
Core 1:         +81.0°C  (high = +86.0°C, crit = +100.0°C)
Core 2:         +81.0°C  (high = +86.0°C, crit = +100.0°C)
Core 3:         +82.0°C  (high = +86.0°C, crit = +100.0°C)
**********************************************************
And here is the out from powertop Frequency stats:

            Package |             Core    |            CPU 0       CPU 1
                    |                     | Actual    3.0 GHz     3.0 GHz
Idle       100.0%   | Idle       100.0%   | Idle       100.0%      100.0%

                    |             Core    |            CPU 2       CPU 3
                    |                     | Actual    3.1 GHz     3.1 GHz
                    | Idle       100.0%   | Idle       100.0%      100.0%

                    |             Core    |            CPU 4       CPU 5
                    |                     | Actual    3.0 GHz     3.0 GHz
                    | Idle       100.0%   | Idle       100.0%      100.0%

                    |             Core    |            CPU 6       CPU 7
                    |                     | Actual    3.0 GHz     3.1 GHz
                    | Idle       100.0%   | Idle       100.0%      100.0%
Comment 25 Pip 2013-06-24 11:35:34 UTC
Output from cpupower while overheating
----------------
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 0.97 ms.
  hardware limits: 800 MHz - 3.10 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 800 MHz and 3.10 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency is 2.93 GHz (asserted by call to hardware).
  boost state support:
    Supported: yes
    Active: yes
    2800 MHz max turbo 4 active cores
    2800 MHz max turbo 3 active cores
    3000 MHz max turbo 2 active cores
    3100 MHz max turbo 1 active cores
*************************************
Comment 26 Pip 2013-07-10 13:12:11 UTC
This is the real-time output from i7z during overheating problem on kernel 3.9.9:
Cpu speed from cpuinfo 2195.00Mhz
cpuinfo might be wrong if cpufreq is enabled. To guess correctly try estimating
Linux's inbuilt cpu_khz code emulated now
True Frequency (without accounting Turbo) 2194 MHz
  CPU Multiplier 22x || Bus clock frequency (BCLK) 99.73 MHz

Socket [0] - [physical cores=4, logical cores=8, max online cores ever=4]
  TURBO ENABLED on 4 Cores, Hyper Threading ON
  Max Frequency without considering Turbo 2293.73 MHz (99.73 x [23])
  Max TURBO Multiplier (if Enabled) with 1/2/3/4 Cores is  31x/30x/28x/28x
  Real Current Frequency 3048.42 MHz [99.73 x 30.57] (Max of below)
        Core [core-id]  :Actual Freq (Mult.)      C0%   Halt(C1)%  C3 %   C6 %
        Core 1 [0]:       3048.42 (30.57x)      1.28       0       1       0
        Core 2 [2]:       2992.03 (30.00x)         1    0.881      1       0
        Core 3 [4]:       3015.05 (30.23x)      10.5       0       0       0
        Core 4 [6]:       2953.21 (29.61x)      1.73    0.235      1       0
******************************************************************************
I understand the this problem is very difficult to spot, and even it's possible that there is a hardware bug in the CPU design. And this is the output from "powertop"--Frequency during overheating:
          Package |             Core    |            CPU 0       CPU 1
                  |                     | Actual    2.9 GHz     2.9 GHz
le       100.0%   | Idle       100.0%   | Idle       100.0%      100.0%
00 MHz     0.0%   |                     |

                  |             Core    |            CPU 2       CPU 3
                  |                     | Actual    3.0 GHz     2.9 GHz
                  | Idle       100.0%   | Idle       100.0%      100.0%
                  |                     |

                  |             Core    |            CPU 4       CPU 5
                  |                     | Actual    2.8 GHz     3.0 GHz
                  | Idle       100.0%   | Idle       100.0%      100.0%
                  |                     |

                  |             Core    |            CPU 6       CPU 7
                  |                     | Actual    2.9 GHz     2.8 GHz
                  | Idle       100.0%   | Idle       100.0%      100.0%
*****************************************************************************
And this is the output from "powertop"--Tunables during overheating:
>> Bad           Wireless Power Saving for interface wlan0                      
   Bad           NMI watchdog should be turned off
   Bad           VM writeback timeout
   Bad           Enable SATA link power Managmenet for host0
   Bad           Enable SATA link power Managmenet for host1
   Bad           Enable SATA link power Managmenet for host2
   Bad           Enable SATA link power Managmenet for host3
   Bad           Enable SATA link power Managmenet for host4
   Bad           Enable SATA link power Managmenet for host5
   Bad           Enable Audio codec power management
   Bad           Autosuspend for USB device 2.4G Keyboard Mouse [MOSART Semi.]
   Bad           Autosuspend for unknown USB device 4-1.5 (8086:0189)
   Bad           Runtime PM for PCI Device Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 4
   Bad           Runtime PM for PCI Device Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 5
   Bad           Runtime PM for PCI Device Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 2
   Bad           Runtime PM for PCI Device Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 6
   Bad           Runtime PM for PCI Device Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Control
   Bad           Runtime PM for PCI Device NEC Corporation uPD720200 USB 3.0 Host Controller
   Bad           Runtime PM for PCI Device Intel Corporation HM67 Express Chipset Family LPC Controller
   Bad           Runtime PM for PCI Device Intel Corporation 6 Series/C200 Series Chipset Family 6 port SATA AHCI Controll
   Bad           Runtime PM for PCI Device Intel Corporation 6 Series/C200 Series Chipset Family SMBus Controller
   Bad           Runtime PM for PCI Device NVIDIA Corporation GF108M [GeForce GT 540M]
   Bad           Runtime PM for PCI Device Intel Corporation Centrino Wireless-N 1030 [Rainbow Peak]
   Bad           Runtime PM for PCI Device Intel Corporation 6 Series/C200 Series Chipset Family MEI Controller #1
   Bad           Runtime PM for PCI Device Realtek Semiconductor Co., Ltd. RTL8111/8168 PCI Express Gigabit Ethernet contr
   Bad           Runtime PM for PCI Device Intel Corporation 6 Series/C200 Series Chipset Family High Definition Audio Con
   Bad           Runtime PM for PCI Device Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Control
   Bad           Runtime PM for PCI Device Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Cont
   Bad           Runtime PM for PCI Device Intel Corporation Xeon E3-1200/2nd Generation Core Processor Family PCI Express
   Bad           Runtime PM for PCI Device Intel Corporation 2nd Generation Core Processor Family DRAM Controller
   Bad           Runtime PM for PCI Device Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 1
   Bad           Wake-on-lan status for device eth0
********************************************************************************
And this is the output from "cpupower frequency-info" during overheating:
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 0.97 ms.
  hardware limits: 800 MHz - 3.10 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 800 MHz and 3.10 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency is 3.01 GHz (asserted by call to hardware).
  boost state support:
    Supported: yes
    Active: yes
    2800 MHz max turbo 4 active cores
    2800 MHz max turbo 3 active cores
    3000 MHz max turbo 2 active cores
    3100 MHz max turbo 1 active cores
********************************************************************************
I hope Intel will fix this bug in the next release, thank you very much :-)
Comment 27 Lan Tianyu 2013-07-15 03:44:02 UTC
Hi, please check whether the following patch fixs your issue.
Recently, we found some high cpu frequency issues were related with i915 driver.
https://bugzilla.kernel.org/attachment.cgi?id=105901
Comment 28 Pip 2013-07-15 20:09:55 UTC
Alright, I will tell you the result of my testing soon. Thank you :-)
Comment 29 Pip 2013-07-24 23:33:59 UTC
Okay, I've been testing your last patch suggested for several days by far, and I think it's good news that I didn't have that heating issue anymore, not observed at least. So it works. I hope you guys can improve the driver in the new release. Good job, guys :-)
Comment 30 Lan Tianyu 2013-07-25 01:29:59 UTC
Ok. Thanks for test. Mark this bug as duplicated.

*** This bug has been marked as a duplicate of bug 58971 ***