Created attachment 104261 [details] The names of files explain themselves This is Dell XPS L502x laptop with Intel i7 Quad-Core Sandybridge CPU. For no particular reason, the CPU gets full power-horse working and overheating in a random way, which causes the fan to spin crazily with huge noise. However, the temperature doesn't cool down that much, instead the computer shuts down itself sometimes after it reaches the critical point of 100 degree Celsius. It seems that the intel_pstate driver built-in the kernel is buggy. And observed solution from practical experience is to reboot or suspend the machine to cool it down a little bit. Looks like there is a weird gap in the algorithm that controls the CPU frequency scaling job. There is no regular behavior observed yet. But this CPU overheating problem does exist. Don't know what causes it to do heavy computing and overheating.
Created attachment 104271 [details] lscpu.txt
Created attachment 104281 [details] i7z.txt
Created attachment 104291 [details] sensors.txt
Created attachment 104301 [details] grub_config.txt
Hi, Could you provide what powertop and htop show during overheating? (including these tab, Overview Idle stats Frequency stats Device stats Tunables).
BTW, does this bug happen in the mainline(v3.10-rc5) kernel?
Well, I have been switched to 3.9.1-1 kernel for about 2 days, and so far so good , `cpupower` shows that intel_pstate driver which is built in works well. So there must be something different between this kernel and the latest 3.9.5-1 stock kernel . I haven't tried 3.10 kernel yet because it's not in the stable repository of my distribution( Archlinux here actually).
This seems a regression between stable 3.9.1 and stable 3.9.5. Could you further narrow the gap or do a bisect? From your description, there is no a direct way to produce the bug since it takes place randomly, right? It's better to check whether the issue exists in the latest mainline kernel v3.10-rc5 because this may have been resolved.
I have tested the 3.9.6 stock kernel from my distribution and the problem remains, so I have to wait for the next main line release of 3.10 which will be ready. Now I'm running on 3.9.1 which works very well. No such problem observed.
(In reply to comment #9) > I have tested the 3.9.6 stock kernel from my distribution and the problem > remains, so I have to wait for the next main line release of 3.10 which will > be > ready. > Now I'm running on 3.9.1 which works very well. No such problem observed. Hi: Could you do bisect between 3.9.1 and 3.9.5? Or Just try 3.9.2, 3.9.3 ... Since it's hard to reproduce the bug. And then we can find the first bad version. (In reply to comment #5) > Hi, Could you provide what powertop and htop show during overheating? > (including these tab, Overview Idle stats Frequency stats Device stats > Tunables). Could you provide these info?
Created attachment 105471 [details] screenshot of htop when overheating
Created attachment 105481 [details] powertop overview 1
Created attachment 105491 [details] powertop overview 2
Created attachment 105501 [details] powertop overview 3
Created attachment 105511 [details] powertop device stats 1
Created attachment 105521 [details] powertop device stats 2
Created attachment 105531 [details] powertop idle stats
Created attachment 105541 [details] powertop frequency stats
Created attachment 105551 [details] powertop tunables
These screenshots are taken when the laptop is overheating running on stock kernel 3.9.6-1. *********************************************************** pip@XPS-Pip ~ % sensors acpitz-virtual-0 Adapter: Virtual device temp1: +84.0°C (crit = +100.0°C) temp2: +84.0°C (crit = +100.0°C) coretemp-isa-0000 Adapter: ISA adapter Physical id 0: +82.0°C (high = +86.0°C, crit = +100.0°C) Core 0: +77.0°C (high = +86.0°C, crit = +100.0°C) Core 1: +77.0°C (high = +86.0°C, crit = +100.0°C) Core 2: +77.0°C (high = +86.0°C, crit = +100.0°C) Core 3: +78.0°C (high = +86.0°C, crit = +100.0°C) ***********************************************************
(In reply to comment #18) > Created an attachment (id=105541) [details] > powertop frequency stats This output is strange. There is no scaling available cpu frequency. Please provide the output of "grep . /sys/bus/cpu/devices/cpu*/cpufreq/*". Could you provide some syslog/dmesg just during overheat? There maybe some clue.
Created attachment 105841 [details] dmesg output while overheating
Created attachment 105851 [details] grep sys_cpufreq_overheating
This time on 3.9.7-1 stock kernel, overheating problem remains. Here is the output from command 'sensors': acpitz-virtual-0 Adapter: Virtual device temp1: +84.0°C (crit = +100.0°C) temp2: +84.0°C (crit = +100.0°C) coretemp-isa-0000 Adapter: ISA adapter Physical id 0: +85.0°C (high = +86.0°C, crit = +100.0°C) Core 0: +78.0°C (high = +86.0°C, crit = +100.0°C) Core 1: +81.0°C (high = +86.0°C, crit = +100.0°C) Core 2: +81.0°C (high = +86.0°C, crit = +100.0°C) Core 3: +82.0°C (high = +86.0°C, crit = +100.0°C) ********************************************************** And here is the out from powertop Frequency stats: Package | Core | CPU 0 CPU 1 | | Actual 3.0 GHz 3.0 GHz Idle 100.0% | Idle 100.0% | Idle 100.0% 100.0% | Core | CPU 2 CPU 3 | | Actual 3.1 GHz 3.1 GHz | Idle 100.0% | Idle 100.0% 100.0% | Core | CPU 4 CPU 5 | | Actual 3.0 GHz 3.0 GHz | Idle 100.0% | Idle 100.0% 100.0% | Core | CPU 6 CPU 7 | | Actual 3.0 GHz 3.1 GHz | Idle 100.0% | Idle 100.0% 100.0%
Output from cpupower while overheating ---------------- analyzing CPU 0: driver: intel_pstate CPUs which run at the same hardware frequency: 0 CPUs which need to have their frequency coordinated by software: 0 maximum transition latency: 0.97 ms. hardware limits: 800 MHz - 3.10 GHz available cpufreq governors: performance, powersave current policy: frequency should be within 800 MHz and 3.10 GHz. The governor "powersave" may decide which speed to use within this range. current CPU frequency is 2.93 GHz (asserted by call to hardware). boost state support: Supported: yes Active: yes 2800 MHz max turbo 4 active cores 2800 MHz max turbo 3 active cores 3000 MHz max turbo 2 active cores 3100 MHz max turbo 1 active cores *************************************
This is the real-time output from i7z during overheating problem on kernel 3.9.9: Cpu speed from cpuinfo 2195.00Mhz cpuinfo might be wrong if cpufreq is enabled. To guess correctly try estimating Linux's inbuilt cpu_khz code emulated now True Frequency (without accounting Turbo) 2194 MHz CPU Multiplier 22x || Bus clock frequency (BCLK) 99.73 MHz Socket [0] - [physical cores=4, logical cores=8, max online cores ever=4] TURBO ENABLED on 4 Cores, Hyper Threading ON Max Frequency without considering Turbo 2293.73 MHz (99.73 x [23]) Max TURBO Multiplier (if Enabled) with 1/2/3/4 Cores is 31x/30x/28x/28x Real Current Frequency 3048.42 MHz [99.73 x 30.57] (Max of below) Core [core-id] :Actual Freq (Mult.) C0% Halt(C1)% C3 % C6 % Core 1 [0]: 3048.42 (30.57x) 1.28 0 1 0 Core 2 [2]: 2992.03 (30.00x) 1 0.881 1 0 Core 3 [4]: 3015.05 (30.23x) 10.5 0 0 0 Core 4 [6]: 2953.21 (29.61x) 1.73 0.235 1 0 ****************************************************************************** I understand the this problem is very difficult to spot, and even it's possible that there is a hardware bug in the CPU design. And this is the output from "powertop"--Frequency during overheating: Package | Core | CPU 0 CPU 1 | | Actual 2.9 GHz 2.9 GHz le 100.0% | Idle 100.0% | Idle 100.0% 100.0% 00 MHz 0.0% | | | Core | CPU 2 CPU 3 | | Actual 3.0 GHz 2.9 GHz | Idle 100.0% | Idle 100.0% 100.0% | | | Core | CPU 4 CPU 5 | | Actual 2.8 GHz 3.0 GHz | Idle 100.0% | Idle 100.0% 100.0% | | | Core | CPU 6 CPU 7 | | Actual 2.9 GHz 2.8 GHz | Idle 100.0% | Idle 100.0% 100.0% ***************************************************************************** And this is the output from "powertop"--Tunables during overheating: >> Bad Wireless Power Saving for interface wlan0 Bad NMI watchdog should be turned off Bad VM writeback timeout Bad Enable SATA link power Managmenet for host0 Bad Enable SATA link power Managmenet for host1 Bad Enable SATA link power Managmenet for host2 Bad Enable SATA link power Managmenet for host3 Bad Enable SATA link power Managmenet for host4 Bad Enable SATA link power Managmenet for host5 Bad Enable Audio codec power management Bad Autosuspend for USB device 2.4G Keyboard Mouse [MOSART Semi.] Bad Autosuspend for unknown USB device 4-1.5 (8086:0189) Bad Runtime PM for PCI Device Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 4 Bad Runtime PM for PCI Device Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 5 Bad Runtime PM for PCI Device Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 2 Bad Runtime PM for PCI Device Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 6 Bad Runtime PM for PCI Device Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Control Bad Runtime PM for PCI Device NEC Corporation uPD720200 USB 3.0 Host Controller Bad Runtime PM for PCI Device Intel Corporation HM67 Express Chipset Family LPC Controller Bad Runtime PM for PCI Device Intel Corporation 6 Series/C200 Series Chipset Family 6 port SATA AHCI Controll Bad Runtime PM for PCI Device Intel Corporation 6 Series/C200 Series Chipset Family SMBus Controller Bad Runtime PM for PCI Device NVIDIA Corporation GF108M [GeForce GT 540M] Bad Runtime PM for PCI Device Intel Corporation Centrino Wireless-N 1030 [Rainbow Peak] Bad Runtime PM for PCI Device Intel Corporation 6 Series/C200 Series Chipset Family MEI Controller #1 Bad Runtime PM for PCI Device Realtek Semiconductor Co., Ltd. RTL8111/8168 PCI Express Gigabit Ethernet contr Bad Runtime PM for PCI Device Intel Corporation 6 Series/C200 Series Chipset Family High Definition Audio Con Bad Runtime PM for PCI Device Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Control Bad Runtime PM for PCI Device Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Cont Bad Runtime PM for PCI Device Intel Corporation Xeon E3-1200/2nd Generation Core Processor Family PCI Express Bad Runtime PM for PCI Device Intel Corporation 2nd Generation Core Processor Family DRAM Controller Bad Runtime PM for PCI Device Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 1 Bad Wake-on-lan status for device eth0 ******************************************************************************** And this is the output from "cpupower frequency-info" during overheating: analyzing CPU 0: driver: intel_pstate CPUs which run at the same hardware frequency: 0 CPUs which need to have their frequency coordinated by software: 0 maximum transition latency: 0.97 ms. hardware limits: 800 MHz - 3.10 GHz available cpufreq governors: performance, powersave current policy: frequency should be within 800 MHz and 3.10 GHz. The governor "powersave" may decide which speed to use within this range. current CPU frequency is 3.01 GHz (asserted by call to hardware). boost state support: Supported: yes Active: yes 2800 MHz max turbo 4 active cores 2800 MHz max turbo 3 active cores 3000 MHz max turbo 2 active cores 3100 MHz max turbo 1 active cores ******************************************************************************** I hope Intel will fix this bug in the next release, thank you very much :-)
Hi, please check whether the following patch fixs your issue. Recently, we found some high cpu frequency issues were related with i915 driver. https://bugzilla.kernel.org/attachment.cgi?id=105901
Alright, I will tell you the result of my testing soon. Thank you :-)
Okay, I've been testing your last patch suggested for several days by far, and I think it's good news that I didn't have that heating issue anymore, not observed at least. So it works. I hope you guys can improve the driver in the new release. Good job, guys :-)
Ok. Thanks for test. Mark this bug as duplicated. *** This bug has been marked as a duplicate of bug 58971 ***