Bug 51941

Summary: Huge CPU temperature rise with possible hardware damage in a fanless Atom-ION box
Product: ACPI Reporter: Vyacheslav Dikonov (sdiconov)
Component: Power-ProcessorAssignee: Len Brown (lenb)
Status: CLOSED INSUFFICIENT_DATA    
Severity: normal CC: alan, lenb, rjw, rui.zhang
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.6 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Kernel 3.5.7 which overheats the system
kernel3.5.7 temperature sensors readings
kernel 3.4.8 which is nice and cool
kernel3.4.8 temperature sensors readings
custom kernel 3.4.8 which runs best
custom 3,4,8 kernel config patch to show the difference from the stock
dmesg output with kernel 3.4.8
acpidump source
Result of running acpidump

Description Vyacheslav Dikonov 2012-12-24 09:56:55 UTC
I have a small fanless Atom330+ION box with Asrock ION330 motherboard. 
It is installed in a passively cooled case with hard TDP limits. The box has been well tested and runs perfectly with kernels up to 3.4. The CPU temperature stays at about 40-45 degrees Celsius and up to 58-60 at peak load. 

After upgrading the kernel to 3.6 the CPU temperature of both atom cores rises to 60 within 1 minute after boot and exceeds 80 under load. The heat spreads to other hardware components and my HDD SMART now has permanent overheat warnings (which makes the warranty void). 

The sympthoms were well described in this article at  Phoronix:
http://www.phoronix.com/scan.php?page=article&item=linux_power_20&num=2
I can confirm that it is true. The article says that kernel 3.7 (which I haven't yet tested) still has this regression. 

It may be similar or related to bug 48721, but in my case the hardware is different. The video is a built-in Nvidia ION, driven by nvidia proprietary driver. The nvidia driver 310.19 (latest) and all other software packages (XBMC) remained the same before and after the bad kernel upgrade.
Comment 1 Len Brown 2013-01-15 01:21:53 UTC
Probably not the same as bug 48721 b/c that seems to be related
to use of Intel grahics driver, and here you've got Nvidia graphics.

Please run powertop on both the working and failing configurations
and report what C-states are being used, and also what P-states
are being used.

note that powertop can be used with the --html option
to create a file you can attach to this bug report.
Comment 2 Vyacheslav Dikonov 2013-01-27 10:50:34 UTC
Finally I have found some time to test it again. I have made 3 powertop reports and corresponding sensors output
1) - kernel 3.4.8-pae-slava1
2) - kernel 3.4.8-pae-alt1 (stock distro kernel with pae)
3) - kernel 3.5.7-pae-alt1 (stock distro kernel with pae)
Comment 3 Vyacheslav Dikonov 2013-01-27 10:55:54 UTC
Created attachment 91871 [details]
Kernel 3.5.7 which overheats the system

All kernels following 3,5,x demonstrate identical overheating problem.
Comment 4 Vyacheslav Dikonov 2013-01-27 10:59:55 UTC
Created attachment 91881 [details]
kernel3.5.7 temperature sensors readings

Here the system is perfectly idle. The data was taken immediately after a cold boot. If I run any applications the temperatures jump higher.
Comment 5 Vyacheslav Dikonov 2013-01-27 11:02:23 UTC
Created attachment 91891 [details]
kernel 3.4.8 which is nice and cool
Comment 6 Vyacheslav Dikonov 2013-01-27 11:04:21 UTC
Created attachment 91901 [details]
kernel3.4.8 temperature sensors readings

Taken at idle after a cold boot. All conditions identical to the kernel3.5.7 sensors reading except the kernel versions.
Comment 7 Vyacheslav Dikonov 2013-01-27 11:07:27 UTC
Created attachment 91911 [details]
custom kernel 3.4.8 which runs best

This kernel is a tweaked version of the other 3.4.8 kernel. It shows better or similar thermal performance but seems a bit more responsive.
Comment 8 Vyacheslav Dikonov 2013-01-27 11:10:19 UTC
Created attachment 91921 [details]
custom 3,4,8 kernel config patch to show the difference from the stock

This patch shows all difference between the two 3.4.8 kernels.
Comment 9 Vyacheslav Dikonov 2013-01-27 11:14:55 UTC
I also tried later kernels both optimized for Atom and lower latency and distro stock. They all overheat and the temperatures are not visibly different from the 3,5,7. 

Kernels optimized for the Atom CPU behave similar to non-optimized generic Pentium4 kernels.
Comment 10 Zhang Rui 2013-04-13 17:39:10 UTC
please attach the dmesg output for both 3.4.8 and 3.5.7.
Comment 11 Vyacheslav Dikonov 2013-04-14 21:01:08 UTC
Created attachment 98631 [details]
dmesg output with kernel 3.4.8
Comment 12 Zhang Rui 2013-04-15 01:16:23 UTC
please attach the acpidump output of this box.
Comment 13 Vyacheslav Dikonov 2013-04-15 23:48:22 UTC
There is no such command. Web search gave me only dead links man pages and debs, but I run an rpm based distribution. Where can I download this tool?
Comment 14 Zhang Rui 2013-04-17 06:47:36 UTC
Created attachment 98911 [details]
acpidump source

please build it and run acpidump > acpidump.out with root privilege.
Comment 15 Vyacheslav Dikonov 2013-04-17 20:11:42 UTC
Created attachment 99051 [details]
Result of running acpidump
Comment 16 Zhang Rui 2013-04-22 03:21:05 UTC
NO ACPI Fan/Thermal control on this platform.

Len,
can you please continue to look at this problem please?
Comment 17 Len Brown 2013-05-13 23:57:41 UTC
> nvidia: module license 'NVIDIA' taints kernel.

please contact nvidia for support,
or re-open when you can reproduce this
w/o their proprietary software.
Comment 18 Vyacheslav Dikonov 2013-05-16 10:00:42 UTC
1) Th purpose of this machine is to play sound and video. it is physically impossible to install a different video card. There is no other a/v output. It is impossible to run the box without nvidia driver. 

HOWEVER,

2) The bug is reproducible by booting different versions of the linux kernel while running THE SAME nvidia blob, i.e. the system can run fine with the nvidia driver it uses. I did not change, reinstall, update or did anything with the nvidia driver before and after the kernel swap which triggered the problem. Only the kernel<->blob interface got rebuilt, but it was the same source rpm package for both good and bad kernels. 

From this I conclude that it is the kernel to blame and not nvidia. I need good justification to ask nvidia support and I need some evidence that their driver has something to do with the problem. Could you give me any such evidence / tips how to get the truly relevant technical info to make such request?
Comment 19 Alan 2013-05-16 10:02:09 UTC
Nvidia have their source code and can read ours, the reverse is not true. Only they can help you.
Comment 20 Vyacheslav Dikonov 2013-05-16 10:47:04 UTC
While this statement is true, it is irrelevant to this technical issue. 

On one hand I see the fact that The system runs perfectly well and cool with an old kernel and goes mad with a newer kernel while the nvidia code remains the same, identical, constant, unchanged, frozen..... 
It is a (possibly superficial) evidence that nvidia is not involved in the problem, unless there are some changes in the kernel that break mutual compatibility.

On the other hand I have nothing to support the claim that nvidia driver is defective in this area. Nvidia support will simply send me back to you and I am unable to make them do anything better. We probably need to file such request together with relevant technical reference. 


BTW. ION1 is a popular, but discontinued product since ION2. Nvidia tends to drop support for such. Still ION seems to be the only way to build a home theater grade _silent_ HD Audio/Video player box _with_no_moving_parts_. It is a whole class of devices where linux OS (used to?) have an edge over other OSes.