Bug 51941 - Huge CPU temperature rise with possible hardware damage in a fanless Atom-ION box
Summary: Huge CPU temperature rise with possible hardware damage in a fanless Atom-ION...
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: ACPI
Classification: Unclassified
Component: Power-Processor (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Len Brown
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-12-24 09:56 UTC by Vyacheslav Dikonov
Modified: 2013-06-24 23:21 UTC (History)
4 users (show)

See Also:
Kernel Version: 3.6
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Kernel 3.5.7 which overheats the system (11.18 KB, text/plain)
2013-01-27 10:55 UTC, Vyacheslav Dikonov
Details
kernel3.5.7 temperature sensors readings (1.17 KB, text/plain)
2013-01-27 10:59 UTC, Vyacheslav Dikonov
Details
kernel 3.4.8 which is nice and cool (11.94 KB, text/csv)
2013-01-27 11:02 UTC, Vyacheslav Dikonov
Details
kernel3.4.8 temperature sensors readings (1.17 KB, text/plain)
2013-01-27 11:04 UTC, Vyacheslav Dikonov
Details
custom kernel 3.4.8 which runs best (11.92 KB, text/plain)
2013-01-27 11:07 UTC, Vyacheslav Dikonov
Details
custom 3,4,8 kernel config patch to show the difference from the stock (3.23 KB, patch)
2013-01-27 11:10 UTC, Vyacheslav Dikonov
Details | Diff
dmesg output with kernel 3.4.8 (52.20 KB, text/plain)
2013-04-14 21:01 UTC, Vyacheslav Dikonov
Details
acpidump source (260.00 KB, application/x-tar)
2013-04-17 06:47 UTC, Zhang Rui
Details
Result of running acpidump (132.96 KB, application/octet-stream)
2013-04-17 20:11 UTC, Vyacheslav Dikonov
Details

Description Vyacheslav Dikonov 2012-12-24 09:56:55 UTC
I have a small fanless Atom330+ION box with Asrock ION330 motherboard. 
It is installed in a passively cooled case with hard TDP limits. The box has been well tested and runs perfectly with kernels up to 3.4. The CPU temperature stays at about 40-45 degrees Celsius and up to 58-60 at peak load. 

After upgrading the kernel to 3.6 the CPU temperature of both atom cores rises to 60 within 1 minute after boot and exceeds 80 under load. The heat spreads to other hardware components and my HDD SMART now has permanent overheat warnings (which makes the warranty void). 

The sympthoms were well described in this article at  Phoronix:
http://www.phoronix.com/scan.php?page=article&item=linux_power_20&num=2
I can confirm that it is true. The article says that kernel 3.7 (which I haven't yet tested) still has this regression. 

It may be similar or related to bug 48721, but in my case the hardware is different. The video is a built-in Nvidia ION, driven by nvidia proprietary driver. The nvidia driver 310.19 (latest) and all other software packages (XBMC) remained the same before and after the bad kernel upgrade.
Comment 1 Len Brown 2013-01-15 01:21:53 UTC
Probably not the same as bug 48721 b/c that seems to be related
to use of Intel grahics driver, and here you've got Nvidia graphics.

Please run powertop on both the working and failing configurations
and report what C-states are being used, and also what P-states
are being used.

note that powertop can be used with the --html option
to create a file you can attach to this bug report.
Comment 2 Vyacheslav Dikonov 2013-01-27 10:50:34 UTC
Finally I have found some time to test it again. I have made 3 powertop reports and corresponding sensors output
1) - kernel 3.4.8-pae-slava1
2) - kernel 3.4.8-pae-alt1 (stock distro kernel with pae)
3) - kernel 3.5.7-pae-alt1 (stock distro kernel with pae)
Comment 3 Vyacheslav Dikonov 2013-01-27 10:55:54 UTC
Created attachment 91871 [details]
Kernel 3.5.7 which overheats the system

All kernels following 3,5,x demonstrate identical overheating problem.
Comment 4 Vyacheslav Dikonov 2013-01-27 10:59:55 UTC
Created attachment 91881 [details]
kernel3.5.7 temperature sensors readings

Here the system is perfectly idle. The data was taken immediately after a cold boot. If I run any applications the temperatures jump higher.
Comment 5 Vyacheslav Dikonov 2013-01-27 11:02:23 UTC
Created attachment 91891 [details]
kernel 3.4.8 which is nice and cool
Comment 6 Vyacheslav Dikonov 2013-01-27 11:04:21 UTC
Created attachment 91901 [details]
kernel3.4.8 temperature sensors readings

Taken at idle after a cold boot. All conditions identical to the kernel3.5.7 sensors reading except the kernel versions.
Comment 7 Vyacheslav Dikonov 2013-01-27 11:07:27 UTC
Created attachment 91911 [details]
custom kernel 3.4.8 which runs best

This kernel is a tweaked version of the other 3.4.8 kernel. It shows better or similar thermal performance but seems a bit more responsive.
Comment 8 Vyacheslav Dikonov 2013-01-27 11:10:19 UTC
Created attachment 91921 [details]
custom 3,4,8 kernel config patch to show the difference from the stock

This patch shows all difference between the two 3.4.8 kernels.
Comment 9 Vyacheslav Dikonov 2013-01-27 11:14:55 UTC
I also tried later kernels both optimized for Atom and lower latency and distro stock. They all overheat and the temperatures are not visibly different from the 3,5,7. 

Kernels optimized for the Atom CPU behave similar to non-optimized generic Pentium4 kernels.
Comment 10 Zhang Rui 2013-04-13 17:39:10 UTC
please attach the dmesg output for both 3.4.8 and 3.5.7.
Comment 11 Vyacheslav Dikonov 2013-04-14 21:01:08 UTC
Created attachment 98631 [details]
dmesg output with kernel 3.4.8
Comment 12 Zhang Rui 2013-04-15 01:16:23 UTC
please attach the acpidump output of this box.
Comment 13 Vyacheslav Dikonov 2013-04-15 23:48:22 UTC
There is no such command. Web search gave me only dead links man pages and debs, but I run an rpm based distribution. Where can I download this tool?
Comment 14 Zhang Rui 2013-04-17 06:47:36 UTC
Created attachment 98911 [details]
acpidump source

please build it and run acpidump > acpidump.out with root privilege.
Comment 15 Vyacheslav Dikonov 2013-04-17 20:11:42 UTC
Created attachment 99051 [details]
Result of running acpidump
Comment 16 Zhang Rui 2013-04-22 03:21:05 UTC
NO ACPI Fan/Thermal control on this platform.

Len,
can you please continue to look at this problem please?
Comment 17 Len Brown 2013-05-13 23:57:41 UTC
> nvidia: module license 'NVIDIA' taints kernel.

please contact nvidia for support,
or re-open when you can reproduce this
w/o their proprietary software.
Comment 18 Vyacheslav Dikonov 2013-05-16 10:00:42 UTC
1) Th purpose of this machine is to play sound and video. it is physically impossible to install a different video card. There is no other a/v output. It is impossible to run the box without nvidia driver. 

HOWEVER,

2) The bug is reproducible by booting different versions of the linux kernel while running THE SAME nvidia blob, i.e. the system can run fine with the nvidia driver it uses. I did not change, reinstall, update or did anything with the nvidia driver before and after the kernel swap which triggered the problem. Only the kernel<->blob interface got rebuilt, but it was the same source rpm package for both good and bad kernels. 

From this I conclude that it is the kernel to blame and not nvidia. I need good justification to ask nvidia support and I need some evidence that their driver has something to do with the problem. Could you give me any such evidence / tips how to get the truly relevant technical info to make such request?
Comment 19 Alan 2013-05-16 10:02:09 UTC
Nvidia have their source code and can read ours, the reverse is not true. Only they can help you.
Comment 20 Vyacheslav Dikonov 2013-05-16 10:47:04 UTC
While this statement is true, it is irrelevant to this technical issue. 

On one hand I see the fact that The system runs perfectly well and cool with an old kernel and goes mad with a newer kernel while the nvidia code remains the same, identical, constant, unchanged, frozen..... 
It is a (possibly superficial) evidence that nvidia is not involved in the problem, unless there are some changes in the kernel that break mutual compatibility.

On the other hand I have nothing to support the claim that nvidia driver is defective in this area. Nvidia support will simply send me back to you and I am unable to make them do anything better. We probably need to file such request together with relevant technical reference. 


BTW. ION1 is a popular, but discontinued product since ION2. Nvidia tends to drop support for such. Still ION seems to be the only way to build a home theater grade _silent_ HD Audio/Video player box _with_no_moving_parts_. It is a whole class of devices where linux OS (used to?) have an edge over other OSes.

Note You need to log in before you can comment on or make changes to this bug.