Bug 14695 - regression in karmic thermal control - Thinkpad X24
regression in karmic thermal control - Thinkpad X24
Status: CLOSED INSUFFICIENT_DATA
Product: ACPI
Classification: Unclassified
Component: EC
All Linux
: P1 normal
Assigned To: acpi_ec
:
: 14888 (view as bug list)
Depends on:
Blocks: 14230
  Show dependency treegraph
 
Reported: 2009-11-26 08:45 UTC by public
Modified: 2010-10-28 23:34 UTC (History)
8 users (show)

See Also:
Kernel Version: 2.6.32
Tree: Mainline
Regression: Yes


Attachments
acpidump (368.60 KB, application/octet-stream)
2009-12-01 09:35 UTC, public
Details
cat_proc_acpi_thermal_zone (903 bytes, text/plain)
2009-12-01 09:35 UTC, public
Details
before adding boot option (451 bytes, text/plain)
2009-12-10 22:59 UTC, public
Details
after adding boot option (450 bytes, text/plain)
2009-12-10 22:59 UTC, public
Details
powertop -d while cpu is idle (5.75 KB, text/plain)
2010-02-25 08:36 UTC, public
Details
grep . /sys/firmware/acpi/interrupts/* (2.13 KB, text/plain)
2010-02-26 09:02 UTC, public
Details

Description public 2009-11-26 08:45:08 UTC
"Since installing Karmic my fan has been operating at high speed too often for my liking and now I've had a number of unexpected shutdowns"

Please have a look at Launchpad bug report #432670:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/432670

There you find a detailed description and some users reporting that problem.

Kind regards,
Florian
Comment 1 public 2009-11-26 08:46:37 UTC
Beside karmic standard kernel I also tried current mainline kernel. Yesterday I encoded with avidemux which resulted in a critical temperature shutdown, too.
Comment 2 ykzhao 2009-11-30 12:59:08 UTC
will you please attach the output of acpidump?
It will be great if you can attach the output of "cat /proc/acpi/thermal_zone/* ".
thanks.
Comment 3 public 2009-12-01 09:35:35 UTC
Created attachment 23983 [details]
acpidump
Comment 4 public 2009-12-01 09:35:56 UTC
Created attachment 23984 [details]
cat_proc_acpi_thermal_zone
Comment 5 Zhang Rui 2009-12-04 03:12:58 UTC
please boot with "acpi_enforce_resources=lax" and see if the problem still exists.
Comment 6 public 2009-12-06 17:03:07 UTC
I tried this boot paramater but without success. PC just crashed during Avidemux encode.
Comment 7 Zhang Rui 2009-12-09 02:55:25 UTC
hmm, there is no ACPI fan control in this laptop and the passive trip point is too close to the critical trip point.
please set CONFIG_ACPI_THERMAL=y, boot with thermal.psv=80 and see if it helps.
Comment 8 public 2009-12-09 18:16:27 UTC
(In reply to comment #7)
> please set CONFIG_ACPI_THERMAL=y, boot with thermal.psv=80 and see if it helps.

Hi, could you please tell me in detail how to do this?
Comment 9 Zhang Rui 2009-12-10 01:21:55 UTC
is the ACPI thermal driver built in or loaded as a module?

if it's built in, please add boot option "thermal.psv=80" in the grub menu.
if it's a module, please add "thermal thermal.psv=80" in the /etc/modules file and reboot.

please attach the output of "grep . /proc/acpi/thermal_zone/THRM/*" after this test.
Comment 10 public 2009-12-10 22:57:35 UTC
Still high temperature, but a first test with a 30min Avidemux encode did not crash my computer.
I'll attach two files one before adding that boot option and one afterwards.
Comment 11 public 2009-12-10 22:59:15 UTC
Created attachment 24149 [details]
before adding boot option
Comment 12 public 2009-12-10 22:59:38 UTC
Created attachment 24150 [details]
after adding boot option
Comment 13 Zhang Rui 2009-12-11 01:02:12 UTC
comment #9 is just a workaround for this issue. It slows down the processor once the system is overheat.
But the real problem is that laptop overheats in Karmic kernel while it doesn't before, right?

can you give a detailed description about the problem please,
1. what is the latest kernel that doesn't have this problem
2. what is the earliest kernel that this problem exists
3. can you please try the latest vanilla kernel (2.6.32) and see if it helps?
4. can you hear fan spinning in working kernel? can you hear fan spinning in the overheating kernel? do you think the computer does work in a much higher temperature in the karmic kernel than before?
Comment 14 public 2009-12-11 18:09:23 UTC
(In reply to comment #13)
> But the real problem is that laptop overheats in Karmic kernel while it doesn't
> before, right?

Yes! This did not happen in Jaunty.

1. I don't know exactly, as it was in jaunty it should have been 2.6.28-17
2. 2.6.31-14 The first that came with karmic as far that I know
3. I already tried 2.6.32-999
4. Both: yes
Comment 15 public 2009-12-16 15:16:45 UTC
Any new ideas? More information needed?
Comment 16 Rolf Leggewie 2009-12-17 16:54:57 UTC
I am another user affected by this issue.  My computer is a Thinkpad X24 that previously did not have overheating issues in Jaunty but does now in Karmic.  I am the original reporter of the downstream ticket in Launchpad referenced above.

I tested with todays Ubuntu Mainline kernel (they free from Ubuntu patches, more info at https://wiki.ubuntu.com/KernelTeam/MainlineBuilds) and the problem exists there as well.  The mainline kernel is compiled with CONFIG_ACPI_THERMAL=y, yet the computer shut down within about 60 seconds after increasing load to 100% by running a couple of sha512sum processes.  This was irrespective of whether the boot option thermal.psv=80 was given via grub or not.  I did see in conky that the CPU frequency dropped for about 2-3 seconds, went back up and then the computer shut down very quickly after that.

$ cat /proc/acpi/thermal_zone/THM0/*
<setting not supported>
<polling disabled>
state:                   ok
temperature:             74 C
critical (S5):           95 C
passive:                 80 C: tc1=5 tc2=2 tsp=600 devices=CPU0 

Can't answer all questions from comment #14 yet, but the latest mainline kernel still has this problem.
Comment 17 Rolf Leggewie 2009-12-17 16:57:52 UTC
Oh, the "passive" line above read 92° when booting without that boot option IIRC.
Comment 18 public 2009-12-30 22:47:16 UTC
Anything new about this bug?
Comment 19 Zhang Rui 2009-12-31 01:48:15 UTC
what's the kernel version of Jaunty?
As we only handle the vanilla kernel bugs here, I suggest that you try a vanilla kernel with the same version, and verify if the system works well in the vanilla kernel as well.
If yes, then we know that this is a upstream kernel regression, so we can use git bisect to find out which commit introduces this bug.
Comment 20 Rafael J. Wysocki 2009-12-31 10:53:51 UTC
*** Bug 14888 has been marked as a duplicate of this bug. ***
Comment 21 Rolf Leggewie 2010-01-01 14:58:49 UTC
Zhang, thanks for taking a look. Yes, this is a bug not specific to Ubuntu kernels.  The Ubuntu mainline kernels are vanilla kernels with no distro specific patches applied.
Comment 22 Zhang Rui 2010-01-04 01:01:32 UTC
hmmm,
what's the kernel version of Jaunty?
Comment 23 public 2010-01-04 08:15:27 UTC
Latest kernel used in jaunty is 2.6.28-17
http://bugzilla.kernel.org/show_bug.cgi?id=14695#c14

Please also have a look at Marks "Sidux"-note at comment 19: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/432670?comments=all
Comment 24 Derek Barnes 2010-01-09 08:25:17 UTC
I'm adding to this report on behalf of a number of users at ubuntuforums.org, including myself. Most of us are using Toshiba Satellites with dual AMD Turion processors. The fan doesn't seem to turn on until the CPUs reach temperatures well above recommended levels, or in some cases doesn't come on at all. We've tried multiple methods of correcting the problem, none of which have thus far succeeded. I'm hoping the detective work we have done can be of some use here.

Full thread on Ubuntu Forums:
http://ubuntuforums.org/showthread.php?t=1282161

-----

My results from the command line:

$ cat /proc/acpi/thermal_zone/*/trip_points
critical (S5):           108 C
passive:                 101 C: tc1=30 tc2=30 tsp=50 devices=CPU0 CPU1 
active[0]:               94 C: devices=FAN0 
active[1]:               82 C: devices=FAN1 
active[2]:               72 C: devices=FAN2 
active[3]:               52 C: devices=FAN3 
active[4]:               42 C: devices=FAN4 

"powersave -T" shows similar results. Only the "active[0]" point seems to work; at this point the fan goes from silent to full blast for a brief moment, then shuts off again.

I tried adjusting thermal management and cooling policy in powersave, to no avail; tried several ACPI recognition protocols in /etc/default/grub and got nothing. Just updated to 2.6.31-17-generic; no change.

Any more info you need, just say the word; I'll do my best to provide.
Comment 25 Zhang Rui 2010-01-11 01:25:53 UTC
Rolf Leggewie and Derek Barnes,
it would be great if you guys can open a new bug report for the problem on your laptop. Because they don't like the same problem as the one here. Because Bugie's laptop doesn't have ACPI fan devices.
Comment 26 Derek Barnes 2010-01-11 06:36:03 UTC
W(In reply to comment #25)
> Rolf Leggewie and Derek Barnes,
> it would be great if you guys can open a new bug report for the problem on your
> laptop. Because they don't like the same problem as the one here. Because
> Bugie's laptop doesn't have ACPI fan devices.

Neither does mine as far as I can tell. That said, a fan that won't turn off and one that won't turn on are technically separate problems, whether they have a related cause or not. I'll file a separate report ASAP.
Comment 27 Zhang Rui 2010-01-13 07:01:20 UTC
(In reply to comment #26)
> W(In reply to comment #25)
> > Rolf Leggewie and Derek Barnes,
> > it would be great if you guys can open a new bug report for the problem on your
> > laptop. Because they don't like the same problem as the one here. Because
> > Bugie's laptop doesn't have ACPI fan devices.
> 
> Neither does mine as far as I can tell. 

$ cat /proc/acpi/thermal_zone/*/trip_points
critical (S5):           108 C
passive:                 101 C: tc1=30 tc2=30 tsp=50 devices=CPU0 CPU1 
active[0]:               94 C: devices=FAN0 
active[1]:               82 C: devices=FAN1 
active[2]:               72 C: devices=FAN2 
active[3]:               52 C: devices=FAN3 
active[4]:               42 C: devices=FAN4 

this indicates that there are ACPI fans on your laptop.

> That said, a fan that won't turn off
> and one that won't turn on are technically separate problems, whether they
> have a related cause or not. I'll file a separate report ASAP.

that's would be great! Thanks.
Comment 28 Zhang Rui 2010-01-13 07:07:39 UTC
Close this bug because https://bugs.launchpad.net/ubuntu/+source/linux/+bug/432670/comments/28 says that the overheating problem doesn't exist in the 2.6.33-999 kernel.

Rolf Leggewie and bugie,
please re-open it if this is not true for you.
Comment 29 Rolf Leggewie 2010-01-13 23:28:09 UTC
Zhang, I'd reopen this ticket but apparently I lack the privs to do so.

While it is true that current mainline seems to have improved a lot here, driving the machine really hardy still results in a shutdown.  I doubt that mark drove his machine beyond normal day-to-day load.

Here's the situation as it applies to me.  Unpatched mainline kernel from a few weeks ago was no improvement over the kernel shipped in latest Ubuntu.  When shooting off a few sha256sum calculations for large files I could get the machine to shut down for overheating within 60 seconds after login.  With the suggested thermal.psv=80 and yesterday's mainline kernel the situation has improved a LOT.  Doing those CPU-intensive calculations again pushes the CPU temperature as measured in conky to about 90+ degrees very quickly.  Fan is going strong, albeit it may not be the highest setting (sorry, forgot to check).  The on-demand governor only very seldomly lowers CPU frequency (I'd guess every 30 seconds for about a second or two).  The machine would finally be pushed over the edge after about 10 minutes of non-stop heavy computation.

In essence I would say that during normal daytime operations one will possibly not hit this bug (although tabs with flash content in FF are CPU-intensive).  But it's not yet gone.
Comment 30 Zhang Rui 2010-01-14 06:22:18 UTC
So the problem you described doesn't exist in the earlier kernel releases,
and it just becomes better in the latest kernel but is not fixed, right?

re-open this bug.
Comment 31 Rolf Leggewie 2010-01-14 15:26:40 UTC
yes

Let me know what information I can contribute to narrow things down.
Comment 32 public 2010-01-18 18:52:45 UTC
I tested Ubuntus mainline kernel linux-image-2.6.33-020633rc4-generic_2.6.33-020633rc4_i386.deb yesterday without any success. Temperature increased to 80°C and cpu was throttled then.

Regards,
Florian
Comment 33 public 2010-01-23 16:47:31 UTC
Hi,

I discovered that CPU throttling should work with Karmics standard kernel (workaround http://bugzilla.kernel.org/show_bug.cgi?id=14695#c13 ).

This already did work with the mainline kernel as said in my previous post. I'm using the Gnome applet which lets you regulate CPU steps. Unfortunatly I don't know its english name. It was always "on demand" which results in temperatures about 100 degrees, even with the mentioned workaround. At the moment avidemux encodes a video and CPU is set to "1.6GHz", but now CPU is thottled automatically (800MHz <-> 1.2GHz <->1.6GHz). Of course this isn't the expected behaviour, because CPU should be set fix to 1.6GHz with that setting. For me it's better that way. :-)

Kind regards,
Florian
Comment 34 Zhang Rui 2010-01-27 07:20:01 UTC
please attach run the top command for a couple of seconds when the system is idle and see if there is any process that takes a lot of cpu resource.
Comment 35 public 2010-01-27 08:59:16 UTC
I already checked if there is any process eating up my cpu. If system is idle it is also cool. Can't say how many degrees but cool enough that the fan is not audible.
Comment 36 public 2010-02-15 07:35:48 UTC
Problem still exists. Anything new?
Comment 37 Rolf Leggewie 2010-02-20 12:48:08 UTC
There's some speculation in the Launchpad ticket now if messages in kern.log of the form

Feb 19 19:40:11 X24 kernel: [    1.196034] Clocksource tsc unstable (delta = -283962795 ns)

might have anything to do with this and point to the deeper issue.  I'm not an expert on what that line means, but it seems to have to do with errors when switching CPU frequency and I did observe some funky frequency switching in conky quite a few times just seconds before the overheat shutdown occurred.

Zhang, I think you are the expert.  Do you see any possible link here?
Comment 38 Zhang Rui 2010-02-23 07:39:15 UTC
please attach the output of " grep .  /sys/firmware/acpi/interrupts/*" when the system overheats.

(In reply to comment #37)
> Feb 19 19:40:11 X24 kernel: [    1.196034] Clocksource tsc unstable (delta =
> -283962795 ns)
> 
I don't think this is related because this can be seen on many platforms.
Comment 39 Zhang Rui 2010-02-25 02:25:05 UTC
please also attach the output of "powertop -d", on both kernels w/ and w/o this regression.
Comment 40 public 2010-02-25 08:36:25 UTC
Created attachment 25207 [details]
powertop -d while cpu is idle

Hi,

please find attached the output of powertop on my asus notebook.
Comment 41 Zhang Rui 2010-02-26 03:13:08 UTC
you probably forget to attach the "powertop -d" output of a working kernel, i.e a kernel without the regression.
Comment 42 public 2010-02-26 07:54:08 UTC
Unfortunately I do not have a kernel without this regression installed. Is it of any help if I use a Jaunty live image to get this information?
Comment 43 public 2010-02-26 09:02:59 UTC
Created attachment 25237 [details]
grep .  /sys/firmware/acpi/interrupts/*

CPU was about 87°C while taking the attached information.

Please note that I applied the boot option "thermal.psv=80" some weeks ago.

Regards,
Florian
Comment 44 Zhang Rui 2010-03-01 05:58:26 UTC
/sys/firmware/acpi/interrupts/gpe1A:       0	invalid
/sys/firmware/acpi/interrupts/gpe1B:       0	invalid
/sys/firmware/acpi/interrupts/gpe1C:  292585	enabled
/sys/firmware/acpi/interrupts/gpe1D:       0	invalid
/sys/firmware/acpi/interrupts/gpe1E:       0	invalid
/sys/firmware/acpi/interrupts/gpe1F:       0	invalid
/sys/firmware/acpi/interrupts/gpe_all:  292597
/sys/firmware/acpi/interrupts/sci:  292597
/sys/firmware/acpi/interrupts/sci_not:       0


        Device (EC0)
        {
            Name (_HID, EisaId ("PNP0C09"))
            ...
            Name (_GPE, 0x1C)
            ...
        }

this seems like another EC interrupt storm issue.

Alexey, can you look at this issue please?
Comment 45 Alexey Starikovskiy 2010-03-01 09:57:29 UTC
Bugie,
Could you please uncomment "#define DEBUG" at the beginning of drivers/acpi/ec.c,
enable kernel timestamps in printk ("Kernel Hacking" section) and attach resulting dmesg?
Everybody else, please open other bug reports, as you have different issues.
Comment 46 public 2010-03-01 18:23:03 UTC
Hi Alexey,

do I have to compile anything?
Could you tell me in detail what to do?

Thanks,
Bugie
Comment 47 Alexey Starikovskiy 2010-03-01 18:49:43 UTC
https://help.ubuntu.com/community/Kernel/Compile -- follow this guide to compile custom kernel... Once you've have bootable custom kernel, do the modifications above.
Comment 48 Florian Mickler 2010-10-07 20:55:18 UTC
Is this issue still present in current mainline kernels?
Did you succeed in compiling a bootable kernel?
Comment 49 Zhang Rui 2010-10-22 03:11:32 UTC
Bug closed as there is no response from the bug reporter for months.
please re-open this bug report if the problem still exists in the latest upstream kernel, say 2.6.35 or 2.6.36-rc and you can provide the info requested in comment #45.

Note You need to log in before you can comment on or make changes to this bug.