Bug 14695
Summary: | regression in karmic thermal control - Thinkpad X24 | ||
---|---|---|---|
Product: | ACPI | Reporter: | public |
Component: | EC | Assignee: | acpi_ec |
Status: | CLOSED INSUFFICIENT_DATA | ||
Severity: | normal | CC: | astarikovskiy, bugzilla.kernel.org, dmbarnes, florian, lenb, rjw, rui.zhang, yakui.zhao |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.32 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 14230 | ||
Attachments: |
acpidump
cat_proc_acpi_thermal_zone before adding boot option after adding boot option powertop -d while cpu is idle grep . /sys/firmware/acpi/interrupts/* |
Description
public
2009-11-26 08:45:08 UTC
Beside karmic standard kernel I also tried current mainline kernel. Yesterday I encoded with avidemux which resulted in a critical temperature shutdown, too. will you please attach the output of acpidump? It will be great if you can attach the output of "cat /proc/acpi/thermal_zone/* ". thanks. Created attachment 23983 [details]
acpidump
Created attachment 23984 [details]
cat_proc_acpi_thermal_zone
please boot with "acpi_enforce_resources=lax" and see if the problem still exists. I tried this boot paramater but without success. PC just crashed during Avidemux encode. hmm, there is no ACPI fan control in this laptop and the passive trip point is too close to the critical trip point. please set CONFIG_ACPI_THERMAL=y, boot with thermal.psv=80 and see if it helps. (In reply to comment #7) > please set CONFIG_ACPI_THERMAL=y, boot with thermal.psv=80 and see if it > helps. Hi, could you please tell me in detail how to do this? is the ACPI thermal driver built in or loaded as a module? if it's built in, please add boot option "thermal.psv=80" in the grub menu. if it's a module, please add "thermal thermal.psv=80" in the /etc/modules file and reboot. please attach the output of "grep . /proc/acpi/thermal_zone/THRM/*" after this test. Still high temperature, but a first test with a 30min Avidemux encode did not crash my computer. I'll attach two files one before adding that boot option and one afterwards. Created attachment 24149 [details]
before adding boot option
Created attachment 24150 [details]
after adding boot option
comment #9 is just a workaround for this issue. It slows down the processor once the system is overheat. But the real problem is that laptop overheats in Karmic kernel while it doesn't before, right? can you give a detailed description about the problem please, 1. what is the latest kernel that doesn't have this problem 2. what is the earliest kernel that this problem exists 3. can you please try the latest vanilla kernel (2.6.32) and see if it helps? 4. can you hear fan spinning in working kernel? can you hear fan spinning in the overheating kernel? do you think the computer does work in a much higher temperature in the karmic kernel than before? (In reply to comment #13) > But the real problem is that laptop overheats in Karmic kernel while it > doesn't > before, right? Yes! This did not happen in Jaunty. 1. I don't know exactly, as it was in jaunty it should have been 2.6.28-17 2. 2.6.31-14 The first that came with karmic as far that I know 3. I already tried 2.6.32-999 4. Both: yes Any new ideas? More information needed? I am another user affected by this issue. My computer is a Thinkpad X24 that previously did not have overheating issues in Jaunty but does now in Karmic. I am the original reporter of the downstream ticket in Launchpad referenced above. I tested with todays Ubuntu Mainline kernel (they free from Ubuntu patches, more info at https://wiki.ubuntu.com/KernelTeam/MainlineBuilds) and the problem exists there as well. The mainline kernel is compiled with CONFIG_ACPI_THERMAL=y, yet the computer shut down within about 60 seconds after increasing load to 100% by running a couple of sha512sum processes. This was irrespective of whether the boot option thermal.psv=80 was given via grub or not. I did see in conky that the CPU frequency dropped for about 2-3 seconds, went back up and then the computer shut down very quickly after that. $ cat /proc/acpi/thermal_zone/THM0/* <setting not supported> <polling disabled> state: ok temperature: 74 C critical (S5): 95 C passive: 80 C: tc1=5 tc2=2 tsp=600 devices=CPU0 Can't answer all questions from comment #14 yet, but the latest mainline kernel still has this problem. Oh, the "passive" line above read 92° when booting without that boot option IIRC. Anything new about this bug? what's the kernel version of Jaunty? As we only handle the vanilla kernel bugs here, I suggest that you try a vanilla kernel with the same version, and verify if the system works well in the vanilla kernel as well. If yes, then we know that this is a upstream kernel regression, so we can use git bisect to find out which commit introduces this bug. *** Bug 14888 has been marked as a duplicate of this bug. *** Zhang, thanks for taking a look. Yes, this is a bug not specific to Ubuntu kernels. The Ubuntu mainline kernels are vanilla kernels with no distro specific patches applied. hmmm, what's the kernel version of Jaunty? Latest kernel used in jaunty is 2.6.28-17 http://bugzilla.kernel.org/show_bug.cgi?id=14695#c14 Please also have a look at Marks "Sidux"-note at comment 19: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/432670?comments=all I'm adding to this report on behalf of a number of users at ubuntuforums.org, including myself. Most of us are using Toshiba Satellites with dual AMD Turion processors. The fan doesn't seem to turn on until the CPUs reach temperatures well above recommended levels, or in some cases doesn't come on at all. We've tried multiple methods of correcting the problem, none of which have thus far succeeded. I'm hoping the detective work we have done can be of some use here. Full thread on Ubuntu Forums: http://ubuntuforums.org/showthread.php?t=1282161 ----- My results from the command line: $ cat /proc/acpi/thermal_zone/*/trip_points critical (S5): 108 C passive: 101 C: tc1=30 tc2=30 tsp=50 devices=CPU0 CPU1 active[0]: 94 C: devices=FAN0 active[1]: 82 C: devices=FAN1 active[2]: 72 C: devices=FAN2 active[3]: 52 C: devices=FAN3 active[4]: 42 C: devices=FAN4 "powersave -T" shows similar results. Only the "active[0]" point seems to work; at this point the fan goes from silent to full blast for a brief moment, then shuts off again. I tried adjusting thermal management and cooling policy in powersave, to no avail; tried several ACPI recognition protocols in /etc/default/grub and got nothing. Just updated to 2.6.31-17-generic; no change. Any more info you need, just say the word; I'll do my best to provide. Rolf Leggewie and Derek Barnes, it would be great if you guys can open a new bug report for the problem on your laptop. Because they don't like the same problem as the one here. Because Bugie's laptop doesn't have ACPI fan devices. W(In reply to comment #25) > Rolf Leggewie and Derek Barnes, > it would be great if you guys can open a new bug report for the problem on > your > laptop. Because they don't like the same problem as the one here. Because > Bugie's laptop doesn't have ACPI fan devices. Neither does mine as far as I can tell. That said, a fan that won't turn off and one that won't turn on are technically separate problems, whether they have a related cause or not. I'll file a separate report ASAP. (In reply to comment #26) > W(In reply to comment #25) > > Rolf Leggewie and Derek Barnes, > > it would be great if you guys can open a new bug report for the problem on > your > > laptop. Because they don't like the same problem as the one here. Because > > Bugie's laptop doesn't have ACPI fan devices. > > Neither does mine as far as I can tell. $ cat /proc/acpi/thermal_zone/*/trip_points critical (S5): 108 C passive: 101 C: tc1=30 tc2=30 tsp=50 devices=CPU0 CPU1 active[0]: 94 C: devices=FAN0 active[1]: 82 C: devices=FAN1 active[2]: 72 C: devices=FAN2 active[3]: 52 C: devices=FAN3 active[4]: 42 C: devices=FAN4 this indicates that there are ACPI fans on your laptop. > That said, a fan that won't turn off > and one that won't turn on are technically separate problems, whether they > have a related cause or not. I'll file a separate report ASAP. that's would be great! Thanks. Close this bug because https://bugs.launchpad.net/ubuntu/+source/linux/+bug/432670/comments/28 says that the overheating problem doesn't exist in the 2.6.33-999 kernel. Rolf Leggewie and bugie, please re-open it if this is not true for you. Zhang, I'd reopen this ticket but apparently I lack the privs to do so. While it is true that current mainline seems to have improved a lot here, driving the machine really hardy still results in a shutdown. I doubt that mark drove his machine beyond normal day-to-day load. Here's the situation as it applies to me. Unpatched mainline kernel from a few weeks ago was no improvement over the kernel shipped in latest Ubuntu. When shooting off a few sha256sum calculations for large files I could get the machine to shut down for overheating within 60 seconds after login. With the suggested thermal.psv=80 and yesterday's mainline kernel the situation has improved a LOT. Doing those CPU-intensive calculations again pushes the CPU temperature as measured in conky to about 90+ degrees very quickly. Fan is going strong, albeit it may not be the highest setting (sorry, forgot to check). The on-demand governor only very seldomly lowers CPU frequency (I'd guess every 30 seconds for about a second or two). The machine would finally be pushed over the edge after about 10 minutes of non-stop heavy computation. In essence I would say that during normal daytime operations one will possibly not hit this bug (although tabs with flash content in FF are CPU-intensive). But it's not yet gone. So the problem you described doesn't exist in the earlier kernel releases, and it just becomes better in the latest kernel but is not fixed, right? re-open this bug. yes Let me know what information I can contribute to narrow things down. I tested Ubuntus mainline kernel linux-image-2.6.33-020633rc4-generic_2.6.33-020633rc4_i386.deb yesterday without any success. Temperature increased to 80°C and cpu was throttled then. Regards, Florian Hi, I discovered that CPU throttling should work with Karmics standard kernel (workaround http://bugzilla.kernel.org/show_bug.cgi?id=14695#c13 ). This already did work with the mainline kernel as said in my previous post. I'm using the Gnome applet which lets you regulate CPU steps. Unfortunatly I don't know its english name. It was always "on demand" which results in temperatures about 100 degrees, even with the mentioned workaround. At the moment avidemux encodes a video and CPU is set to "1.6GHz", but now CPU is thottled automatically (800MHz <-> 1.2GHz <->1.6GHz). Of course this isn't the expected behaviour, because CPU should be set fix to 1.6GHz with that setting. For me it's better that way. :-) Kind regards, Florian please attach run the top command for a couple of seconds when the system is idle and see if there is any process that takes a lot of cpu resource. I already checked if there is any process eating up my cpu. If system is idle it is also cool. Can't say how many degrees but cool enough that the fan is not audible. Problem still exists. Anything new? There's some speculation in the Launchpad ticket now if messages in kern.log of the form Feb 19 19:40:11 X24 kernel: [ 1.196034] Clocksource tsc unstable (delta = -283962795 ns) might have anything to do with this and point to the deeper issue. I'm not an expert on what that line means, but it seems to have to do with errors when switching CPU frequency and I did observe some funky frequency switching in conky quite a few times just seconds before the overheat shutdown occurred. Zhang, I think you are the expert. Do you see any possible link here? please attach the output of " grep . /sys/firmware/acpi/interrupts/*" when the system overheats. (In reply to comment #37) > Feb 19 19:40:11 X24 kernel: [ 1.196034] Clocksource tsc unstable (delta = > -283962795 ns) > I don't think this is related because this can be seen on many platforms. please also attach the output of "powertop -d", on both kernels w/ and w/o this regression. Created attachment 25207 [details]
powertop -d while cpu is idle
Hi,
please find attached the output of powertop on my asus notebook.
you probably forget to attach the "powertop -d" output of a working kernel, i.e a kernel without the regression. Unfortunately I do not have a kernel without this regression installed. Is it of any help if I use a Jaunty live image to get this information? Created attachment 25237 [details]
grep . /sys/firmware/acpi/interrupts/*
CPU was about 87°C while taking the attached information.
Please note that I applied the boot option "thermal.psv=80" some weeks ago.
Regards,
Florian
/sys/firmware/acpi/interrupts/gpe1A: 0 invalid /sys/firmware/acpi/interrupts/gpe1B: 0 invalid /sys/firmware/acpi/interrupts/gpe1C: 292585 enabled /sys/firmware/acpi/interrupts/gpe1D: 0 invalid /sys/firmware/acpi/interrupts/gpe1E: 0 invalid /sys/firmware/acpi/interrupts/gpe1F: 0 invalid /sys/firmware/acpi/interrupts/gpe_all: 292597 /sys/firmware/acpi/interrupts/sci: 292597 /sys/firmware/acpi/interrupts/sci_not: 0 Device (EC0) { Name (_HID, EisaId ("PNP0C09")) ... Name (_GPE, 0x1C) ... } this seems like another EC interrupt storm issue. Alexey, can you look at this issue please? Bugie, Could you please uncomment "#define DEBUG" at the beginning of drivers/acpi/ec.c, enable kernel timestamps in printk ("Kernel Hacking" section) and attach resulting dmesg? Everybody else, please open other bug reports, as you have different issues. Hi Alexey, do I have to compile anything? Could you tell me in detail what to do? Thanks, Bugie https://help.ubuntu.com/community/Kernel/Compile -- follow this guide to compile custom kernel... Once you've have bootable custom kernel, do the modifications above. Is this issue still present in current mainline kernels? Did you succeed in compiling a bootable kernel? Bug closed as there is no response from the bug reporter for months. please re-open this bug report if the problem still exists in the latest upstream kernel, say 2.6.35 or 2.6.36-rc and you can provide the info requested in comment #45. |