Most recent kernel where this bug did not occur: 2.6.21 Distribution: Hardware Environment: Software Environment: Problem Description: Steps to reproduce:
Please describe how the failing machine fails, attach the output from acpidump, dmesg -s64000, dmidecode, and the contents of /proc/acpi/thermal_zone/*/* In particular, please note if /proc/acpi/thermal_zone/*/polling_frequency is non-zero, and any change in behaviour when it is zero.
Oops, sorry, I forgot to add description of the bug. Done, xe3 was re-built from parts. /proc/acpi/.../trip_points: critical (S5): 100 C passive: 83 C... active[0]: 100 C... (hmm, active=critical? Interesting. Fortunately fan seems to be driven by BIOS). Temperature is ~63 C in "normal" use. Now lets simulate fan failure... and lets load the cpu... temperature slowly rises, 1min00 -- 72C, 1min15 -- 75C, 1min30 -- 77C, 1min45 -- 80C, 1min00 -- 82C, 1min15 -- 83C, 1min45 -- sudden powerdown, presumably because of hardware failsafe. So we have two bugs here: machine should have attempted to use passive cooling sooner, so that critical temperature would not be reached, and machine should have attempted shutdown before hardware failsafe killed the power. I could do both in 2.6.21, with echo of new trip points and enable of polling. Polling frequency is 2 seconds.
Created attachment 12360 [details] acpidump
Created attachment 12361 [details] dmesg
Created attachment 12362 [details] tar of acpi/thermal
Provided the files, hopefully I did not forgot anything this time.
(aha, and I should say that 2 seconds polling is there by default, without any tweaks).
Created attachment 12363 [details] dmidecode of course I did forget to attach something. Fixed now.
Assuming the system always shuts down just above 83C, the failure at hand appears to be a thermally induced hardware or firmware induced hardware poweroff. This isn't a particularly high temperature. Is BIOS SETUP running with defaults? Are there any cooling related options in BIOS SETUP? When the system is booted with "acpi=off", does the same shutdown occur? How about with ACPI enabled, but CONFIG_ACPI_THERMAL=n? > temperature slowly rises, 1min00 -- 72C, 1min15 -- 75C, 1min30 -- > 77C, 1min45 -- 80C, 1min00 -- 82C, 1min15 -- 83C, 1min45 -- sudden > powerdown, presumably because of hardware failsafe. At what temperature do you notice the fan start spinning? Is it spinning at full speed when the system shuts off? > powernow-k8: Processor cpuid 661 not supported > powernow: PowerNOW! Technology present. Can scale: frequency and voltage. > powernow: SGTC: 10000 > powernow: Minimum speed 300 MHz. Maximum speed 900 MHz. Does this system fail the same way the powernow software is disabled?
Reply-To: pavel@ucw.cz Hi! > ------- Comment #9 from len.brown@intel.com 2007-08-13 09:14 ------- > Assuming the system always shuts down just above 83C, > the failure at hand appears to be a thermally induced > hardware or firmware induced hardware poweroff. Yes, that was my diagnosis, too. > This isn't a particularly high temperature. > Is BIOS SETUP running with defaults? > Are there any cooling related options in BIOS SETUP? I do not think BIOS has any cooling-related options, I'll check again. Actually, 83C is too _low_ for emergency-shutdown temperature. Apparently, they put HDD near the CPU, and running at the temperature near the 83C for extended period overheats the HDD, producing read/write errors. Will check with acpi-off, rmmod thermal, etc. Pavel
> tar of acpi/thermal The files in attachment #5 [details] are empty please just copy/paste their contents into bugzilla
> Polling frequency is 2 seconds. There is no _TZP in the DSDT, so the kernel should not enable polling on this box. Please disable/override whatever in user-space requests the 2-second polling interval and report how the machine behaves with polling disabled -- its default configuration. Also, please kill acpid and cat /proc/acpi/event and see if changes in temperature give any thermal events. Re: fan control The DSDT does show real-live fan _ON/_OFF methods. In 2.6.21 were you able to control the state of the fan by modifying the (100C) active trip point? Are you able to cause the fan to start running at a specific temperature? Are you able to change the behaviour of the fan by echo 0 or 3 into the fan state files -- and is the change in state indicated in those files?
Reply-To: pavel@ucw.cz > The files in attachment #5 [details] are empty > please just copy/paste their contents into bugzilla Sorry. Here they are; will find out why I have 2 seconds polling there. dream:/proc/acpi/thermal_zone/THRM # cat cooling_mode cooling mode: active dream:/proc/acpi/thermal_zone/THRM # cat polling_frequency polling frequency: 2 seconds dream:/proc/acpi/thermal_zone/THRM # cat state state: ok dream:/proc/acpi/thermal_zone/THRM # cat temperature temperature: 69 C dream:/proc/acpi/thermal_zone/THRM # cat trip_points critical (S5): 100 C passive: 83 C: tc1=2 tc2=5 tsp=300 devices=0xcffdb338 active[0]: 100 C: devices=0xcffd4e14 dream:/proc/acpi/thermal_zone/THRM # Machine is stable if I simulate fan failure, load the cpu and rmmod thermal and rmmod powernow_k7. I guess 300MHz cpu is not hot enough to reach any interesting temperature. Inserting powernow_k7 kills the machine in minute-or-so.
Reply-To: pavel@ucw.cz Hi! > Re: fan control > The DSDT does show real-live fan _ON/_OFF methods. > In 2.6.21 were you able to control the state of the fan > by modifying the (100C) active trip point? > Are you able to cause the fan to start running at a specific temperature? Strange, yes, fan interface is there. But it says dream:/proc/acpi/fan/FAN # cat state status: off dream:/proc/acpi/fan/FAN # ...and fan is clearly spinning. > Are you able to change the behaviour of the fan by echo 0 or 3 > into the fan state files -- and is the change in state indicated > in those files? I can echo 0/3, and the value in state file changes, but it does not seem to effect the physical fan. It just spins. Pavel
Reply-To: pavel@ucw.cz > Re: fan control > The DSDT does show real-live fan _ON/_OFF methods. > In 2.6.21 were you able to control the state of the fan > by modifying the (100C) active trip point? > Are you able to cause the fan to start running at a specific temperature? > > Are you able to change the behaviour of the fan by echo 0 or 3 > into the fan state files -- and is the change in state indicated > in those files? Fan seems to have a mind of its own. It comes into live at ~60 celsius, and turns itself off at ~55. Nothing I do can control it. I was not trying to control it in 2.6.21 because fan was physically dead at that time... I was trying to lower passive trip points to cool machine that way. Pavel
with acpi=off (and powernow-k7 loaded, fan failed, and cpu loaded) machine fails about ~2min30 with the same thermally induced hardware poweroff. No thermal-related settings in BIOS, but I reset the value to defaults, anyway. Hmm, polling_frequency indeed says <disabled> on init=/bin/bash boot. Sorry for confusion, something in suse10.2 is playing with me.
I tried cat /proc/acpi/events with acpid and polling disabled, but did not see anything -- but the 83C passive trip point is so close to hw shutdown that I may have missed it.
> Machine is stable if I simulate fan failure, load the cpu and rmmod > thermal and rmmod powernow_k7. I guess 300MHz cpu is not hot enough to > reach any interesting temperature. Inserting powernow_k7 kills the > machine in minute-or-so. If you boot without powernow_k7, the machine comes up and runs at constant 300MHz and does not fail under load -- even if you disconnect the fan? How hot is it getting? Can you load thermal here and verify that you are not exceeding 83C? Is it possible to run the same experiment, but at peak MHz? I'd like to see powernow not loaded, if possible, or at least not running -- say, by using the performance governor. > Fan seems to have a mind of its own. It comes into live at ~60 > celsius, and turns itself off at ~55. Nothing I do can control it. Apparently ACPI fan control on this system is a facade with nothing behind it. Certainly with no _TZP and no thermal events, Windows will simply ignore it. Linux should just ignore it too. This can be done in a pretty way in 2.6.23-rc3 via "thermal.act=-1" But the fact that thermal events don't work suggests that Windows will also not notice the passive trip point at 83C -- and it should run into the same hardware malfunction that Linux runs into. It would be useful if you can verify that Windows causes this unit to malfunction the same way Linux does, or if Windows behaves differently. Does more than one of these machines exist? If so, do they all fail the same way? The fact that the fan kicks in starting at 60C but the system continues to heat up to 83C suggests that the thermal solution is simply broken on this unit. But the fact that the malfunction occurs at 83C, which happens to be the passive trip point is either a very large coincidence (possible, since you get a failure also with acpi=off, though unclear at what temperature) or throttling may actually be involved provoke the failure. What do you see if you boot 2.6.23-rc3 with "thermal.psv=70" and heat up the system. Does it continue to function while keeping the temperature down to 70? Note that since there seem to be no thermal events on this box, you'll also need thermal.tzp=5 or use what SuSE uses to enable polling from proc.
Created attachment 12386 [details] patch vs 2.6.23-rc3 creating thermal.crt=C Please apply this patch to 2.6.23-rc3 and boot with "thermal.crt=80" or something that will provoke a graceful shutdown before the hardware malfunctions. Note that since this system doesn't seem to provide any thermal events, you'll need to invoke polling (eg. thermal.tzp=10) to get this threshold noticed. I hesitate to add a DMI option to invoke workarounds for this model automatically, because the cooling failure and hardware malfunction look like a unit failure that may not be shared with other units. We could add a DMI entry to ignore the false active cooling hooks -- but that would be cosmetic only, since they do no harm today, other than perhaps confuse users of this model. Assuming the powernow thing is not related to the failure, perhaps the (manually invoked) .psv and .crt thermal hooks are sufficient to make this box usable?
patch in comment #19 is in acpi test tree. Please let me know if it is insufficient to resolve this issue.
the patch in comment #19 shipped in linux-2.6.23-rc3-git9 closed.
Reply-To: pavel@ucw.cz Hi! > the patch in comment #19 shipped in linux-2.6.23-rc3-git9 > closed. Thanks a lot, configurable critical trip point should solve the issue.