Bug 6315
Summary: | Fan control broken for Toshiba Satellite A40 | ||
---|---|---|---|
Product: | Drivers | Reporter: | Frans Pop (elendil) |
Component: | I2C | Assignee: | Jean Delvare (jdelvare) |
Status: | RESOLVED CODE_FIX | ||
Severity: | high | CC: | acpi-bugzilla, d.gaffuri, trenn |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.16 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
ACPI dump
Output of dmesg Make use of the INUSE_STS hardware semaphore Detect conflicting ACPI I/O accesses Stop unhiding the SMBus on Toshiba laptops |
Description
Frans Pop
2006-04-01 03:13:04 UTC
One thing I should probably mention. /proc/acpi/fan/FAN/state has allways shown the status of the fan as "off", even when it's running. /proc/acpi/toshiba/fan (with toshiba_acpi loaded) shows the correct state. $ cat /proc/acpi/fan/FAN/state status: off $ cat /proc/acpi/toshiba/fan running: 1 force_on: 0 Could you provide the output of 'dmesg', 'acpidump' commands, please. Created attachment 7737 [details]
ACPI dump
Created attachment 7738 [details]
Output of dmesg
any difference with boot option ec_intr=0? > any difference with boot option ec_intr=0?
No, makes no difference at all.
Any difference when you do not try to load toshiba module? Can you (or could you with older kernels) influence fan activity with the toshiba module? Any error messages when you search dmesg |grep -i acpi? > Any difference when you do not try to load toshiba module? No, I first noticed the problem while I did not have that module loaded; having it loaded makes no difference. > Can you (or could you with older kernels) influence fan activity with the toshiba module? Yes. # echo force_on:1 >/proc/acpi/toshiba/fan # fan starts running # cat /proc/acpi/toshiba/fan running: 1 force_on: 1 # echo force_on:0 >/proc/acpi/toshiba/fan # fan turns off again But of course the fan only runs at constant speed, I still get no automatic, flexible control. > Any error messages when you search dmesg |grep -i acpi? I can see no errors. (Note: I already attached dmesg to this bug earlier.) Hi for the power off problem you may want to take a look at this bug http://bugzilla.kernel.org/show_bug.cgi?id=6395 Thanks a lot for pointing that out. I've compiled a custom kernel based on Debian's current 2.6.16 with the patch from http://bugzilla.kernel.org/attachment.cgi?id=7896&action=view and indeed it does solve the reboot issue. And, surprise, surprise, it also restores my fan to perfect working order :-) I definitely did not expect that when I convinced myself to check fan operation too on the off-chance. Cheers, FJP P.S. I'll leave proper merging/closing of the bug to you folks. Thanks for all the help and suggestions. Daniele, thanks for your help, very much appreciated :) Frans, the fan behavior your describe suggests that something (BIOS? ACPI EC?) is accessing the ADM1032 chip in the background, bypassing the I/O region request and mutex which normally grant exclusive access to the i2c-i801 SMBus. This is likely to cause trouble if you are running "sensors" or any other tool accessing the ADM1032 chip in the regular way. Concurrent accesses to the ADM1032 chip will result in data corruption on the SMBus and/or lockups. Note that the Toshiba Satellite A40 originally didn't expose its SMBus device for the i2c-i801 to attach to. We added a PCI quirk to allow that, and with what we now know, it wasn't a safe decision. We may have to step back on this one (and possibly other similar quirks.) On a more general note, we have a problem on many systems with "something" (again, I guess BIOS or ACPI EC) accessing devices on the SMBus or I/O regions that are in use by hardware monitoring drivers. Without proper locking to grant exclusive access to these resources, we are in trouble, and many users reported problems on different systems already. We need to think about a general solution to this issue, suggestions are welcome. It would be worth to try with a minimal system init=/bin/bash, if the fan still does not work it's very likely the ACPI subsystem itself? Could this have to do with unthreaded executed ACPI control methods?: It could be worth to test the patch from bug #5534, then. I will try to test a kernel without the PEC bit patch and with the quirk disabled. Please hold off removing the quirk until I confirm that both the reboot problem and the fan control issue are solved by that. I'd really hate to loose the info unnecessarily :-P I can confirm that without the quirk the PEC patch is not needed. So it looks as if if khali's analysis in [1] is correct and disabling the quirks is the best solution for now (unfortunately ;-). [1] http://bugzilla.kernel.org/show_bug.cgi?id=6315#c11 It does make me wonder though how Windows utilities are able to access hw-sensors data... I hope it will still be possible to find a structural solution that will allow availability of the data without causing the conflicts. khali: Could you please let me know the patch you'll submit for inclusion in the kernel so I can get it applied in Debian ASAP? Thomas, I am not familiar with ACPI, but my guess is that the ACPI EC somehow accesses the ADM1032 temperature sensor through the Intel SMBus, just like the i2c-i801 + lm90 driver combination does, but without proper resource request/locking. If so, users are likely to observe problems if the acpi subsystem and the i2c subsystem both access the devices at the same time. What made things worse in Frans' (and Daniele's) case is that the i2c-i801 driver was leaving the SMBus device in a non-standard state (PEC enabled.) This wouldn't be a problem if all accesses were going through the i2c-i801 driver as they should. This also wouldn't be a problem if other subsystems (BIOS, ACPI EC) accessing the device did not rely on assumptions with regards to the "initial" state of the SMBus device. Unfortunately, both assertions seem to be false. The most immediate fix was to revert a change in the i2c-i801 driver, so that it will again always leave the SMBus device in a standard state. This fix will go in 2.6.16.10 and 2.6.17-rc3. However, this doesn't address the more general problem of ACPI EC possibly accessing resources without properly requesting them beforehand. Suggestions on how this problem could be solved are welcome. Frans, the PEC bit patch is already merged: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c79cfbaccac0ef81ab3e796da1582a83dcef0ff9 This patch will go in 2.6.16.10 too. I will not disable the quirks for now. I want to test a few other possibilities before I do. It would obviously be better if ACPI EC and regular SMBus access could work together. Excellent. Thanks. I'll keep following both bugs and will be happy to test any proposed patches. Created attachment 7939 [details]
Make use of the INUSE_STS hardware semaphore
Frans, Daniele, please try this patch. It makes use of the SMBus hardware
semaphore of the 82801 family of chips. This semaphore is meant to help handle
"various independent software" accessing the SMBus host controller. The
i2c-i801 driver wasn't using that feature so far, as we did not expect other
software to access the device.
If the ACPI EC implementation in your respective Toshiba laptops is properly
using this semaphore, then this could solve the concurrent access problem for
this specific case. However I wouldn't be too optimistic: I have two laptops
using the i2c-i801 driver here (ICH3-M and ICH4-M chips) and in both case, the
BIOS isn't properly releasing the semaphore before Linux loads.
If nothing else, this patch will log interesting information. It'll complain if
the BIOS did not properly release the semaphore (if so, the driver will
forcibly reset it so that it has a chance to work.) It'll also notify when a
concurrent access (presumably from ACPI EC) is detected. If the ACPI EC isn't
properly releasing the semaphore, i2c-i801 Please report to me what the logs
have to say after using it for some time.
Khali: Should we test this with or without the PEC bit patch? Frans, please test *with* the PEC bit patch applied. Hi 801_smbus 0000:00:1f.3: SMBus semaphore is held by someone else! lm90 0-004c: Register 0x0 read failed (-1) i801_smbus 0000:00:1f.3: SMBus semaphore is held by someone else! lm90 0-004c: Register 0x6 read failed (-1) i801_smbus 0000:00:1f.3: SMBus semaphore is held by someone else! lm90 0-004c: Register 0x5 read failed (-1) i801_smbus 0000:00:1f.3: SMBus semaphore is held by someone else! lm90 0-004c: Register 0x20 read failed (-1) i801_smbus 0000:00:1f.3: SMBus semaphore is held by someone else! lm90 0-004c: Register 0x19 read failed (-1) i801_smbus 0000:00:1f.3: SMBus semaphore is held by someone else! lm90 0-004c: Register 0x21 read failed (-1) and so on... Daniele Identical messages as posted by Daniele for me, with following comments: - until the laptop switched off the fan for the first time, there was nothing in the logs and 'sensors' reported data OK - after the laptop switched off the fan, every call of sensors would result in the messages _and_ sensors would no longer report correct values, but report the same values (from just before fan switched off) every time - fan control kept on working correctly So, with only the PEC bit patch, sensors would continue to work correctly; with both the PEC bit patch and this latest one, it does not (but I guess that is expected). Registers reported as failed are (in order): 0, 6, 5, 20, 19, 21, 1, 8, 7, 2 I can confirm Frans post, except that I'm not able now to reboot to check that errors start when fan goes off. Every time I run sensors I've a whole cycle of failed reads before getting the same temp (48.4, which is not correct as ACPI reports 71). The sequence is the same: 0x0, 0x6, 0x5, 0x20, 0x19, 0x21, 0x1, 0x8, 0x7, 0x2. Fan seems to work correctly. Frans, yes the breakage was more or less expected. What you both observed confirms that something is accessing the SMBus host controller behing the i2c-i801 driver's back. That something is acquiring the hardware semaphore (it's almost automatic) but does NOT properly release it when done. This semaphore can only be used if all users play fair. Here, the i2c-i801 driver (with the "use INUSE_STS" patch) will timeout on every access, waiting for the semaphore, as soon as the other resource user accessed the device once. Don't you have an "SMBus semaphore reset" message in your logs? If you have it, it should be printed just once, as soon as the i2c-i801 driver is loaded. You can revert the "use INUSE_STS" patch for now, so that the i2c-i801 driver is again useable. I need some time to think of the next step. I've that message i801_smbus 0000:00:1f.3: SMBus semaphore reset (bad BIOS?) exactly once while loading i2c-i801. I can confirm also that sensors reports correct values until the fan kicks in for the first time after module reload. Same here: Apr 23 23:15:26 localhost kernel: PCI: Enabled i801 SMBus device Apr 23 23:15:26 localhost kernel: i801_smbus 0000:00:1f.3: SMBus semaphore reset (bad BIOS?) so when the i2c-i801 driver is not loaded, ACPI works properly? I'm not aware of any Linux-level locking that is necessary or appropriate for the i2c driver to coordinate access to the i2c bus with the ACPI EC driver. The ACPI EC driver has a notion of using the ACPI global lock to coordinate access with BIOS SMI code, but in practice, only very old systems set the bits that enable the use of that lock by the EC driver. Perhaps we should simply not be loading the ACPI EC driver and the i2c driver at the same time? (Note that the EC driver is built-into the Linux ACPI implementation -- so what we're really talking about here is running the i2c driver only when acpi=off.) Created attachment 10638 [details] Detect conflicting ACPI I/O accesses Here comes a patch (against 2.6.20.1) which will detect and log the ACPI accesses to I/O areas already requested by other drivers. I would like all users affected by ACPI vs lm_sensors conflicts to give it a try and post the generated logs. A 2.6.21-rc3 version of the patch is also available: http://jdelvare.pck.nerim.net/sensors/acpi-check-io-ports.patch I'd like to test this for you, but my quirk to enable the SMBus on my laptop was removed from the kernel source :-( I doubt testing makes much sense without the SMBus enabled. [...] Hmm. Did not get around to checking the source before now, but I see the quirk is still there. I wonder then why it's not getting enabled anymore... I need to look into this first. OK, I see now why the quirk is no longer active. There is now an "#ifndef CONFIG_ACPI_SLEEP" around that block of code and that is enabled in the Debian kernel config. I'd appreciate some background on this and advice how best to test: - remove the #ifndef - unset CONFIG_ACPI_SLEEP I'd hope that the eventual goal is to have both the SMBus (and i2c) and proper sleep support :-) P.S. Thanks for getting back to this one Jean! Frans, SMBus unhiding was made exclusive with suspend support between kernels 2.6.16 and 2.6.20 because we realized we weren't replaying the quirk on resume from RAM and it had bad consequences. Since 2.6.20 we're able to replay the quirks on resume so the #ifdef has been removed. If you're using an older kernel, please disable CONFIG_ACPI_SLEEP for the tests. The goal "suspend and SMBus work together" is mostly reached by now. Now we need to work on the more complex "ACPI and SMBus work together" problem, this is what my test patch is for. OK. Thanks for the info. I'm currently running Debian's 2.6.18 kernel which will be the one shipped with Etch. I have tested your patch using the 2.6.20 Debian snapshot kernel. This does not include 2.6.20.1 yet, but your patch applied cleanly. After booting my laptop, the sensors were active normally and fan control was good. I gave it some work and let it cool down, and the fan speed adjusted correctly. The question is what logs you are looking for? I had a look, but did not see anything obvious in the regular log files. A grep on "ACPI (read|write)" in /var/log returned nothing. First of all, keep in mind that grep's syntax for alternatives is more like "ACPI \(read\|write\)". Which drivers were loaded when you did you tests? You need to load the thermal, fan and i2c-i801 drivers to see something, and possibly also the toshiba-specific acpi driver. You need to ensure that the i2c-i801 driver is working (i.e. the unhiding quirk worked.) You should see it in /proc/ioports: 1880-189f : 0000:00:1f.3 1880-189f : i801_smbus If thermal isn't polling by default (check /proc/acpi/thermal_zone/*/polling_frequency) you may need to read from /proc/acpi/thermal_zone/*/temperature to trigger a resource access. Or maybe you need to load your laptop's CPU up to the point the fan will kick in or on the contrary let it idle so the fan stops (your comment #22 suggests this.) So please leave the patch in place and use your laptop with all drivers loaded, maybe the resource conflict only happens on specific events. > First of all, keep in mind that grep's syntax for alternatives is more > like "ACPI \(read\|write\)". I used egrep of course :-) > Which drivers were loaded when you did you tests? You need to load the > thermal, fan and i2c-i801 drivers to see something, and possibly also > the toshiba-specific acpi driver. All of these were loaded. > You need to ensure that the i2c-i801 driver is working (i.e. the unhiding > quirk worked.) You should see it in /proc/ioports: I have: d880-d89f : 0000:00:1f.3 d880-d89f : motherboard d880-d89f : i801_smbus Also, 'sensors' works and gives correct output (showing changing limits for different FAN speeds, i.e. when a limit is reached and a new level is entered). > If thermal isn't polling by default (check > /proc/acpi/thermal_zone/*/polling_frequency) you may need to read from > /proc/acpi/thermal_zone/*/temperature to trigger a resource access. The contents of that directory are: * /proc/acpi/thermal_zone/THRM/cooling_mode <setting not supported> cooling mode: passive * /proc/acpi/thermal_zone/THRM/polling_frequency <polling disabled> * /proc/acpi/thermal_zone/THRM/state state: ok * /proc/acpi/thermal_zone/THRM/temperature temperature: 63 C * /proc/acpi/thermal_zone/THRM/trip_points critical (S5): 93 C passive: 92 C: tc1=9 tc2=2 tsp=1800 devices=0xdeb96ea0 active[0]: 92 C: devices=0xdeb96928 active[1]: 92 C: devices=0xdeb96928 Seems like polling is not supported? However, reading temperature does nothing. > Or maybe you need to load your laptop's CPU up to the point the fan > will kick in or on the contrary let it idle so the fan stops (your > comment #22 suggests this.) Tried that several times. Also tried suspending to RAM and resuming. > So please leave the patch in place and use your laptop with all drivers > loaded, maybe the resource conflict only happens on specific events. I've double checked that your patch was applied (it was). I just can't seem to trigger the events you're looking for. Not sure if that is good or bad :-P Indeed for #22 it was trivial to trigger the conflict and I'm sure I've done the same now as then. Could it be that the problem has already been solved somehow? Would it maybe make sense to again apply the patch that showed the debugging from #21/#22? I have now tested with a quickly compiled 2.6.20 kernel on a partition I normally only use as chroot. I don't really want to boot my regular working partition with that kernel. If you think that would be useful, I can compile a 2.6.20.1 kernel with your patch and boot that on my regular partition. However, I'm also quite busy with the RC2 release of the Debian Installer (which is a requirement for releasing Etch). Thanks for the complete report. So it looks like this isn't ACPI accessing the SMBus, but something else... presumably SMM, against which we can't do anything. This is bad news in that the solution I and others are working on these days probably won't work in your case, so we might have to remove the quirk which unhides the SMBus on your laptop. Daniele, any chance you could try my patch on your laptop, and report if you see something in the logs or not? > So it looks like this isn't ACPI accessing the SMBus, but something > else... presumably SMM, against which we can't do anything. Yes, this seems likely. I've always had the impression that the fan was not being controlled from linux, but from the BIOS. See also #1 and #8. Just read up a bit about SMM, and it looks somewhat evil :-/ > [...] so we might have to remove the quirk which unhides the SMBus > on your laptop. Just as a question, not as criticism... Why? What would break? Everything seems to work fine currently, including suspending to RAM. And another question I asked once before (#14): how come Windows _is_ able to access and display sensor data without conflicts? > Daniele, any chance you could try my patch on your laptop, and report if you
see something in the logs or not?
Hi Jean
thanks for your work. I'm currently on 2.6.20.1 with your patch but there's
nothing in logs, even after stressing the CPU with kde compilation. Anyway
everything works well, fan is running when needed and sensors reports the right
temps. Since 2.6.20 the lspci command correctly reports SMBus after suspending
to RAM.
I agree with Frans hoping the quirk will not be removed. I've reverted the
ACPI_SLEEP patch on 2.6.17, 18 and 19 kernels and used suspend to RAM without
any problem.
Please tell me if there's something else I can do.
I understand that things work OK for you even with the quirk enabled, and I can imagine your frustration to see it removed. The problem however is that a race condition exists. We have no control over when the SMM code is triggered, we don't even _know_ when it happens, and we are now certain it accesses the SMBus on your machine. If it ever happens that the SMM code is triggered at the exact moment the i2c-i801 driver uses the SMBus controller, unpredictable badness will ensue. This could include instant shutdown because the BIOS thinks the laptop is overheating, or... laptop damaged because the BIOS didn't realize it was overheating. While I agree that the probability of such an event is low (by definition of a race condition), it does exist. Which is why I believe we have to remove the quirks from the kernel. While I frequently state that dedicated hardware monitoring drivers such as the lm90 do a much better job than the ACPI thermal interface, in this particular case I think that the BIOS hiding the SMBus from us on these two laptops (Satellite A40 and Tecra M2) is a serious hint that Toshiba does not want the OS to touch the SMBus. Given that they have SMM code fiddling with it, it really makes sense to keep us away from it. Also notice how the BIOS dynamically changes the temperature limits for its own benefit. And notice how these are the only two Toshiba laptops we have quirks for, and both happen to have SMM code using the SMBus for thermal management purposes. What a coincidence... > And another question I asked once before (#14): how come Windows _is_ able to > access and display sensor data without conflicts? It depends on what you call "Windows" here. Windows itself doesn't include any hardware monitoring tool as far as I know. It's frequent that the laptop or motherboard vendor includes a dedicated hardware monitoring program as part of the software bundle. They have a great advantage on us: they know all the hardware and BIOS details. So they know all the wirings, scaling factors, temperature offsets etc. And they can even use specific, undocumented ways to retrieve the values. Are you using such a proprietary tool under Windows on your laptop? There also are some third-party tools such as Motherboard Monitor (discontinued) or Speedfan reporting hardware monitoring information under Windows, but they face the same problems we do: they don't have the details about how the hardware is wired nor what the BIOS does. Also, you shouldn't assume that what your Windows tool is doing is safe. As I explained in comment #38, there is a race condition. Maybe a proprietary tool has a way to work around it, but it could also be that the programmers either didn't realize there was such a race condition, or assumed that it was unlikely enough to be ignored - something we aren't willing to do in Linux. Created attachment 10931 [details]
Stop unhiding the SMBus on Toshiba laptops
It was found that the Toshiba laptops with hidden Intel SMBus have SMM code
handling the thermal management which accesses the SMBus. Thus it is not safe
to unhide it and let Linux access it. We have to leave the SMBus hidden. SMM is
a pain, really.
|