Most recent kernel where this bug did *NOT* occur: 2.6.20-rc1 and lower Distribution: Debian SID Hardware Environment: Amilo A CY26 Software Environment: Problem Description: From kernel 2.6.20, I have a problem with all sensors (sensors-applet for gnome). When my gnome session starts, everything starts except some applets which freezes gnome-panel during about 1 minute. So I tried to search the origin of this problem. This problem appear with 2.6.20-rc2 version. I also discovered that `cat /proc/acpi/thermal_zone/THRM/temperature` command prints immedialty the temperature with kernel releases < 2.6.20-rc2. But after this version, this command prints the result after 3-5 seconds. I also add that startup of the kernel is more slower since 2.6.20-rc2. If you need more information, i'm ready to answer your questions. Steps to reproduce:
Created attachment 11442 [details] dmesg of 2.6.20-rc1 kernel
Created attachment 11443 [details] dmesg of 2.6.20-rc2 kernel
Created attachment 11444 [details] Kernel startup This movie shows that the speed of startup is different between 2.6.20-rc1 and rc2
Created attachment 11445 [details] kernel config
Hi, Does the problem still exists in the latest kernel release?
I tested 2.6.20-rc2, 2.6.20, 2.6.21, 2.6.22-rc2, 2.6.22-rc4. The bug is still present.
Most recent kernel where this bug did *NOT* occur: 2.6.20-rc1 Distribution: Debian 4.0 (Etch) Hardware Environment: Clevo D4J D410J notebook I have the same experience, the following command works with different speed on different kernel versions: On 2.6.20-rc1: $ time cat /proc/acpi/thermal_zone/THRM/temperature temperature: 51 C real 0m0.156s user 0m0.004s sys 0m0.000s On 2.6.21-rc2: $ time cat /proc/acpi/thermal_zone/THRM/temperature temperature: 52 C real 0m1.020s user 0m0.000s sys 0m0.000s On 2.6.21-rc2 sometimes the execution of this command takes a lot more time, for example 19.879s (!). The problem still exists in 2.6.22-rc4 (1.006sec).
Created attachment 11768 [details] acpidump of Clevo D410J
There aren't any changes in acpi from 2.6.20-rc1 to rc2? Could you also send .config of each. Did CONFIG_PNPACPI get set by any chance?
Created attachment 11961 [details] config of 2.6.20-rc1 on Clevo D410J The configuration for 2.6.20-rc1 and 2.6.20-rc2 are exactly the same. The CONFIG_PNPACPI is 'y' in booth cases.
Created attachment 11962 [details] config of 2.6.20-rc2 on Clevo D410J
Would you mind git bisecting the patch out? As said, between rc1 and rc2 there nearly nothing happened with ACPI (some libata-acpi changes). Maybe this is some interference because of some unrelated driver... If you can reproduce this easily, bisect between rc1 and rc2 should not take too long and is the only possibility I see to find out what caused this. Maybe it's a Debian only bug or were rc1-rc2 tests made on vanilla kernels?
Thomas, his problem started with 2.6._20_-rc2. And there were several ACPI changes between 2.6.20-rc1 and 2.6.20-rc2.
I used 2.6.20-rc1 and 2.6.20-rc2 kernels from kernel.org without any changes and compiled from source (as far as I know this is vanilla). What does exactly "git bisecting the patch" mean? Is there any documentation how to do this? What I understand is that there were several patches applied to 2.6.20-rc1 to reach 2.6.20-rc2, which are listed one-by-one in http://kernel.org/pub/linux/kernel/v2.6/testing/ChangeLog-2.6.20-rc2 . The assumption is that one or more of them cause this problem while others has no influence.
You can look on kernel.org for git pointers, and do man git when you have it installed. Below is small snippet that I saved off once, but you can search the web and get tons of instructions, look for "git bisect" or better yet check out this guide and search in there: http://www.stardust.webpages.pl/files/handbook/handbook-en-0.3-rc1.pdf In short: # install git and cogito on your computer # clone Linus' tree: cg-clone \ git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git # start bisecting: cd linux-2.6 git bisect start git bisect bad v2.6.18-rc5 git bisect good v2.6.18-rc4 # start round cp /path/to/.config . make oldconfig make # install kernel, check whether it's good or bad, then: git bisect [bad|good] # start next round After at about 9 reboots, you'll have found the guilty commit ("... is first bad commit"). More information on git bisecting: man git-bisect
comment #13: Yep I mixed up 20 and 21, sorry -> there are a lot of ACPI changes. Still a bisect could help. Adding Alexey, he knows the EC code quite well, the culprit might be there...
I add the bug is still present in 2.6.22
please post cat of /proc/interrupts before and after cat of /proc/acpi/thermal_zone/*/*. also, please check if "ec_intr=0" added to command line changes the things...
When I add ec_intr=0, the problem seems to be solved ... Results of your tests : For 2.6.20-rc1 (ec_intr=0) : cat /proc/interrupts : CPU0 0: 20882 XT-PIC-XT timer 1: 11 XT-PIC-XT i8042 2: 0 XT-PIC-XT cascade 3: 578 XT-PIC-XT yenta, ohci1394, radeon@pci:0000:01:05.0 5: 2 XT-PIC-XT ALI 5451 8: 1 XT-PIC-XT rtc 9: 126 XT-PIC-XT acpi, ohci_hcd:usb1 10: 2 XT-PIC-XT ohci_hcd:usb2 11: 58 XT-PIC-XT yenta, eth0 12: 232 XT-PIC-XT i8042 14: 3609 XT-PIC-XT ide0 15: 524 XT-PIC-XT ide1 NMI: 0 LOC: 0 ERR: 0 MIS: 0 cat /proc/acpi/thermal_zone/THRM/* : <setting not supported> cooling mode: passive <polling disabled> state: ok temperature: 55 C critical (S5): 205 C passive: 200 C: tc1=2 tc2=5 tsp=150 devices=0xdfff7338 ------------------------------------------------------------------------------ For 2.6.20-rc1 (*NO* ec_intr=0) : cat /proc/interrupts : CPU0 0: 25534 XT-PIC-XT timer 1: 11 XT-PIC-XT i8042 2: 0 XT-PIC-XT cascade 3: 794 XT-PIC-XT yenta, ohci1394, radeon@pci:0000:01:05.0 5: 2 XT-PIC-XT ALI 5451 8: 1 XT-PIC-XT rtc 9: 169 XT-PIC-XT acpi, ohci_hcd:usb1 10: 2 XT-PIC-XT ohci_hcd:usb2 11: 68 XT-PIC-XT yenta, eth0 12: 232 XT-PIC-XT i8042 14: 3629 XT-PIC-XT ide0 15: 668 XT-PIC-XT ide1 NMI: 0 LOC: 0 ERR: 0 MIS: 0 cat /proc/acpi/thermal_zone/THRM/* : <setting not supported> cooling mode: passive <polling disabled> state: ok temperature: 53 C critical (S5): 205 C passive: 200 C: tc1=2 tc2=5 tsp=150 devices=0xdfff7338 ------------------------------------------------------------------------------ For 2.6.20-rc2 (ec_intr=0) : cat /proc/interrupts : CPU0 0: 18198 XT-PIC-XT timer 1: 11 XT-PIC-XT i8042 2: 0 XT-PIC-XT cascade 3: 680 XT-PIC-XT yenta, ohci1394, radeon@pci:0000:01:05.0 5: 2 XT-PIC-XT ALI 5451 8: 1 XT-PIC-XT rtc 9: 86 XT-PIC-XT acpi, ohci_hcd:usb1 10: 2 XT-PIC-XT ohci_hcd:usb2 11: 50 XT-PIC-XT yenta, eth0 12: 232 XT-PIC-XT i8042 14: 3625 XT-PIC-XT ide0 15: 434 XT-PIC-XT ide1 NMI: 0 LOC: 0 ERR: 0 MIS: 0 cat /proc/acpi/thermal_zone/THRM/* : <setting not supported> cooling mode: passive <polling disabled> state: ok temperature: 54 C critical (S5): 205 C passive: 200 C: tc1=2 tc2=5 tsp=150 devices=0xdfff7338 ------------------------------------------------------------------------------ For 2.6.20-rc2 (*NO* ec_intr=0) : cat /proc/interrupts : CPU0 0: 28094 XT-PIC-XT timer 1: 11 XT-PIC-XT i8042 2: 0 XT-PIC-XT cascade 3: 578 XT-PIC-XT yenta, ohci1394, radeon@pci:0000:01:05.0 5: 2 XT-PIC-XT ALI 5451 8: 1 XT-PIC-XT rtc 9: 101 XT-PIC-XT acpi, ohci_hcd:usb1 10: 2 XT-PIC-XT ohci_hcd:usb2 11: 60 XT-PIC-XT yenta, eth0 12: 236 XT-PIC-XT i8042 14: 3585 XT-PIC-XT ide0 15: 488 XT-PIC-XT ide1 NMI: 0 LOC: 0 ERR: 0 MIS: 0 cat /proc/acpi/thermal_zone/THRM/* : <setting not supported> cooling mode: passive <polling disabled> state: ok temperature: 59 C critical (S5): 205 C passive: 200 C: tc1=2 tc2=5 tsp=150 devices=0xdfff7338
Booting with 'ec_intr=0' kernel parameter reduces the needed time to show the temperature information on kernel versions above 2.6.20-rc2 on Clevo D410J: $ time cat /proc/acpi/thermal_zone/THRM/temperature temperature: 42 C real 0m0.005s user 0m0.001s sys 0m0.003s
I followed the "git bisect" procedure to find out the first bad commit. (It was not easy because some of the versions was not able to compile, so I had to select an other version to test.) The result is: 5c4064124a5720a2576eb4bd5b7200d70052e9b5 is first bad commit commit 5c4064124a5720a2576eb4bd5b7200d70052e9b5 Author: Alexey Starikovskiy <alexey.y.starikovskiy@intel.com> Date: Thu Dec 7 18:42:16 2006 +0300 ACPI: ec: Increase timeout from 50 to 500 ms to handle old slow machines. http://bugzilla.kernel.org/show_bug.cgi?id=7466 Signed-off-by: Len Brown <len.brown@intel.com> :040000 040000 f6a739daad9b6bd251c0b18287ab9aec54624eda 0f5c9263f0af4240c56fc201662d306ee492b999 M drivers
Created attachment 13004 [details] .config used for "git bisect"
Created attachment 13005 [details] BISECT_LOG
Created attachment 13012 [details] Split EC_DELAY Well, this proves that you don't have EC interrupt confirmation of the complete transaction, this means that your machine really should work with ec_intr=0 boot parameter, as suggested before. Anyway, boot parameters are not friendly :) Could you try the attached patch? You could also play with ACPI_EC_SHORT_DELAY -- set it to 5 and see what happens.
The split_ec_delay.patch from comment #24 was applied to linux-2.6.23-rc8. The measurement was made with 1000 calls to "time cat /proc/acpi/thermal_zone/THRM/temperature" command. The minimum, average and the maximum value of the "real" time spent expecuting the command is shown in the following table. The kernel was booted without the ec_intr=0 parameter. ACPI_EC_ | min | avg | max | acpi_ec_wait in dmesg SHORT_DELAY | | | | ------------+---------+-------+-------+----------------------- 1 | 0.003 | 0.005 | 0.511 | there was one, see (*) ------------+---------+-------+-------+----------------------- 2 | 0.005 | 0.009 | 1.093 | none ------------+---------+-------+-------+----------------------- 3 | 0.008 | 0.009 | 0.697 | none ------------+---------+-------+-------+----------------------- 4 | 0.01 | 0.01 | 0.05 | none ------------+---------+-------+-------+----------------------- 5 | 0.012 | 0.013 | 0.061 | none ------------+---------+-------+-------+----------------------- 50 | 0.058 | 0.109 | 0.604 | none ------------+---------+-------+-------+----------------------- 250 | 0.258 | 0.648 | 9.157 | none ------------+---------+-------+-------+----------------------- 500 | 1.003 | 1.728 |11.005 | none ------------+---------+-------+-------+----------------------- (*) ACPI: EC: acpi_ec_wait timeout, status = 1, expect_event = 1
Created attachment 13013 [details] Auto switch poll mode After discussion with Len, there is another approach to this problem... Please check if this patch works for you.
Created attachment 13014 [details] Auto switch poll mode for 2.6.23-rc8 extended with printk() on change It seems that the variable acpi_ec_mode does not reach a stable state on my laptop. If I do "time cat /proc/acpi/thermal_zone/THRM/temperature", the mode is always switched, like this: [ 798.539881] ACPI: EC: EC did not send confirm interrupt, switch to poll mode [ 798.540004] ACPI: EC: Non-query intterrupt from EC, must be confirmation, switch to interrupt [ 798.540064] ACPI: EC: Non-query intterrupt from EC, must be confirmation, switch to interrupt [ 798.540430] ACPI: EC: Non-query intterrupt from EC, must be confirmation, switch to interrupt [ 799.040927] ACPI: EC: EC did not send confirm interrupt, switch to poll mode [ 799.043546] ACPI: EC: Non-query intterrupt from EC, must be confirmation, switch to interrupt [ 802.414587] ACPI: EC: Non-query intterrupt from EC, must be confirmation, switch to interrupt [ 802.915288] ACPI: EC: EC did not send confirm interrupt, switch to poll mode [ 802.915977] ACPI: EC: Non-query intterrupt from EC, must be confirmation, switch to interrupt [ 802.937073] ACPI: EC: Non-query intterrupt from EC, must be confirmation, switch to interrupt [ 803.437333] ACPI: EC: EC did not send confirm interrupt, switch to poll mode [ 803.437977] ACPI: EC: Non-query intterrupt from EC, must be confirmation, switch to interrupt [ 803.446347] ACPI: EC: Non-query intterrupt from EC, must be confirmation, switch to interrupt The timing of 1000 execution of the command: min: 0.003 avg: 0.333 max: 3.505
Created attachment 13019 [details] Insert barrier() after each atomic_read() Please check if compiler _did_ change the order of atomic_read and next function call...
Created attachment 13032 [details] Insert mb() after each atomic_read() Please test if this patch helps.
Created attachment 13040 [details] dmesg with mb()s inserted The patch in comment #29 was applied to linux-2.6.23-rc8. While executing the 'time cat /proc/acpi/thermal_zone/THRM/temperature' 1000 times, the following messages repeat: [ 311.373860] ACPI: EC: lost interrupt [ 311.379419] ACPI: EC: increment interrupt counter [ 311.880903] ACPI: EC: lost interrupt [ 311.881790] ACPI: EC: increment interrupt counter [ 312.381975] ACPI: EC: lost interrupt [ 312.382875] ACPI: EC: increment interrupt counter [ 312.882075] ACPI: EC: lost interrupt ...and so on. The the statistics of reading out the temperature sensor is slow: min: 0.19s avg: 1.707s max: 11.009s
interrupt still happens half-a-second before timeout... did you try to disable spinlock debug ? could you please insert printk of what is read into count by each atomic_read, and what is incremented by gpe_handler?
Created attachment 13041 [details] Replace interrupt sync with spinlock Please check if this sync scheme works better...
Created attachment 13042 [details] also printk() the counters Disabling spinlock debug does not affect the problem. It seems that in function acpi_ec_check_status() the atomic_read() value always matches the old_count.
Created attachment 13043 [details] dmesg with printk()ing the counts
yep, somehow ordering of atomic_read and out/inb is broken... Please try #32.
The patch in comment #32 does not cleanly apply me for 2.6.23-rc8 and 2.6.23-rc9. Could you please specify the base version I have to apply this patch?
Created attachment 13045 [details] patch against 2.6.23-rc9 Patch was modified to apply to 2.6.23-rc9-git.
Created attachment 13048 [details] spinlock + more printk() I applied the patch from comment #37 to linux-2.6.23-rc8 without problem. The result was that the sensor reading out is still slow. I modified the patch and added more printk()s, the dmesg contains: [ 21.441681] ACPI: EC: acpi_ec_transaction_unlocked() enter [ 21.441829] ACPI: EC: acpi_ec_gpe_handler(): old_state=EC_WAIT_INTERRUPT [ 21.441886] ACPI: EC: acpi_ec_gpe_handler(): new_state=EC_READ_STATUS [ 21.941138] ACPI: EC: lost interrupt, ec->state forced to EC_READ_STATUS old_state=EC_WAIT_INTERRUPT [ 21.941217] ACPI: EC: transaction is complete [ 21.941450] ACPI: EC: acpi_ec_transaction_unlocked() enter [ 21.941610] ACPI: EC: acpi_ec_gpe_handler(): old_state=EC_WAIT_INTERRUPT [ 21.941668] ACPI: EC: acpi_ec_gpe_handler(): new_state=EC_READ_STATUS [ 22.440804] ACPI: EC: lost interrupt, ec->state forced to EC_READ_STATUS old_state=EC_WAIT_INTERRUPT [ 22.440884] ACPI: EC: transaction is complete ... and so on. Is it possible that the acpi_ec_gpe_handler() is working on an other ec, or on a different address space than the other part of the driver...? I added a "volatile" before enum ec_state state in the struct acpi_ec, but it did not help.
Could you please add printk( ..."%p", ec) to gpe_handler and to wait? If they work with different ec objects addresses should differ.
Created attachment 13049 [details] spinlock + printk(... ec) The pointers seems to point to the same place: [ 75.739649] ACPI: EC: acpi_ec_wait(): ec=c1995ed0 [ 75.739899] ACPI: EC: acpi_ec_transaction_unlocked() enter [ 75.740078] ACPI: EC: acpi_ec_wait(): ec=c1995ed0 [ 75.740184] ACPI: EC: acpi_ec_gpe_handler(): ec=c1995ed0 [ 75.740240] ACPI: EC: acpi_ec_gpe_handler(): old_state=EC_WAIT_INTERRUPT [ 75.740297] ACPI: EC: acpi_ec_gpe_handler(): new_state=EC_READ_STATUS [ 75.740724] ACPI: EC: acpi_ec_wait(): ec=c1995ed0 [ 76.239896] ACPI: EC: lost interrupt, ec->state forced to EC_READ_STATUS old_state=EC_WAIT_INTERRUPT [ 76.240110] ACPI: EC: transaction is complete [ 76.240990] ACPI: EC: acpi_ec_wait(): ec=c1995ed0 [ 76.241210] ACPI: EC: acpi_ec_transaction_unlocked() enter [ 76.241387] ACPI: EC: acpi_ec_wait(): ec=c1995ed0 [ 76.241493] ACPI: EC: acpi_ec_gpe_handler(): ec=c1995ed0 [ 76.241549] ACPI: EC: acpi_ec_gpe_handler(): old_state=EC_WAIT_INTERRUPT [ 76.241606] ACPI: EC: acpi_ec_gpe_handler(): new_state=EC_READ_STATUS [ 76.242029] ACPI: EC: acpi_ec_wait(): ec=c1995ed0 [ 76.741547] ACPI: EC: lost interrupt, ec->state forced to EC_READ_STATUS old_state=EC_WAIT_INTERRUPT [ 76.741742] ACPI: EC: transaction is complete
from fragments you've sent it looks like EC does not bother to send second confirmation interrupt, which appears to be a concluding one. Could you please check that driver always looses the last interrupt? It should be one before acpi_ec_wait/acpi_ec_transaction_unlocked pair. If it is so, then we could workaround that easily...
Created attachment 13051 [details] Don't wait for write completion Well, the patch could be as simple as this... Please test.
Created attachment 13052 [details] dump the full low level EC communication As the patch from comment #42 did not solve the problem, I created a patch which can dump the whole low level communication. To interpret the results we'll need the ACPI Specification 3.0b from http://www.acpi.info/DOWNLOADS/ACPIspec30b.pdf , chapter 12.
Created attachment 13053 [details] the actual communication log The sign "--->" means reading a register, "~~~>" means an interrupt and "<---" means write.
Created attachment 13054 [details] force the EC read with polling This patch solves the speed problem, it seems that there is one interrupt after write operation, but when the data is ready for read, the interrupt is missing. The 1000 execution of 'time cat /proc/acpi/thermal_zone/THRM/temperature' results the following timing statistics: min: 0.002 avg: 0.006 max: 0.022
Created attachment 13055 [details] communication log with force poll reading There is one interrupt after write, and the read is done by polling the status. By the way, all four status register reading which are marked with '*' are necessary, or it is enough to read the status register once after interrupt? [ 261.203966] ACPI: EC: acpi_ec_transaction_unlocked(): enter, count=2498 [ 261.203971] ACPI: EC: <--- command=0x80 [* 261.203977] ACPI: EC: ---> status=0x0A [* 261.203983] ACPI: EC: ---> status=0x0A [ 261.204057] ACPI: EC: ~~~> interrupt (2499) [* 261.204068] ACPI: EC: ---> status=0x08 [* 261.204078] ACPI: EC: ---> status=0x08 [ 261.204082] ACPI: EC: <--- data=0x10 [ 261.204088] ACPI: EC: acpi_ec_transaction_unlocked(): after write_cmd, count=2499 [ 261.204093] ACPI: EC: acpi_ec_transaction_unlocked(): before read_data, count=2499 [ 261.204100] ACPI: EC: ---> status=0x01 [ 261.204104] ACPI: EC: ---> data=0x05 [ 261.204108] ACPI: EC: acpi_ec_transaction_unlocked(): exit, count=2499
by ACPI specification, www.acpi.info, each transactional write or read to EC should cause an interrupt. In your hardware we see only one such interrupt immediately after write_command, so your hardware does not conform to spec. I think we should return to auto select mode scheme and forbid cycles in it. I'll do patch later, but you may try to do if yourself :)
Created attachment 13063 [details] Auto switch irq mode, add degraded mode Here is the promised patch. It's quite big, but it takes most out of your hardware -- meaning that it expects command completion interrupt only in BROKEN_IRQ mode. Please try.
Created attachment 13067 [details] Auto switch with degraded mode #2
Created attachment 13075 [details] separate read/write irq Unfortunately the patch from comment #49 does not reach stable state on my computer. I tried to create a patch which separates the read and the write irq. In case only the read irq is missing and it is recognized, then switch only the following reads to polling mode. Although the patch seems to work, it is far from complete. It does not wait for the interrupt but first evaluates the condition parameter of the wait_event_timeout() two times (see <linux/wait.h>, wait_event_timeout and __wait_event_timeout), and in case of reading it is sometimes enough that the OBF=1. This means that the timeout is not waited for the interrupt, but ignored. Here is the log when the change finally happens: [ 148.669438] ACPI: EC: acpi_ec_transaction_unlocked(): ec->flags=0x1 [ 148.669442] ACPI: EC: <--- command=0x80 [ 148.669447] ACPI: EC: acpi_ec_wait_ibf_0(): enter, ec->flags=0x5, broken_write_irq=0 [ 148.669453] ACPI: EC: ---> status=0x0A [ 148.669458] ACPI: EC: ---> status=0x0A [ 148.669525] ACPI: EC: ~~~> interrupt [ 148.669533] ACPI: EC: ---> status=0x08 [ 148.670500] ACPI: EC: ---> status=0x08 [ 148.670506] ACPI: EC: <--- data=0x2E [ 148.670512] ACPI: EC: acpi_ec_transaction_unlocked(): after write: ec->flags=0x1 [ 148.670517] ACPI: EC: acpi_ec_transaction_unlocked(): after read: ec->flags=0x1 [ 148.670522] ACPI: EC: acpi_ec_wait_obf_1(): enter, ec->flags=0x1, broken_read_irq=0 [ 148.670529] ACPI: EC: ---> status=0x02 [ 148.670534] ACPI: EC: ---> status=0x02 [ 149.170008] ACPI: EC: acpi_ec_wait_obf_1(): timeout, ec->flags=0x1 [ 149.170023] ACPI: EC: ---> status=0x01 [ 149.170027] ACPI: EC: lost interrupt after read, switch to degraded mode [ 149.170033] ACPI: EC: ---> data=0x82 [ 149.170038] ACPI: EC: acpi_ec_transaction_unlocked(): leave: ec->flags=0x1 [ 149.170055] ACPI: EC: acpi_ec_wait_ibf_0(): enter, ec->flags=0x1, broken_write_irq=0 [ 149.170061] ACPI: EC: ---> status=0x00 [ 149.170065] ACPI: EC: acpi_ec_transaction_unlocked(): ec->flags=0x1 [ 149.170070] ACPI: EC: <--- command=0x80 [ 149.170075] ACPI: EC: acpi_ec_wait_ibf_0(): enter, ec->flags=0x5, broken_write_irq=0 [ 149.170081] ACPI: EC: ---> status=0x0A [ 149.170086] ACPI: EC: ---> status=0x0A [ 149.170162] ACPI: EC: ~~~> interrupt [ 149.170169] ACPI: EC: ---> status=0x08 [ 149.170518] ACPI: EC: ---> status=0x08 [ 149.170522] ACPI: EC: <--- data=0x2F [ 149.170527] ACPI: EC: acpi_ec_transaction_unlocked(): after write: ec->flags=0x1 [ 149.170532] ACPI: EC: acpi_ec_transaction_unlocked(): after read: ec->flags=0x1 [ 149.170538] ACPI: EC: acpi_ec_wait_obf_1(): enter, ec->flags=0x1, broken_read_irq=1 [ 149.170544] ACPI: EC: ---> status=0x01 [ 149.170548] ACPI: EC: ---> data=0x11 [ 149.170552] ACPI: EC: acpi_ec_transaction_unlocked(): leave: ec->flags=0x1 [ 149.170560] ACPI: EC: acpi_ec_wait_ibf_0(): enter, ec->flags=0x1, broken_write_irq=0 [ 149.170566] ACPI: EC: ---> status=0x00 [ 149.170570] ACPI: EC: acpi_ec_transaction_unlocked(): ec->flags=0x1 [ 149.170574] ACPI: EC: <--- command=0x80 [ 149.170579] ACPI: EC: acpi_ec_wait_ibf_0(): enter, ec->flags=0x5, broken_write_irq=0 [ 149.170585] ACPI: EC: ---> status=0x0A [ 149.170590] ACPI: EC: ---> status=0x0A [ 149.170798] ACPI: EC: ~~~> interrupt [ 149.170807] ACPI: EC: ---> status=0x08 [ 149.172900] ACPI: EC: ---> status=0x08 [ 149.172904] ACPI: EC: <--- data=0x30 [ 149.172909] ACPI: EC: acpi_ec_transaction_unlocked(): after write: ec->flags=0x1 [ 149.172915] ACPI: EC: acpi_ec_transaction_unlocked(): after read: ec->flags=0x1 [ 149.172920] ACPI: EC: acpi_ec_wait_obf_1(): enter, ec->flags=0x1, broken_read_irq=1 [ 149.172926] ACPI: EC: ---> status=0x02 [ 149.172932] ACPI: EC: ---> status=0x02 [ 149.172937] ACPI: EC: ---> status=0x02 [ 149.172943] ACPI: EC: ---> status=0x02 [ 149.172948] ACPI: EC: ---> status=0x01 [ 149.172953] ACPI: EC: ---> data=0x00 [ 149.172956] ACPI: EC: acpi_ec_transaction_unlocked(): leave: ec->flags=0x1 [ 149.172995] ACPI: EC: ~~~> interrupt [ 149.173000] ACPI: EC: ---> status=0x00 [ 149.173007] ACPI: EC: acpi_ec_wait_ibf_0(): enter, ec->flags=0x1, broken_write_irq=0 [ 149.173014] ACPI: EC: ---> status=0x00 [ 149.173017] ACPI: EC: acpi_ec_transaction_unlocked(): ec->flags=0x1 [ 149.173022] ACPI: EC: <--- command=0x80 [ 149.173027] ACPI: EC: acpi_ec_wait_ibf_0(): enter, ec->flags=0x5, broken_write_irq=0 [ 149.173033] ACPI: EC: ---> status=0x0A [ 149.173038] ACPI: EC: ---> status=0x0A [ 149.173206] ACPI: EC: ~~~> interrupt [ 149.173213] ACPI: EC: ---> status=0x08 [ 149.173567] ACPI: EC: ---> status=0x08 [ 149.173572] ACPI: EC: <--- data=0x31 [ 149.173577] ACPI: EC: acpi_ec_transaction_unlocked(): after write: ec->flags=0x1 [ 149.173582] ACPI: EC: acpi_ec_transaction_unlocked(): after read: ec->flags=0x1 [ 149.173587] ACPI: EC: acpi_ec_wait_obf_1(): enter, ec->flags=0x1, broken_read_irq=1 [ 149.173593] ACPI: EC: ---> status=0x01 [ 149.173597] ACPI: EC: ---> data=0x00 [ 149.173601] ACPI: EC: acpi_ec_transaction_unlocked(): leave: ec->flags=0x1
Created attachment 13080 [details] Autoswitch with degraded mode #3 I think, this is a wrong direction to separate read and write, as we are speaking of workaround to rare (I hope), outdated broken controller. If we are required to read status register several times, it is better to turn force_poll = 1, rather than insert additional readings into irq mode. I like your debugging prints, do you mind if I put them into EC driver permanently (with some #define to enable/disable them from Kconfig)?
Created attachment 13081 [details] dmesg of the patch from comment #51 I applied the patch from comment #51 to 2.6.23-rc8. In the dmesg the following two lines repeats: [ 203.467620] ACPI: EC: lost interrupt, switch to degraded mode [ 203.467634] ACPI: EC: interrupt mode is broken, switch it off [ 203.968668] ACPI: EC: lost interrupt, switch to degraded mode [ 203.968682] ACPI: EC: interrupt mode is broken, switch it off ... and so on. The result of the 1000 read of /proc/acpi/thermal_zone/THRM/temperature is: min: 0.003 avg: 0.310 max: 3.508 I don't really get your point: how can you solve the auto switch if there _is_ an interrupt after write, but there is _not_ after read. If the comment at the top of the patch is still true, the patch will see that there is an interrupt after write: OK, lets switch to interrupt mode. Then the read comes: timeout, lets switch to poll mode. Next time the write comes, it will switch again to interrupt mode. I can imagine two solution: 1. separate read and write interrupt 2. start from the interrupt mode, and if one single interrupt is missing, switch to poll mode permanently. In this case I would printk() a warning (if ever would some vendor use Linux to verify the ACPI EC, he/she would recongise that there is a deviation from the ACPI standard). Starting from poll mode would be difficult because you never know how the timing of the status register change and interrupt will be. If the status register already signals that the data is ready, the interrupt might not be lost but needs still needs a short time to come. Concerning the dumper patch I am pleased and happy to give it free. I would only ask you to put my 'Signed-off-by: Márton Németh <nm127@freemail.hu>' line also to the patch.
Created attachment 13082 [details] quickly hacked stress test A quickly hacked stress test to execute 'time cat /proc/acpi/thermal_zone/THRM/temperature' 1000 times and compute the minimum, average and maximum times.
Sorry, I might have missed fact that after write there is an interrupt...
So what do you recommend?
Marton, none of your logs contain write (command=0x81) log. there are only command=0x80 (read) and command=0x84 (query). read is <command, <addr, >data, there is interrupt after command, and none after addr (error). My patch correctly handles that. query is <command, >data, there are no interrupts at all (error). My patch does not handle that. I think, that the logic of EC author was to send interrupt only to confirm that byte in consumed from input buffer (IBF_O) condition. So patch should be changed to have all OBF_1 waits done in poll mode, and have all IBF_0 done in int mode for BROKEN_IRQ bit set, do you agree?
Created attachment 13083 [details] patch number 4 Here is a patch
The patch from comment #57 works fine with my hardware. Thanks! (Sorry, there was a misunderstanding at my previous comments. I ment the outb() and inb() to/from the data register when I have written the 'write' and the 'read' operation.)
Oh, I forget the measurement results: min: 0.002 avg: 0.004 max: 0.031
great news :) I'll need to split the patch into series to be accepted into the kernel and send it up. Thanks. Driver became quite more intelligent after this bug report :)
One small thing in the source, the line if (unlikely(!test_bit(EC_FLAGS_IRQ_MODE, &ec->flags))) shouldn't be written as: if (likely(test_bit(EC_FLAGS_IRQ_MODE, &ec->flags))) As far as I understand the likely()/unlikely() this change should not affect the generated code but simpler
Sorry, didn't get your point... test_bit() and !test_bit() are true if bit=1 and bit=0 respectively, so replacing second with first doesn't work. (un)likely() is a hint for compiler what codepath should be optimized more, so it does not change meaning of above. Or you mean, it is better to change if/else also?
I was wrong, just forget comment #61. I thought that likely(!x) == unlikely(x) but it is not.
66c5f4e7367b0085652931b2f3366de29e7ff5ec (ACPI: EC: Add workaround for "optimized" controllers) and friends shipped in Linux-2.6.24-rc1-git6 closed.
Created attachment 14399 [details] dmesg of linux 2.6.24-rc7 on Clevo D410J As of linux-2.6.24-rc7, there is a switch between the iterrupt and polling mode: [ 12.747449] ACPI: EC: driver started in poll mode [ 12.781635] ACPI: EC: non-query interrupt received, switching to interrupt mode [ 22.017620] ACPI: EC: missing address confirmation, don't expect it any longer. [ 30.355550] ACPI: EC: missing confirmations, switch off interrupt mode. [ 30.356514] ACPI: EC: non-query interrupt received, switching to interrupt mode [ 246.813332] ACPI: EC: missing confirmations, switch off interrupt mode. [ 246.814442] ACPI: EC: non-query interrupt received, switching to interrupt mode [ 273.878575] ACPI: EC: missing confirmations, switch off interrupt mode. [ 273.879355] ACPI: EC: non-query interrupt received, switching to interrupt mode [ 328.871410] ACPI: EC: missing confirmations, switch off interrupt mode. [ 328.872426] ACPI: EC: non-query interrupt received, switching to interrupt mode
Created attachment 14484 [details] Linux 2.6.24-rc8 dmesg on Clevo D410J Linux 2.6.24-rc8 still contains the alternating between interrupt and non-interrupt mode: [ 12.235058] ACPI: EC: non-query interrupt received, switching to interrupt mode [ 26.914894] ACPI: EC: missing confirmations, switch off interrupt mode. [ 26.915805] ACPI: EC: non-query interrupt received, switching to interrupt mode [ 51.894832] ACPI: EC: missing confirmations, switch off interrupt mode. [ 51.896103] ACPI: EC: non-query interrupt received, switching to interrupt mode [ 69.242386] ACPI: EC: missing confirmations, switch off interrupt mode. [ 69.242960] ACPI: EC: non-query interrupt received, switching to interrupt mode [ 187.853864] ACPI: EC: missing confirmations, switch off interrupt mode. [ 187.854923] ACPI: EC: non-query interrupt received, switching to interrupt mode [ 111.651303] ACPI: EC: missing confirmations, switch off interrupt mode. [ 111.651918] ACPI: EC: non-query interrupt received, switching to interrupt mode [ 258.237359] ACPI: EC: missing confirmations, switch off interrupt mode. [ 133.339813] ACPI: EC: non-query interrupt received, switching to interrupt mode
Created attachment 14488 [details] Return OBF_1 check Please check if this patch helps.
Created attachment 14489 [details] add leading zeros to debug messages Add leading zeros to pr_debug() calls. For example if x=0x0a, the format "0x%2x" will result the string "0x a", the format "0x%2.2x" will result "0x0a". Signed-off-by: Márton Németh <nm127@freemail.hu>
Created attachment 14490 [details] "DEBUG" needs to be defined earlier The "DEBUG" symbol needs to be defined before #including <linux/kernel.h> to get the pr_debug() working. Signed-off-by: Márton Németh <nm127@freemail.hu>
Created attachment 14491 [details] dmesg of Linux 2.6.24-rc8 with patch from comment #67 on Clevo D410J The patch from commment #67 fixed the continuous switching between the interrupt and non-interrupt mode.
the dmesg spam in comment #65 and comment #66 is a regression -- marking this bug accordingly. patch in comment #67 applied to acpi tree patches in comment #68 and #69 also applied to acpi tree
the patches mentioned in comment #71 shipped in Linux-2.6.24-rc8-git8 closed.