Most recent kernel where this bug did not occur: none
Distribution: Gentoo stable
Hardware Environment: Lenovo Thinkpad x61 Tablet
Intel Core2 Duo, Santa Rosa chipset
Software Environment: Linux kernel, Gentoo stable userspace
Bootup + a couple of minutes, the following appears in dmesg:
irq 20: nobody cared (try booting with the "irqpoll" option)
[<f892a263>] uhci_irq+0x23/0x170 [uhci_hcd]
[<f8905262>] usb_hcd_irq+0x22/0x60 [usbcore]
[<f8905240>] (usb_hcd_irq+0x0/0x60 [usbcore])
Disabling IRQ #20
After this the USB ports on the right side of the notebook is dead. (As expected)
Running with irqpoll caused my system to lock up hard once, haven't tried it lately. I assume it's buggy firmware from Lenovo: I'd be happy to test any patches you can throw at me. Attaching full dmesg in next post.
Created attachment 12294 [details]
dmesg of 2.6.23-rc2
The 'set_level status: 0' messages are acpi-video's vain attempts to set the brightness. It's independent of the spurious IRQ.
I get the same error on my T61 6564-CTO (15.4" widescreen), but now on irq 19. Afterwards, the two horizontal USB ports on the right side are dead. The one vertical USB port on the left side still works.
I can confirm that kernels 2.6.21 and 2.6.22 are affected.
Presumeably something is wrong with the USB irq or something caused a non-usb irq on irq 19.
Jul 3 23:51:54 thinkpad kernel: irq 19: nobody cared (try booting with the "irqpoll" option)
Jul 3 23:51:54 thinkpad kernel:
Jul 3 23:51:54 thinkpad kernel: Call Trace:
Jul 3 23:51:54 thinkpad kernel: <IRQ> [<ffffffff802ba4ee>] __report_bad_irq+0x30/0x72
Jul 3 23:51:54 thinkpad kernel: [<ffffffff802ba6fd>] note_interrupt+0x1cd/0x20e
Jul 3 23:51:54 thinkpad kernel: [<ffffffff802bafce>] handle_fasteoi_irq+0xa9/0xd1
Jul 3 23:51:54 thinkpad kernel: [<ffffffff802654a3>] do_IRQ+0xf1/0x15f
Jul 3 23:51:54 thinkpad kernel: [<ffffffff80257631>] ret_from_intr+0x0/0xa
Jul 3 23:51:54 thinkpad kernel: [<ffffffff88004d6b>] :uhci_hcd:uhci_irq+0x20/0x153
Jul 3 23:51:54 thinkpad kernel: [<ffffffff88019bec>] :ehci_hcd:ehci_irq+0x27/0x182
Jul 3 23:51:54 thinkpad kernel: [<ffffffff803bf04b>] usb_hcd_irq+0x24/0x52
Jul 3 23:51:54 thinkpad kernel: [<ffffffff8020f931>] handle_IRQ_event+0x25/0x53
Jul 3 23:51:54 thinkpad kernel: [<ffffffff802bafb9>] handle_fasteoi_irq+0x94/0xd1
Jul 3 23:51:54 thinkpad kernel: [<ffffffff802654a3>] do_IRQ+0xf1/0x15f
Jul 3 23:51:54 thinkpad kernel: [<ffffffff80257631>] ret_from_intr+0x0/0xa
Jul 3 23:51:54 thinkpad kernel: <EOI> [<ffffffff8039900d>] acpi_processor_idle+0x2a0/0x4a1
Jul 3 23:51:54 thinkpad kernel: [<ffffffff80399003>] acpi_processor_idle+0x296/0x4a1
Jul 3 23:51:54 thinkpad kernel: [<ffffffff80398d6d>] acpi_processor_idle+0x0/0x4a1
Jul 3 23:51:54 thinkpad kernel: [<ffffffff80244248>] cpu_idle+0x8c/0xaf
Jul 3 23:51:54 thinkpad kernel:
Jul 3 23:51:54 thinkpad kernel: handlers:
Jul 3 23:51:54 thinkpad kernel: [<ffffffff803bf027>] (usb_hcd_irq+0x0/0x52)
Jul 3 23:51:54 thinkpad kernel: Disabling IRQ #19
Something generates a lot of interrupts on irq 19, even if nothing is plugged in:
[root@thinkpad ~]# cat /proc/interrupts
0: 554859 0 IO-APIC-edge timer
1: 5536 0 IO-APIC-edge i8042
8: 1 0 IO-APIC-edge rtc
9: 200 1688 IO-APIC-fasteoi acpi
12: 61800 0 IO-APIC-edge i8042
14: 6789 8642 IO-APIC-edge libata
15: 122 21 IO-APIC-edge libata
16: 0 0 IO-APIC-fasteoi yenta, uhci_hcd:usb3
17: 24303 0 IO-APIC-fasteoi uhci_hcd:usb4, HDA Intel, iwl4965
18: 1 424 IO-APIC-fasteoi uhci_hcd:usb5, sdhci:slot0
19: 1983 98018 IO-APIC-fasteoi ehci_hcd:usb7
20: 223291 0 IO-APIC-fasteoi uhci_hcd:usb1
21: 0 0 IO-APIC-fasteoi uhci_hcd:usb2
22: 3 0 IO-APIC-fasteoi ehci_hcd:usb6
2298: 1487 0 PCI-MSI-edge eth0
NMI: 0 0
LOC: 554802 554779
The full lspci and lsusb output can be found here:
Two more comments: I am running Fedora 7 x86_64, the OP gentoo i386 (correct me if I am wrong). So it is not a 32/64 bit issue.
Also, there was a similar discussion concerning a T61 on the linux-usb-users list
No definite solution was identified.
As Volker said: this is clearly not restricted to my thinkpad model ;-)
Also (might be related, might not), my power button doesn't seem to generate ACPI events: /proc/acpi/button/power/PWRF/info reads "type: Power Button (FF)" and the lid and AC seems to generate events, but natch on PWRF.
Something else that's VERY interesting:
0: 720651 763 IO-APIC-edge timer
1: 461 4 IO-APIC-edge i8042
5: 17 1 IO-APIC-edge serial
8: 38 1 IO-APIC-edge rtc
9: 6218 40 IO-APIC-fasteoi acpi
12: 9299 69 IO-APIC-edge i8042
14: 0 0 IO-APIC-edge libata
15: 0 0 IO-APIC-edge libata
16: 93926 0 IO-APIC-fasteoi uhci_hcd:usb3
17: 0 0 IO-APIC-fasteoi uhci_hcd:usb4
18: 1 1 IO-APIC-fasteoi uhci_hcd:usb5, yenta, \ i915@pci:0000:00:02.0
19: 2 1 IO-APIC-fasteoi ehci_hcd:usb1
20: 200000 1 IO-APIC-fasteoi ehci_hcd:usb2
21: 19210 1 IO-APIC-fasteoi uhci_hcd:usb6, ohci1394, \
HDA Intel, ipw3945
23: 0 0 IO-APIC-fasteoi sdhci:slot0
220: 627 1489 PCI-MSI-edge eth0
221: 26335 7 PCI-MSI-edge ahci
NMI: 0 0
LOC: 14171 263680
Notice the nice round 200000 at IRQ20?
And yes, there seems to be a lot of interrupts on the USB bus with no physical activity.
Bump just to verify that 2.6.23-rc3 (latest git as of today) is still affected by this.
I see the same problem on a Thinkpad T61P (Also with Santa Rosa chipset). Verified that the bug also is present with rc4. Will test with rc5 tonight.
The problem persists with rc5.
Will you please upload the acpidump with boot option acpi=off?
The acpidump tool can be found in the http://www.kernel.org/pub/linux/kernel/people/lenb/acpi/utils/pmtool_20070714_debug.
Created attachment 12770 [details]
ACPI dump of TP X61T
This is the Thinkpad X61T (model 7767-B8G)'s acpidump. It's the same whether acpi=off is passed or not. And believe you me, the laptop's basically unusable with acpi=off!
Lastly, the power button seems to be back in action. You actually need to press it for a second or so, in stead of just "clicking" it. Dunno if it came right because of a newer kernel, or if I just didn't press it long enough in the past...
Created attachment 12772 [details]
Output of acpidump running on a Thinkpad T61P (6460-6XG)
Here is the output from acpidump running on a T61P, which also shows the problem.
Thanks for the acpidump info.
Will you please try it with boot option of pci=noacpi apic=debug?
If the system boots successfully , please attach the following info:
b. lspci -v
Created attachment 12784 [details]
X61T: dmesg with pci=noacpi apic=debug
Created attachment 12785 [details]
X61T: /proc/interrupts with pci=noacpi apic=debug
Created attachment 12786 [details]
X61T: lspci -v with pci=noacpi apic=debug
Just some other random info: with pci=noacpi it seems everything is set to irq 10: this reflects the settings in the BIOS. I should also mention that IRQ 5 is "reserved" for the serial port for the wacom tablet.
please try acpi=noirq, and give us the /proc/interrupts
Also, if you have a winxp installed in the system, please let us know the device interrupt assignment in win. This can help us narrow down the issue.
Created attachment 12787 [details]
X61T: /proc/interrupts with acpi=noirq
Very little difference between pci=noacpi apic=debug and acpi=noirq
Created attachment 12788 [details]
X61T: Interrupt assignment in Vista Business (32-bit)
I manually copied this out from Device Manager: human error might have slipped in, so if anything is extremely weird, just ask me to verify. Anyone know a good tool to use in Vista to get this info without printscreen and OCR? ;-)
Do you want me to collect the same data for my T61p?
Ok, I've verified that I copied out the Vista information correctly. For future reference: the tool to use is msinfo32 (ships with Windows). I really hate booting into Vista, but this is for a good cause ;-) msinfo32 doesn't show the type (PCI/ISA) of the interrupts. If you require more detailed info and have a program to obtain it, I'd be happy to run it!
msinfo32 lists the negatively numbered IRQ's as unsigned numbers:
-5 == 4294967291,
-2 == 4294967294
Hi, Jan Gutter
Thanks for the info.
Will you please check whether the debug function of PCI is enabled in the kernel configuration? If disabled, please enabled.
Please upload the following info with the boot option of apic=debug initcall_debug debug.
b. lspci -v
c. lspic -xxx
Created attachment 12841 [details]
X61T: kernel config for the next 3 attachments
This is the kernel config under which the next three attachments were made: Yes, I did forget PCI_DEBUG and ACPI_DEBUG on the previous ones!
Created attachment 12842 [details]
X61T: dmesg with apic=debug initcall_debug debug
Created attachment 12843 [details]
X61T: lspci -v with apic=debug initcall_debug debug
Created attachment 12844 [details]
X61T: lspci -xxx with apic=debug initcall_debug debug
Hi, Jan Gutter
From the comment #24 it seems that the error disappears.
Will you please check whether the error still exists when the version of 2.6.23-rc6 is used?
If it can't work well, please attach the dmesg that contains the error info.
If it can work well , please try it with another version and attach the following info(dmesg, /proc/interrupts, lspci -v).
The error still appears after a couple of minutes, and the dmesg was taken before that happened. I'll post the dmesg with the error in the next message. All the tests are done with the latest -git sources.
Created attachment 12848 [details]
X61T: dmesg with apic=debug initcall_debug debug (showing spurious interrupt)
This is the dmesg showing the spurious interrupt
Created attachment 12849 [details]
X61T: /proc/interrupts with apic=debug initcall_debug debug
IRQ 20 has got a rate of about ~211 interrupts/sec before it gets killed.
Hi, Jan Gutter
We have a T61P(15.4 widescreen) by hand. We test the kernel of 2.6.23-rc2,rc3,rc5 and rc6.But we can't reproduce the error on our system.
I have a T61P 15.4 widescreen (exact model: 6460-6XG), and this shows the problem, approximately 10 minutes after boot:
[ 627.024000] irq 23: nobody cared (try booting with the "irqpoll" option)
[ 627.024000] [<c015b5d4>] __report_bad_irq+0x24/0x80
[ 627.024000] [<c015b892>] note_interrupt+0x262/0x2a0
[ 627.024000] [<f88b16c2>] usb_hcd_irq+0x22/0x60 [usbcore]
[ 627.024000] [<c015aaf0>] handle_IRQ_event+0x30/0x60
[ 627.024000] [<c015c27b>] handle_fasteoi_irq+0xbb/0xf0
[ 627.024000] [<c0106b1b>] do_IRQ+0x3b/0x70
[ 627.024000] [<c0105223>] common_interrupt+0x23/0x30
[ 627.024000] [<f8862977>] acpi_processor_idle+0x246/0x41f [processor]
[ 627.024000] [<f8862731>] acpi_processor_idle+0x0/0x41f [processor]
[ 627.024000] [<c0102413>] cpu_idle+0x53/0xe0
[ 627.024000] =======================
[ 627.024000] handlers:
[ 627.024000] [<f88b16a0>] (usb_hcd_irq+0x0/0x60 [usbcore])
[ 627.024000] Disabling IRQ #23
I'll be happy to help fix this problem.
Oh, by the way, the above message is from the Ubuntu Gutsy kernel, but the same thing happens with 2.6.23-rc5 (which is the last one I tested).
Might it be something in the BIOS settings or with specific hardware options? I believe there is a Lenovo tool to transfer the BIOS settings between similar model Thinkpads... I have read on a mailing list somewhere that a firmware update solved a similar problem once, that's why I assumed the answer's not necessarily directly linked to the kernel.
Might be. There is a newer version of the BIOS available for my laptop (it's currently running Version 7LET44WW (1.14-1.06) and the newest one is Version 7LET51WW (1.22-1.06)). I'll try to upgrade tomorrow, and see if the problem persists.
I'll be happy to try the bios settings migration tool also. However I don't remember changing any of the BIOS settings, so it should be as close to factory settings as possible.
Created attachment 12922 [details]
Get the info using ./test 0xfed1c000 0x4000 result
Hi, Jan Gutter
Will you please get the info using the attached files?
How to use this tool is described in the file of readme.
./test 0xfed1c000 0x4000 result.
Created attachment 12923 [details]
X61T: result of "./test 0xfed1c000 0x4000 result"
The dmesg has these lines:
simple: module license 'unspecified' taints kernel.
Exit the module
Note: this was taken *BEFORE* the spurious IRQ. Do you need me to re-run after the IRQ occurred?
I found something interesting, by pure accident today: if I use the rfkill switch (the one that disables both the bluetooth radio and the Wifi radio), the interrupts don't count up, AND I don't get the error!
If use the rfkill switch to disable the radios, the following disappears from lsusb:
Bus 003 Device 003: ID 0a5c:2110 Broadcom Corp.
Likewise lspci -xxx has a difference in the following section:
00:1c.1 PCI bridge: Intel Corporation PCI Express Port 2 (rev 03)
rfkill not used (radios enabled):
e0: 00 0f c7 83 06 07 08 00 33 00 00 00 00 00 00 00
rfkill used (radios disabled):
e0: 00 0f c7 03 06 07 08 00 33 00 00 00 00 00 00 00
For the record, the "ID 0a5c:2110 Broadcom Corp" part is the internal Bluetooth. The bt adapter is implemented as a USB device, basically it is a USB Bluetooth stick without the stick. It is supported via the hci_usb module.
Why did the bt driver not care about the irq?
Yep, I know, BUT I didn't think about it because the bluetooth device is clearly bound to usb 3-2, IRQ #16, from cat /proc/interrupts and dmesg. I can see the interrupts counting up with it enabled (and transferring stuff makes it count up faster, I think). IRQ #20 is bound to ehci_hcd:usb2, which has nothing bound to it, except the right-side USB ports.
usb 3-1 is the fingerprint reader, FWIW (also sharing IRQ #16).
Hi, Gan Jutter
Thanks for the info. It is unnecessary to re-run.
I have a similar problem on a T61 7959-AB8, here it's irq 23 which is normally only assigned to one of the USB controllers. I haven't yet tried with the rfkill switch off. The problem occurs about 10mn after boot on a mostly idle machine.
I'll send all the logs & dumps etc... tonight or tomorrow unless the problem is found in the meantime :-)
Did any USB guys look at the issue?
We checked the chipset config, and found the interrupt routing info is correct and Linux is doing the right thing so far.
I also have a T61 (7664-18G) and often experience those irq-switch-offs (#19 and #23). This only happens when the kill-swich is turned on which <speculation>is physically connecting the integrated bluetooth-dongle using usb</speculation> and creating an acpi-event to enable the wifi-card. I took out my iwl4965 and inserted an atheros 5418 and the problem still remains. IMHO it has something to do with the bluetooth-dongle or the fingerprintreader which are attached using usb. My T61 does not have the integrated camera.
Is there still a need for any logs or dumps?
On mine, I had it happen on irq #21 today instead of the usual irq #23, which is weird. Both have USB uhci's on them and I think the Bluetooth HCI is not on any of those 2 (I'll double check). Irq #23 also has sdhci on it (though my distro kernel didn't attach a driver to it). I'm starting to wonder if something's wrong with those UHCI's... well, UHCI is pretty wrong by definition but maybe something is more wrong than usual here :-)
Ok, a quick plot recap: (please correct me on anything, I'm just a clueless newbie!)
1. On certain model Thinkpads, we get an "IRQ XXX: nobody cared" message after a few minutes of uptime.
2. The IRQ *should be* connected to the USB driver handling the right-side USB ports, but the driver doesn't think it is, hence the ports are disabled.
3. If the rfkill switch is set (wireless disabled), the error does not seem to occur.
4. The bluetooth device unplugged by the switch, is connected to a different USB bus entirely (i.e. NOT the right-side ports).
5. We've had some smart firmware guys look over the code, and it doesn't look like the problem is with the ACPI routing of the IRQ's.
Ok, now my own inferences (which might be less accurate):
1. If an IRQ fires and none of the drivers connected to that IRQ handle it, "irq NN: nobody cared" occurs.
2. This could mean two things: ACPI routing is busted -> the kernel associates the wrong IRQ to the driver, or the driver is busted -> the driver ignores an IRQ that it should handle, or misconfigures the hardware.
3. If the ACPI routing is OK, this would mean the USB chipset driver is the one to blame?
Finally, some burning questions:
1. What's the next step? Bug the USB guys again?
2. Also, does the fact that I have a nice, round 200000 interrupts recorded on the IRQ signify anything?
Another possible explanation could be some other device we don't have a driver for loaded at the moment (some legacy stuff, whatever) asserting the IRQ line, but the fact that it -changed- IRQ line for me today makes this less likely.
One thing I'm wondering if the IRQ is just a short interrupt or is actually asserted continuously. A way to do that would be to print on every occurence and not only after the count reaches 100 (and still disable it tho). If the prints are all together and then it gets disabled, then it's probably asserted by something. If not, then it's a "short" interrupt, and thus is harmless, and the kernel is being a bit too harsh at disabling it.
Basically, a short IRQ is an IRQ from a device that got "caught" by the APIC, but by the time it's actually serviced by the processor, it's gone. There can be multiple reasons for that. It could be an APIC problem where some IRQs end up occasionally dispatched to multiple CPUs (I'm not too familiar with the x86 APICs so I don't know if that can be a problem), or it could be some HW issue where the IRQ output line from a chip, such as a UHCI controller, takes a bit too long to go down after it's been acked on the chip. In the later case, by the time it actually goes down, it may already have been unmasked by the APIC recorded as a new interrupts. That sort of thing...
At this stage, I suspect that the Intel folks are in the best spot to figure that out, though I can try tomorrow to figure out if it's a short interrupt problem or if there's actually a fully asserted interrupt happening (easy, just printk every time it's unhandled rather than after 100 iterations and look at the timestamps).
I have the same issue on an HP nc2510p (a 965-based system), so it's not just Thinkpads.
Does the HP also have a bluetooth module and does it also disable some external USB ports?
It has bluetooth, but the interrupt disabled is the one for the firewire interface. It's not flagged as being shared with anything else.
(Matthew, lets file the HP failure in a different bug report --
we have a little T61 community forming here and although
it would be great if the HP were the same, we're rarely so lucky:-)
I've got a T61 here on my desk which got IRQ 16 disabled
when running the Debian 2.6.22-1-amd64 kernel with usb5 and yenta on that IRQ;
but I've not reproduced the failure with any kernel.org kernels yet.
whelp, added yenta to my 2.6.23-rc8 kernel and still no failure.
0: 32281 32211 IO-APIC-edge timer
1: 3 5 IO-APIC-edge i8042
8: 1 0 IO-APIC-edge rtc
9: 338 329 IO-APIC-fasteoi acpi
12: 1170 1162 IO-APIC-edge i8042
14: 1094 1120 IO-APIC-edge ide0
16: 0 0 IO-APIC-fasteoi uhci_hcd:usb5, yenta
17: 1 1 IO-APIC-fasteoi ohci1394, uhci_hcd:usb6
18: 0 0 IO-APIC-fasteoi uhci_hcd:usb7
19: 18 23 IO-APIC-fasteoi ehci_hcd:usb2
20: 17 9 IO-APIC-fasteoi uhci_hcd:usb3
21: 0 0 IO-APIC-fasteoi uhci_hcd:usb4
22: 0 2 IO-APIC-fasteoi ehci_hcd:usb1
1273: 449 416 PCI-MSI-edge eth1
1274: 860 887 PCI-MSI-edge ahci
NMI: 0 0
LOC: 64446 64423
Bus 006 Device 001: ID 0000:0000
Bus 002 Device 001: ID 0000:0000
Bus 007 Device 001: ID 0000:0000
Bus 005 Device 001: ID 0000:0000
Bus 001 Device 001: ID 0000:0000
Bus 004 Device 001: ID 0000:0000
Bus 003 Device 002: ID 0483:2016 SGS Thomson Microelectronics Fingerprint Reader
Bus 003 Device 001: ID 0000:0000
Version: 7LET39WW (1.09 )
Release Date: 05/14/2007
can somebody ship me a .config for 2.6.23-rc8 that fails?
Also, what method are you using to tweak the RF switch?
Created attachment 12967 [details]
X61T: latest 2.6.23-rc8 config
This is similar to attachment 12841 [details], but I have made a few changes since (suggested by linuxpowertop.org), so I'm reposting. Main difference is CONFIG_IRQBALANCE and CONFIG_ACPI_DEBUG is not set, but that doesn't affect the error.
Also, the rfkill switch is a physical switch located just slightly to the left of the lid switch, on the bottom edge of the laptop. Slid to the left, radios are disabled (usb disconnect event for bluetooth, wifi disabled), slid to the right, radios are enabled (usb connect event on usb 3-2, wifi enabled).
I had the exact same errors. The Thinkpad T61 permits disabling Bluetooth from the BIOS (under Security settings). Since I've disabled Bluetooth my laptop is running without errors for 2-3 hours now. I did also disabled some other things such as serial and parallel ports but I feel it is the Bluetooth that is causing the problem because:
- The problem only happens when the rfkill switch is on, never with it off
- The rfkill switch controls both wlan and bluetooth
- I've been running wlan all morning with Bluetooth disabled. No problems.
Hope this helps.
Hrm... looks like the current kernel code is smart enough it differentiate short interrupts from really stale ones, so here goes for my explanations. I'll still add some instrumentation to the interrupt code to see if I can see something fishy but so far, it seems like a genuine unhandled interrupt.
I tried disabling only the Bluetooth from the BIOS, and I can confirm that the IRQ isn't disabled. The machine have been runnning for 45 minutes now, and the USB ports are still working. This is on a T61p (6460-6XG)
I've been running for 2 days now with Bluetooth disabled on the Thinkpad T61 7664-16u, with wireless enabled. Not a single problem.
Installation of Fedora Core 7 i386 2.6.21-1.3194.fc7
failed on my T61.
The installation could not find the SATA/AHCI drive.
dmesg showed an irq20 nobody cared -- the irq shared
by yenta, uhci_hcd:usb4, and libata. Also, looking
at the dmesg and lspci output, the graphics device
is also connected to this pin, though it doesn't
seem to have a linux driver loaded.
(note that IRQ20 would be called IRQ16 on x86_64
to match GSI 16, since only i386 has the bogus irq compression code)
ACPI: PCI Interrupt 0000:15:00.0[A] -> GSI 16 (level, low) -> IRQ 20
ACPI: PCI Interrupt 0000:00:1d.0[A] -> GSI 16 (level, low) -> IRQ 20
ACPI: PCI Interrupt 0000:00:1f.2[B] -> GSI 16 (level, low) -> IRQ 20
ACPI: PCI Interrupt 0000:00:02.0[A] -> GSI 16 (level, low) -> IRQ 20
15:00.0 CardBus bridge: Ricoh Co Ltd RL5c476 II (rev ba)
Interrupt: pin A routed to IRQ 20
00:1d.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #1 (rev 03) (prog-if 00 [UHCI])
Interrupt: pin A routed to IRQ 20
00:1f.2 IDE interface: Intel Corporation Mobile SATA IDE Controller (rev 03) (prog-if 80 [Master])
Interrupt: pin B routed to IRQ 20
00:02.0 VGA compatible controller: Intel Corporation Mobile Integrated Graphics Controller (rev 0c) (prog-if 00 [VGA])
Interrupt: pin A routed to IRQ 20
Entering BIOS SETUP and changing SATA mode to "Compatibility"
from "AHCI", I was able to install FC7 w/o problems --
and libata ends up on IRQ14 plus IRQ15.
On mine, the IRQ that gets busted doesn't have ata on it:
23: 50102 49899 IO-APIC-fasteoi ehci_hcd:usb7
(I think the Ricoh sdhci thingy is also routed there tho, I don't have a driver loaded for it at the moment).
Last week a new BIOS came out for the Thinkpad X61T, this is just to confirm that it hasn't fixed anything:
it was 7SET18WW (1.04), now it's 7SET20WW (1.06), baseboard firmware is still on 7RHT16WW-1.02, though.
Len, did you have bluetooth disabled (either via kill-switch, or the security menu in the BIOS) when you did your Fedora install? Is the bluetooth USB device really the common denominator here?
Allright, so I did some digging and come up with more data though no solution at this point:
First, some data about my laptop, it's a T61, and the "stray" IRQ is #23 which is apparently only routed to one of the EHCI controllers (the one that is paired with the UHCIs that controls the leftmost ports and the BlueTooth dongle). I don't see anything else on that IRQ line, at least via /proc/interrupts, or whatever else I can find, but there might of course be something on the mobo...
I've hacked the kernel IRQ code (I'll attach a patch) to add a counter on each IRQ line of how many total bogus interrupts happened (interrupts that no handler accepted, that is, IRQ_NONE result). I've also hacked the threshold for disabling the IRQ so that if there's more than 1 jiffy between 2 occurences, it will consider the IRQ spurrious and not stray, and thus won't disable it.
My findings so far are that this IRQ line is subject to a flood of about 200 bogus IRQs per seconds whenever the BlueTooth dongle is active. They cause the kernel to switch the IRQ off after a while because the delay between two of them isn't long enough for the kernel to consider them as simple short interrupts.
It doesn't seem to be a stray interrupt, since this is a level IRQ line, that would result in one CPU being totally out until the IRQ gets disabled, which is not the case. We just get this continuous stream of 200 bogus IRQs per second, which don't get acked anywhere, so that's really strange. It looks like there's a 200Hz square wave connected to that IRQ line.
No, I've tried various things with USB & bluetooth, result is as follow:
- First, the interrupt in question seem to only have the EHCI on it. However, BT doesn't use EHCI (it's not a high speed device), it uses UHCI. So EHCI should only be involved in the initial port connect sequence. I've verified this is the case, the EHCI driver then switches the controller to HALT state and whenever those stray IRQs happen, the EHCI status register always contain 0x1000 which means halted and no interrupt pending. So something "else" is toggling that IRQ line in the background.
- If I rmmod uhci_hcd (the _U_HCI driver where BT is connected), the flood stops. If I modprobe it again, it restarts.
- If I killswitch (HW switch), the flood stops, it restarts if I re-enable BT
- If I kill via /proc/acpi/ibm/bluetooth, same effect (both cause the USB device to disconnect/reconnect).
- If I go to the sysfs file of the BT HCI device and do echo suspend >power/state, the flood stops. If I do echo auto >power/state, the flood resumes.
So at this point, it looks like the IRQ line is "shorted" with a 200Hz output from the BT dongle, which is strange. Unfortunately, the BT dongle is some Broadcom part for which I can't download a data sheet or errata (and that wouldn't necessarily help anyway).
Created attachment 13090 [details]
Add bogus IRQ counts to /proc/interrupts HACK
This hack adds counters of bogus interrupts to /proc/interrupts (in parenthesis after the list of attached devices). This helps show where the problem is. You can see that number starting to increase at about 200HZ on an IRQ line as soon as BT is enabled. What IRQ line is affected seems to depend on the machine model. The exact same problem has been reported on other 965gm based machines such as HP.
reproduced the failure per Ben's description using FC8-T3.
I needed to
1. enable physical wireless kill switch (and see BT logo light up)
That caused the BT device to appear to lsusb, which it wasn't before:
Bus 003 Device 004: ID 0a5c:2110 Broadcom Corp.
2. Then enable the device:
# hciconfig hci0 up
And watched GSI 19 (IRQ 21 on this box)
[root@t61 ibm]# dmesg |grep 'IRQ 21'
ACPI: PCI Interrupt 0000:00:1d.7[D] -> GSI 19 (level, low) -> IRQ 21
increment at 200/sec
Confirmed that "hciconfig hci0" stops the issue,
as well as switching off the wireless radio switch.
It sure looks like some sort of BT driver needs to register on GSI 19
to handle the interrupts the hardware is sending there...
I don't totally agree with this BT driver idea. Here's why:
- First, this is a PCI IRQ line, thus it's level low. However, what we see is a burst of 200HZ, thus it's not a level interrupt, more like either some square wave or an edge interrupt, or a parasite copy of another interrupt.
- BT is a USB device. It shouldn't have a direct IRQ line
I wonder if maybe what happens is that all IRQs for the paired UHCI where BT is connected end up duplicated/shunted to the IRQ line of the EHCI... that would cause something like that. Maybe a misrouting in the chipset ?
I don't know if this is helpful but:
On a mailinglist I saw someone with a T61, who had all his USB ports working until he got a driver for the wireless card installed. I.e. this guy had Bluetooth enabled and everything had been working for weeks, until he installed the driver for the wireless card, and suddenly all the USB ports in one side of his laptop stopped working.
I haven't had the time to verify this myself, but I'm thinking that both Bluetooth and Wireless needs to be enabled for the problem to occur? Maybe it's triggered by something the wireless driver does when initializing the card?
Unfortunately I can't confirm this with my X61T. I've just disabled my wireless Intel 3945ABG (in the "security" tab in the BIOS), and confirmed that the PCI device is not even listed in lspci. The error still occurs on usb-2, BUT the IRQ has shifted to #19. It seems to confirm that the error follows the usb-2 line no matter how the ACPI enumerates the IRQ's. Looks like bluetooth still has a 100% correlation with the error.
I haven't seen any relationship to wifi neither. The IRQ line seems to always be GSI 19 for a couple of machines around here. Is that the case for everyone else ?
the GSI number is the "real" number, the number displayed by linux in /proc/interrupts can change between boots, to see the GSI number, see the kernel log for the message that tells you the mapping.
For example, currently, the "bad" interrupt for me in linux is IRQ 22, and I can see this in dmesg:
ACPI: PCI Interrupt 0000:00:1d.7[D] -> GSI 19 (level, low) -> IRQ 22
Yes. I've checked a couple of my kern.log files, and its always GSI 19 that shows up.
I'll try to get the guy on the other (danish) mailing list to see if he can reproduce the problem without the wireless driver.
I checked a couple of coworker T61 and X61's and it's also always GSI 19. I then booted Windows XP on mine, and checked the IRQ assignment in the Device Manager, and it shows only the EHCI on that interrupt, so there doesn't seem to be any other driver attached there...
Maybe windows doesn't care about bogus IRQs as much as we do or uses a shorter time threshold to differentiate short IRQs from real stale ones.
The USB ID of the Bluetooth chip refers to the Broadcom BCM2045B:
Isn't this chip also present in the X60/T60 series? I'm pretty sure that supports what Benjamin said: if the same Bluetooth chip doesn't give any errors on the X60/T60 series, something specific to the way the X61/T61 series implements it in hardware might be different.
Either that, or it's a problem with the EHCI/UHCI combination ... some misconfiguration of the chipset that would cause the IRQs from that specific UHCI controller to be "mirrored" on the EHCI line...
(In reply to comment #68)
> I'll try to get the guy on the other (danish) mailing list to see if he can
> reproduce the problem without the wireless driver.
I think I'm the guy Klaus is talking about. I have been very busy for the last week, so I haven't had time to get into this until now. Sorry about that.
On my T61 the right-hans USB-ports stop working approx. 10 min after I turn BT and wireless on the master killswitc. But my USB-ports all worked fine until I installed ndiswrapper and the Windows driver (after 2 weeks of no USB-problems copying 75+ GB of data over the right-hand ports), so it has to have something to do with that.
What do you need from me in terms of testing this, and in terms of output before and after? I'm a bit of a Linux newbie, so I'll probably need quite detailed descriptions of what to do. If you don't want to spam this discussion with descriptions like that please e-mail me directly.
The first thing I need to find out is how to unload (or uninstall?) a Windows wireless driver loaded using ndiswrapper.
I doubt it's related. I never used nor installed ndiswrapper or any such thing and I've tracked the thing down pretty low level. I suspect if you run my bogus IRQ counting patch, you'll see them flowing even without ndiswrapper.
At this stage, I suspect there is little any of us can do except wait for somebody for either Intel or Lenovo to dig into the HW and see what's going on. I don't have enough knowledge about those x86 chipsets to go down there myself.
I think so too, someone's either going to need to get their hands dirty with a scope or some sort of firmware emulator (if the problem is HW), or check out the USB chip initialization (if the problem is SW). It would be very strange if the IRQ line got connected to some sort of 200HZ clock.
Are we reaching the limit of useful information that we're gathering here? Would it be productive to narrow the problem definitively down to the (Santa Rosa)/(BCM2045B and 3945ABG) combination?
Finally, the ugliest solution is a quirk in the driver. But if the problem is in the motherboard routing, that might be the only solution.
Sorry guys but this is what I just found, which tends to indicate it has nothing to do with wifi nor bluetooth being on or off specifically (but my earlier observations hold: turning off the bluetooth via the BIOS made my laptop very stable, and turning them on makes it very unstable, but....):
I was trying to copy 7 Gigs over the LAN card. The wifi/bluetooth switch was *OFF*.
Eventually the NIC stopped functioning. ifup/ifdown would not restore it.
Here's what I found in the system log:
Oct 17 09:01:06 ThinkpadT61 avahi-daemon: Registering new address record for 10.0.0.3 on wlan0.
Oct 17 09:01:07 ThinkpadT61 kernel: irq 177: nobody cared (try booting with the "irqpoll" option)
Oct 17 09:01:07 ThinkpadT61 kernel: [<c01402e3>] __report_bad_irq+0x2b/0x69
Oct 17 09:01:07 ThinkpadT61 kernel: [<c01404d0>] note_interrupt+0x1af/0x1e7
Oct 17 09:01:07 ThinkpadT61 kernel: [<f887d49e>] usb_hcd_irq+0x23/0x50 [usbcore]
Oct 17 09:01:07 ThinkpadT61 kernel: [<c013fae7>] handle_IRQ_event+0x23/0x49
Oct 17 09:01:07 ThinkpadT61 kernel: [<c013fbc0>] __do_IRQ+0xb3/0xe8
Oct 17 09:01:07 ThinkpadT61 kernel: [<c01050e5>] do_IRQ+0x43/0x52
Oct 17 09:01:07 ThinkpadT61 kernel: [<c01036b6>] common_interrupt+0x1a/0x20
Oct 17 09:01:07 ThinkpadT61 kernel: [<c01e007b>] acpi_ex_create_method+0x9f/0xa3
Oct 17 09:01:07 ThinkpadT61 kernel: [<f88495ac>] acpi_processor_idle+0x1ec/0x380 [processor]
Oct 17 09:01:07 ThinkpadT61 kernel: [<c0101b52>] cpu_idle+0x9f/0xb9
Oct 17 09:01:07 ThinkpadT61 kernel: handlers:
Oct 17 09:01:07 ThinkpadT61 kernel: [<f887d47b>] (usb_hcd_irq+0x0/0x50 [usbcore])
Oct 17 09:01:07 ThinkpadT61 kernel: Disabling IRQ #177
I have noticed this before: copying large (Gigs) of data over the T61 NIC kills it. In fact, I pretty much convinced I can't copy more than approx 1 gig without failure (this is over sftp using nautilus if that's of any relevance). Please do note, the NIC had died, not nautilus, and ifup/ifdown did not restore it.
Thinkpad T61, Debian Linux, kernel 2.6.18-5-686.
I'm not the most technical tool in the shed but if there is anything I can do to help, let me know.
Maybe running bluetooth/wifi or also heavy NIC traffic is putting some kind of load on the system ... in other words, it is not those devices specifically that cause the problem, just what those devices demanding of the hw/sw? Remove (or reduce) the demand and the problem becomes less frequent. (I'm such a newb...)
Just a sidenote on the NIC issue:
Usually "rmmod e1000 && modprobe e1000" takes care of this. (Yes this cures only a symptom)
I remember seeing this behaviour a lot (like every minute) when copying at GBit rates from a fileserver to a SATA drive attached to a express-card controller. reloading e1000 and reconfiguring the interface would be needed every 1-2minutes sometimes 3minutes if the throughput was less. I was amazed though that the TCP connection survived the 20-30x module reloads...
As for the e1000 ethernet, I had problems with some old kernels. Upgrading to a current one solved them. Transferring multiple GB works perfectly on 2.6.22. I do not have a gigabit ethernet for testing, though.
Just a "me too".
dmesg shows "Disabling IRQ #23" @ 625 seconds. If I then plug in a USB device, USB isn't working via the 2 left ports. The right port works.
T61. No fingerprint reader, nothing plugged in to USB until after the problem occurs, no ndiswrapper, problem occurs even if using VESA X.org drivers (i.e. not due to proprietary Nvidia driver), problem also occurs if networked via ethernet with no wifi.
/proc/interrupts shows 23: CPU0 64688 CPU1 35313 IO-APIC-fasteoi timer ehci_hcd:usb7
Brand new ThinkPad from Lenovo, so it should be the latest BIOS. Let me know if there's any other useful information I can provide.
Thomas, can you try my hack to report the bogus IRQ counts and tell me what the counter look like in /proc/interrupts ? I added the bogus IRQ count in parenthesis at the end of each line.
It's also possible that this is an unrelated problem. That is, I can see how a network driver using NAPI could cause occasional (or even frequent) short interrupts, and older kernels such as your 2.6.18 don't even try to ignore them, so you get a timebomb waiting to happen.
Just a possibility though, worth investigating. Can you also try a 2.6.23 kernel ?
I've upgraded to kernel 2.6.23-gentoo and the new BIOS for the X61T: 7SET21WW-1.07. Embedded controller is still 7RHT16WW-1.02.
The problem still occurs :-(
But, now I'm experiencing interrupts on GSI-19 at roughly 144/sec (as opposed to 200/sec). I'm not sure what caused the frequency change, but if you think it'll help, I can re-run on one of the older kernel configs + version.
I flashed the today released 7KET56WW-1.26 BIOS for my T61 (7664-18G) and it also didn't fix that problem.
Do the people at lenovo know of that problem/bug? If it's bios related they probably should know about it...
A similar bug report is on Ubuntu's lanchpad:
Installed Gutsy can not boot on Santa Rosa
Similar on Thinkpad R61 7732 (also with Santa Rosa chipset) with fedora-development (ie soon-to-be fedora 8), kernel 126.96.36.199-37.fc8 on i386. x86_64 with similar kernel has the same problem.
The kernel says:
irq 21: nobody cared (try booting with the "irqpoll" option)
Disabling IRQ #21
I have a T61P and can confirm exact symptoms. Is there someone at Lenovo and/or Intel we should be bugging to investigate this? It certainly reeks of a hardware issue.
The problem is gone on my Thinkpad T61 (766314G) with the latest BIOS version 1.26.
My BIOS(1.07) for the X61T is dated 10 October, I see 1.26 for the T61 is dated 18 October. The changelog for both of these look similar. Mine definitely still shows the problem, though.
Andreas, would it be a lot to ask if you could possibly flash the previous version, and verify that it's definitely been fixed? I usually just check /proc/interrupts before and after a BIOS update.
If this is really fixed in the BIOS, it might also be applicable to other makes and models sporting the Santa Rosa chipset...
PS. Kudos to Lenovo for fixing a bug in the BIOS on the T61's that improves Linux compatibility! (The fix for: Volume and mute buttons on the keyboard do not work on Linux.)
I can confirm I have this on a Lenovo ThinkPad X61, using Ubuntu 7.10 ( 2.6.22-14-generic AMD64 kernel).
Are there any known workarounds to this issue? Sounds like neither irqpoll nor noapic are good ones.
Latest BIOS (7KET56WW - 1.26-1.06 - 2007/10/18) does NOT fix the problem on R61 7732-1EG (Santa Rosa chipset) running Fedora devel on kernel 188.8.131.52-42.fc8.
Same here, 1.26-1.06 (7LET56WW) on a T61, problem still present.
This has been debugged to death from a USB viewpoint. It sometimes happens without USB devices being plugged in. The trigger seems to be that the plugging event wakes the machine up.
Ugh ? Oliver, are we talking about the same problem here ? The machine -is- already up, there is no suspend/resume process involved as far as I can tell. It's purely that when the USB bluetooth device becomes active on the internal USB, the interrupt line of the EHCI that is paired with the UHCI that drives it starts getting that 200HZ or so spurrious interrupt flood. Do you think it could be some kind of wakeup thing coming from the EHCI ? It's not in D3 state... I've tried reading the status reg from it and it doesn't show any irq condition ...
I've compiled and installed kernel 184.108.40.206 with the realtime patch patch-220.127.116.11-rt5, noapci, to do some audio work on my T61. The problem appears even without the wifi/bluetooth switch off but with the RT patch the problem occurs quite frequently. Basically I lose the mouse and the keyboard giving the system the appearance of being frozen. I found that plugging in an external keyboard (and pressing some keys in it?) that the mouse and keyboard starts working again and I can continue. Then eventually I have to do it again.
I had the exact same problem when I boot into Windows XP (and plugging in an ext keyboard had the same effect, i.e. it "revives" the USB), but now the problem in Windows XP has gone away completely (after installing upgrades from Lenovo and Microsoft, not sure what fixed it).
The problem persists in Linux. Same laptop, dual boot, most recent BIOS.
Hopefully this provides some clues.
Sorry, that should have read "even with the wifi/bluetooth switch off".
I have a T61p, with the same problem since I got it in August.
irq 23: nobody cared (try booting with the "irqpoll" option)
Disabling IRQ #23
#91, I was under the mistaken impression that leaving the C3 state causes it. But it rather seems to be an interrupt routing issue.
I now switched to 64 Bit and the problem seems to be gone .....
Same here too. T61 with 7LET56WW (1.26-1.06) and Debian kernel (2.6.22 or 2.6.23).
One (ugly) way to make this bug go away: recompile the kernel taking away the ehci support from USB (under "device drivers--> USB support"). You're left with the VERY SLOW uhci USB1.1, but at least all USB ports work. I'm using gentoo, with kernel
Linux Challenger 2.6.22-gentoo-r8 #11 SMP PREEMPT
on a new thinkpad T61.
I have an X61s with the latest (1.08, 9/26/2007) BIOS and 2.6.24-rc3, and I'm seeing exactly the same behavior. Disabling the bluetooth does make or unloading the ehci_hcd module makes the problem go away; unloading and reloading the ehci_hcd module seems to restore the broken USB behavior.
I have not tried going to a 64bit-kernel, but I suppose that would be an interesting next step.
SuSE's bugzilla also has this bug logged: https://bugzilla.novell.com/show_bug.cgi?id=325601
Does anyone on this (now sizable!) mailing list have a contact at Lenovo who's able to help us here? If not, the only option might be a (gulp!) workaround.
Also, #97, does "same here" mean you also suffer from the bug, or does it mean 64-bit solved it for you?
I have a contact at lenovo who is investigating, but I have no more infos on what the status is there.
(In reply to comment #100)
> Also, #97, does "same here" mean you also suffer from the bug, or does it
> 64-bit solved it for you?
Sorry :/ I also suffer from this bug using Debian AMD64.
(In reply to comment #91)
> Ugh ? Oliver, are we talking about the same problem here ? The machine -is-
> already up, there is no suspend/resume process involved as far as I can tell.
> It's purely that when the USB bluetooth device becomes active on the internal
> USB, the interrupt line of the EHCI that is paired with the UHCI that drives
> starts getting that 200HZ or so spurrious interrupt flood. Do you think it
> could be some kind of wakeup thing coming from the EHCI ? It's not in D3
> state... I've tried reading the status reg from it and it doesn't show any
> condition ...
Speaking of wakeup and EHCI: I'm trying to debug #9258. And only recently I saw 'irq 19 nobody cared' in dmesg output. But that only happened right after resume from C3/C4 (don't remember exactly). I haven't seen that message in other situations (eg. when the laptop runs normally). The laptop is a X61 Tablet with 2.6.24-rc3, without any bluetooth support compiled in. If you think these two bugs could be related or want more infos about my setup just give me shout.
(In reply to comment #96)
> I now switched to 64 Bit and the problem seems to be gone .....
I've just installed the 64 bit kernel on my R61 and the problem definitely hasn't gone away for me.
By the way, a limited workaround for me is to plug the mouse into one of the two USB ports that have the problem. When the kernel disables interrupts, the mouse carries on working. Then i've still got the LHS port to plug other things into.
There is a new BIOS-Update available for the R61/T61's (2.07/1.08) which is supposed to fix the interrupt problem. http://www-307.ibm.com/pc/support/site.wss/document.do?sitestyle=lenovo&lndocid=MIGR-67989
Hopefully that annoying bug is finally fixed - TESTING :)
Applied it and I see no more spurrious EHCI interrupts. Looks good here !
That's awesome news. Now I just have to wait for the X61T BIOS...
Thanks everyone for your help! Thanks Lenovo for fixing this bug too!
So, is there a bug resolution for "fixed in hardware"? ;-)
Well, for R61 and T61 owners, yes. For X60/X61 owners, it's "hopefully Lenovo will see fit to fix it in a BIOS update sometime soon". (i.e., it's not fixed for everyone just yet, but there is hope that it will be fixed in firmware --- and it seriously suggests that it can't be fixed at the kernel, so we probably close close the bug report....)
Confirmed fixed on my T61p.
Fixed on my T61 too.
Same here. Thanks for the notice Christian!
According to a contact at Lenovo, the fix is already in the latest X61 BIOS update. Ted, can you verify ?
The latest BIOS update I see on the Lenovo site is v1.10 (24 Oct 2007). I don't see any reference in its changelog as I do in the T61 such as, "(Fix) Unexpected interrupts from the USB controller may occur." Is there a link to a more recent X61 BIOS update?
That may well be the one. Can you verify that the problem still occurs with this version of the BIOS ?
This bug is mentioned in the changelog. If you don't update the BIOS, you must give "noirqdebug" on the kernel command line. It'll shorten your time on battery though.
@Oliver: It's mentioned in the changelog for the R61/T61/T61p, not for the X61 or X61 Tablet.
@Benjamin: I'll upgrade my BIOS to 1.10 tonight or tomorrow and let you know how it affects my system.
There doesn't appear to be a BIOS update that applies to my R61 (8932-A13) model. According to the Lenovo site, it's not one of the models that are supported by the BIOS update referred to above. The BIOS version on this computer starts with 7O, not 7L or 7K, as specified in in the Lenovo BIOS update pages.
So it looks like this problem isn't solved for me yet.
I assume you have verified that the latest BIOS available for your model still has the problem regardless of whether the changelog talks about the fix or not ? I need to be sure before I go back to Lenovo.
I just upgraded from BIOS version 1.03 to 1.10. I can still reproduce the problem. To wit:
[ 796.048578] irq 19: nobody cared (try booting with the "irqpoll" option)
[ 796.048591] Call Trace:
[ 796.048596] <IRQ> [<ffffffff8026aade>] __report_bad_irq+0x1e/0x80
[ 796.048643] [<ffffffff8026adc3>] note_interrupt+0x283/0x2c0
[ 796.048663] [<ffffffff8026b92d>] handle_fasteoi_irq+0xdd/0x110
[ 796.048679] [<ffffffff8020c6ab>] do_IRQ+0x7b/0x100
[ 796.048690] [<ffffffff8020a3a1>] ret_from_intr+0x0/0xa
[ 796.048696] <EOI> [<ffffffff88025ac9>] :processor:acpi_processor_idle+0x25f/0x456
[ 796.048748] [<ffffffff88025abf>] :processor:acpi_processor_idle+0x255/0x456
[ 796.048766] [<ffffffff8802586a>] :processor:acpi_processor_idle+0x0/0x456
[ 796.048779] [<ffffffff802090c0>] cpu_idle+0x70/0xc0
[ 796.048820] handlers:
[ 796.048824] [<ffffffff8807dd30>] (usb_hcd_irq+0x0/0x60 [usbcore])
[ 796.048859] Disabling IRQ #19
So, I would say that the bug has *not* been fixed in the latest BIOS update for the X61.
I suppose on the upside now that I've updated by BIOS, when my machine boots it claims it's Energy Star compliant. :)
Well, i'm completely at a loss to explain this, but the problem seems to have gone away of its own accord on my system! It's an R61 (8932-A13), running Fedora 8, with a home-built 18.104.22.168 x86_64 kernel which previously had the problem - and, as far as i know, i haven't done anything to fix it.
I can't say for certain when it disappeared, as i don't normally shut down and reboot my system for days on end (i use hibernate or suspend at night). However, i think it disappeared yesterday.
The only thing i can think that could have fixed it is an automatic update of the WinXP system that's dual bootable. I almost never boot into Windows, but i did yesterday - and Windows did an automatic update. Perhaps it did some BIOS configuration that's persistent.
Naturally, i can't undo this and verify my suspicion. And i can't yet say for certain that the problem's really gone away and it's not just taking much longer than usual to Linux to kill that interrupt, for some reason. But it's currently been up for 30 mins without that interrupt being disabled.
I first noticed this morning that those two USB ports were still working, even though the uptime was about 12 hours. I had a (Fedora) kernel update yesterday and i wondered if that had fixed it, so i rebooted with a kernel that i knew still had the problem and it appears to have gone away.
@Will: Did you change your Bluetooth usage? Did you accidentally nudge the kill switch for wireless? Both seem to have an effect on how long it takes for Linux to kill the interrupt.
It's also possible that Fedora switched to the new BT HCI driver which, I've been told, have been improved to generate much less interrupts. Thus, the IRQ shutdown is much less likely to happen, though the underlying HW issue is still there.
You can check if you still see IRQs being counted for the EHCI to which nothing is connected in /proc/interrupts
Ah, yeah. Bluetooth seems to have disappeared for some reason. I haven't used it for a while so i didn't notice. I don't know why it's gone, so i guess i'll have to investigate - or maybe i'll just leave it off and get my USB ports back! ;-)
Sorry about the false alarm.
This error is NOT corrected with the latest X61s BIOS. I have a new X61s I just recieved a few days ago; it exhibits this behavior with BIOS version 1.11:
thinkpad_acpi: ThinkPad BIOS 7NET30WW (1.11 ), EC 7MHT24WW-1.02
It's strange that I have 1.11, because 1.10 appears to be the latest version available on the web site. It must be because my system is so new. I am going through the process of configuring the system and troubleshooting any errors in the logs, and this was one of the first ones I found:
kernel: irq 20: nobody cared (try booting with the "irqpoll" option)
kernel: [<c045b16a>] __report_bad_irq+0x36/0x75
kernel: [<c045b380>] note_interrupt+0x1d7/0x213
kernel: [<c057ae99>] usb_hcd_irq+0x21/0x4e
kernel: [<c045a807>] handle_IRQ_event+0x23/0x51
kernel: [<c045bb0b>] handle_fasteoi_irq+0x86/0xa6
kernel: [<c045ba85>] handle_fasteoi_irq+0x0/0xa6
kernel: [<c04074c3>] do_IRQ+0x8c/0xb9
kernel: [<c0405b6f>] common_interrupt+0x23/0x28
kernel: [<c05382e6>] acpi_idle_enter_simple+0x17b/0x1f1
kernel: [<c0537f40>] acpi_idle_enter_bm+0xc3/0x2ee
kernel: [<c05a2399>] cpuidle_idle_call+0x5c/0x7f
kernel: [<c05a233d>] cpuidle_idle_call+0x0/0x7f
kernel: [<c040340b>] cpu_idle+0xab/0xcc
kernel: [<c057ae78>] (usb_hcd_irq+0x0/0x4e)
kernel: Disabling IRQ #20
This occurred about 10 minutes after boot, with radios enabled.
Also note that GSI 19 (level, low) -> IRQ 20.
Just bumping to confirm this issue is very much present on my X61.
Benjamin: any status update from your contact at lenovo?
Nope, not yet, but monitor the BIOS update site as they may not tell me right away if/when they post a fixed version
I just noticed that the 1.11 version of the BIOS is available. On the Lenovo web site, the Driver Matrix page for the X61/X61s claims the latest BIOS version is 1.10, but if you actually click on the link (http://www-307.ibm.com/pc/support/site.wss/document.do?lndocid=MIGR-67982) it takes you to a page for the 1.11 version of the BIOS.
Having tried it, I can also confirm that it doesn't fix the problem (not that surprising, the change log doesn't mention fixing the spurious BIOS problem, and that was mentioned in the T61 firmware update). It looks like the same Bios is used for the Thinkpad Reserve edition, which is priced at roughly twice the normal X61s, but supposed has better service. Maybe if we can find someone who purchased it who can complain to Lenovo? :-)
After returning from holiday, I also upgraded my X61 Tablet to 1.08 (which seems to have the same changelog as X61/X61s 1.11). Also negatory on the fix :-( It looks like the X61 and X61 Tablet BIOS code is unified, though...
BIOS 7SET22WW (1.08) has the same changelog as
BIOS 7NET30WW (1.11)
Well, since Santa obviously gave me coal for Christmas, I'll have to wait for my birthday present then....
I have a T61p on which the 2 USB ports on the right did not work (it was the bug on this page). However, flashing the BIOS to version 2.07 WORKS!!!
Also, previously with BIOS 1.26, adding 'irqpoll' to the boot options in grub and setting IRQ in the BIOS are set to Auto-Detect for all hardware would work but with annoying side effects (which would make GNOME pop up stupid messages about my CDROM drive all the time).
Anyway, let me know if you need any output or something. As far as I am concerned, bug 8853 is solved :D
I did a quick check on the Lenovo site and read the BIOS changelogs. I looked for the line:
(Fix) Unexpected interrupts from the USB controller may occur.
Thinkpad models without the fix listed (note, this does not necessarily mean they are affected by this bug):
ThinkPad R61 15 inch models (8942, 8943, 8944, 8945, 8947, 8948, 8949)
ThinkPad X61, X61s, Reserve Edition
ThinkPad X61 Tablet
ThinkPad Z61e, Z61m, Z61p, Z61t
Thinkpad models with the fix listed:
ThinkPad R61 14.1inch widescreen with IEEE 1394, ThinkPad T61, T61p
ThinkPad R61 and R61i 14.1inch widescreen without IEEE 1394, R61 15.4inch widescreen
It seems that all the BIOSes where the bugs had been fixed had their Embedded Controller (EC) revision bumped. I assume that the bug's likely to be there.
Any ETA from the man at Lenovo?
Thanks for pointing that out, Jan! That BIOS update doesn't come up when i enter the model ID of my Thinkpad (8932-A13) directly into the Lenovo support page, so i hadn't found it when i'd looked previously.
I've applied this update and it appears to have fixed the problem. So far, bluetooth has been enabled for 45 minutes and the RHS USB ports are still working - with no kernel message about disabling interrupt.
So it appears this bug is solved for the 15.4" widescreen model with IEEE 1394 - 8932-A13.
I updated today, and I saw a list of BIOS updates with the fix listed! Only the Z61 series BIOS doesn't have the fix listed, but I'm not sure whether the problem affects that series, though.
I didn't see any problems after 18 minutes of uptime, and no continuous stream of spurious interrupts seem to be visible. (X61 Tablet 7767-B8G is FIXED!)
Thank you all for a sterling job on identifying and isolating a niggling bug, and thanks to Lenovo for fixing it! Benjamin, would you please convey my glee to them?!?!?
Confirmed fixed here on a Thinkpad X61s when updated to BIOS 2.06 (7NETA6WW). Thanks!
Confirmed fixed here as well on a ThinkPad X61! Updating the BIOS without a optical drive was a bit tricky, but worked when I followed this guide on ThinkWiki: http://www.thinkwiki.org/wiki/BIOS_Upgrade/X_Series#Approach_3:_Alternative_method_using_a_USB_stick
Also confirmed fixed on X61. Thanks Johannes for the pointer to the alternative firmware upgrade approach -- it saved me having to setup a bootable Windows physical machine.
It also fixed my X61. Updating was a real pain -- every attempt I made at using [FreeDOS or DOS 6.22] with [pxeboot+MEMDISK or USB flash drive] led to a successful boot but a system freeze during update. Johannes' suggested method (essentially the same but requiring Windows to create the boot disk) worked for some reason.
Is it worth having the kernel check for these known-bad BIOS versions and printing a warning? Or maybe something handled by HAL similar to the battery info.is_recalled stuff?
Confirmed that Lenovo BIOS 2.10 fixes this issue on T61.
No, I'm not excited about adding DMI entries to Linux
to warn users to upgrade their BIOS -- as there are lots
of versions of the BIOS, and looking at the DMI for the
one I just loaded, it doesn't even match what IBM
calls the BIOS (DMI says 2.16, Lenovo calls it 2.10)