Subject : Regression in 2.6.27: "irq 18: nobody cared" on Toshiba Satellite A100 Submitter : "M. Vefa Bicakci" <bicave@superonline.com> Date : 2008-10-19 14:06 References : http://marc.info/?l=linux-kernel&m=122442552100406&w=4 Handled-By : Stefan Assmann <sassmann@suse.de> This entry is being used for tracking a regression from 2.6.26. Please don't close it until the problem is fixed in the mainline.
Will you please use git-bisect to identify the commit by which the regression is caused by? Will you please attach the output of acpidump, lspci -vxxx, /proc/interrupts? thanks.
taken from initial discussion on lkml: contents of /proc/interrupts: --- 8< --- CPU0 CPU1 0: 45249492 60399 IO-APIC-edge timer 1: 25451 0 IO-APIC-edge i8042 8: 1 0 IO-APIC-edge rtc0 9: 36514 0 IO-APIC-fasteoi acpi 12: 1147983 2103 IO-APIC-edge i8042 14: 170245 0 IO-APIC-edge ata_piix 15: 558085 819 IO-APIC-edge ata_piix 16: 508 0 IO-APIC-fasteoi uhci_hcd:usb5, i915@pci:0000:00:02.0 17: 1353 0 IO-APIC-fasteoi firewire_ohci 18: 300158 1 IO-APIC-fasteoi uhci_hcd:usb4, tifm_7xx1, yenta 19: 0 0 IO-APIC-fasteoi uhci_hcd:usb3 20: 26606 2 IO-APIC-fasteoi eth0 22: 3206279 1 IO-APIC-fasteoi HDA Intel 23: 3 0 IO-APIC-fasteoi uhci_hcd:usb1, ehci_hcd:usb2 220: 2105545 0 PCI-MSI-edge iwl3945 NMI: 0 0 Non-maskable interrupts LOC: 5971997 27874747 Local timer interrupts RES: 938710 1791498 Rescheduling interrupts CAL: 138135 180813 function call interrupts TLB: 48455 64413 TLB shootdowns TRM: 0 0 Thermal event interrupts SPU: 0 0 Spurious interrupts ERR: 0 MIS: 0 --- >8 ---
Created attachment 18471 [details] spurious-counter.patch - should help to reproduce issue quicker This patches should trigger spurious IRQs more often as it reduces the number of unhandled IRQs that trigger a nobody cared event from 99900 to 999.
please provide required info as comment 1 suggested, so we can work on it
References : http://marc.info/?l=linux-kernel&m=122565894720909&w=4
Apparently, not a regression from 2.6.26.
http://marc.info/?l=linux-kernel&m=122624419830761&w=4 > I have also tried to boot with "noapic" option using the kernels > from Etch (Debian 4.0), unmodified: one based on 2.6.18 and one > based on 2.6.24. Again, I got similar "nobody cared" messages. clearing regression flag.
Hello, I am sorry; I didn't even know that this bug-report ticket existed until I read Len Brown's e-mail which is dated today. Currently, I am testing Debian's 2.6.24 based kernel *without* the "noapic" option. The uptime is currently almost three days, and I haven't yet seen the "nobody cared" message. Considering that I saw the "nobody cared" message with 2.6.25.19 in 41 hours, it is probable that 2.6.24 doesn't have this problem. However, there is no way of knowing how much I should have to wait until I can say that a certain version of the kernel does not have this problem. This makes bisecting very hard and time-consuming. Any suggestions are welcome. I am going to attach the output of "acpidump" and "lspci -vv -xxx -nn" in a moment. Thanks to everyone! Regards, M. Vefa Bicakci
Created attachment 18833 [details] acpidump output
Created attachment 18834 [details] Output of "lspci -vv -xxx -nn"
Created attachment 18835 [details] Contents of /proc/interrupts on Debian's 2.6.24-based kernel
please also attach the output from dmesg -s64000
Created attachment 18856 [details] dmesg from Debian's 2.6.24 based kernel Please note two things: 1) This dmesg is from Debian's 2.6.24 based kernel for Etch, with a modified version of the patch from Stefan Assmann. (I used "99" instead of "999" for the spurious interrupt count.) Furthermore, the uptime has been about 96 hours and I haven't been able to reproduce the "nobody cared" message with this kernel. 2) I just noticed that there is a file named "/etc/modprobe.d/ieee1394" on my system which blacklists the "ohci1394" module. This means that the firewire drivers were not loaded in my tests with kernels whose versions were lower than 2.6.25.x, including Debian's 2.6.24 based kernel that I am testing right now.
JFI: I think I got the same problem, with debian kernel 2.6.26-10: 2.6.26-1-amd64 #1 SMP Sat Nov 8 18:25:23 UTC 2008 x86_64 GNU/Linux It's again an USB interrupt, and I got it when I tried to get my Canon 350D (in "normal" mode, as opposed to "PTP") to run: - Plugged the camera in, wasn't detected - Turned camera off, and on again, again nothing on "lsusb" - Set udev logging to "debug", camera off and on, and there's the report. I looked in my logfiles (where the first entry is from May 6), and it's the first occurrence of "nobody cared"; as I mentioned, I used the camera in PTP previously. Maybe that makes a difference. I hope that helps a bit; I can try to find some time for testing, if it helps. Call Trace: <IRQ> [<ffffffffa020c11a>] :snd_hda_intel:azx_interrupt+0x20/0xf9 [<ffffffff8026c78f>] __report_bad_irq+0x30/0x72 [<ffffffff8026c9ce>] note_interrupt+0x1fd/0x23b [<ffffffff8026d257>] handle_fasteoi_irq+0xa5/0xc8 [<ffffffff8020f5e4>] do_IRQ+0x6d/0xd9 [<ffffffff8020b0a3>] default_idle+0x0/0x49 [<ffffffff8020c46d>] ret_from_intr+0x0/0x19 <EOI> [<ffffffff8021a817>] lapic_next_event+0x0/0x13 [<ffffffff8021eb54>] native_safe_halt+0x2/0x3 [<ffffffff8021eb54>] native_safe_halt+0x2/0x3 [<ffffffff8020b0cd>] default_idle+0x2a/0x49 [<ffffffff8020ac79>] cpu_idle+0x89/0xb3 handlers: [<ffffffff8038eaf3>] (usb_hcd_irq+0x0/0x78) [<ffffffffa020c0fa>] (azx_interrupt+0x0/0xf9 [snd_hda_intel]) Disabling IRQ #22
Update: Now STR doesn't work anymore, and I get more reports. PM: Some devices failed to suspend Restarting tasks ... done. PM: Syncing filesystems ... done. Freezing user space processes ... (elapsed 0.04 seconds) done. Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done. Suspending console(s) sd 3:0:0:0: [sdb] Synchronizing SCSI cache sd 3:0:0:0: [sdb] Stopping disk sd 0:0:0:0: [sda] Synchronizing SCSI cache sd 0:0:0:0: [sda] Stopping disk ACPI handle has no context! serial 00:0a: disabled parport_pc 00:06: disabled ACPI handle has no context! ACPI: PCI interrupt for device 0000:00:10.1 disabled ACPI: PCI interrupt for device 0000:00:0f.0 disabled ACPI: PCI interrupt for device 0000:00:0e.0 disabled usb_hcd_pci_suspend(): ehci_pci_suspend+0x0/0x99 [ehci_hcd] returns -22 pci_device_suspend(): usb_hcd_pci_suspend+0x0/0x176 returns -22 suspend_device(): pci_device_suspend+0x0/0x57 returns -22 Could not suspend device 0000:00:0b.1: error -22 PM: Writing back config space on device 0000:00:0e.0 at offset f (was 1030100, writing 103010b) ... irq 22: nobody cared (try booting with the "irqpoll" option) Pid: 0, comm: swapper Tainted: P 2.6.26-1-amd64 #1 Call Trace: <IRQ> [<ffffffffa020c11a>] :snd_hda_intel:azx_interrupt+0x20/0xf9 [<ffffffff8026c78f>] __report_bad_irq+0x30/0x72 [<ffffffff8026c9ce>] note_interrupt+0x1fd/0x23b [<ffffffff8026d257>] handle_fasteoi_irq+0xa5/0xc8 [<ffffffff8020f5e4>] do_IRQ+0x6d/0xd9 [<ffffffff8020b0a3>] default_idle+0x0/0x49 [<ffffffff8020c46d>] ret_from_intr+0x0/0x19 <EOI> [<ffffffff8022f2e8>] finish_task_switch+0x2a/0xba [<ffffffff8022f2e8>] finish_task_switch+0x2a/0xba [<ffffffff8021eb54>] native_safe_halt+0x2/0x3 [<ffffffff8021eb54>] native_safe_halt+0x2/0x3 [<ffffffff8020b0cd>] default_idle+0x2a/0x49 [<ffffffff8020ac79>] cpu_idle+0x89/0xb3 handlers: [<ffffffff8038eaf3>] (usb_hcd_irq+0x0/0x78) [<ffffffffa020c0fa>] (azx_interrupt+0x0/0xf9 [snd_hda_intel]) Disabling IRQ #22 ata2: SATA link down (SStatus 0 SControl 300) ... hdc: UDMA/66 mode selected PM: Some devices failed to suspend Restarting tasks ... done. hda_intel: azx_get_response timeout, switching to polling mode: last cmd=0x001f0500
So do any alsa guys look at the issue?
Hi, Pmarek From the log it seems that the issue on your box is different with that on Toshiba Satellite A100. Will you please open a new bug and attach the output of acpidump, lspci -vxxx, dmesg. Thanks.
Will you please double check whether the following issue still exists on the kernel of 2.6.27.2? >irq 18: nobody cared (try booting with the "irqpoll" option) If exists, please add the boot option of "nousb" and see whether it still exists? Thanks.
Dear ykzhao, I have confirmed that this problem exists in 2.6.25.19, 2.6.26.7 and 2.6.27.{1,2,4}. I tried to blacklist all USB related modules on 2.6.27.x and have confirmed that this problem exists when USB modules are blacklisted too. (I have already mentioned the above on LKML.) I would like to clarify one thing. When I boot using "noapic", the version of the kernel does not matter; I always get a "nobody cared" message during boot. However, when I boot with*out* "noapic", that is, normally, then I only get "nobody cared" messages after the computer has been on for a while - and only for certain kernel versions. Please read on. === Important update === Furthermore, I have found out that this problem was introduced by a change between 2.6.24 and 2.6.25. I have found this out by using 2.6.24 for quite some time and noticing that I did not get any "nobody cared" messages, even though I had reduced the check number for the spurious IRQ counter to "99". I have partially bisected the commits between 2.6.24 and 2.6.25 and have made some progress. I have waited 48 to 70 hours using the kernel built by each iteration of the bisection. If I got a "nobody cared" message, I used "git bisect bad". Otherwise, "git bisect good". Unfortunately, I don't have the time these days to continue bisecting. (I am kind of busy.) I will have the time about one-two weeks later. I am going to attach the current output of "git bisect log" in a moment. Please note that the bisection isn't finished yet. As I said, I will be able to continue bisecting in a few weeks. Regards, M. Vefa Bicakci
Created attachment 19165 [details] The output of "git bisect log" Please note that the bisection is not complete yet.
Hello, Wondering which commit could be the cause of this problem, I ran "git bisect visualize". After noticing that nothing really relevant showed up in the list of commits, I realized that waiting for 48-60 hours might not be enough to detect whether a certain version of the kernel has this problem or not. Furthermore, I have tested 2c6b8c030cfca334c3d700ee504036c585c4c6a3 again, and I have found out that it is actually "bad", not "good". This means that my bisection results are not meaningful. I have restarted the bisection and this time I will wait longer before deciding whether a given version of the kernel has the problem or not. I would really appreciate it if a kernel developer could help me to debug this problem. Regards, M. Vefa Bicakci
hi, vefa, thanks for your effort, is that any test result?
Dear Zhang Rui, I have started to bisect again, but this time with a 3-day waiting period before declaring a kernel "good". I am currently on the 4th iteration. However, I am still not sure whether a 3-day waiting period is long enough. (As you might remember from my previous posts, a 2-day waiting period was not enough.) I will attach the output of "git bisect log" in a moment. I would really appreciate it if someone could ask me to test certain git commits. Regards, M. Vefa Bicakci
Created attachment 19601 [details] Output of "git bisect log" Here is my bisection progress.
Hello, While bisecting (between 2.6.24 and 2.6.25), I noticed that the commits between the so called good and bad commits were very unrelated to the problem I am having. Finally, I tried a vanilla 2.6.24 kernel. I am sad to report that vanilla 2.6.24 has the same problem. However, all is not lost. I have built Debian's 2.6.24 based kernel with a patch which reduces the spurious IRQ counter limit to 10, and have run it for 18 days. I am happy to report that during that time I haven't had any "nobody cared" messages. So I decided to bisect the differences between Debian's 2.6.24 and the vanilla 2.6.24. I have created a git repository based on Debian's 2.6.24, and then reverted Debian's changes to 2.6.24 patch by patch and ran "git commit" between each step. This gave me a git repository which can be used to bisect the differences. (Unfortunately, git cannot bisect from future to past, which is the reason I have come up with this complicated approach.) Just to rule out the possibility that this might be a configuration related issue, I am trying a vanilla 2.6.24 kernel with Debian's configuration. If this test shows that the configuration does not trigger this bug, then I will start the bisection procedure using the git repository which I have documented above. Regards, M. Vefa Bicakci
HI, Vefa How about the git-bisect result? From the comment #25 it seems that there is no such issue on the 2.6.24 Debian's kernel. But the problem exists on the vanilla kernel(2.6.24). Right? If so, had better use the git-bisect to identify the bad commit. Thanks.
ping Vefa.
Hello, I apologize for not responding earlier; I wanted to gather more data before posting a message. I took 2.6.24's source and reduced its spurious interrupt limit to 10 in "kernel/irq/spurious.c". I called this 2.6.24-suspect. Testing this kernel was not helpful because after a certain amount of uptime and with a certain pattern of network activity, iwl3945 started to print error messages which are appended to this message. After a few "rmmod iwl3945" and "modprobe iwl3945" cycles, I got crashes. This occurred at least three times. In addition to this, I got the following "nobody cared" message one time while testing 2.6.24-suspect: === 8< === irq 17: nobody cared (try booting with the "irqpoll" option) Pid: 29141, comm: Xorg Not tainted 2.6.24-suspect #1 [<c0157ee6>] __report_bad_irq+0x36/0x75 [<c0158139>] note_interrupt+0x214/0x273 [<c01577e9>] handle_IRQ_event+0x23/0x51 [<c01586af>] handle_fasteoi_irq+0x85/0xa5 [<c010672f>] do_IRQ+0x55/0x6e [<c0184a4f>] sys_ioctl+0x45/0x5e [<c010497f>] common_interrupt+0x23/0x28 ======================= handlers: [<f8882d7a>] (usb_hcd_irq+0x0/0x56 [usbcore]) [<f8ddccfd>] (i915_driver_irq_handler+0x0/0x18e [i915]) Disabling IRQ #17 === >8 === However, this message occurred when I was trying CompizFusion - I was switching it on and off. To be honest, I don't know if this "nobody cared" message was printed because of the same problem as the one I had when I opened this bug report, or because 10 is a really low limit. The longest uptime I was able to achieve with 2.6.24-suspect before I got crashes was 4 days. Because wireless is essential for me, I didn't continue my tests with 2.6.24-suspect. I then used Sidux's 2.6.28.7 based kernel, unmodified. After an uptime of 15 days without any problems, I decided to test vanilla 2.6.28.7 with a low spurious interrupt limit. This time I used 99 as the spurious interrupt limit. I called this kernel 2.6.28.7-spurious. I am still running this kernel, and my current uptime is 15 days. So, in other words, I haven't been able to reproduce this problem with 2.6.28.7. (Yet?) However, I am going to try 2.6.24.7 and 2.6.29 with 99 as the spurious interrupt limit, so please do not close this bug report yet. Any suggestions on which of these two versions I should test next, how much time I should wait before deciding that I cannot reproduce this problem, and what would be a good spurious interrupt limit (instead of 10 or 99) ? There is one more important thing which happened: Before testing 2.6.24-suspect, I upgraded my laptop's BIOS. I can downgrade the BIOS if it becomes necessary to confirm the role of the BIOS version in this problem. Once again, sorry for my late response. Thank you, M. Vefa Bicakci ######################################################################## Here's the message iwl3945 prints: === 8< === Error sending REPLY_SCAN_ABORT_CMD: time out after 500ms. Error sending REPLY_RXON: time out after 500ms. === >8 ===
Please try the 2.6.29 kernel with the 99 as the spurious interrupt threshold and then see whether the problem can be reproduced. And IMO it is unnecessary to downgrade the BIOS. If the problem can't be reproduced, IMO this bug can be closed. Thanks.
Hello, Unfortunately, I have bad news. I took vanilla 2.6.29 and modified it so that its spurious interrupt threshold is 99. After an uptime of 4 days and 4.5 hours with this kernel, at 3:30am, when the computer was mostly idle, I got the "nobody cared" message below: === 8< === irq 18: nobody cared (try booting with the "irqpoll" option) Pid: 0, comm: swapper Not tainted 2.6.29-spurious #1 Call Trace: [<c0175494>] __report_bad_irq+0x24/0x90 [<c017566d>] note_interrupt+0x16d/0x1b0 [<c0174590>] handle_IRQ_event+0x30/0x60 [<c0175d2b>] handle_fasteoi_irq+0xab/0xd0 [<c0106273>] do_IRQ+0x43/0x90 [<c0104427>] common_interrupt+0x27/0x2c [<f80770c8>] acpi_idle_enter_bm+0x2c8/0x351 [processor] [<c02e387f>] cpuidle_idle_call+0x6f/0xc0 [<c0102ce6>] cpu_idle+0x66/0xa0 handlers: [<f80ba970>] (usb_hcd_irq+0x0/0x80 [usbcore]) [<f81bd600>] (tifm_7xx1_isr+0x0/0x140 [tifm_7xx1]) [<f80eaad0>] (yenta_interrupt+0x0/0xf0 [yenta_socket]) [<f808ad60>] (sdhci_irq+0x0/0x560 [sdhci]) Disabling IRQ #18 === >8 === My laptop has four USB ports, two on its right side and two on its back. I have a USB keyboard attached to one of the ports on the back and a USB mouse attached to one of the ports on the right side. This "nobody cared" message only affected the rear ports. If I connect my USB keyboard to a port on the right side, then it works properly. I don't know if this problem is specific to the ports on the real panel or is because of my USB keyboard. If needed, I can reboot and have my USB keyboard (and mouse) attached to one of the ports on the right side and see whether I get a "nobody cared" message or not. What I don't understand is how I managed to get with 2.6.28.7-spurious a 15 day uptime without any "nobody cared" messages. Should I go back to testing 2.6.24.7 or 2.6.28.7 ? I would appreciate any suggestion to debug this problem. I will not reboot until I get a reply from you, in case the current state of the computer can provide some information. Thank you, M. Vefa Bicakci --- For the record, the following is what my /proc/interrupts looks like: === 8< === CPU0 CPU1 0: 248683635 8493 IO-APIC-edge timer 1: 95240 46 IO-APIC-edge i8042 8: 15 0 IO-APIC-edge rtc0 9: 105898 25576 IO-APIC-fasteoi acpi 12: 1105620 0 IO-APIC-edge i8042 14: 809096 8 IO-APIC-edge ata_piix 15: 1900817 84 IO-APIC-edge ata_piix 16: 596 48 IO-APIC-fasteoi uhci_hcd:usb5, i915@pci:0000:00:02.0 17: 3501 0 IO-APIC-fasteoi firewire_ohci 18: 455207 8036 IO-APIC-fasteoi uhci_hcd:usb4, tifm_7xx1, yenta, mmc0 19: 441166 20272 IO-APIC-fasteoi uhci_hcd:usb3 20: 90489 12 IO-APIC-fasteoi eth0 22: 2727170 0 IO-APIC-fasteoi HDA Intel 23: 22 2 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb2 27: 4655721 586 PCI-MSI-edge iwl3945 NMI: 0 0 Non-maskable interrupts LOC: 10409292 99475736 Local timer interrupts RES: 2619999 4462391 Rescheduling interrupts CAL: 186 592 Function call interrupts TLB: 19465 26565 TLB shootdowns TRM: 0 0 Thermal event interrupts SPU: 0 0 Spurious interrupts ERR: 0 MIS: 0 === >8 ===
Hi Sorry for the late response. Will you please attach the output of dmesg, lspci -vxxx, lsusb -l? From the test result it seems that the issue can be reproduced if the spurious interrupt threshold is changed to 99. Of course it is not easy to reproduce. >After 4 days. How about if we don't change the spurious interrupt threshold? Thanks.
Hello, I will attach the output of "dmesg", "lspci -vxxx" and "lsusb -v" in a moment. I am using Sidux, and the lsusb I have on my system does not have a "-l" option. So, instead I will attach the output of "lsusb -v". My system is currently in the "nobody cared" state. Because of that, I have attached my USB keyboard to one of the ports on the right side. Normally I attach it to one of the ports on the back side. I have a custom patch applied; all it does is to print the name of the ISR which "worked" for the spurious interrupt. I will attach this patch in a moment too. Finally, I would like to note that I used to get the "nobody cared" messages even without changing the spurious interrupt threshold. This was the original reason of my bug report. I started to decrease the threshold on Stefan Assmann's advice so that the problem could be reproduced quickly. However, I will try your suggestion: I just compiled a vanilla 2.6.29.1 without modifying the threshold, and I will test this kernel with the keyboard attached to one of the ports on the back of the computer. Please note that it might take a long time to reproduce the problem. Thank you for your help. M. Vefa Bicakci
Created attachment 20885 [details] dmesg output for 2.6.29.1 with 99 as the spurious interrupt threshold
Created attachment 20886 [details] Output of lspci -vxxx
Created attachment 20887 [details] Output of lsusb -v
Created attachment 20888 [details] The patch that makes the kernel print out the name of the handler which "worked" for the spurious interrupt.
The patch I attached a few seconds ago also changes the spurious interrupt threshold to 10. However, the kernel I am currently using has a threshold of 99, not 10. As I said above, I will test 2.6.29.1 without the threshold modification.
Hello, Unfortunately, I have bad news. I had to restart my computer a few times after an uptime of 6 or 7 days because of X crashes which are unrelated to this bug. After those reboots, my longest uptime was 20 days, which lasted from April 16th to May 6th or 7th. (The reason of the reboot was a hard lock-up caused by the SD card reader, and even Alt-SysRq-B didn't work.) During that 20 day-long uptime, no "nobody cared" messages were emitted by the kernel. However, because the spurious interrupt threshold is at its default value of 99900, I am guessing that it might take even longer than 20 days to get a "nobody cared" message. The only thing I can think of doing is trying again with 2.6.29.3, this time being even more careful so that a reboot does not become necessary. I would appreciate any advice. Regards, M. Vefa Bicakci
does irqpoll boot option work here? It disappears in latest kernel, maybe we should just close the track, so please double check with latest kernel.
From the log in comment #38 it seems that this issue can't be reproduced on the upstream kernel. How about close this bug?
Hello, I am currently running 2.6.30.1 without any modifications, and the uptime is about 30 days, during which I haven't seen a nobody cared message. I still don't know the cause of this problem, and because of this I can't tell for sure whether the same problem will occur again. Maybe we could close the bug report for now, and I can reopen it if I ever get another nobody cared message. (Hopefully I won't!) Regards, M. Vefa Bicakci